On the design of efficient, fast and predictable memory allocator

We are building our own Operating System in C for our Principles of Operating System class. At one time, we are required to create a memory allocation system. Given a whole heap of memory, we should be able to reserve some space within the memory and free that space. In this post, I would like to write down the approach that I took.

Our system allocates in a way that double-word boundary is respected; all address and size is thus a factor of 8. If the system is initialized with a main memory of which size is not a factor of 8, less space will be taken to make it align properly. When a block of memory is requested, we use first-fit to find the earliest block that can satisfy the request. Such a block must be at least of the same size, or bigger than requested. Since it is possible that the returned block is much larger than what is requested, we employ buddy allocator technique to divide the block. We call this division technique “chunking up.” As we later demonstrates, our chunking up technique is similar to budy allocator but is different in a way.

When the memory is setup for the first time, which is when the kernel is being initialized, the memory would be in the following configuration:

1
2
POS           ADDRESS    PID      FREE           SIZE
  0    0x7f2c171e0018      0       yes      134217720

The POS indicates an absolute position counted from 0, whereby the ADDRESS is the memory address usable by the user. While the POS does account the overhead struct, the ADDRESS does not. In other words, the ADDRESS is the address of the addressable allocated memory which can be read, write, checked and freed. Now, FREE indicates whether the block is, at the time of reporting, being reserved or not. The SIZE reports how large the allocated memory is sans any overhead struct. If we were to aggregate and sum the SIZE, the number would be smaller than the total free memory space when the memory is at its kernel-fresh state since we must account for overhead structs created for each memory block.

Now, when our system is fully initialized, the memory map would be akin to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
     POS           ADDRESS    PID      FREE           SIZE
       0    0x7fd464e5d018      0        no              8
      16    0x7fd464e5d028      0       yes              8
      32    0x7fd464e5d038      0        no              8
      48    0x7fd464e5d048      0       yes             72
     128    0x7fd464e5d098      0       yes            120
     256    0x7fd464e5d118      0       yes            248
     512    0x7fd464e5d218      0       yes            504
    1024    0x7fd464e5d418      0       yes           1016
    2048    0x7fd464e5d818      0       yes           2040
    4096    0x7fd464e5e018      0       yes           4088
    8192    0x7fd464e5f018      0       yes           8184
   16384    0x7fd464e61018      0       yes          16376
   32768    0x7fd464e65018      0       yes          32760
   65536    0x7fd464e6d018      0       yes          65528
  131072    0x7fd464e7d018      0       yes         131064
  262144    0x7fd464e9d018      0       yes         262136
  524288    0x7fd464edd018      0       yes         524280
 1048576    0x7fd464f5d018      0       yes        1048568
 2097152    0x7fd46505d018      0       yes        2097144
 4194304    0x7fd46525d018      0       yes        4194296
 8388608    0x7fd46565d018      0       yes        8388600
16777216    0x7fd465e5d018      0       yes       16777208
33554432    0x7fd466e5d018      0       yes       33554424
67108864    0x7fd468e5d018      0       yes       67108856

As must be noticeable, the memory is divided in such a way that the right hand side, that is ones at addresses that are growing larger, tended to be of a greater size. This indeed is done deliberately with careful design. Put in another way, we believe that blocks at the left-hand side is extremely short-lived or volatile in nature–they are used and freed quickly. For that, we want a system that does not scan further to be able to give those blocks at a time of need. This can be observed by how block at 0x7f5465a75028 of 8 bytes is free. This block must be used by the shell program in some way and then freed before the memory mapper could take a look at it when it was still being used. For a fuller picture, the first block stored the PCB (Process Control Block) metadata which is allocated by the system’s kernel as soon as the OS is running–and will remain there as long as the OS is running.

This technique helped make our first-fit algorithm to have virtually identical benefits to best-fit without compromising speed. But, unlike buddy allocator in general, however, we always make sure that the division would cause the right block to be of a slightly bigger size than its left-hand cousin. That is, the division is not done so evenly. For example, let’s study these blocks:

1
2
3
POS           ADDRESS    PID      FREE           SIZE
2048    0x7fd464e5d818      0       yes           2040
4096    0x7fd464e5e018      0       yes           4088

If the division is fully fair and equal, block at POS 4096 would be of size 4080 bytes but instead, it is of 8 bytes larger. Indeed, this is how our chunking up is a little bit different from the standard buddy allocator as we ensure that the right-hand blocks are always at slightly larger size than its right hand-side cousin. This is done with the hope that the left-hand side will be much smaller and more readily suitable for the requested size.

The generally large-spaced blocks on the right also allow for the next request to ask for a bigger size, which, should that very block can accommodate it, they can be returned right away without splitting a new bigger one. Seen from a different perspective, we may say that the block on the left will tend to be much short-living as the algorithm is indeed a little carefully designed so that blocks of smaller size are concentrated on the left-hand side which is very near to the iteration when we need to find a suitable block.

But, we do not stop at first-fit and best-buddy allocation when it comes to allocating memory. Otherwise, we can run into the possibility of wasting so much space. That is so because there can be a point where we would no longer be able to chunk up a block as doing so would either created buddies of effectively 0-byte large (if not even less) such as in the case of a block with 16 bytes. When a 16 bytes block is chunked, it should end up as two blocks of 8 bytes each. But, since we will need to create a bookkeeping struct for the new block on the right, the right buddy can end up with 0 byte addressable space. The left buddy does not have to create a bookkeeping struct since it can use the same bookkeeping struct the block is originally attached with.

Instead of chunking, in this case we ended up with a process so-called space-fitting. For example, if we have a memory in this layout:

1
2
3
4
5
6
POS           ADDRESS    PID      FREE           SIZE
  0    0x7f5465a75018      0        no              8
 16    0x7f5465a75028      0       yes             80
104    0x7f5465a75080      0        no             16
128    0x7f5465a75098      0        no             16
152    0x7f5465a750b0      0        no             16

If we issue: malloc 8 to reserve an 8-bytes addressable memory space, we will end up with a new block at ADDRESS 0x7f5465a75038 of exactly 8 bytes:

1
2
3
4
5
6
POS           ADDRESS    PID      FREE           SIZE
  0    0x7f5465a75018      0        no              8
 16    0x7f5465a75028      0       yes              8
 32    0x7f5465a75038      0        no              8
 48    0x7f5465a75048      0       yes             64
120    0x7f5465a75090      0        no             16

Without space-fitting, this new region would likely be bigger than 8 bytes, although will be less than 80-bytes wide.

The space-fitting works in this manner: first, the first-fit and best-budy will perform their works to give us the best possible block as soon as possible. Now, if the block is larger and the right hand side is free, we will transfer extra bytes there. This is almost always possible because of how our memory allocator is designed to work, that is, seen from another angle, to make the right hand side almost always free. In the rare case the right is not free but the left is, the extra bytes will be transferred there. In the even rarer case that neither side is free, the block will “divide” itself when possible even if the division is hardly equal and fair–as long as none of the block will end up with 0 byte. If none of the above can be done, the block will be left as-is.

This overall design has benefited us tremendously. Our iteration does not have to iterate further. If we ever need to chunk up we will still have a block big enough to be chunked up again. And that, nearly every block that is reserved shouldn’t have that much extra, unnecessary bytes.

Now, we should see what will happen if we allocate 128 bytes of memory. First, take a look at the snapshot of our memory map before making such an allocation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
POS           ADDRESS    PID      FREE           SIZE
   0    0x7f5465a75018      0        no              8
  16    0x7f5465a75028      0       yes              8
  32    0x7f5465a75038      0        no              8
  48    0x7f5465a75048      0       yes             16
  72    0x7f5465a75060      0        no             32
 112    0x7f5465a75088      0       yes             40
 160    0x7f5465a750b8      0        no             16
 184    0x7f5465a750d0      0        no             16
 208    0x7f5465a750e8      0       yes             96
 312    0x7f5465a75150      0       yes             96
 416    0x7f5465a751b8      0       yes             88
 512    0x7f5465a75218      0       yes            504
1024    0x7f5465a75418      0       yes           1016
2048    0x7f5465a75818      0       yes           2040
 
(truncated for brevity)

Now, after performing memory allocation, our memory map is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
POS           ADDRESS    PID      FREE           SIZE
   0    0x7f5465a75018      0        no              8
  16    0x7f5465a75028      0       yes              8
  32    0x7f5465a75038      0        no              8
  48    0x7f5465a75048      0       yes             16
  72    0x7f5465a75060      0        no             32
 112    0x7f5465a75088      0       yes             40
 160    0x7f5465a750b8      0        no             16
 184    0x7f5465a750d0      0        no             16
 208    0x7f5465a750e8      0       yes             96
 312    0x7f5465a75150      0       yes             96
 416    0x7f5465a751b8      0       yes             88
 512    0x7f5465a75218      0        no            128
 648    0x7f5465a752a0      0       yes            368
1024    0x7f5465a75418      0       yes           1016

Notice the memory at ADDRESS 0x7f5465a75218 is located just in between of two world: one on its left all of smaller size, and one on its right of bigger size, proving our memory allocator worked just as intended by design. That’s why, instead of having (at least) 280 bytes of a single block on its left (0x7f5465a750e8, 0x7f5465a75150, and 0x7f5465a751b8 combined), that “potential” block is chunked to 3 blocks, completely upholding the rules of our allocator.

Now, what would happen if we were to allocate 64 bytes worth of memory? The answer is, our memory map would be as follow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
POS           ADDRESS    PID      FREE           SIZE
   0    0x7f5465a75018      0        no              8
  16    0x7f5465a75028      0       yes              8
  32    0x7f5465a75038      0        no              8
  48    0x7f5465a75048      0       yes             16
  72    0x7f5465a75060      0        no             32
 112    0x7f5465a75088      0        no             88
 208    0x7f5465a750e8      0       yes             40
 256    0x7f5465a75118      0        no             16
 280    0x7f5465a75130      0        no             16
 304    0x7f5465a75148      0       yes             48
 360    0x7f5465a75180      0       yes             48
 416    0x7f5465a751b8      0       yes             88
 512    0x7f5465a75218      0        no            128
 648    0x7f5465a752a0      0       yes            368
1024    0x7f5465a75418      0       yes           1016
2048    0x7f5465a75818      0       yes           2040
4096    0x7f5465a76018      0       yes           4088
8192    0x7f5465a77018      0       yes           8184

It used block at ADDRESS 0x7f5465a75088 that is 88-bytes worth. Again, it’s located on the left-hand side of 0x7f5465a75218 (128 bytes) which we’ve allocated earlier.

When a given memory block is freed, it will be merged with the block to its right and/or to its left if they were free at the time of merging. When those blocks are merged, their overhead struct will be claimed and thus reusable, resulting in the SIZE of said merged block to be much bigger than when they were standing all by themselves.

To demonstrate, assume we had allocated 16 bytes (0x7f5465a75048) and 8 bytes (0x7f5465a75060) worth of space to end up with the following memory map:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
POS           ADDRESS    PID      FREE           SIZE
   0    0x7f5465a75018      0        no              8
  16    0x7f5465a75028      0       yes              8
  32    0x7f5465a75038      0        no              8
  48    0x7f5465a75048      0        no             16
  72    0x7f5465a75060      0        no             32
 112    0x7f5465a75088      0        no             88
 208    0x7f5465a750e8      0       yes             16
 232    0x7f5465a75100      0        no             24
 264    0x7f5465a75120      0       yes             40
 312    0x7f5465a75150      0        no             16
 336    0x7f5465a75168      0        no             16
 360    0x7f5465a75180      0       yes             64
 432    0x7f5465a751c8      0       yes             72
 512    0x7f5465a75218      0        no            128
 648    0x7f5465a752a0      0       yes            368
1024    0x7f5465a75418      0       yes           1016
2048    0x7f5465a75818      0       yes           2040
4096    0x7f5465a76018      0       yes           4088

Now, let’s free 0x7f5465a75048:

1
2
3
4
5
6
7
8
POS           ADDRESS    PID      FREE           SIZE
  0    0x7f5465a75018      0        no              8
 16    0x7f5465a75028      0       yes              8
 32    0x7f5465a75038      0        no              8
 48    0x7f5465a75048      0       yes             16
 72    0x7f5465a75060      0        no             32
112    0x7f5465a75088      0        no             88
208    0x7f5465a750e8      0       yes             40

If we were to free the region on its left, 0x7f5465a75038, our memory map will be merged all the way up to 0x7f5465a75028, making a super-sized block out of what used to be 8-bytes large, and far larger than when they are accounted individually: 32 bytes (0x7f5465a75028 + 0x7f5465a75038 + 0x7f5465a75048) since now the overhead structs will be a part of the addressable space.

1
2
3
4
5
6
7
8
POS           ADDRESS    PID      FREE           SIZE
  0    0x7f5465a75018      0        no              8
 16    0x7f5465a75028      0       yes             48
 72    0x7f5465a75060      0        no             32
112    0x7f5465a75088      0        no             88
208    0x7f5465a750e8      0       yes             40
256    0x7f5465a75118      0        no             16
280    0x7f5465a75130      0        no             16

Now of course this block is naturally bigger, but there is exactly no needs nor reasons to move it to the right-hand side just to follow the rules. In fact, doing so is extremely undesirable as we would have to travel further to reserve new blocks that is usually temporary in nature especially in these hotspot regions.

So far our algorithm has worked beautifully, but at this rate it’s far from desirable. In a production environment where the memory is in a chunked up state, when a user wanted to allocate simply one byte bigger than the unallocated tail block: the allocator will refuse to allocate even if the free memory size is suffice to fulfill such a request. For example, given our tails in this configuration:

1
2
3
4
5
6
     POS           ADDRESS    PID      FREE           SIZE
 4194304    0x7fec44cd2018      0       yes        4194296
 8388608    0x7fec450d2018      0       yes        8388600
16777216    0x7fec458d2018      0       yes       16777208
33554432    0x7fec468d2018      0       yes       33554424
67108864    0x7fec488d2018      0       yes       67108856

Reserving 67108857 bytes of memory, which will then be rounded up to meet our double-word boundary expectation, would not be allowed under current technique. That is because there is simply no block that the first-fit can found that can satisfy such a request. Fortunately, the fix for that is very easy: we just need to stitch memory back starting from the tail (the end tip of the memory) to a possible position nearing to the head such that the combined sum of those blocks will be (large) enough for the request. We call this technique stitching.

Without stitching we would not be able to reserve 77108856 bytes of memory as, again, the last block is way too small for that. But, with stitching, this is what would happen when we reserve 77108856 bytes of memory:

1
2
3
4
5
     POS           ADDRESS    PID      FREE           SIZE
 4194304    0x7fec44cd2018      0       yes        4194296
 8388608    0x7fec450d2018      0       yes        8388600
16777216    0x7fec458d2018      0       yes       40331640
57108864    0x7fec47f48998      0        no       77108856

Notice that we managed to reserve exactly that size, while transferring any extra byte to its left buddy. The process is done in this way: first, 0x7fec488d2018 would be merged with 0x7fec468d2018 which result in 0x7fec468d2018 having extra bytes. The 0x7fec468d2018 then transfer its extra bytes to its left buddy, which is 0x7fec458d2018, causing its size to widen from 16777208 bytes to 40331640 bytes. Now, since 0x7fec458d2018 has received extra bytes, 0x7fec468d2018 can no longer stay at its position thus it “indented” itself resulting it to be in the ADDRESS of 0x7fec47f48998. The expected behavior of these stitched blocks would remain similar. If they were to be freed, the whole block is freed, and perhaps merged to its left buddy or right buddy as appropriate. They are also subjected themselves from being chunked up, also if applicable.

Thus, as demonstrated, our algorithm is rather efficient in its use of the memory because when our user requested for 8 bytes space and the allocator found a 16 bytes block, the block can be chunked down and space-fitted so as to avoid wasting space. It is also clearly fast when it needs to find a suitable block as it used the plain-and-simple first-fit algorithm. On top of that, it’s relatively predictable in its behavior due to buddy allocator with some adjustments albeit.

There are still rooms for improvement, however. One that needed attention is that, current technique may fail to reserve memory even if the total free size is adequate when there are hurdle blocks which make the tail end to start at a different position past these hurdle(s), in other words: the new tail end will be of smaller size which significantly reduce the chance to reserve memory of bigger size. To visualize this, imagine we have this configuration:

1
2
3
4
5
     POS           ADDRESS    PID      FREE           SIZE
 8388608    0x7f99ea135018      0       yes        8388600
16777216    0x7f99ea935018      0       no        16777208
33554432    0x7f99eb935018      0       yes       33554424
67108864    0x7f99ed935018      0       no        67108856

Notice that block at POS 16777216 is a hurdle block because it sits in the middle. If we want to reserve 33554428 bytes (4 bytes larger than what POS 33554432 can offer), although our memory from the free size standpoint is completely capable to offer the space, our allocator will not return any block. That is because we will need to defragment our memory by moving any hurdle block so that the tail end is above them. To do so, we may need to move parts of our memory into the disk. For now, this is not implemented due to its relative complexity, yet it is not a rocket science.

Although we have just talked about the mechanism without showing a single code of our allocator, I can be quite sure that it’s a rather simple code due to the fact that we don’t play with chances. All memory blocks are initialized and known at any given time–they are in a way uniform in nature. They just work by following exactly the same rules, eliminating the needs of using many ifs or elses.


How to merge a subfolder git into its parent’s git repo

Since years ago I have this centralized repo where I kept notes, code snippets, diagrams, lab codes and basically everything I have learned that I think will be valuable. I do that because I know I would like to come again someday there and regain knowledge quickly. I find this organization to be extremely helpful for me.

So, naturally, when I took the CSCI-E95 last semester, which was awesome, I wanted to have whatever I will learn to be inside that centralized repo as well. And of course, the code artifact is of utmost importance.