Practical Efficient Memory Management

August 19, 2008 @ 3:26 | In Programming | 29 Comments | twitter rss digg
Gamelab 2008

A good memory management architecture is one of those key features that can make the difference between a successful application and an unsuccessful one. For realtime applications the memory architecture becomes critical. This article discovers how memory management is more than tracking where your malloc() and free() are located. Although the focus will be on realtime applications implemented in C, all the techniques described here can be translated to other scenarios because the terms described are language-independent.

The best memory management is doing no allocation at all. You should architect your software to minimize the interaction with the memory manager. In the past that was a realistic option but in modern architectures that objective becomes harder to achieve. Modern requisites for realtime architectures like, for example, content streaming or hot loading force us to have frequent interactions with the memory manager. This article describes how this can be done efficiently. The ideas provided plus the links to other articles will give you enough information to implement your own solution. No downloadable code is provided with this article but if you need help implementing the ideas described here do not hesitate to contact the author.

The article is subdivided in three parts. Each part is dedicated to different memory management layers. From low-level to high-level, the first layer is the Memory Allocator. The Memory allocator, in direct contact with the operating system, implements the memory allocation/deallocation functionality. The next layer that will be described is dedicated to Pools, or how to minimize the interaction with the Memory Allocator. In the last part, automatic memory management or how to avoid human interaction with the memory manager is described.

1. The Memory Allocator

The Memory Allocator is a class implementation with basic memory management functionality. An interface for this class would be something like this:

////////////////////////////////////////////////////////////////////////////
/// MemoryAllocator. Base class for all memory allocators
////////////////////////////////////////////////////////////////////////////
class MemoryAllocator
{
public:
    virtual ~MemoryAllocator() = 0 {};
 
    /// \return The name of the allocator implementing this interface
    virtual const char* GetName() const = 0;
 
    virtual void* Alloc(size_t size) = 0;
    virtual void* Realloc(void* ptr, size_t size) = 0;
    virtual void Dealloc(void* ptr) = 0;
 
    /// Returns the size of a memory block allocated by this allocator
    virtual size_t SizeOf(void* ptr) const = 0;
 
    /// Log statistics
    virtual void DumpStats() const = 0;
};

This proposed interface is virtual and a concrete implementation must be provided. Probably you will implement this removing the virtual class, but having it is a good option to reduce compile-time dependencies. Anyway the virtual functions will not be a bottleneck.

The first implementation for this interface is obvious: implement it using the standard C functions like malloc, realloc and free. The SizeOf() function can be implemented in Visual Studio with _msize. Rest of the platforms provide similar functionality. This first implementation may be enough for you. Test it, profile it and if you need better performance pass to the next step.

The next step starts with a piece of advice: do not implement your own allocator. Do not reinvent the wheel. Do not waste your precious time implementing something that probably others have already done.

What follows is a description of desirable properties that must have a good memory allocator. Next, links to existing implementations are provided.

Properties of a good memory allocator

When looking for existing memory allocator implementations you must pay attention at the following features:

  • Efficient for small block allocations: a high percentage of your allocations are for small blocks (from a byte to a few hundred). The allocation for this blocks must be very fast and the size overhead must be minimum. When you request N bytes to an allocator, it is normal that internally more than N bytes are allocated due to internal headers. A good small allocator must allocate/deallocate in O(1) time and add zero overhead. Think about it, for a small block a common overhead of 8 bytes can be quite a lot! Is it possible to have zero-bytes overhead? Of course, separating your small allocators from the big one and playing with virtual addresses that objective is achievable
  • Efficient for multithreaded programs: unlike in the past, multithreading is actually not an option for your architecture, it is mandatory. A good memory allocator must provide smarter mechanisms than protecting each function with a lock. Using Thread-Local-Storage (TLS) the locking mechanism can be avoided.

Although the implementation provided by Visual Studio have been improved a lot (last version tested: Visual Studio 2005) it is not especially good at the points described before. Small blocks have a high overhead and the functions are protected by thread locks, the worst way to implement a thread-safe memory allocator.

Known good implementations like Doug Lea Malloc Allocator [1] are not so good in multithreaded environments. From the author himself (about thread safety): “This is not especially fast, and can be a major bottleneck. If you are using malloc in a concurrent program, consider instead using ptmalloc, which is derived from a version of this malloc. (See http://www.malloc.de).”

The Hoard Memory Allocator [3] is very good at the two points described before. It is specially designed to be very fast on multiprocessor machines. It is licensed under the GNU terms. If you want to use it in a commercial application they offer commercial licenses in its website.

Inside the Google Performance Tools [4] there is a very efficient implementation of malloc. As described by the authors: “The fastest malloc we’ve seen”. Apart from the marketing issues, their implementation satisfies the two requisites described before.

You must test each implementation in your scenario and choose the one that better suits you. Architect the memory manager in a way that the internal allocator can be easily swapped. Your first implementation should be based on the standard malloc.

The allocator implementations described here have been tested in real and finished applications. There is an important point to emphasize here. Apart from the efficiency of your final applications you want to improve the day to day job when you are building your program. In those cases, efficiency is not as important as robustness. The Debug Implementation of Malloc() provided by Microsoft in their Microsoft Visual Studio is incredibly good at this point, detecting all kind of errors related to memory. This can save a lot of valuable time. So, one more time, being able to swap memory allocators is something you definitely want for your architecture.

Redirecting to your allocator

Having chosen an allocator the next step is redirecting all the program allocations through it. The first and most obvious solution is doing it explicitly, like the following example.

// Allocate a block from the memory manager
void *data = MemoryManager::Alloc(256);                                    
 
// Use the block
// ...
 
// Free the block
MemoryManager::DeAlloc(ptr);

And for C++ classes, the global operators new and delete must be overridden.

////////////////////////////////////////////////////////////////////////////
void* operator new(size_t size)
{
    return MemoryManager::Alloc(size);
}
 
////////////////////////////////////////////////////////////////////////////
void* operator new[](size_t size)
{
    return MemoryManager::Alloc(size);
}
////////////////////////////////////////////////////////////////////////////
void operator delete(void* ptr)
{
    MemoryManager::DeAlloc(ptr);
}
 
////////////////////////////////////////////////////////////////////////////
void operator delete[](void* ptr)
{
    MemoryManager::DeAlloc(ptr);
}

Apart from being insufficient, because there will be allocations that will not be caught with this method, it is prone to error. Users of this architecture have to remember that instead of using the malloc() functions they have to use other methods. Apart for creating less portable code, there is a great risk of losing allocations by not following that rule. Furthermore, code that can not be modified will not be redirected to your allocator. This includes all kind of Third-Parties you may be using. For example, DirectX or OpenGL allocations will not be caught. And last but not least this method is incompatible with other Third-parties doing the same thing (like overriding the operator new). You will be faced with hard to solve linking problems here. An example of Third-Party overriding the operator new and delete is the MFC.

To overcome all these problems platform specific solutions are needed. Some platforms (in many consoles for example) offer clean mechanisms to catch all the memory allocations. In other more heterogeneous platforms, like Windows PC, things are a little bit more complex. A method very similar to the one in [5] is described here for the Windows platform.

For catching all the memory allocations, the relevant memory functions will be ‘hooked’. The mechanism of hooking a function consists in inserting an assembler jump instruction at the beginning of the function. This way, all the code invoking the original function will be redirect to the new implementation. In addition, the original function code must be preserved because it will be needed. A copy of the beginning of the function, the prologue, is stored before patching it.

Commercial libraries like Detours [6] offer the hooking mechanism described. Free alternatives like the TRET toolkit [7] or the google-perftools [4] offer the same functionality. But if you do not want to incorporate such big packages into your project you can implement your own hooking mechanism. You are going to need a disassembler to properly implement it. Libdasm [8] is a recommended package for this purpose.

Having the hooking mechanism ready, now the target functions to be hooked must be selected. Standard C functions like malloc(), calloc(), realloc() and free() functions must be hooked. Global operators new and delete must be hooked too. But that is not enough, because under Windows lots of allocations come directly from functions like HeapAlloc(). At least this function family must be hooked too. [9] is a good lecture to discover how the virtual process address is populated in a Windows platform.

////////////////////////////////////////////////////////////////////////////
// Copy prologues for original CRT memory functions
OrgMalloc = static_cast<MallocT>(CopyPrologue(&malloc));
OrgCalloc = static_cast<CallocT>(CopyPrologue(&calloc));
OrgRealloc = static_cast<ReallocT>(CopyPrologue(&realloc));
OrgFree = static_cast<FreeT>(CopyPrologue(&free));
OrgMSize = static_cast<MSizeT>(CopyPrologue(&_msize));
 
// Hook CRT memory functions    
mCrtPatches[0] = PatchFunction(&malloc, &NsAlloc);
mCrtPatches[1] = PatchFunction(&calloc, &NsCalloc);
mCrtPatches[2] = PatchFunction(&realloc, &NsRealloc);
mCrtPatches[3] = PatchFunction(&free, &NsDealloc);
mCrtPatches[4] = PatchFunction(&_msize, &NsMSize);
 
// Obtain the CRT HMODULE because we need to patch operatow new/delete 
// functions and we cannot get a pointer to them
HMODULE crtModule;
GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | 
    GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT, 
    reinterpret_cast<LPCTSTR>(&malloc), &crtModule);
 
// operator new        
mCrtPatches[5] = PatchFunction(crtModule, "??2@YAPAXI@Z", &NsAlloc);
// operator delete
 
 
mCrtPatches[6] = PatchFunction(crtModule, "??3@YAXPAX@Z", &NsDealloc);
// operator new[]
mCrtPatches[7] = PatchFunction(crtModule, "??_U@YAPAXI@Z", &NsAlloc);
// operator delete[]
mCrtPatches[8] = PatchFunction(crtModule, "??_V@YAXPAX@Z", &NsDealloc);
 
// Copy prologues for Windows memory functions    
OrgHeapAlloc = static_cast<HeapAllocT>(CopyPrologue(&HeapAlloc));
OrgHeapRealloc = static_cast<HeapReallocT>(CopyPrologue(&HeapReAlloc));
OrgHeapFree = static_cast<HeapFreeT>(CopyPrologue(&HeapFree));
 
// Hook Windows Heap functions
HMODULE hModule = GetModuleHandle(NST("kernel32"));
mW32Patches[0] = PatchFunction(hModule, "HeapAlloc", &NsHeapAlloc);
mW32Patches[1] = PatchFunction(hModule, "HeapReAlloc", &NsHeapRealloc);
mW32Patches[2] = PatchFunction(hModule, "HeapFree", &NsHeapFree);

Care must be taken to avoid infinite recursions. For example, the CRT memory functions are implemented using Windows Heap functions. This can be a problem if your memory allocator is implemented with malloc and free. Another source of problems can be found when you have several versions of the CRT in the same process. This happens when distinct dynamic libraries are linked with the static library version of the CRT.

The recommended settings are as follows. First, link as much libraries as possible with the DLL version of the CRT. This is the CRT that will be used by the memory manager too. Second, hook the malloc functions from this DLL to be redirected to your allocator. And third, hook the HeapAlloc functions to redirect only the requests being done in heaps distinct from the CRT used by the memory manager. Requests coming from the main CRT will be hooked by the malloc functions.

This is illustrated with the following example:

////////////////////////////////////////////////////////////////////////////
LPVOID WINAPI NsHeapAlloc(HANDLE hHeap, DWORD dwFlags, 
    DWORD_PTR dwBytes) 
{        
    void* ptr = OrgHeapAlloc(hHeap, dwFlags, dwBytes);
 
    if (hHeap != reinterpret_cast<HANDLE>(_get_heap_handle()))
    {
        MemoryManager::Track(ptr, dwBytes);
    }    
 
    return ptr;       
}

Memory tracking

Apart from the redirection mechanism to your own allocator, having control of all the memory allocations allows you to implement different tracking mechanisms that can give you valuable information. The tracking mechanism should be only active in the selected build configurations because it can add a considerable overhead. Two examples of tracking mechanisms are briefly described:

  • Leak detector: the memory manager can keep a list of the currently allocated blocks with extra information about them: when it was allocated, which thread requested it, in which module, stack trace, etc. When this tracking mechanism is active, the memory manager increases the memory allocation sizes by 4 bytes (a pointer, 8 bytes in 64 bits) to hold a pointer to this node info. When your program exits, the memory manager can report a list of the blocks still allocated and that are, probably, leaks. Having a leak detector active while you are building your software is an invaluable feature that will improve the quality of your architecture.
  • Memory Stats: with a tagging mechanism you can have exact control of where your memory allocations are being requested. Among all the tagging mechanisms, the stack based is the less intrusive and probably more flexible. Whenever a new memory block is requested to the memory manager, it is attached to the current stat block.

    The current stat block can be pushed / popped with macros like in the following example:

    void DX9RenderSystem::Init()
    {  
        NS_PROFILE("rendersystem", Mem);    
     
        mD3D = Direct3DCreate9(D3D_SDK_VERSION);    
     
    ...

    In the above example, all the memory allocated by the Init() function (and by all the functions called from it, including SLT allocations, etc) are accumulated in the “rendersystem” node. When exiting that function, the “rendersystem” node is popped from the stack.

    Each frame, the memory manager could collect all the node stats and information about the memory usage could be listed. For example:

    Node                            Memory         Memory (including children)
    --------------------------------------------------------------------------
    *root                  0%      10667320 ( 56%)    19029395 (100%)
    RenderSystem          56%       6381409 ( 34%)     6386525 ( 34%)
    AudioSystem           90%       1835510 ( 10%)     1836646 ( 10%)
    reflection            99%        105028 (  1%)      106576 (  1%)
    symbols              100%         29300 (  0%)       29300 (  0%)
    factory              100%         10800 (  0%)       11352 (  0%)
    --------------------------------------------------------------------------

 

2. Memory Pools

A memory pool preallocates a number of fixed size memory blocks from the memory allocator. The pool offers the same functionality than the memory allocator but due to the fixed size restriction it can offer several important advantages:

  • Avoiding Dynamic Memory Allocation: except for preallocated blocks, whose memory is requested to the memory allocator when the pools are initialized, each allocation operation done by the Memory Pool implies no heap allocation. This results in faster operation times, nearly in the same order of magnitude than stack based allocations
  • Constant time for Allocation / Deallocation: thanks to the fixed size restriction, an optimal implementation of a Memory Pool can allocate and deallocate block in O(1) time
  • Cache coherence: all the blocks preallocated in the Memory Pool are located in a contiguous block of memory improving the cache performance
  • Minimum Memory Fragmentation: in a Memory Pool there is no external fragmentation and the internal memory fragmentation when the pool is full is insignificant (there is only internal memory fragmentation in the big block allocated at initialization time)

Small objects (few bytes) that cannot be stored in the stack should me managed by Pools. A good implementation can be found in Alexandrescu’s book [10]. To encapsulate the pool usage class operator new and delete can be defined in the following way:

////////////////////////////////////////////////////////////////////////////
#define MANAGED_BY_POOL(Pool)\
    static void *operator new(NsSize size)\
    {\      
        return Pool->Allocate();\
    }\
\
    static void operator delete(void* ptr)\
    {\
        Pool->Deallocate(ptr);\
    }\
\
    static void *operator new(NsSize size, void* placementPtr)\
    {\       
        return placementPtr;\
    }\
\
    static void operator delete(void* ptr, void* placementPtr)\
    {\
        NS_UNUSED(ptr);\
        NS_UNUSED(placementPtr);\
    }

The following example implements a small struct that is transparently managed by a pool.

extern ObjectPool<Point, 32>* gPointPool;
 
////////////////////////////////////////////////////////////////////////////
struct Point
{
    NsFloat32 x, y, z;
 
    MANAGED_BY_POOL(gPointPool);
};

STL pools

STL is the perfect place where to use Memory Pools. Node base containers (list, map, hash_map) make heavy use of dynamic memory. To avoid the costs described above it is recommended to use Pools with STL. Memory used by STL containers is controlled by allocators that are passed to the container when it is created. [11] is a very recommended read that gives a good description of how to implement STL allocators. There you will learn how to code an allocator that uses a memory pool.

struct TrackInfo
{
    NsSize size;
    MemProfilerNode* profilerNode;
};
 
typedef std::map<
    const void*, 
    TrackInfo, 
    std::less<const void*>, 
    PoolStlAllocator<std::pair<const void*, TrackInfo>, 1024> 
 
> ExternalTrack;
 
////////////////////////////////////////////////////////////////////////////
ExternalTrack mExternalTrack;

3. Garbage Collection and Smart Pointers

The less interaction you have with the memory manager the better. Memory management errors are probably one of the worst problems in software engineering. Memory leaks, buffer overruns, etc. That is why modern languages like Java, Python and C# try to avoid exposing memory management to programmers using memory garbage collection techniques.

In C++ there is no automatic garbage collection but it can be implemented using object reference counters and smart pointers. This way, manual new and deletes can be avoided reducing the probability of incurring into memory errors. In Alexandrescu’s book [12] all the details for implementing smart pointers can be found.

To implement the reference counter mechanism each class must derive from a base class defining an integer for the counter and public functions AddReference() and Release() to manage the counter. The Release() function is in charge of deleting the instance when its counter reaches to zero. An intrusive implementation for reference counting where the counter is stored in the class itself is the recommended one for reasons of efficiency.

////////////////////////////////////////////////////////////////////////////
class BaseRefCounted: public BaseObject
{
public:
    /// Default constructor.
    BaseRefCounted();
 
    /// Increments reference count
    /// \returns Number of references after incrementing one
    int AddReference() const;
 
    /// Decrements reference count, deleting the object when reaches 0
    /// \returns Number of references after releasing one
    int Release() const;
 
    /// Gets current reference count for the object
    /// \return Object number of references
    int GetNumReferences() const;
 
protected:
    /// Destructor. Base classes are abstract classes. Destructor is pure 
    /// virtual. This destructor is declared protected to avoid deleting 
    /// reference counted objects. Release() should be used in this case.
    virtual ~BaseRefCounted() = 0;
 
private:
    /// Base classes are non-copyable objects
    //@{
    BaseRefCounted(const BaseRefCounted&);
    BaseRefCounted& operator=(const BaseRefCounted&);
    //@}
 
    volatile mutable int mRefCount;
};

There are alternatives to reference counting (read [13] for a mark-sweep algorithm) but it is probably not worth the effort using such a sophisticated algorithm if you are mixing in your program C++ code with other high level scripting language where garbage collection is implemented natively.

The code becomes a lot cleaner and safer when using smart pointers:

////////////////////////////////////////////////////////////////////////////
/// d0 and d1 are automatically destroyed when no code references them
////////////////////////////////////////////////////////////////////////////
Ptr<IStream> d0 = fileSystem->OpenRead("Dat/Stats/s0.dat");
d0->Read(buff, 512);
 
Ptr<IStream> d1 = fileSystem->OpenRead("Dat/Stats/s0.dat");
d1->Read(buff, 512);

4. Conclusion

This article described three memory management layers: the memory allocator, the pool allocator and the garbage collector allocator. Although memory management is important in the architecture, its usage must be as transparent as possible. This is the principle behind the layer subdivision proposed in this article.

Thanks for reading. Comments are opened for discussing anything related to it: bugs, suggestions, improvements, questions…
 

  1. Doug Lea Memory Allocator (dlmalloc)
  2. Wolfram Gloger’s malloc homepage
  3. The Hoard Memory Allocator
  4. google-perftools. Fast, multi-threaded malloc() and nifty performance analysis tools
  5. Monitoring Your PC’s Memory Usage For Game Development
  6. Detours instrumenting library homepage
  7. TRET: The Reverse Engineering Toolkit
  8. libdasm: simple x86 disassembly library
  9. Why Your Windows Game Won’t Run In 2,147,352,576 Bytes
  10. Loki Library – Small Object Implementation
  11. Pete Isensee on Custom STL Allocators
  12. Modern C++ Design
  13. A garbage collector for C and C++



  1. Good article.



    Comment by gyakoo
    August 19, 2008 @ 10:11 #

  2. Nice work Jesus! really interesting the way you can use your own allocator. Right now i’m using a lot of C# but in a really near future i’m going back to C++ and perhaps i could try this approach. Thank you and keep going.



    Comment by ProD
    August 19, 2008 @ 10:18 #

  3. Wow, too low-level for me sometimes but pretty interesting none the less. Keep up the good work!



    Comment by Miguel Herrero
    August 19, 2008 @ 11:02 #

  4. Fantastic article, very comprehensive!



    Comment by Josh Szepietowski
    August 19, 2008 @ 18:45 #

  5. Nice article, really good stuff. Just one thing, I´ve been playing with the detour lib to forward file access calls and the express edition is pretty lightweight, free and you can hook easily to all the msvcrt dlls you want. Check this code.

    fopen60 = reinterpret_cast(DetourFindFunction( “msvcr60.dll”, “fopen” ));
    if (fopen60) DetourAttach(&(PVOID&)fopen60, Myfopen60);

    fopen70 = reinterpret_cast(DetourFindFunction( “msvcr70.dll”, “fopen” ));
    if (fopen70) DetourAttach(&(PVOID&)fopen70, Myfopen70);

    fopen71 = reinterpret_cast(DetourFindFunction( “msvcr71.dll”, “fopen” ));
    if (fopen71) DetourAttach(&(PVOID&)fopen71, Myfopen71);

    fopen80 = reinterpret_cast(DetourFindFunction( “msvcr80.dll”, “fopen” ));
    if (fopen80) DetourAttach(&(PVOID&)fopen80, Myfopen80);

    I haven´t played with the memory allocation functions but it should be more or less the same. So if you get hooked to all the available visual studio crt implementations you can be sure that your allocator will receive all possible calls, or may be not. Probably you have already check this and it didn´t work.



    Comment by albaregar
    August 20, 2008 @ 1:22 #

  6. Hi Alberto,

    Yes, I played with Detours Express but the problem is that the license is for non-commercial applications:

    “Detours Express 2.1 is available for immediate download under a no-fee, click-through license for research, non-commercial, and non-production use on 32-bit code.”

    It is a pity. :)

    Thanks everybody for the good comments!



    Comment by ent
    August 20, 2008 @ 12:55 #

  7. Nice work, I liked the thoroughness and the organization of the article.

    I disagree on one point, though, and was missing some discussion of an aspect I consider fundamental.

    The way you extend the allocations to hold the leak information has the inconvenient that it will modify the footprint of the application in an undefined way. I believe it’s better to keep that information separate from the allocations themselves (say in a private list or even a different heap) so the behaviour of the application remains unchanged and you know exactly the overhead you’re adding. Also, it lets you easily extend the data you’re keeping, for instance for storing callstack information.

    The aspect I missed is the use of more than one heap, as opposed to have everything go into the same big heap (which I assume you allocate at the beginning using GlobalAlloc or similar). There are advantages and drawbacks to both approaches. Using multiple heaps you keep all your subsystems separate, minimize fragmentation (especially if you’re careful to give their own heap to libraries that tend to do lots of smallish allocations), and can enforce very specific memory budgets (which is critical if you work on consoles). On the other hand, you’re wasting some memory by having just a few KB available in each of the heaps, and by the overhead of the heap itself. Also you need a well defined policy when an allocation doesn’t fit in a private heap, but there’s memory available elsewhere: do you print an error message and get the memory from the global heap? or crash the application to catch systems going over budget asap?

    Finally, I would be wary of garbage collection in a real-time application, although I admit I haven’t tried it with C++ (just doing it with scripted languages can be a pain in the ass). And reference counting can also be a double-edged sword, if you’re not careful to avoid cyclic references, for which you absolutely need a way to keep weak references.

    Keep up the good work. Looking forward to seeing what you’re working on… :-)



    Comment by Sergio
    August 20, 2008 @ 20:10 #

  8. [...] Practical Efficient Memory Management You can never know too much about your memory. [...]



    Pingback by Squeezy Links | Programmer's Log
    August 20, 2008 @ 22:18 #

  9. Interesting article, there are some new parts there that I was not aware of(windows specific). When you are hooking new and delete operators are these the throwing versions or the no_throw versions and therefore which version are you not hooking?
    For a memory pool which I have created in the past I have used mmap and VirtalAlloc to create pages of memory, yet you suggest a preallocated fixed size with no additional allocations. To not allow further allocations requires extensive knowledge about the type used in the memory pool, ie how many of the object is it possible to have in game at once etc is this always possible or predictable?



    Comment by liam
    August 21, 2008 @ 15:07 #

  10. Sergio,

    Nice to see you here. :)

    Yes, the tracking technique described in the article modifies the footprint adding 4 bytes and changing the aligning (anyway if you want an aligned allocation you must use a specific function for that purpose). But the solution of having the blocks separately will increase the deallocation time (having to look for the tracking block). It is common to have libraries that modify the footprint in debug (Windows Heap, MSVCRT, STL, …) and I have always opted for that solution but I have to test the impact of that solution. Are you using it?

    About the heaps I think I didn’t explain it properly. All the allocations controlled by your application (operator new, hooked malloc, etc) go to the same allocator Heap. But for example, when you start to hook functions like HeapAlloc you are starting to receive allocations coming from other Heaps (for example, DirectX is linked with a different CRT that the one you use in your program compiled with Visual Studio 2005). Trying to redirect this allocations to your memory manager is problematic, a lot… it is better to only track that allocations (for stats and memory usage information) but allowing them to go to the original pool. It is really dificult to explain without source code… I prefer to have only one Heap for the application and budget it with the stats mechanism. What solution to you think is better?

    And the garbage collection… yes :) But what I have always said, you can use direct pointers (outside the smartpointer) for the critical parts and left the smart pointer to inform about the ownership of the pointer (the instance having the smart pointer is the owner, the one who will delete it). In the rest of the non critical code (90%) smart pointers + reference counting have always worked fine for me.



    Comment by ent
    August 21, 2008 @ 15:49 #

  11. liam,

    You should hook all the versions of the operators, including the array and non-throw versions. The code above is only hooking the normal and array version. The exported names are CRT version dependent and you must inspect the version you are using with a tool like Depends.exe to get that information.

    Yes, the trick of the pool with the VirtualAlloc have worked fine for me in the past. As similar technique is being used by the google allocator. The pool implementation can grow when there is no more space left. In that case, you preallocate another big chunk and chain it to the previous one. The recommended implementation from Alexandrescu (in the bibliography of the article) use exactly that technique.



    Comment by ent
    August 21, 2008 @ 15:58 #

  12. I don’t think the impact in the delete times would be too great, it’s just a lookup in a hash table. The system we use currently goes even further, reporting all memory information through the network to an external application that keeps a complete history, including high watermarks. Sadly, the performance hit of having to send stuff through the network means we don’t have that running all the time, but when you’re debugging a memory problem there’s no better way to do it (and the game only runs at a third or so of the normal speed). In consoles it’s tricky anyways, as you’re always so tight with memory (especially on the 360) that you really can’t even afford 4 extra bytes per allocation.
    .
    The thing with multiple heaps is not a result of external libraries demanding it, it’s done for internal organization of the application. What I’m describing is you allocate a big chunk of memory at the beginning, and create several Doug Lea (or whatever) heaps in it, instead of just one.
    For instance, imagine you’re working on a streaming application. You load a few chunks of the world and then create a bunch of dynamic objects. It all goes together, both the big textures and the small allocation of the objects. Then you unload a chunk of the world and load another, with possibly allocations of different sizes. Your memory is becoming disorganized and fragmented. Now imagine you let Lua use that same heap (Lua is horrible with allocations, does it all the time, with all kinds of sizes in an unpredictable manner).
    It becomes harder and harder to keep things in line this way. Sure, you can still keep track of how much memory everyone is using, but no matter how good the allocator is, fragmentation will very likely kill you in the end.
    If you’re targeting just PC, then it may make more sense, as you can rely on pages and virtual allocations to save you. As I said, there are advantages for either approach. Currently our engine is geared towards consoles, so we do heavy use of heaps. We have more than 60, each heap reserved for one streaming cell, or one character, or instance data for props, or lua, or AI, or… I personally think it’s too much, but one is probably too little :)
    .
    Hmm, why would you want to use direct pointers when you could use weak references? Strong references are fancy, but personally, unless a resource is really shared (e.g. a texture) I’d say you’re better off with specific ownership and weak references or handles. For instance, for basic game objects, you could create them with references and leave them hanging in the open, but I’d rather have a central manager owning them and everyone else holds a weak reference that goes NULL when the object is destroyed for any reason. Keeping direct pointers from one frame to the next is a recipe for obscure crash bugs and convoluted code that has to notify every single system in the game when an object gets deleted.



    Comment by Sergio
    August 21, 2008 @ 19:47 #

  13. This last comment was great, but I disagree on the weak references arguments. Sometimes happen that you want or even need something to be allocated until you stop using it or you decided so, and you can assume the performance hit of hard references. So usually the best option is supporting both weak and hard references.



    Comment by albaregar
    August 22, 2008 @ 0:18 #

  14. Agreed, supporting both is definitely the best option.



    Comment by Sergio
    August 22, 2008 @ 1:09 #

  15. Wenas…
    Un peasso the artículo, eh Jess.

    Back in the times of the PS2 (long ago and -not very- far away) we used the several heaps solution this way:
    - PS2 devkit has several times more memory than the retail console, so we dumped all our debug info in a heap outside of the first 32Mb where the game MUST fit when running a final release version.
    - At any moment or when the game crashed we just write the info stored in the debug heap to a file and analyze it with an external application. We were able to review the full memory history (with a call stack for each allocation) for a limited period of time. It was useful (and funny).
    I believe this is not possible in the current generation of consoles with the memory constraints that devkits have. A network solution (like Sergio told us) or just a file can do the trick.



    Comment by Gus
    August 23, 2008 @ 0:14 #

  16. I’ve created simple system for tracking memory related problems similar to the one described by Sergio. More in-depth info can be found here: http://msinilo.pl/blog/?cat=7 . Most recent article describes real-world application (it actually helped us to remove plenty of leaks and cut down memory consumption by 20%).



    Comment by yarpen
    August 24, 2008 @ 16:57 #

  17. Hi Gus, welcome… :)

    Following with Sergio, yes the Heap approach can be of great benefit in those platforms without a proper virtual memory mechanism. Now, we are using only a single heap for the main allocations plus a few Heaps for specific purposes (python and entities for example) and yes, 60 heaps seem to be a lot. I wonder how you make sure that the allocations are done to the proper heap in the code. a tagging mechanism?

    And about the weak pointers. I think you are thinking only in the logic part of the engine (correct me if I am wrong). In that part I agree with you that the best is using weak references for all the entities. But, outside that part, weak pointers cannot substitute normal pointers because weak pointers are not as cheap as raw pointers.



    Comment by ent
    August 25, 2008 @ 3:31 #

  18. Hi yarpen,

    Interesting article and interesting blog :)

    I have a question for you about the StackWalk64… Yes, it is slow but not using it generates incorrect stack traces in release (due to the FPO). Do you switch dinamically to StackWalk64 when RtlCaptureStackBackTrace fails?



    Comment by ent
    August 25, 2008 @ 3:57 #

  19. I considered it, but it turned out it’s easier just not to use FPO (performance gains werent noticeable in our case, so it just wasnt worth it).



    Comment by yarpen
    August 25, 2008 @ 10:38 #

  20. True, I will disable it (/Oy-) in Debug/Release and leave it activated in the Final configuration. Thanks for the idea.



    Comment by ent
    August 25, 2008 @ 21:06 #

  21. [...] Christian Gyrling’s Are we out of memory? and Jesus de Santos Garcia’s post about Practical Efficient Memory Management (though the latter is clearly a bit too PC-centric, so pinch of salt for consoles). Also read [...]



    Pingback by realtimecollisiondetection.net - the blog » Posts and links you should have read
    September 2, 2008 @ 17:34 #

  22. Hi,

    I am not sure if I agree on all parts of your article. You say: ‘the best memory management is doing no memory management at all’. This is derived from the performance optimization rule, but it doesn’t necessary need to work for memory management. I am not sure what “doing no memory management al all” means. In memory management it is harder to create allocators that are space efficient than time-efficient (creating an O(1) allocator is peanuts). Doing no memory management at all can often translate to (what you are also saying later in the article): avoid dynamic memory allocations. If you are trying to do so, this means that you will need to allocate up front or in pools. In many ways this is not space efficient at all. Pools are often thought of as good solutions to avoid fragmentation, but in fact the internal fragmentation is much worse than the external fragmentation. You say that internal fragmentation is insignificant at full loads. Sure, but the problem is generally that the pools are often not full. They need to be large enough for worse-cased scenarios. Using pools you design your application for worse case scenario loads. They cannot be load-balanced. To give an example: you have a memory pool that holds your particles. In some levels you have many particles and in some you have only a few. The pool size needs to be big enough to hold the particles in the worst-case level. Some levels that contain only a few particles may have very large textures, but due to the memory pool structure you cannot load-balance. In most of the levels you have wasted a large part of your memory. A decent general purpose allocator like the ones you suggested would have been much more efficient. You might say that you would setting a pools size per level would solve this problem, but in fact you are only moving it to a smaller scope. And, this process is very cumbersome and error-prone. To conclude, this example shows that “doing no memory management at all” is not entirely true. Doing memory management, but properly, would have been better.

    PROs: high performance. And if you get your title running it will keep on running. That is important on game consoles. CON: you are almost certain to be wasting a lot of memory.

    I would recommend to everybody to read the paper: “Dynamic storage allocation, a survey and critical review”. It is an essential read prior to doing anything with memory management. Google it ;)



    Comment by Jelle van der Beek
    September 12, 2008 @ 16:58 #

  23. Hi,

    Yes, the phrase ‘the best memory management is doing no memory management at all’ is not very correct. I agree with you. My idea was to express that the best very memory management is doing no dynamic allocation at all. Obviously that is hard to achieve without a good memory management architecture. I will fix it in the next version of the article.

    About the pools. Most implementations of pools allows some kind of granularity control. For example, your pool of particles can be a linked list of memory chunks. Imagine, you choose the granularity at 32. In that case your number of allocated particles will always be a multiple of 32. But a map with 200 particles will allocate 7 chunks for the pool and the map with 2000 particles will allocate 63 chunks.

    Thanks for commenting and thanks for that paper recommendation.



    Comment by ent
    September 16, 2008 @ 11:33 #

  24. Thanks for your answer!

    Ok, continuing down this path, what is a dynamic memory allocation? I know you probably mean: passing an allocation request to the OS. Because, whatever way you put it, dynamic memory allocations are being performed whether you use pools or whether you use the default malloc implementation. There aren’t many differences except for the fact that an allocation at the OS level may be slower in some implementations. The Windows allocator also has specific pool-like structures to optimize for allocations of certain sizes smaller than a certain treshold. Conceptually, there isn’t a much of a difference. Even more, many OS allocators will likely be more efficient than a common implementation. I have also seen some that operate really bad under some circumstances. So, what does “not doing dynamic memory allocations” precisely mean?

    I wonder whether you are not looking at this problem from the wrong angle. Personally I believe that many people do. It sometimes seems that pools are generally looked at to be a very good solution to avoiding fragmentation, but the paper I mentioned proves that global solutions are superior to local solutions. So I wonder how you will look at this problem after reading this paper (reading the first 30/40 pages should do).



    Comment by Jelle van der Beek
    September 16, 2008 @ 12:39 #

  25. cool stuff!



    Comment by elvis
    September 19, 2008 @ 22:55 #

  26. Thanks for great article.

    >Although the implementation provided by Visual Studio have been >improved a lot (last version tested: Visual Studio 2005) it is not >especially good at the points described before.

    What do you think about Low Fragmentation Heap mechanism?



    Comment by virtul
    September 21, 2008 @ 10:08 #

  27. Hi virtul,

    The problem with the Low-Fragmentation Heap is that the granularity can be excessive for very small blocks. For example, for a 10 bytes structure, 16 bytes are reserved from the system (an increase of 60%!). This was enough for me to refuse using them but I would like to see stats about them.

    Have you tried them?



    Comment by ent
    September 23, 2008 @ 20:01 #

  28. someone should repeat these memory tips on their fireviews.com profile. they seem really useful. i’d do it myself but i’m trying to save up my views



    Comment by yobbo
    July 5, 2010 @ 2:21 #

  29. I’ve written article about this topic, here:
    http://blog.szulak.net/programming/dynamic-memory-management-in-c/



    Comment by szulak
    September 18, 2013 @ 11:00 #


Sat, 22 Nov 2014 00:48:07 +0000 / 27 queries. 2.104 seconds / 4 Users Online

gentoo link wordpress link apache link PHP link

Theme modified from Pool theme. Valid XHTML and CSS