Fast File Loading / part II – Load-In-Place

February 21, 2007 @ 0:15 | In Programming | 20 Comments | twitter rss digg

In my first article on Fast File Loading, I described techniques directly related with the hardware and the Operating System to load files in the most efficient way. In this one, a technique for organizing the data to be loaded as fast as possible is described. Basically, we are going to load data without doing any parsing at all. Although the techniques described here are oriented towards realtime applications, any application could use them to accelerate its load times.

The implementation and all the samples shown here can be found in a project that you can download. It is a project for Microsoft Visual Studio 8.0 and it has been tested under a 32-bit architecture. Extensions to other compilers and other architectures should be easy to add. This code must be taken as a proof of concept and not as a solid and robust implementation ready to be used in a project. I have tried to isolate the dependencies with other systems so that you can concentrate in the topic of this article.

1. Load-In-Place (aka Relocatable Data Structures)

The basic philosophy under Load-In-Place is “preprocess as much as possible“: do not waste CPU ticks with operations that can be done offline, do not waste CPU ticks so that the the loading process can run in parallel without interfering with your main process.

Imagine that you could load your data with a simple fread() from a file. Without doing any parsing at all. This is basically what we are going to try to do without modifying the structures we usually have.

If we had, for example, a data file with a vector of floats, loading with this philosophy in mind would be as simple as memcpying the file to memory and pointing a float* to the beginning of the memory buffer. And that would be all. We would have our data ready for being used.

For simple structures that would be enough but this simplicity disappears when we start to use more complex structures in C++:

  • Pointers: it is usual to have pointers embedded in the data. Pointers can not be saved to disk without a proper management. And when you load those pointers they have to be relocated properly.
  • Dynamic Types: a pointer to a type T doesn’t imply that the dynamic type is T. With polymorphism we have to solve the problem of loading/saving the proper dynamic type.
  • Constructors: objects in C++ have to be properly constructed (invoking one of its constructors). We wouldn’t have a generic enough implementation if we didn’t allow object construction. Constructors are used by the compiler to create internal structures associated to a class like vtables, dynamic type info, etc.

Basically, what we want is a system being able to take a snapshot of a memory region and to save it to disk. Later, when loading the block, we need the system to make the proper adjustments to have the data in the same state that it was before saving. The mechanisms described here are very similar to those used by the compilers to generate modules (executables, static libraries, dynamic libraries) that later have to be loaded by the OS (relocating the pointers for example).

With this technique, native hardware formats are loaded by the runtime. This may imply a redesign of your data pipeline. A good approach is having a platform-independent format that is processed for each platform giving a native format that is loaded-in-place. This is the same procedure that is followed when generating binaries (executables, libraries). So it may be a good idea to unify this because now code and data are all the same.

2. Implementation Details

The class SerializationManager is in charge of loading/saving objects to/from file. Basically you have a function for saving where you pass a pointer to a root object and a function for loading with the same parameters. The root class has to satisfy some requirements for being serialized. The implementation is designed to be as less intrusive as possible. It means that you can design the classes as you wish. To make a class in-placeable you have to implement 2 functions for each class you want to save:

  • CollectPtrs(): This function is used when saving. In this function you collect all the pointers and continue gathering recursively.
  • In-Place Constructor(): this function is used when loading. There is a special constructor for Load-In-Place where you relocate the pointers. It is a good idea not having default constructors in the classes you want to in-place. That way default constructors are not silently called if you forget to call the in-place constructor.

The SerializationManager traverses the pointers starting from the root object looking for contiguous memory blocks. When all the contiguous memory blocks are localized, the pointers are adjusted to become an offset from the beginning of the block. When loading, those pointers are relocated (the stored offset is relative to the beginning of the block) and initialized (each pointer is only initialized once) using the placement syntax of the new operator:

// Class constructor is called but no memory is reserved
Class *c = new(ptr) Class();

Let’s see a simple example:

class N0;
class N1;
class N2;
class N3;

class N3

   float a, b, c;

   N3() {}

   N3(const SerializationManager &sm) {}

   void CollectPtrs(SerializationManager &sm) const {}

class N0

   N2 *n2;
   N1 *n1;

   N0() {}

   N0(const SerializationManager &sm)

   void CollectPtrs(SerializationManager &sm) const

class N2

   N2() {}

   N1 *n1;
   N3 *n3;

   N2(const SerializationManager &sm)

   void CollectPtrs(SerializationManager &sm) const

class N1

   float val;
   N2 n2;

   N1() {}

   N1(const SerializationManager &sm): n2(sm) {}

   void CollectPtrs(SerializationManager &sm) const

With the above class definitions we serialize the following instances:

N0 n0;
N1 n1;
N3 n3;

n1.val = 1.0f;
n0.n1 = &n1;
n0.n2 = &n1.n2;
n0.n2->n1 = n0.n1;
n0.n2->n3 = &n3;

manager.Save(“raw.class”, n0);

This is a graphical representation of how the In-Place process would be applied to this sample:

To save classes in this way, you need to add to them the two functions described above. This intrusive method doesn’t allow, for example, saving STL containers. To solve this problem (and probably to avoid dependencies with a specific STL implementation) you probably will need to develop your own in-placeable containers. I have included in the sample project, a Vector and StaticVector implementation.

Polymorphic classes need to be initialized in a different way. For these classes, I am using an object factory that assign an unique ID to each class. When saving this classes to disk, the ClassID is written in the vtable position (4 bytes at the beginning of the class in most of the compilers). This is a sample of a Vector of polymorphic objects:

Vector<Renderable *> renderables(4);

renderables[0] = new Sphere(10.0f);
renderables[1] = new Cube(10.0f);
renderables[2] = 0;
renderables[3] = new Mesh(5000);

manager.Save(“raw.class”, renderables);

As you can see in the example above, null pointers needn’t be treated in a special way. A 0 is written to disk.

One last point about memory management. When you load-in-place classes using the SerializationManager, you receive a pointer to the root class inside a Resource class instance (that acts like a smartpointer) holding the raw memory allocated. When finishing using that Resource, you should Release() it. When Releasing() a Resource, destructors for the classes in the tree are not invoked. That way, you can use the destructor to free the memory when the class is used in a normal way (not being loaded in-place). When the class in loaded in-place, it shouldn’t free the memory because it has not been allocated by itself.

Resource<Vector <Renderable *> > renderablesRS;

manager.Load(“raw.class”, renderablesRS);

// Use renderablesRS
// …



3. Efficiency

All the complex operations happens when saving. I will concentrate on the loading phase. Loading In-Place basically consists of two parts: loading the entire file to memory (a relatively long I/O operation) and relocating the pointers through the Load-In-Place constructor (a relatively short CPU operation). Pointers are relocated adding the address of the memory block and constructing the object if it has not been previously constructed. Constructing the object is doing a placement new or invoking the factory (through a virtual function) if the object is polymorphic.

All the critical SerializationManager functions are templatized and they will be properly inlined in Release configurations. At the end, you can expect a quality code very similar (if not faster) to what you would have written manually.

For example, loading a vector of 3d points (a simple struct with 3 floats) generates the optimal code (just a simple fread()) when it is compiled in release configuration and all the code is inlined (all the relocation code disappears).

Pointer relocation can be considered a very fast operation in comparison to reading the data from file. So, at the end, the Load In-Place operation is heavily dominated by an I/O operation leaving the CPU free for other resources. In my examples, the relocation phase is always under 0.1% (with relatively small files, bigger files will go below this) of the total time. Rest of the time is for the I/O phase.

There is one more thing to consider about efficiency. With Load-In-Place, you will have less memory fragmentation because lots of objects are placed in a contiguous memory block.

4. A more realistic scenario

Having finished with all the tests to implement Load-In-Place I decided to take a real scenario from an old project I had in my source control repository. In that project, I had code to load/save a typical realtime mesh representation: list of materials, list of vertices, normals, binormals, tangents, text coords channels, color channels, etc.

The code for that part is included in the sample 8 of the project. That sample creates a mesh with 1000 vertices (randomly filled) with all the attributes and 2 text coord channels. Two fictitious materials are added to the mesh.

In total, 80186 bytes are written to disk and 15 pointers are relocated. The relocation phase in this sample is under 0.03% of the total time in my machine. With this new method for loading the application started to load a lot faster. But that is not all, there is a opened door for implementing loading in a background thread now that the loading process is an almost 100% I/O operation. I will talk about this in future articles.

5. Ideas for the future

These are possible ways to improve the simple implementation given within this article:

  • Integration with a class reflection system. Instead of implementing a function to save a class (like the CollectPtrs() one), you implement a description of the class. Using that description not only you can save the class (generating automatically the CollectPtrs implementation), you can for example edit it in realtime, you can edit classes saved to disk, you can detect when a class saved to disk is not the same that the one you have in runtime and do a proper transformation (automatic class versioning). A reflection system deserves an own article.
  • As you cannot use the STL library, you will have to implement all the containers and algorithms that you need. Probably you will want to expose the same interface as STL so that the clients of this system doesn’t have to learn a new (and not standard) API.
  • A packaging system where you can pack and unpack several object trees into one file would be really useful if you start to use this system in a serious project.
  • Support for external references. In this article, only internal pointers are treated. External pointers (pointers to objects located outside the file) are needed to reference external resources.
  • Probably you will need a more sophisticated method to detect when you want to save an object polymorphically or not. In the example given, all the classes derived from BaseObject are treated polymorphically. The exception here are arrays. Arrays of objects are never treated polymorphically.
  • With the system described here and with little modifications, you can easily move objects in memory. It is a similar process to the one for loading from file. A system that defragments the memory moving objects can be easily implemented this way.
  • Support for multiple inheritance. The implementation given here doesn’t support multiple inheritance. You will probably want support for multiple inheritance when using interfaces (implementing several interfaces in the same class).


6. Conclusion

A technique for loading data without parsing have been described. Reading C++ instances from disk is one of the fastest way to load information without incurring in a CPU cost (most of the time, the CPU is idle waiting for the I/O operation to finish). That CPU power could be used to read compressed data (as described in one of my previous articles) to improve load times. Decompression and pointer relocation times can be hidden by the latency of the I/O disk operations. At the end you can go as fast as the read operation allows.

  1. Sample accompanying this article
  2. Fast File Loading
  3. Feeding The Monster: Advanced Data Packing for Consoles
  4. gdalgorithms-list topic on In-place loaded data structures

  1. Super nice article! Sounds very much like the “current next-gen” engine being used by pretty much everybody these days. :)

    Comment by Brian Lawson
    February 26, 2007 @ 16:08 #

  2. Great that you are documenting this, Jesus; in-place seralization is a key factor in current day console development that game development sites generally ignore because is not much relevant to small projects on PCs.

    BTW, kudos on the cartoon “avatar”, pretty realistic I must say :P

    Comment by Manolete
    March 5, 2007 @ 19:29 #

  3. Great stuff, Jesus. We’ve been doing this for years, but it’s good to see we’re not the only ones. One thing interesting about our tools vs. runtime code is that we use the same serialization method for both, but different loaders. The tools use one that rebuilds a nicer class structure that has STL containers embedded in them, because in tools, we don’t care about memory allocations or load speed so much. At one point, I worked in a company that had a system that had reflection via class schemas and object factories for classes and pointers and all that crud–it was so frickin’ complicated that nobody could figure out how to add new objects to the data stream, so people began hacking around it. There is a balance to be found between ease of use and feature complexity, I think.

    Comment by Jason Hughes
    May 3, 2007 @ 19:11 #

  4. How come we just don’t design loadable data that has “pointers” or requires polymorphism and constructors. Sounds much simpler, and no “pointer walking” would be needed, no “where’s my vtable pointer??” concerns because of compiler implementation?

    For the other people commenting on this: No it’s not every current next engine that’s going to implement something like this. This doesn’t scale that well when you want to load (or hot-swap) small parts of your data (because you have to resolve pointers every time), nor does it scale well if you want to implement continuous streaming of data, because sooner or later you’re gonna have to defrag your memory, and you need REAL relocatable data.

    Comment by Rush
    May 4, 2007 @ 7:52 #

  5. oops I meant to say “design loadable data that has NO pointers and requires NO polymorphism and NO constructors”

    Comment by Rush
    May 4, 2007 @ 7:53 #

  6. This seems like a nice ripoff of my GDC 2006 presentation: “Feeding the Monster: Advanced Data Packaging for Consoles”. You even use the same “Load-In-Place” term, the type ID in the vtable, polymorphic relocation, reflexion, and argumentation for the system. See slides here for the original work:

    For those interested, my system adds file packaging, indexing and automatic defragmentation of the loaded assets.

    Comment by Bruno Champoux
    May 8, 2007 @ 17:08 #

  7. Bruno, your article is cited in the references of the article. And the Load-In-Place term is not only used in your Power Point. Please have a look to the other references and you will discover that this is the normal term used for this technology.

    This is an implementation (full source code and ready to be used) of Load-In-Place technology. Your presentation doesn’t include a implementation. So, it is difficult to rip that. And please don’t tell me that you are the inventor of reflection, polymorphism relocation, object factories…

    Comment by ent
    May 8, 2007 @ 20:27 #

  8. My intention with the GDC talk was to present to the public how A2M implemented its loading system, with no pretention on actually inventing anything. We are fully aware other people use similar relocation techniques (including the “in-place” expression), but to our knowledge, noone in the game industry presented a complete, coherent, and working asset loading system incorporating C++ object-based LIP at the time of the GDC submission in July 2005. Our original C++ LIP system dates back to 2003, with games using it shipping in 2004. C++ object relocation was then a rare subject in general DLL and persistence discussions.

    Still, I think it’s questionable whether a mere footnote is enough to indicate that most of your thought process is based on our sole GDC talk; the similarity of argumentation, nomenclature and design ideas is just too convenient to be coincidental (that, and the mails I now remember we exchanged a month after the GDC).

    Comment by Bruno Champoux
    May 9, 2007 @ 1:56 #

  9. Jason,

    When you say “the tools use one that rebuilds a nicer class structure that has STL containers embedded in them”. How is that class generated? Dynamically or it is statically defined by the author or the class?

    I started to implement a version of this that encapsulate vector, list and other containers with the same interface that the stl equivalents but your idea seems interesting.

    Comment by ent
    May 9, 2007 @ 21:58 #

  10. Rush,

    I started with a simple implementation with only simple structs (no polymorphism, no pointers, etc) but it was not too useful. We indeed already had lot of data that was represented using C++ object. So we would have had to do conversions from simple structs in disk to complex C++ objects.

    I am using this to implement continuous streaming of data. At the end, you cannot stream continuosly at the byte level. You pack your data in packages and your unit of streaming is the package. That package is loaded in place.

    I am going to write an article about this.

    Comment by ent
    May 9, 2007 @ 22:06 #

  11. Bruno,

    The point of this article was to release a simple implementation of Load-In-Place that anybody could download and start playing with it. I spent more than a month working in the source code (after trying several implementation) starting from scratch. So please, don’t say this is a ripof of anything.

    When you publish something like you did, I think you expect others to continue or take some of your ideas to avoid reinventing the wheel. Your article gave me a lot of nice information to solve this problem and due to that I am including a link to your Presentation. So everybody reading my article can read yours. I think this is honest but if you think you deserve more credit for that, we can definitely talk about that.

    About the references to your article and others in Gamasutra: it has been an unintentional error that is going to be fixed. I already sent the information to the Gamasutra staff. Sorry for that.

    Comment by ent
    May 9, 2007 @ 22:13 #

  12. Great article, I’m happy to see someone has made a public implementation of a load-in-place system (it would be nice to have such a system in Boost). I’m the co-author of the system Bruno is mentioning and that we both presented at GDC, and, even if I like very much the system, I must say I have mixed feelings with load-in-place. No doubt it is the fastest way to load data (and implementing it was technically very interesting), but it affects loaded classes. Some of my ex-collegues are still talking to me about the system relative burden, so I’m still not convinced the speed gain is worth the effect in Production code. Maybe it’s better only for stable and complex data like animations, but not for frequently modified small structures like game objects. I really don’t know.

    The system also needs to have a complementary xml-binary-like format that is backward-compatible. The tool-mode engine needs to be able to save both this format and the load-in-place format. Keep up the good work, it would be nice if at some point we stop reinventing the wheel and share such code.


    Comment by Nicolas Fleury
    May 10, 2007 @ 7:40 #

  13. We are in the process of getting rid of the entire C++ introspection part of the system and replacing it with an IDL-based data description. It makes writing live tools and production code very simple, and yet keeps the LIP packaging/loading advantages. It also prevent the need for compiling the data classes twice (once for engine-side, once for exporter-side), since the IDL parser automatically generates the C# exporter.

    Simple data representation, fast loading, runtime memory defragmentation, we’re happy.

    Comment by Bruno Champoux
    May 10, 2007 @ 21:54 #

  14. Bruno, Nicolas, interesting to see you here.

    Yeah, this article bears more than a passing resemblance to the talk you guys did at GDC.

    I’ve come to regard load-in-place as somewhat of a lost cause lately, though, since we’re loading so much data this round, that it’s probably not as effective as it could be. Of course, it all depends on your requirements, but as the author points out, this method has a huge IO overhead and virtually zero CPU overhead. When the player is staring at the loading screen, I think we should do whatever we can to reduce the IO load even if it costs us 100% of our CPU. As such, loading less data (e.g. “prototype data”) and constructing the objects from that may be more effective going forward.

    For “large streaming worlds” (which we started doing on a PS1 game I was working on in 1996), you should do everything you can to have zero CPU overhead, so then I’d say you should try to load data that functionality consumes rather than full-blown objects. The requirement to call placement new on every object is overhead that’s just wasted in my opinion. Load PODs off of disc, and do as much or as little processing as you need to get that data into your game system.

    I think that data management is a serious topic for modern games, and sadly one that goes undiscussed during a game’s development, but the solutions are going to vary by the problems, and I would not be surprised if the “best” solutions for a game were as varied as the scenarios that need to load data. For my current game, for instance, some data gets cooked right into the executable, some gets loaded out of bare files, some gets loaded out of WADs. Some of the loaded data is further processed after load, and some isn’t. I don’t believe that a “one size fits all” solution is generally appropriate.

    Load-in-place definitely has its place, though, so my hats off to all of you guys for doing the presentations and articles.

    Comment by Tom Plunket
    May 11, 2007 @ 1:44 #

  15. In our case the I/O overhead is quite low, in fact we get effective transfer rates of about 6 to 10 MB/s on the Xbox 360 (probably more if we had the patience to write data to the edge of the disc every time), which is not bad (we don’t do PC games, so no HDD for us).

    We load compressed packages of fairly optimized data structures, and in the future we’ll try to use as much procedural data as possible. You are right, it is good to balance CPU time with loading time. Compression is crucial, in any form, procedural techniques or generic LZ, and that will make good use of all those processors we now have. Something we’ll need to improve in time, but (I think) orthogonal to LIP.

    I think the I/O problems many people have are due to seek times, and the parsing-less nature of load-in-place is still desirable.

    Comment by Bruno Champoux
    May 11, 2007 @ 15:54 #

  16. Hi Tom, nice to see you here.
    I tend to agree with you on many points, but I’m not as convinced that load-in-place data IO overhead is that bigger than with valuable-members-only data; useless members being good compression targets. Like Bruno said, procedural techniques and LIP are orthogonal, even if load-in-place nature could favor a tendency to load final classes. I would expect that streaming simplification tend to produce a speed gain.

    Actually, my main concern sounds like the opposite; raw data is typically taking more space than complex C++ objects, so load-in-place performance effect, positive or negative, might be negligible (compared with loading-in-place only parts of objects). I would then prefer the most productive approach in day-to-day programming.

    I believe in IDL solutions like Bruno mentionned; it can generates a lot of useful code for both tools and engine. But I still doubt full load-in-place as a single solution. The IDL can actually be used to do the exact opposite: control how to serialize a class. Programmers can then end up with more flexibility in their code by dealing only with IDL for serialized members, whatever how. Combined with some form of C++ reflection (can be generated, can be meta-prog, can be both combined), it’s possible to save any object in C++ engine in tool-mode. In a sense, C++ reflection, IDL and load-in-place are all orthogonal.

    I now work in an environment with mixed solutions like you describe, and I tend to prefer it. It’s probably why data management is not a popular topic as you said; we just can’t agree on one ideal solution;)


    Comment by Nicolas Fleury
    May 12, 2007 @ 4:08 #

  17. I am in the process of creating a new Load/Save System. Having worked with both Serialization Systems and InPlace Systems I think that my next implementation is going to be a Serialization System.

    InPlace is 100% efficient, but it is very hard to maintain across platforms and between runtime & editors. You have to save all the class (although in lot of scenarios you could save only a small percentage). Compression probably will eliminate all that redundant information, but anyway it doesn’t seem the right way.

    My current approach is a Serialization System, where you Serialize the member that you want and where a schema file is generate for debug versions of the data. The final version doesn’t use schemas, is a raw of the data in the same order they are read and where a lot of optimizations can be applied, for example a std::vector of POD can be read with a single fread operation. I have to test it, but I am almost sure I will reach almost the same speed that the inplace with this approach.

    Comment by ent
    October 16, 2007 @ 2:29 #

  18. I did some television interview about the whole subject here it is:

    Comment by Bruno Champoux
    March 12, 2008 @ 13:41 #

  19. [...] be done is call to ‘fread’ to a memory buffer. For more detailed description of such system see Fast File Loading article by [...]

    Pingback by Reflection in C++ – part 2 | .mischief.mayhem.soap.
    November 29, 2009 @ 20:36 #

  20. Nice discussion

    Comment by shaasthra
    February 2, 2010 @ 13:52 #

Mon, 30 Mar 2015 04:24:32 +0000 / 27 queries. 2.951 seconds / 2 Users Online

gentoo link wordpress link apache link PHP link

Theme modified from Pool theme. Valid XHTML and CSS