Fast File Loading

October 25, 2006 @ 0:41 | In Programming | 26 Comments | del.icio.us digg devbump rss

Reading data efficiently from Hard Disk and DVD units is vital for video games and one of the more important problems to solve in the next generation of games. While we are getting 20x performance in processing power and memory size, we are only getting 4x performance improvement in data devices (dvd for consoles).

I describe in this post how to efficiently read raw data from disk (hdd, dvd) oriented towards streaming files in a realtime application (although the concepts are useful in other areas). The platform used is Win32, but all the topics covered could be easily ported to other platforms.

I have included a project for Visual Studio 2005 with all the code described here and all the framework to test the different techniques. You can download it from here. The machine where I have done the test is a 3.2GHz Pentium 4. The devices used for testing are:

  • A 7200rpm Hard Disk with an average read performance of 46.8 MB/s (measured using Hd Tach)
  • A DVD-Rom unit (using a DVD+RW media) with a peak performance of 2.40Mb (measured using Nero CD-DVD Speed)

Windows (and all the Operating Systems in general) uses part of the physical memory (not being used by processes) for file caching (you can view how much memory is being used for the File System Cache in the Task Manager). To avoid Windows caching my test files I implemented a function that flushes the cache reading big files before measuring times. The tests were executed 10 times, lasting each one several minutes.

So let’s start the travel… Objective: having a 100MB file in the CPU memory as fast as possible.

1. The Standard and Portable way

The first option is using portable code from the standard C library. All we know the benefits of portability. So we allocate a buffer and fread() the file.

FILE *fp = fopen(FileName, “rb”);
fread(&g_buffer[0], 1, FileSize, fp);
fclose(fp);

 

Stats Min (MB/s) Max (MB/s) Average (MB/s)
HDD 47.847 48.828 48.527
DVD 2.381 2.386 2.383

 

2. The Win32 Native way

Trying to improve the native approach we go for the Win32 native file functions: CreateFile and ReadFile

HANDLE hFile = CreateFile(FileName, GENERIC_READ, 0, 0, OPEN_EXISTING,
   FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN, 0);

DWORD dwNumberOfBytesRead = 0;
ReadFile(hFile, &g_buffer[0], FileSize, &dwNumberOfBytesRead, 0);
CloseHandle(hFile);

 

Stats Min (MB/s) Max (MB/s) Average (MB/s)
HDD 47.483 48.852 48.497
DVD 2.383 2.390 2.386

 

Nearly the same performance. FILE_FLAG_SEQUENTIAL_SCAN is used to direct the Cache Manager to access the file sequentially. It is recommended to use it when reading large files with sequential access. I though that the FILE_FLAG_SEQUENTIAL_SCAN hint would give a better improvement than this but obviously the fread implementation in Win32 is doing a good job.

3. File Memory Mapping

Our next approach is trying memory mapped files where the system reads the data from disk on demand using the same mechanism that is used in Virtual Memory.

HANDLE hFile = CreateFile(FileName, GENERIC_READ, 0, 0, OPEN_EXISTING,
   FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN, 0);
HANDLE hFileMapping = CreateFileMapping(hFile, 0, PAGE_READONLY, 0, FileSize, 0);

int iPos = 0;
const unsigned int BlockSize = 128 * 1024;

while(iPos < FileSize)
{
   int iLeft = FileSize – iPos;
   int iBytesToRead = iLeft > BlockSize ? BlockSize: iLeft;

   void *rawBuffer =
      MapViewOfFile(hFileMapping, FILE_MAP_READ, 0, iPos, iBytesToRead);
   memcpy(&g_buffer[iPos], rawBuffer, iBytesToRead);
   UnmapViewOfFile(rawBuffer);

   iPos += iBytesToRead;
}

CloseHandle(hFileMapping);
CloseHandle(hFile);

 

Stats Min (MB/s) Max (MB/s) Average (MB/s)
HDD 45.830 48.828 48.190
DVD 2.524 2.528 2.526

 

We are getting a significant improvement when reading from the DVD. Reading from Hard Disk is nearly the same performance.

4. Asynchronous I/O

Async I/O places disk requests in a queue for the disk controller and returns immediately. One of the best advantages of Async I/O is that the disk is always busy without entering and leaving from kernel mode. It is important to note here that we are bypassing the Win32 cache using the FILE_FLAG_NO_BUFFERING avoiding any unnecessary memory copy operation. The data moves directly into the application from DMA. I found hard to use the flag FILE_FLAG_OVERLAPPED (for Async I/O) without the FILE_FLAG_NO_BUFFERING. Most of the times, windows was not overlapping my disk requests.

I got the best results keeping 8 I/O requests active at all times. While the requests are being processed the CPU is copying the already processed ones.

By the way, when using FILE_FLAG_OVERLAPPED, you need to read to sector aligned buffers. I am using VirtualAlloc for this purpose.

for(int i = 0; < NumBlocks; i++)
{
   // VirtualAlloc() creates storage that is page aligned
   // and so is disk sector aligned
   blocks[i] = static_cast<char *>
      (VirtualAlloc(0, BlockSize, MEM_COMMIT, PAGE_READWRITE));

   ZeroMemory(&overlapped[i], sizeof(OVERLAPPED));
   overlapped[i].hEvent = CreateEvent(0, false, false, 0);
}

HANDLE hFile = CreateFile(FileName, GENERIC_READ, 0, 0, OPEN_EXISTING,
   FILE_ATTRIBUTE_NORMAL | FILE_FLAG_NO_BUFFERING |
   FILE_FLAG_OVERLAPPED | FILE_FLAG_SEQUENTIAL_SCAN, 0);

int iWriterPos = 0;
int iReaderPos = 0;
int iIOPos = 0;
int iPos = 0;

do
{
   while(iWriterPos – iReaderPos != NumBlocks && iIOPos < FileSize)
   {
      overlapped[iWriterPos & NumBlocksMask].Offset = iIOPos;

      int iLeft = FileSize – iIOPos;
      int iBytesToRead = iLeft > BlockSize ? BlockSize: iLeft;

      const int iMaskedWriterPos = iWriterPos & NumBlocksMask;
      ReadFile(hFile, blocks[iMaskedWriterPos], iBytesToRead, 0,
         &overlapped[iMaskedWriterPos]);

      iWriterPos++;
      iIOPos += iBytesToRead;
   }

   const int iMaskedReaderPos = iReaderPos & NumBlocksMask;

   WaitForSingleObject(overlapped[iMaskedReaderPos].hEvent, INFINITE);

   int iLeft = FileSize – iPos;
   int iBytesToRead = iLeft > BlockSize ? BlockSize: iLeft;

   memcpy(&g_buffer[iPos], blocks[iMaskedReaderPos], iBytesToRead);

   iReaderPos++;
   iPos += iBytesToRead;

}
while(iPos < FileSize);

CloseHandle(hFile);

for(int i = 0; i < NumBlocks; i++)
{
   VirtualFree(blocks[i], BlockSize, MEM_COMMIT);
   CloseHandle(overlapped[i].hEvent);
}

 

Stats Min (MB/s) Max (MB/s) Average (MB/s)
HDD 48.239 48.852 48.700
DVD 2.384 2.408 2.399

 

We are getting the same performance that before for Hard Disk but we go below memory mapped files for DVD. There is something interesting that we are getting with this technique: stability (less difference between min and max values)

5. Data Compression

We can improve the performance if we put to good use the CPU (mostly idle in the previous tests). Data Compression gives work to the CPU while the disk is reading and allows to reduce the amount of data to be transferred. Ideally we should decompress faster than the I/O operation so that we get the decompression for free (in the sample code, you can detect when the I/O have to wait for the CPU activating the LOG_IO_STALLS macro).

I tried LZO and ZLIB to compress the data of the test. LZO is faster than ZLIB but it doesn’t allow streaming data so I chose ZLIB. The 100Mb file was compressed down to 64Mb.

Let’s see how each test benefits from the compression:

 

HardDisk Min (MB/s) Max (MB/s) Average (MB/s) Improvement %
ANSIC 59.067 69.348 67.558 +39%
W32 67.797 69.396 68.446 +41%
MMapped 52.247 53.562 52.980 +10%
Async I/O 68.634 69.832 69.242 +42%

 

DVD Min (MB/s) Max (MB/s) Average (MB/s) Improvement %
ANSIC 2.469 2.476 2.472 +3%
W32 3.437 3.467 3.455 +44%
MMapped 3.713 3.724 3.720 +47%
Async I/O 3.464 3.475 3.470 +44%

 

Definitely compression is a win. We are almost getting the theorical profit (56% for a 64 / 100 compression) in all the tests except Memory Mapped in Hard Disk (while in DVD is getting the best performance).

6. Conclusion

We have reviewed several ways to load files from disk and discovered that compression give us a performance boost. Async I/O with compression is my recommended way to load files as fast a possible due to:

  • Easy to code. We don’t have to deal with threads
  • Although it is not the best in all the scenarios, it gives a good average performance in all of them
  • Disk is always busy without entering and leaving from kernel mode. This is vital when reading from devices such as DVD where seek times are costly. This feature is really important when you stream lot of files from DVD. Although if you are reading from DVD you should be bundling your data files, but that is another history

Although Async I/O doesn’t need another thread, I recommend reading and writing with Async I/O in another thread in charge of decompressing and parsing while the main thread is consuming already loaded items.

Well, this is all for today. We got a fast technique for reading data from files. My next article on this topic will be about preparing the data to be preprocessed as fast as posible (using Async I/O and compression of course). The trick? Avoiding preprocessing and parsing at all: inplace data loading.

Thanks for reading.
 
 

  1. Sample accompanying this article
  2. Optimizing File Accesses via Ordering and Caching
  3. File and Network I/O using Win32 and .NET API’s on Windows XP



  1. Very nice article indeed, as always!! ;) )

    When you say:
    “Although Async I/O doesn’t need another thread, I recommend reading and writing with Async I/O in another thread in charge of decompressing and parsing while the main thread is consuming already loaded items.”

    I think the OS calls you when the IO has finished in a “worker thread”, where you can parse and decompress.

    Saludos



    Comment by Tuchi
    November 10, 2006 @ 18:25 #

  2. OS calls you if you use Completion Ports. I didn’t cover that in the article. It may be a good idea to extend it. :)



    Comment by ent
    November 23, 2006 @ 2:05 #

  3. Great article! I would really like to see your tests implemented for writing large datasets (gigs) to disk from a smaller memory buffer. For example, you have a 4 meg buffer and you fill it with randomized data, then write the buffer to a single file until it reaches 4 gigs using each of the methods you described.



    Comment by Todd
    December 15, 2006 @ 21:45 #

  4. Yes, a possible extension to this article include Writing Tests. In this first version only reading was taken into consideration: most of the realtime applications only take care about load times.



    Comment by ent
    December 20, 2006 @ 2:22 #

  5. True, but build processes care a lot about writing, and fast build processes mean happy artists and rapid iterations. You’ll also find memory mapping is a much bigger win when writing than reading.



    Comment by Andy P
    February 2, 2007 @ 15:17 #

  6. Thanks for the info. I will try it when I extend this document with Writing Timings. Probably is a factor for improving rapid iterations. But the key for rapid iteration is hot loading of resources. Another topic on itself. :)



    Comment by ent
    February 9, 2007 @ 2:59 #

  7. Nice article, I was wondering if you have an explanation for why file mapping gives such a nice performance boost with DVDs?



    Comment by Martin Fuller
    April 20, 2007 @ 11:08 #

  8. @Martin Fuller:

    Drastically reduced seek times. Seeking on a drive where the read head is a laser on a carriage is orders of magnitude slower than with electrically actuated drive heads.

    If you know where to look for your data, you’re shaving huge seek times off of each read.



    Comment by anon
    April 23, 2007 @ 18:49 #

  9. It would be interesting to see the complete performance gain including decompression after loading compressed data. Unfortunatly, that depends on the CPU / mem performance, so it would be hard to find a good measure.



    Comment by Marius Fahlbusch
    April 24, 2007 @ 15:40 #

  10. Martin & anon,

    The tests for this article were done reading a single file. So, seeks are completely eliminated from the equation.

    Why Memory Mapping gets that huge improvement is something I still have to investigate.

    And I should run these tests in more hardware configurations.
    And include a test using Completion Ports (as described in a comment above).

    So, there is lot of work pending here. :)



    Comment by ent
    April 24, 2007 @ 20:49 #

  11. Marius,

    The performance measured is after decompression. Although (and if not it is detected in the tests) the decompression is done in paralell while reading.



    Comment by ent
    April 24, 2007 @ 20:51 #

  12. Is direct DMA to application memory really used when using async IO? I’m sure DMA is used for the raw data to/from the disk, but in between I suppose there is a file system driver that removes/adds the file structures, headers, checksums and like. I don’t think the driver uses DMA to copy into application space. Correct me if I’m wrong.



    Comment by Moggen
    May 2, 2007 @ 11:20 #

  13. “The tests for this article were done reading a single file. So, seeks are completely eliminated from the equation.”
    This is true if the file is stored in sequence on the disk. If we introduce fragmentation of the file system, seek times becomes an issue. So, always make sure that you have a fresh (new) file system to maximize performance. Especially important with slow-seeking media like CD/DVD. With CD/DVD there might also be some gains in performance if we can control where the data is stored physically on the disc. The farther out on the disc, the higher teorethical bitrate (assuming constant rotational speed, ie. the upper limit in speed safely usable for a disc).
    BTW forgot to mention in last post: Great articles!



    Comment by Moggen
    May 2, 2007 @ 12:58 #

  14. Thanks for your article.
    Would you allow me to translate this article and its sequel into Korean?



    Comment by khris
    February 6, 2008 @ 2:16 #

  15. khris, of course I allow you. Tell me if I can help you in any way. And please, link to this article. Thanks.



    Comment by ent
    February 6, 2008 @ 10:50 #

  16. The performance gain of Asynchronous I/O should be from larger buffer.
    Here you are using 8 blocks of the buffer.
    I believe that the file mapping should get best performance if you are using 8*blocksize buffer instead of 1*blocksize buffer.

    Since DVD is much slower than HD, thus the benefits of a bigger buffer might not be as large as for HD. This may explaine that file mapping still has the best performance for DVD. In HD case, the larger buffer in
    Asynchronous I/O case (8*blocksize) overshadows the performance gain from file mapping.

    Also, if we are using file mapping, we do not need to copy that part of the memory, the above file mapping test case has unnecessary memory copying, otherwise, the file mapping method should beat all other approaches for both DVD and HD.



    Comment by James
    April 30, 2008 @ 1:44 #

  17. My above post did not consider the compression case, since it is a different issue.



    Comment by James
    April 30, 2008 @ 1:47 #

  18. [...] my first article on Fast File Loading, I described techniques directly related with the hardware and the Operating System to load files [...]



    Pingback by Fast File Loading / part II - Load-In-Place | EntBlog
    September 27, 2008 @ 23:28 #

  19. I’d expect memory mapping to be the fastest, as that’s the only option that will get the file data into the process address space without any copying: the file cache is mapped directly in.

    Your memory mapping test is adjusting the mapping using 128kb windows – does the speed improve using larger windows? There isn’t much reason not to use far larger windows, such as 64mb.



    Comment by Jon
    March 9, 2009 @ 1:20 #

  20. Yes, I was expecting a better efficiency for the memory mapping. But I think in this kind of scenario (reading a big file in linear order) the results are normal.

    I do not remember better results increasing the window size. But the test was done years ago. The source is provided with the article. May be you want to experiment and share with us the results.

    I expect to write a revision of the article in a future.



    Comment by ent
    March 9, 2009 @ 2:14 #

  21. Strange
    your article benchmark seems to imply that
    using Win32 API the “standard” way, with synchronous calls,
    results in the same performance as the more complex asynchronous method.

    Is that correct ?
    W32 3.437 3.467 3.455 +44%
    Async I/O 3.464 3.475 3.470 +44%



    Comment by Cyan
    April 27, 2009 @ 23:06 #

  22. Yes, more or less the same performance. What you get with the asynchronous methods is space for additional work that can be done in parallel.

    Any way, you can download the benchmark and try it on your machine. I would appreciate any comments on that.



    Comment by ent
    April 28, 2009 @ 14:23 #

  23. Hello

    Yes, I just did it, thanks to your pretty clear explanation.

    The first thing that striked me was that the timing necessary for completing the asynchronous call was exactly the same as for the synchronous call.

    After looking a bit further into details, it appears that :
    - My example focused on writing, rather than reading
    - I’m expanding the file size as it is written (well, quite common)
    - Chunks are large (a few MB)
    - In this case, the OS is single-deciding without notice to treat asynchronous calls as if they were synchronous.

    So, actually, i cannot use the “space” for additionnal work.

    I selected to test Writing first because it seems to be where most of the time is spent in my compression program. But maybe i should have a look at it again. Now, with 0.1s spent on a cached 100MB file, i wonder if there is really anything worthwhile to be saved…



    Comment by Cyan
    April 28, 2009 @ 14:46 #

  24. Hello. Nice article. I just don’t understand how did you get similar performance for fread and ReadFile. I’m working on a project where I need to read huge (a few GB) files and fread is about 6x slower than ReadFile, because it does other things as locking critical sections etc.



    Comment by Big Muscle
    May 2, 2009 @ 20:55 #

  25. Big Muscle, did you try the benchmark in your machine? would like to know the results.



    Comment by ent
    May 4, 2009 @ 22:34 #

  26. I would love to know about anyway to optimize file I/O over Linux.
    I’m having hard times on linux-powered smartphones, reading files from memory cards. What do you recomend me?



    Comment by Daniel "NeoStrider" Monteiro
    June 24, 2009 @ 3:18 #


Sat, 21 Nov 2009 22:09:16 +0100 / 29 queries. 1.770 seconds / 3 Users Online

gentoo link wordpress link apache link PHP link

Theme modified from Pool theme. Valid XHTML and CSS