So, I’m gradually working towards making a simple fighting game, using my Kinect as the input sensor. I think initially it’ll be one person punching a punching bag, then two people competing, and then probably some sort of computer controlled AI which looks at your pose, and chooses the right pose/move accordingly (although since I don’t really know how to fight, that will be interesting).
The Experiment
Anyhow, a few weeks ago I got a very rudimentary scripting host working, where I have my rendering infrastructure written in C++, and I have a mono .NET window displaying the output etc. Well I have since improved it to allow me to create entities on-screen in a (somewhat) adhoc manner (they’re all the same entity at the moment). In doing this I decided to do an experiment to discover what we all know is true: if you have more than one instance of the same model on screen, it better be the same vertices et al in memory. My Naive version loaded the model from disk, and packed it into my renderable format in memory each time the scrip requested an entity from the engine. I wanted to do 1000 entities, but the Naive approach crashed my computer at ~380, so for my performance comparisons I only went up to 200. My second approach uses boost::flyweights to share the model amongst all the entities that use it. I performance tested this one up to 200 as well, but as you can see below it gets up to 1000 dwarves on screen quite happily.

The Results
I’ve also attached a bunch of graphs of the comparison between the two approaches. Obviously, the approach using the flyweights was much better than the naive approach, but there are a few interesting features on the graphs. First up is the load-times. In the naive approach the curve starts out very steep for the 1, 2, 5 and 10 dwarf runs, and then peels of to some sort of linear amount function for the rest. It seems to me like it takes at least 10 reads of a file for my Windows/my hard disk to cache the file fully. You’ll also notice that there is a tiny amount of growth in the load-times for the flyweight approach, although I think this is down to the “.” that gets printed to the console for every constructed entity (printf and its higher-level cousins are SLOW).

Next is the memory graph. The thing that strikes me here is that the memory growth appears to slow for the naive approach, which I don’t quite understand. You can also see a delta between “max” memory, and “rendering” memory. This is because the importer library loads a copy of the file into its format, then I convert from its format into vertex and index arrays, and the create vertex/index buffers out of those, and then clean everything up so that only the buffers are left. “Max” memory, therefore is during the loading phase, and “Rendering” memory is once we’ve created the buffers and cleaned up the other stuff. Note that there’s no delta between these in the flyweight approach, and as we would expect, the memory usage is constant.

Finally, a graph showing something that I have not made much any better yet, and that is the frame time. I didn’t really expect any improvement, but it seems that repeatedly setting the same vertex and index buffer is slightly faster (maybe ~15%) than setting different buffers each time (I guess that’s the GPU cache being nice, or maybe the driver being smart and not actually doing anything). I am hopeful that if I do GPU instancing it’ll make the rendering much faster because the only thing I’ll have to change per entity is the world matrix.

A Problem
I am a little uneasy with my current approach, using boost::flyweights, because the flyweight pattern assumes an immutable shared-object, and my design does the loading in two stages: the first loads from disk, and packs into my runtime format, the second actually creates the gpu-based resources. This is deliberate, partially because I don’t want to couple the disk-based loading code to my graphics API, and secondly because I want to go multi-threaded and perform the second stage in a prologue for each frame (ie, at the start of each frame the load-queue is processed, and recently loaded assets get turned into resources on the GPU). So, back to the problem: the flyweight supports get/casting to a const value_type& (value_type is your type, ie GraphicsResource), but I can’t populate the buffers inside the object if it is a const, so I had to “cast away the constness”, which I wouldn’t advise in production code.
Conclusion
As I always knew, sharing a single instance of resources like models is not just a good idea, it is vitally important. Boost::flyweight is a nice pattern and removes the need to write complicated factory and handle classes, but the fact that a cast from const to non-const is required to satisfy my design makes it unsuitable for production code. So, sharing instances is definitely the way to go, boost::flyweight just might not be the vehicle to get me there.