|
Klaim
|
 |
« Reply #13 on: July 18, 2013, 09:38:46 AM » |
|
[...]
Now, assuming I have identified the execution needs of the different systems, I need different kind of execution systems too. The kinds of executor systems are:
1. A thread + a work queue: you push tasks into the queue, it will be executed in the thread. 2. A thread pool + each system using it will have a work queue or something similar: you push tasks into the pool, it will be executed by whatever thread have no other work to do.
Systems identified as A(X/Y) really need executor 1, there is no other choice. Systems identified as BX either, the first one being a good choice if the interval of updates should be very short, while the second one is ok if it's not the case. Systems identified as BY can use either but for scaling they should use system 2.
I have to digress in technicals from here. The following is what I had to implement to fit my needs:
- WorkQueue<T> : this is just a wrapper around tbb::concurrent_queue< std::function<R(T)> > where R is a return value (currently I allow only void but I'm changing that to anything soon). It can be implemented with boost::lockfree::queue too. It's a basic tool to build other kind of concurrecny comunication. You push work in it and when you call execute() all the work accumulated since the last execute() call will be executed on the calling thread. It's generally useful and I believe the C++ standard will have one in a few years. However, one thing it don't allow is rescheduling a task. It's designed to execute non-looping tasks, which is fine for tasks which are not system-update. For system-update we need something more complex.
- Task<T> : a move-only type which is basically an augmented std::function<T>, with several features like a unique id (using boost::uuid or a provided string), callbacks on some events of the lifetime of the Task instance, and conditional ending. A Task can either be one-shot OR rescheduling. You can also add conditions for ending rescheduling. That Task is designed force the user to define on creation what the task should do (function+rescheduling?interval?until what?), then moving it into either a TaskChain or a TaskScheduler.
- TaskChain<T> : you can think of it like a vector of Task<T>, combined with a WorkQueue<void>. The idea is that when you manipulate TaskChain, to push back, front or insert at some positions tasks (before or after a specific task id), the work is actually pushed in the WorkQueue. TaskChain have a execute() function which will go first execute the WorkQueue tasks, then execute each Task in the specific order there are registered in. This is important: it's an ordered chain of tasks to be executed. If after execution a Task says it don't need to be rescheduled, then it will be removed from the vector at the end of the execution call. If it wants to be rescheduled, then nothgin happen and as it still is in the vector, next call to TaskChain execution will execute it again (and ask for end request again). This means that if you execute a TaskChain regularly, you have strictly no locking happening but you still have lockfree extra work going on if you add and remove tasks from it. It's the perfect tool for a system update "loop". So each system have a main TaskChain which will contain the tasks to be done each system update. If additional tasks are needed, just push them in the TaskChain. If they need to be done before the normal system updates tasks, just push fron or insert them where you want relative to the normal tasks. Depending on the system, the T in TaskChain<T> will dependd. For example, I use a GraphicData struct which is modified by the graphic system and provide info like time since last frame. This data, once set for one update cycle, will be read by all Task<GraphicData> inside the TaskChain<GraphicData> which is inside the GraphicEngine. GraphicEngine exposes a schedule( Task<GraphicData> task); function which just insert Tasks before the Task which performe the rendering.
- TaskScheduler: this is the ninja part. It's not written as a singleton but really it should be one. Basically, TaskScheduler takes explicitely Task<void> and promise to execute it ASAP using a thread pool. Internatlly, the Task will be wrapped into a special tbb::task child type, which will make the interface with tbb::task_scheduler. TaskScheduler manage the tbb::task_scheduler internally too, so it's essentially a abstraction of using tbb::task_scheduler, so that I can the TaskScheduler impleemntation later if needed (for example on a platform where there is a better solution). Then, the Task will either be pushed directly into tbb::task_scheduler, which just push it in work queues of it's internal thread pool; or, if the Task is marked as having an time interval before execution (rescheduling or not), it will be pushed in an internal tbb::concurrent_priority_queue<Task<void>>. However, Tasks can be associated to Clocks! Clocks are virtual clock represnetation, whichi means the time flow can go faster or slower (I don't need to go reverse in my case). This means that there is a priority queue for each different Clock. Now, the task is sorted by order of time to execute according to the clock (or to real time), which makes easy to just pop a task from the queue, see if it's time to execute it and if not just push it back in the queue (which will maintain the priority order). Once it's time to execute the task, it will be pushed in the tbb::task_scheduler. This mean that Tasks which are, in the end.
It is the main thread that loops (with a sleep of 5 microseconds or more) and pop tasks to push them into the TaskScheduler when it's time.
Some important realisations you should have already if I was clear enough: - Systems A(X/Y) will never use TaskScheduler for their update tasks, just spawn and maintain a thread with a loop and execute the system's TaskChain on each cycle; - Systems BY can use TaskScheduler for their update task which will be configured as rescheduling and will then call the system's TaskChain execution each time they are executed; - Systems BX can use either one depending on the kind of time limits necessary for the system to be responsive;
But this is not the whole story. Once you have all the systems updating correcly, in a as concurrently possible way, there is still room for some system tasks implementations to parallelize some preocess. For example it is very common in physics simulation to parallelize processing as most data needed comes from a snapshot of the physics state from previous frames. The way you parallelize these tasks is very specific to the task, so you have to choose how to do it, through a tbb::parallel_for or through spawning tasks that you don't need to sync or whatever. What you need is just realize which kind of resources will be used when you do this: for example, using tbb::parallel_for means that the tbb::tasks that will be generated into the algorithm will be pushed into the tbb::task_scheduler which is shared by TaskSheduler. This is actually a good thing: you wanted scaling, you got it. However this imply that sometimes, using tbb::parallel_for might be too computer intensive and might slow down everything. That's ok, you just need to do it serially but in one Task and all systems will work as they can.
Now, more game-specific tools:
- Id<T>: this is a generic unique identifier tool for any type. It is insanely useful. It works with any T and you even don't need to have the definition of T to compile Id<T>. It also does some interesting checks on validity, but more importantly, each Id is unique, generated by boost::uuid which helps a lot creating Ids withougt having to synchronize with a counter or something similar. I'll get back to it.
- Monitor<T>: this is a wrapper around a T& and a WorkQueue<void>. Basically what it does is that you can't access the T instance directly, you have to push a work to do (mostly a lambda or std::function<void(T&)> ) and it will be done when the execute() function will be called. This is powerful because it means you can build both T instance and the Monitor safely somewhere in the guts of a system, then provide a smart pointer to the Monitor (not the T instance) which will deffer work pushed until we want to update the T instance. I'm using it in...
- MonitoredSystem<UpdateData>: this one is particularly useful for game-engine work. To summarize: MonitoredSyste<UpdateData> holds all data that needs to be updated on the same thread, assuming theses data can be big and organized in "arrays". Basically, it holds a sets of ObjectPool<T> where T is any type. You declare which types can be used thrhough a function and the associated pool is created if it don't exist. Then, this pool actually is stable: the objects in it are all of the same type and instances never move in memory (it's not a std::vector<T> which move memory around, it's actually implemented currently as a boost::container::stable_vector<T> which acts like a classic pool). Now, MonitoredSystem also exposes functions to create and destroy T instances, which are also associated with a Id<T>. MonitoredSystem will provide ways to plug and remove Controller<T,UpdateData> instances which are basically like std::function< void( T& ,const UpdataData& ) >. I'll get back to it in a minute. MonitoredSystem and each of the pools own WorkQueue<void> to accumulate work to be done before update. So, MonitoredSystem have an update( const UpdateData& ) function which will first execute work in the work queue, it's own and the one of each pool, and then will call update of each pool data. The update of each ObjectPool<T> is calling all registered Controller<T,UpdateData> with the provided UpdateData instance, and each one of the T instance in one row. (Note: the update of eadch ObjectPool<T> instances can be done in parallel, I'm still not sure if it's worth though, so I do it sequentially for now).
So to clarify why it's useful (in concurrent game system), let's say I want to implement a component-base engine. Components of type A,B and C can be updated separately, in parallel. However omponents X, Y and Z have to be updated sequentially and in the same thread.
Here we can setup one MonitoredSystem which will own component instances of type X, Y and Z, and there will be only one thread calling monitored_system.update( update_data ); at each update cycle, updating all the components sequentially. Meanwhile, A, B and C instances can be hosted by different MonitoredSystem instances which will be associated with Task instances rescheduling to act like a loop, executing when they can (or after an interval) and just updating each component type in parallel.
Then, MonitoredSystem also exposes a way to get access to a shared_ptr<Monitor<T>> if you can provide an Id<T>. Once you have that pointer, you can push work to do with T. Which mean that I can have an object gathering pointers to monitors of different types and jsut push work progressively when needed in the different componenents.
This all without a lock, with a simple syntax on the user side (obviously it's another story for the implementation of these systems...). Also, I'm improving these systems with adding async/future/promise semantic to the interfaces, which mean you will be able to chain work without any locks again.
About tbb + boost + c++11, some important facts I discovered progressively:
- c++11 provide low level threading tools, which means it's not good enough for most of what you want to do (or maybe you want to implement your own task pool, in which case good luck); - that being said, C++11 future/promise/async are the best tools to use in concurrent interfaces...but they miss important features! for example, future.then() (which allow adding a task to do once the previous task is done, by specifying which executor should be used to execute the continuation) is not standard yet and async( executor, ... )isn't either so you don't have implementations around yet; - boost 1.54.0 does provide future.then() but is still imcomplete. Still, at least you can begin using it in yuor interfaces and as it is officially planned to complete the continuation and async( executor, ...) implementation, you can bet you'll be able to use these later in code using your interfaces. Now, just keep in mind that they are still marked as experimental by the boost.thread maintainer, which is not a problem in my case but suggests it might not be totally ready yet; - an important bug of Visual Studio 2010/2012: std::thread relies on std::chrono to provide time information, like real time clocks which are used in synchronization tools like condition_variable (very useful for synching update tasks on both initialization and termination of systems). But in VC10/11, the clocks of std::chrono have a precision of 8 miliseconds, which make them totally unpractical for high performance use like a game engine. I discovered that while using std::sleep_for() in the main thread for poping tasks scheduled for later. The wait would be far more than the required times, which is also visible in condition_variable use and other systems provided in the standard library implementation. This is really a bug but I asked the guy maintiang the VC STL (which initials are STL actually) if it was fixed in VS2013 and he said it will not be (because they lack time for that release). - boost::chrono don't have this issue and is working very precisely with time, and it's portable, so you should use it instead of std::chronon if Visual Studio is used on Windows. Now, as std::thread relies on std::chrono, you cannot pass it boost::chrono types. You have to use boost::thread to use boost::chrono. It's not a problem: boost::thread implement C++11 features, it works well and it also adds C++14 and C++YY features (for helping providing an implementation of official proposals). Once I switched to boost::thread/chrono, the whole system sped up a lot! - boost have some concurrent containers in boost::lockfree. However, it don't provide any associative container (lile map or unordered map) and no thread pool (boost::asio can be used as a thread poool but it's a bazooka for the task); - tbb provide the task scheduler which is a thread pool which works hands in hands with the system (as much as possible), it also provides associative containers like tbb::concurrent_unordered_map and tbb::concurrent_hash_map (the difference is the last one can remove value concurrently, the first one cannot). - if you use Ogre, by default, it will use either tbb::task_scheduler or a custom boost-based thread pool if you want to for loading resources. So it helps with scaling too.
So to summarize if you want future/promize/async, concurrent containers, task pools and efficient, portable implementations of all these, you'll have to use that combo.
Finally, the meta data:
- understanding, implementing and refactoring these tools took me around 3 months, but keep in mind that I'm still improving the system today, and part of the time was fixing bugs discovered in the previously listed libraries; - know that I have a strong background in C++ architecture design in professional works, it helps a lot; - know that I worked full time the 3 months; - concurrent bugs are indeed longer to kill because most of the time you don't understand them at first. However, having all these tools to compose my concurrency helps a lot on debugging as I have unit tests for most of these tools and it's easy to reproduce bugs related to them. Hard bugs are from using other libraries mostly. Now, in this case, it's also easy to isolate the bug case because each librry is used in a specific system and it makes isolating informations easy. - the longer I've spent on hunting a concurrent bug in my game-specific engine was 2.5 work days, but keep in mind that I also had a reputation of really loving hunting hard to understand bugs, so I'm not really frustrated by these kind of situations; - my current estimation is that that the same engine in a more serial design would have taken less than a month to setup, working full time. My plans for the game requires that it scales well so I decided to still burn the time on the concurrency of the engine; - I learn so much on the subject that I feel like it will help a lot my future developments. Actually, I also have an open source project which begins to benefit from it; - I think working with concurrency is really hard currently without the tools I've built and gathered from libraries. So if you don't feel like spending a year learning concurrency, don't bother. Scaling requirements are rare with games. Implementing an engine which scales is a big investment. - I began 1 year ago, I had to stop in mid-March and resume 2 weeks ago. In that time I also implemented client/server networking, the basics of the game-model system and other stuffs specific to the game. What I'm saying is: I still manage to work on the game itself instead of engine code, but it's only now that I'm almost fully on game code because I needed fist to setup a context which helps doing concurrency is both easy and intuitive. Now I implement concurrent systems easily. - the game is not finished at all, my tests don't reflect the final behaviour of the game, so I might be totally going in a wrong direction, but the more I work on this, the more confident I get that it was worth it, both for learning concurrency by practic and because it will allow me to do some interesting gameplay features later, without having to re-write the game-specific-engine.
I'm open to questions.
Damn, I should clean all this text and make a blog post...
|