Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length

 
Advanced search

1075919 Posts in 44152 Topics- by 36120 Members - Latest Member: Royalhandstudios

December 29, 2014, 03:28:47 PM
TIGSource ForumsDeveloperTechnical (Moderators: Glaiel-Gamer, ThemsAllTook)Threads, pools and workers
Pages: [1] 2
Print
Author Topic: Threads, pools and workers  (Read 1374 times)
SoulSharer
Level 1
*



View Profile
« on: July 17, 2013, 03:57:08 AM »

Hello there friends.

Did anyone try to implement true multi-threading via worker threads that scale up nicely on N cores?
Can you share you experience on that matter?
How hard/time consuming is to accomplish this and what kind of resources (articles, books or anything else) helped you to implement one?

There seems to be quite a lot to it, which makes it overwhelming to get the idea where to start.

Thanks for the help.
Logged

Twitter: twitter.com/SoulSharer
Skype: disturbedfearless
Schrompf
Level 2
**

Always one mistake ahead...


View Profile WWW
« Reply #1 on: July 17, 2013, 04:40:08 AM »

I currently only have a simple system in place: a base class from which all parallel task derive, and a series of worker threads that poll regularily for new jobs. The job queue is a lockless queue taken from Intel's Threading Building Blocks.

The lib is good! I haven't been able to squeeze a general job system into this, and they openly admit that their approach is not suited for general work involving possible waiting times. But you don't have to use it all. I just used the threading base primitives and the various parallel container classes. I put texture loading, shader generation and such into this job system. The task base class is a two-step worker that offers a second function to execute from the main thread. This is necessary to create and upload the actual DirectX resources, you have to do that from the thread the windows message pump runs in.

Downsides: inheriting from a base class is clumsy. And the worker classes polling for new jobs put quite a latency on each job, so it's not really suitable for per-frame parallel jobs. Which is my target when I find some time again.

With C++11 and the whole threading support you could do it better, though. Nowadays I put isolated long-lasting tasks simply into an std::async(). Of course that's only viable for one-shot jobs. You could probably build a suitable job system with the primitives C++11 threading offers, but I haven't done so, yet. I'd also be interested in anyone doing this. As slim as possible.
Logged

Let's Splatter it and then see if it still moves.
SoulSharer
Level 1
*



View Profile
« Reply #2 on: July 17, 2013, 04:47:20 AM »

I currently only have a simple system in place: a base class from which all parallel task derive, and a series of worker threads that poll regularily for new jobs. The job queue is a lockless queue taken from Intel's Threading Building Blocks.

The lib is good! I haven't been able to squeeze a general job system into this, and they openly admit that their approach is not suited for general work involving possible waiting times. But you don't have to use it all. I just used the threading base primitives and the various parallel container classes. I put texture loading, shader generation and such into this job system. The task base class is a two-step worker that offers a second function to execute from the main thread. This is necessary to create and upload the actual DirectX resources, you have to do that from the thread the windows message pump runs in.

Downsides: inheriting from a base class is clumsy. And the worker classes polling for new jobs put quite a latency on each job, so it's not really suitable for per-frame parallel jobs. Which is my target when I find some time again.

With C++11 and the whole threading support you could do it better, though. Nowadays I put isolated long-lasting tasks simply into an std::async(). Of course that's only viable for one-shot jobs. You could probably build a suitable job system with the primitives C++11 threading offers, but I haven't done so, yet. I'd also be interested in anyone doing this. As slim as possible.

Thanks for the answer. I saw quite a lot of people say about Intel's TBB, but isn't it limiting to use? I mean, what if this library doesn't support every platform out there? (I know it does support Windows, Mac and Linux, but besides that I'm not sure)
Logged

Twitter: twitter.com/SoulSharer
Skype: disturbedfearless
JakobProgsch
Level 1
*



View Profile Email
« Reply #3 on: July 17, 2013, 08:00:29 AM »

C++11 already has threading facilities. It does also have fun stuff like packaged_task. It doesn't have a thread pool/worker abstraction, that is fairly easy to set up though (and hard to do really well obviously Smiley ).
I did some research on this put my idea of a simple thread pool on github ( https://github.com/progschj/ThreadPool ) and after some nice comments from other people think that has become a fairly good starting point.
Logged

BleakProspects
Level 4
****



View Profile WWW Email
« Reply #4 on: July 17, 2013, 08:15:56 AM »

I do it this way:
1. Each worker is an object with a global ID. They also have a status enum (Working, Stopped, Error, Done, etc.)
2. Worker threads register in a big map from their global IDs to the object. the map lives in the main thread.
3. The main thread produces jobs. When a new job is needed, the main thread looks through all the worker threads in the map, and those that are waiting on a job will be assigned the job. If there aren't any workers already, the main thread creates them.
4. The main thread registers a "OnJobDone" callback with the worker. When the worker finishes, this callback will be called and the result of the job can be inspected. Similar callbacks are registered for other events.
5. Another function in the main thread checks which workers need to be destroyed, and does so.

This avoids any kind of polling, and all the locking can be done inside the callback functions. Most of the distributed stuff I learned comes from taking a class in college on distributed computing, and experience with distributed computing in robotics.
Logged

SoulSharer
Level 1
*



View Profile
« Reply #5 on: July 17, 2013, 08:30:38 AM »

Thanks for the thoughts guys.

C++11 already has threading facilities. It does also have fun stuff like packaged_task. It doesn't have a thread pool/worker abstraction, that is fairly easy to set up though (and hard to do really well obviously Smiley ).
I did some research on this put my idea of a simple thread pool on github ( https://github.com/progschj/ThreadPool ) and after some nice comments from other people think that has become a fairly good starting point.

I don't really want to rely on C++11 to be honest, not every compiler supports it to the high degree. Not to mention standard library functionality is implementation defined and probably not extensible anyway. (so I'd rather abstract native solutions myself and add what's missing)
Yet code helps to understand the approach, thanks. Smiley

I do it this way:
1. Each worker is an object with a global ID. They also have a status enum (Working, Stopped, Error, Done, etc.)
2. Worker threads register in a big map from their global IDs to the object. the map lives in the main thread.
3. The main thread produces jobs. When a new job is needed, the main thread looks through all the worker threads in the map, and those that are waiting on a job will be assigned the job. If there aren't any workers already, the main thread creates them.
4. The main thread registers a "OnJobDone" callback with the worker. When the worker finishes, this callback will be called and the result of the job can be inspected. Similar callbacks are registered for other events.
5. Another function in the main thread checks which workers need to be destroyed, and does so.

Yeah, I've heard people doing the same way, though I'm not sure how would one break everything into tasks/jobs? It also means that the data you process in tasks/jobs should not communicate with others, but it hardly seems doable for AI as example?
Logged

Twitter: twitter.com/SoulSharer
Skype: disturbedfearless
Schrompf
Level 2
**

Always one mistake ahead...


View Profile WWW
« Reply #6 on: July 17, 2013, 01:26:12 PM »

Exactly. It's not doable the straight way. That's why I deem this simple "task to thread" thinking to be an error. Building a reliable job queue is a challenge, but you might get that from the net. Splitting your work into tasks that *can* run in parallel is the actual challenge.
Logged

Let's Splatter it and then see if it still moves.
BleakProspects
Level 4
****



View Profile WWW Email
« Reply #7 on: July 17, 2013, 05:39:30 PM »

Yeah, I've heard people doing the same way, though I'm not sure how would one break everything into tasks/jobs? It also means that the data you process in tasks/jobs should not communicate with others, but it hardly seems doable for AI as example?


That's a whole different can of worms that there are entire academic conferences on. Sometimes your jobs will need to communicate. That's a job-specific requirement. Sometimes they will fall nicely into "embarassingly parallel" categories that require no communication. Breaking a task into jobs can also be tricky. For some tasks (like, say, image processing) this will be obvious. For other tasks (you mention AI), it will require clever thinking.
Logged

SoulSharer
Level 1
*



View Profile
« Reply #8 on: July 17, 2013, 11:05:41 PM »

I see, it does make a lot of sense. I guess its better to go this way even if its hard to prepare jobs right. Because it seems to be easier to transform engine from single-threaded to multi-threaded with this approach rather than making every subsystem in its own thread, which will involve a lot of synchronization and won't scale.

I've watched Civ5 GDC presentation and some articles on Intel's site, but still have a vague idea where I should start.
Do you know any good resource on that matter? (be it a book, article or presentation)
Thanks a bunch.
Logged

Twitter: twitter.com/SoulSharer
Skype: disturbedfearless
Garthy
Level 8
***


Quack, verily


View Profile WWW
« Reply #9 on: July 18, 2013, 01:53:07 AM »


I worked on a problem with coarse threads (tasks in the order of seconds usually), and just kept n*2 tasks (where there are n cores on the current machine) active. Because the tasks were coarse it didn't scale perfectly, but more cores meant better performance overall, which was my goal in that case.

The problems I ran into were:

- Anything with complex interaction is a nightmare to debug. Tasks are best separated into pieces that are self-contained and interact as little as possible with each other and the surrounding environment. Better to go for 10% worse performance than having to debug a nightmare of strangely interacting code for the next decade.

- Windows scheduling requires some manual intervention if you want to use priorities.

I developed my own threading-library for it at the time, but nowadays I'd be looking into Boost threading if I had to do it again from scratch.

As for resources, mostly a combination of my Uni studies and online research. If you aren't 110% certain on what a race condition is, why you must avoid them,  and how they can be solved with a Mutex or Mutex-like solution, keep reading until you understand it completely. You must approach multithreaded problems with a mentality of "how could this break", and must understand the basics thoroughly. The advanced stuff can come later.

Multithreaded programming isn't easy to pick up, expect some hassles if you're new to it.
Logged
SoulSharer
Level 1
*



View Profile
« Reply #10 on: July 18, 2013, 02:19:04 AM »


I worked on a problem with coarse threads (tasks in the order of seconds usually), and just kept n*2 tasks (where there are n cores on the current machine) active. Because the tasks were coarse it didn't scale perfectly, but more cores meant better performance overall, which was my goal in that case.

The problems I ran into were:

- Anything with complex interaction is a nightmare to debug. Tasks are best separated into pieces that are self-contained and interact as little as possible with each other and the surrounding environment. Better to go for 10% worse performance than having to debug a nightmare of strangely interacting code for the next decade.

- Windows scheduling requires some manual intervention if you want to use priorities.

I developed my own threading-library for it at the time, but nowadays I'd be looking into Boost threading if I had to do it again from scratch.

As for resources, mostly a combination of my Uni studies and online research. If you aren't 110% certain on what a race condition is, why you must avoid them,  and how they can be solved with a Mutex or Mutex-like solution, keep reading until you understand it completely. You must approach multithreaded problems with a mentality of "how could this break", and must understand the basics thoroughly. The advanced stuff can come later.

Multithreaded programming isn't easy to pick up, expect some hassles if you're new to it.

Thanks, will save me some pain.

I'm not a novice in this thing. I know how OS handles threads, how mutex, critical section, events work and their purpose. (and I also know a little about atomic operations, never tried them though)
So I'm looking to deepen my knowledge at this point.
Logged

Twitter: twitter.com/SoulSharer
Skype: disturbedfearless
Garthy
Level 8
***


Quack, verily


View Profile WWW
« Reply #11 on: July 18, 2013, 02:51:32 AM »

Thanks, will save me some pain.

Not a problem, happy to help.

I'm not a novice in this thing. I know how OS handles threads, how mutex, critical section, events work and their purpose. (and I also know a little about atomic operations, never tried them though)
So I'm looking to deepen my knowledge at this point.

Excellent, that'll definitely help then. Smiley
Logged
Klaim
Level 10
*****



View Profile WWW
« Reply #12 on: July 18, 2013, 09:33:58 AM »

[I cut this in several parts, I maxed out the maximum allowed lengh...]

The game I'm working on is highly concurent but I think itmight be an extreme design.

First, I must say that I'm still learning. I read a lot on the subject and did a lot of little experimentations even before starting the new game's enigne (because it's a reset of a codebase I did before but I needed it to be more scalable for gameplay reasons).
Also, my focus is on the high level designs of the system instead of understanding how a thread pool is implemented. Understanding how it works internally is enough. I don't want to implement one, for good reasons.
The book I can recommand to learn about concurrency in C++ is "C++ Concurrency In Action" that covers all you need for sane knowledge of the subject that will make you easily understand other related concepts.

Second, I follow a few design rules:

 - avoid waiting at all costs: design all suposed concurrent interfaces so that there is no waiting;
 - totally avoid sharing data (which is doable most of the time, at expense of memory and copy time);
 - use lockfree or designed-for-concurrency containers where concurrency is involved;

Actually I didn't follow these rules exactly in practice because in some cases I needed to have a system working to understand it better before being able redesign it in a more concurrent-friendly interface. For example one of the last waiting code in my engine is the virtual clocks implementations that I basically copy-pasted from previous code base and added locked on access and update, so that I can focus on more important parts of the code. That was a good strategy because now I can redesign the virtual clocks interface to avoid using a mutex, I see how to do it and I can focus on it without wondering about the impact on other code.

One thing to note is that even if you parallelize some of the initialization and termination code of your game, or some specific system, you will have to wait for something to synchronize the different systems to be able to make them work together. For example, the graphic system takes some time to initialize fully, so inside it I only setup the strict necessary first, which makes the thread constructing the graphic system wait. Then the rest of the graphic inialization happen on a separate thread and the graphic system interface exposes a way to push work to do after the initialization is complete, which makes the constructing thread continue working with other things. Basically, initialization and termination of systems involve synchronization, so be aware of that and avoid too much concurrency there if not necessary. To be more clear: make the interface of the different systems accessible ONLY once they can take messages (which might be ignored or not, but at least it will not lock when you push messages inside).


Now to the more specifics technical details.

I should point that the game is client/server and both side needs some kind of high concurrency for gameplay reasons, but the server side (the game-model-simulation part if you prefer) is the part that benefits the most of it.

The game uses (among other dependencies):
 - Ogre on the graphic side;
 - RakNet+Protobuf for networking;
 - TBB: concurrent containers and the task scheduler (I'll get back to it later);
 - Boost: diverse utilities, AND concurrency too!
 - C++11 AND concurrency too!
I'll get back to why I use C++11, Boost and TBB for concurrency in combination.

Fundamentally, Any game is a set of systems, on different layers of abstractions.
Understand that each system have different ways to work and opportunities to make the system more concurrent-friendly differs depending on the kind of system.
What I mean is: I use different kind of concurrency for different systems, but there is basically two ways of thinking about it.

 A. The system have to always work on the same thread.
 B. The system can work on any thread.

Here by system I mean that it's a group of sub systems working together and most of the time this system have to be updated regularly. The update interval necessary depends a lot on the kind of system too, so you also have another kind of choice to make:

 X. The system needs to have a very consistent and tight update "loop".
 Y. The system don't need to have a very tight "loop", just be updated regularly, it's ok if it's not updated as often as possible.

For example, most graphic systems based on OGL do have to use a specific thread because of the implementations of most OGL drivers crashing if you use a different thread. The exception is data transfer to the graphic memory, which can be parallelized in some ways. But to simplify, the graphic system must run on it's own thread. It also have to have a consistent and tight update loop. I tried with sleeps in the update loop, you are garanteed to have graphic hiccups even if the computer's task scheduler interface show a flat process running.
The graphic system is AX.
However, the graphic resource loading system is actually BY, so the graphic system is composed of a graphic engine which is AX and a graphic assets manager which is BY.

In my case, there is no physics and the game rules code can be updated by any thread, as long as the game state is not updated by several different threads (like most systems anyway). So the game-logic is, on both client and sever, BY.

The input system I am not totally sure yet. I think that in a precise input-intensive game the input system should share the same thread than the output systems (the graphic and audio one). For now, as my game is not this kind of game and is more like a RTS where the actions have to be interpreted anyway, I consider the input system as BX, but I might change it later easily.
Logged

http://www.klaimsden.net | Game : NetRush | Digital Story-Telling Technologies : Art Of Sequence
Klaim
Level 10
*****



View Profile WWW
« Reply #13 on: July 18, 2013, 09:38:46 AM »

[...]

Now, assuming I have identified the execution needs of the different systems, I need different kind of execution systems too. The kinds of executor systems are:

 1. A thread + a work queue: you push tasks into the queue, it will be executed in the thread.
 2. A thread pool + each system using it will have a work queue or something similar: you push tasks into the pool, it will be executed by whatever thread have no other work to do.

Systems identified as A(X/Y) really need executor 1, there is no other choice.
Systems identified as BX either, the first one being a good choice if the interval of updates should be very short, while the second one is ok if it's not the case.
Systems identified as BY can use either but for scaling they should use system 2.

I have to digress in technicals from here. The following is what I had to implement to fit my needs:

 - WorkQueue<T> : this is just a wrapper around tbb::concurrent_queue< std::function<R(T)> > where R is a return value (currently I allow only void but I'm changing that to anything soon). It can be implemented with boost::lockfree::queue too. It's a basic tool to build other kind of concurrecny comunication. You push work in it and when you call execute() all the work accumulated since the last execute() call will be executed on the calling thread. It's generally useful and I believe the C++ standard will have one in a few years. However, one thing it don't allow is rescheduling a task. It's designed to execute non-looping tasks, which is fine for tasks which are not system-update. For system-update we need something more complex.

 - Task<T> : a move-only type which is basically an augmented std::function<T>, with several features like a unique id (using boost::uuid or a provided string), callbacks on some events of the lifetime of the Task instance, and conditional ending. A Task can either be one-shot OR rescheduling. You can also add conditions for ending rescheduling. That Task is designed force the user to define on creation what the task should do (function+rescheduling?interval?until what?), then moving it into either a TaskChain or a TaskScheduler.

 - TaskChain<T> : you can think of it like a vector of Task<T>, combined with a WorkQueue<void>. The idea is that when you manipulate TaskChain, to push back, front or insert at some positions tasks (before or after a specific task id), the work is actually pushed in the WorkQueue. TaskChain have a execute() function which will go first execute the WorkQueue tasks, then execute each Task in the specific order there are registered in. This is important: it's an ordered chain of tasks to be executed. If after execution a Task says it don't need to be rescheduled, then it will be removed from the vector at the end of the execution call. If it wants to be rescheduled, then nothgin happen and as it still is in the vector, next call to TaskChain execution will execute it again (and ask for end request again). This means that if you execute a TaskChain regularly, you have strictly no locking happening but you still have lockfree extra work going on if you add and remove tasks from it. It's the perfect tool for a system update "loop". So each system have a main TaskChain which will contain the tasks to be done each system update. If additional tasks are needed, just push them in the TaskChain. If they need to be done before the normal system updates tasks, just push fron or insert them where you want relative to the normal tasks.
   Depending on the system, the T in TaskChain<T> will dependd. For example, I use a GraphicData struct which is modified by the graphic system and provide info like time since last frame. This data, once set for one update cycle, will be read by all Task<GraphicData> inside the TaskChain<GraphicData> which is inside the GraphicEngine. GraphicEngine exposes a schedule( Task<GraphicData> task); function which just insert Tasks before the Task which performe the rendering.

 - TaskScheduler: this is the ninja part. It's not written as a singleton but really it should be one. Basically, TaskScheduler takes explicitely Task<void> and promise to execute it ASAP using a thread pool. Internatlly, the Task will be wrapped into a special tbb::task child type, which will make the interface with tbb::task_scheduler. TaskScheduler manage the tbb::task_scheduler internally too, so it's essentially a abstraction of using tbb::task_scheduler, so that I can the TaskScheduler impleemntation later if needed (for example on a platform where there is a better solution). Then, the Task will either be pushed directly into tbb::task_scheduler, which just push it in work queues of it's internal thread pool; or, if the Task is marked as having an time interval before execution (rescheduling or not), it will be pushed in an internal tbb::concurrent_priority_queue<Task<void>>. However, Tasks can be associated to Clocks! Clocks are virtual clock represnetation, whichi means the time flow can go faster or slower (I don't need to go reverse in my case). This means that there is a priority queue for each different Clock. Now, the task is sorted by order of time to execute according to the clock (or to real time), which makes easy to just pop a task from the queue, see if it's time to execute it and if not just push it back in the queue (which will maintain the priority order). Once it's time to execute the task, it will be pushed in the tbb::task_scheduler. This mean that Tasks which are, in the end.

It is the main thread that loops (with a sleep of 5 microseconds or more) and pop tasks to push them into the TaskScheduler when it's time.

Some important realisations you should have already if I was clear enough:
 - Systems A(X/Y) will never use TaskScheduler for their update tasks, just spawn and maintain a thread with a loop and execute the system's TaskChain on each cycle;
 - Systems BY can use TaskScheduler for their update task which will be configured as rescheduling and will then call the system's TaskChain execution each time they are executed;
 - Systems BX can use either one depending on the kind of time limits necessary for the system to be responsive;

But this is not the whole story. Once you have all the systems updating correcly, in a as concurrently possible way, there is still room for some system tasks implementations to parallelize some preocess. For example it is very common in physics simulation to parallelize processing as most data needed comes from a snapshot of the physics state from previous frames. The way you parallelize these tasks is very specific to the task, so you have to choose how to do it, through a tbb::parallel_for or through spawning tasks that you don't need to sync or whatever. What you need is just realize which kind of resources will be used when you do this: for example, using tbb::parallel_for means that the tbb::tasks that will be generated into the algorithm will be pushed into the tbb::task_scheduler which is shared by TaskSheduler.
This is actually a good thing: you wanted scaling, you got it. However this imply that sometimes, using tbb::parallel_for might be too computer intensive and might slow down everything. That's ok, you just need to do it serially but in one Task and all systems will work as they can.

Now, more game-specific tools:

 - Id<T>: this is a generic unique identifier tool for any type. It is insanely useful. It works with any T and you even don't need to have the definition of T to compile Id<T>. It also does some interesting checks on validity, but more importantly, each Id is unique, generated by boost::uuid which helps a lot creating Ids withougt having to synchronize with a counter or something similar. I'll get back to it.

 - Monitor<T>: this is a wrapper around a T& and a WorkQueue<void>. Basically what it does is that you can't access the T instance directly, you have to push a work to do (mostly a lambda or std::function<void(T&)> ) and it will be done when the execute() function will be called. This is powerful because it means you can build both T instance and the Monitor safely somewhere in the guts of a system, then provide a smart pointer to the Monitor (not the T instance) which will deffer work pushed until we want to update the T instance. I'm using it in...

 - MonitoredSystem<UpdateData>: this one is particularly useful for game-engine work. To summarize: MonitoredSyste<UpdateData> holds all data that needs to be updated on the same thread, assuming theses data can be big and organized in "arrays".
Basically, it holds a sets of ObjectPool<T> where T is any type. You declare which types can be used thrhough a function and the associated pool is created if it don't exist. Then, this pool actually is stable: the objects in it are all of the same type and instances never move in memory (it's not a std::vector<T> which move memory around, it's actually implemented currently as a boost::container::stable_vector<T> which acts like a classic pool). Now, MonitoredSystem also exposes functions to create and destroy T instances, which are also associated with a Id<T>.
MonitoredSystem will provide ways to plug and remove Controller<T,UpdateData> instances which are basically like std::function< void( T& ,const UpdataData& ) >. I'll get back to it in a minute.
MonitoredSystem and each of the pools own WorkQueue<void> to accumulate work to be done before update.
So, MonitoredSystem have an update( const UpdateData& ) function which will first execute work in the work queue, it's own and the one of each pool, and then will call update of each pool data. The update of each ObjectPool<T> is calling all registered Controller<T,UpdateData> with the provided UpdateData instance, and each one of the T instance in one row.
(Note: the update of eadch ObjectPool<T> instances can be done in parallel, I'm still not sure if it's worth though, so I do it sequentially for now).

So to clarify why it's useful (in concurrent game system), let's say I want to implement a component-base engine. Components of type A,B and C can be updated separately, in parallel. However omponents X, Y and Z have to be updated sequentially and in the same thread.

Here we can setup one MonitoredSystem which will own component instances of type X, Y and Z, and there will be only one thread calling monitored_system.update( update_data ); at each update cycle, updating all the components sequentially.
Meanwhile, A, B and C instances can be hosted by different MonitoredSystem instances which will be associated with Task instances rescheduling to act like a loop, executing when they can (or after an interval) and just updating each component type in parallel.

Then, MonitoredSystem also exposes a way to get access to a shared_ptr<Monitor<T>> if you can provide an Id<T>. Once you have that pointer, you can push work to do with T. Which mean that I can have an object gathering pointers to monitors of different types and jsut push work progressively when needed in the different componenents.


This all without a lock, with a simple syntax on the user side (obviously it's another story for the implementation of these systems...).
Also, I'm improving these systems with adding async/future/promise semantic to the interfaces, which mean you will be able to chain work without any locks again.

About tbb + boost + c++11, some important facts I discovered progressively:

 - c++11 provide low level threading tools, which means it's not good enough for most of what you want to do (or maybe you want to implement your own task pool, in which case good luck);
 - that being said, C++11 future/promise/async are the best tools to use in concurrent interfaces...but they miss important features! for example, future.then() (which allow adding a task to do once the previous task is done, by specifying which executor should be used to execute the continuation) is not standard yet and async( executor, ... )isn't either so you don't have implementations around yet;
 - boost 1.54.0 does provide future.then() but is still imcomplete. Still, at least you can begin using it in yuor interfaces and as it is officially planned to complete the continuation and async( executor, ...) implementation, you can bet you'll be able to use these later in code using your interfaces. Now, just keep in mind that they are still marked as experimental by the boost.thread maintainer, which is not a problem in my case but suggests it might not be totally ready yet;
 - an important bug of Visual Studio 2010/2012: std::thread relies on std::chrono to provide time information, like real time clocks which are used in synchronization tools like condition_variable (very useful for synching update tasks on both initialization and termination of systems). But in VC10/11, the clocks of std::chrono have a precision of 8 miliseconds, which make them totally unpractical for high performance use like a game engine. I discovered that while using std::sleep_for() in the main thread for poping tasks scheduled for later. The wait would be far more than the required times, which is also visible in condition_variable use and other systems provided in the standard library implementation. This is really a bug but I asked the guy maintiang the VC STL (which initials are STL actually) if it was fixed in VS2013 and he said it will not be (because they lack time for that release).
 - boost::chrono don't have this issue and is working very precisely with time, and it's portable, so you should use it instead of std::chronon if Visual Studio is used on Windows. Now, as std::thread relies on std::chrono, you cannot pass it boost::chrono types. You have to use boost::thread to use boost::chrono. It's not a problem: boost::thread implement C++11 features, it works well and it also adds C++14 and C++YY features (for helping providing an implementation of official proposals). Once I switched to boost::thread/chrono, the whole system sped up a lot!
 - boost have some concurrent containers in boost::lockfree. However, it don't provide any associative container (lile map or unordered map) and no thread pool (boost::asio can be used as a thread poool but it's a bazooka for the task);
 - tbb provide the task scheduler which is a thread pool which works hands in hands with the system (as much as possible), it also provides associative containers like tbb::concurrent_unordered_map and tbb::concurrent_hash_map (the difference is the last one can remove value concurrently, the first one cannot).
 - if you use Ogre, by default, it will use either tbb::task_scheduler or a custom boost-based thread pool if you want to for loading resources. So it helps with scaling too.

So to summarize if you want future/promize/async, concurrent containers, task pools and efficient, portable implementations of all these, you'll have to use that combo.

Finally, the meta data:

 - understanding, implementing and refactoring these tools took me around 3 months, but keep in mind that I'm still improving the system today, and part of the time was fixing bugs discovered in the previously listed libraries;
 - know that I have a strong background in C++ architecture design in professional works, it helps a lot;
 - know that I worked full time the 3 months;
 - concurrent bugs are indeed longer to kill because most of the time you don't understand them at first. However, having all these tools to compose my concurrency helps a lot on debugging as I have unit tests for most of these tools and it's easy to reproduce bugs related to them. Hard bugs are from using other libraries mostly. Now, in this case, it's also easy to isolate the bug case because each librry is used in a specific system and it makes isolating informations easy.
 - the longer I've spent on hunting a concurrent bug in my game-specific engine was 2.5 work days, but keep in mind that I also had a reputation of really loving hunting hard to understand bugs, so I'm not really frustrated by these kind of situations;
 - my current estimation is that that the same engine in a more serial design would have taken less than a month to setup, working full time. My plans for the game requires that it scales well so I decided to still burn the time on the concurrency of the engine;
 - I learn so much on the subject that I feel like it will help a lot my future developments. Actually, I also have an open source project which begins to benefit from it;
 - I think working with concurrency is really hard currently without the tools I've built and gathered from libraries. So if you don't feel like spending a year learning concurrency, don't bother. Scaling requirements are rare with games. Implementing an engine which scales is a big investment.
 - I began 1 year ago, I had to stop in mid-March and resume 2 weeks ago. In that time I also implemented client/server networking, the basics of the game-model system and other stuffs specific to the game. What I'm saying is: I still manage to work on the game itself instead of engine code, but it's only now that I'm almost fully on game code because I needed fist to setup a context which helps doing concurrency is both easy and intuitive. Now I implement concurrent systems easily.
 - the game is not finished at all, my tests don't reflect the final behaviour of the game, so I might be totally going in a wrong direction, but the more I work on this, the more confident I get that it was worth it, both for learning concurrency by practic and because it will allow me to do some interesting gameplay features later, without having to re-write the game-specific-engine.

I'm open to questions.

Damn, I should clean all this text and make a blog post...
« Last Edit: July 18, 2013, 09:57:45 AM by Klaim » Logged

http://www.klaimsden.net | Game : NetRush | Digital Story-Telling Technologies : Art Of Sequence
Klaim
Level 10
*****



View Profile WWW
« Reply #14 on: July 18, 2013, 10:18:11 AM »

I think this needs more work to be clarified, will do that soon and publish a blog post. Take it like a draft.
Logged

http://www.klaimsden.net | Game : NetRush | Digital Story-Telling Technologies : Art Of Sequence
Schrompf
Level 2
**

Always one mistake ahead...


View Profile WWW
« Reply #15 on: July 18, 2013, 10:29:20 AM »

Damn, I should clean all this text and make a blog post...

Maybe. Thanks for the write-up here! It was well worth reading.

I for myself decided that I don't need the full-blown parallel architecture in my engine(s). Most of the stuff just isn't compute-intensive enough to justify the huge maintenance efforts that come with it. Therefore I intent to only parallelize the critical parts - mainly the rendering - which should be a pure read-only process in theory. And I think that's a good idea because the profiler always told me that rendering makes up 70% to 90% of my frame times. Once I get to that point I'll tell you if it worked out as imagined.
Logged

Let's Splatter it and then see if it still moves.
Klaim
Level 10
*****



View Profile WWW
« Reply #16 on: July 18, 2013, 10:33:00 AM »

Beware of paralellizing the rendering: if you don't relyin DX11, then only the resource management and cpu-side data update can be worked concurrently (if you manage to make it not inter-dependant). Don't assume the rendering itself can be done concurrently outside the graphic card.

Your decision, I believe, is the sane decision for most games.
Logged

http://www.klaimsden.net | Game : NetRush | Digital Story-Telling Technologies : Art Of Sequence
Schrompf
Level 2
**

Always one mistake ahead...


View Profile WWW
« Reply #17 on: July 18, 2013, 11:00:13 AM »

I know. I'm still on DX9, so I can't even create new resources in parallel. But a lot of work in my render pipeline is visiting objects, collecting and culling draw calls, assigning shaders to fit the light list, and finally sorting them. All of that should be working in parellel just fine, and the main thread can push D3D calls as fast as possible in the meantime.
Logged

Let's Splatter it and then see if it still moves.
Klaim
Level 10
*****



View Profile WWW
« Reply #18 on: July 18, 2013, 11:03:26 AM »

I see, it's a bit like the current situation with Ogre rendering process which have to visit the whole scene graphe to build the rendering commands. There is a massive work going on right now to re-architecture the graphic engine so that all this can be parallelized, and that means that a lot of scene data will be put in arrays with pre-fixed lengths.
The goals is both to improve scaling and improve rendering speed.
Logged

http://www.klaimsden.net | Game : NetRush | Digital Story-Telling Technologies : Art Of Sequence
SoulSharer
Level 1
*



View Profile
« Reply #19 on: July 18, 2013, 11:46:45 AM »

Holy balls, thank you sir.
It will take time to process this huge volume of info though, will probably reread it from time to time.

The only thing I fear so far is that how much maintainability it will need in the end.
- Edit - Also it feels your approach is rather complex, which might bring issues to the table.
« Last Edit: July 18, 2013, 12:18:10 PM by SoulSharer » Logged

Twitter: twitter.com/SoulSharer
Skype: disturbedfearless
Pages: [1] 2
Print
Jump to:  

Theme orange-lt created by panic