I don't think this was a particularly good article.
What follows might come off as brutally critical, and I want establish that it is not my intention to be cruel. It is my intention, however, to highlight numerous significant issues with the article. My apologies to the author in any case for offense caused if they are reading. Please do not take the following personally.
The article manages to get on my nerves within the first two sentences:
"I found that there is relative little information out there on how to multi-thread actual, real-world systems"
I don't think this is even remotely true.
The deceptive title doesn't help. It seems like "The Truth" was chosen simply to make the title sound more impressive, as if the article would reveal the greater truths behind multihreading.
Anyway, the article continues, delving into ways to work around the perceived problems of multithreading by bypassing the facilities available (eg. going right to CAS), without showing much understanding of what is available.
There is talk about using spinlocks. There are two main uses for spinlocks: (i) when you are trying to get the basics going and don't care about performance; and (ii) when you are getting into the really hairy time-critical stuff and you are willing to starve everything else running on the machine.
"With a spinlock, our CPU will churn waiting for the busy thread. This consumes power and creates heat."
I barely know where to start with this one. The scheduler is going to switch out the ill-behaved process. The problem is starving everything else out. And that is going to create problems, some of which will be managed, and *maybe* at the end of the process we will use power and generate heat. But a well-behaved system that is efficient and busy is going to generate heat and consume power as well. And many systems will manage this heat by slowing the processor. It is not a matter of power consumption and heat, but of efficient use of resources.
If you're ever forced into something like a spinlock, and there are usually mechanism to avoid doing so, it should generally work like this:
- Check everything
- Does anything need updating?
- If so: Update it.
- Yield.
That last step is important. If your thread can't do anything useful right now, give it up to one that can.
There is also a distinct lack of actual measurement, with speculation in the place of actual numbers. This covers it well:
"I have no good numbers for this and it would require a thorough investigation."
Or, instead of a thorough investigation, some basic testing of the concepts, information on the test setup, and the results.
And:
"But if you look at the cycle counts in the case of no contention they are all pretty similar. In fact, they are all pretty fast."
This is vague. Where are the numbers?
"A common multithreaded programming model is to create a number of jobs, go wide to execute them, and then go single-threaded again to collect the results."
Whilst this is one way multithreading is used, there are plenty of other ways it is used. Sometimes you have tasks that run near continuously that kick in with their contribution when they need to do something. GUI and audio management are two popular examples. Many times they sit in their own thread(s) and behave as ideal threads- sleeping and yielding most of the time, and grabbing a chunk of CPU time when they need it. This is a reasonable statement on its own, but it then becomes the focus of what follows, to the exclusion of all else.
There is talk of scheduling kicking in at the wrong time. Scheduling has to kick in eventually, as your thread and process aren't the only thing running on the machine. You can't look at a single thread in isolation and take steps to optimise its performance and dodge the scheduler because the time you take in that thread is time not spent in other threads. An analogy: Take team sports. You shouldn't spend so much time fussing over optimal performance of a single player at the cost of the others when the best overall course of action may be to pass the damn ball. You concentrate on making each player work well to ensure the best team outcome.
And on the subject of scheduling kicking in and spinlocking, the best thing you can do is to work *with* the scheduler, not against it. One way to do this is with condition variables. For those unfamiliar: These are basically a two-sided mechanism. One side is basically a way of a thread informing the scheduling system that they're stuck, but they're waiting on something, and pretty-please give us control again when the resource we have requested is ready. The other side is basically a way of saying that a thread has just provided information that another thread probably needs. They're a way to let the scheduler know what you want to happen, and how it needs to happen. It is not uncommon for a scheduler to pass control right back to the original thread that was waiting, especially since the scheduler will be aware that the thread yielded early, well before its slice was over. Some schedulers prioritise on how little time a thread has used. Yield early and yield often.
Now search the article for the term "condition variable", this very important tool for assisting a scheduler.
The system spoken about seems to be a mish-mash of a transaction-based system and a routing system. It is a bit vague, but that is the impression I got.
Let's look at the transaction side. A solution should be designed accordingly. Coarse-grained parallelism is generally preferred over fine-grained. So: One thread builds up the set of changes, passes ownership safely to another thread that manages them (think mutex on the pointer to data), and the result is then reported. The original thread can yield or go off and do other stuff, checking the result later.
Now for routing: If the system is genuinely shuffling a lot of data (the article is again vague on specifics), then such a system should be facilitating setting up a mechanism to allow setting up some sort of stream between the threads, and just managing the coordination issues and contention with other threads wanting access to the same data. Perhaps the system should accept large chunks of data, which the other thread can request a pointer to.
Anyway, the specifics of the actual goals and performance parameters are left vague, but this is where I'd be starting in general.
The article does mention one useful concept which is worth noting: Bulk allocation of resources (eg. memory) that are stored per-thread and can then be allocated to that thread, allowing that resource to be divvied up without locking. This is a good idea, but it's also an old idea. Search for: "glibc malloc arena" for a more advanced example.
And with all things performance-related, measurement is important. "Measure" and "measurement" are two others words to search for in the article.
There's so much more, but that's enough for now.
Some palate cleansers:
https://en.wikipedia.org/wiki/Monitor_(synchronization)
https://en.wikipedia.org/wiki/Thread_(computing)#Multithreadinghttps://en.wikipedia.org/wiki/Parallel_computinghttps://en.wikipedia.org/wiki/Scheduling_(computing)
https://en.wikipedia.org/wiki/Granularity_(parallel_computing)
Oh, and to be clear, I don't want to discourage sharing of articles, even ones like this one. My input is specifically on the article itself, not that it was shared in the first place.