Welcome, Guest. Please login or register.

Login with username, password and session length

 
Advanced search

1411508 Posts in 69374 Topics- by 58429 Members - Latest Member: Alternalo

April 26, 2024, 08:41:11 AM

Need hosting? Check out Digital Ocean
(more details in this thread)
TIGSource ForumsDeveloperTechnical (Moderator: ThemsAllTook)Glue b/t differently sorted arrays in data-oriented prog. for optimal locality?
Pages: [1]
Print
Author Topic: Glue b/t differently sorted arrays in data-oriented prog. for optimal locality?  (Read 938 times)
Martin 2BAM
Level 10
*****


@iam2bam


View Profile WWW
« on: July 02, 2015, 08:24:03 AM »

The glue between differently sorted arrays in data oriented programming (structs of arrays) for optimal locality of reference?



First of all let me say it's for educational purposes. This topic is about optimization of data for locality of reference (CPU caching) and CPU branch prediction.
I'm not trying to optimize prematurely without bottlenecks so please let's leave that kind of comments aside  Gentleman

I'm using C++11 but will write examples in pseudo-code. If something doesn't make sense, please point it out!



Data oriented programming uses separate arrays to represent objects components, to improve CPU caching and prediction by keeping related data compact and in the cache.
But sometimes I'll need differently ordered (sorted) components for different optimizations. E.g. a particle with update data and render data.
If I sort at insertion of new particles by time of death, I can skip a lot of branch mispredictions while updating

Code:
array updateData = [alive] + [just died] + [dead]

Code:
now = time()
for(i = 0 to nAlive) {    //mostly true
   part = updateData[i]   //sequential array access, cached and fast
   if(part.timeOfDeath > now) {    //mostly true
      part.update();
   }
   else {    //mispredict once
      for(j = i to nAlive) {  //up till nAlive are dead
          destroy(j);
      }
      nAlive = i;
      break;
   }
}

but if I want to keep a first-in depth-ordered rendering (vertices) to avoid flicker - they might die in a different order
(I can't use an OpenGL depth-buffer mostly because alpha blending won't work that way)

Code:
array renderData = [visible (depth ordered)] + [invisible]
(also kept by visible/invisible ordering to avoid branch mispredictions)

(EDIT: added to clarify)
Code:
for(i = 0 to nAlive) {    //mostly true
   render(renderData[i])
}

Sometimes I might not even know the black-box internal ordering of data, e.g. Box2D pools it's b2Body objects by itself, and I  cant predict how the internal allocation sorts them. Also would need to update vertex information in a specific format, convenient to the graphics card.

So my problem is, I can use an index to map from update to render data (or vice versa), but at some point, when using one data to update the other (e.g. physics modifying vertex data) it might start jumping all around in memory.
Also this "index" to map between them would need to be kept updated on removal (death of particles) if the arrays are to remain compact and sorted by different criteria.

(EDIT: added to clarify)
Code:
//Call between update and render...
for(i = 0 to nAlive) {    //mostly true
   udata = updateData[i];
   rdata = renderData[udata.renderIndex];  //different order, memory jumps
   rdata.vertex = udata.position;
   rdata.color = udata.color;
   //...
}

How is this handled in frameworks and engines? Which trade-off is better?
Branch mispredictions on alive/death checks and keep both arrays matching in depth order?
Or jumping around in memory while updating the other array data? (can't see other solution for black-box frameworks)

I'd greatly appreciate if you shed some light on this.
I must admit that even after I've been reading a lot of CPU caching and prediction, I don't completely understand it yet nor have the empirical knowledge to apply it in more complex scenarios.

Thanks for your time.
« Last Edit: July 02, 2015, 10:44:42 AM by Martin 2BAM » Logged

Working on HeliBrawl
RandyGaul
Level 1
*

~~~


View Profile WWW
« Reply #1 on: July 02, 2015, 08:55:38 AM »

Quote
Data oriented programming uses separate arrays to represent objects components, to improve CPU caching and prediction by keeping related data compact and in the cache.

Well this isn't really true. Sure, one can use data oriented thinking to decide to use some arrays, but in the end data oriented design is just an idea, or a way of going about things. High performance, easy to multi-thread, these are just benefits that arise emergently. Here's a good explanation: http://www.codersnotes.com/notes/explaining-data-oriented-design

Your particle array is a good case study to apply data oriented design. Usually it is good to start with "what are the problems I need to solve". It sounds like you need to have particles, destroy them, and create them dynamically. You also need to update them with multiple algorithms.

Each of these kinds of operations should probably be split into it's own small algorithm (loop). We want to transform our data with an algorithm that has defined its data input and data output.

The loop you've shown:

Code:
now = time()
for(i = 0 to nAlive) {    //mostly true
   part = updateData[i]   //sequential array access, cached and fast
   if(part.timeOfDeath > now) {    //mostly true
      part.update();
   }
   else {    //mispredict once
      for(j = i to nAlive) {  //up till nAlive are dead
          destroy(j);
      }
      nAlive = i;
      break;
   }
}

Is doing multiple things at once, and thus you need an if-statement in the middle of your tight loop. You can factor this into multiple loops. One loop can check the alive boolean and make an appropriate action. Another loop can call update. Now we might have:

Code:
for particle p in pool
{
if ( dead( p ) ) do destroy( p )
}

Code:
for particle p in pool
{
update( p )
}

Now you can see that out loops themselves have minimal branching, and can be considered quite a bit simpler. Due to using different loops it may be easy to notice that the first loop that calls destroy only really needs to read a boolean. Perhaps it's best to have an entire separate array of these booleans (or ints) that represent the death flag. This way when we are calling update on particles, each cache line pulled into the CPU cache will contain more data relevant to the update method. The same is true for the if ( dead( p ) ) check.

As for running another type of update algorithm, perhaps it would be best that the most performance intensive algorithm runs across the array linearly, and other algorithms use an index list. This could be a good choice since the index list may be very simple to implement.

If you find that index lists are just too slow I'd recommend holding duplicated data in different arrays so each update algorithm can do a linear traversal across the array. The downside to duplicating data is now creation/deletion of particles requires a slightly more advanced sync step, and if creation/deletion is super important and frequent, more data oriented thinking can be applied to create/destroy functions.

Here's a good video on CPU caches:

Logged
Martin 2BAM
Level 10
*****


@iam2bam


View Profile WWW
« Reply #2 on: July 02, 2015, 09:19:20 AM »

I guess you didn't spot that I'm already doing that  Smiley

"Removing" branches (unwinding) isn't always a solution, you essentially can't as loops are branches too.
The optimal way is to make them in a predictable pattern to make the Prediction history table at the CPU hit most times. Compacting and sorting them (on insertion) is a way I've chosen to do that.

If you carefully analyze my update code's comments, it points out when mispredictions might occur, but being sorted (by time of death and, by consequence, dead) they're unusual.

Your examples are essentially the same as mine, maybe there could be some benefit from removing the time of death member from updateData (fit a bit more data in cache) but, by needing two arrays in that case, sorting (for coherence) would be jumping between memory locations (to keem them sorted in the same way) and a bit redundant.
Otherwise, if unsorted, the "dead()" check would be getting branch mispredictions (each frame) for being scattered around.

Branch misprediction stalls the CPU around the same time as between L1 and L2 cache miss, so it's important too.

I hope you see my point. Thanks anyway for your reply!
Logged

Working on HeliBrawl
RandyGaul
Level 1
*

~~~


View Profile WWW
« Reply #3 on: July 02, 2015, 11:12:07 AM »

I understood your post well. Be careful not to miss any useful information I posted by overlooking it. The important concepts I gave you are understanding tradeoffs between different decisions.

Also, it is my opinion that trying to sort data for branch prediction is very silly when the if-statements can be factored into separate transformations. Trying to guess how the predictor works isn't a good idea, as different prediction implementations will yield different results, your code will become more complex to try to compensate for prediction, and in the end you won't have results you can easily verify.

Edit: Also I would bet that your statement about branch mispredictions being as expensive as a cache miss to be wildly uninformed, roughly an order of magnitude off. Mispredictions are generally way cheaper than cache misses.
« Last Edit: July 02, 2015, 11:19:50 AM by RandyGaul » Logged
JakobProgsch
Level 1
*



View Profile
« Reply #4 on: July 02, 2015, 12:00:22 PM »

In the particle case i assume "update" is just something trivial like moving it by its velocity vector or so. If the cost of update wasn't trivial the cost of misprediction probably wouldn't matter much. In this case it might be reasonable to just keep dead particles around until their total number exceeds some threshold and only then clean them in one sweep. You might not even need to exclude them from being updated. Just keep the update loop branch free and update all. what does it matter that a dead particle is moved as long as the relevant parts (the drawing for example) ignore them? So only very few loops will pay the branching cost.
The whole dynamic changes when you go to expensive operations where saving the cost of updating dead elements is relevant. But at that point who cares about mispredictions?

If there are "logical" reasons that enforce a specific order, like keeping draw order, you don't have many options anyway. "supporting structures" like external arrays that reindex the source array etc. tend to introduce more cost via indirection than the cost of having slightly shitty code that profits from the awesomeness of sequential access Smiley.

The way one deals with these kinds of issues is by lots of measuring really. Which annoyingly always trumps theorycrafting about them.
Logged

Martin 2BAM
Level 10
*****


@iam2bam


View Profile WWW
« Reply #5 on: July 02, 2015, 01:27:08 PM »

sort data for branch prediction is very silly when the if-statements can be factored into separate transformations. Trying to guess how the predictor works isn't a good idea, as different prediction implementations will yield different results, your code will become more complex to try to compensate for prediction, and in the end you won't have results you can easily verify.
You still have to do the ifs in the other loop if you factor it out. Sorting by alive/dead is a different story.
Also TOD/TTL is used for transforming, e.g. opacity.

The common user-end CPU predictors usually either:
1) Do static prediction via hints or take the heuristic guess to prioritize back-jumps (usually true for loops)
2) Look in 4-16 entry pattern table
The key is consistent pattern. I don't think that's gonna change much in the near future.
Or both

Edit: Also I would bet that your statement about branch mispredictions being as expensive as a cache miss to be wildly uninformed, roughly an order of magnitude off. Mispredictions are generally way cheaper than cache misses.

L2 CACHE hit, ~10 cycles   (i.e. L1 cache miss)
https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

On a modern machine, an L1 miss will typically cost around 10 cycles, an LL miss can cost as much as 200 cycles, and a mispredicted branch costs in the region of 10 to 30 cycles. Detailed cache and branch profiling can be very useful for understanding how your program interacts with the machine and thus how to make it faster.
http://valgrind.org/docs/manual/cg-manual.html

Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns  (i.e. L1 cache miss)

https://gist.github.com/hellerbarde/2843375
« Last Edit: July 02, 2015, 01:38:59 PM by Martin 2BAM » Logged

Working on HeliBrawl
Martin 2BAM
Level 10
*****


@iam2bam


View Profile WWW
« Reply #6 on: July 02, 2015, 01:37:28 PM »

In the particle case i assume "update" is just something trivial like moving it by its velocity vector or so. If the cost of update wasn't trivial the cost of misprediction probably wouldn't matter much. In this case it might be reasonable to just keep dead particles around until their total number exceeds some threshold and only then clean them in one sweep. You might not even need to exclude them from being updated. Just keep the update loop branch free and update all. what does it matter that a dead particle is moved as long as the relevant parts (the drawing for example) ignore them? So only very few loops will pay the branching cost.
The whole dynamic changes when you go to expensive operations where saving the cost of updating dead elements is relevant. But at that point who cares about mispredictions?

If there are "logical" reasons that enforce a specific order, like keeping draw order, you don't have many options anyway. "supporting structures" like external arrays that reindex the source array etc. tend to introduce more cost via indirection than the cost of having slightly shitty code that profits from the awesomeness of sequential access Smiley.

The way one deals with these kinds of issues is by lots of measuring really. Which annoyingly always trumps theorycrafting about them.


Those are great points Smiley
I have no escape from using split array structures, I need vertex arrays fed to OpenGL compact and regular between this and other renderable things. But I can use an unified index so both are sequential.

After devoting most of today to thinking, I ended up putting the tradeoffs in a calc-sheet and yes... sequentiality wins Smiley

Logged

Working on HeliBrawl
Boreal
Level 6
*


Reinventing the wheel


View Profile
« Reply #7 on: July 02, 2015, 02:14:45 PM »

Edit: Also I would bet that your statement about branch mispredictions being as expensive as a cache miss to be wildly uninformed, roughly an order of magnitude off. Mispredictions are generally way cheaper than cache misses.

L2 CACHE hit, ~10 cycles   (i.e. L1 cache miss)
https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

On a modern machine, an L1 miss will typically cost around 10 cycles, an LL miss can cost as much as 200 cycles, and a mispredicted branch costs in the region of 10 to 30 cycles. Detailed cache and branch profiling can be very useful for understanding how your program interacts with the machine and thus how to make it faster.
http://valgrind.org/docs/manual/cg-manual.html

Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns  (i.e. L1 cache miss)

https://gist.github.com/hellerbarde/2843375

An L1 cache miss is not what people mean when they say, generically, "cache miss".  They mean that the data was not in any level of cache, and thus is a main memory hit, which is typically a few hundred cycles.
Logged

"In software, the only numbers of significance are 0, 1, and N." - Josh Barczak

magma - Reconstructed Mantle API
Pages: [1]
Print
Jump to:  

Theme orange-lt created by panic