Crimsontide
|
|
« Reply #1280 on: September 21, 2018, 06:47:23 AM » |
|
This is for an ECS system? I would put A and B in their own pools.
A trick that can help, is to reserve address space but not necessarily commit it (depends on the OS of course). That way you can reserve large sections, but not necessarily consume the memory until actually necessary.
|
|
|
Logged
|
|
|
|
oahda
|
|
« Reply #1281 on: September 21, 2018, 07:11:39 AM » |
|
Yeah, ECS. c:
Separate pools would get rid of the issue so I'll probably just do that. Decided to leave the post up just in case someone else has a different view, but I'll probably start coding as soon as I can be bothered.
Reserving without committing looks like a cool thing but since it is indeed OS-dependent and I'm targeting so many platforms, even if it's possible on all of them, it looks like I would have to manage a bunch of platform-specific functionality (unless someone has done the work for me already) and I probably don't need it since my games probably won't be huge, but good to know—wasn't aware of it before!
What about per-component pools? Would you still reserve a section for a limited (and the same) amount of each type of component, or would you create a separate pool for all of those too? Since a lot of equally sized subpools for rare components might just be a big waste of space. Or cram them all into the same pool but with different subpool sizes? So many options…
|
|
« Last Edit: September 21, 2018, 07:18:53 AM by Prinsessa »
|
Logged
|
|
|
|
Crimsontide
|
|
« Reply #1282 on: September 21, 2018, 08:30:09 AM » |
|
I never used ECS in an actual product, but I did toy with it a bit. Granted there's a thousand different versions of what one can consider ECS...
I found the easiest way to lay out memory was pools of components. Each component type had its own pool. Everything else became a mess (as I think you're starting to see). Either you're moving things around, or alignments are off, or you have complex searches to find optimal placement. With everything in its own pool it all becomes stupid simple.
The next thing I did (which was a bit different, but again who knows if its optimal) was not to store entities. Entities were just an id (uint64_t). I then used a couple of flat hash tables to map components to entities and vice versa. This made it easy/fast to find all entities with a given component, or all the components of an entity, etc... It made it much like working with a database, you'd make 'queries' to get sets of components/entities to work on, then process them in batches.
It worked well for a couple small tests, but I never did anything else with it.
|
|
|
Logged
|
|
|
|
oahda
|
|
« Reply #1283 on: September 21, 2018, 08:36:51 AM » |
|
Yeah, entities as ID's and lookup tables was my plan too. Seems like it might be worth it to just go for the KISS principle and see how it scales in practice. Thanks for your input!
|
|
|
Logged
|
|
|
|
qMopey
|
|
« Reply #1284 on: September 21, 2018, 09:19:38 AM » |
|
If your use case is to iterate over all elements once a frame, don’t use a typical memory pool. Instead use a vector with a layer of abstraction for looking up an element. Whenever an element is removed, swap the last element into its place and update the swapped element’s handle. Handles are used to translate from unique id of an element instance to an index into the vector.
Memory pools are to avoid the need to malloc (which typically locks internally), and to avoid heap fragmentation. They are not so great for keeping a contiguous list of elements, because once an element is freed other elements can not be moved around, since memory pools typically return pointers to internal storage as opposed to handles.
|
|
|
Logged
|
|
|
|
qMopey
|
|
« Reply #1285 on: September 21, 2018, 09:27:46 AM » |
|
Oh you’re doing ECS. In that case I’ll write a little more about the vectors. You store components in vectors where each vector contains one kind of component. However, there isn’t necessarily a reason to store each kind of component in only one single vector. Sometimes you want to split them up. For example all of the collider components for a single room in the game can sit on one vector. That way only the vector associated with the room the player is in needs to be considered. Another use case is to load up a bunch of component vectors but not use them yet — stream them in for level loading before the player needs to transition to a new level. This can help avoid loading screens.
Exactly how and when to store all your different component vectors is a matter of architectural preference.
|
|
|
Logged
|
|
|
|
oahda
|
|
« Reply #1286 on: September 21, 2018, 11:53:55 AM » |
|
Yeah, I was gonna do the swap thing too. I actually have an older implementation in place already but it's a mess so I want to fix it up and simplify it. And yeah, I'm thinking of separating pools for different subscenes so that I can load things in and out asynchronously without losing track of what belongs to the main scene and so on.
Of course this is for a more general engine so I don't want to tailor it too much to one game, but the game I have in mind is intended as a continuous experience with no loading screens beyond the initial one, a bit like an open world game in miniature that all takes place in one small town and going inside a house doesn't take you to a new scene, but the interiors are absent when the player is outside and loaded asynchronously as needed, as are chunks of the town that are too far away. Something like that. Coupled with the usual LOD trickery.
|
|
|
Logged
|
|
|
|
qMopey
|
|
« Reply #1287 on: September 21, 2018, 01:09:24 PM » |
|
Great sounds like you have a grasp on what you want to do now
|
|
|
Logged
|
|
|
|
Crimsontide
|
|
« Reply #1288 on: September 21, 2018, 01:55:07 PM » |
|
Let us know how it works
|
|
|
Logged
|
|
|
|
oahda
|
|
« Reply #1289 on: September 21, 2018, 11:41:12 PM » |
|
Thanks for the help!
|
|
|
Logged
|
|
|
|
oahda
|
|
« Reply #1290 on: September 24, 2018, 07:50:25 AM » |
|
Did some testing and found that this is slower… f0 = f3; f1 = f4; f2 = f5;
… than this… memcpy(&f0, &f3, sizeof(float) * 3); … at least on my PC. In my case these are sequential floats as members of a struct, and I do have to do this a bunch every frame, meaning that the minor speedup might actually be significant, and so this made me wonder about padding… Does anybody have some good reading on when padding actually happens? Is there a reason to expect, if I have six floats in a row in a class, that there would ever be bytes for padding in between any of the floats in each subgroup of three, possibly breaking my memcpy operation?
|
|
|
Logged
|
|
|
|
Crimsontide
|
|
« Reply #1291 on: September 24, 2018, 08:37:34 AM » |
|
My feeling is your testing method isn't accurate...
For small bits of code, I find look at the assembly generated is a better estimate of performance than a simple loop + timer. Generally speaking the optimizer can do a lot of stuff, and its probably not executing what you think its executing...
|
|
|
Logged
|
|
|
|
Crimsontide
|
|
« Reply #1292 on: September 24, 2018, 09:03:59 AM » |
|
Ok I made a post that had errors in it, then promptly delete'd it, if anyone saw it just ignore it Anyways this seems to work for me now: # include <string> # include <iostream> # include <functional> # include <memory> # include <chrono>
using namespace std;
struct S { float f0, f1, f2; };
bool operator==(S s0, S s1) { return s0.f0 == s1.f0 && s0.f1 == s1.f1 && s0.f2 == s1.f2; }
int main() {
using TimePoint = std::chrono::high_resolution_clock::time_point;
constexpr size_t iterations = 100000000;
S* s = new S[iterations]; S* t = new S[iterations];
// test 1 for (size_t i = 0; i < iterations; ++i) { s[i] = { 0.0f, 0.0f, 0.0f }; t[i] = { static_cast<float>(i), static_cast<float>(i + 1), static_cast<float>(i + 2) }; }
TimePoint t0 = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < iterations; ++i) { s[i].f0 = t[i].f0; s[i].f1 = t[i].f1; s[i].f2 = t[i].f2; } TimePoint t1 = std::chrono::high_resolution_clock::now();
if (!std::equal(s, s + iterations, t)) cout << "error in test 1" << endl; else cout << "test 1 success" << endl;
// test 2 for (size_t i = 0; i < iterations; ++i) { s[i] = { 0.0f, 0.0f, 0.0f }; t[i] = { static_cast<float>(i), static_cast<float>(i + 1), static_cast<float>(i + 2) }; }
TimePoint t2 = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < iterations; ++i) { s[i] = t[i]; } TimePoint t3 = std::chrono::high_resolution_clock::now();
if (!std::equal(s, s + iterations, t)) cout << "error in test 2" << endl; else cout << "test 2 success" << endl;
// test 3 for (size_t i = 0; i < iterations; ++i) { s[i] = { 0.0f, 0.0f, 0.0f }; t[i] = { static_cast<float>(i), static_cast<float>(i + 1), static_cast<float>(i + 2) }; }
TimePoint t4 = std::chrono::high_resolution_clock::now(); std::memcpy(s, t, sizeof(S) * iterations); TimePoint t5 = std::chrono::high_resolution_clock::now();
if (!std::equal(s, s + iterations, t)) cout << "error in test 3" << endl; else cout << "test 3 success" << endl;
cout << endl;
std::chrono::nanoseconds d0 = t1 - t0; std::chrono::nanoseconds d1 = t3 - t2; std::chrono::nanoseconds d2 = t5 - t4;
cout << "d0 = " << d0.count() << "ns" << endl; cout << "d1 = " << d1.count() << "ns" << endl; cout << "d2 = " << d2.count() << "ns" << endl; cout << endl;
double e0 = static_cast<double>(d0.count()) / 1000000000.0; double e1 = static_cast<double>(d1.count()) / 1000000000.0; double e2 = static_cast<double>(d2.count()) / 1000000000.0;
cout << "e0 = " << e0 << "s" << endl; cout << "e1 = " << e1 << "s" << endl; cout << "e2 = " << e2 << "s" << endl; cout << endl;
getchar(); return 0; } The results are kinda what I would expect. There are differences but they are small. On my system (Core i7, latest VC++ compiler) memcpy turns out to be the fastest method by a small amount. My guess is that its copying in larger chunks than the two loops. My guess is that if the size of S was increase to be a cache line, we'd see similar performance. Off to test that.
|
|
|
Logged
|
|
|
|
qMopey
|
|
« Reply #1293 on: September 24, 2018, 09:15:08 AM » |
|
Nowadays most machines have an intrinsic for memcpy and will be doing vector operations as well.
|
|
|
Logged
|
|
|
|
Crimsontide
|
|
« Reply #1294 on: September 24, 2018, 09:24:47 AM » |
|
Nowadays most machines have an intrinsic for memcpy and will be doing vector operations as well.
In theory the compiler can do the same thing for assignments as well. The question is, does it do it: # include <string> # include <iostream> # include <functional> # include <memory> # include <chrono>
using namespace std;
// ----- Test ----- template<class T> void Test(size_t iterations) {
using TimePoint = std::chrono::high_resolution_clock::time_point;
T* s = new T[iterations]; T* t = new T[iterations];
// test 1 for (size_t i = 0; i < iterations; ++i) { s[i] = { 0.0f, 0.0f, 0.0f }; t[i] = { static_cast<float>(i), static_cast<float>(i + 1), static_cast<float>(i + 2) }; }
TimePoint t0 = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < iterations; ++i) { s[i].f0 = t[i].f0; s[i].f1 = t[i].f1; s[i].f2 = t[i].f2; } TimePoint t1 = std::chrono::high_resolution_clock::now();
if (std::equal(s, s + iterations, t)) cout << "test 1 success" << endl; else cout << "error in test 1" << endl;
// test 2 for (size_t i = 0; i < iterations; ++i) { s[i] = { 0.0f, 0.0f, 0.0f }; t[i] = { static_cast<float>(i), static_cast<float>(i + 1), static_cast<float>(i + 2) }; }
TimePoint t2 = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < iterations; ++i) { s[i] = t[i]; } TimePoint t3 = std::chrono::high_resolution_clock::now();
if (std::equal(s, s + iterations, t)) cout << "test 2 success" << endl; else cout << "error in test 2" << endl;
// test 3 for (size_t i = 0; i < iterations; ++i) { s[i] = { 0.0f, 0.0f, 0.0f }; t[i] = { static_cast<float>(i), static_cast<float>(i + 1), static_cast<float>(i + 2) }; }
TimePoint t4 = std::chrono::high_resolution_clock::now(); std::memcpy(s, t, sizeof(T) * iterations); TimePoint t5 = std::chrono::high_resolution_clock::now();
if (std::equal(s, s + iterations, t)) cout << "test 3 success" << endl; else cout << "error in test 3" << endl;
// test 4 for (size_t i = 0; i < iterations; ++i) { s[i] = { 0.0f, 0.0f, 0.0f }; t[i] = { static_cast<float>(i), static_cast<float>(i + 1), static_cast<float>(i + 2) }; }
TimePoint t6 = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < iterations; ++i) { std::memcpy(s + i, t + i, sizeof(T)); } TimePoint t7 = std::chrono::high_resolution_clock::now();
if (std::equal(s, s + iterations, t)) cout << "test 4 success" << endl; else cout << "error in test 4" << endl;
cout << endl;
// results std::chrono::nanoseconds d0 = t1 - t0; std::chrono::nanoseconds d1 = t3 - t2; std::chrono::nanoseconds d2 = t5 - t4; std::chrono::nanoseconds d3 = t7 - t6;
cout << "d0 = " << d0.count() << "ns" << endl; cout << "d1 = " << d1.count() << "ns" << endl; cout << "d2 = " << d2.count() << "ns" << endl; cout << "d3 = " << d3.count() << "ns" << endl; cout << endl;
double e0 = static_cast<double>(d0.count()) / 1000000000.0; double e1 = static_cast<double>(d1.count()) / 1000000000.0; double e2 = static_cast<double>(d2.count()) / 1000000000.0; double e3 = static_cast<double>(d3.count()) / 1000000000.0;
cout << "e0 = " << e0 << "s" << endl; cout << "e1 = " << e1 << "s" << endl; cout << "e2 = " << e2 << "s" << endl; cout << "e3 = " << e3 << "s" << endl; cout << endl;
// done delete[] s; delete[] t; }
// ----- Main ----- struct S { float f0, f1, f2; };
bool operator==(S s0, S s1) { return s0.f0 == s1.f0 && s0.f1 == s1.f1 && s0.f2 == s1.f2; }
struct SS { float f0, f1, f2, f3; };
bool operator==(SS s0, SS s1) { return s0.f0 == s1.f0 && s0.f1 == s1.f1 && s0.f2 == s1.f2 && s0.f2 == s1.f2; }
struct SSS { float f0, f1, f2, f3; float f4, f5, f6, f7; };
bool operator==(SSS s0, SSS s1) { return s0.f0 == s1.f0 && s0.f1 == s1.f1 && s0.f2 == s1.f2 && s0.f2 == s1.f2 && s0.f4 == s1.f4 && s0.f5 == s1.f5 && s0.f6 == s1.f6 && s0.f7 == s1.f7; }
int main() {
cout << "S: " << endl; Test<S>(100000000);
cout << endl; cout << "SS: " << endl; Test<SS>(100000000);
cout << endl; cout << "SSS: " << endl; Test<SSS>(100000000);
getchar(); return 0; } A bit of a mess, and test 1 isn't really accurate (since it doesn't do all the assignments in SS or SSS) but in the end it seems the form of copy doesn't matter. It seems cache coherency/memory subsystem are what dictates speed and not the form of copy chosen. So... do whatever is convenient, it won't really matter anyways
|
|
|
Logged
|
|
|
|
oahda
|
|
« Reply #1295 on: September 24, 2018, 09:41:04 AM » |
|
Wow, didn't expect you to put in all that work, haha. Thanks.
|
|
|
Logged
|
|
|
|
qMopey
|
|
« Reply #1296 on: September 24, 2018, 09:42:33 AM » |
|
In my experience assignments are not extremely optimized all too often, beyond a certain point. I dont really know why not.
For example I once wrote a fast Fourier transform function and adjusted the inner loop to go from a couple pointer dereferences to dereference onto the stack, do ops on local stack variable, and dereference to store stack values back. I got around 13% speed boost on MSVC with typical optimization flags. This scenario is a pointer aliasing scenario but still, I was surprised the compiler didn’t avoid aliasing for such a simple case.
|
|
|
Logged
|
|
|
|
Crimsontide
|
|
« Reply #1297 on: September 24, 2018, 10:00:52 AM » |
|
In my experience assignments are not extremely optimized all too often, beyond a certain point. I dont really know why not.
For example I once wrote a fast Fourier transform function and adjusted the inner loop to go from a couple pointer dereferences to dereference onto the stack, do ops on local stack variable, and dereference to store stack values back. I got around 13% speed boost on MSVC with typical optimization flags. This scenario is a pointer aliasing scenario but still, I was surprised the compiler didn’t avoid aliasing for such a simple case.
Was this recently? I know in the past VC++ was terrible; but recently I've been impressed with its performance. For example (sort of tangential) I used to use memory pools all the time, since the VC++ new/delete were abysmal performance. Had a really fast implementation that could allocate/delete in the 20-50ns range. Recently I tested it against new/delete and they had clearly made some serious improvements, and there performance matched or beat mine in every case (where-as previously it wasn't even in the same ballpark). Perhaps give it a try with the latest version? It'd be nice to know if local variables could be faster under some circumstances.
|
|
|
Logged
|
|
|
|
qMopey
|
|
« Reply #1298 on: September 24, 2018, 10:20:51 AM » |
|
Yep it was very recent! To be clear I didn't use memcpy, I just was talking about little tweaks can help the compiler a lot (like memcpying 3 variables instead of doing 3 separate assignments).
This is just true in general for all compilers. They typically work "well enough", but don't actually go as crazy as people claim. I think they get a rep for doing insane optimizations, because a lot of work goes into compilers to detect certain scenarios that can be heavily optimized from an implied slow state in C/C++. For example lots of scalar routines (with loops inside) can be automagically converted to proper vectorized loops under the hood. Or divides/modulos can simply disappear from certain loops. Or sometimes a lot of templated cruft can be compiled down tremendously.
And yep new/delete for Win7 and above have gotten way better than Win XP versions. It's pretty hard to beat malloc/free perf on Windows nowadays. I have noticed that speed improvements for allocators nowadays seem to primarily come from avoiding the locking that is likely happening when malloc is called (thread contention). I think general-purpose custom allocators have really fallen out of favor in recent years as a result (though these custom allocators would obviously still be useful on embedded hardware).
|
|
|
Logged
|
|
|
|
|
|