@namandixit Yep that all checks out. So now the question is when we pass memory to the GPU for two matrices A and B, and we want to perform A * B, the memory storage for both A and B can be in either row or column storage.
A = Translation * Rotation * Scaling
// And in block form:
[ R * S, T ]
[ { 0 }^T, 1 ]
// Above is typical column major notation we are all
// familiar with. Commonly we *pre-transpose* this into
// the following memory storage, so in memory we have:
[ (R * S)^T, { 0 } ]
[ T^T , 1 ]
So the GPU should not care whether the bytes in the block of memory that you call a matrix represent four rows or four columns in order because it has no concept of matrices. The layout is only meaningful in context of an API and a shading language.
We can call it whatever we want, sure, but the storage order will definitely be expected in a certain format; i.e. we can transpose a matrix and get different behavior.
...
So I did a little more reading and it looks like I was totally wrong about DX/GL expecting the same storage! To quote from
Fabian here:
So, while OpenGL defaults to column-major storage and D3D defaults to row-major storage, they don’t store the same thing – the matrices themselves are different (transposed in fact) because of the different types of vectors they use.
I would definitely trust what Fabian says. When I originally was doing tests I think I had forgotten at one point I was swapping the ordering to make sure the storage between GL and DX was consistent, and so I thought the storage had to be consistent all the time. Consequently I thought GL used row major storage.
Fabian says:
HLSL supports both row-major and column-major storage. The default if you don’t do anything is column-major, but within a shader you can either set the global default for all matrices to row-major using a pragma, or specify row_major/col_major per matrix if you want to.
Either way, there’s no performance difference between v*M and M*v.
The interesting thing was that there didn't seem to be much of a performance difference either way, so what really matters is just sticking with something consistent. Personally I think sticking with OpenGL's storage (described in my above code example) all around would be a pretty good idea, which is what namandixit is doing in his source.
So turns out I was also wrong about what I was calling row major. GL uses column major storage, which is why we have to do the whole pre-transpose thing.
Fabian points out that the FAQ I linked to is incorrect, and quotes directly from the OpenGL spec stating GL uses column major storage. So I had the storage terminology flipped due to the incorrect FAQ.
According to Fabian DX uses row major storage by default, but this can be changed with an HLSL #pragma.
tl;dr
GL uses column major storage. DX uses row major storage by default, but this can be changed by the user. Whatever is picked doesn't matter really at all in terms of performance, in terms of shaders.