I added the volatile change as described above, and optimze with -O3. I'm not an assembly pro, but I have used it (and waded through it) from time to time. As expected in both the switch and function situations, its been completely inlined, and that inlined code rolled through the optimizer. From a cursory glance the switch seems to be the fastest. I didn't check the loop or the template (the loop will never be faster, though it'll probably be unrolled and would be 'as fast', and the template approach only works for constants, and wasn't implemented properly).
In the switch version here's what's happening:
.string "Assertion failed."
These are just defining the raw string values, this is the assembly equivalent of:
const char* string_1 = "Assertion failed."
sub rsp, 40
mov DWORD PTR [rsp], 0
mov DWORD PTR [rsp+4], 1
mov DWORD PTR [rsp+8], 2
mov DWORD PTR [rsp+12], 3
mov DWORD PTR [rsp+16], 1
mov DWORD PTR [rsp+20], 2
mov DWORD PTR [rsp+24], 3
mov DWORD PTR [rsp+28], 0
This defines the beginning of the main function, increments the stack pointer, then saves all the enum values that it needs to the stack. Each 'slot' on the stack you can think of as a variable in C++. For example the C++ variable 'n1' is clearly stored at [rsp + 0]
, s1 at [rsp + 4], ect...
mov ecx, DWORD PTR [rsp+16]
mov eax, DWORD PTR [rsp]
Here we move n2 into register ecx, and n1 into eax. It doesn't seem to need ecx till much later on so I'm not sure why it loads it so early, perhaps its something to do with label alignment.
cmp eax, 1
Here its comparing n1 with 1. If they are equal we jump to .L3. If its less than or equal (ya weird, but its not incorrect) jump to .L41.
If you trace it through you'll see it just jumps around and does alot of compares. Here's another interesting bit:
cmp ecx, edx
mov edi, OFFSET FLAT:.LC0
xor eax, eax
add rsp, 40
We come into .L5 and then do a compare, jumping away if equal. Though if its not equal we fall through to .L8 which prints the .LC0 string "Assertion Failed.". .L38 sets eax to 0 (the return value of the main function), rewinds the stack, and exits the function. Notice how this is right in the middle of the code, and is probably the most optimal place for it.
As you can see the outcome of these sort of low-level optimzations are hard to predict, and best left to the compiler. All sorts of criteria come into play. Code alignment restrictions may cause it to shift instructions sooner or later. Jumps are often eliminated by using well placed code and alot of 'fall through'. The compiler has a TON of clever tricks up its sleeves.
Unless its a single function that, after profiling, takes a MASSIVE portion of your CPU time, its not worth the time or effort. A person can hand optimize these things better than a CPU BUT it takes a MASSIVE amount of time and expertise to do so. Its best to just make the code clear, concise, correct, and easy to maintain and let the compiler do its thing. If after profiling you find that a particular function is causing grief then by all means address it, but thats only AFTER profiling.
Here's the thing. This is all good and fun for something to do when bored, but the performance difference is minimal. As far as to which is faster? Really both will be similar. Cache coherence, and the function usage in the rest of your code will become a much large issue performance wise than the difference between these two functions.