In terms of ease of use, floats are easiest. For this reason, I use floats. The other two would be similar to use if implemented as classes, but fixedFloat will require some additional logic on the implementation-side for addition and subtraction that is not necessary for fixed point. Other operations will be similarly complicated and all will involve a few conversions each.
For a fixedFloat where the fractional part is 0 <= f < 1 the C code for an addition looks like this:
ff ff_add(ff *a, ff *b)
{
ff result;
int carry;
result.f = a->f + b->f; //0 <= f < 2
carry = int(result.f); //trucation; equal to (result.f >= 1)
result.i = a->i + b->i + carry;
result.f -= carry;
return result;
}
But, you specifically want to leverage the properties of a float for situations where the number has a magnitude less than one, so you're talking about using -1 < f <= 0 for the float part when the value is negative? This complicates things.
ff ff_add(ff *a, ff *b)
{
ff result;
int carry;
result.i = a->i + b->i;
result.f = a->f + b->f; //-2 < f < 2
carry = int(result.f); //Trucation; zero unless f's magnitude exceeds one.
result.f -= carry;
//We must ensure the sign of the float matches the sign of the int.
if (result.i > 0 && result.f < .0f) {--result.i; ++result.f;}
else if (result.i < 0 && result.f > .0f) {++result.i; --result.f;}
return result;
}
Sticking with this format, let's look at a multiplication:
ff ff_from_float(float v)
{
//This actually looks pretty simple thanks to C++'s truncation rules.
ff result;
result.i = v;
result.f = v - result.i;
}
ff ff_mult(ff *a, ff *b)
{
ff result;
int aibi = a->i*b->i, carry;
float aibf = a->i*b->f, afbi = a->f*b->i, afbf = a->f*b->f;
result.f = aibf+afbi+afbf;
carry = result.f;
result.f -= carry;
result.i = aibi + carry;
//We must ensure the sign of the float matches the sign of the int.
if (result.i > 0 && result.f < .0f) {--result.i; ++result.f;}
else if (result.i < 0 && result.f > .0f) {++result.i; --result.f;}
}
You might be able to cut these down a little, but so can any half-decent optimizing compiler. Let's compare with a fixed-point multiplication:
int64 fx_from_float(float v)
{
return int64(v*float(2<<32));
}
//Multiply two 32.32 numbers
int64 fx_mult(int64 a, int64 b)
{
//This code isn't 100% portable due to some undefined bitshift/logical-and behavior
int64 ah = (a >> 32), al = (a&0xFFFFFFFF), bh = (b >> 32), bl = (b&0xFFFFFFFF);
return ((ah*bh) << 32) + (al*bh + ah*bl) + ((al*bl) >> 32);
}
The fixed-point multiplication code there isn't perfect but gives a generally good measurement of the operation's complexity. Generally this fx_mult will perform similar or slightly worse than a floating-point multiplication on modern hardware.
When you've got access to an integer type that's twice as long as your fixed-point numbers it gets simpler still:
//Multiply two 16.16 numbers
int32 fx_mult(int32 a, int32 b)
{
return int32((int64(a) * int64(b)) >> 16);
}
(By the way, this might be a bit tangental but can anyone tell me what the proper way to extract high and low words during a fixed-point multiply is? My implementation is subtly wrong or something and it's been giving me aliasing in some of my audio DSPs.)