This tutorial is meant to be a profound and comprehensive introduction to fixed point arithmetic. It should provide you a solid basis to shape exact understanding of its working and the limitations. The tutorial is divided into 5 parts:

- Motivation
- How it works
- Efficient unsigned implementation
- Extending to signed implementation
- Limitations and hints

**MOTIVATION:**A great number of bugs, inconsistencies and glitches in game-mechanics/physics is a result of naive use of floats or their intrinsic shortcomings. Perfectly consistent game states and their reproducibility is often required when you aim for perfection in game mechanics. For example whenever you repeat an exactly reproduceable sequence of actions in an action-puzzler you expect the result to be identical and independent from possible positional coordinate shifts of the experiment. This is where integers or fixed point are the right tools to use.

**But we don't have to use fixed point exclusively to keep the desired consistency. We can store positional values, perform positional translations and "simple" single multiplications/divisions in fixed point. And we use floats for multiplication/division intensive tasks such as trigonometric functions. That way we can get the best of both worlds and don't have to rewrite complicated mathematical functions for fixed point.****Fixed point vs float, why they are not similar:****Fixed point is an integer, we just interpret a fractional part with a fixed amount of bits into it.****pros of fixed point:**

- consistent computation results thanks to äquidistant number distribution

- identical number representation across varying machines **cons of fixed point:**

- significantly limited domain (watch out for multiplications)

- bad relative precision for smaller numbers in its fractional part **In a float type the binary point is not fixed, it can "float".****pros of float:**

- great range

- great absolute precision for smaller numbers in its fractional part:

it is because floats keep a good relative precision for all numbers, what makes them well

suited for scientific computations being sensitive to relative errors**cons of float**

- bad absolute precision in higher range (increasingly worse for representing positional values)

- inconsistent accuracy since the number distribution is not äquidistant**HOW IT WORKS:**Let us first get aware about the structure of our native number system. We interpret the number

349.5 the following way:

**349.5 = 3*10^2 + 4*10^1 + 9*10^0 + 5*10^-1** We call 10 our base and see that the powers of the base determine the positional values. We can also see that a positional value x must be in the domain

**0 <= x < base. An important property here we will use later on is that multiplications or divisions with the base correspond to a shift of the decimal point.** One way to see why that is is by simply observing the arithmetical process:

Lets multiply the number 3.5 by 10:

**10*(3*10^0 + 5*10^-1) = 3*10^1 + 5*10^0 = 35**We see that since the distribution property holds here every base gets "elevated" by the power of one so the decimal place goes one unit to the right. Accordingly each power will decrease by one when a division by 10 is applied so the decimal place will go one unit to the left.

**As already mentioned fixed point is an integer, we just interpret a fractional part into it**. And now we are ready to conceptualize fixed point by using integers. We will use the term

**resolution as the amount of äquidistant points per unit, it is the amount of points reserved for the fractional part.** In order to introduce all that properly it is key to understand the following distinction:

**- x,y,z are fixed point numbers****- x/resolution, y/resolution, z/resolution is the actual value of those numbers**For example if x = 200 and the resolution is 100 then its actual value is x/resolution = 200/100 = 2. In the following we see how the basic arithmetic is looking like.

** So for example if we want to know the result of an addition in fixed pont format we have to **

resolve the idea: "x-value" + "y-value" = "z-value" to z, since that is the result in fixed point format. The same idea applies to the other functions: **addition/subtraction:****x/resolution +- y/resolution = z/resolution -> x +- y = z**It means we have to add the 2 fixed point numbers just like 2 integers. That's it.

**multiplication:****x/resolution * y/resolution = z/resolution -> x*y / resolution = z**It means it corresponds to a regular integer multiplication, but that will be divided by the resolution.

**division:****(x/resolution) / (y/resolution) = z/resolution -> x/y * resolution = z**It means it corresponds to a regular integer division, but that will be multiplied by the resolution.

Let us sum up the results for the basic arithmetic in fixed point:

**x+y = z****x-y = z****x*y / resolution = z****x/y * resolution = z**

**Note: The arithmetic we derived is the mathematical idea. We will see that in the **

actual machine-computation the order of operations does matter and should not be ignored.

**EFFICIENT UNSIGNED IMPLEMENTATION:**To keep readability we will encapsulate the arithmetic in functions here. We will assume regular 32 bit integers in this example.

Remember it was shown that multiplying or dividing a number with its base equals a shift of the decimal point. In a digital machine numbers are represented with the base of 2 and unsigned integers are stored according to our number representation. So multiplications or divisions of powers of 2 can be achieved by bit-shifting. Bit-shifting is an atomic operation and very fast so we will choose the resolution to be a power of two to avoid unnecessary and costly divisions.

The bit shift operators are applied on integers and usually known as "<<" and ">>" in various languages and mean the following for unsigned integers:

**(x << shifts) equals x * (2^shifts)****(x >> shifts) equals x / (2^shifts)**

So this is how we can implement our functions now:

const uint resolution = 256; //example resolution

const int resolutionShift = 8; //2^8 = 256

const float inversedResolution = (float)(1.0 / double(resolution));

const uint roundBit = 1 << (resolutionShift-1);

public static uint fixedPointMultiply(uint x, uint y)

{

return (x * y + roundBit) >> resolutionShift;

}

public static uint fixedPointDivide(uint numerator, uint denominator)

{

uint roundFraction = denominator >> 1;

return ((numerator << resolutionShift) + roundFraction) / denominator;

}

public static uint toFixedPoint(uint full, uint numerator, uint denominator)

{

return (full << resolutionShift) + fixedPointDivide(numerator, denominator);

}

public static uint fixedPointToInt(uint x) // always rounding towards negative infinity here!

{

return x >> resolutionShift;

}

Notice that addition/subtraction in fixed point is identical to an integer addition/subtraction, that is why we don't need functions for that. For a faster conversion to a float (can be required for a draw position on the screen by your framework) the division is already done in the float constant inversedResolution. toFixedPoint() is just a useful function for converting a full integer value and an additional fraction to fixed point.

Let's first point out the valid domain multiplication and division are working on properly.

For multiplication we define the independent domain as the domain where x and y can be chosen

independently from each other:

**independent domain of multiplication:**

x and y must be within[0, resolution * 2^(16 - resolutionShift) - 1 = 2^16 - 1]**domain of division:**

numerator must be within [0, resolution * 2^(32 - resolutionShift - 1)]The

**roundBit** constant is meant to add one bit right behind the least significant bit in the number. That way the least significant bit can be rounded up before the bits behind it will be shifted away.

If you take a look at fixedPointDivide(uint x, uint y) and ignore rounding you see that the implementation of

**(x/y * resolution) is (x << resolutionShift) / y)**which means the order of operations is changed.

First x is multiplied by resolution and then divided by y.

**This multiplication with the resolution can be considered a "scope expansion". Only then the division is applied to minimize quantization errors.** Imagine your resolution is 100 and you want to divide the value 2 by 4, corresponding to x = 200, y = 400 in fixed point. Without the scope expansion of x this is what would happen in fixed point: (200/400)*100, this will be zero since 200/400 is not a full integer. Now doing it with scope expansion of x: (200*100)/400 = 50. And in this resolution 50 corresponds to 0.5 when converted to float, just as you expect it.

**EXTENDING TO SIGNED IMPLEMENTATION:**If it wasn't for proper rounding, we could use the same implementation for signed and unsigned integers. However it is beneficial to know how signed numbers are represented to understand exactly how we can work with them.

**In a signed integer the most significant bit (msb) is determining whether the number is positive or negative:****- If msb is 0 then the rest of the bits are representing the number just like an unsigned integer.****- If msb is 1 then the rest of the bits can be considered to represent an unsigned integer aswell, **

but with an additional offset of -2^31 For example let's assume we have only 1 byte to represent numbers, then the following is true:

- 00000000 represents 0

- 00000001 represents 1

- 10000000 represents -128

- 10000001 represents -128 + 1 = -127

- 11111111 represents -128 + 127 = -1

This representation of negative numbers is called

**Two's complement**.

Now we are about to see that the operator ">>" has to work in a special way to perform divisions

by a power of 2 on negative numbers. As example we can start with -128 and consecutively divide

it by two and list some results:

- 10000000 represents -128

- 11000000 represents -64

- 11100000 represents -32

- 11110000 represents -16

- 11111000 represents -8

We see that we have to

**"extend the sign"** with every right shift instead of doing

logical shifts. And this is actually what happens when you apply the right shift

operator on signed integers in a language that supports it (should be the regular case), it

turns to an

**arithmetic shift.** While a logical left shift remains unchanged for signed

and unsigned integers.

**One important consequence of the arithmetic right shift is that it will round negative **

numbers towards negative infinity while a division is always rounding towards zero. So

a right shift on negative numbers is not identical to a division. Example: -3 / 2 = -1

-3 >> 1 = -2

Let's take a look at the signed implementation now:

const int bitSizeMinusOne = 31;

const int resolution = 256; //example resolution

const int resolutionShift = 8; //2^8 = 256

const int resolutionShiftPlusOne = resolutionShift + 1;

const float inversedResolution = (float)(1.0 / double(resolution));

const int roundBit = 1 << (resolutionShift-1);

const int bit1 = 1;

public static int fixedPointMultiply(int x, int y)

{

return ( (x * y) + roundBit) >> resolutionShift;

}

public static int fixedPointDivide(int numerator, int denominator)

{

int doubleResult = (numerator << resolutionShiftPlusOne) / denominator;

int sign = (doubleResult >> bitSizeMinusOne) | bit1;

int roundAdd = sign * (doubleResult & bit1);

return (doubleResult + roundAdd) >> 1;

}

public static int toFixedPoint(int full, int numerator, int denominator)

{

return (full << resolutionShift) + fixedPointDivide(numerator, denominator);

}

public static int fixedPointToInt(int x) // always rounding towards negative infinity here!

{

return x >> resolutionShift;

}

**independent domain of multiplication:**

x and y must be within[-2^15.5, 2^15.5 - 1]**domain of division:**

numerator must be within [-resolution * 2^(31 - resolutionShift - 1) + 1,

resolution * 2^(31 - resolutionShift - 1) - 1] We see that only the division requires a change because we have to make a distinction when to add and when to subtract the roundFraction.

**So the unsigned implementation would not always round correctly for signed integers.** But we don't want to use an if-statement since branching slows down the flow of execution. The idea here is to double the numerator and then to divide. If rounding up is not to happen doubleResult is 2 times the result, which is an even number. However when rounding up towards positive/negative infinity is happening we get 2 times the result plus/minus 1, which is an odd number. An odd number has always the first bit set and that is what we are going to exploit:

int sign = (doubleResult >> bitSizeMinusOne) | bit1;

First we extract the sign ( 1/-1, remember the representation of -1). The "|" operator does a logical "or" operation per bit. So it assures the first bit is always set to get 1 instead of 0 for a positive number.

int roundAdd = sign * (doubleResult & bit1);

The "&" operator does a logical "and" operation per bit. We use it to extract the first bit from the doubleresult. Then we multiply it by the sign to obtain the correct direction of adding.

return (doubleResult + roundAdd) >> 1;

Since doubleResult + roundAdd is always an even number the ">>" operation will perform a correct division.

We could also extract rounding just by bit manipulations without using a multiplication. It would take just a few more commands to do so. However multiplications on modern hardware are so fast that an alternative implementation is unlikely to increase the speed of execution. On the other hand note that divisions are still slow. It is because division is a complex operation by nature.

**LIMITATIONS AND HINTS:****One can easily overestimate the domain a multiplication and division can work on properly. Both operations need a headroom for the resolution, as we can see directly in the implementation.** But the multiplication suffers the most from this demand since it puts a restriction on both of the operands, the division only for the numerator. For example if we spend the fraction the same amount of bits like for the full part, here 16, we won't be able to do even unsigned multiplications with full parts anymore ( division wouldn't work with a full numerator too in this configuration). Just multiply a full integer, let's say 1 * 1 and you get a bit overflow since full parts in fixed point are now at least the size of 2^16. Now there are 2 ways to approach this problem:

**Hint 1**:

We can expand the domain on which native multiplications work properly by chopping the multiplication down to several sub-parts.

**Note however that it is only a good idea when hint 2 is out of question, when you are already working with largest integer types and still need a larger domain.** The idea is to chop numbers and multiply them like following:

(xFull + xFraction) * (yFull + yFraction) =

xFull*yFull + xFull*yFraction + xFraction*yFull + xFraction * yFraction;

We have only to be carefull to properly take fixed point arithmetic into account.

public static int fixedPointExpansiveMultiply(int x, int y)

{

//fractionBitMask = resolution - 1;

int xFraction = x & fractionBitMask;

int yFraction = y & fractionBitMask;

int xFull = x >> resolutionShift;

int yFull = y >> resolutionShift;

int result = (xFull * yFull) << resolutionShift;

result += (xFull * yFraction);

result += (yFull * xFraction);

result += ((xFraction * yFraction) + roundBit) >> resolutionShift;

return result;

}

**And here is the actual trick:** uint xFull = x >> resolutionShift;

uint yFull = y >> resolutionShift;

uint result = (xFull * yFull) << resolutionShift;

result += (xFull * yFraction);

result += (yFull * xFraction);

Since we know xFull and yFull is the full part we can decrease it without information loss before the multiplication is applied. That way we have expanded the independent domain to

**[- resolution * 2^ ((32 - resolutionShift - 1) / 2), resolution * 2^ ((32 - resolutionShift - 1) / 2) - 1].** In case we are using 16 bits for the fraction we have now expanded the independent domain for 32 bit integer multiplication from [-2^15.5, 2^15.5 - 1] to [-2^23.5, 2^23.5 - 1], and that is as much as it can possibly go.

** Hint 2: **

If the hardware natively supports it then you should simply use more bits. For example you can use 64 bit integers with a 16 bit fraction instead of trying to find a slower work around for 32 bits like in Hint 1. You might find code pieces like this for the multiplication on the internet:

// precomputed value:

#define K (1 << (Q-1))

signed int a, b, result;

signed long int temp;

temp = (long int) a * (long int) b; // result type is operand's type

// Rounding; mid values are rounded up

temp += K;

// Correct by dividing by base

result= temp >> Q;

**This line of thought is not always a good idea:**temp = (long int) a * (long int) b;

Now when you think about

**Hint 2** you see a possible fallacy. Because instead of converting to a bigger type you could use a bigger type in the first place. It adds convolution if the larger operand type is already natively supported by the hardware. So you should question when it makes sense to work around our native implementation like that.

**Possible solution for varying precision demands:**

We can use several fixed point formats for varying precision demands but remember that we don't have to sacrifice the use of floats for that purpose.