Welcome, Guest. Please login or register.

Login with username, password and session length

 
Advanced search

1411283 Posts in 69325 Topics- by 58380 Members - Latest Member: bob1029

March 29, 2024, 06:05:13 AM

Need hosting? Check out Digital Ocean
(more details in this thread)
TIGSource ForumsCommunityTownhallForum IssuesArchived subforums (read only)TutorialsFixed Point Arithmetic - A Comprehensive Introduction
Pages: [1]
Print
Author Topic: Fixed Point Arithmetic - A Comprehensive Introduction  (Read 12953 times)
J-Snake
Level 10
*****


A fool with a tool is still a fool.


View Profile WWW
« on: September 08, 2013, 05:28:52 AM »

This tutorial is meant to be a profound and comprehensive introduction to fixed point arithmetic. It should provide you a solid basis to shape exact understanding of its working and the limitations. The tutorial is divided into 5 parts:

  • Motivation
  • How it works
  • Efficient unsigned implementation
  • Extending to signed implementation
  • Limitations and hints

MOTIVATION:

A great number of bugs, inconsistencies and glitches in game-mechanics/physics is a result of naive use of floats or their intrinsic shortcomings. Perfectly consistent game states and their reproducibility is often required when you aim for perfection in game mechanics. For example whenever you repeat an exactly reproduceable sequence of actions in an action-puzzler you expect the result to be identical and independent from possible positional coordinate shifts of the experiment. This is where integers or fixed point are the right tools to use.

But we don't have to use fixed point exclusively to keep the desired consistency. We can store positional values, perform positional translations and "simple" single multiplications/divisions in fixed point. And we use floats for multiplication/division intensive tasks such as trigonometric functions. That way we can get the best of both worlds and don't have to rewrite complicated mathematical functions for fixed point.


Fixed point vs float, why they are not similar:


Fixed point is an integer, we just interpret a fractional part with a fixed amount of bits into it.

pros of fixed point:

 - consistent computation results thanks to äquidistant number distribution
 - identical number representation across varying machines


cons of fixed point:

 - significantly limited domain (watch out for multiplications)
 - bad relative precision for smaller numbers in its fractional part

  



In a float type the binary point is not fixed, it can "float".

pros of float:

 - great range
 - great absolute precision for smaller numbers in its fractional part:

    it is because floats keep a good relative precision for all numbers, what makes them well  
    suited for scientific computations being sensitive to relative errors


cons of float

 - bad absolute precision in higher range (increasingly worse for representing positional values)
 - inconsistent accuracy since the number distribution is not äquidistant





HOW IT WORKS:

Let us first get aware about the structure of our native number system. We interpret the number
349.5 the following way:

349.5 = 3*10^2 + 4*10^1 + 9*10^0 + 5*10^-1  

We call 10 our base and see that the powers of the base determine the positional values. We can also see that a positional value x must be in the domain 0 <= x < base. An important property here we will use later on is that multiplications or divisions with the base correspond to a shift of the decimal point. One way to see why that is is by simply observing the arithmetical process:

Lets multiply the number 3.5 by 10: 10*(3*10^0 + 5*10^-1) = 3*10^1 + 5*10^0 = 35

We see that since the distribution property holds here every base gets "elevated" by the power of one so the decimal place goes one unit to the right. Accordingly each power will decrease by one when a division by 10 is applied so the decimal place will go one unit to the left.

As already mentioned fixed point is an integer, we just interpret a fractional part into it. And now we are ready to conceptualize fixed point by using integers. We will use the term resolution as the amount of äquidistant points per unit, it is the amount of points reserved for the fractional part. In order to introduce all that properly it is key to understand the following distinction:

- x,y,z are fixed point numbers
- x/resolution, y/resolution, z/resolution is the actual value of those numbers

For example if x = 200 and the resolution is 100 then its actual value is x/resolution = 200/100 = 2. In the following we see how the basic arithmetic is looking like.

So for example if we want to know the result of an addition in fixed pont format we have to
resolve the idea: "x-value" + "y-value" = "z-value" to z, since that is the result in fixed point format. The same idea applies to the other functions:


addition/subtraction:
x/resolution +- y/resolution = z/resolution -> x +- y = z
It means we have to add the 2 fixed point numbers just like 2 integers. That's it.

multiplication:
x/resolution * y/resolution = z/resolution -> x*y / resolution = z
It means it corresponds to a regular integer multiplication, but that will be divided by the resolution.

division:
(x/resolution) / (y/resolution) = z/resolution -> x/y * resolution = z
It means it corresponds to a regular integer division, but that will be multiplied by the resolution.

Let us sum up the results for the basic arithmetic in fixed point:

  • x+y = z
  • x-y = z
  • x*y / resolution = z
  • x/y * resolution = z

Note: The arithmetic we derived is the mathematical idea. We will see that in the
actual machine-computation the order of operations does matter and should not be ignored
.



EFFICIENT UNSIGNED IMPLEMENTATION:

To keep readability we will encapsulate the arithmetic in functions here. We will assume regular 32 bit integers in this example.

Remember it was shown that multiplying or dividing a number with its base equals a shift of the decimal point. In a digital machine numbers are represented with the base of 2 and unsigned integers are stored according to our number representation. So multiplications or divisions of powers of 2 can be achieved by bit-shifting. Bit-shifting is an atomic operation and very fast so we will choose the resolution to be a power of two to avoid unnecessary and costly divisions.

The bit shift operators are applied on integers and usually known as "<<" and ">>" in various languages and mean the following for unsigned integers:

  • (x << shifts) equals x * (2^shifts)
  • (x >> shifts) equals x / (2^shifts)

So this is how we can implement our functions now:

Code:
 
        const uint   resolution          = 256; //example resolution
        const int    resolutionShift     = 8;   //2^8 = 256
        const float  inversedResolution  = (float)(1.0 / double(resolution));
        const uint   roundBit            = 1 << (resolutionShift-1);
    

        public static uint fixedPointMultiply(uint x, uint y)
        {        
          return  (x * y  + roundBit) >> resolutionShift;
        }                      
          
        public static uint fixedPointDivide(uint numerator, uint denominator)
        {
          uint roundFraction = denominator >> 1;
          
          return ((numerator << resolutionShift) + roundFraction) / denominator;
        }
        
        public static uint toFixedPoint(uint full, uint numerator, uint denominator)
        {
         return (full << resolutionShift) + fixedPointDivide(numerator, denominator);
        }        

        public static uint fixedPointToInt(uint x) // always rounding towards negative infinity here!
        {
         return x >> resolutionShift;
        }

Notice that addition/subtraction in fixed point is identical to an integer addition/subtraction, that is why we don't need functions for that. For a faster conversion to a float (can be required for a draw position on the screen by your framework) the division is already done in the float constant inversedResolution. toFixedPoint() is just a useful function for converting a full integer value and an additional fraction to fixed point.

Let's first point out the valid domain multiplication and division are working on properly.
For multiplication we define the independent domain as the domain where x and y can be chosen
independently from each other:

independent domain of multiplication:
x and y must be within[0, resolution * 2^(16 - resolutionShift) - 1 = 2^16 - 1]


domain of division:
numerator must be within [0, resolution * 2^(32 - resolutionShift - 1)]


The roundBit constant is meant to add one bit right behind the least significant bit in the number. That way the least significant bit can be rounded up before the bits behind it will be shifted away.

If you take a look at  fixedPointDivide(uint x, uint y) and ignore rounding you see that the implementation of

(x/y * resolution) is (x << resolutionShift) / y)

which means the order of operations is changed.
First x is multiplied by resolution and then divided by y. This multiplication with the resolution can be considered a "scope expansion". Only then the division is applied to minimize quantization errors. Imagine your resolution is 100 and you want to divide the value 2 by 4, corresponding to x = 200, y = 400 in fixed point. Without the scope expansion of x this is what would happen in fixed point: (200/400)*100, this will be zero since 200/400 is not a full integer. Now doing it with scope expansion of x: (200*100)/400 = 50. And in this resolution 50 corresponds to 0.5 when converted to float, just as you expect it.





EXTENDING TO SIGNED IMPLEMENTATION:

If it wasn't for proper rounding, we could use the same implementation for signed and unsigned integers. However it is beneficial to know how signed numbers are represented to understand exactly how we can work with them.

In a signed integer the most significant bit (msb) is determining whether the number is positive or negative:

- If msb is 0 then the rest of the bits are representing the number just like an unsigned integer.

- If msb is 1 then the rest of the bits can be considered to represent an unsigned integer aswell,
   but with an additional offset of  -2^31

  
   For example let's assume we have only 1 byte to represent numbers, then the following is true:

   - 00000000 represents    0
   - 00000001 represents    1
   - 10000000 represents -128
   - 10000001 represents -128 + 1   = -127
   - 11111111 represents -128 + 127 =   -1

  This representation of negative numbers is called Two's complement.

  Now we are about to see that the operator ">>" has to work in a special way to perform divisions
  by a power of 2 on negative numbers. As example we can start with -128 and consecutively divide
  it by two and list some results:
  
  - 10000000 represents -128
  - 11000000 represents  -64
  - 11100000 represents  -32
  - 11110000 represents  -16
  - 11111000 represents   -8

  We see that we have to "extend the sign" with every right shift instead of doing
  logical shifts. And this is actually what happens when you apply the right shift
  operator on signed integers in a language that supports it (should be the regular case), it
  turns to an arithmetic shift. While a logical left shift remains unchanged for signed
  and unsigned integers.
  
  One important consequence of the arithmetic right shift is that it will round negative
  numbers towards negative infinity while a division is always rounding towards zero. So
  a right shift on negative numbers is not identical to a division.

  
  Example: -3  /  2 = -1
               -3 >> 1 = -2  
  

  Let's take a look at the signed implementation now:
  
  
Code:

       const int    bitSizeMinusOne     = 31;
       const int    resolution             = 256; //example resolution
        const int    resolutionShift        = 8;   //2^8 = 256
        const int    resolutionShiftPlusOne = resolutionShift + 1;
        const float  inversedResolution     = (float)(1.0 / double(resolution));
        const int    roundBit               = 1 << (resolutionShift-1);
        const int    bit1                   = 1;


        public static int fixedPointMultiply(int x, int y)
        {
            return ( (x * y) + roundBit) >> resolutionShift;
        }

        public static int fixedPointDivide(int numerator, int denominator)
        {
            int doubleResult = (numerator << resolutionShiftPlusOne) / denominator;

            int sign         = (doubleResult >> bitSizeMinusOne) | bit1;
            int roundAdd     = sign * (doubleResult & bit1);

            return (doubleResult + roundAdd) >> 1;
        }

        public static int toFixedPoint(int full, int numerator, int denominator)
        {
            return (full << resolutionShift) + fixedPointDivide(numerator, denominator);
        }
        
        public static int fixedPointToInt(int x) // always rounding towards negative infinity here!
        {
         return x >> resolutionShift;
        }
 

independent domain of multiplication:
x and y must be within[-2^15.5, 2^15.5 - 1]


domain of division:
numerator must be within [-resolution * 2^(31 - resolutionShift - 1) + 1,
                                                   resolution * 2^(31 - resolutionShift - 1) - 1]

 
We see that only the division requires a change because we have to make a distinction when to add and when to subtract the roundFraction. So the unsigned implementation would not always round correctly for signed integers. But we don't want to use an if-statement since branching slows down the flow of execution. The idea here is to double the numerator and then to divide. If rounding up is not to happen doubleResult is 2 times the result, which is an even number. However when rounding up towards positive/negative infinity is happening we get 2 times the result plus/minus 1, which is an odd number. An odd number has always the first bit set and that is what we are going to exploit:

Code:
int sign         = (doubleResult >> bitSizeMinusOne) | bit1;
First we extract the sign ( 1/-1, remember the representation of -1). The "|" operator does  a logical "or" operation per bit. So it assures the first bit is always set to get 1 instead of 0 for a positive number.

Code:
int roundAdd     = sign * (doubleResult & bit1);
The "&" operator does a logical "and" operation per bit. We use it to extract the first bit from the doubleresult. Then we multiply it by the sign to obtain the correct direction of adding.

Code:
return (doubleResult + roundAdd) >> 1;
Since doubleResult + roundAdd is always an even number the ">>" operation will perform a correct division.


We could also extract rounding just by bit manipulations without using a multiplication. It would take just a few more commands to do so. However multiplications on modern hardware are so fast that an alternative implementation is unlikely to increase the speed of execution. On the other hand note that divisions are still slow. It is because division is a complex operation by nature.






LIMITATIONS AND HINTS:

One can easily overestimate the domain a multiplication and division can work on properly. Both operations need a headroom for the resolution, as we can see directly in the implementation. But the multiplication suffers the most from this demand since it puts a restriction on both of the operands, the division only for the numerator. For example if we spend the fraction the same amount of bits like for the full part, here 16, we won't be able to do even unsigned multiplications with full parts anymore ( division wouldn't work with a full numerator too in this configuration). Just multiply a full integer, let's say 1 * 1 and you get a bit overflow since full parts in fixed point are now at least the size of 2^16. Now there are 2 ways to approach this problem:

Hint 1:
We can expand the domain on which native multiplications work properly by chopping the multiplication down to several sub-parts. Note however that it is only a good idea when hint 2 is out of question, when you are already working with largest integer types and still need a larger domain. The idea is to chop numbers and multiply them like following:

(xFull + xFraction) * (yFull + yFraction) =

xFull*yFull + xFull*yFraction + xFraction*yFull + xFraction * yFraction;

We have only to be carefull to properly take fixed point arithmetic into account.


      
Code:
public static int fixedPointExpansiveMultiply(int x, int y)
        {
                                                //fractionBitMask = resolution - 1;
            int xFraction = x & fractionBitMask;
            int yFraction = y & fractionBitMask;
            int xFull     = x >> resolutionShift;
            int yFull     = y >> resolutionShift;

            int result = (xFull * yFull) << resolutionShift;

            result += (xFull * yFraction);
            result += (yFull * xFraction);
            result += ((xFraction * yFraction) + roundBit) >> resolutionShift;

            return result;
        }
        

And here is the actual trick:
Code:
     uint xFull  = x >> resolutionShift;
      uint yFull  = y >> resolutionShift;
      uint result = (xFull * yFull) << resolutionShift;

Code:
     result        += (xFull * yFraction);
      result        += (yFull * xFraction);

Since we know xFull and yFull is the full part we can decrease it without information loss before the multiplication is applied. That way we have expanded the independent domain to

[- resolution * 2^ ((32 - resolutionShift - 1) / 2), resolution * 2^ ((32 - resolutionShift - 1) / 2) - 1].

In case we are using 16 bits for the fraction we have now expanded the independent domain for 32 bit integer multiplication from [-2^15.5, 2^15.5 - 1] to [-2^23.5, 2^23.5 - 1], and that is as much as it can possibly go.


Hint 2:
If the hardware natively supports it then you should simply use more bits. For example you can use 64 bit integers with a 16 bit fraction instead of trying to find a slower work around for 32 bits like in Hint 1.


You might find code pieces like this for the multiplication on the internet:

Code:
// precomputed value:
#define K   (1 << (Q-1))

signed int       a, b, result;
signed long int  temp;
temp = (long int) a * (long int) b; // result type is operand's type
// Rounding; mid values are rounded up
temp += K;
// Correct by dividing by base
result= temp >> Q;

This line of thought is not always a good idea:
Code:
temp = (long int) a * (long int) b;

Now when you think about Hint 2 you see a possible fallacy. Because instead of converting to a bigger type you could use a bigger type in the first place. It adds convolution if the larger operand type is already natively supported by the hardware. So you should question when it makes sense to work around our native implementation like that.




Possible solution for varying precision demands:

We can use several fixed point formats for varying precision demands but remember that we don't have to sacrifice the use of floats for that purpose.



« Last Edit: October 11, 2016, 04:21:54 PM by J-Snake » Logged

Independent game developer with an elaborate focus on interesting gameplay, rewarding depth of play and technical quality.<br /><br />Trap Them: http://store.steampowered.com/app/375930
J-Snake
Level 10
*****


A fool with a tool is still a fool.


View Profile WWW
« Reply #1 on: September 08, 2013, 06:32:49 AM »

The main conclusion we can draw is that arithmetic in fixed point needs some considerable headroom for the resolution. That is why 32 bits are often not sufficient to perform arithmetic. There are two ways to deal with this:

- We can store the data in a smaller type and convert it for arithmetic to a larger type or come  
  up with more convoluted functions like fixedPointExpansiveMultiply() which increase the domain.

- Or we can do everything (storage and arithmetic) "uniformly" in a sufficiently large data type,
  if it exists.

The right choice depends on individual restrictions and demands. An overall safe bet is to use 64 bit integers uniformly on 64 bit hardware if 32 bit integers turn out to struggle too much.





Small exercise:
The following implementation of division is quoted from this wikipedia page:
http://en.wikipedia.org/wiki/Q_%28number_format%29

Code:
signed int a,b,result;
signed long int temp;
temp = (long int)a << Q;
temp = temp+b/2;
result = temp/b;

This implementation is flawed. First find out what is the actual flaw here. Then classify the operands for which it works correctly and for which it does not.
« Last Edit: September 24, 2013, 12:31:46 PM by J-Snake » Logged

Independent game developer with an elaborate focus on interesting gameplay, rewarding depth of play and technical quality.<br /><br />Trap Them: http://store.steampowered.com/app/375930
Evan Balster
Level 10
*****


I live in this head.


View Profile WWW
« Reply #2 on: September 08, 2013, 02:55:57 PM »

Code:
//Multiply a 32-bit number by a 16.16 fixed-point number.
static Sint32 Mult_16_16(Sint32 a, Sint32 b)
{
Sint32
ah = (a>>16), al = (a&0xFFFF),
bh = (b>>16), bl = (b&0xFFFF);
return ((ah*bh)<<16) + (ah*bl)+(bh*al) + ((al*bl)>>16);
}

//Multiply a 32-bit number by an 8.24 fixed-point number.
static Sint32 Mult_8_24(Sint32 a, Sint32 b)
{
Sint32
ah = (a>>16), al = (a&0xFFFF),
bh = (b>>16), bl = (b&0xFFFF);
return ((ah*bh)<<8) + (((ah*bl)+(bh*al)) >> 8) + ((al*bl)>>24);
}

The above procedures don't quite round correctly, but can be used to multiply any 32-bit integer or fixed-point number by a fixed-point of the denoted precision, yielding a number of the first's format.  Unlike the multiply described above (EDIT: didn't notice the expansive multiply; I'll chalk it up to my flu.) these procedures are not vulnerable to overflow and will function for any two values.

Due to the undefined nature of the right shift operation they may not be portable to all platforms.  (In practice this is rarely an issue)


All that said, fixed-point isn't well-suited to representing numbers which are frequently subject to multiplication.  It represents a uniform additive space, whereas floating-point values distribute (somewhat) uniformly over logarithmic space.  Addition and subtraction are translations through additive space, while multiplication and division are translations through log-space.  Precision loss with floats is thus most often a result of addition (cancellation) and precision loss with fixed is most often a result of multiplication (expansion / aliasing).

My prescription is to use fixed-point for numbers whose magnitude is irrelevant; coordinates in time and space are a prime example as there is never any sense in multiplying them and their zero-values are arbitrary.  Vectors, deltas and quantities such as mass or volume have meaningful zero-values, are very frequently multiplied and thus well-suited to floats.
Logged

Creativity births expression.  Curiosity births exploration.
Our work is as soil to these seeds; our art is what grows from them...


Wreath, SoundSelf, Infinite Blank, Cave Story+, <plaid/audio>
J-Snake
Level 10
*****


A fool with a tool is still a fool.


View Profile WWW
« Reply #3 on: September 09, 2013, 05:45:27 PM »

Yeah, you are trying to implement the expansiveMultiply() function. The tutorial shows how it looks like with proper rounding. expansiveMultiply() can still overflow or compute wrong results for sufficiently large operand-magnitudes. Its independent domain is shown in the tutorial.
Logged

Independent game developer with an elaborate focus on interesting gameplay, rewarding depth of play and technical quality.<br /><br />Trap Them: http://store.steampowered.com/app/375930
Evan Balster
Level 10
*****


I live in this head.


View Profile WWW
« Reply #4 on: September 09, 2013, 07:21:01 PM »

To the best of my understanding the functions I posted work for the complete 32-bit range.  The range of a multiply operation is the product of the domains of the operands; two 16-bit operands multiplied have a range of 32 bits.

EDIT:  Aha.  I see.  The reason your expansive multiply has a limited range is that it's performing the multiplication on 8 and 24 bit "pieces" of the number.  Multiplying the 24 bit components together requires 48 bits, causing overflow.  Splitting the value in half prevents overflow issues, because the range of the multiply never exceeds 32 bits.

EDIT:  And to be clear, I'm aware that the multiply can overflow if both numbers are very large.  But either number may be anywhere in the allowable range and there will be values of the other number that will yield an accurate multiplication.  Thus the domain is still the full 32 bits.

All that aside...  My main protest is against your suggestion that a 64-bit type should be used for fixed-point operations for reasons independent of the precision needed; it has precisely the same problem with range expansion as a 32-bit number, and the only "solution" is to ensure that the upper 32 bits of any stored value are always zeroes.
« Last Edit: September 09, 2013, 07:29:04 PM by Evan Balster » Logged

Creativity births expression.  Curiosity births exploration.
Our work is as soil to these seeds; our art is what grows from them...


Wreath, SoundSelf, Infinite Blank, Cave Story+, <plaid/audio>
J-Snake
Level 10
*****


A fool with a tool is still a fool.


View Profile WWW
« Reply #5 on: September 10, 2013, 07:19:12 AM »

EDIT:  Aha.  I see.  The reason your expansive multiply has a limited range is that it's performing the multiplication on 8 and 24 bit "pieces" of the number.  Multiplying the 24 bit components together requires 48 bits, causing overflow.  Splitting the value in half prevents overflow issues, because the range of the multiply never exceeds 32 bits.
This is incorrect as you can observe by your own code. The following multiplication on full parts still needs the headroom for resolution:

Code:
((ah*bh)<<16)

As mentioned the working domain for expansiveMultiply() is shown in the tutorial which clearly shows its resolution-dependence. There is no magic that will let you work around that since we cannot store more information than a given set of bits allows.


My main protest is against your suggestion that a 64-bit type should be used for fixed-point operations for reasons independent of the precision needed; it has precisely the same problem with range expansion as a 32-bit number, and the only "solution" is to ensure that the upper 32 bits of any stored value are always zeroes.
If the hardware can natively work on 64 bits then it is strongly recommend to use 64 bits if 32 bits are potentially sufficient for your demands, but require an expansive multiply. That is the suggestion.
Logged

Independent game developer with an elaborate focus on interesting gameplay, rewarding depth of play and technical quality.<br /><br />Trap Them: http://store.steampowered.com/app/375930
Evan Balster
Level 10
*****


I live in this head.


View Profile WWW
« Reply #6 on: September 10, 2013, 10:23:25 AM »

Code:
((ah*bh)<<16)

As mentioned the working domain for expansiveMultiply() is shown in the tutorial which clearly shows its resolution-dependence. There is no magic that will let you work around that since we cannot store more information than a given set of bits allows.

This isn't a problem with expansiveMultiply or a limitation of the domain, though.  It's simply a number exceeding the well-understood limitations of its fixed-point format.  The domain of ah and bh is still the full set of 16-bit numbers; the overflow only arises if their product falls outside this set.  This is no more a problem than the fact that a multiply on two 32-bit integers might overflow.

Put differently: if either ah or bh is <= 1, the other may assume any 16-bit value and there can be no overflow.  Hence it's a range limitation and not a domain limitation.

That said, logic for throwing an overflow exception of some kind might be very helpful in some applications.  The code I posted was for a ring-modulation DSP, so rounding and overflow handling were deprioritized in favor of performance and simplicity.

If the hardware can natively work on 64 bits then it is strongly recommend to use 64 bits if 32 bits are potentially sufficient for your demands, but require an expansive multiply. That is the suggestion.

Using temporary 64-bit values for the expansive multiply is entirely sensible; your phrasing suggested 32-bit as an insufficient storage format and that had me up in arms.


Anyway, I should probably quit making a mess of your thread.  Just being opinionated.  Tongue
Logged

Creativity births expression.  Curiosity births exploration.
Our work is as soil to these seeds; our art is what grows from them...


Wreath, SoundSelf, Infinite Blank, Cave Story+, <plaid/audio>
J-Snake
Level 10
*****


A fool with a tool is still a fool.


View Profile WWW
« Reply #7 on: September 10, 2013, 02:38:28 PM »

I might refine my formulation to make it more clear but the overall trend of suggestion is still to use 64 bits for possible demands in video games, especially now on modern hardware. We can often also not ignore the demands of the division. I am personally using 64 bits with 16 bits for the fractions for my upcoming engine.


The domain of ah and bh is still the full set of 16-bit numbers; the overflow only arises if their product falls outside this set.
Yes, that is all true but it is only a part of the story. After the full parts  (ah * bh) are multiplied the resolution shift happens towards the most signficant bit, leaving only 8 bits for valid unsigined full parts here, not 16. That implies that the total result of expansiveMultiply() can be totally wrong when both of the operands are outside the independent domain. This domain describes for which operands (and not parts in them) the expansiveMultiply() will return proper computation results given a fixed set of bits we are working on. That is how to see it.

The code I posted was for a ring-modulation DSP, so rounding and overflow handling were deprioritized in favor of performance and simplicity.
If you want to round you only need to add the roundBit here, just according to the tutorial:

Code:
((al*bl + roundBit)>>16)

So rounding won't make any relevant impact on performance here and the average precision will be doubled almost "for free".
« Last Edit: September 11, 2013, 04:38:44 AM by J-Snake » Logged

Independent game developer with an elaborate focus on interesting gameplay, rewarding depth of play and technical quality.<br /><br />Trap Them: http://store.steampowered.com/app/375930
Zaphos
Level 5
*****



View Profile WWW
« Reply #8 on: September 10, 2013, 03:26:48 PM »

Wasn't this same tutorial posted a while back?  Or was it slightly different?  I vaguely remember seeing one that Boris(?) seemed to have a number of objections to.

A great number of bugs, inconsistencies and glitches in game-mechanics/physics is a result of naive use of floats
It's perhaps worth pointing out that sometimes you can mitigate inconsistencies pretty well just by re-centering and re-scaling positions before you do certain calculations, so that the objects of interest are near the origin with coordinates of average magnitude near 1.  For large worlds, one could also break the world into blocks and record object positions in local coordinates relative to the local block.
Logged

How to Be a Tree | Voro | Realistic Kissing Simulator | twitter | Programmer at Epic Games
J-Snake
Level 10
*****


A fool with a tool is still a fool.


View Profile WWW
« Reply #9 on: September 10, 2013, 03:35:01 PM »

The previous tutorial has been left in an incomplete state for too long. Now it is complete.
Logged

Independent game developer with an elaborate focus on interesting gameplay, rewarding depth of play and technical quality.<br /><br />Trap Them: http://store.steampowered.com/app/375930
Pages: [1]
Print
Jump to:  

Theme orange-lt created by panic