web oriented universal language

Fast Flash9 Fixed Point Sine

Posted on 2007-11-11 by Nicolas Cannasse in General

Recently I've been starting playing a bit with Flash9.

A few months ago, after adapting the haXe compiler to target the AVM2 (which also runs ActionScript3), I wrote the hxASM library which enable you to write Flash9 assembler in haXe.

Yesterday, I started to micro-benchmark some of the most common operations in order to check what kind of thing is expensive and what kind of thing can be done fast.

I was using some Bench program written in haXe, then generating the corresponding AS3 code by using the haXe-to-AS3 Generator, and comparing the results of both outputs.

This way, I could find that some of the operations in haXe were a bit underoptimized, so now everything should be fixed on CVS and these improvements will be included in next 1.17 release.

I was also able to get some interesting conclusions :

  • array accesses are not that fast
  • static calls are very slow compared to method calls (like 10 times slower). That's a bit strange given that the VM should be able to resolve them at JIT'time and them optimize better
  • in particular, while floating point and integer arithmetics operations are very fast, calls to Math functions such as "sin" or "cos" are very slow

As you know, games and 3D engines use a lot of Math functions, so I was interested in writting an optimized Math.sin function. Using precomputed tables would not work because of array access time, so I was thinking to use some approximative float calculus (like Taylor series). After a bit of googling, I found this blog post which was exactly what I was looking for. This can be rewrote as the following code (without precomputed constants) :

function fastsin( x : Float ) {
   // clamp
   if( x < -PI ) x += 2*PI else if( x > PI ) x -= 2*PI;
   // approximate
   var sin = (x - x * abs(x) / PI) * 4 / PI;
   // adds correction
   var fix = sin + .225 * (sin * abs(sin) - sin);
   // we're done !
   return fix;
}

This code does not calculate exactly a sin, but gives a very very good approximate that can be perfectly used in games or 3D.

Clamping

First remark : the clamping of value in the [-PI,PI] range is a bit slow and inacurate in the case the value is 10*PI for example.

So how can we easily clamp a Float value between two bounds ?

Using a modulo does not work very well in that case since we want a mod(2PI). We need then some kind of overflow... just like Flash9 int type which is constraint into the range [-2^31,2^31].

Does that remind you something ? Let's do the maths :

var range = 2^31 / PI;
x = int(x * range) / range;

And here it is ! Superfast accurate clamp !

Fixed point

Another good thing to know is that integer calculus is a lot faster than floating point calculus, so making the rest of the fastsin operations using integer calculus would be a lot faster.

But we have floating point values, so how do we turn them into integer operations ? By using Fixed Point integers. For those that don't know about it, here how it works :

  • you convert a float to an int by multiplying it by K=2^15/PI, so you have a 0,000096 precision which should be enough for our fastsin.
  • Since our float is between -PI and PI, it will turn into an integer between [-2^15,2^15] which is good since it means we can multiply two of these integers without any overflow.
  • addition and substraction works the same, since (a*K)+(b*K) == (a+b)*K
  • for multiplication however, you have (a*K)*(b*K) = (a * b) * K * K so once multiplied you have to divise by K which is not very easy here, but there will be some trick.
  • once calculus is terminated, you can simply divide your integer by K to get the float result

Let's get things into practice. First, after clamping, we reduce the integer from [-2^31,2^31] to [-2^15,2^15] range :

var x = int(angle * 2^31 / PI) >> 16;
// with constant
var x = int(angle * 683565275.57643158978229477811035) >> 16;

Then we can perform several simplifications of the sin variable calculus :

// replace x by x/K
var sin = (x/K - x/K * abs(x/K) / PI) * (4 / PI);
// expand
var sin = (x - x * abs(x) / (PI * K)) * (4 / PI) / K;
// replacing K
var sin = (x - x * abs(x) / (PI * 2^15 / PI)) * (4 / PI) / K;
// reducing and multiplying by 2^30/PI
var sinB = (x - x * abs(x) / 2^15) * (4/PI) * (2^30/PI) / (2^15/PI);
// optimizing
var sinB = (x - ((x * abs(x)) >> 15)) * 41721;

Here, we then have calculated sinB = sin*2^30/PI using only integer calculus. Now that the first part is optimized, we can also continue with the second part :

// replace sin by sinB/(2^30/PI)
var fix = sinB/(2^30/PI) + .225 * 
            (sinB/(2^30/PI) * abs(sinB/(2^30/PI)) - sinB/(2^30/PI));
// expand
var fix = (.775*sinB + .225*sinB*abs(sinB) / (2^30/PI)) / (2^30/PI);
// factorize
var fix = (sinB + (.225/.775)*PI*sinB*abs(sinB)/2^30) * .775*PI/2^30
// factorize, calculate
var fix = (sinB + ~0.912*(sinB/2^15)*abs(sinB/2^15)) * .775* PI/2^30
// tmp variable
var sinC = sinB >> 15;
var fix = (sinB + ~0.912 * sinC * abs(sinC)) * .775*PI/2^30;
// one last trick for approximating 0.912 multiplication
var fix = (sinB + ((sinC * abs(sinC)) >> 9) * 467) * .775*PI/2^30;
// now calculate the last division real value
var fix = (sinB + ((sinC * abs(sinC)) >> 9) * 467)
return fix / 441009855.21060102566599663103894;

Let's explain a bit the 0.912 trick here. In fixed point, we want to do only a multiplication by an integer. So we have to find an good balance between precision and speed. So we start calculating all powers of 2 multiplied by 0.912, until we find that (2^9 * 0.912) ~= 467, so we can then substitute 0.912 by 467/2^9, and use a right shift instead of the division.

Ouch ! We're done ! Let's write everything down : this is the haXe version (you can't compile it with haXe 1.16 since __int__ was just added on CVS) :

function fastsin( x : Float ) {
   // clamp to [-2^15,2^15]
   var x = (untyped __int__(angle*683565275.576431589782294)) >> 16;
   // approximate
   var sinB = (x - ((x * (x<0?-x:x)) >> 15)) * 41721;
   // adds correction
   var sinC = sinB >> 15;
   var fix = sinB + (sinC * (sinC<0?-sinC:sinC) >> 9) * 467;
   // we're done
   return fix / 441009855.21060102566599663103894;
}

Doing some quick tests gives the following :

  • approximation works as good as the floating point version, with a difference of less than 0.00129 between the real sinus calculus and the fast one.
  • when called as a method, speed compared to Math.sin is 225% (more than two times faster), when inlined 320% (more than three times faster) !

Bad news and good news !

By doing this optimization, we could manage to remove entirely the cost of the static call. It means that putting fastsin inside a static function would go slower that real Math.sin. But the good news is that we can inline the code.

This is something that haXe will support soon. You will be able to mark fastsin method as inline and each time it's called, it will insert at compiletime in your code the 5 lines of the method body instead of the actual call, giving you the best speedup you can get.

The bad news is for ActionScript3 users. I don't know why, but even with typing everything to int, some of the integer operations are not correctly optimized by the MXMLC compiler. Hence, it actually performs quite bad...

Now, a very good news : if you're interested in getting the best performances from Flash9 VM, you will have to use haXe  :)

Nicolas Cannasse

Comments

hahaha, remarkable coincidence ;)
very nice addition! sined(z)
Posted by zjnue , 2007-11-12 00:47:38
I loved this, seriously!
Posted by IanLiu , 2007-11-12 01:05:13
schweet!.. nice article!

but.. ow, man.. static calls are slow?.. ugh.. i luv using statics.. esp for managers.. oh well..
Posted by guntur , 2007-11-12 01:16:19
One thing to remember is that even though static calls are slower - if you use the instance of the static rather than straight static calls/static functions the change is negligable. :)
Posted by Anonymous , 2007-11-12 03:13:46
Excellent info. I wonder what is up with the array and typing in as3, statics are also quick common in AS3 as it is forming (ala tweening engines etc).

It is killer what you have done here to look for these optimizations but I wonder if it has to do with type handling in the AVM2. joa ebert found some stuff with uint switching types in the machine and type switching when you are looking for speed can cause slowness.

http://drawk.wordpress.com/2007/06/06/as3-interesting-numeric-storage-behavior/
Posted by Ryan [draw.logic[ , 2007-11-12 03:53:37
quick question to poster #4: do you mean like a singleton?
Posted by guntur , 2007-11-12 04:33:06
To Ryan : yes, as André and Joa found out, it's very intersting to know that big integers are represented as floats (Number) as soon as they go out of the [-2^28,2^28] range. It means then that "floatint" could be further optimized by lowering precision by 3 bits.
Posted by Nicolas , 2007-11-12 14:09:24
Amazing !
Posted by Roberto , 2007-11-12 14:45:35
It was André and a colleague that found out about this behaviour :)

Ryan: I do not know which post you are refering to, but if it is the Number->4byte conversion you are talking about, there is some expensive Math going on, of course that is slow (but this can be done in a language like C very smoothly).

Just let me explain how the compiler handles types (not the AVM) and why I think that also the int vs. Number tests are falsified very often.

var i: int = 1;
var j: int = i * 2;

Compiles to

getlocal1
pushbyte 2
multiply
setlocal2

While this one:

var i: int = 1;//local 1
var k: int = 2;//local 3
var j: int = i * k;//local 2

compiles to:

getlocal1
getlocal3
multiplyInt
setlocal2

Now guess which one will be faster... That is something you have to keep in mind. By default all your values are type double (64bit) and this is pretty heavy. But you can help the compiler by typing your stuff correct (if you can).

The compiler is not really smart anyways. If you do somethig like var i: int = 1+1+1+1; it will compile to

pushbyte 1
pushbyte 1
pushbyte 1
pushbyte 1
add
add
add
convertToInt (not so sure if it would occur here)
setlocal1

while there is also a function addInt which would add two integers and not having to convert back and forth all the time (or.. figure out that it is a constant and just put a 4 there). So in the end I just want to note that ints _are_ faster as you can see in Nicolas example but you have to keep in mind what the compiler will produce which is unfortunately very often a mess.

So end of the story? Use haXe and hxASM :)
Posted by joa , 2007-11-13 13:57:49
Joa, you're not talking about haXe compiler here, right ;)
Posted by Nicolas , 2007-11-13 18:03:11
Ah, no ;)

But to be fair: I did not check what haXe would output in the same circumstances.
Posted by joa , 2007-11-13 20:17:03
Static function calls slow? Array access slow? Adobe isn't doing their job right!
Posted by Bent Rasmussen , 2007-12-28 17:03:35
>>Adobe isn't doing their job right!
Им многое стоило бы оптимизировать
They would cost a lot to optimize
Posted by Andrey , 2008-05-02 08:07:43

Post a comment

Name:
Email:
Url:
Security: Please enter 3919 here.
remember me
Comment:
 
 
Haxe Powered Rss flux Valid XHTML 1.0 Valid CSS