Wednesday, October 21, 2009

Cranking iPhone Performance to 11, Without Inline Assembly

I've been writing high performance code on microcontrollers and digital signal processors for embedded systems since the early 90s. I've made a career in realtime control software, the likes of which you'd see in military tracking systems or even in the computer running the engine of your car.

At GDC Austin 2009, Noel Llopis (@SnappyTouch) gave a compelling presentation entitled "Squeezing Every Drop Of Performance Out Of The iPhone". His presentation provides a wonderful overview of the performance concerns in developing applications for the iPhone. This is good stuff, close to my heart.

Noel also gave a presentation at 360iDev here in Denver entitled "Cranking Floating Point Performance To 11". The core of his presentation revolves around utilizing the vector floating point unit by moving more code into inline assembly.

Here is where our opinions differ. I was on that path too. In the early 90's I wrote about half of my code in assembly language. But by the early 00's it was probably 10% and now it's zero. You see, over the last ten years or so things have changed. The machine instruction sets in modern processors are more powerful than ever. They're also getting more difficult to understand. This is especially the case when it comes to optimization and 'hidden' repercussions like pipeline stalls. Having a pipeline means you can reorder your code in a way that causes either a great performance boost or loss.

A few months back I got together with one of my fellow iPhone developers, former CTO of Tendril Networks. We were working on an audio project to do pitch detect and shift (what some people call "autotuning"). The application required running two FFTs every 20 milliseconds. We were pretty much at the limit of the device's capabilities.

We spent the whole weekend working with the Xcode iPhone debugger, Shark, and Instruments and this was our strategy...

Empirical Compiler Optimization:
1. Measure the performance of the time critical section
2. Adjust the optimization settings of the compiler
3. Repeat

Empirical Code Style Optimization:
1. Measure the performance of the time critical section
2. Adjust the C/C++/Obj C code
3. Review results in the debugger's dissassembler
3a. Look for fewer lines of assembly (which is almost always faster)
3b. Look for fewer library calls
4. Repeat

Why we favor this approach:
1. Learning curve involves only cursory understanding of the VPU assembly
2. More time for developing usability and marketing your application
3. It's much, much, much less error prone than inline assembly
4. Not likely to inadvertently trigger performance problems with pipeline stalling
5. Performance will likely be BETTER than inline assembly in all but the rarest of circumstances

The primary compiler optimization setting tricks (optimizing for speed over size):
1. Dissable Thumb mode
2. Set optimization level to "Fastest -O3"
3. Unroll Loops
4. Other C Flags = "-falign-loops=16"

The primary source optimization tricks:
1. Condense code into smaller chunks that fit inside the instruction cache
2. Use all floating point
3. Don't do indexed array lookups inside a bunch of floating point math (this will stall the vpu pipe)
4. Don't use any division, instead always multiply with 1/x

I hope you're able to use some of this advice in your own applications - and crank it to eleven. :)

(Image courtesy Joseph Tey)

3 Comments:

Blogger Art said...

My own profiling indicates that fixed-point calculations are _substantially_ faster than the same float calculations on iPhone OS devices.

Furthermore, since the canonical Core Audio format is 8.24 fixed-point, I was under the impression that Apple felt the same way, at least when it comes to per-sample audio calculations.

On the other hand, I don't have much experience programming microcontrollers or dsp hardware at all: Is there a compelling performance reason to use floating-point?

October 21, 2009 at 2:15 PM  
Blogger JMathews said...

There is an older blog entry about floating point vs fixed point. All of my work, prior to iPhone was done on processors without a VPU so I understand the desire for fixed-point. But on iPhone you have the VPU, so you might as well use it.

It depends on how much math you're doing. If there is a good deal of it and all the floating point math is grouped it will certainly be faster.

As for Core Audio being 8.24 that's true, at least in the most common configuration. But it's not very useful if all your audio processing math is floating point in between the record and play callbacks :)

October 21, 2009 at 2:47 PM  
Blogger JMathews said...

http://www.steamboatmountaindesigns.com/blog/2009/04/math-on-iphone.html

October 21, 2009 at 3:12 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home