I've been writing high performance code on microcontrollers and digital signal processors for embedded systems since the early 90s. I've made a career in realtime control software, the likes of which you'd see in military tracking systems or even in the computer running the engine of your car.
At GDC Austin 2009, Noel Llopis (
@SnappyTouch) gave a compelling presentation entitled
"Squeezing Every Drop Of Performance Out Of The iPhone". His presentation provides a wonderful overview of the performance concerns in developing applications for the iPhone. This is good stuff, close to my heart.
Noel also gave a presentation at 360iDev here in Denver entitled
"Cranking Floating Point Performance To 11". The core of his presentation revolves around utilizing the vector floating point unit by moving more code into inline assembly.
Here is where our opinions differ. I was on that path too. In the early 90's I wrote about half of my code in assembly language. But by the early 00's it was probably 10% and now it's zero. You see, over the last ten years or so things have changed. The machine instruction sets in modern processors are more powerful than ever. They're also getting more difficult to understand. This is especially the case when it comes to optimization and 'hidden' repercussions like pipeline stalls. Having a pipeline means you can reorder your code in a way that causes either a great performance boost or loss.
A few months back I got together with one of my fellow iPhone developers, former CTO of Tendril Networks. We were working on an audio project to do pitch detect and shift (what some people call "autotuning"). The application required running two FFTs every 20 milliseconds. We were pretty much at the limit of the device's capabilities.
We spent the whole weekend working with the Xcode iPhone debugger, Shark, and Instruments and this was our strategy...
Empirical Compiler Optimization:
1. Measure the performance of the time critical section
2. Adjust the optimization settings of the compiler
3. Repeat
Empirical Code Style Optimization:
1. Measure the performance of the time critical section
2. Adjust the C/C++/Obj C code
3. Review results in the debugger's dissassembler
3a. Look for fewer lines of assembly (which is almost always faster)
3b. Look for fewer library calls
4. Repeat
Why we favor this approach:
1. Learning curve involves only cursory understanding of the VPU assembly
2. More time for developing usability and marketing your application
3. It's much, much, much less error prone than inline assembly
4. Not likely to inadvertently trigger performance problems with pipeline stalling
5. Performance will likely be BETTER than inline assembly in all but the rarest of circumstances
The primary compiler optimization setting tricks (optimizing for speed over size):
1. Dissable Thumb mode
2. Set optimization level to "Fastest -O3"
3. Unroll Loops
4. Other C Flags = "-falign-loops=16"
The primary source optimization tricks:
1. Condense code into smaller chunks that fit inside the instruction cache
2. Use all floating point
3. Don't do indexed array lookups inside a bunch of floating point math (this will stall the vpu pipe)
4. Don't use any division, instead always multiply with 1/x
I hope you're able to use some of this advice in your own applications - and crank it to eleven. :)
(Image courtesy Joseph Tey)