User avatar
DavidS
Posts: 4334
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

ARM Assembly Optimization

Sat Jan 30, 2016 3:46 am

I have noticed that there are many references on ARM assembly, though very few on optimizing for the ARM CPU. The nice thing about the ARM architecture is that if you optimize for the ARMv7 you have equally optimized for the ARMv3, ARMv4, ARMv5, and ARMv6. Thus the only limit on the minimal ARM version is the instructions used by your code.

Optimizing for the ARM is a lot simpler than most others. The basic rules are:
  • 1: Minimize the use of branching. That is unroll loops, and the like.
    2: Avoid using the destination register of the last operation as a source or destination register for the current instruction.
    3: Minimize memory access as much as possible, use registers for local variables as much as you can.
    4: If you must do heavy memory access, try to interleave at least two non memory accessing instructions between each memory accessing instruction.
Other than making sure that the caches are enabled, if writing for bare-metal, following the above simple rules will provide for the best speed in 99% of cases on the ARM. These simple rules also allow for multi-issue of instructions on the newer ARM CPU's (ARMv5, ARMV6, and ARMv7), thus allowing for more than one op per CPU clock pulse.

So long as things are thought through well it is possible to produce very tight code while following these simple rules. Though optimizing for size while optimizing for speed is generally an exercise in refining the algorithm that is being implemented.
RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

User avatar
DavidS
Posts: 4334
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: ARM Assembly Optimization

Mon Feb 01, 2016 8:47 am

I should note that I did not mention algorithmic optimizations, that should always be done when coding for best efficiency. That is simple things like moving anything that can be so moved outside of a loop, using the most efficient method of calculation, taking advantage of power of two operations for many things, avoiding divesion if at all possible, preferring integer operations, etc. This is far from an exhaustive list.

Algorithmic optimization makes easily as much of a difference as any other form of optimization.
RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

Heater
Posts: 13271
Joined: Tue Jul 17, 2012 3:02 pm

Re: ARM Assembly Optimization

Tue Feb 02, 2016 1:50 pm

DavidS,
1: Minimize the use of branching. That is unroll loops, and the like.
Is this always true?

Admittedly I have never written any assembler for ARM but many years ago I was surprised to find that unrolling a loop on Intel (486 I think it was) slowed it down. Basically the small rolled up loop fitted in the cache nicely and was very fast. Unrolling the loop threw away the benefits of the cache. Presumably ARMs have caches as well.

There is another downside to unrolling loops if you are into performance on multi-core systems. If you are writing in C you might have a loop like:

Code: Select all

for (i = 1; i < 1000000; i++)
{
     /* Do something with */ array[i];
}  
You can then use OpenMP to split that loop up into 2, 4, 8 or more separate threads and run each on a different processor for nice performance boost.

All in all it can be easier to optimize code in a language like C than it is in assembler, especially if you want it to run on multiple architectures.

Which is not to say there aren't extreme cases where a bit of hand craft assembler is required.

Mostly what I want to optimize is the time it takes me to develop a program. So assembler is not often on the cards.

I think we had that debate before though....

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23631
Joined: Sat Jul 30, 2011 7:41 pm

Re: ARM Assembly Optimization

Tue Feb 02, 2016 1:54 pm

Modern CPU's have much bigger instruction caches than they used to so you can afford to unroll loops in more cases. But its a real balancing act. Try running something like cachegrind on code to see what the effects are.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

Heater
Posts: 13271
Joined: Tue Jul 17, 2012 3:02 pm

Re: ARM Assembly Optimization

Tue Feb 02, 2016 4:30 pm

Quite so.

First advice when considering optimizing anything is measure it's performance and measure it again when you are done. Tricks that work in one place may not work in another.

Worse, tricks that speed up one data set may slow down another. I was playing around writing my own FFT from scratch a couple of years back. One version was the standard looking 3 nested loops arrangement another, much simpler version was recursive. The recursive was slower as expected...until the input sample size was pushed over 8Mbytes or so, then it started to win. I was never quiet sure why that was.

Second advice when considering optimizing anything is don't. Not unless you really have to. Else you end up like Mel
http://www.pbm.com/~lindahl/mel.html

Steve Drain
Posts: 105
Joined: Tue Oct 30, 2012 2:08 pm
Location: Exeter UK

Re: ARM Assembly Optimization

Tue Feb 02, 2016 4:37 pm

DavidS wrote:1: Minimize the use of branching.
You do not menton conditional execution. This used to be very valuable in reducing branching, but I have read that it is unlikely to be so on more modern processors with better prediction. I cannot get out of the habit, myself, though. ;-)

User avatar
DavidS
Posts: 4334
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: ARM Assembly Optimization

Thu Feb 04, 2016 4:58 am

Steve Drain wrote:
DavidS wrote:1: Minimize the use of branching.
You do not menton conditional execution. This used to be very valuable in reducing branching, but I have read that it is unlikely to be so on more modern processors with better prediction. I cannot get out of the habit, myself, though. ;-)
That is a big part of avoiding branching on the ARM. Modern ARM's are still prone to incorrectly predict a branch, especial in cases where it is equally likely to take either outcome.

Now if they do ever implement a true and complete fan in pipeline, with a wide enough bus, then branching will no longer take a hit, and branch prediction will become useless.
RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

User avatar
DavidS
Posts: 4334
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: ARM Assembly Optimization

Thu Feb 04, 2016 5:02 am

Heater wrote:DavidS,
1: Minimize the use of branching. That is unroll loops, and the like.
Is this always true?

Admittedly I have never written any assembler for ARM but many years ago I was surprised to find that unrolling a loop on Intel (486 I think it was) slowed it down. Basically the small rolled up loop fitted in the cache nicely and was very fast. Unrolling the loop threw away the benefits of the cache. Presumably ARMs have caches as well.

There is another downside to unrolling loops if you are into performance on multi-core systems. If you are writing in C you might have a loop like:

Code: Select all

for (i = 1; i < 1000000; i++)
{
     /* Do something with */ array[i];
}  
You can then use OpenMP to split that loop up into 2, 4, 8 or more separate threads and run each on a different processor for nice performance boost.

All in all it can be easier to optimize code in a language like C than it is in assembler, especially if you want it to run on multiple architectures.

Which is not to say there aren't extreme cases where a bit of hand craft assembler is required.

Mostly what I want to optimize is the time it takes me to develop a program. So assembler is not often on the cards.

I think we had that debate before though....
Obviously there are limits, in cache as well as available registers. It is often needed to multiply some the local use data items by the number of times you unroll the loop, and keeping everything of immediate importance in CPU registers will be a limiting factor.

Usually for small loops (like one for an array like you mention), you want to unroll it between 2 and 4 times, no more than that. If you try to unroll it all of the way, other optimizations will be lost.

I do not know what "OpenMP " is.

Yes there are cases where also dividing a task among multiple cores can be faster. Though that is an algorithmic optimization, not low level optimization. I am mainly pointing out what can be done once the algorithm is optimized for the best possible speed. And one I have not yet been able to take advantage on the RPi 2 (I have only had it for 3 days), as I have yet to figure out how to enable the other 3 cores, and then set mutexes (using non cachable mem and the 'SWP' instruction). Though once I do figure it out I will be playing with some ideas I have for RISC OS programs that could be done very well with multiple cores to use.
RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

Heater
Posts: 13271
Joined: Tue Jul 17, 2012 3:02 pm

Re: ARM Assembly Optimization

Thu Feb 04, 2016 11:17 am

DavidS,
Usually for small loops (like one for an array like you mention), you want to unroll it between 2 and 4 times, no more than that. If you try to unroll it all of the way, other optimizations will be lost.
Here is a story...

Years ago I was implementing an encryption algorithm for a military project. The guts of that algorithm was basically a loop over a 256 byte block applying the same hand full of logical operations to each byte. Except at three particular positions in the buffer it did some different or extra logical operation.

This project was written in PL/M, a high level C/Pascal/Algol like language. So I wrote the thing and compiled it. Then looked at the assembler it generated. Sure enough the logical operations were a few simple instructions, but the bulk of the body of the loop was the tests and jumps for those three special cases. Yuk.

So. I removed the loop. Copy and pasted the logical operations 256 times. Added the little differences at the correct locations in that sequence. Compiling it again I got my nice clean string of instructions doing real work, no time wasted on useless tests. It ran about ten times faster than the loop version!

Now, being hell bent on speed, I looked at the generated assembler and notice the compiler had not been as optimal as I would like. So I copied it's generated assembler and removed the inefficiencies in it.

The speed gain from optimizing assembler code? About 20%.

We put the PL/M version into production. The speed gain of moving to assembler was not worth it. Having everything in the same language, and something people could easily read, was not worth giving up for a marginal performance gain.

Since that time, I have been very loath to dick around squeezing marginal gains out of code by writing in assembler or fiddling about optimizing it.

Note: Unrolling the loop in my historical example made things much faster because I could remove redundant loop count tests and because the processor had no cache, it did not matter that the code ended tens of times bigger. On a modern CPU it may be better to keep a tight loop in the cache and keep the tests in there.

Heater
Posts: 13271
Joined: Tue Jul 17, 2012 3:02 pm

Re: ARM Assembly Optimization

Thu Feb 04, 2016 11:31 am

DavidS,
I do not know what "OpenMP " is.
OpenMP is a magical way to get the compiler to analyse your loops in C and other languages and split them up into multiple smaller loops that it then runs on multiple cores. Only got one core? it runs the code as normal. Got 2 cores? it automatically spreads the work around. Got more.... It's a standard so it can be used with many compilers and architectures.
Yes there are cases where also dividing a task among multiple cores can be faster. Though that is an algorithmic optimization, not low level optimization.
Except, one can take a ones normal single threaded algorithm, add a few OpenMP annotations around some loops within it and OpenMP will sort it out for you. No change to the original algorithm required. In that way it is not an "algorithmic optimization".

I guess if one spent less time concentrating on micro-optimizations and assembler one might learn how to make really fast code on modern hardware :)

User avatar
DavidS
Posts: 4334
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: ARM Assembly Optimization

Fri Feb 05, 2016 5:24 am

Heater wrote:DavidS,
I do not know what "OpenMP " is.
OpenMP is a magical way to get the compiler to analyse your loops in C and other languages and split them up into multiple smaller loops that it then runs on multiple cores. Only got one core? it runs the code as normal. Got 2 cores? it automatically spreads the work around. Got more.... It's a standard so it can be used with many compilers and architectures.
Yes there are cases where also dividing a task among multiple cores can be faster. Though that is an algorithmic optimization, not low level optimization.
Except, one can take a ones normal single threaded algorithm, add a few OpenMP annotations around some loops within it and OpenMP will sort it out for you. No change to the original algorithm required. In that way it is not an "algorithmic optimization".
Only if one is using a toolchain that knows these things, and one is using a known multi core threading lib that is supported by said toolchain. To many assumptions for my taste, and as there is no standard for multi core threading on some Operating Systems (eg RISC OS), even worse.
I guess if one spent less time concentrating on micro-optimizations and assembler one might learn how to make really fast code on modern hardware :)
We not need worry about optimizing everything. Though Assembly is the easiest ARM language, other than BBC BASIC V :) .

People write compilers so there is at least that need for knowledge of optimization.

C toolchains can not even agree on how to implement inline assembly, let alone how to implement things like SWI calls (to call the OS or other modules). This makes C inefficient.

And while there are smallish syntax differences between assemblers, the instruction set is ARM.

I write as much in BASIC V as I do in Assembly, as speed is not always needed. Though where things need to be fast ARM assembly is the only sure way to get it correct.

I must agree that micro optimization is going overboard, though knowing the basics is still important.
RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

User avatar
DavidS
Posts: 4334
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: ARM Assembly Optimization

Fri Feb 05, 2016 5:34 am

On modern systems you must be frugal with unrolling a loop. That is to say it should fit into a few cache lines once unrolled how ever far, with larger loops it will still loop just with n iterations unrolled. It is also important to keep as much as possible in registers, minimizing memory access, unrolling a loop to far could sacrifice this ability.

The more important optimizations are generally in instruction interleaving, thus preventing many causes of pipeline stalls, and when done correctly taking better advantage of multiple issue of instructions (that is running more than one instruction per clock cycle).

And obviously do not overdue optimization, just get it capable of running 10 times faster than needed on the lowest end hardware it will ever run on, and you should be good.
RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

Heater
Posts: 13271
Joined: Tue Jul 17, 2012 3:02 pm

Re: ARM Assembly Optimization

Fri Feb 05, 2016 5:56 am

DavidS,
People write compilers so there is at least that need for knowledge of optimization.
Good point.

Only a few dozen people in the world need to understand machine code and how to optimize it. So that they can write the code generators for compilers. This rest of the world can get by using the compilers. :)

On the other hand, I agree, anyone who is serious about programming computers should learn about such things and spend some time programming in assembler for at least one machine. And I would suggest ARM or MIPS over x86 any day.

It's good to be aware of how the machine works under the hood. And good for the soul.

Now a days the only assembler programming I do is for the Parallax Propeller 32 bit 8 core micro-controller: https://www.parallax.com/product/p8x32a-d40

The Propeller architecture is such that writing PASM is almost required to get anything done. Which is OK as it's the simplest most elegant instruction set and assembler I have ever come across. Almost as easy as writing in an HLL. And using assembler from their IDE is very well integrated and easy. It's fun.

On the other hand even that little machine can be programmed in C which can be auto-parallelized by OpenMP. Which I have done for a Fast Fourier Transform. There is no operating system under there supporting that mind you, it's bare metal programming.

Return to “Other programming languages”