ejolson
Posts: 3563
Joined: Tue Mar 18, 2014 11:47 am

ARMv8 32-bit mode and 64-bit division

Mon Sep 16, 2019 4:59 pm

There was a long running thread here that comes up easily on web searches comparing the speed of the Raspberry Pi to some common PCs. In light of this post, that old thread appears to contain some incorrect assumptions about running an ARMv8 processor in 32-bit mode. In particular, it seems that specifying the CPU type with

-march=armv8-a+crc+simd

allows the gcc compiler to issue udiv and sdiv instructions even in 32-bit mode, which significantly improves performance for certain types of calculations.

Since the previous table

https://www.raspberrypi.org/forums/view ... 72#p848915

includes so many comparisons with various PCs, it seems like a good idea to update it with more carefully obtained results for the various models of Raspberry Pi. However, I am unable to do so because my post got deleted and the previous thread locked.

If I understand things correctly, my old table contained a methodological error related to the options used with the C compiler on the Raspberry Pi models 3B and 3B+ which resulted in those computers appearing significantly slower than they really are. Thus, I'll be updating this thread with a new table containing new details and runtimes as well as the Pi 4B in the next post.

ejolson
Posts: 3563
Joined: Tue Mar 18, 2014 11:47 am

Re: ARMv8 32-bit mode and 64-bit division

Mon Sep 16, 2019 5:41 pm

ejolson wrote:
Mon Sep 16, 2019 4:59 pm
I'll be updating this thread with a new table containing new details and runtimes as well as the Pi 4B in the next post.
I have run the following program

Code: Select all

#include <stdio.h>
#include <math.h>
#include <stdint.h>

#define N 10000000

static INT prime[1000000];
static int count=0;

static int isprime(INT n){
    INT sqrtn=sqrt((double)n)+0.5;
    int k;
    for(k=0;k<count;k++){
        if(prime[k]>sqrtn) return 1;
        if(n%prime[k]==0) return 0;
    }
    return 1;
}

int main(){
    INT n;
    for(n=2;n<=N;n++){
        if(isprime(n)) prime[count++]=n;
    }
    printf("Found a total of %d primes (%d-bit)\n",
        count,(int)sizeof(n)*8);
    return 0;
}
compiled as

Code: Select all

$ gcc -march=armv8-a+crc+simd -mtune=cortex-a72 -O3 -o prime32 -D INT=uint32_t prime.c -lm
$ gcc -march=armv8-a+crc+simd -mtune=cortex-a72 -O3 -o prime64 -D INT=uint64_t prime.c -lm
When run on Raspbian Buster with a Pi 4B the following results were obtained:

Code: Select all

$ time ./prime32 ; # Raspbian Buster Pi 4B
Found a total of 664579 primes (32-bit)

real    0m1.892s
user    0m1.841s
sys 0m0.052s
$ time ./prime64 ; # Raspbian Buster Pi 4B
Found a total of 664579 primes (64-bit)

real    0m39.000s
user    0m38.969s
sys 0m0.030s
When run on 64-bit Gentoo with a Pi 3B+ the following results were obtained:

Code: Select all

$ time ./prime32 ; # 64-bit Gentoo Pi 3B+
Found a total of 664579 primes (32-bit)

real    0m2.991s
user    0m2.974s
sys 0m0.016s
$ time ./prime64 ; # 64-bit Gentoo Pi 3B+
Found a total of 664579 primes (64-bit)

real    0m3.214s
user    0m3.200s
sys 0m0.014s
From these results, it would appear that the original table here is still basically correct, even with the improved compiler options. In particular, I don't see 64-bit division performing anywhere near the speed of 32-bit division when the Pi is running in 32-bit mode. Moreover, when performing 64-bit divisions, the Pi 3B+ running in 64-bit mode is more than 10 times faster than the Pi 4B running in 32-bit mode.

From this post I was expecting 64-bit divisions done in 32-bit mode on the Pi 4B to be on par with 64-bit mode. What am I doing wrong?
Last edited by ejolson on Mon Sep 16, 2019 8:37 pm, edited 1 time in total.

jahboater
Posts: 4681
Joined: Wed Feb 04, 2015 6:38 pm

Re: ARMv8 32-bit mode and 64-bit division

Mon Sep 16, 2019 6:07 pm

ejolson wrote:
Mon Sep 16, 2019 5:41 pm
From this post I was expecting 64-bit divisions done in 32-bit mode on the Pi 4B to be on par with 64-bit mode. What am I doing wrong?
32 bit mode has udiv and sdiv instructions from ARMv7 onwards, but they only work with 32 bit operands.
In 32-bit mode, 64 bit division is done by a call to a library function (and very slow).

64 bit mode has the same udiv and sdiv instructions but they accept 64 bit operands as well, and no library routines are ever used.

Your results look entirely reasonable to me.

In general, unsigned division is faster than signed, and 32 bit division is faster than 64 bit, even when hardware instructions are used.

I don't think you are doing anything wrong!

ejolson
Posts: 3563
Joined: Tue Mar 18, 2014 11:47 am

Re: ARMv8 32-bit mode and 64-bit division

Mon Sep 16, 2019 6:24 pm

jahboater wrote:
Mon Sep 16, 2019 6:07 pm
Your results look entirely reasonable to me.
In your post

https://www.raspberrypi.org/forums/view ... 0#p1292430

you obtain the runtime
jahboater wrote:
Mon Mar 26, 2018 7:35 am
pi3+ (Cortex A53 in 32 bit mode, gcc 7.3, 1.4GHz)
Found a total of 664579 primes (64-bit)

real 0m14.691s
user 0m14.680s
sys 0m0.000s
for the Pi 3B+ in 32-bit mode--more than double what I'm currently getting for the Pi 4B. Do you remember what the compiler settings were that you used?

ejolson
Posts: 3563
Joined: Tue Mar 18, 2014 11:47 am

Re: ARMv8 32-bit mode and 64-bit division

Mon Sep 16, 2019 6:46 pm

ejolson wrote:
Mon Sep 16, 2019 6:24 pm
Do you remember what the compiler settings were that you used?
I read through the rest of that thread and found the suggestion

-march=native -Os -mneon-for-64bits

I also tried

-march=native -O3 -mneon-for-64bits
-mcpu=cortex-a72 -mfpu=neon-fp-armv8 -mneon-for-64bits -Os
-mcpu=cortex-a72 -mfpu=neon-fp-armv8 -mneon-for-64bits -O3

and nothing made an appreciable difference in the runtime: In every case the Pi 4B running in 32-bit mode using 64-bit division resulted in runtimes that were greater than 30 seconds.

Buster is using gcc 8.3 and you were using gcc 7.3. Could that be the difference?

Are you sure your original reported Pi 3B+ run times were accurate?

ejolson
Posts: 3563
Joined: Tue Mar 18, 2014 11:47 am

Re: ARMv8 32-bit mode and 64-bit division

Mon Sep 16, 2019 7:29 pm

ejolson wrote:
Mon Sep 16, 2019 6:46 pm
Are you sure your original reported Pi 3B+ run times were accurate?
Since Raspbian Buster includes gcc-4.9, gcc-5.5, gcc-6.5 and gcc-7.3 I installed and tried them all. Using variations of

-mcpu=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard -mneon-for-64bits -Os

all run times are greater than 30 seconds.

Is it possible the 15 second run time you reported earlier was, in fact, for the 64-bit Odriod with optimization turned off?

jahboater
Posts: 4681
Joined: Wed Feb 04, 2015 6:38 pm

Re: ARMv8 32-bit mode and 64-bit division

Thu Sep 19, 2019 10:45 am

ejolson wrote:
Mon Sep 16, 2019 7:29 pm
ejolson wrote:
Mon Sep 16, 2019 6:46 pm
Are you sure your original reported Pi 3B+ run times were accurate?
Since Raspbian Buster includes gcc-4.9, gcc-5.5, gcc-6.5 and gcc-7.3 I installed and tried them all. Using variations of

-mcpu=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard -mneon-for-64bits

all run times are greater than 30 seconds.

Is it possible the 15 second run time you reported earlier was, in fact, for the 64-bit Odriod with optimization turned off?
Sorry Ejolson, I cant remember (but I doubt it).
I have just moved house twice (a long story) and am currently trying to unpack finally ....
So far the only working computer I have now is a Pi4 - which is doing very well.

The flag -mneon-for-64bits just encourages the compiler to use NEON more for 64-bit integer operations. This was considered costly in the past because of the long time required to move data to and from the NEON unit. Now that NEON is no longer a co-processor for armv8, that may not be the case any more.
It doesn't seem to make any difference for prime.c as I don't think NEON can do 64-bit integer division.

For my Pi4 at stock speed, Raspbian, GCC 9.2, I get:

Code: Select all

pi@pi4:~/primes $ gcc -O3 -march=native -mtune=native -mfpu=neon-fp-armv8 prime.c -o prime -lm
pi@pi4:~/primes $ time ./prime
Found a total of 664579 primes (64-bit)

real	0m7.004s
user	0m6.978s
sys	0m0.021s
pi@pi4:~/primes $ 
pi@pi4:~/primes $ gcc -O3 -march=native -mtune=native -mfpu=neon-fp-armv8 prime.c -o prime -lm
pi@pi4:~/primes $ time ./prime
Found a total of 664579 primes (32-bit)

real	0m1.947s
user	0m1.915s
sys	0m0.011s
pi@pi4:~/primes $ 

User avatar
Paeryn
Posts: 2664
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: ARMv8 32-bit mode and 64-bit division

Thu Sep 19, 2019 2:06 pm

jahboater wrote:
Thu Sep 19, 2019 10:45 am
64bits just encourages the compiler to use NEON more for 64-bit integer operations. This was considered costly in the past because of the long time required to move data to and from the NEON unit. Now that NEON is no longer a co-processor for armv8, that may not be the case any more.
It doesn't seem to make any difference for prime.c as I don't think NEON can do 64-bit integer division.
NEON doesn't have any integer division instructions at all regardless of the width, the only divisions it can do are floating point divisions. Integer divisions have to be done one by one with the ordinary scalar SDIV or UDIV instructions which can't use the SIMD registers.
She who travels light — forgot something.

ejolson
Posts: 3563
Joined: Tue Mar 18, 2014 11:47 am

Re: ARMv8 32-bit mode and 64-bit division

Thu Sep 19, 2019 5:08 pm

Paeryn wrote:
Thu Sep 19, 2019 2:06 pm
jahboater wrote:
Thu Sep 19, 2019 10:45 am
64bits just encourages the compiler to use NEON more for 64-bit integer operations. This was considered costly in the past because of the long time required to move data to and from the NEON unit. Now that NEON is no longer a co-processor for armv8, that may not be the case any more.
It doesn't seem to make any difference for prime.c as I don't think NEON can do 64-bit integer division.
NEON doesn't have any integer division instructions at all regardless of the width, the only divisions it can do are floating point divisions. Integer divisions have to be done one by one with the ordinary scalar SDIV or UDIV instructions which can't use the SIMD registers.
It looks like I need to install version 9.2 of the gcc compiler to see what sort of magic makes it 5 times faster with 64-bit division when the Cortex-A72 is running in 32-bit mode. I'll report back soon if the newer compiler makes a difference.

asavah
Posts: 364
Joined: Thu Aug 14, 2014 12:49 am

Re: ARMv8 32-bit mode and 64-bit division

Thu Sep 19, 2019 6:56 pm

Just out of curiosity I tried this test on my own aarch64 OS running on rpi4 4GB
glibc-2.30
gcc-9.2.0

Code: Select all

gcc -mcpu=cortex-a72+crc+fp+simd -O3 -DINT=uint64_t prime.c -lm -o prime64
strip prime64
time ./prime64
Found a total of 664579 primes (64-bit)

real    0m1.842s
user    0m1.826s
sys     0m0.011s

-march=native is still not as good as x86 one, it's better to specify the right flags
On my system -march=native is equivalent to:

Code: Select all

gcc -### -E - -march=native 2>&1 | sed -r '/cc1/!d;s/(")|(^.* - )//g'
-mlittle-endian -mabi=lp64 -march=armv8-a+crc
which is not optimal.

ejolson
Posts: 3563
Joined: Tue Mar 18, 2014 11:47 am

Re: ARMv8 32-bit mode and 64-bit division

Thu Sep 19, 2019 8:06 pm

ejolson wrote:
Thu Sep 19, 2019 5:08 pm
Paeryn wrote:
Thu Sep 19, 2019 2:06 pm
jahboater wrote:
Thu Sep 19, 2019 10:45 am
64bits just encourages the compiler to use NEON more for 64-bit integer operations. This was considered costly in the past because of the long time required to move data to and from the NEON unit. Now that NEON is no longer a co-processor for armv8, that may not be the case any more.
It doesn't seem to make any difference for prime.c as I don't think NEON can do 64-bit integer division.
NEON doesn't have any integer division instructions at all regardless of the width, the only divisions it can do are floating point divisions. Integer divisions have to be done one by one with the ordinary scalar SDIV or UDIV instructions which can't use the SIMD registers.
It looks like I need to install version 9.2 of the gcc compiler to see what sort of magic makes it 5 times faster with 64-bit division when the Cortex-A72 is running in 32-bit mode. I'll report back soon if the newer compiler makes a difference.
My SD card ran out of storage so I'm trying the build again using an external USB drive. For the record, the SD card was advertised as 16GB and fdisk reports that it stores a total of 15552479232 bytes. The root partition is 14.2GB formatted to 13.9GB of which about 9.3GB seems to be taken taken by Raspbian. Another 1.4GB was filled with rusty crates of hashbrowns. After including a few hundred megabytes of user files, only 13% of the SD card was left for compiling gcc--unfortunately not enough.

Are 32GB cards now recommended or have I done something strange to make Raspbian take more storage than it should?
Last edited by ejolson on Fri Sep 20, 2019 1:02 am, edited 1 time in total.

jahboater
Posts: 4681
Joined: Wed Feb 04, 2015 6:38 pm

Re: ARMv8 32-bit mode and 64-bit division

Thu Sep 19, 2019 9:16 pm

ejolson wrote:
Thu Sep 19, 2019 8:06 pm
Are 32GB cards now recommended or have I done something strange to make Raspbian take more storage than in should?
I don't know, cards are so cheap now that I only use 64GB or 128GB.

and 32GB cards are now cheaper than 16GB cards !!!!!!

https://www.amazon.co.uk/SanDisk-microS ... 162&sr=8-3

Return to “General discussion”