LdB
Posts: 1209
Joined: Wed Dec 07, 2016 2:29 pm

AARCH64 hard floats

Tue Aug 08, 2017 5:16 pm

Anyone know how to get hard floats online with linaro AARCH64 I have tried every flag I can think of.

I was building the old Wolf3D raycaster and there is a bit of a penalty without the FPU online when using doubles
https://github.com/LdB-ECM/Raspberry-Pi ... er/RayCast

If you have a Pi3 put any of the 32bit image versions on SD card and look at speed .. then after put kernel8.img which is AARCH64 on SD and compare speed.

Slight speed drop with no hard fpu :-)

The other question is do I have to turn on the MMU to get the neon FPU online?

User avatar
Ultibo
Posts: 158
Joined: Wed Sep 30, 2015 10:29 am
Location: Australia
Contact: Website

Re: AARCH64 hard floats

Thu Aug 10, 2017 11:33 pm

LdB wrote:
Tue Aug 08, 2017 5:16 pm
The other question is do I have to turn on the MMU to get the neon FPU online?
Without the MMU then any difference between hard and soft floating point will likely be insignificant, the biggest performance gain available is from enabling the MMU which allows full use of the L1 and L2 caches.

Of course enabling the MMU also allows proper locking mechanisms through the use of LDREX/STREX as well.
Ultibo.org | Make something amazing
https://ultibo.org

Threads, multi-core, OpenGL, Camera, FAT, NTFS, TCP/IP, USB and more in 3MB with 2 second boot!

LdB
Posts: 1209
Joined: Wed Dec 07, 2016 2:29 pm

Re: AARCH64 hard floats

Fri Aug 11, 2017 6:06 am

Cheers for that Ultibo :-)

I had been procrastinating on that area because the TLB stuff had scared me off .. sigh time to dive in.

I am however still confused, both the 32bit and 64bit code have no mmu setup, so why is the 32bit so much faster?

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: AARCH64 hard floats

Sat Aug 12, 2017 2:31 am

what does an mmu on vs off have to do with anything? do you mean data cache is on vs off (instead of mmu)? you need the mmu to prevent the data cache from caching peripherals status registers and such, but the mmu makes all transactions slower by definition, even with the TLB working/helping.

So a little confused, and how could the mmu have anything to do with the hard fpu?

User avatar
Ultibo
Posts: 158
Joined: Wed Sep 30, 2015 10:29 am
Location: Australia
Contact: Website

Re: AARCH64 hard floats

Sat Aug 12, 2017 10:07 am

dwelch67 wrote:
Sat Aug 12, 2017 2:31 am
what does an mmu on vs off have to do with anything? do you mean data cache is on vs off (instead of mmu)?
Enabling the data cache without also enabling the MMU has very little impact because you need the MMU to mark memory pages as cacheable, shared, device, executable etc etc etc.

According to the ARM Architecture Reference Manual for ARMv7-A, section B3.2.1 VMSA behavior when a stage 1 MMU is disabled says,

When the MMU is disabled:

1) All data accesses will be treated as strongly ordered, non-cacheable.

2) Instruction accesses will be treated as either non-cacheable or cacheable write-through depending on the "I" bit in the system control register.

So yes the MMU has a major (positive) impact on performance and is one of the primary steps needed in getting useful functionality out of the Pi, without that it is just a poor quality micro-controller really.
Ultibo.org | Make something amazing
https://ultibo.org

Threads, multi-core, OpenGL, Camera, FAT, NTFS, TCP/IP, USB and more in 3MB with 2 second boot!

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: AARCH64 hard floats

Sat Aug 12, 2017 1:21 pm

The MMU itself does not improve performance, every access has to be looked up in a table, so by definition it takes extra clocks PER access (even with the TLB). As I mentioned and you repeated the MMU enables the ability to safely use the data cache, which can/does improve performance yes. But the mmu itself with respect to the cache allows you to mark cacheable vs non-cacheble areas and that is where it helps with performance, by allowing the data cache to be used safely. But this has nothing to do with hard floats so far, with or without the cache soft float is slower than hard float, comparing hard float without the cache and soft float with if that was implied makes no sense, apples and oranges.

I am trying to see the connection to hard floats, why is the mmu being mentioned with respect to hardware floating point on AARCH64? It should be like the fpu in other cases you enable it as it is a power hungry block if you need it then use it, is it no longer a coprocessor (that wouldnt make sense), and are there no more coprocessor enables?

LdB
Posts: 1209
Joined: Wed Dec 07, 2016 2:29 pm

Re: AARCH64 hard floats

Sat Aug 12, 2017 4:28 pm

Well that answers my results I got the MMU helped a little but not massively.

What I found out was on Linaro AARCH64 I am using you need to use -O3 or add +simd + fpu to the architecture command
EG -march=armv8-a+simd+fp

On many versions of GCC it turns them on automatically it just doesn't on the version I was using it has like a soft float. The neon use is supposed to be mandatory in AARCH64 so I am not sure the code without the flag is truely AARCH64. The -O3 flags crashes my program and I am still getting to bottom of problem which is something to do with auto vectorization optimisations.

User avatar
Paeryn
Posts: 2634
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: AARCH64 hard floats

Sat Aug 12, 2017 5:47 pm

From what I understand, ARMv8 SIMD is mandatory when there is an FPU so you can't have just one of them, it's either both or none. ARMv8 does allow for not having the FPU & SIMD though not as a general rule, the docs say that such a configuration is licenced only for implementations targeting specialised markets.

You should only need to add +simd to the architecture as it is supposed to enable both simd and fp, though gcc should default to having those on by default for AArch64, I downloaded the linaro 6.3 toolchain on my linux box to test it and it defaulted to having them on.
She who travels light — forgot something.

LdB
Posts: 1209
Joined: Wed Dec 07, 2016 2:29 pm

Re: AARCH64 hard floats

Sat Aug 12, 2017 6:05 pm

Yep just did the same as I finally worked out how to get 6.3 working on the Pi. Yes they are on for the linux version but they forgot to turn them on with the windows version which was my problem :-)

The O3 crash was because they forgot the alignment flag as well which I have now also got working. I found the verbose setting thing where you can spit the flags.

You were also right just needs +simd and I have dropped them a notification. Confused the hell out of me for a while.

Update: Just did linaro 7.1 and flags aren't set on windows cross version there either and I got it running as well. Easy enough to fix once you realize problem.

Be nice if you could write 32bit sections on the AARCH64 compiler having to have switch between two compilers is a bit of a pain just to punch a 32 bit section code.

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: AARCH64 hard floats

Mon Aug 14, 2017 9:32 pm

so are you having compiler issues or runtime issues?

Code: Select all

    //set EL1 to aarch64
    mov x1,#0x80000000
	msr	hcr_el2,x1

    //disable traps
    //before 00C50838
    //Just set the res1 bits
    ldr x0,=0x00C00800
	msr	sctlr_el1,x0
    //after 0x00C00800

    ldr x0,=0x00300000
    msr cpacr_el1,x0

	mov	x0,#0x3c5  
	msr	spsr_el2,x0
    //ldr x0,=enter_el1
	adr	x0,enter_el1
	msr	elr_el2, x0
	eret
enter_el1:
runtime I simply added the cpacr write before switching from el2 the rest is derived from your el1 switch.

as far as gcc command line options, there are none it defaults to supporting hard float

Code: Select all

float fun ( float a, float b )
{
    return(a+b);
}

Code: Select all

aarch64-linux-gnu-gcc --version
aarch64-linux-gnu-gcc (Linaro GCC 5.4-2017.05) 5.4.1 20170404
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Code: Select all

aarch64-linux-gnu-gcc -O2 -c fun.c -o fun.o
aarch64-linux-gnu-objdump -D fun.o
0000000000000000 <fun>:
   0:	1e212800 	fadd	s0, s0, s1
   4:	d65f03c0 	ret

Code: Select all

int notmain ( void )
{

    uart_init();
    hexstring(read_cpacr());
    hexstring(getel());
    hexstring(fun(5.0F,1.0F));

    return(0);
}
output

Code: Select all

00300000                                                                                                      
00000004                                                                                                      
00000006
gcc 7.1.0 same story...

Code: Select all

   80324:	1e2e1001 	fmov	s1, #1.000000000000000000e+00
   80328:	1e229000 	fmov	s0, #5.000000000000000000e+00
   8032c:	94000007 	bl	80348 <fun>
   80330:	1e390000 	fcvtzu	w0, s0
   80334:	97ffffa7 	bl	801d0 <hexstring>

Code: Select all

0000000000080348 <fun>:
   80348:	1e212800 	fadd	s0, s0, s1
   8034c:	d65f03c0 	ret
I guess simple numbers there is an immediate for with fmov. But for 1.2345678

Code: Select all

   80324:	90000000 	adrp	x0, 80000 <_start>
   80328:	1e2e1001 	fmov	s1, #1.000000000000000000e+00
   8032c:	bd435000 	ldr	s0, [x0, #848]
   80330:	94000006 	bl	80348 <fun>
...
   80350:	3f9e0651 	.word	0x3f9e0651
So I think it really is doing floating point

If you already figured this out my apologies...

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: AARCH64 hard floats

Mon Aug 14, 2017 9:34 pm

sorry guess you solved it... I had fun figuring it out anyway...

Return to “Bare metal, Assembly language”