User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Framebuffer Speed

Fri Aug 03, 2012 10:30 am

Hi,

just wondering, if its true that it is so slow, or am i doing something wrong. I opened up the Framebuffer (old style like linux fb) and clear the screen white (resolution 1600x1200 in 32bit):

Code: Select all

	DPrintF("Buffer Addr %x\n", bm->Addr);

	INT32 *buff = (INT32*)bm->Addr; 
	for (int i= 0; i<1600*1200; i=+4)
	{
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
	}
i added 4 buff in the loop in case it could improve. but the screen clear takes about 1sec.

what im wondering of:
My Memset() does about 300mb/s. The framebuffer is 1600x1200x4bytes = ~7 Megabytes. cant be that the loop is so slow that you can see the clear? Am i doing something wrong? Has anybody played with the fb?

Cheers
C.

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Fri Aug 03, 2012 12:13 pm

Ok made some Speedtest:

1sec = 1000000 Hz

With this C Routine only one *buff++:
Time: 1036010 Hz

With 4 *buff++:
Time: 573781 Hz

With 8 *buff++:
Time: 495534 Hz

With assembler optimized MemSet from github:
Time: 24829 Hz

R-E-S-P-E-C-T. Dexos, we have to code in Assembler ;-) Its nearly 50times faster than the standard c code. and even with 8 *buff 20 times.

User avatar
DexOS
Posts: 876
Joined: Wed May 16, 2012 6:32 pm
Contact: Website

Re: Framebuffer Speed

Fri Aug 03, 2012 3:20 pm

One of the reason i am moving away from a normal OS (with gui etc) on the PI and moving to a more arduino model, is that writting to screen is so slow, even with very optimized assembly code.
The best i can get is 45-50 FPS at 800x600 32bpp, i know someone is geting good results from using DMA.

The above is best case, doing nothing else, but dumping buffer to screen.

On top of this me and a number of other coders have had to redo there frame-buffer code 3 times now as they keep changing stuff in the the loader files.
Batteries not included, Some assembly required.

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Framebuffer Speed

Fri Aug 03, 2012 4:20 pm

I'd be interested to know what compiler you're using, what settings, and what assembler it's spitting out. There's something hideously wrong with that.

Also, you should probably be using dma.

Simon

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Fri Aug 03, 2012 4:48 pm

GCC:

arm-none-eabi-gcc.exe (GNU Tools for ARM Embedded Processors) 4.6.2 20120316 (release) [ARM/embedded-4_6-branch revision 185452]
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Config Makefile:

Code: Select all

CFLAGS = -O0 -std=gnu99 -Werror -D__$(PLATFORM)__ -DRASPBERRY_PI -fno-builtin
Assembly:

Code: Select all

	ldr	r3, [fp, #-40]
	sub	r3, r3, #296
	ldr	r3, [r3, #0]
	ldr	r2, [fp, #-20]
	ldr	r2, [r2, #28]
	ldr	r0, [fp, #-40]
	ldr	r1, .L5+8
	blx	r3
	ldr	r3, [fp, #-20]
	ldr	r3, [r3, #28]
	str	r3, [fp, #-8]
	ldr	r3, .L5+12
	ldr	r3, [r3, #0]
	str	r3, [fp, #-24]
	mov	r3, #0
	str	r3, [fp, #-12]
	b	.L2
.L3:
	ldr	r3, [fp, #-8]
	mvn	r2, #0
	str	r2, [r3, #0]
	ldr	r3, [fp, #-8]
	add	r3, r3, #4
	str	r3, [fp, #-8]
	ldr	r3, [fp, #-8]
	mvn	r2, #0
	str	r2, [r3, #0]
	ldr	r3, [fp, #-8]
	add	r3, r3, #4
	str	r3, [fp, #-8]
	ldr	r3, [fp, #-8]
	mvn	r2, #0
	str	r2, [r3, #0]
	ldr	r3, [fp, #-8]
	add	r3, r3, #4
	str	r3, [fp, #-8]
	ldr	r3, [fp, #-8]
	mvn	r2, #0
	str	r2, [r3, #0]
	ldr	r3, [fp, #-8]
	add	r3, r3, #4
	str	r3, [fp, #-8]
	ldr	r3, [fp, #-8]
	mvn	r2, #0
	str	r2, [r3, #0]
	ldr	r3, [fp, #-8]
	add	r3, r3, #4
	str	r3, [fp, #-8]
	ldr	r3, [fp, #-8]
	mvn	r2, #0
	str	r2, [r3, #0]
	ldr	r3, [fp, #-8]
	add	r3, r3, #4
	str	r3, [fp, #-8]
	ldr	r3, [fp, #-8]
	mvn	r2, #0
	str	r2, [r3, #0]
	ldr	r3, [fp, #-8]
	add	r3, r3, #4
	str	r3, [fp, #-8]
	ldr	r3, [fp, #-8]
	mvn	r2, #0
	str	r2, [r3, #0]
	ldr	r3, [fp, #-8]
	add	r3, r3, #4
	str	r3, [fp, #-8]
	ldr	r3, [fp, #-12]
	add	r3, r3, #8
	str	r3, [fp, #-12]
.L2:
	ldr	r2, [fp, #-12]
	ldr	r3, .L5+16
	cmp	r2, r3
	ble	.L3
	ldr	r3, .L5+12
	ldr	r3, [r3, #0]
	str	r3, [fp, #-28]
	ldr	r3, [fp, #-40]
	sub	r3, r3, #296
	ldr	r3, [r3, #0]
	ldr	r1, [fp, #-28]
	ldr	r2, [fp, #-24]
	rsb	r2, r2, r1
	ldr	r0, [fp, #-40]
	ldr	r1, .L5+20
	blx	r3
.L4:
	b	.L4
.L6:
	.align	2
.L5:
	.word	name
	.word	version+7
	.word	.LC0
	.word	536883204
	.word	1919999
	.word	.LC1
C-Code:

Code: Select all

	DPrintF("Buffer Addr %x\n", bm->Addr);

	INT32 *buff = (INT32*)bm->Addr; //-0x20000000; //0x49385000
//	int time1 = READ32(ST_BASE+0x04);
//	MemSet(buff, 0xff, 1600*1200*4);
//	int time2 = READ32(ST_BASE+0x04);
//	DPrintF("Time: %d Hz\n", time2-time1);

	int time3 = READ32(ST_BASE+0x04);
	for (int i= 0; i<1600*1200; i+=8)
	{
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
		*buff++ = (UINT32)0xFFFFFFFF;
	}
	int time4 = READ32(ST_BASE+0x04);

	DPrintF("Time: %d Hz\n", time4-time3);

	for(;;);
Cant go higher than -o0, otherwise my code VANISHES :P

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5213
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Framebuffer Speed

Fri Aug 03, 2012 4:50 pm

Cycl0ne wrote: Cant go higher than -o0, otherwise my code VANISHES :P
You should make buff volatile.

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Fri Aug 03, 2012 5:08 pm

doesnt change anything in time:
Buffer Addr 49385000
Time: 495003 Hz
Time: 22085 Hz

But with my new Memset32(UINT32*, UINT32, UINT32);

I achieve: 22085 Hz.... I believe now we are done, no more speed records. 45FPS is ok at the moment until I get the hardware-acc. stuff from you dom ;-)

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Framebuffer Speed

Fri Aug 03, 2012 7:27 pm

Ah, -O0 is going to (and has done) produce some really braindead code. It's fine for debugging, but there's no way you can use it in production. Rather than taking your loop variable into a register, for example, it stores it in the stack frame, and pulls it out every time it needs to update it, followed by pushing it back again. I woud expect -O3 to produce something within an order of magnitude of the hand-tuned memcpy stuff.

Here's my interpretation of your code:

Code: Select all

void foo() {
  unsigned int * buf = (unsigned int *)0xf00b1234;
  for (int i = 0; i < 1600 * 1200; i+= 8)
  {
    *buf++ = 0xffffffff;
    *buf++ = 0xffffffff;
    *buf++ = 0xffffffff;
    *buf++ = 0xffffffff;
    *buf++ = 0xffffffff;
    *buf++ = 0xffffffff;
    *buf++ = 0xffffffff;
    *buf++ = 0xffffffff;
  }
}
Here's the assembly from a compile with -O0 (using -save-temps to get gcc annotations)

Code: Select all

oo:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 1, uses_anonymous_args = 0
        @ link register save eliminated.
        str     fp, [sp, #-4]!
        add     fp, sp, #0
        sub     sp, sp, #12
        ldr     r3, .L4
        str     r3, [fp, #-8]
        mov     r3, #0
        str     r3, [fp, #-12]
        b       .L2
.L3:
        ldr     r3, [fp, #-8]
        mvn     r2, #0
        str     r2, [r3, #0]
        ldr     r3, [fp, #-8]
        add     r3, r3, #4
        str     r3, [fp, #-8]
        ldr     r3, [fp, #-8]
        mvn     r2, #0
        str     r2, [r3, #0]
        ldr     r3, [fp, #-8]
        add     r3, r3, #4
        str     r3, [fp, #-8]
        ldr     r3, [fp, #-8]
        mvn     r2, #0
        str     r2, [r3, #0]
        ldr     r3, [fp, #-8]
        add     r3, r3, #4
        str     r3, [fp, #-8]
        ldr     r3, [fp, #-8]
        mvn     r2, #0
        str     r2, [r3, #0]
        ldr     r3, [fp, #-8]
        add     r3, r3, #4
        str     r3, [fp, #-8]
        ldr     r3, [fp, #-8]
        mvn     r2, #0
        str     r2, [r3, #0]
        ldr     r3, [fp, #-8]
        add     r3, r3, #4
        str     r3, [fp, #-8]
        ldr     r3, [fp, #-8]
        mvn     r2, #0
        str     r2, [r3, #0]
        ldr     r3, [fp, #-8]
        add     r3, r3, #4
        str     r3, [fp, #-8]
        ldr     r3, [fp, #-8]
        mvn     r2, #0
        str     r2, [r3, #0]
        ldr     r3, [fp, #-8]
        add     r3, r3, #4
        str     r3, [fp, #-8]
        ldr     r3, [fp, #-8]
        mvn     r2, #0
        str     r2, [r3, #0]
        ldr     r3, [fp, #-8]
        add     r3, r3, #4
        str     r3, [fp, #-8]
        ldr     r3, [fp, #-12]
        add     r3, r3, #8
        str     r3, [fp, #-12]
.L2:
        ldr     r2, [fp, #-12]
        ldr     r3, .L4+4
        cmp     r2, r3
        ble     .L3
        add     sp, fp, #0
        ldmfd   sp!, {fp}
        bx      lr
.L5:
        .align  2
.L4:
        .word   -267709900
        .word   1919999
and from -O3

Code: Select all

foo:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r3, .L5
        ldr     r1, .L5+4
        mvn     r2, #0
.L2:
        str     r2, [r3, #0]
        str     r2, [r3, #4]
        str     r2, [r3, #8]
        str     r2, [r3, #12]
        str     r2, [r3, #16]
        str     r2, [r3, #20]
        str     r2, [r3, #24]
        str     r2, [r3, #28]
        add     r3, r3, #32
        cmp     r3, r1
        bne     .L2
        bx      lr
.L6:
        .align  2
.L5:
        .word   -267709900
        .word   -260029900
The latter looks an awful lot like what you'd expect a trivially assembled clear screen to look like, and its performance should be pretty good. You'd get quite a lot faster by using stm rather than str, of course, but it's kinda half-way decent (gcc ain't great on ARM)

As to why your code's disappearing, I don't know. Are you using gcc to assemble and link in one step?

valtonia
Posts: 26
Joined: Wed Jul 04, 2012 9:09 pm

Re: Framebuffer Speed

Fri Aug 03, 2012 9:08 pm

@Cycl0ne

I just tried recompiling my (mainly framebuffer) stuff with -O3 (I'd also been using -O0 while trying to get everything working) and it is indeed an order of magnitude faster. I also don't see any code disappearing. I'm using the arm-bcm2708hardfp-linux-gnueabi-gcc stuff.

I also have a DMA clear screen which I think will be hard to beat. :D

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Fri Aug 03, 2012 10:05 pm

My code gets to disappear, because im using jumptables for the access:

Code: Select all

static APTR FuncTab[] = 
{
	(void(*)) gfx_SetMode,
	(void(*)) gfx_SetFillMode,
	(void(*)) gfx_SetUseBackground,
	(void(*)) gfx_SetAPen,
	(void(*)) gfx_SetBPen,
	(void(*)) gfx_SetAPenC,
	(void(*)) gfx_SetBPenC,
	(void(*)) gfx_SetDash,
	(void(*)) gfx_DrawPoint,
	(void(*)) gfx_DrawPoint,

	(APTR) ((UINT32)-1)
};
and

Code: Select all

#define SetMode(a,b)			(((UINT32(*)(APTR, struct RastPort *, UINT32)) 			_GETVECADDR(GfxBase, 5)) (GfxBase, a,b))
#define SetFillMode(a,b)		(((UINT32(*)(APTR, struct RastPort *, UINT32)) 			_GETVECADDR(GfxBase, 6)) (GfxBase, a,b))
#define SetUseBackground(a,b)	(((BOOL  (*)(APTR, struct RastPort *, BOOL  )) 			_GETVECADDR(GfxBase, 7)) (GfxBase, a,b))
#define SetAPen(a,b)			(((UINT32(*)(APTR, struct RastPort *, UINT32)) 			_GETVECADDR(GfxBase, 8)) (GfxBase, a,b))
#define SetBPen(a,b)			(((UINT32(*)(APTR, struct RastPort *, UINT32))		 	_GETVECADDR(GfxBase, 9)) (GfxBase, a,b))
#define SetAPenC(a,b)			(((UINT32(*)(APTR, struct RastPort *, UINT32)) 			_GETVECADDR(GfxBase,10)) (GfxBase, a,b))
#define SetBPenC(a,b)			(((UINT32(*)(APTR, struct RastPort *, UINT32)) 			_GETVECADDR(GfxBase,11)) (GfxBase, a,b))
#define SetDash(a,b,c)			(((UINT32(*)(APTR, struct RastPort *, UINT32, UINT32)) 	_GETVECADDR(GfxBase,12)) (GfxBase, a,b,c))

#define DrawPoint(a,b,c)		(((UINT32(*)(APTR, struct RastPort *, UINT32, UINT32)) 	_GETVECADDR(GfxBase,13)) (GfxBase, a,b,c))
#define Draw(a,b,c)				(((VOID(*)(APTR, struct RastPort *, UINT32, UINT32)) 	_GETVECADDR(GfxBase,14)) (GfxBase, a,b,c))
With this, the optimizer doesnt see the "call" and thinks: This function is not used, so we dont compile it/link it to the final .img.

so it seems.

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Fri Aug 03, 2012 11:00 pm

tufty wrote:Ah, -O0 is going to (and has done) produce some really braindead code. It's fine for debugging, but there's no way you can use it in production. Rather than taking your loop variable into a register, for example, it stores it in the stack frame, and pulls it out every time it needs to update it, followed by pushing it back again. I woud expect -O3 to produce something within an order of magnitude of the hand-tuned memcpy stuff.
Got the O2 running, and with it:
C:
Time: 71456 Hz
ASM Memset:
Time: 22156 Hz

Got better ;)

Edit: MEMSET vs DMA
Time: 22159 Hz
Time: 30316 Hz

So DMA is fast, but Memset32 is faster ;)

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Fri Aug 03, 2012 11:35 pm

tufty wrote: As to why your code's disappearing, I don't know. Are you using gcc to assemble and link in one step?
I found the "fault" thanks to valtonia. :)

As thought the O1-O3 optimizer finds this static jumptable/romtag of mine which is not used directly. this he optimizes away and the functions cant be used/the code is not initialized.

exp:
the romtag i use ha a pointer to the init function. -> no romtag -> no init(). with this romtag you can easily add new modules to the system by just linking them to the main image, without modifying code. i have a romtag scanner, which reads the ram for this romtag and initializes the code.

No Romtag -> no init
No Jumptable -> no code

putting a volatile before the jumptable and romtag tells the optimizer to leave the structures alone.. this was it. now the code got 20kb smaller and faster. But the next problem: my SDHCI driver failed. Due to the "fast" code, i had to increase the waiting time for the MMC Chip ;-)

User avatar
DexOS
Posts: 876
Joined: Wed May 16, 2012 6:32 pm
Contact: Website

Re: Framebuffer Speed

Sat Aug 04, 2012 3:23 pm

Cycl0ne wrote: Got the O2 running, and with it:
C:
Time: 71456 Hz
ASM Memset:
Time: 22156 Hz

Got better ;)

Edit: MEMSET vs DMA
Time: 22159 Hz
Time: 30316 Hz

So DMA is fast, but Memset32 is faster ;)
DMA advantage, is you can do something else when using dma to moving data.
http://en.wikipedia.org/wiki/Direct_memory_access
Batteries not included, Some assembly required.

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Sat Aug 04, 2012 3:40 pm

Dsiadvantage of DMA: all transfer has to be 128bit aligned. so Blitting/Moving/filling gfx on screen will let your operations miss 127 pixels ;) in 32/24bpp mode. lower bpp worser.

User avatar
DexOS
Posts: 876
Joined: Wed May 16, 2012 6:32 pm
Contact: Website

Re: Framebuffer Speed

Sat Aug 04, 2012 4:28 pm

Cycl0ne wrote:Dsiadvantage of DMA: all transfer has to be 128bit aligned. so Blitting/Moving/filling gfx on screen will let your operations miss 127 pixels ;) in 32/24bpp mode. lower bpp worser.
Well i have not even looked at DMA for the PI, but i know someone that's making bare metal GBA emulator and he uses DMA on the pi and does not have any such problem ;) .
Batteries not included, Some assembly required.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5213
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Framebuffer Speed

Sat Aug 04, 2012 5:02 pm

Cycl0ne wrote:Dsiadvantage of DMA: all transfer has to be 128bit aligned. so Blitting/Moving/filling gfx on screen will let your operations miss 127 pixels ;) in 32/24bpp mode. lower bpp worser.
Not true.

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Sat Aug 04, 2012 5:04 pm

DexOS wrote:
Cycl0ne wrote:Dsiadvantage of DMA: all transfer has to be 128bit aligned. so Blitting/Moving/filling gfx on screen will let your operations miss 127 pixels ;) in 32/24bpp mode. lower bpp worser.
Well i have not even looked at DMA for the PI, but i know someone that's making bare metal GBA emulator and he uses DMA on the pi and does not have any such problem ;) .
hmm, with an gba emulator everything works other:
you emulate everything to a framebuffer in ram and copy the final result to the original framebuffer of the VC. yes this is possible with DMA. same with a ClearScreen and dma. but if you poke in the screen with dma, i think this will get you to a problem.

User avatar
Cycl0ne
Posts: 102
Joined: Mon Jun 25, 2012 8:03 am

Re: Framebuffer Speed

Sat Aug 04, 2012 5:10 pm

dom wrote:Not true.
hmm as mentionend in another post: broadcom doc says: 128bit/32bit the other 8bit mode is not documented. and 8bit alignment: = 7pixel offset in 32 bpp or? since the last 3 bits are 0 of the adress.

if my framebuffer starts at: 0x0 and i want to put something at position 1,1 to 10,10 = start of memory to blit: 0x1 <- not aligned ....

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5213
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Framebuffer Speed

Sat Aug 04, 2012 5:41 pm

Cycl0ne wrote:hmm as mentionend in another post: broadcom doc says: 128bit/32bit the other 8bit mode is not documented. and 8bit alignment: = 7pixel offset in 32 bpp or? since the last 3 bits are 0 of the adress.

if my framebuffer starts at: 0x0 and i want to put something at position 1,1 to 10,10 = start of memory to blit: 0x1 <- not aligned ....
The 32-bit or 128-bit is purely width of access on the bus, and the only reason to select 32-bit is for peripheral accesses.
The byte enable masks are correctly applied, so arbitrary byte aligned start addresses, lengths and strides are fine.

Return to “Bare metal, Assembly language”