Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Thu Jan 17, 2019 6:25 am

Code: Select all

$ node
> a = [1, 2, 3, 4]
[ 1, 2, 3, 4 ]
> b = [5, 6, 7, 8]
[ 5, 6, 7, 8 ]
> t = a
[ 1, 2, 3, 4 ]
> a = b
[ 5, 6, 7, 8 ]
> b = t
[ 1, 2, 3, 4 ]
No array copy happens there.

ejolson
Posts: 3078
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Thu Jan 17, 2019 7:02 am

Heater wrote:
Thu Jan 17, 2019 6:25 am

Code: Select all

$ node
> a = [1, 2, 3, 4]
[ 1, 2, 3, 4 ]
> b = [5, 6, 7, 8]
[ 5, 6, 7, 8 ]
> t = a
[ 1, 2, 3, 4 ]
> a = b
[ 5, 6, 7, 8 ]
> b = t
[ 1, 2, 3, 4 ]
No array copy happens there.
Nice! For completeness, how do you delete the t variable after all that so the reference counts stay one? Or is something else going on?

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Thu Jan 17, 2019 9:01 am

Code: Select all

> t = undefined
undefined
>

ejolson
Posts: 3078
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Fri Jan 18, 2019 8:48 am

ejolson wrote:
Tue Jan 15, 2019 8:06 pm
Heater wrote:
Tue Jan 15, 2019 7:56 pm
This BASIC thing has really gone to your head!
I'm just trying to stay off topic in this off-topic thread. Along those lines, it would be nice if a RISC OS BBC BASIC code (with or without inline assembler) were to appear.
I have finished writing visual.bas, a version of the Fibonacci code in Visual Basic. This program is based on the FreeBASIC fibo.bas program with syntax modifications so it will compile and the addition of the missing Karasuba algorithm. The program was compiled using the open source vbnc compiler running in the mono environment with the optimizer turned on and integer range checking turned off. The best out of three runs are reported here:

Code: Select all

Pi Zero timings of the Fibonacci code

                   BUILD TIME    RUN TIME     TOTAL
serial                7.459       24.195     31.654
fibonacci             5.060       80.946     86.006
fibo_karatserial     56.154       40.090     96.244
integervol           17.886       87.556    105.442
visual               14.146      164.411    177.557
classic              10.851      217.004    227.855

$ time vbnc /optimize /removeintchecks visual.bas
$ time mono visual.exe >visual.txt
It should be noted that visual.bas is not compiled into a native ARM executable but instead into some sort of machine-independent intermediate code. This intermediate code is subsequently interpreted by the mono just-in-time compiler. It would be interesting to know how Microsoft's implementation of Visual Basic compares in terms of efficiency. Could one compile visual.bas using the official Microsoft compiler and then run the resulting intermediate code under mono on Raspbian? What about the other way around? How do the two implementations of Visual Basic compare on x86?

For reference the code is

Code: Select all

module fibonacci

'    visual.bas -- Compute the nth Fibonacci Number
'    Written January 14, 2018 by Eric Olson
'
'    This program demonstrates the expressiveness of Visual Basic
'    in the mono virtual machine as measured by explicitly coding
'    big-number arithmetic and then using the doubling formulas
'    
'        F(2k) = F(k)[2F(k+1)-F(k)]
'      F(2k+1) = F(k+1)[2F(k+1)-F(k)]+(-1)^(k+1)
'
'    and
'
'      F(2k+1) = F(k)[2F(k)+F(k+1)]+(-1)^(k)
'      F(2k+2) = F(k+1)[2F(k)+F(k+1)]
'     
'    to compute the nth Fibonacci number.

dim cmult,rexp,nmax as integer
dim radix as uinteger
dim cadd,cmax as ulong

sub bigprint(x() as uinteger)
    dim i as integer
    if x(0)=0 then
        console.writeline("0")
    else
        for i=x(0) to 1 step -1
            dim s as string
            dim j as integer
            j=x(i):s=str(j)
            if mid(s,1,1)=" " then
                s=mid(s,2,len(s)-1)
            end if
            if i<x(0) then
                s=mid("0000000000",1,rexp-len(s))+s
            end if
            console.write(s)
            next i
        console.writeline()
    end if
end sub

sub biginc(z() as uinteger)
    dim i as integer
    dim t,c as uinteger
    i=1:c=1
    while c
        t=z(i)+1
        if t>=radix then
            z(i)=t-radix
        else
            z(i)=t
            c=0
        end if
        i=i+1
    end while
end sub

sub bigadd(z() as uinteger,x() as uinteger,y() as uinteger)
    dim i,max as integer
    if y(0)>x(0) then
        max=y(0)
    else
        max=x(0)
    end if
    for i=1 to max+1
        z(i)=0
    next i
    for i=1 to max
        dim t,c as integer
        c=0
        if i<=x(0) then
            t=z(i)+x(i)
            if i<=y(0) then
                t=t+y(i)
            end if
        else
            t=z(i)+y(i)
        end if
        if t>=radix then
            c=1
            t=t-radix
        end if
        z(i)=t
        z(i+1)=z(i+1)+c
    next i
    z(0)=max+1
    while z(z(0))=0 and z(0)>1
        z(0)=z(0)-1
    end while
    return
end sub

sub bigdec(z() as uinteger)
    dim i as integer
    dim c as uinteger
    i=1:c=1
    while c
        if z(i)<1 then
            z(i)=radix-1
        else
            z(i)=z(i)-1
            c=0
        end if
        i=i+1
    end while
end sub

sub bigsub(z() as uinteger,x() as uinteger,y() as uinteger)
    dim i as integer
    dim t,c as uinteger
    c=0
    for i=1 to y(0)
        t=y(i)+c
        if x(i)<t then
            z(i)=x(i)+radix-t
            c=1
        else
            z(i)=x(i)-t
            c=0
        end if
    next i
    i=y(0)+1
    while c
        if x(i)<1 then
            z(i)=x(i)+radix-1
        else
            z(i)=x(i)-1
            c=0
        end if
        i=i+1
    end while
    while i<=x(0)
        z(i)=x(i)
        i=i+1
    end while
    z(0)=x(0)
    while z(z(0))=0 and z(0)>1
        z(0)=z(0)-1
    end while
    return
end sub

dim xy() as ulong
sub bigmulw(z() as uinteger,x() as uinteger,y() as uinteger)
    dim i,j,imax as integer
    imax=x(0)+y(0)
    for i=1 to imax
        xy(i)=0
    next i
    for i=1 to x(0)
        for j=1 to y(0)
            dim xt,yt as ulong
            xt=x(i):yt=y(j)
            xy(i+j-1)=xy(i+j-1)+xt*yt
        next j
        if(i mod cmult = 0) then
            dim k,kmax as integer
            kmax=i+y(0)-1
            for k=1 to kmax
                if(xy(k)>=cmax) then
                    xy(k)=xy(k)-cmax
                    xy(k+1)=xy(k+1)+cadd
                end if
            next k
        end if
    next i
    for i=1 to imax-1
        xy(i+1)=xy(i+1)+xy(i)\radix
        z(i)=xy(i) mod radix
    next i
    z(imax)=xy(imax)
    while z(imax)=0 and imax>1
        imax=imax-1
    end while 
    z(0)=imax
    return
end sub

sub bighigh(z() as uinteger,x() as uinteger,n as integer)
    if x(0) <= n then
        z(0)=0
        return
    end if
    dim i as integer
    for i=n+1 to x(0)
        z(i-n)=x(i)
    next i
    z(0)=x(0)-n
    return
end sub

sub biglow(z() as uinteger,x() as uinteger,n as integer)
    if x(0) <= n then z(0)=x(0) else z(0)=n
    dim i as integer
    for i=1 to z(0)
        z(i)=x(i)
    next i
    return
end sub

sub bigmul(z() as uinteger,x() as uinteger,y() as uinteger)
    if x(0)<44 or y(0)<44 then
        bigmulw(z,x,y)
        return
    end if
    dim i,M,n as integer
    if x(0)<y(0) then M=y(0) else M=x(0)
    n=M\2
    dim xl(n+1),xh(n+1),yl(n+1),yh(n+1) as uinteger
    biglow(xl,x,n):bighigh(xh,x,n)
    biglow(yl,y,n):bighigh(yh,y,n)
    dim z0(M+1),z1(M+3),z2(M+3) as uinteger
    bigadd(z0,xl,xh)
    bigadd(z2,yl,yh)
    bigmul(z1,z0,z2)
    bigmul(z0,xl,yl)
    bigmul(z2,xh,yh)
    bigsub(z1,z1,z0)
    bigsub(z1,z1,z2)
    dim imax as integer
    imax=x(0)+y(0)
    dim t,c as uinteger
    c=0
    for i=1 to imax
        t=c
        if i<=z0(0) then t=t+z0(i)
        if i>n and i-n<=z1(0) then t=t+z1(i-n)
        if i>2*n and i-2*n<=z2(0) then t=t+z2(i-2*n)
        c=0
        while t>=radix
            t=t-radix
            c=c+1
        end while
        z(i)=t
    next i
    while z(imax)=0 and imax>1
        imax=imax-1
    end while 
    z(0)=imax
    return
end sub

dim a(),b(),t1(),t2() as uinteger
sub fibowork(n as integer)
    if n=0 then
        a(0)=0:b(0)=1:b(1)=1
        return
    end if
    fibowork(n\2)
    if n mod 2=0 then
        rem [a,b]=[a*(2*b-a),b*(2*b-a)-(-1)^k]
        bigadd(t1,b,b):bigsub(t2,t1,a)
        bigmul(a,a,t2):bigmul(b,b,t2)
        if n mod 4=0 then bigdec(b) else biginc(b)
    else
        rem [a,b]=[a*(2*a+b)+(-1)^k,b*(2*a+b)]
        bigadd(t1,a,a):bigadd(t2,t1,b)
        bigmul(b,b,t2):bigmul(a,a,t2)
        if n mod 4=1 then biginc(a) else bigdec(a)
    end if
    return
end sub

sub fibo(n as integer)
    if n<2 then
        b(0)=1:b(1)=n
        return
    end if
    fibowork((n-1)\2)
    if n mod 2=0 then
        rem b=b*(2*a+b)
        bigadd(t1,a,a):bigadd(t2,t1,b):bigmul(b,b,t2)
    else
        rem b=b*(2*b-a)-(-1)^k
        bigadd(t1,b,b):bigsub(t2,t1,a):bigmul(b,b,t2)
        if n mod 4=1 then bigdec(b) else biginc(b)
    end if
    return
end sub

function main() as integer
    dim n as integer
    n=4784969
'    n=7499
    rexp=9
    cmult=5
    radix=10^rexp
    cadd=radix:cadd=cadd*cmult
    cmax=cadd*radix
    nmax=250100
    redim a(nmax),b(nmax),t1(nmax),t2(nmax),xy(nmax)
    fibo(n)
    bigprint(b)
    return 0
end function 

end module

Steve Drain
Posts: 105
Joined: Tue Oct 30, 2012 2:08 pm
Location: Exeter UK

Re: Why Avoid BASIC on RPi?

Fri Jan 18, 2019 9:50 am

ejolson wrote:
Fri Jan 18, 2019 8:48 am
it would be nice if a RISC OS BBC BASIC code were to appear.
Just for fun I set an integer version of my code running. It finishd after 1117 minutes. :-)

That is extreme, because there are some optimisations still available.

ejolson
Posts: 3078
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Fri Jan 18, 2019 5:04 pm

Steve Drain wrote:
Fri Jan 18, 2019 9:50 am
ejolson wrote:
Fri Jan 18, 2019 8:48 am
it would be nice if a RISC OS BBC BASIC code were to appear.
Just for fun I set an integer version of my code running. It finishd after 1117 minutes. :-)

That is extreme, because there are some optimisations still available.
It is lucky that the Raspberry Pi runs on only a watt or two of electricity.

It was difficult to get the integer version of the line-numbered Basic program to work well because the size of the integers was small compared to double-precision floating point. When less work is done per operation, it is even more important that the loops be efficient.

I have a theory how the gcc backend got broken in Free Basic: I think Free Basic generates C89 compliant code but newer versions of gcc since 5.x use C99 as the default.

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Fri Jan 18, 2019 6:12 pm

I think my whole fibo(4784969) challenge idea is driving me insane....

I'm pretty sure that last time I compared the parallel C solution with my parallel fibo_karatsuba in C++ that the C++ was quite a way behind and was not scaling as well.

Today I ran them on my PC again:

Code: Select all

$ gcc -Wall -Wextra -O3 -fopenmp -o parallel parallel.c
$ time ./parallel > /dev/null
Using 8 cores.

real    0m0.233s
user    0m1.438s
sys     0m0.078s
$ cd ../c++/
$ g++ -Wall -Wextra  -std=c++17 -O2 -fopenmp -DUSE_OMP  -o fibo_karatsuba fibo_karatsuba.cpp fibo.cpp
$ time ./fibo_karatsuba > /dev/null-O2 -f

real    0m0.242s
user    0m1.250s
sys     0m0.109s

That is less than 4% behind!

That is before I started to think about ejolson's suggestion to split my karatusba into two, one parallel and one serial. I still have them combined.

Could some kind soul confirm (or otherwise) this result? Latest code on github, not that it has changed noticeably.

Meanwhile, even more crazy people a posting ever more weird solutions in BASIC (I'll get around to them)

Meanwhile I discovered this: "Parallelism in Modern C++": https://www.youtube.com/watch?v=6Z3_qaFYF84

HXP promises to be a C++ standards compliant implementation of std::async and such parallel processing on steroids. From Raspberry Pi to super computers. Which of course I will have to try out...

Edit: Sorry, wrong video, this one is better: “The Asynchronous C++ Parallel Programming Model” : https://www.youtube.com/watch?v=js-e8xAMd1s

User avatar
Paeryn
Posts: 2569
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Why Avoid BASIC on RPi?

Fri Jan 18, 2019 9:05 pm

ejolson wrote:
Fri Jan 18, 2019 5:04 pm
I have a theory how the gcc backend got broken in Free Basic: I think Free Basic generates C89 compliant code but newer versions of gcc since 5.x use C99 as the default.
No, it's just plain broken. Automatic variables in the function that called setjmp() (in FreeBasic's case the entire program is in one large fuction) which have been altered since the call to setjmp() must be declared volatile otherwise they will have indeterminate values when longjmp() causes execution to go back.

It's always been that way, you can get away with it so long as you don't have any optimisations switched on (though it still is no guarantee) as in that case gcc near-enough treats all variables as if they were volatile (in that all variables get written to memory and are always read from memory)
She who travels light — forgot something.

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Fri Jan 18, 2019 9:25 pm

Paeryn,
No, it's just plain broken. Automatic variables in the function that called setjmp() (in FreeBasic's case the entire program is in one large fuction) which have been altered since the call to setjmp() must be declared volatile otherwise they will have indeterminate values when longjmp() causes execution to go back.
I was wondering about this the other day...

I have not looked at the generated code but I guessed that the entire BASIC program came out as a single huge C function.

If that is the case then why use setjmp/longjmp? Why not just drop a label and use "goto" to get to it when required? As you say, setjmp/longjmp is not guaranteed to work, even at -O0, it's undefined behavior in C.

By using "goto" the compiler will know how to keep variables in order without using any "volatile". And likely be faster. This really is not what volatile is for.

Alternatively why not handle it as one big state machine, using a switch()? After all, every place you can GOTO or GOSUB to in BASIC is a state of the program.

I don't see why they use setjmp/longjmp.

User avatar
Paeryn
Posts: 2569
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Why Avoid BASIC on RPi?

Fri Jan 18, 2019 10:33 pm

Heater wrote:
Fri Jan 18, 2019 9:25 pm
Paeryn,
No, it's just plain broken. Automatic variables in the function that called setjmp() (in FreeBasic's case the entire program is in one large fuction) which have been altered since the call to setjmp() must be declared volatile otherwise they will have indeterminate values when longjmp() causes execution to go back.
I was wondering about this the other day...

I have not looked at the generated code but I guessed that the entire BASIC program came out as a single huge C function.

If that is the case then why use setjmp/longjmp? Why not just drop a label and use "goto" to get to it when required? As you say, setjmp/longjmp is not guaranteed to work, even at -O0, it's undefined behavior in C.

By using "goto" the compiler will know how to keep variables in order without using any "volatile". And likely be faster. This really is not what volatile is for.

Alternatively why not handle it as one big state machine, using a switch()? After all, every place you can GOTO or GOSUB to in BASIC is a state of the program.

I don't see why they use setjmp/longjmp.
Using goto would be problematic as sub-routines could be called from more than one place. You could probably get away with having return end with a jump-table of all possible goto labels and both gosub and return would need to maintain a stack of indicies to say which entry in that jump-table is needed.

The author probably thought that the compiler would automatically synchronise the variables without understanding why it doesn't.
She who travels light — forgot something.

ejolson
Posts: 3078
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 12:30 am

Paeryn wrote:
Fri Jan 18, 2019 9:05 pm
ejolson wrote:
Fri Jan 18, 2019 5:04 pm
I have a theory how the gcc backend got broken in Free Basic: I think Free Basic generates C89 compliant code but newer versions of gcc since 5.x use C99 as the default.
No, it's just plain broken. Automatic variables in the function that called setjmp() (in FreeBasic's case the entire program is in one large fuction) which have been altered since the call to setjmp() must be declared volatile otherwise they will have indeterminate values when longjmp() causes execution to go back.

It's always been that way, you can get away with it so long as you don't have any optimisations switched on (though it still is no guarantee) as in that case gcc near-enough treats all variables as if they were volatile (in that all variables get written to memory and are always read from memory)
Oops. I was under the impression that volatile was not part of ANSI/C89 standard. I guess it is, so never mind. Is it possible FreeBasic originally targeted a K&R style C compiler as the code generator?

ejolson
Posts: 3078
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 2:41 am

Heater wrote:
Fri Jan 18, 2019 6:12 pm
I think my whole fibo(4784969) challenge idea is driving me insane....
Congratulations. Your parallel C++ code seems much better suited to 64-bit Intel hardware than my parallel C code. Here are the compilation options:

Code: Select all

/usr/local/gcc-8.2/bin/gcc -O3 -march=native -mtune=native -fopenmp -o parallel fibo_4784969/c/parallel.c -lm
/usr/local/gcc-8.2/bin/g++ -O3 -march=native -mtune=native -DUSE_OMP -fopenmp -std=c++17 -o fibo_karatomp \
    fibo_4784969/c++/fibo_karatsuba.cpp fibo_4784969/c++/fibo.cpp -lm
/usr/local/gcc-8.2/bin/g++ -O3 -march=native -mtune=native -DUSE_ASYNC -std=c++17 -o fibo_karatasync \
    fibo_4784969/c++/fibo_karatsuba.cpp fibo_4784969/c++/fibo.cpp -lpthread -lm
The programs were run on 1 to 8 cores of a dual Xeon Gold 6126 server, a slightly more up-to-date model than the one linked earlier on ebay. The normalized parallel scaling is

Image

However, the raw performance results tell a different story.

Image

I find it interesting the single-core performance of the C++ code is 65% faster than the C code on the Xeon server and the C code is 65% faster than the C++ code on a Pi Zero. In numbers

Code: Select all

SINGLE-CORE RUN TIMES MEASURED IN SECONDS
         
                     Pi Zero    Gold 6126
parallel.c            24.195       0.632
fibo_karatsuba.cpp    40.090       0.382
From another point of view, the single-core performance of the Xeon processor appears 105 times faster than the Pi when running the C++ code; however, when running the C code it appears only 38 times faster. I suspect the ARM and Intel marketing teams would prefer alternatively C++ or C. At the same time BASIC runs slower on both.
Last edited by ejolson on Sat Jan 19, 2019 3:49 am, edited 1 time in total.

User avatar
Paeryn
Posts: 2569
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 3:43 am

ejolson wrote:
Sat Jan 19, 2019 12:30 am
Oops. I was under the impression that volatile was not part of ANSI/C89 standard. I guess it is, so never mind. Is it possible FreeBasic originally targeted a K&R style C compiler as the code generator?
Ah yes, volatile was introduced in C89. I've just found a scanned archive of the 1st edition of the K&R book and volatile wasn't a keyword then. Nor were setjmp() and longjmp() mentioned so I don't know what (if any) support there was for them before C89 (I learnt C a year or two after).
She who travels light — forgot something.

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 6:32 am

Paeryn,
Using goto would be problematic as sub-routines could be called from more than one place.
I though the traditional way to do this was to implement a state machine. Every possible place that can be jumped to in the original source code becomes state in the generated code. GOTO would then be a simple change of state variable. GOSUB would be the same but pushes the current state onto a stack so as to be able to RETURN by popping the state variable.

That is how I have implemented emulators before and crude interpreters, I have no idea how well it works when generating code to run the thing rather than being just an interpreter of the original source.

jahboater
Posts: 4464
Joined: Wed Feb 04, 2015 6:38 pm

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 7:43 am

Paeryn wrote:
Sat Jan 19, 2019 3:43 am
ejolson wrote:
Sat Jan 19, 2019 12:30 am
Oops. I was under the impression that volatile was not part of ANSI/C89 standard. I guess it is, so never mind. Is it possible FreeBasic originally targeted a K&R style C compiler as the code generator?
Ah yes, volatile was introduced in C89. I've just found a scanned archive of the 1st edition of the K&R book and volatile wasn't a keyword then. Nor were setjmp() and longjmp() mentioned so I don't know what (if any) support there was for them before C89 (I learnt C a year or two after).
I have a program that I translated from B to C in 1984 that uses setjmp/longjmp. It was just a library function.
Setjmp or something similar would have been in B as well.

Something interactive like a old BASIC interpreter would need a non local goto to return to the main driving loop after an interrupt.
The user pressing ^C stop a running program for example. Probably more efficient than constantly checking a global flag.

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 8:14 am

ejolson,
Congratulations. Your parallel C++ code seems much better suited to 64-bit Intel hardware than my parallel C code. Here are the compilation options:
Thanks. But it's not so cut a dried at my end...

Those are very nice results and graphs but somewhat perplexing.

Your raw performance graph shows fibo_karatomp always faster than parallel, quite significantly with more than one core. Here, on my PC, I see fibo_karatomp and parallel are neck and neck, if anything parallel being a bit faster. How can that be?

The graph also shows fibo_karatasync very much slower than fibo_karatomp, here it's not good but not so bad as in the graphs.

Code: Select all

MULTI-CORE RUN TIMES MEASURED IN SECONDS
     
                 Win10 PC    Pi 3B
parallel:        0.227s      1.075s
fibo_karatomp:   0.234s      1.489s
fibo_karatasync: 0.303s      1.789s
It's confusing, we are all over the map in relative performance, changing with cores and platforms.
I suspect the ARM and Intel marketing teams would prefer alternatively C++ or C.
Nah, it's all stuffed through the same GCC, same intermediate representation, same optimizers, same code generators. I have seen C++ compiled, byte for byte, to the same instructions as the same functionality in C, the C++ using classes while the C used structs and function pointer parameters to "methods". Wish I'd kept that code, it always shuts up those that claim C is more efficient than C++.

Rather I suspect the is something subtle about the way cache is used in our programs that makes them more or less friendly to different architectures. Or something even more subtle that makes the code more friendly to Intel's multiple dispatch, branch prediction, speculative execution or whatever other magic goes on those CPU's.

Or is it something to do with efficiency of thread switching...

About that async thing... I don't know if you watched that last video I linked to above, you should, it's amazing. The claim is that they can get hugely better parallelism using async and "futures". Mostly because things like OpenMP rely on fork/join which means everything has to run step by step with the slowest node holding everybody up at each step. With async and futures everything flies along until it really needs a result from another thread.

More down to earth, the claim is that they have a very much lighter weight threading mechanism than pthreads, as used by OpenMP, this of course is something I have to try for myself. So far I have only just got HXP built and printing "Hello world"...

ejolson
Posts: 3078
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 9:44 am

Heater wrote:
Sat Jan 19, 2019 8:14 am
Or something even more subtle that makes the code more friendly to Intel's multiple dispatch, branch prediction, speculative execution or whatever other magic goes on those CPU's.
The above hardware features may affect the assumption used in parallel.c that loops with if statements are slow compared to vectorisable arithmetic kernels.
Heater wrote:
Sat Jan 19, 2019 8:14 am
With async and futures everything flies along until it really needs a result from another thread.
I don't think async futures are any better for the kind of recursive parallelism being exploited in the Karatsuba algorithm. Note, however, it is important to have a proper implementation of fork-join that upon sync only blocks for tasks spawned in the current scope.

Bypassing glibc emulation of POSIX threads is surely a good idea. Given the dominance of Linux, I'm surprised recent versions of OpenMP don't work directly with Linux native threads.

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 10:25 am

ejolson,
The above hardware features may affect the assumption used in parallel.c that loops with if statements are slow compared to vectorisable arithmetic kernels.
Yes, one might naturally assume that sticking a conditional inside a tight loop would slow things down a lot. That was certainly the case back in the day. Now, these Intel guys do a great job of branch prediction and speculative execution, that conditional ends up costing nothing. It's like some quantum computer that executes all possible pathways through your code simultaneously then magically produces the right result,
I don't think async futures are any better for the kind of recursive parallelism being exploited in the Karatsuba algorithm.
Karatsuba and our fibo algorithms may be tough nuts for this but in that presentation of HXP he jumps straight into an example that is very recursive. He has very convincing arguments as to why async/future can make a lot more efficient use of hardware than endless forking and joining.
Bypassing glibc emulation of POSIX threads is surely a good idea. Given the dominance of Linux, I'm surprised recent versions of OpenMP don't work directly with Linux native threads.
As far as I can tell they are not even using Linux native threads, rather doing it all in user space for orders of magnitude less overheads in context switching.

Anyway, I'm not sure I'm going to get off the ground with HXP.I changed my std::async to hpx::async and now it takes ten times longer to build the code. Not something I want to riff on. After that the end result is a segfault, now what?

ejolson
Posts: 3078
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 7:22 pm

Heater wrote:
Sat Jan 19, 2019 10:25 am
ejolson,
The above hardware features may affect the assumption used in parallel.c that loops with if statements are slow compared to vectorisable arithmetic kernels.
Yes, one might naturally assume that sticking a conditional inside a tight loop would slow things down a lot. That was certainly the case back in the day. Now, these Intel guys do a great job of branch prediction and speculative execution, that conditional ends up costing nothing. It's like some quantum computer that executes all possible pathways through your code simultaneously then magically produces the right result,
I don't think async futures are any better for the kind of recursive parallelism being exploited in the Karatsuba algorithm.
Karatsuba and our fibo algorithms may be tough nuts for this but in that presentation of HXP he jumps straight into an example that is very recursive. He has very convincing arguments as to why async/future can make a lot more efficient use of hardware than endless forking and joining.
Bypassing glibc emulation of POSIX threads is surely a good idea. Given the dominance of Linux, I'm surprised recent versions of OpenMP don't work directly with Linux native threads.
As far as I can tell they are not even using Linux native threads, rather doing it all in user space for orders of magnitude less overheads in context switching.

Anyway, I'm not sure I'm going to get off the ground with HXP.I changed my std::async to hpx::async and now it takes ten times longer to build the code. Not something I want to riff on. After that the end result is a segfault, now what?
I only have the usual ideas about the segfault: missing initialization code, a library version mismatch or that the library was made out of snake oil. It appears, for once, not a reason to avoid BASIC.

Before doing anything in user space you need to employ some operating system feature to actually start the threads. It further appears the pool of worker threads is blocking instead of spinning when there is no work to do. This also requires interaction with the kernel.

Seems we're off-line here--probably flooding of the electrical tunnels. I expect the graphs to return by Monday if not sooner. After things are back up I'll rerun parallel.c with the missing micro-optimization workers(2*nprocs) to make it a little more competitive.

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 8:26 pm

Yeah, the, possibly, good news is that I can get my karatsuba modified to use hpx::async running.

The bad news is, I don't yet see any sign of it being faster that std::async or even running single core!
Last edited by Heater on Sat Jan 19, 2019 8:46 pm, edited 1 time in total.

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 8:45 pm

ejolson ,
Before doing anything in user space you need to employ some operating system feature to actually start the threads. It further appears the pool of worker threads is blocking instead of spinning when there is no work to do. This also requires interaction with the kernel.
Thinking about it, I suspect you are right and start to wonder what the guy was talking about.

Looking forward to your updated results.

ejolson
Posts: 3078
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sat Jan 19, 2019 10:00 pm

Heater wrote:
Sat Jan 19, 2019 8:14 am

Code: Select all

MULTI-CORE RUN TIMES MEASURED IN SECONDS
     
                 Win10 PC    Pi 3B
parallel:        0.227s      1.075s
fibo_karatomp:   0.234s      1.489s
fibo_karatasync: 0.303s      1.789s
It's confusing, we are all over the map in relative performance, changing with cores and platforms.
Running the C++ code fast on recent Intel hardware appears to require a recent version of g++ and the appropriate optimiser settings.

Are you still using version 6 of g++ with -O2 and the default generic tuning? As the expressivity of a language for conveying algorithms to the CPU depends strongly on the compiler, I would suggest trying the version 8.2 compiler along with the -march=native -mtune=native -O3 flags. If =native doesn't work on Linux Subsystem for Linux, you might need to specify the processor type directly.

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 20, 2019 3:14 am

Those results were with the same build command you used above and my recent GCC 8.2.0 build. Namely:

Code: Select all

$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/8.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --disable-multilib --enable-languages=c,c++ --enable-checking=no
Thread model: posix
gcc version 8.2.0 (GCC)
$ g++ -O3 -march=native -mtune=native -DUSE_OMP -fopenmp -std=c++17 -o fibo_karatomp  fibo_4784969/c++/fibo_karatsuba.cpp fibo_4784969/c++/fibo.cpp
$
$ time ./fibo_karatomp > /dev/null

real    0m0.235s
user    0m1.156s
sys     0m0.188s
$
$ gcc -O3 -march=native -mtune=native -fopenmp -o parallel fibo_4784969/c/parallel.c
$
$ time ./parallel > /dev/null
Using 8 cores.

real    0m0.228s
user    0m1.297s
sys     0m0.047s
When I fiddle around changing optimization and other flags it's hard to see any appreciable change. The run times are jumping around so much from run to run.

As for setting -march all I know is that it builds as x86-64. If I try setting any Intel processor type there it either fails with illegal instruction or does not seem to change anything. I have no idea what dumb ass name Intel gave to my actual CPU, a Core i7.

jahboater
Posts: 4464
Joined: Wed Feb 04, 2015 6:38 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 20, 2019 8:00 am

Heater wrote:
Sun Jan 20, 2019 3:14 am
I have no idea what dumb ass name Intel gave to my actual CPU, a Core i7.
See /proc/cpuinfo:-

Code: Select all

$ grep Intel /proc/cpuinfo
vendor_id	: GenuineIntel
model name	: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
vendor_id	: GenuineIntel
model name	: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
But since you built the compiler yourself on that machine it is hopefully targeting the correct architecture by default.

Unless WSL is messing with the CPU brand string :(

Heater
Posts: 12714
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 20, 2019 8:38 am

jahboater ,

My WSL does not lie:

Code: Select all

$ grep Intel /proc/cpuinfo
vendor_id       : GenuineIntel
model name      :        Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Which agrees with what Windows tells me. Problem is GCC has no concept of an "i7-2600K".

The missing information is the dumb ass processor name Intel gave it. Which I finally discover is "Sandy Bridge", from here: https://ark.intel.com/products/52214/In ... -3-80-GHz-

Why couldn't the ten dozen articles and documents re: setting GCC -march have told me that or hinted at it?

Anyway -march=sandybridge makes no noticeable difference.

Thanks.

Return to “Off topic discussion”