ejolson
Posts: 2897
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Mon Apr 01, 2019 5:30 pm

RoyLongbottom wrote:
Mon Apr 01, 2019 11:15 am
The following reference that you quoted indicates that a four Raspberry Pi 3s obtain an HPL score of 3.463 GFLOPS but performance of a single 4 core node was not mentioned, nor which versions of the benchmark packages were used. That speed is what I saw with my ATLAS version on one Pi 3 and, for the other version, I see over 5 GFLOPS.

https://www.raspberrypi.org/magpi/bench ... i-cluster/
That article appears to indicate inexperience of the author as well as a lack of editorial oversight. This is somewhat unfortunate as the MagPi serves as a definitive educational resource and the parallel Linpack as the defacto test a cluster is functioning properly.

Maybe the article can motivate an easy challenge: use a cluster of four Raspberry Pi 3Bs to obtain around 20 GFLOPs on Linpack instead of 3.5 GFLOPs. Increasing N, tuning the configuration and making sure HPL was linked with OpenBLAS should be sufficient. I suspect most code clubs and schools have the resources to try, just not me.

RoyLongbottom
Posts: 270
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Apr 24, 2019 6:44 pm

New Multithreaded Stress Tests

I have produced some new 32 bit and 64 bit stress tests that can use up to 64 threads, each using a dedicated segment of the data. There are three varieties Integer, Single Precision Floating Point and Double Precision Floating Point. Run time parameters include duration, data size and number of threads. Full details with RPi 3B and 3B+ results are included in the tar.gz and zip files, at ResearchGate, that contain the benchmarks and source codes.

tar.gz
https://www.researchgate.net/profile/Ro ... UpdatesLog

zip
https://www.researchgate.net/profile/Ro ... UpdatesLog

Default use, with no run time parameters, is a benchmarking mode. Following are examples of integer and single precision floating point performance via data in caches and RAM.

Code: Select all

  MP-Integer-Test 32 Bit v1.0 Tue Apr  9 12:33:19 2019

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   8.8    1   3571  3293  2045  00000000    Yes
   6.3    2   6946  6511  2152  FFFFFFFF    Yes
   5.6    4  13892 12651  1895  5A5A5A5A    Yes
   5.6    8  13247 13764  1880  AAAAAAAA    Yes
   5.5   16  13853 14034  1879  CCCCCCCC    Yes
   5.5   32  13633 13829  1908  0F0F0F0F    Yes

            End of test Tue Apr  9 12:33:56 2019


 MP-Threaded-MFLOPS 32 Bit v1.0 Tue Apr  9 12:34:25 2019

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.2    T1   2   852   847   399   40392  76406  99700
   5.5    T2   2  1641  1666   415   40392  76406  99700
   7.5    T4   2  3098  3172   415   40392  76406  99700
   9.5    T8   2  3002  3072   413   40392  76406  99700
  14.0    T1   8  1905  1932  1460   54756  85091  99820
  16.9    T2   8  3798  3837  1635   54756  85091  99820
  19.2    T4   8  7307  7460  1633   54756  85091  99820
  21.5    T8   8  7258  7613  1645   54756  85091  99820
  36.9    T1  32  2025  2049  1923   35296  66020  99519
  44.8    T2  32  4031  4072  3638   35296  66020  99519
  49.1    T4  32  7986  8025  6085   35296  66020  99519
  53.3    T8  32  7929  8081  6212   35296  66020  99519

            End of test Tue Apr  9 12:35:18 2019
The stress tests were run at the same time as my program that measures CPU MHz, core voltage anf temperature. Following is an indication of logged stress test results on my older Pi 3B:

Code: Select all

  After  99 seconds to first data comparison error

  MP-Integer-Test 32 Bit v1.0 Sat Apr  6 15:45:41 2019

               15 Minute Stress Test

              Data                          Same All
 Seconds      Size Threads   MB/sec Sumcheck Threads

   107.4     640 KB      4      7157 5A5A5A5A  Yes
  115.8     640 KB      4      7176 5A5A5A5A  Yes
  124.2     640 KB      4      7149 5A5A5A5A  Yes
  132.3     640 KB      4      7504 5A5A5A5A  Yes
  140.8     640 KB      4      7074 5A5A5A5A  Yes
  149.2     640 KB      4      7159 5A5A5A5A  Yes
  157.4     640 KB      4      7364 AAAAAAAA  Yes
  165.3     640 KB      4      7569 AAAAAAAA  Yes
  173.6     640 KB      4      7293 AAAAAAAA  Yes
  182.0     640 KB      4      7149 AAAAAAAA  Yes
  190.2     640 KB      4      7322 AAAAAAAA  No 1
  198.3     640 KB      4      7460 AAAAAAAA  Yes


 After 315 seconds to first error

  MP-Threaded-MFLOPS 32 Bit v1.0 Thu Apr 18 10:56:22 2019

                   15 Minute Stress Test

            Data             Ops/         Numeric
 Seconds    Size  Threads    Word  MFLOPS Results     Passes

  325.7   640 KB        8      32    6812   59617      13125
  335.9   640 KB        8      32    6772   59617      13125
  346.0   640 KB        8      32    6815       0      13125
  356.1   640 KB        8      32    6790   59617      13125
The only data comparison failures that were detected were on the older Pi 3B, without the recommended “over_voltage=2” boot setting.

The most stressful tests appeared to be on running the integer program, with 4 threads requiring 640 KB data that overfilled the L2 cache. Then running instead with 8 threads proved to be worse. In this case, performance was much faster, with data for 4 threads in cache, interrupted by swapping for the other 4 threads.

RoyLongbottom
Posts: 270
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Mon Jun 24, 2019 3:32 pm

Raspberry Pi 4 Benchmarks

I have run all of my relevant benchmarks and stress tests on the new Raspberry Pi 4. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer, exercising the system prior to launch. I will be posting results here shortly, but I want to include others for desktop PCs with the first few. A report containing the benchmark results and comparisons with the Pi 3B+ is now at ResearchGate:

https://www.researchgate.net/publicatio ... Benchmarks

The PDF file was converted from simple hand coded HTML, hence the small size that I have attached in a ZIP file. This might be of use in providing easy access to others.
Raspberry Pi 4B 32 Bit BenchmarksHTM.zip
(31.34 KiB) Downloaded 2 times

ejolson
Posts: 2897
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Mon Jun 24, 2019 6:28 pm

RoyLongbottom wrote:
Mon Jun 24, 2019 3:32 pm
Raspberry Pi 4 Benchmarks

I have run all of my relevant benchmarks and stress tests on the new Raspberry Pi 4. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer, exercising the system prior to launch. I will be posting results here shortly, but I want to include others for desktop PCs with the first few. A report containing the benchmark results and comparisons with the Pi 3B+ is now at ResearchGate:

https://www.researchgate.net/publicatio ... Benchmarks

The PDF file was converted from simple hand coded HTML, hence the small size that I have attached in a ZIP file. This might be of use in providing easy access to others.

Raspberry Pi 4B 32 Bit BenchmarksHTM.zip
Did you use a heatsink or fan on the Pi 4 and did it throttle due to heat?

RoyLongbottom
Posts: 270
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Jun 25, 2019 9:02 am

ejolson wrote:
Mon Jun 24, 2019 6:28 pm
RoyLongbottom wrote:
Mon Jun 24, 2019 3:32 pm
Raspberry Pi 4 Benchmarks
Did you use a heatsink or fan on the Pi 4 and did it throttle due to heat?
I will be producing a report of my stress testing efforts in due course. As with all modern processors (that I have encountered) they will throttle or crash when stretched to the limit, without appropriate cooling arrangements.

An example of worst case was running the High Performance Linpack benchmark. On a bare Pi 4 board, with throttling starting at 80°C, running time was 16 minutes with a performance rating of 5.37 GFLOPS and final temperature of 87°, when final throttling frequencies were between 600 and 500 MHz. That is not all bad news as it completed the task.

As second Pi 4 board has a Power over Ethernet HAT fitted, providing fan based cooling. This ran the HPL test in 8 minutes at 10.9 GFLOPS, with maximum temperature of 61°C.

A best case, with no cooling, was that I watched BBC iPlayer for more than 2 hours on a full screen 1920x1080 display, with CPU utilisation around 100% of one core and maximum temperature of 70°C, at maximum MHz all the time. That seemed to be a maximum temperature running my single core benchmarks.

RoyLongbottom
Posts: 270
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Jun 25, 2019 3:39 pm

Single Core Benchmarks

The following provide benchmark results with limited comments on Raspberry Pi 4B performance gains over Pi 3B+ and relative Pi 4B relationships between older ARM V7 and gcc 8 compilations. For the first few, ancient benchmarks, ARM V6 compilations are also compared.

Results of somewhat older PCs are provided for the first few benchmarks, to indicate that the Pi 4 appears to have performance in the desktop PC category.

Whetstone Benchmark

This has a range of simple tests with an overall rating in MWIPS, the latter being dependent on floating point speeds, currently most influenced by variations in the COS and EXP tests. Pi 4B is faster but not excessively.

Code: Select all

System       MHz  MWIPS ------ MFLOPS------  -----  ------ -MOPS-------- ------
                            1      2      3      COS    EXP  FIXPT     IF  EQUAL
ARM V7
Pi 3B+       1400   1060    391    383    298   21.7   12.3   1740   2083   1392
Pi 4B        1500   1884    516    478    310   54.7   27.1   2498   2247    999
4B/3B+       1.07   1.78   1.32   1.25   1.04   2.52   2.21   1.44   1.08   0.72

Arm V6
Pi 3B+       1400   1094    391    407    348   21.7   12.3   1740   2084   1391
Pi 4B        1500   2048    520    473    389   53.8   27.1   2497   2245   2246
4B/3B+       1.07   1.87   1.33   1.16   1.12   2.47   2.20   1.44   1.08   1.61

gcc 8
Pi 3B+       1400   1063    393    373    300   21.8   12.3   1748   2097   1398
Pi 4B        1500   1883    522    471    313   54.9   26.4   2496   3178    998
4B/3B+       1.00   1.76   1.33   1.26   1.05   2.51   2.09   1.43   1.52   0.71

Similar to Pi 4B Results from PCs 

Core i5      @@@@   1813    537    501    317   50.4   32.1   2242    457    561
Core 2       2400   2057    586    580    387   56.7   34.4   2192    381    771
Phenom II    3000   2145    594    492    297   67.1   50.7   2053    694    703

        @@@@ Rated as 1600 MHz but running at up to 2300 MHz using Turbo Boost
https://www.researchgate.net/publicatio ... er_Results

or to compare with computers from 1964 to the 1990s.

https://www.researchgate.net/publicatio ... nd_Results

Dhrystone Benchmark

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.

The Pi 4B was shown to be around twice as fast as the 3B+ and gcc 8 performance was similar to the ARM V7 compilation.

Code: Select all

                                              Best
                ----- Compiler -----         DMIPS
 System     MHz  ARM V6 ARM V7  gcc 8  G8/V7  /MHz

 Pi 3B+    1400   2520   2825   2838   1.00   2.03
 Pi 4B     1500   5077   5366   5646   1.05   3.76

 4B/3B+    1.07   2.01   1.90   1.99          1.86

Similar to Pi 4B Results from PCs 

Athlon 64   2211         4462
Core i5     @@@@         4752
Core 2      2400         6446
Phenom II   3000         7615
https://www.researchgate.net/publicatio ... Longbottom

Linpack 100 Benchmark MFLOPS

This original Linpack benchmark is dependent on fused multiply and add instructions, but the overheads on the standard source code restricts processing speed. It seems that the latest hardware has been modified to execute this type of code more efficiently. The original used Double Precision (DP) arithmetic but, for the benefit of the Pi, Single Precision (SP) versions were produced.

All measurements demonstrate that the Pi 4B was between 3.6 and 4.7 times faster than the Pi 3B+

Code: Select all

                      ARM V6        ARM V7        gcc 8     NEON
 System    MHz      DP     SP     DP     SP     DP     SP     SP 

 Pi 3B+    1400  206.0  220.2  210.5  225.2  224.8  227.3  536.0
 Pi 4B     1500  764.7  880.6  760.2  921.6  957.1 1068.8 2010.5

 4B/3B+    1.07   3.71   4.00   3.61   4.09   4.26   4.70   3.75

Similar to Pi 4B Results from PCs 

Pentium 4E 3000                630.3 
Athlon 64  2150                811.9  
Core 2     1830                997.7  
Core i5    @@@@               1064.7 

Later SSE2
Core 2     1830                             1119.4 
https://www.researchgate.net/publicatio ... Longbottom

Livermore Loops Benchmark MFLOPS

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.

Based on Geomean results, Pi 4B is shown as being 2.36 times faster than the 3B+, and even more so via the gcc 8 compilation, where the gcc8/V7 performance ratio identified is 1.31.

Code: Select all

System      MHz    Maximum Average Geomean Harmean Minimum

 ARM V7
 Pi 3B+     1400     464.8   246.7   220.1   193.9    78.3
 Pi 4B      1500    1800.2   635.1   519.0   416.1   155.3
 4B/3B+     1.07      3.87    2.57    2.36    2.15    1.98

 gcc 8
 Pi 3B+     1400     541.7   283.4   257.4   231.5    92.7
 Pi 4B      1500    1860.8   800.4   679.0   564.1   179.5
 4B/3B+     1.07      3.40    2.80    2.61    2.41    1.90
                                                
Similar to Pi 4B Results from PCs 

Core i5     @@@@    2326.4   717.1   438.0   246.2    34.7 
Athlon 64   2211    2565.9   740.9   460.5   285.7    48.4
Core 2      2400    2236.2   791.9   539.0   331.6    52.3
Phenom  II  3000    3893.9  1071.8   644.4   384.5    64.0

Later SSE2
Core 2      2400    2854.2  1047.6   848.2   653.8   141.8 
https://www.researchgate.net/publicatio ... Longbottom

Next, single core benchmarks using caches and RAM


Return to “General programming discussion”