ejolson
Posts: 3699
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Mon Apr 01, 2019 5:30 pm

RoyLongbottom wrote:
Mon Apr 01, 2019 11:15 am
The following reference that you quoted indicates that a four Raspberry Pi 3s obtain an HPL score of 3.463 GFLOPS but performance of a single 4 core node was not mentioned, nor which versions of the benchmark packages were used. That speed is what I saw with my ATLAS version on one Pi 3 and, for the other version, I see over 5 GFLOPS.

https://www.raspberrypi.org/magpi/bench ... i-cluster/
That article appears to indicate inexperience of the author as well as a lack of editorial oversight. This is somewhat unfortunate as the MagPi serves as a definitive educational resource and the parallel Linpack as the defacto test a cluster is functioning properly.

Maybe the article can motivate an easy challenge: use a cluster of four Raspberry Pi 3Bs to obtain around 20 GFLOPs on Linpack instead of 3.5 GFLOPs. Increasing N, tuning the configuration and making sure HPL was linked with OpenBLAS should be sufficient. I suspect most code clubs and schools have the resources to try, just not me.

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Apr 24, 2019 6:44 pm

New Multithreaded Stress Tests

I have produced some new 32 bit and 64 bit stress tests that can use up to 64 threads, each using a dedicated segment of the data. There are three varieties Integer, Single Precision Floating Point and Double Precision Floating Point. Run time parameters include duration, data size and number of threads. Full details with RPi 3B and 3B+ results are included in the tar.gz and zip files, at ResearchGate, that contain the benchmarks and source codes.

tar.gz
https://www.researchgate.net/profile/Ro ... UpdatesLog

zip
https://www.researchgate.net/profile/Ro ... UpdatesLog

Default use, with no run time parameters, is a benchmarking mode. Following are examples of integer and single precision floating point performance via data in caches and RAM.

Code: Select all

  MP-Integer-Test 32 Bit v1.0 Tue Apr  9 12:33:19 2019

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   8.8    1   3571  3293  2045  00000000    Yes
   6.3    2   6946  6511  2152  FFFFFFFF    Yes
   5.6    4  13892 12651  1895  5A5A5A5A    Yes
   5.6    8  13247 13764  1880  AAAAAAAA    Yes
   5.5   16  13853 14034  1879  CCCCCCCC    Yes
   5.5   32  13633 13829  1908  0F0F0F0F    Yes

            End of test Tue Apr  9 12:33:56 2019


 MP-Threaded-MFLOPS 32 Bit v1.0 Tue Apr  9 12:34:25 2019

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.2    T1   2   852   847   399   40392  76406  99700
   5.5    T2   2  1641  1666   415   40392  76406  99700
   7.5    T4   2  3098  3172   415   40392  76406  99700
   9.5    T8   2  3002  3072   413   40392  76406  99700
  14.0    T1   8  1905  1932  1460   54756  85091  99820
  16.9    T2   8  3798  3837  1635   54756  85091  99820
  19.2    T4   8  7307  7460  1633   54756  85091  99820
  21.5    T8   8  7258  7613  1645   54756  85091  99820
  36.9    T1  32  2025  2049  1923   35296  66020  99519
  44.8    T2  32  4031  4072  3638   35296  66020  99519
  49.1    T4  32  7986  8025  6085   35296  66020  99519
  53.3    T8  32  7929  8081  6212   35296  66020  99519

            End of test Tue Apr  9 12:35:18 2019
The stress tests were run at the same time as my program that measures CPU MHz, core voltage anf temperature. Following is an indication of logged stress test results on my older Pi 3B:

Code: Select all

  After  99 seconds to first data comparison error

  MP-Integer-Test 32 Bit v1.0 Sat Apr  6 15:45:41 2019

               15 Minute Stress Test

              Data                          Same All
 Seconds      Size Threads   MB/sec Sumcheck Threads

   107.4     640 KB      4      7157 5A5A5A5A  Yes
  115.8     640 KB      4      7176 5A5A5A5A  Yes
  124.2     640 KB      4      7149 5A5A5A5A  Yes
  132.3     640 KB      4      7504 5A5A5A5A  Yes
  140.8     640 KB      4      7074 5A5A5A5A  Yes
  149.2     640 KB      4      7159 5A5A5A5A  Yes
  157.4     640 KB      4      7364 AAAAAAAA  Yes
  165.3     640 KB      4      7569 AAAAAAAA  Yes
  173.6     640 KB      4      7293 AAAAAAAA  Yes
  182.0     640 KB      4      7149 AAAAAAAA  Yes
  190.2     640 KB      4      7322 AAAAAAAA  No 1
  198.3     640 KB      4      7460 AAAAAAAA  Yes


 After 315 seconds to first error

  MP-Threaded-MFLOPS 32 Bit v1.0 Thu Apr 18 10:56:22 2019

                   15 Minute Stress Test

            Data             Ops/         Numeric
 Seconds    Size  Threads    Word  MFLOPS Results     Passes

  325.7   640 KB        8      32    6812   59617      13125
  335.9   640 KB        8      32    6772   59617      13125
  346.0   640 KB        8      32    6815       0      13125
  356.1   640 KB        8      32    6790   59617      13125
The only data comparison failures that were detected were on the older Pi 3B, without the recommended “over_voltage=2” boot setting.

The most stressful tests appeared to be on running the integer program, with 4 threads requiring 640 KB data that overfilled the L2 cache. Then running instead with 8 threads proved to be worse. In this case, performance was much faster, with data for 4 threads in cache, interrupted by swapping for the other 4 threads.

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Mon Jun 24, 2019 3:32 pm

Raspberry Pi 4 Benchmarks

I have run all of my relevant benchmarks and stress tests on the new Raspberry Pi 4. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer, exercising the system prior to launch. I will be posting results here shortly, but I want to include others for desktop PCs with the first few. A report containing the benchmark results and comparisons with the Pi 3B+ is now at ResearchGate:

https://www.researchgate.net/publicatio ... Benchmarks

The PDF file was converted from simple hand coded HTML, hence the small size that I have attached in a ZIP file. This might be of use in providing easy access to others.
Raspberry Pi 4B 32 Bit BenchmarksHTM.zip
(31.34 KiB) Downloaded 64 times

ejolson
Posts: 3699
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Mon Jun 24, 2019 6:28 pm

RoyLongbottom wrote:
Mon Jun 24, 2019 3:32 pm
Raspberry Pi 4 Benchmarks

I have run all of my relevant benchmarks and stress tests on the new Raspberry Pi 4. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer, exercising the system prior to launch. I will be posting results here shortly, but I want to include others for desktop PCs with the first few. A report containing the benchmark results and comparisons with the Pi 3B+ is now at ResearchGate:

https://www.researchgate.net/publicatio ... Benchmarks

The PDF file was converted from simple hand coded HTML, hence the small size that I have attached in a ZIP file. This might be of use in providing easy access to others.

Raspberry Pi 4B 32 Bit BenchmarksHTM.zip
Did you use a heatsink or fan on the Pi 4 and did it throttle due to heat?

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Jun 25, 2019 9:02 am

ejolson wrote:
Mon Jun 24, 2019 6:28 pm
RoyLongbottom wrote:
Mon Jun 24, 2019 3:32 pm
Raspberry Pi 4 Benchmarks
Did you use a heatsink or fan on the Pi 4 and did it throttle due to heat?
I will be producing a report of my stress testing efforts in due course. As with all modern processors (that I have encountered) they will throttle or crash when stretched to the limit, without appropriate cooling arrangements.

An example of worst case was running the High Performance Linpack benchmark. On a bare Pi 4 board, with throttling starting at 80°C, running time was 16 minutes with a performance rating of 5.37 GFLOPS and final temperature of 87°, when final throttling frequencies were between 600 and 500 MHz. That is not all bad news as it completed the task.

As second Pi 4 board has a Power over Ethernet HAT fitted, providing fan based cooling. This ran the HPL test in 8 minutes at 10.9 GFLOPS, with maximum temperature of 61°C.

A best case, with no cooling, was that I watched BBC iPlayer for more than 2 hours on a full screen 1920x1080 display, with CPU utilisation around 100% of one core and maximum temperature of 70°C, at maximum MHz all the time. That seemed to be a maximum temperature running my single core benchmarks.

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Jun 25, 2019 3:39 pm

Single Core Benchmarks

The following provide benchmark results with limited comments on Raspberry Pi 4B performance gains over Pi 3B+ and relative Pi 4B relationships between older ARM V7 and gcc 8 compilations. For the first few, ancient benchmarks, ARM V6 compilations are also compared.

Results of somewhat older PCs are provided for the first few benchmarks, to indicate that the Pi 4 appears to have performance in the desktop PC category.

Whetstone Benchmark

This has a range of simple tests with an overall rating in MWIPS, the latter being dependent on floating point speeds, currently most influenced by variations in the COS and EXP tests. Pi 4B is faster but not excessively.

Code: Select all

System       MHz  MWIPS ------ MFLOPS------  -----  ------ -MOPS-------- ------
                            1      2      3      COS    EXP  FIXPT     IF  EQUAL
ARM V7
Pi 3B+       1400   1060    391    383    298   21.7   12.3   1740   2083   1392
Pi 4B        1500   1884    516    478    310   54.7   27.1   2498   2247    999
4B/3B+       1.07   1.78   1.32   1.25   1.04   2.52   2.21   1.44   1.08   0.72

Arm V6
Pi 3B+       1400   1094    391    407    348   21.7   12.3   1740   2084   1391
Pi 4B        1500   2048    520    473    389   53.8   27.1   2497   2245   2246
4B/3B+       1.07   1.87   1.33   1.16   1.12   2.47   2.20   1.44   1.08   1.61

gcc 8
Pi 3B+       1400   1063    393    373    300   21.8   12.3   1748   2097   1398
Pi 4B        1500   1883    522    471    313   54.9   26.4   2496   3178    998
4B/3B+       1.00   1.76   1.33   1.26   1.05   2.51   2.09   1.43   1.52   0.71

Similar to Pi 4B Results from PCs 

Core i5      @@@@   1813    537    501    317   50.4   32.1   2242    457    561
Core 2       2400   2057    586    580    387   56.7   34.4   2192    381    771
Phenom II    3000   2145    594    492    297   67.1   50.7   2053    694    703

        @@@@ Rated as 1600 MHz but running at up to 2300 MHz using Turbo Boost
https://www.researchgate.net/publicatio ... er_Results

or to compare with computers from 1964 to the 1990s.

https://www.researchgate.net/publicatio ... nd_Results

Dhrystone Benchmark

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.

The Pi 4B was shown to be around twice as fast as the 3B+ and gcc 8 performance was similar to the ARM V7 compilation.

Code: Select all

                                              Best
                ----- Compiler -----         DMIPS
 System     MHz  ARM V6 ARM V7  gcc 8  G8/V7  /MHz

 Pi 3B+    1400   2520   2825   2838   1.00   2.03
 Pi 4B     1500   5077   5366   5646   1.05   3.76

 4B/3B+    1.07   2.01   1.90   1.99          1.86

Similar to Pi 4B Results from PCs 

Athlon 64   2211         4462
Core i5     @@@@         4752
Core 2      2400         6446
Phenom II   3000         7615
https://www.researchgate.net/publicatio ... Longbottom

Linpack 100 Benchmark MFLOPS

This original Linpack benchmark is dependent on fused multiply and add instructions, but the overheads on the standard source code restricts processing speed. It seems that the latest hardware has been modified to execute this type of code more efficiently. The original used Double Precision (DP) arithmetic but, for the benefit of the Pi, Single Precision (SP) versions were produced.

All measurements demonstrate that the Pi 4B was between 3.6 and 4.7 times faster than the Pi 3B+

Code: Select all

                      ARM V6        ARM V7        gcc 8     NEON
 System    MHz      DP     SP     DP     SP     DP     SP     SP 

 Pi 3B+    1400  206.0  220.2  210.5  225.2  224.8  227.3  536.0
 Pi 4B     1500  764.7  880.6  760.2  921.6  957.1 1068.8 2010.5

 4B/3B+    1.07   3.71   4.00   3.61   4.09   4.26   4.70   3.75

Similar to Pi 4B Results from PCs 

Pentium 4E 3000                630.3 
Athlon 64  2150                811.9  
Core 2     1830                997.7  
Core i5    @@@@               1064.7 

Later SSE2
Core 2     1830                             1119.4 
https://www.researchgate.net/publicatio ... Longbottom

Livermore Loops Benchmark MFLOPS

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.

Based on Geomean results, Pi 4B is shown as being 2.36 times faster than the 3B+, and even more so via the gcc 8 compilation, where the gcc8/V7 performance ratio identified is 1.31.

Code: Select all

System      MHz    Maximum Average Geomean Harmean Minimum

 ARM V7
 Pi 3B+     1400     464.8   246.7   220.1   193.9    78.3
 Pi 4B      1500    1800.2   635.1   519.0   416.1   155.3
 4B/3B+     1.07      3.87    2.57    2.36    2.15    1.98

 gcc 8
 Pi 3B+     1400     541.7   283.4   257.4   231.5    92.7
 Pi 4B      1500    1860.8   800.4   679.0   564.1   179.5
 4B/3B+     1.07      3.40    2.80    2.61    2.41    1.90
                                                
Similar to Pi 4B Results from PCs 

Core i5     @@@@    2326.4   717.1   438.0   246.2    34.7 
Athlon 64   2211    2565.9   740.9   460.5   285.7    48.4
Core 2      2400    2236.2   791.9   539.0   331.6    52.3
Phenom  II  3000    3893.9  1071.8   644.4   384.5    64.0

Later SSE2
Core 2      2400    2854.2  1047.6   848.2   653.8   141.8 
https://www.researchgate.net/publicatio ... Longbottom

Next, single core benchmarks using caches and RAM

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Jun 26, 2019 4:15 pm

Single Core Benchmarks using Data in Caches and RAM

The following are brief summaries of details and results. For this information, see the HTM file in Raspberry Pi 4B 32 Bit BenchmarksHTM.zip, attached to the first message of this Raspberry Pi 4 benchmarks topic.

https://www.raspberrypi.org/forums/down ... p?id=30443

Fast Fourier Transforms Benchmarks

These are based on methods used in a real application with results being average running times from three passes. Below are Pi 4B results and performance gains over a Pi 3B+. In this case, the new gcc 8 compilations are not much different from the earlier benchmarks.

Code: Select all

                              Time in milliseconds

              Raspberry Pi 4B FFT 1        Raspberry Pi 4B FFT 3

          ARM V7           gcc 8          ARM V7           gcc 8
   Size
      K       SP      DP      SP      DP      SP      DP      SP      DP

      1     0.04    0.04    0.04    0.04    0.06    0.05    0.05    0.04
      2     0.08    0.12    0.08    0.13    0.13    0.11    0.10    0.10
      4     0.32    0.37    0.29    0.34    0.27    0.24    0.24    0.23
      8     0.77    0.97    0.79    0.82    0.58    0.55    0.57    0.51
     16     1.69    2.01    1.65    1.85    1.49    1.35    1.32    1.19
     32     4.37    4.89    3.76    4.71    2.96    3.63    2.69    3.30
     64     9.12   26.55    8.82   30.64    7.46   10.75    6.60    9.47
    128    55.52  160.11   58.54  132.41   17.93   26.03   16.92   23.85
    256   305.92  423.06  275.44  373.12   41.16   55.06   37.61   55.97
    512   833.10  854.88  780.89  751.27   86.93  120.53   81.54  128.13
   1024  1617.49 1875.52 1578.70 1812.20  190.28  266.60  186.45  288.27

  Size              RPi 4B Gains (>1.0 4B running time is less)
     K
      1     3.45    3.46    4.02    3.94    3.06    2.66    2.88    3.45
      2     3.79    3.14    4.27    3.84    3.10    2.93    3.28    3.29
      4     2.46    2.50    3.19    3.84    3.86    3.23    3.24    3.22
      8     2.51    2.24    3.82    4.12    3.67    3.18    3.21    3.44
     16     2.76    2.62    3.08    3.23    3.17    4.06    3.25    4.10
     32     2.51    4.21    3.27    4.38    3.62    4.14    3.55    4.13
     64     3.79    4.86    4.23    4.27    3.88    3.42    3.95    3.51
    128     4.43    1.93    4.34    2.42    3.91    3.24    3.83    3.23
    256     1.92    1.51    2.25    1.97    3.82    3.57    3.86    3.23
    512     1.48    1.61    1.58    1.93    4.18    3.60    4.13    3.16
   1024     1.71    1.60    1.76    1.71    4.24    3.66    3.95    3.17
Following are some PC DP results, where the Pi 4 performs relatively well at the smaller FFT sizes before the PCs see the advantage of larger caches and faster RAM.

Code: Select all

        Double Precision Version 3 Milliseconds

                   Linux, P4 Windows
                                                  
         Core i7  Core 2  Phenom    Atom Pentium
           4820K    6600  II 945    N455       4
     MHz    3900    2400    3000    1666    1900

  K Size
       1    0.02    0.05    0.03    0.17    0.09
       2    0.04    0.12    0.05    0.42    0.20
       4    0.08    0.29    0.14    0.91    0.42
       8    0.18    0.64    0.41    1.98    0.98
      16    0.37    1.29    0.88    3.92    6.23
      32    0.78    2.86    2.11   10.01    15.2
      64    1.70    6.21    4.64   23.51    34.6
     128    3.66   14.49   10.45   53.03    73.7
     256    8.09   34.85   26.32  115.83   156.0
     512   21.05   81.85   79.23  245.59   344.0
    1024   65.15  178.81  197.41  538.71   804.0
BusSpeed Benchmark

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 skipping following data word by decreasing increments. finally reading all data. This shows were data is read in bursts, enabling estimates being made of bus speeds.

The following provides Pi 4B results and performance gains, over the Pi 3B+, for the most meaningful column, reading all data. Note that the best ratio indicates the impact of the larger L2 cache.

The PC versions are not comparable as assembly code was used. Details below are for a single thread using an MP version, indicating that the Pi 4 could be comparable on a per MHz basis.

Code: Select all

                    Pi 4B  gcc 8                

    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  Pi 4B  gcc 8
  KBytes  Words  Words  Words  Words  Words    All   Gain   Gain

      16   4880   5075   5612   5852   5877   5864   1.16   1.00
      32    846   1138   2153   3229   4908   5300   0.99   1.11
      64    746   1019   2035   3027   4910   5360   1.50   1.36 
     128    728    983   1952   2908   4888   5389   1.52   1.00
     256    683    934   1901   2794   4874   5431   1.55   1.06
     512    656    900   1760   2625   4585   5259   1.75   1.23
    1024    301    410    870   1356   2846   4238   2.81   1.21
    4096    233    248    531    996   2151   4045   2.35   1.27
   16384    236    258    511    891   2143   4011   2.35   1.25
   65536    237    257    508    881   2172   4015   2.40   1.28

 Core 2 2400 MHz
 L1 Cache  6751   6808   9050   9198   9270   9258
 L2 Cache  2094   2028   3328   4529   6670   7949     
 RAM        318    385    792   1436   2620   4902  
MemSpeed Benchmark MB/Second

This includes CPU speed dependent calculations using data from caches and RAM. The calculations are shown in the results column titles. Following are full Pi 4B results from running the gcc 8 recompiled versions, plus full Pi4B/3B+ and some old/gcc 8 comparisons. The Pi 4B is indicated as faster on all test functions, with best case on double precision calculations using cached data.

Code: Select all

                           Pi 4B gcc 8  

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   11768   9844   3841  11787   9934   4351  10309   7816   7804
      16   11880   9880   3822  11886  10043   4363  10484   7902   7892
      32    9539   8528   3678   9517   8661   4098  10564   7948   7945
      64    9952   9310   3733   9997   9470   4160   8452   7717   7732
     128    9947   9591   3757   9990   9757   4178   8205   7680   7753
     256   10015   9604   3758  10030   9781   4186   8120   7734   7707
     512    9073   9300   3751   9472   9526   4175   7995   7709   7602
    1024    2681   5303   3594   2664   4965   3760   4828   3592   3569
    2048    1671   3488   3242   1757   3635   3540   2882   1036   1023
    4096    1777   3700   3283   1827   3627   3555   2433   1052   1054
    8192    1931   3805   3420   1933   3815   3629   2465    980    971
 
                             Pi 4B/3B+

       8    5.81   3.08   1.99   3.96   2.23   2.09   3.02   1.77   1.76
      16    5.85   3.09   1.98   3.99   2.25   2.10   3.06   1.78   1.77
      32    4.84   2.74   1.93   3.35   2.02   1.99   3.08   1.78   1.78
      64    5.15   3.06   1.99   3.63   2.30   2.07   2.61   1.83   1.83
     128    5.04   3.11   1.99   3.54   2.34   2.08   2.50   1.80   1.81
     256    5.06   3.11   1.99   3.52   2.34   2.08   2.45   1.82   1.82
     512    5.18   3.35   2.16   3.85   2.57   2.26   2.57   1.91   1.86
    1024    2.74   2.85   2.59   2.20   2.23   2.57   3.30   2.93   2.93
    2048    1.71   1.88   2.35   1.55   1.72   2.45   3.36   1.25   1.24
    4096    1.82   2.05   2.41   1.61   1.73   2.49   2.58   1.14   1.15
    8192    1.98   2.13   2.51   1.70   1.86   2.58   3.07   1.24   1.32

                               4B gcc 8 gains

       8    1.39   2.07   0.29   1.42   2.08   0.28   1.32   0.79   0.79
     512    1.12   2.01   0.40   1.15   2.05   0.39   1.59   0.98   0.96
    8192    1.02   2.00   1.82   1.12   2.89   2.65   1.09   0.96   0.95
Windows - Following are some Windows scores, showing that the Pi 4 is in the same league.

Code: Select all

 Core 2 2400 MHz 
 L1 Cache  12556   6122   8921  16749   7683   9510  14924   4667   7536 
 L2 Cache  12755   6380   8463  11561   7578   7928   7798   5328   5349
 RAM        4816   4761   4828   3068   3085   3081   1568   1539   1547
NeonSpeed Benchmark MB/Second

This carries out some of the same calculations as MemSpeed. All results are for single precision floating point and integer calculations. Norm tests were as generated by the compiler, using NEON directives and Neon tests through using intrinsic functions.

The NEON calculations were faster than those from a normal C compilation, using cached data, but slower from RAM. Maximum speeds are also shown for NEON vector instructions, as single precision MFLOPS and integer MOPS, both at greater than two results per clock cycle.

Code: Select all

                      Pi 4B gcc 8  

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   9884  12882   3910  12773  13090  15133
      32   9904  13061   3916  13002  13162  15239
      64   9029  11526   3450  10704  11708  12084
     128   9242  11784   3391  11016  11816  12179
     256   9283  11890   3396  11215  11929  12284
     512   9043  10680   3413  10211  10925  11241
    1024   5818   3310   3507   3288   3239   2902
    4096   4060   1994   3497   1991   2009   2011
   16384   4030   2063   3445   2068   2072   2067
   65536   3936   2109   3391   1858   2122   2121

  Max MOPS        3265          3251

                      Pi 4B/3B+

      16   3.05   2.16   2.21   1.90   1.87   1.89
      32   3.25   2.28   2.37   2.00   1.97   1.96
      64   3.85   2.99   2.94   2.59   2.67   2.70
     128   3.65   2.84   2.87   2.47   2.65   2.63
     256   3.60   2.82   2.81   2.45   2.61   2.60
     512   4.58   3.79   3.65   3.35   3.60   3.67
    1024   2.60   2.65   2.59   2.64   2.75   2.67
    4096   1.62   1.77   1.78   1.79   1.77   1.77
   16384   1.86   1.80   1.81   1.88   1.86   1.85
   65536   1.84   1.81   1.82   2.44   1.84   1.86

                      4B gcc 8 gains
 
      16   1.02   1.28   0.44   1.36   1.34   1.44
     512   0.87   0.97   0.34   1.02   0.99   0.98
    1024   1.90   1.03   1.11   1.04   1.01   0.91
   16384   1.99   1.02   1.72   1.05   1.02   1.03
Next Multithreading Benchmarks

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Fri Jun 28, 2019 8:36 pm

Multithreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. Quite a few are versions of the single core benchmarks, with similar Pi 4 performance gains. With these, comparisons with PCs becomes more complicated and none are provided. There are variations on the number of cores, cache sizes, memory throughput and more efficient SIMD instructions, normally providing advantage to the PCs. On the simpler MP benchmarks, the Raspberry Pi 4 systems could provide similar performance (at a much lower cost, of course).

Again, the ZIP file in the earlier message contains much more details:

https://www.raspberrypi.org/forums/view ... c#p1484388

MP-Whetstone Benchmark

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish and performance is generally proportional to the number of cores used. Following are Pi 4B results and performance gains over Pi 3B+. Seconds used indicates MP efficiency.

Code: Select all

           MP-Whetstone Benchmark Linux/ARM Pi 4B gcc 8  

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS

 1T  1889.5  538.7  537.6  311.4  56.3  26.1  7450.5  2243.2  659.9
 2T  3782.7 1065.5 1071.2  627.1 112.3  52.0 14525.7  4460.9 1327.3
 4T  7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5  8944.2 2660.8
 8T  8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4

   Overall Seconds   4.99 1T,   5.00 2T,   5.03 4T,  10.06 8T      

                4 Thread Performance ratios 4B/3B+                      
  
 gcc8  1.79   1.39   1.40   1.05   2.66   2.13   1.53   1.68   0.48
MP-Dhrystone Benchmark

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance, as indicated in the results. The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+.

Code: Select all

                      Pi 3B+ gcc 8                        

 Threads                        1        2        4        8
 Seconds                     0.79     0.92     1.23     2.46
 Dhrystones per Second    5035879  8678942 13020489 13028455
 VAX MIPS rating             2866     4940     7411     7415

 
                      Pi 4B gcc 8                           
 
 Threads                        1        2        4        8
 Seconds                     0.79     1.21     2.62     4.88
 Dhrystones per Second   10126308 13262168 12230188 13106002
 VAX MIPS rating             5763     7548     6961     7459
MP SP NEON Linpack Benchmark

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running with no threading. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads. Best no thread Pi 4 gains was 3.1 times using cached data, with worst at 1.13 using RAM.

Code: Select all

Linpack Single Precision MultiThreaded Benchmark
          Using NEON Intrinsics

  MFLOPS 0 to 4 Threads, N 100, 500, 1000     

 Threads      None        1        2        4 
 
                     Pi 3B+ gcc 8            

 N  100     638.49    66.92    66.23    66.14 
 N  500     471.71   304.69   297.05   305.51 
 N 1000     356.13   317.22   316.88   316.33 

                      Pi 4B gcc 8             

 N  100    2007.38   112.55   107.85   106.98 
 N  500    1332.24   686.10   686.11   689.02 
 N 1000     402.61   435.26   432.21   432.01 
MP BusSpeed (read only) Benchmark

Each thread accesses all of the data in separate sections covering caches and RAM, starting at different points, with this V7A v2 version. See single processor BusSpeed details regarding burst reading that can indicate significant differences. RdAll is the main area for comparison, where MP reading RAM is thought to indicate maximum performance.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. These are subject to multiprocessing peculiarities, but Pi 4B/Pi 3B+ performance gains were indicated as being more than twice as fast at all levels.

Code: Select all

             MP-BusSpd ARM V7A gcc 8  Pi 4B        
 
  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll  4B/3B      

 12.3 1T   5310   5616   5801   5898   5940  13425   2.54
      2T   9393  10008  11293  11293  11368  24932   2.47
      4T  15781  15015  17606  19034  22279  40736   2.47
      8T   8465   9599  14580  18465  20034  36831       
122.9 1T    664    930   1861   3191   5017  10281   2.18
      2T    564    726   1523   5376   9387  18985   2.04
      4T    486    919   1886   4289   8337  16979   1.04
      8T    487    912   1854   4275   8271  16826       
12288 1T    225    258    514   1010   1992   3975   2.02
      2T    202    421    450   1765   3307   7396   3.51
      4T    261    288    825   1332   1772   5014   2.44
      8T    218    273    496   1041   2571   4021       
MP RandMem Benchmark

This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Serial reading speed is normally similar to BusSpeed RdAll. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Besides the full results, comparisons of the four thread results are shown below for Pi 4B/3B+ performance ratios. The Pi 3B+ appears to be faster reading data from the shared L2 cache, with 4 threads only, otherwise, the average performance of the new processor was indicated as 80% faster.

Code: Select all

    MP-RandMem Linux/ARM Pi 4B gcc 8

  KB       SerRD SerRDWR   RndRD RndRDWR

  12.3 1T    5950    7903    5945    7896
       2T   11849    7923   11887    7917
       4T   23404    7785   23395    7761
       8T   21903    7669   23104    7655
 122.9 1T    5670    7309    2002    1924
       2T   10682    7285    1648    1923
       4T    9944    7266    1813    1927
       8T    9896    7216    1812    1919
 12288 1T    3904    1075     179     164
       2T    7317    1055     215     164
       4T    3398    1063     343     165
       8T    4156    1062     350     165

             Pi 4B/3B+ gcc 8

  12.3 4T    1.43    1.82    1.43    1.81
 122.9 4T    0.79    1.87    0.76    1.86
 12288 4T    0.93    1.29    2.54    2.62
MP-MFLOPS Benchmarks

MP-MFLOPS measures floating point speed on data from caches and RAM, using 2 and 32 operations per data word read and written. There are three versions single and double precision normal compilations and a further single precision benchmark using NEON intrinsic functions.

Maximum GFLOPS here were 11.0 SP, 17.2 SP NEON and 10.4 DP. There were Pi 4B/3B+ gains all round with lower values when RAM speed limited, and highest from normal compilations over six times faster.

Code: Select all

          MP-MFLOPS Linux/ARM Pi 4B gcc 8                    

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS

 1T     1224    1257     520    2814    2800    2803
 2T     2485    2257     525    5608    5575    5576
 4T     4119    3243     534   11018   10645    8358
 8T     4131    4618     541    9941   10339    8165
Max                                                 
4B/3B+  5.48    6.07    1.35    3.61    3.53    2.86

            NEON Intrinsic Functions Version

 1T     2797    2870     641    4422    4454    4405
 2T     3217    5601     569    8587    8800    8377
 4T     7902    9864     611   17061   17215    9704
 8T     7070   10562     603   15531   16203    9516
Max                                                 
4B/3B+  3.78    4.13    1.49    2.30    2.32    1.61

              Double Precision Version     
       
 1T     1203    1211     315    2675    2719    2674
 2T     2291    2441     293    5406    5421    4907
 4T     4673    2501     309   10313   10393    5256
 8T     4394    3550     265    8782   10110    5197
Max                                                 
4B/3B+  6.69    4.45    1.56    3.30    3.33    1.89
OpenMP-MFLOPS Benchmarks

Using multithreading produced by OpenMP, this benchmark carries out the same single precision calculations as the MP-MFLOPS benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive. The two varieties were also repeated using doublr precision variables. The parameters for these benchmarks were used to test large PCs, with the last two data sizes tending to be dependent on RAM speed on the Raspberry Pi systems.

Below are Pi 3B+ and 4B results from the gcc 8 compilations. Included are 4 core MP gains and Pi 4B/3B+ performance ratios, both having wide variations, but worth having. Maximum GLPOS wer 19.9 SP and 9.2 DP.

Code: Select all

         gcc 8 Compiler Single Precision                                         

         Pi 3B+               Pi 4B                  Pi4 Gains 
 KB+Ops    4       1   4 core   4       1   4 core   4      1  
  /Word  Cores   Core   Gain  Cores   Core   Gain  Cores   Core

  100- 2  2139    778   2.75   5100   2270   2.25   2.38   2.92
 1000- 2   398    403   0.99    617    632   0.98   1.55   1.57
10000- 2   412    415   0.99    542    631   0.86   1.32   1.52

  100- 8  7348   1919   3.83  13805   5511   2.50   1.88   2.87
 1000- 8  1597   1448   1.10   2168   2217   0.98   1.36   1.53
10000- 8  1635   1444   1.13   2178   2542   0.86   1.33   1.76

  100-32  8497   2023   4.20  19921   5341   3.73   2.34   2.64
 1000-32  5997   1903   3.15   8556   5267   1.62   1.43   2.77
10000-32  6057   1914   3.16   8731   5276   1.65   1.44   2.76


         gcc 8 Double Precision                                  

         Pi 3B+               Pi 4B                  Pi4 Gains 
 KB+Ops    4       1   4 core   4       1   4 core   4      1  
  /Word  Cores   Core   Gain  Cores   Core   Gain  Cores   Core

  100- 2   711    203   3.50   3200    977   3.28   4.50   4.81
 1000- 2   193    168   1.15    274    295   0.93   1.42   1.76
10000- 2   199    172   1.16    273    307   0.89   1.37   1.78

  100- 8  1898    503   3.77   6771   2440   2.78   3.57   4.85
 1000- 8   730    434   1.68   1102   1072   1.03   1.51   2.47
10000- 8   755    435   1.74   1108   1255   0.88   1.47   2.89

  100-32  3072    793   3.87   9229   2725   3.39   3.00   3.44
 1000-32  2695    765   3.52   4256   2674   1.59   1.58   3.50
10000-32  2719    765   3.55   4469   2677   1.67   1.64   3.50
Maximum Possible Double Precision GFLOPS

The latest floating point performance improvements, via gcc 8, are due to automatic compilation of NEON instructions, the format shown below for 32 operations per word using double precision working. These not SIMD instruction that would use 128 bit registers with two DP words. Here, maximum speed would be two operations per clock cycle, via vfma fused multiply and add instructions, or 12 GFLOPS from four cores. ARM documentation appears to confirm that SIMD is not available using double precision functions. In this case, the DP benchmark measurements of around 10 GFLOPS are quite respectable.

I could not locate maximum single precision speed, where SIMD is used, handling four words from one instruction. Fused operations could lead to 8 results per clock cycle, or 48 GFLOPS from four cores. The results here suggest that it might be 24 GFLOPS.

Code: Select all

.L18:                            
    vldr.64         d17, [r1]   
    vadd.f64        d16, d17, d4 
    vadd.f64        d18, d17, d0 
    vadd.f64        d25, d17, d15
    vadd.f64        d24, d17, d11
    vmul.f64        d16, d16, d5 
    vadd.f64        d23, d17, d31
    vadd.f64        d22, d17, d27
    vadd.f64        d21, d17, d2 
    vadd.f64        d20, d17, d6 
    vadd.f64        d19, d17, d13
    vfma.f64        d16, d18, d1 
    vadd.f64        d18, d17, d9 
    vadd.f64        d17, d17, d29
    vfma.f64        d16, d25, d14
    vfma.f64        d16, d24, d10
    vfma.f64        d16, d23, d30
    vfma.f64        d16, d22, d28
    vfms.f64        d16, d21, d3 
    vfms.f64        d16, d20, d7 
    vfms.f64        d16, d19, d12
    vfms.f64        d16, d18, d8 
    vfms.f64        d16, d17, d26
    vstmia.64       r1!, {d16}   
    cmp             r0, r1       
    bne             .L18         
OpenMP-MemSpeed Benchmark

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives as a single core test, with similar Pi 4B/3B+ gains as the shorter version.

Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP, with effects on Pi 3B+ and Pi 4B not the same. Detailed comparisons of these results are rather meaningless. Following is one example

Code: Select all

                           Pi 4B gcc 8                                   

     Memory Reading Speed Test OpenMP gcc 8 by Roy Longbottom           

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    8097   8322   8641   8020   8436   8384  39701  19701  19712
       8    7814   8555   8756   8321   8548   8526  39042  19984  19996
      16    8149   7738   7742   8303   7779   8192  37995  19883  19984
      32    8969   8769   8799   9040   8759   8743  37737  20133  20130
      64    7617   7457   7437   7575   7380   7422  17770  15332  14248
     128   11221  10936  11003  11105  11011  10986  13650  13910  13881
     256   17883  18144  18036  17691  18094  17844  13073  12465  12535
     512   18001  18468  19675  17075  18221  19264  13511  13895  12008
    1024    9532  10590   9772  11842  11282  11277   7173   9473   9496
    2048    7095   7025   6866   7117   7043   6946   2914   3475   3468
    4096    7244   6927   7036   5951   7054   6531   2582   3130   3122
    8192    4578   7173   7025   6322   7078   7182   2504   3127   3115
   16384    5470   7043   7067   7103   7052   7020   2557   3093   3088
   32768    7359   7817   7766   7158   7078   7757   2618   3066   3094
   65536    7810   7268   7266   3824   7478   5164   2486   3016   2931
  131072    2460   2655   7224   7513   7308   7339   2540   2944   2940

 Single Core Results                                                                
       8   11775   3895   4342  11787   4325   4354  10334   7806   7816
     256   10032   3699   4223   9978   4289   4185   7105   7612   7621
   65536    2099   2587   3033   2103   3021   3001   2585   1105   1101
 
Next are I/O Benchmarks

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sat Jun 29, 2019 2:05 pm

I/O Benchmarks

As used previously, variations of the same benchmark can be used to measure performance of local drives and networks, covering writing and reading large and small file and random access. A run time parameter can be used to increase the sizes of the large files.

LanSpeed Benchmark - WiFi

Following are Raspberry Pi 3B+ and Pi 4B results using what I believe was, both 2.4 GHz and 5 GHz WiFi frequencies. Details on setting up the links can be found in the following PDF file, LAN/WiFi section:

https://www.researchgate.net/publicatio ... ress_Tests

Performance of the two systems was similar at both frequencies.

Code: Select all

 ******************** Pi 3B+ 2.4 GHz ********************
 
                       MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    5.71    6.07    5.96    5.69    5.46    4.76
      16    6.14    6.38    6.47    6.14    6.15    5.91

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs      2.94   3.081   3.185    3.04    2.89     3.7

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.16    0.57    0.96    0.36    0.63    1.17
 ms/file    25.3   14.31    17.1   11.46   13.04   14.06   2.138

 ********************* Pi 3B+ 5 GHz *********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3
 
       8   12.82   14.52   14.00   10.98   11.09    8.94
      16   11.60   12.91    4.48    9.16    8.19    7.69

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.41    0.76    1.46    0.41    0.74    1.46
 ms/file    9.96   10.83   11.19   10.11   11.02   11.23   1.990

 Random similar to 2.4 GHz


 ********************* Pi 4B 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    6.35    6.33    6.38    7.05    6.98    7.10
      16    6.70    6.82    6.76    7.19    6.53    7.22

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     2.691   2.875   3.048    3.13    2.93    2.84

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.34    0.44    1.04    0.37    0.37    1.26
 ms/file   12.14   18.59    15.7    11.1    22.2   12.99   2.153


 ********************** Pi 4B 5 GHz *********************

                         MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   11.90   12.96   13.16   10.11    9.55    9.66
      16   11.50   13.93   14.13    9.91    8.88    9.92

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.13    0.46    0.91    0.25    0.55    1.02
 ms/file   30.85   17.83   18.10   16.62   14.93   16.01   3.361

 Random similar to 2.4 GHz
LanSpeed Benchmark - LAN 1 Gbps Pi 4

There can be significant variability in performance with these small samples. For the large files, the default sizes were increased to produce more stable speeds. In this case, 1 Gbps was clearly demonstrated using the Pi 4B, around three times faster than the Pi3B+. Random access was mainly slightly faster via the Pi 4B and with the small files, perhaps, 25% faster on writing and 50% faster on reading.

Code: Select all

 ************************ Pi 3B+ ************************
 
                       MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   31.17   31.62   31.61    13.5   26.19   26.38
      16   31.62   31.89   31.76    26.7   26.94   27.01

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     0.007    1.09   0.688    1.16    1.04    1.08

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     1.15    2.26    4.18    1.73    3.18    5.66
 ms/file    3.57    3.62    3.92    2.36    2.58    2.89   0.511

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32   31.99   31.61   32.13   21.39   27.09   26.87
      64   32.33   32.37   32.35   26.94   26.98    26.7

 ************************ Pi 4B ************************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   67.82   12.97   90.19   99.84   93.49   96.83
      16   92.25   92.66   92.96   103.9  105.28   91.17

Random     Read                    Write
From MB        4       8      16       4       8      16
msecs      0.007    0.01    0.04    1.01    0.85    0.91

200 Files  Write                   Read                  Delete
File KB        4       8      16       4       8      16  secs
MB/sec      1.47     2.8    5.14    2.47    4.71    8.61
ms/file     2.78    2.92    3.19    1.66    1.74     1.9   0.256

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32    78.2   34.46   80.71   84.94   87.11   84.97
      64   88.18   87.52   87.03  111.34  109.58  107.28

     128   98.84   99.24   96.58  110.99  110.57   87.43
     256  106.75  105.43   106.4   85.78  108.99  106.29
DriveSpeed Benchmark - USB

Following are DriveSpeed results on Pi 3B+ and 4B, using the same high speed USB 3 stick (SanDisk Extreme with write/read ratings of 110/190 MB/s and 16 KB sectors). Other sticks would probably provide different comparative performance.

On large files, Pi 4B performance gains on the largest files shown, were 2.2 times on writing and 5.3 times on reading. Unlike LanSpeed, DriveSpeed uses Direct i/O, leading to an extra entry for cached files, reading mainly influenced by RAM speeds. Results can be too variable to provide meaningful comparisons.

Random access speeds were quite similar. On small files, relative reading speed was indicates as five times faster, on the Pi 4B, but the 3B+ appeared to be nearly 30 times faster, on reading.

For the Pi 4B, additional large file performance are included for a Patriot Rage 2 USB 3 stick, rated as reading at up to 400 MB/second, with near 300 MB/second demonstrated using a Windows version of DriveSpeed.. In this case, it appeared to be slightly slower than the first one on reading, but faster on writing, at 80 MB/second. This second drive also obtained those painfully slow speeds on writing small files.

Code: Select all

    ********************* Pi 3B+ USB 2 ********************

   DriveSpeed RasPi 1.1 Wed Apr 24 22:09:09 2019

 /media/pi/REMIX_OS/
 Total MB    9017, Free MB    7486, Used MB    1531

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   27.71   27.35   27.13   30.72    30.9   31.31
      16   27.21   27.54   23.69   29.89   31.34   31.27
Cached
       8   52.24   59.57   46.88  333.08  741.57  780.68

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.403   0.403   0.404    0.74    0.85    0.59

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      1.10    2.12    3.82    6.04    9.17   14.01
ms/file     3.71    3.86    4.28    0.68    0.89    1.17   0.123

                        MBytes/Second
MB      Write1  Write2  Write3  Read1   Read2   Read3

    1000   27.25   27.25   27.19   31.23   31.27   31.27
    2000   27.30   27.07   27.32   31.32   31.26   31.26

 ********************* Pi 4B USB 3 *********************

   DriveSpeed RasPi 1.1 Fri Apr 26 17:21:56 2019

 /media/pi/REMIXOSSYS//
 Total MB    5108, Free MB    3982, Used MB    1126

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   33.28   32.27   32.28  161.34  162.25  163.85
      16   39.85   41.95   43.02  164.07  165.53  165.84
Cached
       8   33.32   34.96   34.96  593.94  582.25  589.22

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.383   0.372   0.371    0.77    0.83    0.63

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      0.04    0.07    0.15   20.64   41.04   70.01
ms/file   110.04  109.97  110.01    0.20    0.20    0.23   0.089

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

     500   56.36   58.13   55.25  166.31  165.46  165.43
    1000   59.56   61.46   60.54  161.69  165.97  166.49


 /media/pi/PATRIOT/
 Total MB  120832, Free MB  120832, Used MB       0

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

    1000   80.87   80.23   81.92  131.41  130.72  130.39
    2000   83.67   81.82   82.14  130.85  131.29  131.36
DriveSpeed Benchmark Pi 4B Main Drive

This demonstrates that DriveSpeed measured performance on the main drive, in this case, nowhere near to USB 3 speeds.

Code: Select all

     DriveSpeed RasPi 1.1 Mon Apr 29 10:20:57 2019

 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks/DriveSpeed/drive1
 Total MB   14845, Free MB    8198, Used MB    6646

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   16.41   11.21   12.27   39.81   40.10   40.39
      16   11.79   21.10   34.05   40.18   40.19   40.33
Cached
       8  137.47  156.43  285.59  580.73  598.66  587.97

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.371   0.371   0.363    1.28    1.53    1.30

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      3.49    6.41    8.26    7.67   11.68   17.51
ms/file     1.17    1.28    1.98    0.53    0.70    0.94   0.014
Next Java and OpenGL

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sat Jun 29, 2019 3:08 pm

Java and OpenGL

The Java benchmarks were run after installing Oracle Java 8, then OpenJDK11 later.

Java Whetstone Benchmark

Pi 4B performance was nearly as good as the compiled C version. However, there can be wide variations involving new Java versions. Here, the Pi 3B+ overall MWIPS rating was particularly slow, entirely due to the time taken by the sin, cos and exp, sqrt tests. Other than these, the Pi 4B was three to four times faster.

Code: Select all

 ************************ Pi 3B+ ************************

      Whetstone Benchmark Java Version, May 14 2019, 15:02:11

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    215.20             0.0892
  N2 floating point  -1.131330490    208.76             0.6438
  N3 if then else     1.000000000             103.58    0.9992
  N4 fixed point     12.000000000             538.09    0.5854
  N5 sin,cos etc.     0.499110103               7.04   11.8100
  N6 floating point   0.999999821    106.22             5.0780
  N7 assignments      3.000000000             322.85    0.5724
  N8 exp,sqrt etc.    0.751108646               1.38   26.9200

  MWIPS                              214.14            46.6980

  Operating System    Linux, Arch. arm, Version 4.14.70-v7+
  Java Vendor         Oracle Corporation, Version  1.8.0_212
 

 ************************ Pi 4B ************************

     Whetstone Benchmark Java Version, May 14 2019, 14:16:44

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    503.94             0.0381
  N2 floating point  -1.131330490    488.37             0.2752
  N3 if then else     1.000000000             332.80    0.3110
  N4 fixed point     12.000000000             881.37    0.3574
  N5 sin,cos etc.     0.499110132              42.92    1.9384
  N6 floating point   0.999999821    345.77             1.5600
  N7 assignments      3.000000000             332.97    0.5550
  N8 exp,sqrt etc.    0.825148463              25.00    1.4880

  MWIPS                             1533.01             6.5231

  Operating System    Linux, Arch. arm, Version 4.19.29-v7l+
  Java Vendor         Oracle Corporation, Version  1.8.0_212

 ******************* Pi 4B OpenJDK11 *******************

  Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    524.02             0.0366
  N2 floating point  -1.131330490    494.12             0.2720
  N3 if then else     1.000000000             289.92    0.3570
  N4 fixed point     12.000000000            1092.99    0.2882
  N5 sin,cos etc.     0.499110132              59.86    1.3900
  N6 floating point   0.999999821    345.95             1.5592
  N7 assignments      3.000000000             331.54    0.5574
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1687.92             5.9244

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft
JavaDraw Benchmark

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order for this to run at maximum speed, it was necessary to disable the experimental GL driver. A later version was produced and run via OpenJDK11.

Pi 4B performance gains were best on the most complex test function.

Code: Select all

  ************************ Pi 3B+ ************************

   Java Drawing Benchmark, May 14 2019, 15:32:06
            Produced by javac 1.6.0_27

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      566    56.55
  Display PNG Bitmap Twice Pass 2      651    65.00
  Plus 2 SweepGradient Circles         665    66.45
  Plus 200 Random Small Circles        660    65.93
  Plus 320 Long Lines                  442    44.16
  Plus 4000 Random Small Circles       334    33.30

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.14.70-v7+
  Java Vendor         Oracle Corporation, Version  1.8.0_212


 ************************ Pi 4B ************************

   Java Drawing Benchmark, May 14 2019, 14:33:58
            Produced by javac 1.7.0_02

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      791    79.05
  Display PNG Bitmap Twice Pass 2      932    93.11
  Plus 2 SweepGradient Circles        1152   115.17
  Plus 200 Random Small Circles       1200   119.98
  Plus 320 Long Lines                  784    78.31
  Plus 4000 Random Small Circles       621    62.03

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.29-v7l+
  Java Vendor         Oracle Corporation, Version  1.8.0_212

 ******************* Pi 4B OpenJDK11 *******************

   Java Drawing Benchmark, May 15 2019, 18:55:41
            Produced by OpenJDK 11 javac

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      877    87.65
  Display PNG Bitmap Twice Pass 2     1042   104.18
  Plus 2 SweepGradient Circles        1015   101.47
  Plus 200 Random Small Circles        779    77.85
  Plus 320 Long Lines                  336    33.52
  Plus 4000 Random Small Circles        83     8.25

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft
OpenGL GLUT Benchmark

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

After installing freeglut3, the benchmark ran as before. The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file

Code: Select all

  export vblank_mode=0                                     
  ./videogl32 Width 320, Height 240, NoEnd                 
  ./videogl32 Width 640, Height 480, NoHeading, NoEnd      
  ./videogl32 Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl32 NoHeading                                    
Following are results from the Pi 3B+ and Pi 4B. The early tests depend on graphics speed and the later ones becoming CPU speed dependent.

Code: Select all

 ************************ Pi 3B+ ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Fri Apr 12 22:21:35 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    343.8    208.3     88.4     56.6     24.3     15.5
   640   480    243.0    170.3     82.8     54.5     24.2     15.5
  1024   768    110.6    101.2     63.6     47.8     24.1     15.4
  1920  1080     49.5     47.3     36.8     32.9     23.4     14.9


 ************************ Pi 4B ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu May  2 19:01:05 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    766.7    371.4    230.6    130.2     32.5     22.7
   640   480    427.3    276.5    206.0    121.8     31.7     22.2
  1024   768    193.1    178.8    150.5    110.4     31.9     21.5
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0
Stress Tests

I have run a series of stress tests, some of which lead to high temperatures and slow performance, due to CPU clock throttling. These will be rerun as changes have been incorporated, aimed at reducing temperatures.

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Jul 10, 2019 5:34 pm

Pi 4B Stress Tests

First Unstressed Tests - It is quite easy to produce programs that run at high speeds on all cores of a modern computer, be it a PC, tablet, phone or a small board system like the Raspberry Pi. These programs are likely to lead to increased CPU temperatures. Given insufficient cooling arrangements, the systems are likely to continuously reduce CPU MHz (throttling) in order to continue operation, and eventually power down. Before examining the results of stress testing, it is useful to consider what can be run without throttling occurring, in this case, on a Raspberry Pi 4, without any cooling.

Graphics Tests No Cooling

Earlier, I connected the Pi 4 system to BBC iPlayer, via WiFi, and displayed programmes for more than two hours on a full screen 1920x1080 display (not a hot day). With CPU utilisation around 100% of one core, maximum temperature was 70°C, with CPU at 1500 MHz all the time.

OpenGL

For this exercise, I ran the OpenGL Textured Kitchen test for an hour, with a full screen display (hotter day than above). Following is a summary of recorded results by the program, the environmental monitor and vmstat. The program ran at 22 FPS over the whole period, with CPU at a constant 1500 MHz, recording slightly more than 100% utilisation of one core, with maximum temperature reaching 73°C.

Code: Select all

            =------ Monitors ------  --------- vmstat ---------  Video
                                                                  gl32
                           °C    °C      
 Seconds     MHz   Volts   CPU  PMIC    free    User System Idle   FPS

       0    1500  0.8894    61    54 3589900       0     0   100
     120    1500  0.8841    69    59 3523336      25     2    73    22
     240    1500  0.8841    71    62 3520464      25     2    73    22
     360    1500  0.8841    71    63 3522848      25     2    73    22
     480    1500  0.8841    73    63 3522292      25     2    73    22
     600    1500  0.8841    72    63 3522284      25     2    73    22
     720    1500  0.8841    72    63 3521780      24     2    74    22
     840    1500  0.8841    73    63 3520640      25     2    73    22
     960    1500  0.8841    72    63 3520884      25     2    73    22
    1080    1500  0.8841    72    63 3520140      25     2    73    22
    1200    1500  0.8841    73    63 3519864      24     2    73    22
    1320    1500  0.8841    73    63 3519892      25     2    73    22
    1440    1500  0.8841    73    63 3519892      25     2    73    22
    1560    1500  0.8841    73    63 3518880      25     2    73    22
    1680    1500  0.8841    72    63 3519264      25     2    73    22
    1800    1500  0.8841    73    63 3517976      25     2    73    22
    1920    1500  0.8841    73    63 3518616      25     2    73    22
    2040    1500  0.8841    72    63 3517984      25     2    73    22
    2160    1500  0.8841    72    63 3518604      24     2    73    22
    2280    1500  0.8841    73    63 3518496      25     2    73    22
    2400    1500  0.8841    73    63 3518868      25     2    73    22
    2520    1500  0.8841    72    63 3518488      25     2    73    22
    2640    1500  0.8841    73    63 3518212      25     2    73    22
    2760    1500  0.8841    73    63 3520008      25     2    73    22
    2880    1500  0.8841    73    63 3519756      25     2    73    22
    3000    1500  0.8841    73    63 3516752      25     3    72    22
    3120    1500  0.8841    73    63 3518132      25     2    73    22
    3240    1500  0.8841    73    63 3518132      25     2    73    22
    3360    1500  0.8841    73    63 3517620      24     2    73    22
    3480    1500  0.8841    73    63 3517428      25     2    73    22
    3600    1500  0.8841    73    63 3517656      25     2    73    22
Single Core and Multi-Core CPU Tests

Below are various results from running five minute MP-Integer-Tests on a Raspberry Pi 4B, out of the case, with no cooling attachment. As indicated earlier, ongoing speed measurements by a benchmark provides a better understanding of behaviour, than samples of CPU MHz, that can vary rapidly.

Starting and ending recorded temperatures are shown, along with time when and if 80°C was reached, when throttling will start. The first column is for a run using a single thread, where CPU MHz, and effectively measured speeds, were constant over the whole period. The second column provides details when using four threads, with data in L1 caches. The next two made use of data in L2 cache, starting throttling after one minute, worse than the L1 results, but starting at a higher temperature. The last column provides results when data was in RAM and running at full speed for over four and a half minutes.

Code: Select all

                Speed in MB/second

   KB        512      64     640    1536   15624
   Threads     1       4       4       8       4
   Start°C    62      60      62      64      61

  Seconds
      10    5718   23631   22628   20177    3445
      20    5717   23603   22634   18329    3443
      30    5640   23416   22670   18756    3405
      40    5735   23613   22045   17737    3440
      50    5740   23618   22636   18456    3444
      60    5652   23244   22069   19059    3410
      70    5707   23483   19864   17648    3437
      80    5736   23360   18639   16017    3445
      90    5683   21552   17986   16654    3447
     100    5695   20867   17383   14864    3395
     110    5719   20218   16475   14805    3437
     120    5672   19017   16207   15128    3443
     130    5727   18871   15165   13328    3401
     140    5735   18888   14773   12638    3437
     150    5732   18460   14979   12780    3443
     160    5677   17799   14780   13086    3440
     170    5719   17976   14313   13221    3404
     180    5711   18005   14391   12618    3443
     190    5650   17745   14018   12185    3440
     200    5738   17312   14120   13267    3397
     210    5709   17241   14062   11916    3442
     220    5678   17124   14004   11866    3441
     230    5719   17392   13467   12018    3397
     240    5720   16990   13728   11825    3440
     250    5651   17289   13372   12011    3434
     260    5714   17135   13683   11596    3442
     270    5717   16891   13584   11481    3398
     280    5657   16505   13055   11781    3442
     290    5725   17049   13396   11550    3445
     300    5713   16578   12957   11666    3402

   Max      5740   23631   22670   20177    3447
   Min      5640   16505   12957   11481    3395
   %          98      70      57      57      98

  Max °C      72      82      84      85      80
 Time 80°C   N/A      90      60      60     280
Integer Stress Tests - MP-IntStress

The following are results of 15 minute stress tests, using 1280 KB data and 8 threads. The data is greater than L2 cache, but was in cache as only four threads were executed at a time. This then ran at full speed, with additional swapping of cached data.

Four tests were carried out with no added cooling on a bare board, fitted with a copper heatsink, then with the official, and expensive, Power Over Ethernet fan and, finally, using an inexpensive case with a fitted fan. The changing CPU MHz measurements show that throttling is occurring but, with coarse sampling, they do not reflect real performance, unlike the MB/second details.

With no cooling, throttling started after a minute, reaching 85°C to 86°C, slowly reducing performance to almost half speed. The copper heatsink produced a small improvement. During the two tests where fans were used, the processor ran continuously at 1500 MHz and throughput effectively at a constant MB/second. The POE fan appeared to be slightly more efficient.

Code: Select all

                No Cooling         Copper Heatsink    Official POE Hat   Case With Fan

Seconds MB/sec   MHz  °C   MB/sec   MHz  °C   MB/sec   MHz  °C   MB/sec   MHz  °C

      0         1500  60           1500  60           1500  47           1500  41
     20  21651  1500  73    21381  1500  71    21770  1500  56    22018  1500  54
     40  21892  1500  79    20517  1500  74    21767  1500  57    21979  1500  56
     60  20919  1500  81    21407  1500  77    22234  1500  57    22076  1500  58
     80  17174  1000  81    21153  1500  79    22035  1500  58    22248  1500  60
    100  15643  1000  81    20960  1500  81    21920  1500  59    22153  1500  61
    120  15163  1000  82    18967  1500  82    22184  1500  60    22239  1500  63
    140  14756  1000  81    16828  1000  81    21941  1500  60    22037  1500  64
    160  14491  1000  83    15892  1500  83    21863  1500  60    22231  1500  65
    180  14492  1000  83    16157  1000  82    21753  1500  60    22130  1500  64
    200  14283  1000  84    15039  1000  82    21921  1500  60    22050  1500  65
    220  14386  1000  83    15438  1000  82    21656  1500  60    22210  1500  66
    240  14101  1000  83    14905  1000  82    21908  1500  60    22132  1500  65
    260  13574  1000  84    14597  1000  83    21983  1500  60    22298  1500  65
    280  13763  1000  83    14703  1000  83    21701  1500  60    22031  1500  66
    300  13179  1000  84    14519  1000  82    21857  1500  60    22285  1500  65
    320  13566  1000  84    14204  1000  84    21791  1500  60    22009  1500  65
    340  13368   750  84    14139   750  83    21468  1500  60    22101  1500  65
    360  13530  1000  84    14249  1000  84    22162  1500  60    22166  1500  65
    380  13190  1000  85    14457  1000  82    21819  1500  61    22163  1500  66
    400  13215  1000  84    14395  1000  83    21800  1500  60    22243  1500  65
    420  13021   750  85    14365  1000  83    22083  1500  61    22115  1500  64
    440  13127  1000  84    14214  1000  83    21780  1500  60    22172  1500  64
    460  12933  1000  85    14152  1000  83    21902  1500  60    22138  1500  64
    480  12658  1000  85    14090  1000  84    21964  1500  60    22220  1500  64
    500  12981   750  83    14199  1000  84    22026  1500  61    22061  1500  65
    520  12699  1000  85    14005  1000  83    21661  1500  61    22027  1500  64
    540  12622  1000  84    13987  1000  84    21684  1500  60    22281  1500  65
    560  12761  1000  84    14222  1000  84    22071  1500  59    22097  1500  64
    580  13408  1000  84    13845  1000  84    21728  1500  58    22225  1500  64
    600  13878  1000  85    13945  1000  84    21981  1500  59    22091  1500  62
    620  13893  1000  83    13877  1000  84    21704  1500  58    22203  1500  62
    640  13717  1000  86    13844  1000  84    21935  1500  58    22133  1500  62
    660  13321  1000  85    13774  1000  83    21816  1500  61    22075  1500  62
    680  13154  1000  85    13500  1000  83    21827  1500  61    22229  1500  63
    700  12663  1000  85    13926  1000  83    21995  1500  60    22007  1500  63
    720  12504  1000  85    13722  1000  83    22004  1500  60    22279  1500  64
    740  12501   750  85    13778   750  84    21954  1500  60    22020  1500  65
    760  12227  1000  85    13564  1000  83    21848  1500  60    22270  1500  65
    780  12199   750  85    13755  1000  82    21840  1500  61    22129  1500  65
    800  12505  1000  85    13451  1500  82    22137  1500  59    22175  1500  64
    820  12268   750  85    13587  1000  83    21876  1500  60    22210  1500  64
    840  12322  1500  78    13610  1000  82    21685  1500  61    22041  1500  65
    860  12312  1500  74    14411  1500  75    22077  1500  61    22192  1500  65
    880  12306  1500  73    14380  1500  73    21842  1500  61    22109  1500  65
    900  12305  1500  71    14345  1500  73    21883  1500  61    22199  1500  65

    Max  21892        86    21407        84    22234        61    22298        66
    Min  12199   750        13451   750        21468  1500        21979  1500
 %Min/Ma    56                 63                 97                 99
Single Precision Floating Point Stress Tests - MP-FPUStress,

The table below covers the first 10 minutes of tests on the three cooling configurations. This time, the rather meaningless variations in recorded CPU MHz are not included. Again they used 1280 KB data (320K words) and 8 threads, with 8 floating point operations per word. Maximum temperatures and associated performance degradations were similar to those during the integer tests.

The following graphs provide a more meaningful indication of the effects of adequate cooling that is needed for this kind CPU utilisation (confirmed during running by vmstat as 100% of four cores).

Code: Select all

            No Cooling  Copper HS    Case+Fan

Seconds     °C GFLOPS   °C GFLOPS    °C GFLOPS

      0     61          59           40
     20     76  19.2    73   19.6    55  20.7
     40     81  19.0    78   19.4    61  20.3
     60     82  17.8    80   19.6    62  20.2
     80     83  15.5    82   17.2    64  20.7
    100     84  15.0    82   15.6    65  20.2
    120     83  14.0    82   14.5    66  20.3
    140     84  13.3    81   13.9    65  20.3
    160     84  13.3    83   13.9    66  20.7
    180     86  12.9    83   13.5    67  20.3
    200     85  13.0    83   13.6    67  20.3
    220     84  12.8    84   13.4    66  20.4
    240     84  12.6    83   13.3    67  20.6
    260     83  12.6    84   13.3    67  20.3
    280     85  12.2    84   13.3    67  20.4
    300     84  12.1    83   13.0    67  20.3
    320     85  12.0    84   13.0    67  20.8
    340     84  11.6    85   12.8    67  20.3
    360     85  11.6    84   13.0    67  20.2
    380     85  11.3    83   12.7    67  20.7
    400     85  11.6    84   12.8    67  20.5
    420     84  11.6    84   12.5    68  20.2
    440     85  11.5    84   12.7    67  20.4
    460     84  11.5    85   12.6    67  20.4
    480     85  11.5    84   12.3    66  20.2
    500     84  11.1    85   12.4    67  20.3
    520     85  11.3    83   12.4    67  20.2
    540     84  11.4    85   12.4    68  20.5
    560     84  11.3    84   12.3    67  20.2
    580     85  11.3    83   12.3    67  20.4
    600     85  11.3    84   12.3    67  20.2

    900     85  10.9    84   12.2    67  20.3

   Max          19.2         19.6        20.8
   Min          10.9         12.2        20.3
%Min/Max          57           62          98
GFLOPS.gif
GFLOPS.gif (13.44 KiB) Viewed 3070 times


Double Precision Floating Point Stress Tests - MP-FPUStressDP

Four sets of results are below, again excluding those CPU MHz figures, but including PMIC temperatures. They are without and with the case/fan, using 8 threads, one with 1280 KB data size at 8 operations per word, and the other 128 KB with 32 operations per word.

The second one runs at a higher speed and lower temperature, using data in L1 caches, compared with the other via L2 cache. Maximum temperature and performance degradation of the latter were similar to the earlier examples.

Code: Select all

DP     1280 KB        8       8              128 KB        8      32

                 CPU  PMIC          CPU  PMIC          CPU  PMIC          CPU  PMIC
 Second GFLOPS    °C    °C GFLOPS    °C    °C GFLOPS    °C    °C GFLOPS    °C    °C

      0           48    42           45  42.0           54  47.7           39  35.4
     20    9.3    64  55.2    9.1    61  55.2   10.7    70  57.1   10.7    39  35.4
     40    9.2    73  62.8    9.0    65  59.0   10.6    73  61.8   10.7    53  43.9
     60    9.2    79  68.4    9.1    67  61.8   10.7    75  64.6   10.6    56  48.6
     80    8.8    80  70.3    9.3    66  62.8   10.7    78  67.5   10.6    57  50.5
    100    7.8    81  70.3    9.1    67  62.8   10.7    80  69.4   10.7    58  51.4
    120    7.2    82  70.3    9.2    67  62.8   10.1    82  70.3   10.7    59  53.3
    140    6.8    82  70.3    9.3    67  62.8    9.5    81  70.3   10.7    59  53.3
    160    6.5    82  70.3    9.1    68  62.8    9.1    80  70.3   10.6    59  53.3
    180    6.3    82  70.3    9.1    68  62.8    8.7    82  70.3   10.7    60  53.3
    200    6.1    81  70.3    9.3    68  64.6    8.5    81  70.3   10.7    59  54.3
    220    6.2    82  70.3    9.1    69  62.8    8.5    82  70.3   10.7    59  54.3
    240    6.2    83  72.2    9.1    68  62.8    8.3    81  70.3   10.6    60  54.3
    260    6.1    83  72.2    9.3    68  62.8    8.3    81  70.3   10.7    59  54.3
    280    6.1    84  72.2    9.1    67  64.6    8.0    83  70.3   10.7    61  54.3
    300    6.1    83  70.3    9.1    68  64.6    8.0    81  70.3   10.6    60  54.3
    320    6.0    84  72.2    9.1    68  64.6    7.9    82  70.3   10.7    61  54.3
    340    5.9    85  72.2    9.2    68  64.6    7.6    82  71.2   10.8    61  53.3
    360    5.8    85  72.2    9.1    68  62.8    7.7    82  70.3   10.7    60  54.3
    380    5.8    84  72.2    9.2    68  64.6    7.8    83  70.3   10.6    60  54.3
    400    5.7    84  72.2    9.1    68  62.8    7.7    83  70.3   10.6    61  54.3
    420    5.7    84  72.2    9.2    68  62.8    7.7    82  70.3   10.6    60  54.3
    440    5.6    84  72.2    9.1    68  64.6    7.6    82  70.3   10.7    60  54.3
    460    5.7    84  72.2    9.1    68  62.8    7.6    83  70.3   10.6    61  54.3
    480    5.6    84  72.2    9.1    69  64.6    7.5    82  70.3   10.7    60  54.3
    500    5.6    84  72.2    9.1    69  62.8    7.5    82  71.2   10.6    60  54.3
    520    5.5    85  72.2    9.1    68  62.8    7.4    81  70.3   10.7    60  54.3
    540    5.5    84  74.1    9.3    67  64.6    7.4    82  70.3   10.7    60  54.3
    560    5.5    84  72.2    9.1    69  62.8    7.4    82  70.3   10.8    59  54.3
    580    5.4    84  74.1    9.1    67  64.6    7.3    82  70.3   10.7    60  55.2
    600    5.5    84  74.1    9.2    68  62.8    7.3    81  70.3   10.7    60  54.3
    620    5.4    85  74.1    9.2    68  62.8    7.3    82  70.3   10.6    61  54.3
    640    5.4    84  74.1    9.2    69  62.8    7.3    83  70.3   10.6    62  55.2
    660    5.4    85  74.1    9.3    68  62.8    7.3    83  70.3   10.7    60  54.3
    680    5.5    85  72.2    9.0    67  62.8    7.3    83  70.3   10.7    60  54.3
    700    5.4    85  74.1    9.1    69  62.8    7.3    81  70.3   10.7    60  54.3
    720    5.4    85  72.2    9.2    68  64.6    7.3    84  70.3   10.7    60  54.3
    740    5.4    84  72.2    9.1    68  62.8    7.3    82  70.3   10.7    60  55.2
    760    5.3    85  74.1    9.1    68  62.8    7.3    81  70.3   10.7    60  54.3
    780    5.4    85  74.1    9.3    67  62.8    7.3    83  70.3   10.7    59  54.3
    800    5.4    84  74.1    9.1    69  64.6    7.3    81  70.3   10.7    60  54.3
    820    5.3    85  72.2    9.1    68  62.8    7.3    82  70.3   10.7    60  54.3
    840    5.3    84  72.2    9.2    68  62.8    7.2    82  70.3   10.7    60  54.3
    860    5.2    85  74.1    9.1    69  64.6    7.2    81  70.3   10.6    60  54.3
    880    5.2    85  74.1    9.1    68  62.8    7.2    82  70.3   10.6    60  54.3
    900    5.3    84  74.1    9.1    68  62.8    7.2    81  70.3   10.6    60  54.3

   Max     9.3    85  74.1    9.3    69  64.6   10.7    84  71.2   10.8    62  55.2
   Min     5.2                9.0                7.2               10.6
%Min/Ma     57                 97                 67                 98
More Later

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Fri Jul 19, 2019 4:26 pm

High Performance Linpack (HPL)

In 2016, it was found that a precompiled version of High Performance Linpack (HPL) could produce the wrong and inconsistent numeric calculations, also system crashes. For more information see this PDF file at ResearchGate:

https://www.researchgate.net/publicatio ... rror_Tests

That report includes behaviour of another version, compiled to use ATLAS, employing alternative Basic Linear Algebra Subprograms. This took 14 hours to build. The original precompiled version would not run on the Pi 4 but I rebuilt ATLAS on the new system, this time taking 8 hours. an example of the output for the fastest Pi 4 results is shown below:

Code: Select all

The following parameter values will be used:

N      :   20000 
NB     :     128 
PMAP   : Row-major process mapping
P      :       2 
Q      :       2 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             490.49              1.087e+01
HPL_pdgesv() start time Sun May 26 12:03:05 2019

HPL_pdgesv() end time   Sun May 26 12:11:15 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0010188 ...... PASSED
================================================================================
The programs were run on a bare board Pi 4 and one in the inexpensive case with a fan. No data errors or system freezes/crashes were encountered over these and many more runs.

Following is a summary of four tests on each of the test beds. The bare board arrangement performs relatively well for short duration tests, but the long ones are needed to demonstrate maximum performance. The latter was 10.8 Double Precision GFLOPS, similar to my MP-FPUStressDP program, where, at 58%, that also applied to efficiency of the uncooled processor. As it should be, the sumchecks of hot and cold systems were identical, at a given data size.

Assuming similarity with the original scalar Linpack benchmark, data size would be N x N x 8 for double precision operation or 3.2 GB at N = 20000, as approximately confirmed by the vmstat memory details provided below. The latter also indicate that the four core CPU utilisation was 100%.

Below the table is a graph, of the worst case uncooled scenario, to demonstrate CPU MHz throttling and temperature (°C times 10), based on samples every 10 seconds.

Code: Select all

Cooling       N  Seconds GFLOPS SumCheck Max °C  Av MHz

None        4000     5.7   7.4  0.002398    71    1500
Fan         4000     5.2   8.2  0.002398    54    1500
None        8000    39.9   8.6  0.001675    81    1500
Fan         8000    36.7   9.3  0.001675    61    1500
None       16000   404.3   6.8  0.001126    86     919
Fan        16000   263.0  10.4  0.001126    70    1500
None       20000   856.0   6.2  0.001019    87     828
Fan        20000   494.3  10.9  0.001019    71    1500

%None/Fan  20000      58    58      Same            55


procs  -----------memory---------- ---swap-- -----io---- -system- ------cpu-----
r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

0  0      0 3510712  30172 276440    0    0    17     1   90  111 16  1 83  0  0
4  0      0 3097880  30180 277088    0    0     0     6  526  515 52  3 45  0  0
4  0      0 2357404  30188 276492    0    0     0     6  620  344 95  5  0  0  0
4  0      0 1615192  30196 276976    0    0     0    11  586  289 95  5  0  0  0
5  0      0  871872  30204 271032    0    0     0     5  490   75 96  4  0  0  0
4  0    768  282692  26828 241092    0   34    20    40  604  307 95  4  0  0  0
4  0    768  276088  26968 250344    6    0   118    12  591  288 99  1  0  0  0
HPL.gif
HPL.gif (9.25 KiB) Viewed 2920 times
Last edited by RoyLongbottom on Fri Jul 19, 2019 5:45 pm, edited 1 time in total.

ejolson
Posts: 3699
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Fri Jul 19, 2019 5:16 pm

RoyLongbottom wrote:
Fri Jul 19, 2019 4:26 pm
High Performance Linpack (HPL)

In 1993, it was found that a precompiled version of High Performance Linpack (HPL) could produce the wrong and inconsistent numeric calculations, also system crashes. For more information see this PDF file at ResearchGate:

https://www.researchgate.net/publicatio ... rror_Tests
While those problems with the Pi 3B seem long ago, I'm pretty sure it was after 1993.

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Fri Jul 19, 2019 5:46 pm

ejolson wrote:
Fri Jul 19, 2019 5:16 pm
RoyLongbottom wrote:
Fri Jul 19, 2019 4:26 pm
High Performance Linpack (HPL)

In 1993, it was found that a precompiled version of High Performance Linpack (HPL) could produce the wrong and inconsistent numeric calculations, also system crashes. For more information see this PDF file at ResearchGate:

https://www.researchgate.net/publicatio ... rror_Tests
While those problems with the Pi 3B seem long ago, I'm pretty sure it was after 1993.
Thanks, last time I had indicated 2006

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sat Jul 20, 2019 11:35 am

Livermore Loops Plus OpenGL Stress Tests (Including Dual Monitors)

For more details of stress testing options see:

https://www.researchgate.net/publicatio ... ce_Linpack

Three copies of Livermore Loops stress tests were run along with the OpenGL Tiled Kitchen section, on a Pi 4 without any cooling, then in the case with a fan. The former program was arranged to have a nominal durations of 864 seconds (72 x 12). When running, the CPU load is continuously changing and that can be reflected in ongoing temperature and OpenGL Frames Per Second. The tests make use of six terminal windows and a full screen display, run by the commands shown below. This is followed by the results.

With no cooling, there were the usual increases in temperature and performance degradation, but not as severe as some of the earlier tests. With cooling performance was effectively constant. Averages at the end reflect the differences. There were no reports of errors or any sign of system failures.

Dual Monitors - The benchmarks, with no cooling, were repeated using two monitors, providing a screen area of 3840 x 1080 pixels, the results being included below. Performance was only between 7% and 15% slower than the single monitor example. Benchmark results of all OpenGL tests and provided at the end of the table, showing those more dependent on graphics speed were affected by the number of pixels displayed.

Code: Select all

Run Commands

Terminal 1
vmstat 10 100

Terminal 2 script file
lxterminal -e ./RPiHeatMHzVolts2 Passes 120 Seconds 10 Log 20
lxterminal -e ./liverloopsPiA7R Seconds 12 Log 20
lxterminal -e ./liverloopsPiA7R Seconds 12 Log 21
lxterminal -e ./liverloopsPiA7R Seconds 12 Log 22

Terminal 3
./videogl32 Test 6, Mins 16, Log 20

                                                 Dual Monitors
          No Cooling          Case + Fan         No Cooling

Seconds   MHz     °C   FPS    MHz    °C   FPS    MHz    °C   FPS

      0   1500    64         1500    42         1500    69
     30   1000    82    19   1500    57    20   1000    82    13
     60   1000    82    16   1500    62    21    750    84    13
     90   1500    83    15   1500    66    20   1000    83    12
    120    750    85    13   1500    64    21   1000    85    11
    150   1000    84    13   1500    62    20    600    84    10
    180   1000    83    14   1500    60    22    750    85    10
    210   1000    84    15   1500    62    21   1000    85    12
    240   1000    83    14   1500    61    19    750    84    12
    270   1000    84    14   1500    63    21   1000    85    11
    300   1000    84    14   1500    61    21    750    84    12
    330    750    84    14   1500    64    21   1000    85    12
    360   1000    82    14   1500    64    21    750    84    11
    390   1000    83    12   1500    66    21    750    84    12
    420   1000    84    13   1500    63    21    750    84    12
    450   1000    84    14   1500    62    20    750    85    11
    480    750    84    12   1500    63    21    750    85    12
    510    750    85    13   1500    61    21   1000    84    12
    540    750    84    11   1500    59    21    750    84    11
    570   1000    84    12   1500    62    21   1000    85    11
    600   1000    84    14   1500    62    22    750    83    10
    630   1000    84    13   1500    66    19    750    84    11
    660    750    84    14   1500    60    21    750    85    12
    690    750    86    13   1500    65    21   1000    85    12
    720   1000    84    13   1500    63    21    600    83    11
    750   1000    83    13   1500    62    21   1000    84    12
    780    750    84    12   1500    61    21   1000    85    11
    810    750    85    12   1500    62    21   1000    84    11
    840   1000    85    12   1500    58    21    750    86    10
    870    750    85    12   1500    58    21    750    85    11
    900   1000    84    13   1500    54    21   1000    85    10
    930   1000    85    13   1500    50    21   1000    85    11
    960   1000    84    13   1500    49    21    750    85    11
    990   1000    85    14   1500    45    21    750    85    12

Average    956    83    13   1500    60    21    866    84    11
%Fan        64   139    64
MFLOPS     916               1502                854
%Fan        61

 OpenGL Benchmark Single and Dual Monitors

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

  1920  1080     58.2     56.7     54.5     49.9     31.0     20.7
  3840  1080     27.9     26.5     26.0     25.2     25.7     16.3
Last edited by RoyLongbottom on Sat Jul 20, 2019 11:50 am, edited 1 time in total.

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sat Jul 20, 2019 11:49 am

Input/Output Stress Test - burnindrive2

This is essentially the same as my program used during hundreds of UK Government and University computer acceptance trials during the 1970s and 1980s, with some significant achievements. Burnindrive writes four files, using 164 blocks of 64 KB, repeated 16 times (164.0 MB), with each block containing a unique data pattern. The files are then read for two minutes, on a sort of random sequence, with data and file ID checked for correct values. Then each block (unique pattern) is read numerous times, over one second, again with checking for correct values. Total time is normally about 5 minutes for all tests, with default parameters. More details are included in the following;

https://www.researchgate.net/publicatio ... ce_Linpack

For the stress test, three copies of burnindrive2 were run, accessing the main drive, a USB 3 stick and a remote PC via a 1 Gbps LAN, along with MP-IntStress using four threads. The environment was monitored using RPiHeatMHzVolts2, vmstat for drive activity and CPU MHz, and sar -n for network traffic. Commands used and results are provided below. Stress tests are generally based on executing a fixed set of activities, where completion times can vary. Hence, the provided results are extrapolated approximations, with drive speeds the average for a particular activity.

All stress tests ran to completion without detecting any errors. CPU utilisation was around 90% of four cores but CPU throttling still occurred, with temperatures up to 86°C (and possibly not enough throttling). Performance measured by the stress tests was broadly in line with the system vmstat and sar measurements. In order to indicate which activity suffered from the most degradation, performance of standalone runs are also provided. It seems that LAN traffic was given a higher priority, with no speed reduction, followed by the main SD drive. Worst was the CPU bound program, probably suffering from a lower priority besides throttling.

Code: Select all

Commands from 4 terminals
vmstat 30 30
sar -n DEV 30 30
./RPiHeatMHzVolts2 Passes 30 Seconds 30 Log 20

Script File
lxterminal -e ./burnindrive2 Mins 10, Log 4
lxterminal -e ./burnindrive2 Mins 10, FilePath /media/pi/PATRIOT Log 5
lxterminal -e ./burnindrive2 Mins 10, FilePath /media/public Log 6
lxterminal -e ./MP-IntStress Threads 3, Mins 15, KB 480, Log 4

Results
          ------ MB/second ------     
   Secs   Main USB 3 1Gbps  MP-Int  MHz    °C
         Drive Drive   LAN  Stress 

      0                            1500    55
     30   11.9  38.0  42.3  13116  1500    66
     60   11.9  44.1  32.8  13063  1500    73
     90   28.1  44.1  32.8  13615  1500    75
    120   28.1  44.1  32.8  13734  1500    81
    150   28.1  44.1  32.8  13370  1500    83
    180   28.1  44.1  32.8  13555  1000    82
    210   28.1  44.1  32.8  13285  1000    82
    240   28.1  44.1  32.8  13194  1000    82
    270   28.1  44.1  32.8  13022  1000    83
    300   28.1  44.1  32.8  13316  1000    82
    330   28.1  44.1  32.8  13615  1000    82
    360   28.1  44.1  32.8  13677  1000    84
    390   28.1  44.1  32.8  13315  1000    83
    420   28.1  44.1  32.8  13273  1000    82
    450   28.1  44.1  32.8  13117  1000    83
    480   28.1  44.1  32.8  12860  1000    83
    510   28.1  44.1  32.8  12370  1000    83
    540   28.1  44.1  32.8  11863  1000    84
    570   28.1  44.1  32.8  11550  1000    84
    600   28.1  44.1  32.8  11312  1000    82
    630   28.1  44.1  32.7  10895  1000    83
    660   28.1  54.0  32.7  10696  1000    83
    690   29.7  54.0  32.7  10479  1000    84
    720   29.7  54.0  32.7  10223   750    84
    750   29.7  54.0  32.7  10227  1000    85
    780   29.7  54.0  32.7  10413   750    84
    810   29.7  54.0        10090   750    86
    840   29.7               9952  1000    84

  Stand Alone
  Max     33.4  68.6  32.3  22664


vmstat
procs -----------memory---------- --swap-- -----io---- -system- ------cpu-----
 r  b   swpd   free   buff  cache  si  so    bi    bo   in   cs us sy id wa st
Start
 6  2      0 3499820  45700 271552  0   0 12409 32193 16450 13425 54 24 20  2  0
 2  2      0 3503956  45776 264632  0   0 46811 12381 27174 16714 68 23  3  5  0
 4  2      0 3506080  45816 264348  0   0 76271   248 25885 16188 64 22  7  7  0
Read 1
 5  2      0 3502984  45992 264844  0   0 75473     5 18777 14118 67 24  3  6  0
 5  2      0 3504888  46032 264884  0   0 74726     7 18907 14631 66 25  4  5  0
Read 2
 6  2      0 3503236  46544 265452  0   0 86628     7 17180 15114 62 28  4  6  0
 4  2      0 3501964  46592 265452  0   0 80815     6 15395 14321 68 28  2  2  0

 Ethernet Read sar -n DEV 
  rxpck/s   txpck/s    rxkB/s    txkB/s  rxcmp/s  txcmp/s  rxmcst/s   %ifutil

 24841.37   6883.90  36206.23    505.50     0.00     0.00     0.03     29.66

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Sep 03, 2019 9:00 pm

64 Bit Raspberry Pi 4B Benchmarks

The Gentoo 64 bit Operating System is now available for the Raspberry Pi 4B from:

https://github.com/sakaki-/gentoo-on-rpi-64bit

Existing benchmarks were used to provide comparisons with the old Pi 3B+ model and the Pi 4B system using 32 bit Raspbian. Brief reminders of the benchmark functions are included. The same programs were run on both systems, running under Gentoo, providing valid comparisons, but not necessarily so comparing 64 bit and 32 bit operation, where different versions of the gcc compiler and options were used. Compiler optimisation parameters used for the 64 bit versions were simply -O3 -march=armv8-a.

The first part is for my original single core programs,

Whetstone Benchmark

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised (N/A), but the time does not affect the overall rating much.

For this simple code, at 64 bits, average Pi 4 performance gain, over the Pi 3B+, was 2.12 times, but only around 1.3 times for straightforward floating point calculations. Then, as should be expected, the Pi 4B 32 bit speed was not much slower.

Code: Select all

System       MHz  MWIPS   ----- MFLOPS------   ------------ MOPS--------------
                            1      2      3     COS    EXP  FIXPT    IF   EQUAL

Pi 3B+ 64b  1400   1071    383    403    328   20.9   12.4   1704   N/A    1357
Pi 4B  64b  1500   2269    522    534    398   54.8   39.8   2487   N/A     997
Pi4/3B+     1.07   2.12   1.36   1.32   1.21   2.63   3.21   1.46   N/A    0.73

Pi 4B 32b   1500   1884    516    478    310   54.7   27.1   2498   2247    999
64b/32b     1.00   1.20   1.01   1.12   1.28   1.00   1.47   1.00   N/A    1.00
Dhrystone Benchmark

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow. This benchmark has no significant data arrays, suitable for vectorisation.

Using the same 64 bit program, the Pi 4 was more than twice as fast and 52% faster than the 32 bit compilation.

Code: Select all

                             DMIPS
System      MHz     DMIPS     /MHz

Pi 3B+ 64b  1400     4028     2.88
Pi 4B  64b  1500     8176     5.45
Pi4/3B+     1.07     2.03

Pi 4B 32b   1500     5366     3.58
64b/32b     1.00     1.52
Linpack Benchmark

The original Linpack benchmark specified the use of double precision (DP) floating point arithmetic, and the code used here is identical to that initially approved for use on old PCs. For the benefit of early ARM computers, the code is also run using single precision (SP) numbers. A version was also produced, replacing the key Daxpy code with NEON Intrinsic Functions, using vector operations, also with single precision calculations. The Pi 3B+ 32 bit results are also provided for clarification.

At 64 bits, Pi 4/3B+ performance ratios were generally higher, than those with the earlier benchmarks. Then, as could be expected, virtually compiler independent performance, using NEON Intrinsic Functions, were similar at 32 bits and 64 bits. The main 64 bit gain was with the compiled single precision version, obtaining the same performance as that via NEON Intrinsics.

Code: Select all

System      MHz     ------- MFLOPS --------
                      DP      SP    SP NEON

Pi 3B+ 64b  1400    396.6    562.1    604.2
Pi 4B  64b  1500   1059.9   1977.8   1968.6
Pi4/3B+     1.07     2.67     3.52     3.26

Pi 4B 32b   1500    760.2    921.6   2010.5
64b/32b     1.00     1.39     2.15     0.98

Pi 3B+ 32b  1400    210.5    225.2    562.5
Pi4/3B+     1.07     3.61     4.09     3.57
Livermore Loops Benchmark

This original main benchmark for supercomputers was first introduced in 1970, initially comprising 14 kernels of numerical application, written in Fortran. This was increased to 24 kernels in the 1980s. Following are overall MFLOPS ratings, geometric mean being the official average performance, followed by details from the 24 kernels. Note that these are for double precision calculations

All the ratings indicate reasonably significant performance gains of Pi 4 over Pi 3B+ and 64 bits over 32 bits. Results from the 24 kernels indicate some higher gains. Also note the maximum speed of 2.49 GFLOPS (Double Precision).

Code: Select all

System      MHz Maximum Average Geomean Harmean Minimum

Pi 3B+ 64b 1400   737.7   319.4   284.7   250.6    91.6
Pi 4B  64b 1500  2490.5     892   730.3   603.3   212.4
Pi4/3B+    1.07    3.38    2.79    2.57    2.41    2.32

Pi 4B 32b  1500  1800.2   635.1     519   416.1   155.3
64b/32b    1.00    1.38    1.40    1.41    1.45    1.37


MFLOPS Of 24 Kernels

Pi 3B+    540   296   539   527   226   175   738   428   484   251   169   245
64b       127   161   291   258   440   520   333   280   310    93   362   209

Pi 4B    2026   997   987   948   372   739  2033  2491  1980   758   495   875
64b       220   404   811   710   753  1124   444   397  1061   414   822   283

Pi4/3B+  3.75  3.37  1.83  1.80  1.65  4.23  2.76  5.83  4.09  3.02  2.92  3.57
         1.73  2.51  2.79  2.75  1.71  2.16  1.33  1.42  3.43  4.48  2.27  1.36
         Min   1.33  Max   5.83

Pi 4B 32  746   964   988   943   212   538  1169  1800  1032   469   214   186
32b       159   335   778   623   732  1034   320   350   489   360   749   187

64b/32b  2.72  1.03  1.00  1.00  1.76  1.37  1.74  1.38  1.92  1.62  2.31  4.70
         1.38  1.20  1.04  1.14  1.03  1.09  1.39  1.13  2.17  1.15  1.10  1.51
         Min   1.00  Max   4.70
Memory Benchmarks

This batch of programs measure speed dependent on data from caches and RAM.

MemSpeed Benchmark

MemSpeed benchmark measures data reading speeds in MegaBytes per second, carrying out calculations on arrays of cache and RAM data, normally sized 2 x 4 KB to 2 x 4 MB. Calculations are as shown in the result headings. For the first two double precision tests, speed MFLOPS can be calculated by dividing MB/second by 8 and 16. For single precision divide by 4 and 8.

Results are provided below for the Gentoo 64 bit version on the Pi 3B+ and Pi 4B, and the Raspbian 32 bit variety on the Pi 4B, then a sample of relative performance, covering data from L1 cache, L2 cache and RAM.

Gains, greater than the 7% CPU MHz difference, were recorded all round by the Pi 4B over the Pi 3B+. The most impressive were on using L2 cache based data and the more intensive floating point calculations.

On the Pi 4B, speeds of 64 bit and 32 bit compilations were similar using RAM based data and executing some integer tests, but significantly faster from cache based floating point calculations.

Code: Select all

                              Gentoo 64b Pi 3B+

         Memory Reading Speed Test armv8 64 Bit by Roy Longbottom

               Start of test Fri Aug 16 12:48:51 2019

    Memory  x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]       x[m]=y[m]
    KBytes   Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

          8   4813   2897   4350   6180   3954   4831   5378   4324   4324
         16   4540   2900   4356   6213   3961   4838   5401   4344   4333
         32   4184   2780   4047   5540   3721   4483   5421   4285   4316
         64   3784   2678   3803   4776   3547   4171   4925   4087   4051
        128   3613   2694   3842   4731   3562   4188   4967   4087   4103
        256   3133   2652   3800   4626   3493   4027   4967   4093   4096
        512    670    882   1630   2913   2422   2718   3101   3141   2780
       1024    587    774   1017   1310   1287   1184   1105   1526   1543
       2048    555    746    917   1143   1131   1043   1071   1007   1128
       4096    545    691   1130   1039   1015   1140   1045   1087    892
       8192    537    795   1139    980   1133   1148    887    854    922
 Max MFLOPS    602    725

                              Gentoo 64b Pi 4B

          8  15530  13973  12509  15570  14025  15534  11417   9308   7798
         16  15719  14042  12750  15745  14200  15660  11753   9447   7890
         32  14062  12228  11435  14052  12699  12855  11864   9459   7937
         64  12195  11344  10698  12211  11705  12025   8872   8752   7904
        128  12172  11360  10755  12166  11862  11975   8569   8460   7913
        256  12228  11369  10697  12123  11790  12082   8073   8222   7896
        512  11269  10738  10206  10985  11164  11590   8017   6280   6557
       1024   3407   2635   3281   3396   3242   2979   3765   3947   4029
       2048   1525   1832   1838   1851   1607   1838   2819   2790   2770
       4096   1407   1851   1859   1861   1666   1840   2485   2487   2410
       8192   1913   1914   1922   1528   1895   1891   2496   2234   2489
 Max MFLOPS   1965   3511

                              Comparison 64b Pi4/3B+

          8   3.23   4.82   2.88   2.52   3.55   3.22   2.12   2.15   1.80
         16   3.46   4.84   2.93   2.53   3.58   3.24   2.18   2.17   1.82
 
        256   3.90   4.29   2.82   2.62   3.38   3.00   1.63   2.01   1.93
        512  16.82  12.17   6.26   3.77   4.61   4.26   2.59   2.00   2.36
       1024   5.80   3.40   3.23   2.59   2.52   2.52   3.41   2.59   2.61

       4096   2.58   2.68   1.65   1.79   1.64   1.61   2.38   2.29   2.70
       8192   3.56   2.41   1.69   1.56   1.67   1.65   2.81   2.62   2.70

                              Raspbian 32b Pi 4B

          8   8459   4766  13344   8303   4768  15553   7806   9926   9927
         16   7142   3918   8649   7103   4094   9309   7899  10086  10056
         32   7969   4490  10339   7941   4532  11627   7758  10070  10048
         64   8126   4602   9909   8114   4617  11069   7425   8021   8070
        128   8302   4651   9623   8311   4657  10836   7374   8049   7934
        256   8319   4663   9627   8360   4666  10768   7530   7922   7925
        512   8088   4629   9453   8239   4650  10696   5023   7904   7949
       1024   3581   3113   3618   3577   3150   3675   5358   2431   1560
       2048   1338   1808   1780   1811   1832   1773   2131    950    956
       4096   1881   1880   1852   1879   1664   1336   1988    984   1054
       8192   1890   1901   1884   1729   1319   1367   2252   1018   1021
 Max MFLOPS   1057   1192

                              Comparison Pi 4B 64b/32b

          8   1.84   2.93   0.94   1.88   2.94   1.00   1.46   0.94   0.79
         16   2.20   3.58   1.47   2.22   3.47   1.68   1.49   0.94   0.78

        256   1.47   2.44   1.11   1.45   2.53   1.12   1.07   1.04   1.00
        512   1.39   2.32   1.08   1.33   2.40   1.08   1.60   0.79   0.82
       1024   0.95   0.85   0.91   0.95   1.03   0.81   0.70   1.62   2.58
 
       4096   0.75   0.98   1.00   0.99   1.00   1.38   1.25   2.53   2.29
       8192   1.01   1.01   1.02   0.88   1.44   1.38   1.11   2.19   2.44
NeonSpeed Benchmark

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler, using NEON directives and Neon through using Intrinsic Functions.

Unlike running the same programs on the Pi 3B+, using the Pi 4, compiled codes were no longer slower than those produced via Intrinsic Functions. This lead to performance gains of up to over five times.

Except using L1 cache based data, performance was essentially the same using 32 bit and 64 bit benchmarks.

Code: Select all

                     Gentoo 64b Pi 3B+

   NEON Speed Test armv8 64 Bit V 1.0 Fri Aug 16 2019

       Vector Reading Speed in MBytes/Second 
     Memory  Float v=v+s*v  Int v=v+v+s  Neon v=v+v
     KBytes   Norm   Neon   Norm   Neon  Float    Int

         16   2715   5110   3945   4826   5426   5598
         32   2528   4326   3569   4191   4596   4661
         64   2491   4153   3494   4068   4407   4429
        128   2537   4228   3583   4120   4461   4473
        256   2526   4265   3614   4140   4480   4514
        512   1917   2830   2545   2579   2896   2964
       1024   1166   1299   1152   1257   1205   1229
       4096   1022   1135   1132   1122   1130   1100
      16384   1080   1026   1131   1016   1064   1094
      65536    996   1120   1061    831   1110   1069

                     Gentoo 64b Pi 4B

         16  13982  16424  12505  15239  16065  17193
         32   9554  10753   8981   9657  10970  11025
         64  10658  11833  10274  10722  12110  12134
        128  10657  11887  10337  10680  11994  11973
        256  10709  11970  10360  10774  12003  12083
        512  10147  11441   9733  10209  11264  11532
       1024   2964   3222   2876   3216   3270   2942
       4096   1734   1712   1729   1772   1586   1728
      16384   1592   1922   1818   1923   1926   1667
      65536   1970   1736   1997   1747   1884   2021

                   Comparison 64b Pi4/3B+

         16   5.15   3.21   3.17   3.16   2.96   3.07

        256   4.24   2.81   2.87   2.60   2.68   2.68
        512   5.29   4.04   3.82   3.96   3.89   3.89

      65536   1.98   1.55   1.88   2.10   1.70   1.89
                             
                      Raspbian 32b Pi 4B

         16   9677  10072   8905   9358   9776  10473
         32  10149  10330   9364   9539   9988  10543
         64  10948  11708  10466  10568  11318  11994
        128  10484  11232  10410  10104  11200  11792
        256  10509  11369  10428  10264  11273  11842
        512  10406  11066  10134  10054  11075  11467
       1024   3069   3202   3159   3166   3204   3203
       4096   1721   1910   1908   1882   1903   1900
      16384   2023   2009   2008   1965   2032   2013
      65536   2073   2074   2074   2073   2068   2064

                   Comparison Pi 4B 64b/32b

         16   1.44   1.63   1.40   1.63   1.64   1.64

        256   1.02   1.05   0.99   1.05   1.06   1.02
        512   0.98   1.03   0.96   1.02   1.02   1.01

      65536   0.95   0.84   0.96   0.84   0.91   0.98
BusSpeed Benchmark

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word address increments, followed by decreasing increments. finally reading all data. This shows were data is read in bursts, enabling estimates being made of bus speeds. The two comparison columns ar for two word and one word increments.

Most data transfers were 2.0 to 2.5 times faster on the Pi 4, including from RAM, and somewhat higher with L2 cache based data.

The 64 bit version still deals with 32 bit words but transferred data somewhat quicker than the 32 bit program, as shown by the Pi 4 results.

Code: Select all

                       Gentoo 64b Pi 3B+

    BusSpeed armv8 64 Bit Fri Aug 16 12:53:43 2019

     Reading Speed 4 Byte Words in MBytes/Second
     Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read   Inc2   Read
     KBytes  Words  Words  Words  Words  Words    All  Words    All

         16   3819   4253   4622   5041   5089   3870
         32   1234   1328   2067   3158   4082   3674
         64    681    704   1325   2208   3350   3602
        128    638    646   1214   2070   3238   3625
        256    592    617   1165   1991   3164   3622
        512    295    309    640    985   2085   2790
       1024    108    120    271    525   1070   1636
       4096     98    123    249    486    881   1840
      16384    121    114    246    480    977   1642
      65536    121    124    248    409    989   1864

                       Gentoo 64b Pi 4B
                                                          Pi4/3B+

         16   4999   5042   5665   5885   5891   8217   1.16   2.12
         32   1578   2105   3283   4339   5154   7507   1.26   2.04
         64    585    911   1855   3085   5163   7918   1.54   2.20
        128    590    932   1888   3110   5161   7874   1.59   2.17
        256    598    934   1908   3056   5265   7883   1.66   2.18
        512    603    939   1822   3019   5124   7716   2.46   2.77
       1024    319    482   1060   1885   3283   5721   3.07   3.50
       4096    209    253    503   1006   2009   4111   2.28   2.23
      16384    209    261    520   1041   2071   4115   2.12   2.51
      65536    203    263    489   1011   2023   4036   2.05   2.17

                       Raspbian 32b Pi 4B
                                                          64b/32b

         16   3836   4049   4467   5885   4641   5858   1.14   1.14
         32    761   1473   2594   3216   3960   4780   1.01   1.01
         64    409    801   1684   2422   3745   3940   0.95   0.95
        128    406    803   1202   1914   3037   5377   1.32   1.32
        256    415    700   1165   2481   4789   5137   1.27   1.27
        512    392    760   1243   2455   3764   4264   1.38   1.38
       1024    230    256    623   1061   2455   3501   1.59   1.59
       4096    197    214    454    938   1852   3195   1.80   1.80
      16384    138    215    445    897   1724   3210   1.91   1.91
      65536    174    215    398    744   1655   3130   1.61   1.61
Fast Fourier Transforms Benchmark

This is a real application provided by my collaborator at Compuserve Forum. There are two versions. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements, at each size, using both single and double data, calculating FFT sizes between 1K and 1024K. Results are in milliseconds, with those here, the average of three measurements.

There were gains all round on the Pi 4, compared with the 3B+, mainly between 3 and 4 times on the optimised version, less so using FFT1, with more data transfer speed dependency.

On the Pi 4, performance from the 32 bit compilation was often similar to that at 64 bits. This is probably due to much of the data being read on a skipped sequential basis, not good for vectorisation.

Code: Select all

                 Gentoo 64b Pi 3B+

       Size    FFT1           FFT3
          K      SP      DP     SP     DP

          1    0.13    0.15   0.15   0.17
          2    0.29    0.39   0.32   0.38
          4    0.76    1.13   0.79   0.85
          8    1.93    2.66   1.77   1.94
         16    4.02    5.51   4.69   5.14
         32    9.50   25.11   9.51  13.67
         64   42.53  110.21  25.30  32.25
        128  151.08  257.41  57.68  76.71
        256  355.88  589.07 129.47 174.85
        512  819.91 1324.89 297.80 390.74
       1024 1746.23 2943.08 641.50 863.82

       
               Gentoo 64b Pi 4B             Pi4/3B+

       Size    FFT1           FFT3          FFT1          FFT3
          K      SP      DP     SP     DP     SP     DP     SP     DP

          1    0.04    0.04   0.04   0.04   3.30   3.62   3.60   4.13
          2    0.08    0.14   0.11   0.09   3.81   2.88   2.82   4.03
          4    0.25    0.38   0.19   0.22   3.05   2.93   4.13   3.86
          8    0.79    1.31   0.46   0.50   2.45   2.04   3.87   3.87
         16    2.15    2.91   1.15   1.09   1.87   1.89   4.07   4.71
         32    5.71    6.76   2.48   3.18   1.66   3.71   3.83   4.30
         64   15.22   51.00   5.43   9.29   2.79   2.16   4.66   3.47
        128   83.47  151.95  16.28  24.75   1.81   1.69   3.54   3.10
        256  231.24  362.64  39.13  57.28   1.54   1.62   3.31   3.05
        512  561.16  765.18  90.20 133.21   1.46   1.73   3.30   2.93
       1024 1250.51 1878.44 213.35 303.39   1.40   1.57   3.01   2.85


              Raspbian 32b Pi 4B            64b/32b

       Size    FFT1           FFT3          FFT1          FFT3
          K      SP      DP     SP     DP     SP     DP     SP     DP

          1    0.04    0.04   0.06   0.05   0.99   0.96   1.44   1.18
          2    0.08    0.12   0.13   0.11   1.04   0.89   1.14   1.18
          4    0.32    0.37   0.27   0.24   1.28   0.96   1.42   1.09
          8    0.77    0.97   0.58   0.55   0.98   0.74   1.26   1.09
         16    1.69    2.01   1.49   1.35   0.78   0.69   1.29   1.24
         32    4.37    4.89   2.96   3.63   0.77   0.72   1.19   1.14
         64    9.12   26.55   7.46  10.75   0.60   0.52   1.37   1.16
        128   55.52  160.11  17.93  26.03   0.67   1.05   1.10   1.05
        256  305.92  423.06  41.16  55.06   1.32   1.17   1.05   0.96
        512  833.10  854.88  86.93 120.53   1.48   1.12   0.96   0.90
       1024 1617.49 1875.52 190.28 266.60   1.29   1.00   0.89   0.88
Next Multithreading Benchmarks

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Sep 18, 2019 2:23 pm

64 Bit Raspberry Pi 4B Multithreading Benchmarks

Many of these benchmarks run using 1, 2, 4 and 8 threads, with others executing programs on all available cores via OpenMP.

MP-Whetstone Benchmark

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance is generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.

As with the single core version, average Pi 4 performance gain, over the Pi 3B+, was just over 2 times, but more similar compared with 32 bit speed, this time the latter being somewhat faster on some floating point calculations.

Code: Select all

           MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Threads              1      2      3  MOPS   MOPS   MOPS   MOPS   MOPS

                            Gentoo Pi 3B+ 64 Bits

    1       1152    383    383    328   23.2   13.0   N/A    2721   1365
    2       2312    767    767    657   46.5   26.0   N/A    5461   2738
    4       4580   1506   1526   1304   92.0   51.6   N/A   10777   5449
    8       4788   1815   1961   1382   95.0   53.3   N/A   13827   5811

            Overall Seconds   4.96 1T,   4.95 2T,   5.05 4T,  10.07 8T
    
                            Gentoo Pi 4B 64 Bits

    1       2395    536    538    397   60.8   39.0   N/A    4483    997
    2       4784   1062   1079    794  121.2   77.9   N/A    8932   1990
    4       9476   2125   2080   1568  240.8  155.3   N/A   17718   3962
    8       9834   2631   2744   1630  243.6  160.1   N/A   22265   4053

            Overall Seconds   4.99 1T,   5.01 2T,   5.12 4T,  10.17 8T

                             Pi 4B/3B+ 64 Bits
 
    1       2.08   1.40   1.41   1.21   2.62   3.00   N/A    1.65   0.73
    2       2.07   1.39   1.41   1.21   2.61   3.00   N/A    1.64   0.73
    4       2.07   1.41   1.36   1.20   2.62   3.01   N/A    1.64   0.73
    8       2.05   1.45   1.40   1.18   2.56   3.00   N/A    1.61   0.70

                           Raspbian Pi 4B 32 Bits

    1       2059    673    680    311   55.6   33.1   7462   2245    995
    2       4117   1342   1391    624  110.7   65.9  14887   4467   1986
    4       7910   2652   2722   1180  208.5  132.6  29291   8952   3832
    8       8652   3057   2971   1268  233.2  149.6  38368  11923   3942

            Overall Seconds   4.99 1T,   5.01 2T,   5.29 4T,  10.71 8T

                            Pi 4B 64 bits/32 bits

    1       1.16   0.80   0.79   1.28   1.09   1.18   N/A    2.00   1.00
    2       1.16   0.79   0.78   1.27   1.09   1.18   N/A    2.00   1.00
    4       1.20   0.80   0.76   1.33   1.15   1.17   N/A    1.98   1.03
    8       1.14   0.86   0.92   1.28   1.04   1.07   N/A    1.87   1.03
MP Dhrystone Benchmark

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance with not much gain using multiple cores.

The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+. The single thread Pi 4B 64 bit/32 bit speed ratio was also similar to that during the single core tests.

Code: Select all

 MP-Dhrystone Benchmark armv8 64 Bit Fri Aug 23 00:44:05 2019

                  Using 1, 2, 4 and 8 Threads

Threads                       1        2        4        8

Seconds                     0.54     0.67     1.23     2.46
Dhrystones per Second    7391586 11954301 11300304 13028539
VAX MIPS Pi 3B+ 64 bits     4207     6804     7401     7415
  
VAX MIPS Pi 4B  64 bits     8880     7828     8303     8314

Pi 4B/3B+       64 bits     2.11     1.15     1.12     1.12

VAX MIPS Pi 4B  32 bits     5539     5739     6735     7232

Pi 4B   64 bits/32 bits     1.60     1.36     1.23     1.15
MP Linpack Benchmark (Single Precision NEON)

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.

This benchmark uses the same NEON Intrinsic Functions as the single core program, with similar speeds at N = 100, without the threading overheads, but decreasing with larger data sizes, involving RAM accesses.

The full logged output is shown for the first entry, to demonstrate error checking facilities. The sumchecks were identical from the Pi 3B+ and Pi 4B at Gentoo 64 bits, but those from the Raspbian 32 bit test were different, as shown below. Ignoring the slow threaded results, performance ratios of CPU speed limited tests were similar to the single core version.

Code: Select all

           Gentoo Pi 3B+ 64 Bits

 Linpack Single Precision MultiThreaded Benchmark
  64 Bit NEON Intrinsics, Fri Aug 23 00:45:54 2019

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

Threads     None      1       2       4

N  100    642.56   66.69   66.05   65.54
N  500    479.48  274.36  274.85  269.07
N 1000    363.77  316.17  310.37  316.71

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            1.97            5.40           13.51
 RE  4.69621336e-05  6.44138840e-04  3.22485110e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -1.31130219e-05  5.79357147e-05 -3.08930874e-04
 XN -1.30534172e-05  3.51667404e-05  1.90019608e-04

Thread
 0 - 4 Same Results    Same Results    Same Results


           Gentoo Pi 4B 64 Bits

N  100   2252.70   97.25   97.43   97.41
N  500   1628.24  665.21  646.63  674.38
N 1000    399.87  406.80  405.84  399.54


           Pi 4B/3B+ 64 Bits

N  100      3.51    1.46    1.48    1.49
N  500      3.40    2.42    2.35    2.51
N 1000      1.10    1.29    1.31    1.26


           Raspbian Pi 4B 32 Bits

N  100   1921.53  108.66  101.88  102.46
N  500   1548.81  530.23  714.37  733.09
N 1000    399.94  378.11  364.78  398.21

           Pi 4B 64 bits/32 bits

N  100      1.17    0.89    0.96    0.95
N  500      1.05    1.25    0.91    0.92
N 1000      1.00    1.08    1.11    1.00

           32 bit numeric results

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04
MP BusSpeed (read only) Benchmark

Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, with this version. See single processor BusSpeed details regarding burst reading that can indicate significant differences.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. Pi 4B/3B+ performance ratios were similar to that for the single core tests. There was an exception with two threads, on the Pi 4, using RAM at 64 bits, probably due to caching effects and not seen on subsequent repeated tests.

Particularly note that performance was significantly better using the 32 bit Raspbian compiler. Below are examples of disassembly, showing that Pi 4 code employed scalar operation, using 32 bit w registers, with the 3B benefiting from using 128 bit q registers, for Single Instruction Multiple Data (SIMD) operation. Compile options are included below, where alternative were also tried on the Pi 4B, but failed to implement SIMD operation.

Code: Select all

                  Gentoo Pi 3B+ 64 Bits 

 MP-BusSpd armv8 64 Bit Fri Aug 23 00:47:43 2019

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
   Staggered starting addresses to avoid caching

KB Threads  Inc32   Inc16   Inc8    Inc4    Inc2    RdAll

 12.3 1T     3138    2822    3044    2383    1708    1737
      2T     5354    4865    5647    4519    3303    3362
      4T     7922    7504    9717    6794    6216    6597
      8T     5125    4159    6987    6696    5350    5195
122.9 1T      640     666    1191    1864    1627    1712
      2T     1008    1018    1926    3496    3268    3387
      4T      962    1042    2157    4259    6427    4372
      8T     1031    1047    2147    3952    6317    6514
12288 1T      124     114     260     527    1016    1363
      2T      137     138     275     487     946    2182
      4T      105     118     240     409     975    2158
      8T      108     117     236     504    1077    2051

                                                          RdAll
                      Gentoo Pi 4B 64 Bits              Pi 4B/3B+

 12.3 1T    4864    4879    5378    4379    4115    4221    2.43
      2T    8159    6924    9179    8006    7689    7837    2.33
      4T   12677   11531   14850   12554   13807   14794    2.24
      8T    7398    6927   10881   11675   11497   13075    2.52
122.9 1T     665     926    1869    2714    3557    4152    2.43
      2T     610     696    1549    4898    7188    8184    2.42
      4T     476     865    1885    4107    8058   14617    3.34
      8T     474     883    1848    3919    7939   13633    2.09
12288 1T     202     210     514    1044    2033    3616    2.65
      2T     258     425     853    1551    3693    6228    2.85
      4T     217     346     497    1024    2181    3789    1.76
      8T     220     275     540    1030    1937    3577    1.74


                      Raspbian Pi 4B 32 Bits              64b/32b

 12.3 1T    5263    5637    5809    5894    5936   13445    0.31
      2T    9412   10020   10567   11454   11604   24980    0.31
      4T   16282   15577   16418   21222   20000   45530    0.32
      8T   11600   13285   16070   18579   20593   36837    0.35
122.9 1T     739     956    1888    3153    5008    9527    0.44
      2T     629    1158    1568    5058    9509   16489    0.50
      4T     600    1093    2134    4527    8732   16816    0.87
      8T     593    1104    2121    4382    8629   17158    0.79
12288 1T     238     258     518    1005    2001    4029    0.90
      2T     278     228     453    1690    1826    3628    1.72
      4T     269     257     740    1019    1790    4145    0.91
      8T     233     292     532     926    2186    3581    1.00
MP BusSpeed Disassembly

Code: Select all

       Source Code 64 AND instructions in main loop
  
   for (i=start; i<end; i=i+64)
   {
       andsum1[t] = andsum1[t] 
           & array[i   ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
           & array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
    To
           & array[i+56] & array[i+57] & array[i+58] & array[i+59]
           & array[i+60] & array[i+61] & array[i+62] & array[i+63];
   }


Pi 32 Bit Raspbian Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -mcpu=cortex-a7
           -mfloat-abi=hard -mfpu=neon-vfpv4 -o MP-BusSpd2PiA7


Pi 64 Bit Gentoo Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -march=armv8-a -o MP-BusSpd2Pi64

Parameters also tried
-march=armv8-a+crc -mtune=cortex-a72 -ftree-vectorize -O2 -pipe 
-fomit-frame-pointer"




Pi 32 Bit Disassembly          Pi 64 Bit Disassembly

vld1.32 {q6}, [lr]              ldp     w30, w17, [x0, 52]
vld1.32 {q7}, [r6]              and     w18, w18, w30
vand    q10, q10, q6            and     w1, w1, w18
vld1.32 {q6}, [r0]              ldp     w18, w30, [x0, 60]
vand    q9, q9, q7              and     w17, w17, w18
vand    q12, q12, q6            and     w1, w1, w17
vld1.32 {q7}, [ip]              ldp     w17, w18, [x0, 68]
vld1.32 {q6}, [r7]              and     w30, w30, w17
add     r1, r3, #96             and     w1, w1, w30
add     r6, r3, #144            ldp     w30, w17, [x0, 76]
vand    q11, q11, q7            and     w18, w18, w30
vand    q14, q14, q6            and     w1, w1, w18
vld1.32 {q7}, [r1]              ldp     w18, w30, [x0, 84]
vld1.32 {q6}, [r6]              and     w17, w17, w18
MP RandMem Benchmark

This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Pi 4B provided variable gains over the Pi 3B+, at 64 bits but less on the Pi 4B, from 64 bits over 32 bits.

Code: Select all

 MP-RandMem armv8 64 Bit Aug 2019  Using 1, 2, 4 and 8 Threads

         Serial Serial Random Random Serial Serial Random Random
KB+Thread  Read   RdWr   Read   RdWr     Read   RdWr   Read   RdWr

Gentoo Pi 4B 64 Bits

 12.3 1T    5922   7871   5892   7857
      2T   11856   7882  11902   7923
      4T   22964   7821  22276   7832
      8T   23225   7751  22082   7717
122.9 1T    5827   7276   2052   1921
      2T   10965   7258   1754   1924
      4T   10969   7232   1848   1929
      8T   10896   7158   1834   1909
12288 1T    3879   1052    188    170
      2T    4848    935    218    168
      4T    4684    943    332    170
      8T    3982   1049    340    171

Gentoo Pi 3B+ 64 Bits                    Raspbian Pi 4B 32 Bits

 12.3 1T    4901   3587   4912   3585     5860   7905   5927   7657
      2T    8749   3564   8719   3556    11747   7908  11182   7746
      4T   17108   3504  17160   3505    21416   7626  17382   7731
      8T   16885   3475  16650   3485    20649   7528  20431   7378
122.9 1T    3921   3339   1010    974     5479   7269   1826   1923
      2T    7360   3350   1814    972    10355   6964   1667   1920
      4T   12199   3313   2281    969     9808   7177   1715   1908
      8T   12089   3313   2279    968    11677   7058   1697   1919
12288 1T    2024    828     83     67     3438   1271    179    152
      2T    2169    820    142     67     4176   1204    213    167
      4T    2178    818    154     67     4227   1117    337    161
      8T    2219    821    161     67     3479   1093    287    168

                         4 Thread  Comparisons
            Pi 4B/3B+ 64 Bits             Pi 4B 64 bits/32 bits

 12.3 4T    1.34   2.23   1.30   2.23     1.07   1.03   1.28   1.01
122.9 4T    0.90   2.18   0.81   1.99     1.12   1.01   1.08   1.01
12288 4T    2.15   1.15   2.16   2.54     1.11   0.84   0.99   1.06
MP-MFLOPS Benchmarks

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x = (x+a)*b-(x+c)*d+(x+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. Versions are available using single precision and double precision data, plus one with NEON intrinsic functions. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

There can be wide variations in speeds, affected by the short running times and such as cached data variations. In order to help in interpreting results, comparisons are provided of results using one and four threads. These indicate that, with cache based data, the Pi 4B was more than 3.5 times faster than the Pi 3B+ at two operations per word, but less so at 32 operations.

The 64 bit and 32 bit comparisons were, no doubt, influenced by the particular compiler version used, and this is reflected in the main disassembled code shown below, for 32 operations per word. The 32 bit version compile included -mfpu=neon-vfpv4, but NEON was not implemented, resulting in scalar operation, using single word s registers. I have another version with compile including -funsafe-math-optimizations, that compiles NEON instructions, with similar performance as the 64 bit version, but more sumcheck differences.

The benchmark compiled to use NEON Intrinsic Functions does not include any that specify fused multiply and add operations, reducing maximum possible speed. The 64 bit compiler converts the functions to include fused instructions, providing the fastest speeds.

The main compiler independent feature that provides a clear advantage to 64 bit operation is that the CPU, at 32 bits, does not support double precision SIMD (NEON) operation, with single word d registers being compiled. On the other hand, performance gain does not appear to be meet the potential. This suggests that there are other limiting factors - see disassembly below.

Single Precision

Code: Select all

          MP-MFLOPS armv8 64Bit Thu Aug 22 19:50:10 2019

          FPU Add & Multiply using 1, 2, 4 and 8 Threads

    2 Ops/Word         32 Ops/Word        2 Ops/Word         32 Ops/Word
KB  12.8  128  12800   12.8    128 12800  12.8  128  12800   12.8   128  12800

    Gentoo Pi 4B 64 Bits MFLOPS

1T  2908  2854   459   5778   5734  5405
2T  5700  5311   457  10935  11212  7968
4T 10375  5588   490  18181  21842  7637
8T  9675  8460   511  20128  20567  8568

    Gentoo Pi 3B+ 64 Bits MFLOPS           Raspbian Pi 4B 32 Bits MFLOPS

1T   792   806   373   1780   1783  1724   987   993   606   2816   2794  2804
2T  1482  1596   382   3542   3509  3380  1823  1837   567   5610   5541  5497
4T  2861  2742   429   5849   7013  5465  2119  3349   647   9884  10702  9081
8T  2770  2877   429   6434   6700  6101  3136  3783   609  10230  10504  9240

                     4 Thread  Comparisons
    Pi 4B/3B+ 64 Bits                     Pi 4B 64 bits/32 bits

1T  3.67  3.54  1.23   3.25   3.22  3.14  2.95  2.87  0.76   2.05   2.05  1.93
4T  3.63  2.04  1.14   3.11   3.11  1.40  4.90  1.67  0.76   1.84   2.04  0.84
Double Precision

Code: Select all

       MP-MFLOPS armv8 64Bit Double Precision Thu Aug 22 19:51:42 2019

          FPU Add & Multiply using 1, 2, 4 and 8 Threads

    2 Ops/Word         32 Ops/Word        2 Ops/Word         32 Ops/Word
KB  12.8  128  12800   12.8   128  12800  12.8  128  12800   12.8   128  12800

    Gentoo Pi 4B 64 Bits MFLOPS

1T  1464  1386   225   3398   3386  3182
2T  2837  2792   228   6720   6741  4547
4T  5172  3414   251  10405  12762  4763
8T  4774  4353   275  11506  12118  4865

    Gentoo Pi 3B+ 64 Bits MFLOPS           Raspbian Pi 4B 32 Bits MFLOPS

1T   415   386   206   1400   1403  1333  1187  1220   309   2682   2714  2701
2T   820   813   209   2804   2767  2597  2420  2416   282   5379   5415  4780
4T  1328  1323   212   5433   5340  2465  4665  2381   317  10256  10336  5242
8T  1343  1308   214   5090   5006  3280  4385  3114   310   9721  10340  5131

                     4 Thread  Comparisons
    Pi 4B/3B+ 64 Bits                     Pi 4B 64 bits/32 bits


                      4 Thread  Comparisons
    Pi 4B/3B+ 64 Bits                     Pi 4B 64 bits/32 bits

1T  3.99  3.88  1.32   2.16   2.19  2.22  1.33  1.33  0.87   1.49   1.53  1.45
4T  2.83  2.16  1.30   2.04   2.07  1.55  0.59  1.02  1.02   1.40   1.46  1.03
NEON Single Precision

Code: Select all

     MP-MFLOPS NEON Intrinsics 64 Bit Thu Aug 22 19:52:48 2019

          FPU Add & Multiply using 1, 2, 4 and 8 Threads

    2 Ops/Word         32 Ops/Word        2 Ops/Word         32 Ops/Word
KB  12.8  128  12800   12.8   128  12800  12.8  128  12800   12.8   128  12800

    Gentoo Pi 4B 64 Bits MFLOPS

1T  3311  3192   535   6442   6548  6198
2T  4607  6186   552  13030  13012  8468
4T  6279  5725   562  23798  24128  9374
8T  7815 12044   486  22725  21712  9395

    Gentoo Pi 3B+ 64 Bits MFLOPS           Raspbian Pi 4B 32 Bits MFLOPS

1T   830   823   406   2989   2986  2792  2491  2399   615   4325   4285  4261
2T  1575  1498   414   5981   5872  5445  5629  5520   591   8602   8463  8308
4T  2217  2650   431  11661  11644  6061 10580  5594   553  16991  16493  9124
8T  2733  3197   437  10505  10637  6708  7047 10785   513  14325  16219  8867

                      4 Thread  Comparisons
    Pi 4B/3B+ 64 Bits                     Pi 4B 64 bits/32 bits

1T  3.99  3.88  1.32   2.16   2.19  2.22  1.33  1.33  0.87   1.49   1.53  1.45
4T  2.83  2.16  1.30   2.04   2.07  1.55  0.59  1.02  1.02   1.40   1.46  1.03
MP-MFLOPS Disassembly

On the Pi 4B, with single precision floating point and SIMD, four word registers were used (see 4s below). With this, four results of calculations might be expected per clock cycle, or 6 GFLOPS per core and up to 24 GFLOPS using all four cores, Then such as fused multiply and add could double the speed for up to four times to 12 GFLOPS per core. For the mix of instructions below, expectations might by 70% of this or 8.4 GFLOPS. Using double precision, with two words in the 128 bit registers, expectations might be half that at 4.2 GFLOPS per core, with this code.

Code: Select all

SP NEON 24.1 GFLOPS 6.55 1 core          DP 12.7 GFLOPS - 3.39 1 core 

.L41:                                   .L84:
ldr     q1, [x1]                        ldr     q16, [x2, x0]
ldr     q0, [sp, 64]                    add     w3, w3, 1
fadd    v18.4s, v20.4s, v1.4s           cmp     w3, w6
fadd    v17.4s, v22.4s, v1.4s           fadd    v15.2d, v16.2d, v14.2d
fadd    v0.4s, v0.4s, v1.4s             fadd    v17.2d, v16.2d, v12.2d
fadd    v16.4s, v24.4s, v1.4s           fmul    v15.2d, v15.2d, v13.2d
fadd    v7.4s, v26.4s, v1.4s            fmls    v15.2d, v17.2d, v11.2d
fadd    v6.4s, v28.4s, v1.4s            fadd    v17.2d, v16.2d, v10.2d
fadd    v5.4s, v30.4s, v1.4s            fmla    v15.2d, v17.2d, v9.2d
fmul    v0.4s, v0.4s, v19.4s            fadd    v17.2d, v16.2d, v8.2d
fadd    v4.4s, v10.4s, v1.4s            fmls    v15.2d, v17.2d, v31.2d
fadd    v3.4s, v12.4s, v1.4s            fadd    v17.2d, v16.2d, v30.2d
fadd    v2.4s, v14.4s, v1.4s            fmla    v15.2d, v17.2d, v29.2d
fadd    v1.4s, v8.4s, v1.4s             fadd    v17.2d, v16.2d, v28.2d
fmls    v0.4s, v21.4s, v18.4s           fmls    v15.2d, v17.2d, v0.2d
fmla    v0.4s, v23.4s, v17.4s           fadd    v17.2d, v16.2d, v27.2d
fmls    v0.4s, v25.4s, v16.4s           fmla    v15.2d, v17.2d, v26.2d
fmla    v0.4s, v27.4s, v7.4s            fadd    v17.2d, v16.2d, v25.2d
fmls    v0.4s, v29.4s, v6.4s            fmls    v15.2d, v17.2d, v24.2d
fmla    v0.4s, v31.4s, v5.4s            fadd    v17.2d, v16.2d, v23.2d
fmls    v0.4s, v9.4s, v1.4s             fmla    v15.2d, v17.2d, v22.2d
fmla    v0.4s, v4.4s, v11.4s            fadd    v17.2d, v16.2d, v21.2d
fmls    v0.4s, v3.4s, v13.4s            fadd    v16.2d, v16.2d, v19.2d
fmla    v0.4s, v2.4s, v15.4s            fmls    v15.2d, v17.2d, v20.2d
str     q0, [x1], 16                    fmla    v15.2d, v16.2d, v18.2d
cmp     x1, x0                          str     q15, [x2, x0]
bne     .L41                            add     x0, x0, 16
                                        bcc     .L84


                     32 bit    64 bit    32 bit     64 bit   32 bit    64 bit
                         SP        SP        DP        DP   NEON SP   NEON SP

Maximum GFLOPS         10.7      21.8      10.3      12.7      17.0      24.1

Instructions
Total                    27        39        26        27        67        27
Floating point           22        32        22        32        32        22

FP operations
Total                    32       128        32        64       128       128
Add or subtract          11        44        11        22        21        44
Multiply                  1         4         1         2        11         4
Fused                    20        80        20        40         0        80

Add example           fadds      fadd     faddd      fadd  vadd.f32      fadd
                        s16,   v15.4s,      d25,   v15.2d,       q9,    v1.4s,
                        s23,   v16.4s,      d17,   v16.2d,       q8,    v8.4s,
                        s2     v15.4s       d15    v14.2d        q14    v1.4s

Multiply example     fnmuls      fmul     fmuld      fmul  vmul.f32      fmul
                        s16,   v15.4s,      d16,   v15.2d,       q9,    v0.4s,
                         s3,   v15.4s,      d16,   v15.2d,       q9,    v0.4s,
                        s16    v17.4s       d5     v13.2d       q12    v19.4s

Fused example      vfma.f32      fmla  vfma.f64      fmla       N/A      fmla
                        s16,   v15.4s,      d16,   v15.2d,              v0.4s,
                        s29,   v17.4s,      d22,   v17.2d,              v4.4s,
                         s9     v0.4s       d28    v22.2d              v11.4s

FP registers used        32         4        32        25        16        32
MP-MFLOPS Sumchecks

Different instructions, like between SP and DP, may not produce identical numeric results. Variations also depend on the number of passes, here they were close to 1.0 as data size increased. Only anomaly is -X below.

Code: Select all

              2 Ops/Word              32 Ops/Word 
  KB          12.8    128    12800    12.8     128   12800  

SP
4B/64	 1T    76406   97075   99969   66015   95363   99951
3B/64	 1T    76406   97075   99969   66015   95363   99951
4B/32	 1T    76406   97075   99969   66015   95363   99951

DP		
4B/64	 1T    76384   97072   99969   66065   95370   99951	
3B/64	 1T    76384   97072   99969   66065   95370   99951	
4B/32	 1T    76384   97072   99969   66065   95370   99951	

NEON Bit SP		
4B/64	 1T    76406   97075   99969   66015   95363   99951	
3B/64	 1T    76406   97075   99969   66015   95363   99951	
4B/32	 1T    76406   97075   99969   66014-X 95363   99951	
OpenMP-MFLOPS Benchmark

This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.

Following is an example of full output. The strange test names were carried forward from a 2014 CUDA benchmark, via Windows and Linux Intel CPU versions. Details are in the following GigaFLOPS Benchmarks report, covering MP-MFLOPS, QPAR and OpenMP. This showed nearly 100 GFLOPS from a Core i7 CPU and 400 GFLOPS from a GeForce GTX 650 graphics card, via CUDA.

https://www.webarchive.org.uk/wayback/a ... hmarks.htm

The detail is followed by MFLOPS results on Pi 3B+ and Pi 4B. The direct conversions of the code from large systems lead to excessive memory demands for Raspberry Pi systems, with too many tests dependent on RAM speed, and low MP performance gains. There were glimpses of the usual performance gains an a maximum of over 20 SP GFLOPS on a 64 bit Pi 4B.

Code: Select all

            OpenMP MFLOPS64 Thu Aug 22 19:54:59 2019

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.092836     5386    0.929538   Yes
 Data in & out    1000000     2      250   0.887743      563    0.992550   Yes
 Data in & out   10000000     2       25   0.917173      545    0.999250   Yes

 Data in & out     100000     8     2500   0.129858    15401    0.957117   Yes
 Data in & out    1000000     8      250   0.899561     2223    0.995518   Yes
 Data in & out   10000000     8       25   0.847036     2361    0.999549   Yes

 Data in & out     100000    32     2500   0.391602    20429    0.890215   Yes
 Data in & out    1000000    32      250   0.989877     8082    0.988088   Yes
 Data in & out   10000000    32       25   0.944493     8470    0.998796   Yes

                End of test Thu Aug 22 19:55:05 2019*


         --------------- MFLOPS -------------- -------- Compare --------
Mbytes/  Pi 3B+       Pi 4B        Pi 4B                    Pi 4B
Threads    64b          64b          32b        4b/3b       64/32b
           All   1CP    All    1CP   All   1CP   All   1CP    All    1CP
 
  0.4/2   2674   755   5386   2780  4716  2850  2.01  3.68   1.14   0.98
    4/2    411   404    563    557   556   429  1.37  1.38   1.01   1.30
   40/2    419   408    545    588   544   632  1.30  1.44   1.00   0.93

  0.4/8   7029  1886  15401   5555  7981  5191  2.19  2.95   1.93   1.07
    4/8   1656  1495   2223   2116  2389  2082  1.34  1.42   0.93   1.02
   40/8   1725  1507   2361   2310  2199  2003  1.37  1.53   1.07   1.15

 0.4/32   6648  1699  20429   5647  8147  5449  3.07  3.32   2.51   1.04
   4/32   5977  1616   8082   5445  7951  5385  1.35  3.37   1.02   1.01
  40/32   6027  1616   8470   5479  8030  5379  1.41  3.39   1.05   1.02
Next More Multi threading Benchmarks

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Sep 18, 2019 2:42 pm

More 64 Bit Raspberry Pi 4B Multithreading Benchmarks

OpenMP-MemSpeed

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP, with effects on Pi 3B+ and Pi 4B not the same. Detailed comparisons of these results are rather meaningless.

Code: Select all

                         Gentoo Pi 3B+ 64 Bits

      Memory Reading Speed Test OpenMP 64 Bit by Roy Longbottom

               Start of test Fri Sep  6 12:44:14 2019

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    6302   3584   1853   9425   5287   2000  12537   4762   2141
       8    5122   3699   1897   8911   5620   2012  13017   5610   2111
      16    7283   3717   1873  10812   5659   1974  13006   6428   2080
      32    6953   3675   1762  10058   5515   1974  11998   6101   1997
      64    6967   3683   1836  10052   5531   1966  12142   6169   2054
     128    7021   3694   1848  10049   5544   2035   9932   6269   2091
     256    7048   3680   1908  10196   5593   1976   8831   6323   2067
     512    4986   1606   1722   6324   4189   1304   4509   4135   1647
    1024    2284   2692   1397   3932   2385   1321   1277   1268    942
    2048     981   2650   1398   1749   3360   1471    758    838   1043
    4096     874   1578   1355   3757   3398    909    760    852    756
    8192    1038   2585   1092   3805   1646   1243    857    751    870
   16384     917   2359   1734   1184   3151   1179    880    776    814
   32768    2983   1229   1916   1519   2880   1293    808    847   1373
   65536    3214   1259   1298   3018   1319   1286    894    857   1147
  131072     839    673    779    918    883    765   1263   1286    512

 Not OMP                                                                
       8    4694   2913   4841   6213   3944   4844   5402   4337   4337
     256    3791   2572   3921   4428   3223   3922   4941   4065   4070
   65536    1064   1070   1106   1075   1086   1028    763    849    847


                         Gentoo Pi 4B 64 Bits

     Memory Reading Speed Test OpenMP 64 Bit by Roy Longbottom

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    7854   9082   3171   7660   9140   2232  30162  15534   2692
       8    8238   8906   3150   8308   9253   2339  29033  15749   2673
      16    8217   8964   3136   8408   9044   2350  31531  15867   2472
      32    8598   8192   3085   8547   8094   2387  17252  14505   2377
      64    9084   8654   3130   8902   8606   2410  18959  14678   2393
     128   11338  11686   3091  11811  11261   2858  14852  15240   2361
     256   16320  17582   3236  17404  16671   2652  13683  14741   2411
     512   17581  18033   3089  16204  18086   2758  12921  10441   2331
    1024   14527  13629   2891  15196  13782   2682   4323   6169   2272
    2048    5018   7240   3120   7328   7241   2512   3370   3428   2215
    4096    4054   7200   3135   7330   5612   2916   2775   2703   2196
    8192    2130   2261   3867   7731   7527   3823   2701   2615   2184
   16384    3795   4552   3364   2106   7417   3397   1793   2709   2100
   32768    2065   6760   3327   7215   7144   3797   2108   2376   2242
   65536    2462   2245   2390   7160   3945   2742   2746   2386   2259
  131072    3276   3526   2324   8110   1927   2882   2584   2719   1965

 Not OMP                                                                
       8   15527  13976  15533  15504  14021  15537  11563   9311   7794
     256   12236  11434  12096  12084  11740  12156   7883   8044   7818
   65536    2047   2046   2037   2034   2054   2071   2567   2554   2547




                        Raspbian Pi 4B 32 Bits

     Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    7650   6427   1247   7389   6401   1587  39558  19538    883
       8    7777   6511   1263   7655   6534   1586  39076  19920    890
      16    8180   7840   1275   8412   7870   1566  38490  20039    846
      32    9355   9127   1295   9685   9266   1612  36718  19878    862
      64    8949   8763   1223   8971   8727   1566  14949  14827    844
     128   12241  11610   1247  12328  12303   1742  13945  15134    876
     256   17543  14765   1300  18010  17894   1748  12710  13167    839
     512   18252  15466   1265  18030  16934   1651  12814  12407    874
    1024    9044  12367   1432  12278  12201   1641   6907   9438    846
    2048    6975   6620   1521   7031   6999   1676   3073   3365    797
    4096    3539   7303   1440   7267   7247   1730   2348   3165    831
    8192    7547   7759   1369   7608   7659   1762   2622   3133    904
   16384    3877   7559   1329   7987   5744   1506   2514   3136    850
   32768    7391   3974   1317   7290   6655   1763   2586   3102    921
   65536    8209   7779   1341   7856   7290   1805   2445   2834    851
  131072    5086   7344   1280   3475   5222   1688   2358   2968    830

 Not OMP                                                                
       8    8603  11757  13383   8607  11754  13384   7827   7796   7796
     256    8312   9879   9991   8355   9988   9993   7530   7803   7805
   65536    2098   2073   2081   2087   2077   2068   2590    961    965
 
Stress Testing Programs Benchmarking Mode

My latest stress testing programs have parameters that specify running time, data size, number of threads, log file number and, in two cases, processing density. When run without parameters, the full range of options are used, providing a useful benchmark. Log file results from Pi 4B tests, and comparisons, are provided below. The programs are available in:

https://www.researchgate.net/profile/Ro ... UpdatesLog

Integer Stress Test-Benchmark

The integer program test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Disassembly shows that the test loop, in fact, used 68 instructions, most additional ones being load register type. The result of these is 68/32 instructions per 4 byte word. At the maximum of 1943M words per second, using a single core, resultant execution speed was 4129 MIPS with nearly four times more using all cores.

The tables below, with speeds on the considered systems, provide average performance gains of the Pi 4B at 64 bits, somewhat limited in this case.

Code: Select all

                 Gentoo Pi 4B 64 Bits

  MP-Integer-Test 64 Bit v1.0 Fri Sep  6 16:33:36 2019

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   4.3    1   7771  7352  3895  00000000    Yes
   3.3    2  15467 14218  3714  FFFFFFFF    Yes
   3.0    4  28715 26652  3345  5A5A5A5A    Yes
   3.0    8  30292 26310  3334  AAAAAAAA    Yes
   3.0   16  29466 28503  3337  CCCCCCCC    Yes
   3.0   32  29351 30358  3390  0F0F0F0F    Yes


              Pi 4B 32 bit MB/sec        Pi 3B+ 64 bit MB/sec

               KB      KB      MB         KB      KB      MB
               16     160      16         16     160      16
   Threads
        1    5964    5756    3931       4823    3884    1209
        2   11787   11430    3748       9613    7709    1908
        4   23214   22060    3456      17737   15137    1779
        6   22197   22171    3472      17651   18692    1767
       16   22671   23299    3256      18255   18793    1757
       32   21379   21881    3346      18246   18674    1748

             Pi 4B 64b/32b              64b Pi 4B/3B+
Average
Gain         1.31    1.25    0.99       1.63    1.67    2.13
Single Precision Floating Point Stress Test-Benchmark

This and the double precision program carry out the same calculations as MP-MFLOPS, but are slightly faster by including a loop that repeats the tests within the calculate functions. Maximum speeds were 6.75 GFLOPS, using one core, and 26.7 GFLOPS with four cores.

These programs were written using a later compiler than those used for MP-MFLOPS, at least resulting in similar speeds between 32 bit and 64 bit versions. Typical Pi 4B/3B+ performance improvements were indicated.

Code: Select all

  MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep  6 16:30:12 2019

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   1.7    T1   2  2819  2874   504   40392  76406  99700
   3.2    T2   2  5592  5702   511   40392  76406  99700
   4.6    T4   2  9223  7520   519   40392  76406  99700
   6.0    T8   2  9520 10471   545   40392  76406  99700
   8.2    T1   8  5381  5595  2050   54764  85092  99820
   9.8    T2   8 11039 10883  2173   54764  85092  99820
  11.3    T4   8 19087 21040  2044   54764  85092  99820
  12.9    T8   8 19747 21107  2016   54764  85092  99820
  17.5    T1  32  6693  6753  6377   35206  66015  99520
  20.2    T2  32 13491 13464  8710   35206  66015  99520
  22.2    T4  32 25732 26704  9160   35206  66015  99520
  24.1    T8  32 25708 25770  8927   35206  66015  99520

            End of test Fri Sep  6 16:30:37 2019


              Pi 4B 32 bit               Pi 3B+ 64 bit
Threads       KB      KB      MB         KB      KB      MB
Ops/wd       12.8     128    12.8       12.8     128    12.8

T1   2       2641    2607     646        838     826     373
T2   2       5089    5116     618       1659    1650     380
T4   2       8282    8522     683       2584    3296     384
T8   2       8756    9847     686       3013    3056     391
T1   8       5543    5428    2597       1981    1972    1354
T2   8      10754   10603    2711       3936    3923    1518
T4   8      18716   20823    2844       7482    7396    1531
T8   8      19859   21684    2555       7399    7705    1534
T1  32       5309    5274    5265       2820    2809    2462
T2  32      10557   10509    9991       5636    5583    4754
T4  32      20416   20919   11340      10640   10882    6020
T8  32      20072   19787    9330      10641   10926    6159

              Average Pi 4B Performance Gains

  Ops/Word      Pi 4B 64b/32b              64b Pi 4B/3B+

        2    1.09    1.04    0.79       3.37    3.16    1.36
        8    1.00    1.01    0.77       2.69    2.80    1.40
       32    1.27    1.29    0.96       2.40    2.41    1.85
Double Precision Floating Point Stress Test-Benchmark

Maximum measured DP speeds were 3.39 GFLOPS, using one core, and 13.2 GFLOPS with four cores. Some of the 64/32 bit and 4B/3B+ performance ratios were similar to those from MP-MFLOPS

Code: Select all

  MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep  6 16:31:24 2019

    Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.2    T1   2  1398  1462   285   40395  76384  99700
   6.2    T2   2  2799  2807   256   40395  76384  99700
   8.9    T4   2  5024  4589   257   40395  76384  99700
  11.5    T8   2  5089  5545   280   40395  76384  99700
  15.7    T1   8  2668  2790  1103   54805  85108  99820
  18.8    T2   8  5670  5545  1158   54805  85108  99820
  21.7    T4   8 10259 10011  1068   54805  85108  99820
  24.7    T8   8 10239 10824  1036   54805  85108  99820
  34.1    T1  32  3317  3390  3195   35159  66065  99521
  39.2    T2  32  6791  6754  4753   35159  66065  99521
  43.1    T4  32 12940 13200  4497   35159  66065  99521
  46.9    T8  32 13200 13049  4557   35159  66065  99521

            End of test Fri Sep  6 16:32:11 2019

              Pi 4B 32 bit               Pi 3B+ 64 bit
Threads       KB      KB      MB         KB      KB      MB
Ops/wd       12.8     128    12.8       12.8     128    12.8

T1   2        993     998     329        412     411     193
T2   2       1971    1995     309        828     824     194
T4   2       3633    3937     340       1543    1514     197
T8   2       3635    3796     339       1525    1551     196
T1   8       2378    2445    1288        980     978     696
T2   8       4770    4860    1282       1975    1964     782
T4   8       9281    9556    1210       3688    3688     781
T8   8       9119    9448    1245       3726    3689     787
T1  32       2697    2726    2708       1402    1403    1231
T2  32       5397    5446    5163       2808    2808    2399
T4  32      10689   10806    5146       5379    5413    3195
T8  32      10716   10494    4497       5450    5485    3150

              Average Pi 4B Performance Gains

  Ops/Word   Pi 4B 64b/32b              64b Pi 4B/3B+

        2    1.40    1.37    0.82       3.34    3.39    1.38
        8    1.13    1.12    0.87       2.78    2.83    1.44
       32    1.23    1.24    1.00       2.40    2.41    1.86
High Performance Linpack Benchmark

Earlier, he High Performance Linpack Benchmark was run on Raspberry Pi 3 models, and later, on the Raspberry Pi 4 system, both via 32 bit Raspbian Operating System. Details and results can be found in the following reports.

https://www.researchgate.net/publicatio ... rror_Tests

https://www.researchgate.net/publicatio ... ce_Linpack

Initially, two versions of HPL tests were run, one accessing precompiled Basic Linear Algebra Subprograms and the other with ATLAS alternatives, that had to be built. The whole benchmark suite was produced according to instructions in the following.

https://computenodes.net/2018/06/28/bui ... pberry-pi/

The ATLAS version was installed, as the older benchmark would not run on the Pi 4. One issue is the time required for the build, apparently due to the numerous tuning tests. Time taken was 14 hours using a Pi 3B+, then 8 hours on a Pi 4. Later, 64 bit ATLAS was built on the Pi 3B+, via Gentoo, taking 26 hours, that included extended periods swapping data with the rather slow main drive.

The procedure specified in the above was used, successfully leading to a working package. Only one change was required, this was to Make.rpi line 95 to;

Code: Select all

LAdir  = /home/pi/atlas-build to = /home/demouser/atlas-build 
Following the introduction of 64 bit Gentoo for the Pi 4B, ATLAS was again created, taking more than 10 hours. As indicated in the above links, the HPL benchmark can be a useful stress test, due to the long running time with heavy processing. It can lead to CPU MHz being throttled on the Pi 4B, producing slow GFLOPS speeds. The tests reported here were run using a Pi 4B with a cooling fan, with CPU MHz monitored to help to indicate that the processor was running at full speed.

Results and comparisons are provided below, followed by the main report for he best Pi 4B Gentoo result. Particularly important, maximum performance is dependent on the amount of RAM available. As with the original single CPU Linpack benchmark, where N is the matrix problem size, minimum memory used is N x N x 8 Bytes (double precision) or 512 MB for N = 8000 or 3.2 GB for N = 20000. The end of the detailed output indicates a further problem, where the first run at maximum size might be slow, with extra time swapping data out of RAM, to create space for the HPL data.

Next, the benchmark produces a sumcheck but, in the case of the ATLAS implementation, these are not consistent using the same problem size, all those shown here were indicated as PASSED (within specified tolerances). The anomaly could be produced using different CPU models or alternative compilations but, the least understandable is identified at the end of the detailed output, where the sumcheck is shown to vary on repeating the program on the same system.

Comparing Pi 4B 32 bit and 64 bit GFLOPS maximum speeds, the 32 bit version appears to be slightly faster (or the same within reasonable tolerances). Then it is not clear (to me), whether the compiled code completely embraces the difference in technology or whether external compile options should be included for the different packages involved.

Code: Select all

        ------ Time ------   ----- GFLOPS -----  ----------- Sumcheck ----------

          4B     4B    3B+     4B     4B    3B+         4B         4B        3B+
    N    64b    32b    64b    64b    32b    64b        64b        32b        64b

 4000   5.51   5.20  14.53   7.75   8.20   2.94  0.0022808  0.0023975  0.0025857
 8000  38.22  36.70 101.59   8.93   9.30   3.36  0.0017216  0.0016746  0.0017518
16000 269.26 263.00         10.14  10.40         0.0012577  0.0011258
20000 513.67 494.30         10.38  10.80         0.0009637  0.0010188

              GFLOPS Comparisons 

                4B           64b
    N        64b/32b       4B/3B+

 4000          0.95          2.64
 8000          0.96          2.66
16000          0.98
20000          0.96


--------------------------------------------------------------------------------

The following parameter values will be used:

N      :    1000 
NB     :     128 
PMAP   : Row-major process mapping
P      :       2 
Q      :       2 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             513.67              1.038e+01
HPL_pdgesv() start time Fri Aug 23 10:57:30 2019

HPL_pdgesv() end time   Fri Aug 23 11:06:04 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0009637 ...... PASSED
================================================================================

WR11C2R4       20000   128     2     2             516.71              1.032e+01

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0008697 ...... PASSED
================================================================================
First Run

WR11C2R4       20000   128     2     2             656.89              8.120e+00

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0009470 ...... PASSED
================================================================================ 

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

64 Bit Raspberry Pi 4B Java and I/O Benchmarks

Wed Sep 25, 2019 7:41 pm

64 Bit Raspberry Pi 4B Java and I/O Benchmarks

The benchmark results included below enable comparisons between the Pi 4B and 3B+, using the same benchmark and system software, but a sample of Pi 4 32 bit speeds is included, in some cases.

Three operational differences were observed from the 32 bit Raspbian benchmarking exercise. The first was with dual monitors where displaying a twice monitor pixel width window, the 32 bit program spread this across both monitors, but no mirroring appeared to be available. The 64 bit benchmark provided the latter for smaller windows, but switching off mirroring, squashed the image into the half width monitor display.

The standard version of my DriveSpeed benchmark uses Direct I/O, to avoid caching, on writing and reading large files and works as expected running the 2 bit version. Running a 64 vit version, this lead to failures to write or read. Variations were produced to enable performance measurements.

My broadband hub has dual 2.4 and 5 GHz capabilities. Significant variations in performance can be produced in this mode, on different devices, but appear to be more significant using the 64 bit benchmark. I changed the hub settings to provide different 2.4 and 5 GHz ports but, unlike the 32 bit benchmark, the 64 bit version would not connect to the network, using the same Pi 4B system.

Note that these differences could be due to program, software and/or hub incompatibility.

OpenGL GLUT Benchmark

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines, followed by one with applied textures.

Pi 4B average performance gains are included below, with textured objects the best, at 2.1 times, and worst, at around 1.5 times, with the slow kitchen displays.

Dual Monitors - The benchmark was also run with two 1920x1080 monitors connected. It displayed two identical displays when the mirror option was selected. Without this, the normal display, from where the program is executed, appeared on one display, and the OpenGL images on the other. This was fine when the usual display dimensions, as shown below, were specified. With no parameters, full screen image was assumed to be 3840x1080 and this was displayed horizontally squashed into 1920 pixels. FPS measurements for the latter are shown below.

On running the 32 bit version via Raspbian, the default display was 3840x1080, across both monitors, but only on one monitor, when 1920x1080 parameters or less were specified. There was no mirror option.

Code: Select all

############################# Pi 3B+ #############################

 GLUT OpenGL Benchmark 64 Bit Version 1, Fri Sep 20 11:15:47 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    389.6    227.2    122.6     75.3     30.0     21.5
   320   240    328.1    201.7    113.8     73.3     30.2     21.3
   640   480    203.3    144.7     87.8     62.0     30.2     21.0
  1024   768    107.1     94.5     60.3     51.1     28.9     20.0
  1920  1080     45.3     47.5     36.9     33.1     28.7     20.0


############################## Pi 4B #############################

  GLUT OpenGL Benchmark 64 Bit Version 1, Thu Sep 12 20:48:21 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    767.4    420.3    258.3    154.3     45.7     31.7
   320   240    682.9    388.8    245.0    148.3     45.1     30.8
   640   480    367.1    262.6    217.9    140.1     46.2     30.9
  1024   768    150.8    148.8    128.6    117.3     45.3     30.4
  1920  1080     71.9     73.9     64.0     61.6     43.3     27.9

  Pi 4B Gains    1.77     1.74     2.12     2.10     1.52     1.46


  Dual Monitor- mirrored displays
  1920  1080     65.0     66.3     61.6     58.2     42.7     27.5

  Dual Monitor - not mirrored squashed image on one monitor
  3840  1080     60.9     59.6     57.2     54.8     40.8     26.8

  32 Bit
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0

JavaDraw Benchmark

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.

Pi 4B performance gains shown below were indicated between 2.1 and 3.42 times.

Code: Select all

############################# Pi 3B+ #############################

   Java Drawing Benchmark, Sep 20 2019, 11:08:33
            Produced by javac 1.7.0_02

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      335    33.46
  Display PNG Bitmap Twice Pass 2      546    54.53
  Plus 2 SweepGradient Circles         502    50.08
  Plus 200 Random Small Circles        366    36.59
  Plus 320 Long Lines                  134    13.30
  Plus 4000 Random Small Circles        46     4.59

         Total Elapsed Time  60.2 seconds

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


############################# Pi 4B ##############################

   Java Drawing Benchmark, Sep 12 2019, 20:18:28
            Produced by javac 1.7.0_02

  Test                              Frames      FPS  Gains

  Display PNG Bitmap Twice Pass 1     1146   114.52   3.42
  Display PNG Bitmap Twice Pass 2     1318   131.79   2.42
  Plus 2 SweepGradient Circles        1237   123.66   2.47
  Plus 200 Random Small Circles        972    97.13   2.65
  Plus 320 Long Lines                  415    41.48   3.12
  Plus 4000 Random Small Circles        97     9.65   2.10

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222

32 bit Pi 4B speeds were between 8.25 and 104.47 FPS
[/code


[b]]Java Whetstone Benchmark[/b]

The benchmark measures performance of various floating point and integer calculations , with an overall rating in Million Whetstone Instructions Per Second (MWIPS).

[code]############################# Pi 3B+ #############################

    Whetstone Benchmark Java Version, Sep 20 2019, 11:06:12

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    310.88             0.0618
  N2 floating point  -1.131330490    289.41             0.4644
  N3 if then else     1.000000000             241.15    0.4292
  N4 fixed point     12.000000000             706.28    0.4460
  N5 sin,cos etc.     0.499110132              23.31    3.5700
  N6 floating point   0.999999821    130.04             4.1480
  N7 assignments      3.000000000              89.19    2.0720
  N8 exp,sqrt etc.    0.825148463              21.92    1.6970

  MWIPS                              775.89            12.8884

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


############################# Pi 4B ##############################
 
    Whetstone Benchmark Java Version, Sep 12 2019, 20:15:35

                                                      1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs Gains

  N1 floating point  -1.124750137    488.80             0.0393   1.57
  N2 floating point  -1.131330490    475.92             0.2824   1.64
  N3 if then else     1.000000000             344.31    0.3006   1.43
  N4 fixed point     12.000000000            1571.86    0.2004   2.23
  N5 sin,cos etc.     0.499110132              43.55    1.9104   1.87
  N6 floating point   0.999999821    264.15             2.0420   2.03
  N7 assignments      3.000000000             264.00    0.7000   2.96
  N8 exp,sqrt etc.    0.825148463              25.80    1.4420   1.18

  MWIPS                             1445.70             6.9171   1.86

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222
DriveSpeed Benchmark

This benchmark has the format shown below, measuring writing and reading speeds of large files, cached files, random access and numerous small files. Run time parameters are available to specify large file size and the file path.

Code: Select all

########################## Pi 4B USB 3 ###########################

   DriveSpeed RasPi 64 Bit 2.0 Fri Sep 13 22:25:40 2019
 
 Selected File Path: 
 /run/media/demouser/PATRIOT//
 Total MB  120832, Free MB  119778, Used MB    1054

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

 512    30.72    31.11    34.01   287.24   295.04   311.90
1024    34.66    36.11    35.45   298.87   302.38   300.26
 Cached
   8    42.03    39.58    38.85  1167.71  1029.35  1061.56

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.007    0.310     9.65    10.42     9.71

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.13   268.10   427.95   657.48
 ms/file   122.73   122.28   122.22     0.02     0.02     0.02    2.557
For non-cached tests, in the standard version of this benchmark, the file opening handle includes the O_DIRECT option, specifying Direct I/O (no caching). The latest minor variety of this appears to work, as expected, on the 32 bit Raspbian version, on both main and USB drives. The 64 bit compilation of this indicated a failure to write to the main SD drive and a failure to read from USB flash drives. Omitting O_DIRECT, for reading, appeared to correct the latter (see above). To check this and enable main drive measurements, separate direct I/O free large file write and read only programs were produced, to follow write/reboot/read procedures. These were also necessary to indicate throughput simultaneously writing or reading two USB 3 drives.


USB Flash Drives

Two FAT 32 formatted USB 3 sticks were used, P at 128 GB, with 32 KB sectors, reading speed rated as up to 400 MB/second, and R 8.8 GB partition, with 8 KB sectors, reading speed rated as up to 190 MB/second (but appears to do better sometimes).

Following is a summary of results, indicating USB 3 large file reading speed improvements between 6.7 and 8.1 times, but disappointing writing performance, where the slower P speeds might be affected by the mysteries of updating file allocation tables, also influencing random access and dealing with lots of small files, including file delete times. USB 3 use provided little or no performance gains for the latter. Cached reading reflects RAM speed

Code: Select all

MB/second 16 MB USB 2, 1024 MB USB 3

 System   Drive  Write1  Write2  Write3   Read1   Read2   Read3

Pi 3B+  USB 2 P     11.5    11.4    11.5    36.6    37.7    37.3
Pi 3B+  USB 2 R     15.9    16.4    13.9    37.1    40.1    39.8
Pi 4B   USB 2 P     12.6    12.6    12.6    37.0    37.3    37.2
Pi 4B   USB 2 R     22.6    22.9    22.9    36.5    36.3    36.5
Pi 4B   USB 3 P     34.7    36.1    35.5   298.9   302.4   300.3
Pi 4B   USB 3 R     48.9    44.6    53.4   249.4   248.8   246.2
Compare MB/second
Pi 4B   P USB 3/2   2.75    2.88    2.81    8.07    8.11    8.07
Pi 4B   R USB 3/2   2.17    1.94    2.33    6.83    6.85    6.74

Cached MB/second Write1  Write2  Write3   Read1   Read2   Read3
Pi 3B+  USB 2 P     13.6    14.2    14.4   633.4   544.0   464.3
Pi 3B+  USB 2 R     13.7    14.4    19.4   623.5   661.4   557.6
Pi 4B   USB 2 P     15.0    14.7    14.8  1204.0  1047.3  1066.3
Pi 4B   USB 2 R     20.8    21.2    13.9   930.2   933.6  1230.3
Pi 4B   USB 3 P     42.0    39.6    38.9  1167.7  1029.4  1061.6
Pi 4B   USB 3 R     21.1    15.9    36.2  1103.6   944.9   981.0
Compare
Pi 4B   P USB 3/2   2.80    2.70    2.63    0.97    0.98    1.00
Pi 4B   R USB 3/2   1.01    0.75    2.60    1.19    1.01    0.80

Random milliseconds
                   Read                   Write
Pi 3B+  USB 2 P    0.013   0.013   0.254   11.76   10.18    9.80
Pi 3B+  USB 2 R    0.017   0.008   0.032    1.09    1.39   11.72
Pi 4B   USB 2 P    0.006   0.007   0.215    9.56    8.54    8.75
Pi 4B   USB 2 R    0.009   0.005   0.016    1.35    2.12    1.34
Pi 4B   USB 3 P    0.004   0.007   0.310    9.65   10.42    9.71
Pi 4B   USB 3 R    0.004   0.004   0.008    1.75    0.85    0.92
Compare
Pi 4B   P USB 3/2   1.50    1.00    0.69    0.99    0.82    0.90
Pi 4B   R USB 3/2   2.25    1.25    2.00    0.77    2.49    1.46

200 Small Files  milliseconds
                  Write                    Read                  Delete
Pi 3B+  USB 2 P    134.2   128.6   129.6    0.08    0.12    0.07    3.36
Pi 3B+  USB 2 R    105.5   104.7   107.6    0.05    0.05    0.07    0.26
Pi 4B   USB 2 P    125.8   125.5   125.8    0.02    0.02    0.02    3.12
Pi 4B   USB 2 R    104.1   104.0   104.0    0.02    0.02    0.03    0.14
Pi 4B   USB 3 P    122.7   122.3   122.2    0.02    0.02    0.02    2.56
Pi 4B   USB 3 R    105.4   104.0   104.3    0.02    0.02    0.03    0.15
Compare
Pi 4B   P USB 3/2   1.03    1.03    1.03    1.00    1.00    1.00    1.22
Pi 4B   R USB 3/2   0.99    1.00    1.00    1.00    1.00    1.00    0.95

Drive Write/Reboot/Read Tests

The write test also reads the data for verification, but this will normally be cached in RAM, with high data transfer speeds. VMSTAT results are provided, covering reading speeds.


Main SD Drive

This is rated at up to 98 MB/second reading speed but only achieves near 46 MB/second. VMSTAT results confirm data transfer speed and three files eventually occupying around 3 GB of the cache, with the low 2% (x4) CPU utilisation and 23% (x4) waiting for I/O.

Code: Select all

 Current Directory Path: /home/demouser/RPi3-64-Bit-Benchmarks/IOtests/writeread
 Total MB   28225, Free MB   18761, Used MB    9464
 
                1024 MB   MBytes/Second
       Write1   Write2   Write3    Read1    Read2    Read3

Write   18.99    19.34    19.47  1337.09  1164.91  1325.96
Read     N/A      N/A      N/A     45.80    45.88    45.89


procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in   cs us sy id wa st

0  1     0 673848  60668 2792716    0    0 45056     0  767 1181  0  2 75 23  0
0  1     0 630228  60668 2835544    0    0 44544     0  789 1199  0  2 74 23  0
0  1     0 585204  60668 2880268    0    0 45056     0  691 1041  0  3 75 23  0

USB 3 Drive P

Read only speed was similar to that from the earlier detailed test. Note high CPU utilisation average of 17%, equivalent to 68% of one core.

Code: Select all

 Selected File Path: 
 /run/media/demouser/PATRIOT/
 Total MB  120832, Free MB  119752, Used MB    1080

                 1024 MB   MBytes/Second
       Write1   Write2   Write3    Read1    Read2    Read3

Write   58.45    23.10    22.91  1368.04  1190.71  1354.84
Read     N/A      N/A      N/A    306.18   294.93   302.91

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

1  0   256 811672  20920 2696504    0    0 305664     0 3898 6182  1 15 73 11  0
0  1   256 510852  20920 2996188    0    0 303616     0 4304 5936  1 16 72 12  0
1  0   256 239400  20920 3267636    0    0 307184     0 4512 6177  1 17 71 11  0

USB 3 Drive R

This time data transfer speed was slower than the earlier example

Code: Select all

 Selected File Path: 
 /run/media/demouser/REMIX_OS/
 Total MB    9017, Free MB    7485, Used MB    1532

                 1024 MB   MBytes/Second                  
       Write1   Write2   Write3    Read1    Read2    Read3

Write   46.43    28.81    36.57  1265.07  1103.23  1236.02
Read     N/A      N/A      N/A    172.71   172.14   176.49

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

0  1   256 111512    912 3417624    0    0 175189     0 4315 5929  1 12 71 17  0
0  1   256 169756    992 3358840    0    0 169043     0 4064 5515  1 11 71 17  0
0  1   256 177444   1068 3351176    0    0 155724     0 4088 6023  1 12 70 16  0

USB 3 Drives R and P Together

File sizes were reduced to 512 MB for these tests, in order to ensure that there would be sufficient RAM to contain six copies, as indicated in VMSTAT cache occupancy. This makes it more tricky to measure total throughput, but the following appears to provide a best case example, with a maximum of up to 386 MB/second, with CPU utilisation near 100% of one core.

Later is a bad example, where one drive appears to be running at USB 2 speed.

Code: Select all

Write/Read Thu Sep 19 16:07:48 2019  /run/media/demouser/REMIX_OS/
Write/Read Thu Sep 19 16:07:46 2019  /run/media/demouser/PATRIOT/

                   512 MB MBytes/Second
       Write1   Write2   Write3    Read1    Read2    Read3

  R     28.72    33.89    44.69  1302.19  1131.65  1374.24
  P     11.93     8.86     6.21  1232.47  1072.38  1213.36

Sep 23 17:11:21 2019 /run/media/demouser/PATRIOT/
Sep 23 17:11:20 2019 /run/media/demouser/REMIX_OS/

                  512 MB MBytes/Second
       Write1   Write2   Write3    Read1    Read2    Read3   Seconds

 P       N/A      N/A      N/A    159.78   187.44   294.23   7.7 
 R       N/A      N/A      N/A    221.83   232.10   230.94   6.7+2 delayed start

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

0  0     0 3160720  74616  296092   0    0     0     0  2031 3601  4  2 94  0  0
0  1     0 3112052  74616  342188   0    0  45552    0  1512 2257  1  3 93  4  0
0  1     0 2908004  74616  547600   0    0 206336    0  4684 7169  4 14 67 15  0
2  0     0 2531960  74616  919400   0    0 369136    0  5495 8033  4 24 47 25  0
2  0     0 2149064  74616 1303288   0    0 382960    0  5168 7007  1 21 52 26  0
1  1     0 1771492  74616 1681348   0    0 385024    0  5969 8255  1 23 49 26  0
1  1     0 1383524  74616 2068788   0    0 386016    0  5621 7926  1 21 49 29  0
0  2     0  999100  74616 2453280   0    0 383488    0  4602 6895  1 19 54 26  0
0  1     0  628988  74616 2824188   0    0 368640    0  5405 8153  2 20 56 22  0
1  0     0  310748  74624 3142732   0    0 317424   20  4622 6551  1 17 72 10  0
1  0     0  223052  73680 3231812   0    0 268288    0  2815 5012  1 18 72 10  0
0  0     0  223824  73680 3231280   0    0  32768    0  1044 2009  1  3 95  1  0
0  0     0  223824  73680 3231280   0    0      0    0   393  619  0  0 99  0  0

 ===============================================================================

 Bad Example``````````````

       Write1   Write2   Write3    Read1    Read2    Read3

 P       N/A      N/A      N/A     36.37    37.72    37.48
 R       N/A      N/A      N/A    248.18   248.22   223.53

LAN and WiFi Benchmarks

The Raspberry Pi LanSpeed64 version uses the same programming code as for the DriveSpeed benchmark, except O_DIRECT is not used on creating files. The measurements were made between the Pi 4B and a Windows 7 based PC, where the data transfer speed was confirmed via Task Manager Network information and sysstat sar -n DEV on the Raspberry Pi 4. SAMBA was also installed to connect a remote PC and enable an Intel Windows version, LanSpdx86Win.exe, to be run.

An example of a LanSpeed64 log file is provided below, preceded by examples of the required mount and run commands are shown below.

Code: Select all

Commands

sudo mount -t cifs -o dir_mode=0777,file_mode=0777 //192.168.1.68/d /media/public

./LanSpeed64 FilePath /media/public/test

Log File

   LanSpeed RasPi 64 Bit 1.0 Thu Sep 12 22:06:06 2019
 
 Selected File Path: 
 /media/public/test/
 Total MB  266240, Free MB   70991, Used MB  195249

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    66.13    92.09    92.76    96.36    96.85    97.30
  16    80.79    93.59    94.61   103.99   104.34   104.57

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.009    0.435     0.95     0.92     0.93

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      1.37     2.45     4.77     1.37     2.49     4.92
 ms/file     2.99     3.35     3.43     2.98     3.29     3.33    0.467

LAN and WiFi Benchmark Results

Below are results from programs run on the Pi 3B+ and 4B, plus others from running on a PC.

Dealing with large files, PC to Pi 4B and Pi 4B to PC LAN speeds demonstrated some gigabit performance examples (over 100 MB/second), around three times faster than on the Pi 3B+. My BT Hub has dual 2.4 and 5 GHz WiFi capabilities, leading to the following erratic WiFi performance, where (I think) greater than 10 MB/second is indicative of 5 GHz and around 4 MB/second for 2.4 GHz, the former usually only on writing. In this case, the hub was inches away from the Pi.

I changed the hub settings to provide separate 2.4 and 5 GHz hub address selections, with 72 and 180 Mbits/second being indicated, respectively. These sort of numbers were confirmed on my Smartphone, but variable. The 64 bit version would not connect to the network at 5 GHz, unlike the 32 bit program, for example, obtaining 15 MB/second writing and 8 MB/second reading. these differences could be, I suppose, due to program, software and/or hub incompatibility.

Random access times appeared to be quite similar on all WiFi tests, with faster but variable comparative times via LAN. There were similar relationships on dealing with numerous small files.

Code: Select all

Large Files MB/second

System   MB   Write1   Write2   Write3    Read1    Read2   Read3

PC WiFi  16     4.08     4.16     4.11     2.34     1.68     1.30
PC LAN   16   106.11   106.11   105.89    50.67    33.86    25.47
LAN 3B+  16    28.63    29.03    28.96    22.18    32.28    32.61
3B+ WiFi 16    11.15    11.00    10.76     4.01     3.89     3.09
4B WiFi1 16     6.43     6.39     6.47     4.33     4.13     4.86
4B WiFi2 16    13.26    13.34    13.25     3.69     4.22     4.00
4B LAN   16    80.79    93.59    94.61   103.99   104.34   104.57
4B LAN  128    96.58    96.67    95.74   106.41   107.24   107.82


Random milliseconds

System          Read                     Write

PC WiFi        1.711    1.972    2.015     2.26     2.28     2.25
PC LAN         0.606    0.590    0.532     0.47     0.48     0.47
LAN 3B+        0.030    0.816    0.484     1.19     1.16     1.16
3B+ WiFi       3.052    3.167    3.475     3.60     3.39     3.45
4B WiFi1       3.286    3.549    3.627     4.02     3.45     3.72
4B WiFi2       2.786    2.822    2.944     3.20     2.94     2.92
4B LAN         0.004    0.009    0.435     0.95     0.92     0.93


200 Small Files  milliseconds per file

System      Write                       Read                     Delete

PC WiFi     10.09    12.42    13.81     5.50     6.11     8.06    1.507 
PC LAN       4.05     4.59     4.53     2.38     2.23     2.64    0.661 
LAN 3B+      3.72     4.36     4.45     3.33     3.40     3.60    0.378
3B+ WiFi    12.61    13.53    14.97    13.17    14.06    15.88    2.534
4B WiFi1    15.08    16.53    22.83    12.96    14.23    17.29    2.509
4B WiFi2    11.38    12.85    12.82    10.64    11.83    14.15    2.083
4B LAN       2.99     3.35     3.43     2.98     3.29     3.33    0.467

RoyLongbottom
Posts: 299
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Nov 13, 2019 10:52 am

Raspberry Pi 4B 64 Bit Stress Tests

My 64 bit stress testing programs have now been run on Gentoo controlled Pi 4B systems, with some variations. They demonstrate the same heating and processor speed variations reported for those tested at 32 bits via Raspbian. The detailed Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests PDF report has now been uploaded to ResearchGate :

https://www.researchgate.net/publicatio ... ress_Tests

along with a tar.gz archive file containing the benchmarks and source codes in:

https://www.researchgate.net/profile/Ro ... UpdatesLog

As a reminder, there are also separate reports for 32 bit benchmarks and stress tests:

https://www.researchgate.net/publicatio ... Benchmarks
https://www.researchgate.net/publicatio ... ce_Linpack

plus benchmarks and source codes for both in:

https://www.researchgate.net/profile/Ro ... UpdatesLog

Return to “General programming discussion”