ejolson
Posts: 3036
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Mon Apr 01, 2019 5:30 pm

RoyLongbottom wrote:
Mon Apr 01, 2019 11:15 am
The following reference that you quoted indicates that a four Raspberry Pi 3s obtain an HPL score of 3.463 GFLOPS but performance of a single 4 core node was not mentioned, nor which versions of the benchmark packages were used. That speed is what I saw with my ATLAS version on one Pi 3 and, for the other version, I see over 5 GFLOPS.

https://www.raspberrypi.org/magpi/bench ... i-cluster/
That article appears to indicate inexperience of the author as well as a lack of editorial oversight. This is somewhat unfortunate as the MagPi serves as a definitive educational resource and the parallel Linpack as the defacto test a cluster is functioning properly.

Maybe the article can motivate an easy challenge: use a cluster of four Raspberry Pi 3Bs to obtain around 20 GFLOPs on Linpack instead of 3.5 GFLOPs. Increasing N, tuning the configuration and making sure HPL was linked with OpenBLAS should be sufficient. I suspect most code clubs and schools have the resources to try, just not me.

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Apr 24, 2019 6:44 pm

New Multithreaded Stress Tests

I have produced some new 32 bit and 64 bit stress tests that can use up to 64 threads, each using a dedicated segment of the data. There are three varieties Integer, Single Precision Floating Point and Double Precision Floating Point. Run time parameters include duration, data size and number of threads. Full details with RPi 3B and 3B+ results are included in the tar.gz and zip files, at ResearchGate, that contain the benchmarks and source codes.

tar.gz
https://www.researchgate.net/profile/Ro ... UpdatesLog

zip
https://www.researchgate.net/profile/Ro ... UpdatesLog

Default use, with no run time parameters, is a benchmarking mode. Following are examples of integer and single precision floating point performance via data in caches and RAM.

Code: Select all

  MP-Integer-Test 32 Bit v1.0 Tue Apr  9 12:33:19 2019

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   8.8    1   3571  3293  2045  00000000    Yes
   6.3    2   6946  6511  2152  FFFFFFFF    Yes
   5.6    4  13892 12651  1895  5A5A5A5A    Yes
   5.6    8  13247 13764  1880  AAAAAAAA    Yes
   5.5   16  13853 14034  1879  CCCCCCCC    Yes
   5.5   32  13633 13829  1908  0F0F0F0F    Yes

            End of test Tue Apr  9 12:33:56 2019


 MP-Threaded-MFLOPS 32 Bit v1.0 Tue Apr  9 12:34:25 2019

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.2    T1   2   852   847   399   40392  76406  99700
   5.5    T2   2  1641  1666   415   40392  76406  99700
   7.5    T4   2  3098  3172   415   40392  76406  99700
   9.5    T8   2  3002  3072   413   40392  76406  99700
  14.0    T1   8  1905  1932  1460   54756  85091  99820
  16.9    T2   8  3798  3837  1635   54756  85091  99820
  19.2    T4   8  7307  7460  1633   54756  85091  99820
  21.5    T8   8  7258  7613  1645   54756  85091  99820
  36.9    T1  32  2025  2049  1923   35296  66020  99519
  44.8    T2  32  4031  4072  3638   35296  66020  99519
  49.1    T4  32  7986  8025  6085   35296  66020  99519
  53.3    T8  32  7929  8081  6212   35296  66020  99519

            End of test Tue Apr  9 12:35:18 2019
The stress tests were run at the same time as my program that measures CPU MHz, core voltage anf temperature. Following is an indication of logged stress test results on my older Pi 3B:

Code: Select all

  After  99 seconds to first data comparison error

  MP-Integer-Test 32 Bit v1.0 Sat Apr  6 15:45:41 2019

               15 Minute Stress Test

              Data                          Same All
 Seconds      Size Threads   MB/sec Sumcheck Threads

   107.4     640 KB      4      7157 5A5A5A5A  Yes
  115.8     640 KB      4      7176 5A5A5A5A  Yes
  124.2     640 KB      4      7149 5A5A5A5A  Yes
  132.3     640 KB      4      7504 5A5A5A5A  Yes
  140.8     640 KB      4      7074 5A5A5A5A  Yes
  149.2     640 KB      4      7159 5A5A5A5A  Yes
  157.4     640 KB      4      7364 AAAAAAAA  Yes
  165.3     640 KB      4      7569 AAAAAAAA  Yes
  173.6     640 KB      4      7293 AAAAAAAA  Yes
  182.0     640 KB      4      7149 AAAAAAAA  Yes
  190.2     640 KB      4      7322 AAAAAAAA  No 1
  198.3     640 KB      4      7460 AAAAAAAA  Yes


 After 315 seconds to first error

  MP-Threaded-MFLOPS 32 Bit v1.0 Thu Apr 18 10:56:22 2019

                   15 Minute Stress Test

            Data             Ops/         Numeric
 Seconds    Size  Threads    Word  MFLOPS Results     Passes

  325.7   640 KB        8      32    6812   59617      13125
  335.9   640 KB        8      32    6772   59617      13125
  346.0   640 KB        8      32    6815       0      13125
  356.1   640 KB        8      32    6790   59617      13125
The only data comparison failures that were detected were on the older Pi 3B, without the recommended “over_voltage=2” boot setting.

The most stressful tests appeared to be on running the integer program, with 4 threads requiring 640 KB data that overfilled the L2 cache. Then running instead with 8 threads proved to be worse. In this case, performance was much faster, with data for 4 threads in cache, interrupted by swapping for the other 4 threads.

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Mon Jun 24, 2019 3:32 pm

Raspberry Pi 4 Benchmarks

I have run all of my relevant benchmarks and stress tests on the new Raspberry Pi 4. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer, exercising the system prior to launch. I will be posting results here shortly, but I want to include others for desktop PCs with the first few. A report containing the benchmark results and comparisons with the Pi 3B+ is now at ResearchGate:

https://www.researchgate.net/publicatio ... Benchmarks

The PDF file was converted from simple hand coded HTML, hence the small size that I have attached in a ZIP file. This might be of use in providing easy access to others.
Raspberry Pi 4B 32 Bit BenchmarksHTM.zip
(31.34 KiB) Downloaded 18 times

ejolson
Posts: 3036
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Mon Jun 24, 2019 6:28 pm

RoyLongbottom wrote:
Mon Jun 24, 2019 3:32 pm
Raspberry Pi 4 Benchmarks

I have run all of my relevant benchmarks and stress tests on the new Raspberry Pi 4. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer, exercising the system prior to launch. I will be posting results here shortly, but I want to include others for desktop PCs with the first few. A report containing the benchmark results and comparisons with the Pi 3B+ is now at ResearchGate:

https://www.researchgate.net/publicatio ... Benchmarks

The PDF file was converted from simple hand coded HTML, hence the small size that I have attached in a ZIP file. This might be of use in providing easy access to others.

Raspberry Pi 4B 32 Bit BenchmarksHTM.zip
Did you use a heatsink or fan on the Pi 4 and did it throttle due to heat?

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Jun 25, 2019 9:02 am

ejolson wrote:
Mon Jun 24, 2019 6:28 pm
RoyLongbottom wrote:
Mon Jun 24, 2019 3:32 pm
Raspberry Pi 4 Benchmarks
Did you use a heatsink or fan on the Pi 4 and did it throttle due to heat?
I will be producing a report of my stress testing efforts in due course. As with all modern processors (that I have encountered) they will throttle or crash when stretched to the limit, without appropriate cooling arrangements.

An example of worst case was running the High Performance Linpack benchmark. On a bare Pi 4 board, with throttling starting at 80°C, running time was 16 minutes with a performance rating of 5.37 GFLOPS and final temperature of 87°, when final throttling frequencies were between 600 and 500 MHz. That is not all bad news as it completed the task.

As second Pi 4 board has a Power over Ethernet HAT fitted, providing fan based cooling. This ran the HPL test in 8 minutes at 10.9 GFLOPS, with maximum temperature of 61°C.

A best case, with no cooling, was that I watched BBC iPlayer for more than 2 hours on a full screen 1920x1080 display, with CPU utilisation around 100% of one core and maximum temperature of 70°C, at maximum MHz all the time. That seemed to be a maximum temperature running my single core benchmarks.

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Jun 25, 2019 3:39 pm

Single Core Benchmarks

The following provide benchmark results with limited comments on Raspberry Pi 4B performance gains over Pi 3B+ and relative Pi 4B relationships between older ARM V7 and gcc 8 compilations. For the first few, ancient benchmarks, ARM V6 compilations are also compared.

Results of somewhat older PCs are provided for the first few benchmarks, to indicate that the Pi 4 appears to have performance in the desktop PC category.

Whetstone Benchmark

This has a range of simple tests with an overall rating in MWIPS, the latter being dependent on floating point speeds, currently most influenced by variations in the COS and EXP tests. Pi 4B is faster but not excessively.

Code: Select all

System       MHz  MWIPS ------ MFLOPS------  -----  ------ -MOPS-------- ------
                            1      2      3      COS    EXP  FIXPT     IF  EQUAL
ARM V7
Pi 3B+       1400   1060    391    383    298   21.7   12.3   1740   2083   1392
Pi 4B        1500   1884    516    478    310   54.7   27.1   2498   2247    999
4B/3B+       1.07   1.78   1.32   1.25   1.04   2.52   2.21   1.44   1.08   0.72

Arm V6
Pi 3B+       1400   1094    391    407    348   21.7   12.3   1740   2084   1391
Pi 4B        1500   2048    520    473    389   53.8   27.1   2497   2245   2246
4B/3B+       1.07   1.87   1.33   1.16   1.12   2.47   2.20   1.44   1.08   1.61

gcc 8
Pi 3B+       1400   1063    393    373    300   21.8   12.3   1748   2097   1398
Pi 4B        1500   1883    522    471    313   54.9   26.4   2496   3178    998
4B/3B+       1.00   1.76   1.33   1.26   1.05   2.51   2.09   1.43   1.52   0.71

Similar to Pi 4B Results from PCs 

Core i5      @@@@   1813    537    501    317   50.4   32.1   2242    457    561
Core 2       2400   2057    586    580    387   56.7   34.4   2192    381    771
Phenom II    3000   2145    594    492    297   67.1   50.7   2053    694    703

        @@@@ Rated as 1600 MHz but running at up to 2300 MHz using Turbo Boost
https://www.researchgate.net/publicatio ... er_Results

or to compare with computers from 1964 to the 1990s.

https://www.researchgate.net/publicatio ... nd_Results

Dhrystone Benchmark

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.

The Pi 4B was shown to be around twice as fast as the 3B+ and gcc 8 performance was similar to the ARM V7 compilation.

Code: Select all

                                              Best
                ----- Compiler -----         DMIPS
 System     MHz  ARM V6 ARM V7  gcc 8  G8/V7  /MHz

 Pi 3B+    1400   2520   2825   2838   1.00   2.03
 Pi 4B     1500   5077   5366   5646   1.05   3.76

 4B/3B+    1.07   2.01   1.90   1.99          1.86

Similar to Pi 4B Results from PCs 

Athlon 64   2211         4462
Core i5     @@@@         4752
Core 2      2400         6446
Phenom II   3000         7615
https://www.researchgate.net/publicatio ... Longbottom

Linpack 100 Benchmark MFLOPS

This original Linpack benchmark is dependent on fused multiply and add instructions, but the overheads on the standard source code restricts processing speed. It seems that the latest hardware has been modified to execute this type of code more efficiently. The original used Double Precision (DP) arithmetic but, for the benefit of the Pi, Single Precision (SP) versions were produced.

All measurements demonstrate that the Pi 4B was between 3.6 and 4.7 times faster than the Pi 3B+

Code: Select all

                      ARM V6        ARM V7        gcc 8     NEON
 System    MHz      DP     SP     DP     SP     DP     SP     SP 

 Pi 3B+    1400  206.0  220.2  210.5  225.2  224.8  227.3  536.0
 Pi 4B     1500  764.7  880.6  760.2  921.6  957.1 1068.8 2010.5

 4B/3B+    1.07   3.71   4.00   3.61   4.09   4.26   4.70   3.75

Similar to Pi 4B Results from PCs 

Pentium 4E 3000                630.3 
Athlon 64  2150                811.9  
Core 2     1830                997.7  
Core i5    @@@@               1064.7 

Later SSE2
Core 2     1830                             1119.4 
https://www.researchgate.net/publicatio ... Longbottom

Livermore Loops Benchmark MFLOPS

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.

Based on Geomean results, Pi 4B is shown as being 2.36 times faster than the 3B+, and even more so via the gcc 8 compilation, where the gcc8/V7 performance ratio identified is 1.31.

Code: Select all

System      MHz    Maximum Average Geomean Harmean Minimum

 ARM V7
 Pi 3B+     1400     464.8   246.7   220.1   193.9    78.3
 Pi 4B      1500    1800.2   635.1   519.0   416.1   155.3
 4B/3B+     1.07      3.87    2.57    2.36    2.15    1.98

 gcc 8
 Pi 3B+     1400     541.7   283.4   257.4   231.5    92.7
 Pi 4B      1500    1860.8   800.4   679.0   564.1   179.5
 4B/3B+     1.07      3.40    2.80    2.61    2.41    1.90
                                                
Similar to Pi 4B Results from PCs 

Core i5     @@@@    2326.4   717.1   438.0   246.2    34.7 
Athlon 64   2211    2565.9   740.9   460.5   285.7    48.4
Core 2      2400    2236.2   791.9   539.0   331.6    52.3
Phenom  II  3000    3893.9  1071.8   644.4   384.5    64.0

Later SSE2
Core 2      2400    2854.2  1047.6   848.2   653.8   141.8 
https://www.researchgate.net/publicatio ... Longbottom

Next, single core benchmarks using caches and RAM

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Jun 26, 2019 4:15 pm

Single Core Benchmarks using Data in Caches and RAM

The following are brief summaries of details and results. For this information, see the HTM file in Raspberry Pi 4B 32 Bit BenchmarksHTM.zip, attached to the first message of this Raspberry Pi 4 benchmarks topic.

https://www.raspberrypi.org/forums/down ... p?id=30443

Fast Fourier Transforms Benchmarks

These are based on methods used in a real application with results being average running times from three passes. Below are Pi 4B results and performance gains over a Pi 3B+. In this case, the new gcc 8 compilations are not much different from the earlier benchmarks.

Code: Select all

                              Time in milliseconds

              Raspberry Pi 4B FFT 1        Raspberry Pi 4B FFT 3

          ARM V7           gcc 8          ARM V7           gcc 8
   Size
      K       SP      DP      SP      DP      SP      DP      SP      DP

      1     0.04    0.04    0.04    0.04    0.06    0.05    0.05    0.04
      2     0.08    0.12    0.08    0.13    0.13    0.11    0.10    0.10
      4     0.32    0.37    0.29    0.34    0.27    0.24    0.24    0.23
      8     0.77    0.97    0.79    0.82    0.58    0.55    0.57    0.51
     16     1.69    2.01    1.65    1.85    1.49    1.35    1.32    1.19
     32     4.37    4.89    3.76    4.71    2.96    3.63    2.69    3.30
     64     9.12   26.55    8.82   30.64    7.46   10.75    6.60    9.47
    128    55.52  160.11   58.54  132.41   17.93   26.03   16.92   23.85
    256   305.92  423.06  275.44  373.12   41.16   55.06   37.61   55.97
    512   833.10  854.88  780.89  751.27   86.93  120.53   81.54  128.13
   1024  1617.49 1875.52 1578.70 1812.20  190.28  266.60  186.45  288.27

  Size              RPi 4B Gains (>1.0 4B running time is less)
     K
      1     3.45    3.46    4.02    3.94    3.06    2.66    2.88    3.45
      2     3.79    3.14    4.27    3.84    3.10    2.93    3.28    3.29
      4     2.46    2.50    3.19    3.84    3.86    3.23    3.24    3.22
      8     2.51    2.24    3.82    4.12    3.67    3.18    3.21    3.44
     16     2.76    2.62    3.08    3.23    3.17    4.06    3.25    4.10
     32     2.51    4.21    3.27    4.38    3.62    4.14    3.55    4.13
     64     3.79    4.86    4.23    4.27    3.88    3.42    3.95    3.51
    128     4.43    1.93    4.34    2.42    3.91    3.24    3.83    3.23
    256     1.92    1.51    2.25    1.97    3.82    3.57    3.86    3.23
    512     1.48    1.61    1.58    1.93    4.18    3.60    4.13    3.16
   1024     1.71    1.60    1.76    1.71    4.24    3.66    3.95    3.17
Following are some PC DP results, where the Pi 4 performs relatively well at the smaller FFT sizes before the PCs see the advantage of larger caches and faster RAM.

Code: Select all

        Double Precision Version 3 Milliseconds

                   Linux, P4 Windows
                                                  
         Core i7  Core 2  Phenom    Atom Pentium
           4820K    6600  II 945    N455       4
     MHz    3900    2400    3000    1666    1900

  K Size
       1    0.02    0.05    0.03    0.17    0.09
       2    0.04    0.12    0.05    0.42    0.20
       4    0.08    0.29    0.14    0.91    0.42
       8    0.18    0.64    0.41    1.98    0.98
      16    0.37    1.29    0.88    3.92    6.23
      32    0.78    2.86    2.11   10.01    15.2
      64    1.70    6.21    4.64   23.51    34.6
     128    3.66   14.49   10.45   53.03    73.7
     256    8.09   34.85   26.32  115.83   156.0
     512   21.05   81.85   79.23  245.59   344.0
    1024   65.15  178.81  197.41  538.71   804.0
BusSpeed Benchmark

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 skipping following data word by decreasing increments. finally reading all data. This shows were data is read in bursts, enabling estimates being made of bus speeds.

The following provides Pi 4B results and performance gains, over the Pi 3B+, for the most meaningful column, reading all data. Note that the best ratio indicates the impact of the larger L2 cache.

The PC versions are not comparable as assembly code was used. Details below are for a single thread using an MP version, indicating that the Pi 4 could be comparable on a per MHz basis.

Code: Select all

                    Pi 4B  gcc 8                

    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  Pi 4B  gcc 8
  KBytes  Words  Words  Words  Words  Words    All   Gain   Gain

      16   4880   5075   5612   5852   5877   5864   1.16   1.00
      32    846   1138   2153   3229   4908   5300   0.99   1.11
      64    746   1019   2035   3027   4910   5360   1.50   1.36 
     128    728    983   1952   2908   4888   5389   1.52   1.00
     256    683    934   1901   2794   4874   5431   1.55   1.06
     512    656    900   1760   2625   4585   5259   1.75   1.23
    1024    301    410    870   1356   2846   4238   2.81   1.21
    4096    233    248    531    996   2151   4045   2.35   1.27
   16384    236    258    511    891   2143   4011   2.35   1.25
   65536    237    257    508    881   2172   4015   2.40   1.28

 Core 2 2400 MHz
 L1 Cache  6751   6808   9050   9198   9270   9258
 L2 Cache  2094   2028   3328   4529   6670   7949     
 RAM        318    385    792   1436   2620   4902  
MemSpeed Benchmark MB/Second

This includes CPU speed dependent calculations using data from caches and RAM. The calculations are shown in the results column titles. Following are full Pi 4B results from running the gcc 8 recompiled versions, plus full Pi4B/3B+ and some old/gcc 8 comparisons. The Pi 4B is indicated as faster on all test functions, with best case on double precision calculations using cached data.

Code: Select all

                           Pi 4B gcc 8  

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   11768   9844   3841  11787   9934   4351  10309   7816   7804
      16   11880   9880   3822  11886  10043   4363  10484   7902   7892
      32    9539   8528   3678   9517   8661   4098  10564   7948   7945
      64    9952   9310   3733   9997   9470   4160   8452   7717   7732
     128    9947   9591   3757   9990   9757   4178   8205   7680   7753
     256   10015   9604   3758  10030   9781   4186   8120   7734   7707
     512    9073   9300   3751   9472   9526   4175   7995   7709   7602
    1024    2681   5303   3594   2664   4965   3760   4828   3592   3569
    2048    1671   3488   3242   1757   3635   3540   2882   1036   1023
    4096    1777   3700   3283   1827   3627   3555   2433   1052   1054
    8192    1931   3805   3420   1933   3815   3629   2465    980    971
 
                             Pi 4B/3B+

       8    5.81   3.08   1.99   3.96   2.23   2.09   3.02   1.77   1.76
      16    5.85   3.09   1.98   3.99   2.25   2.10   3.06   1.78   1.77
      32    4.84   2.74   1.93   3.35   2.02   1.99   3.08   1.78   1.78
      64    5.15   3.06   1.99   3.63   2.30   2.07   2.61   1.83   1.83
     128    5.04   3.11   1.99   3.54   2.34   2.08   2.50   1.80   1.81
     256    5.06   3.11   1.99   3.52   2.34   2.08   2.45   1.82   1.82
     512    5.18   3.35   2.16   3.85   2.57   2.26   2.57   1.91   1.86
    1024    2.74   2.85   2.59   2.20   2.23   2.57   3.30   2.93   2.93
    2048    1.71   1.88   2.35   1.55   1.72   2.45   3.36   1.25   1.24
    4096    1.82   2.05   2.41   1.61   1.73   2.49   2.58   1.14   1.15
    8192    1.98   2.13   2.51   1.70   1.86   2.58   3.07   1.24   1.32

                               4B gcc 8 gains

       8    1.39   2.07   0.29   1.42   2.08   0.28   1.32   0.79   0.79
     512    1.12   2.01   0.40   1.15   2.05   0.39   1.59   0.98   0.96
    8192    1.02   2.00   1.82   1.12   2.89   2.65   1.09   0.96   0.95
Windows - Following are some Windows scores, showing that the Pi 4 is in the same league.

Code: Select all

 Core 2 2400 MHz 
 L1 Cache  12556   6122   8921  16749   7683   9510  14924   4667   7536 
 L2 Cache  12755   6380   8463  11561   7578   7928   7798   5328   5349
 RAM        4816   4761   4828   3068   3085   3081   1568   1539   1547
NeonSpeed Benchmark MB/Second

This carries out some of the same calculations as MemSpeed. All results are for single precision floating point and integer calculations. Norm tests were as generated by the compiler, using NEON directives and Neon tests through using intrinsic functions.

The NEON calculations were faster than those from a normal C compilation, using cached data, but slower from RAM. Maximum speeds are also shown for NEON vector instructions, as single precision MFLOPS and integer MOPS, both at greater than two results per clock cycle.

Code: Select all

                      Pi 4B gcc 8  

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   9884  12882   3910  12773  13090  15133
      32   9904  13061   3916  13002  13162  15239
      64   9029  11526   3450  10704  11708  12084
     128   9242  11784   3391  11016  11816  12179
     256   9283  11890   3396  11215  11929  12284
     512   9043  10680   3413  10211  10925  11241
    1024   5818   3310   3507   3288   3239   2902
    4096   4060   1994   3497   1991   2009   2011
   16384   4030   2063   3445   2068   2072   2067
   65536   3936   2109   3391   1858   2122   2121

  Max MOPS        3265          3251

                      Pi 4B/3B+

      16   3.05   2.16   2.21   1.90   1.87   1.89
      32   3.25   2.28   2.37   2.00   1.97   1.96
      64   3.85   2.99   2.94   2.59   2.67   2.70
     128   3.65   2.84   2.87   2.47   2.65   2.63
     256   3.60   2.82   2.81   2.45   2.61   2.60
     512   4.58   3.79   3.65   3.35   3.60   3.67
    1024   2.60   2.65   2.59   2.64   2.75   2.67
    4096   1.62   1.77   1.78   1.79   1.77   1.77
   16384   1.86   1.80   1.81   1.88   1.86   1.85
   65536   1.84   1.81   1.82   2.44   1.84   1.86

                      4B gcc 8 gains
 
      16   1.02   1.28   0.44   1.36   1.34   1.44
     512   0.87   0.97   0.34   1.02   0.99   0.98
    1024   1.90   1.03   1.11   1.04   1.01   0.91
   16384   1.99   1.02   1.72   1.05   1.02   1.03
Next Multithreading Benchmarks

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Fri Jun 28, 2019 8:36 pm

Multithreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. Quite a few are versions of the single core benchmarks, with similar Pi 4 performance gains. With these, comparisons with PCs becomes more complicated and none are provided. There are variations on the number of cores, cache sizes, memory throughput and more efficient SIMD instructions, normally providing advantage to the PCs. On the simpler MP benchmarks, the Raspberry Pi 4 systems could provide similar performance (at a much lower cost, of course).

Again, the ZIP file in the earlier message contains much more details:

https://www.raspberrypi.org/forums/view ... c#p1484388

MP-Whetstone Benchmark

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish and performance is generally proportional to the number of cores used. Following are Pi 4B results and performance gains over Pi 3B+. Seconds used indicates MP efficiency.

Code: Select all

           MP-Whetstone Benchmark Linux/ARM Pi 4B gcc 8  

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS

 1T  1889.5  538.7  537.6  311.4  56.3  26.1  7450.5  2243.2  659.9
 2T  3782.7 1065.5 1071.2  627.1 112.3  52.0 14525.7  4460.9 1327.3
 4T  7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5  8944.2 2660.8
 8T  8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4

   Overall Seconds   4.99 1T,   5.00 2T,   5.03 4T,  10.06 8T      

                4 Thread Performance ratios 4B/3B+                      
  
 gcc8  1.79   1.39   1.40   1.05   2.66   2.13   1.53   1.68   0.48
MP-Dhrystone Benchmark

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance, as indicated in the results. The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+.

Code: Select all

                      Pi 3B+ gcc 8                        

 Threads                        1        2        4        8
 Seconds                     0.79     0.92     1.23     2.46
 Dhrystones per Second    5035879  8678942 13020489 13028455
 VAX MIPS rating             2866     4940     7411     7415

 
                      Pi 4B gcc 8                           
 
 Threads                        1        2        4        8
 Seconds                     0.79     1.21     2.62     4.88
 Dhrystones per Second   10126308 13262168 12230188 13106002
 VAX MIPS rating             5763     7548     6961     7459
MP SP NEON Linpack Benchmark

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running with no threading. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads. Best no thread Pi 4 gains was 3.1 times using cached data, with worst at 1.13 using RAM.

Code: Select all

Linpack Single Precision MultiThreaded Benchmark
          Using NEON Intrinsics

  MFLOPS 0 to 4 Threads, N 100, 500, 1000     

 Threads      None        1        2        4 
 
                     Pi 3B+ gcc 8            

 N  100     638.49    66.92    66.23    66.14 
 N  500     471.71   304.69   297.05   305.51 
 N 1000     356.13   317.22   316.88   316.33 

                      Pi 4B gcc 8             

 N  100    2007.38   112.55   107.85   106.98 
 N  500    1332.24   686.10   686.11   689.02 
 N 1000     402.61   435.26   432.21   432.01 
MP BusSpeed (read only) Benchmark

Each thread accesses all of the data in separate sections covering caches and RAM, starting at different points, with this V7A v2 version. See single processor BusSpeed details regarding burst reading that can indicate significant differences. RdAll is the main area for comparison, where MP reading RAM is thought to indicate maximum performance.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. These are subject to multiprocessing peculiarities, but Pi 4B/Pi 3B+ performance gains were indicated as being more than twice as fast at all levels.

Code: Select all

             MP-BusSpd ARM V7A gcc 8  Pi 4B        
 
  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll  4B/3B      

 12.3 1T   5310   5616   5801   5898   5940  13425   2.54
      2T   9393  10008  11293  11293  11368  24932   2.47
      4T  15781  15015  17606  19034  22279  40736   2.47
      8T   8465   9599  14580  18465  20034  36831       
122.9 1T    664    930   1861   3191   5017  10281   2.18
      2T    564    726   1523   5376   9387  18985   2.04
      4T    486    919   1886   4289   8337  16979   1.04
      8T    487    912   1854   4275   8271  16826       
12288 1T    225    258    514   1010   1992   3975   2.02
      2T    202    421    450   1765   3307   7396   3.51
      4T    261    288    825   1332   1772   5014   2.44
      8T    218    273    496   1041   2571   4021       
MP RandMem Benchmark

This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Serial reading speed is normally similar to BusSpeed RdAll. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Besides the full results, comparisons of the four thread results are shown below for Pi 4B/3B+ performance ratios. The Pi 3B+ appears to be faster reading data from the shared L2 cache, with 4 threads only, otherwise, the average performance of the new processor was indicated as 80% faster.

Code: Select all

    MP-RandMem Linux/ARM Pi 4B gcc 8

  KB       SerRD SerRDWR   RndRD RndRDWR

  12.3 1T    5950    7903    5945    7896
       2T   11849    7923   11887    7917
       4T   23404    7785   23395    7761
       8T   21903    7669   23104    7655
 122.9 1T    5670    7309    2002    1924
       2T   10682    7285    1648    1923
       4T    9944    7266    1813    1927
       8T    9896    7216    1812    1919
 12288 1T    3904    1075     179     164
       2T    7317    1055     215     164
       4T    3398    1063     343     165
       8T    4156    1062     350     165

             Pi 4B/3B+ gcc 8

  12.3 4T    1.43    1.82    1.43    1.81
 122.9 4T    0.79    1.87    0.76    1.86
 12288 4T    0.93    1.29    2.54    2.62
MP-MFLOPS Benchmarks

MP-MFLOPS measures floating point speed on data from caches and RAM, using 2 and 32 operations per data word read and written. There are three versions single and double precision normal compilations and a further single precision benchmark using NEON intrinsic functions.

Maximum GFLOPS here were 11.0 SP, 17.2 SP NEON and 10.4 DP. There were Pi 4B/3B+ gains all round with lower values when RAM speed limited, and highest from normal compilations over six times faster.

Code: Select all

          MP-MFLOPS Linux/ARM Pi 4B gcc 8                    

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS

 1T     1224    1257     520    2814    2800    2803
 2T     2485    2257     525    5608    5575    5576
 4T     4119    3243     534   11018   10645    8358
 8T     4131    4618     541    9941   10339    8165
Max                                                 
4B/3B+  5.48    6.07    1.35    3.61    3.53    2.86

            NEON Intrinsic Functions Version

 1T     2797    2870     641    4422    4454    4405
 2T     3217    5601     569    8587    8800    8377
 4T     7902    9864     611   17061   17215    9704
 8T     7070   10562     603   15531   16203    9516
Max                                                 
4B/3B+  3.78    4.13    1.49    2.30    2.32    1.61

              Double Precision Version     
       
 1T     1203    1211     315    2675    2719    2674
 2T     2291    2441     293    5406    5421    4907
 4T     4673    2501     309   10313   10393    5256
 8T     4394    3550     265    8782   10110    5197
Max                                                 
4B/3B+  6.69    4.45    1.56    3.30    3.33    1.89
OpenMP-MFLOPS Benchmarks

Using multithreading produced by OpenMP, this benchmark carries out the same single precision calculations as the MP-MFLOPS benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive. The two varieties were also repeated using doublr precision variables. The parameters for these benchmarks were used to test large PCs, with the last two data sizes tending to be dependent on RAM speed on the Raspberry Pi systems.

Below are Pi 3B+ and 4B results from the gcc 8 compilations. Included are 4 core MP gains and Pi 4B/3B+ performance ratios, both having wide variations, but worth having. Maximum GLPOS wer 19.9 SP and 9.2 DP.

Code: Select all

         gcc 8 Compiler Single Precision                                         

         Pi 3B+               Pi 4B                  Pi4 Gains 
 KB+Ops    4       1   4 core   4       1   4 core   4      1  
  /Word  Cores   Core   Gain  Cores   Core   Gain  Cores   Core

  100- 2  2139    778   2.75   5100   2270   2.25   2.38   2.92
 1000- 2   398    403   0.99    617    632   0.98   1.55   1.57
10000- 2   412    415   0.99    542    631   0.86   1.32   1.52

  100- 8  7348   1919   3.83  13805   5511   2.50   1.88   2.87
 1000- 8  1597   1448   1.10   2168   2217   0.98   1.36   1.53
10000- 8  1635   1444   1.13   2178   2542   0.86   1.33   1.76

  100-32  8497   2023   4.20  19921   5341   3.73   2.34   2.64
 1000-32  5997   1903   3.15   8556   5267   1.62   1.43   2.77
10000-32  6057   1914   3.16   8731   5276   1.65   1.44   2.76


         gcc 8 Double Precision                                  

         Pi 3B+               Pi 4B                  Pi4 Gains 
 KB+Ops    4       1   4 core   4       1   4 core   4      1  
  /Word  Cores   Core   Gain  Cores   Core   Gain  Cores   Core

  100- 2   711    203   3.50   3200    977   3.28   4.50   4.81
 1000- 2   193    168   1.15    274    295   0.93   1.42   1.76
10000- 2   199    172   1.16    273    307   0.89   1.37   1.78

  100- 8  1898    503   3.77   6771   2440   2.78   3.57   4.85
 1000- 8   730    434   1.68   1102   1072   1.03   1.51   2.47
10000- 8   755    435   1.74   1108   1255   0.88   1.47   2.89

  100-32  3072    793   3.87   9229   2725   3.39   3.00   3.44
 1000-32  2695    765   3.52   4256   2674   1.59   1.58   3.50
10000-32  2719    765   3.55   4469   2677   1.67   1.64   3.50
Maximum Possible Double Precision GFLOPS

The latest floating point performance improvements, via gcc 8, are due to automatic compilation of NEON instructions, the format shown below for 32 operations per word using double precision working. These not SIMD instruction that would use 128 bit registers with two DP words. Here, maximum speed would be two operations per clock cycle, via vfma fused multiply and add instructions, or 12 GFLOPS from four cores. ARM documentation appears to confirm that SIMD is not available using double precision functions. In this case, the DP benchmark measurements of around 10 GFLOPS are quite respectable.

I could not locate maximum single precision speed, where SIMD is used, handling four words from one instruction. Fused operations could lead to 8 results per clock cycle, or 48 GFLOPS from four cores. The results here suggest that it might be 24 GFLOPS.

Code: Select all

.L18:                            
    vldr.64         d17, [r1]   
    vadd.f64        d16, d17, d4 
    vadd.f64        d18, d17, d0 
    vadd.f64        d25, d17, d15
    vadd.f64        d24, d17, d11
    vmul.f64        d16, d16, d5 
    vadd.f64        d23, d17, d31
    vadd.f64        d22, d17, d27
    vadd.f64        d21, d17, d2 
    vadd.f64        d20, d17, d6 
    vadd.f64        d19, d17, d13
    vfma.f64        d16, d18, d1 
    vadd.f64        d18, d17, d9 
    vadd.f64        d17, d17, d29
    vfma.f64        d16, d25, d14
    vfma.f64        d16, d24, d10
    vfma.f64        d16, d23, d30
    vfma.f64        d16, d22, d28
    vfms.f64        d16, d21, d3 
    vfms.f64        d16, d20, d7 
    vfms.f64        d16, d19, d12
    vfms.f64        d16, d18, d8 
    vfms.f64        d16, d17, d26
    vstmia.64       r1!, {d16}   
    cmp             r0, r1       
    bne             .L18         
OpenMP-MemSpeed Benchmark

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives as a single core test, with similar Pi 4B/3B+ gains as the shorter version.

Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP, with effects on Pi 3B+ and Pi 4B not the same. Detailed comparisons of these results are rather meaningless. Following is one example

Code: Select all

                           Pi 4B gcc 8                                   

     Memory Reading Speed Test OpenMP gcc 8 by Roy Longbottom           

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    8097   8322   8641   8020   8436   8384  39701  19701  19712
       8    7814   8555   8756   8321   8548   8526  39042  19984  19996
      16    8149   7738   7742   8303   7779   8192  37995  19883  19984
      32    8969   8769   8799   9040   8759   8743  37737  20133  20130
      64    7617   7457   7437   7575   7380   7422  17770  15332  14248
     128   11221  10936  11003  11105  11011  10986  13650  13910  13881
     256   17883  18144  18036  17691  18094  17844  13073  12465  12535
     512   18001  18468  19675  17075  18221  19264  13511  13895  12008
    1024    9532  10590   9772  11842  11282  11277   7173   9473   9496
    2048    7095   7025   6866   7117   7043   6946   2914   3475   3468
    4096    7244   6927   7036   5951   7054   6531   2582   3130   3122
    8192    4578   7173   7025   6322   7078   7182   2504   3127   3115
   16384    5470   7043   7067   7103   7052   7020   2557   3093   3088
   32768    7359   7817   7766   7158   7078   7757   2618   3066   3094
   65536    7810   7268   7266   3824   7478   5164   2486   3016   2931
  131072    2460   2655   7224   7513   7308   7339   2540   2944   2940

 Single Core Results                                                                
       8   11775   3895   4342  11787   4325   4354  10334   7806   7816
     256   10032   3699   4223   9978   4289   4185   7105   7612   7621
   65536    2099   2587   3033   2103   3021   3001   2585   1105   1101
 
Next are I/O Benchmarks

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sat Jun 29, 2019 2:05 pm

I/O Benchmarks

As used previously, variations of the same benchmark can be used to measure performance of local drives and networks, covering writing and reading large and small file and random access. A run time parameter can be used to increase the sizes of the large files.

LanSpeed Benchmark - WiFi

Following are Raspberry Pi 3B+ and Pi 4B results using what I believe was, both 2.4 GHz and 5 GHz WiFi frequencies. Details on setting up the links can be found in the following PDF file, LAN/WiFi section:

https://www.researchgate.net/publicatio ... ress_Tests

Performance of the two systems was similar at both frequencies.

Code: Select all

 ******************** Pi 3B+ 2.4 GHz ********************
 
                       MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    5.71    6.07    5.96    5.69    5.46    4.76
      16    6.14    6.38    6.47    6.14    6.15    5.91

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs      2.94   3.081   3.185    3.04    2.89     3.7

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.16    0.57    0.96    0.36    0.63    1.17
 ms/file    25.3   14.31    17.1   11.46   13.04   14.06   2.138

 ********************* Pi 3B+ 5 GHz *********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3
 
       8   12.82   14.52   14.00   10.98   11.09    8.94
      16   11.60   12.91    4.48    9.16    8.19    7.69

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.41    0.76    1.46    0.41    0.74    1.46
 ms/file    9.96   10.83   11.19   10.11   11.02   11.23   1.990

 Random similar to 2.4 GHz


 ********************* Pi 4B 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    6.35    6.33    6.38    7.05    6.98    7.10
      16    6.70    6.82    6.76    7.19    6.53    7.22

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     2.691   2.875   3.048    3.13    2.93    2.84

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.34    0.44    1.04    0.37    0.37    1.26
 ms/file   12.14   18.59    15.7    11.1    22.2   12.99   2.153


 ********************** Pi 4B 5 GHz *********************

                         MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   11.90   12.96   13.16   10.11    9.55    9.66
      16   11.50   13.93   14.13    9.91    8.88    9.92

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.13    0.46    0.91    0.25    0.55    1.02
 ms/file   30.85   17.83   18.10   16.62   14.93   16.01   3.361

 Random similar to 2.4 GHz
LanSpeed Benchmark - LAN 1 Gbps Pi 4

There can be significant variability in performance with these small samples. For the large files, the default sizes were increased to produce more stable speeds. In this case, 1 Gbps was clearly demonstrated using the Pi 4B, around three times faster than the Pi3B+. Random access was mainly slightly faster via the Pi 4B and with the small files, perhaps, 25% faster on writing and 50% faster on reading.

Code: Select all

 ************************ Pi 3B+ ************************
 
                       MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   31.17   31.62   31.61    13.5   26.19   26.38
      16   31.62   31.89   31.76    26.7   26.94   27.01

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     0.007    1.09   0.688    1.16    1.04    1.08

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     1.15    2.26    4.18    1.73    3.18    5.66
 ms/file    3.57    3.62    3.92    2.36    2.58    2.89   0.511

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32   31.99   31.61   32.13   21.39   27.09   26.87
      64   32.33   32.37   32.35   26.94   26.98    26.7

 ************************ Pi 4B ************************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   67.82   12.97   90.19   99.84   93.49   96.83
      16   92.25   92.66   92.96   103.9  105.28   91.17

Random     Read                    Write
From MB        4       8      16       4       8      16
msecs      0.007    0.01    0.04    1.01    0.85    0.91

200 Files  Write                   Read                  Delete
File KB        4       8      16       4       8      16  secs
MB/sec      1.47     2.8    5.14    2.47    4.71    8.61
ms/file     2.78    2.92    3.19    1.66    1.74     1.9   0.256

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32    78.2   34.46   80.71   84.94   87.11   84.97
      64   88.18   87.52   87.03  111.34  109.58  107.28

     128   98.84   99.24   96.58  110.99  110.57   87.43
     256  106.75  105.43   106.4   85.78  108.99  106.29
DriveSpeed Benchmark - USB

Following are DriveSpeed results on Pi 3B+ and 4B, using the same high speed USB 3 stick (SanDisk Extreme with write/read ratings of 110/190 MB/s and 16 KB sectors). Other sticks would probably provide different comparative performance.

On large files, Pi 4B performance gains on the largest files shown, were 2.2 times on writing and 5.3 times on reading. Unlike LanSpeed, DriveSpeed uses Direct i/O, leading to an extra entry for cached files, reading mainly influenced by RAM speeds. Results can be too variable to provide meaningful comparisons.

Random access speeds were quite similar. On small files, relative reading speed was indicates as five times faster, on the Pi 4B, but the 3B+ appeared to be nearly 30 times faster, on reading.

For the Pi 4B, additional large file performance are included for a Patriot Rage 2 USB 3 stick, rated as reading at up to 400 MB/second, with near 300 MB/second demonstrated using a Windows version of DriveSpeed.. In this case, it appeared to be slightly slower than the first one on reading, but faster on writing, at 80 MB/second. This second drive also obtained those painfully slow speeds on writing small files.

Code: Select all

    ********************* Pi 3B+ USB 2 ********************

   DriveSpeed RasPi 1.1 Wed Apr 24 22:09:09 2019

 /media/pi/REMIX_OS/
 Total MB    9017, Free MB    7486, Used MB    1531

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   27.71   27.35   27.13   30.72    30.9   31.31
      16   27.21   27.54   23.69   29.89   31.34   31.27
Cached
       8   52.24   59.57   46.88  333.08  741.57  780.68

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.403   0.403   0.404    0.74    0.85    0.59

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      1.10    2.12    3.82    6.04    9.17   14.01
ms/file     3.71    3.86    4.28    0.68    0.89    1.17   0.123

                        MBytes/Second
MB      Write1  Write2  Write3  Read1   Read2   Read3

    1000   27.25   27.25   27.19   31.23   31.27   31.27
    2000   27.30   27.07   27.32   31.32   31.26   31.26

 ********************* Pi 4B USB 3 *********************

   DriveSpeed RasPi 1.1 Fri Apr 26 17:21:56 2019

 /media/pi/REMIXOSSYS//
 Total MB    5108, Free MB    3982, Used MB    1126

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   33.28   32.27   32.28  161.34  162.25  163.85
      16   39.85   41.95   43.02  164.07  165.53  165.84
Cached
       8   33.32   34.96   34.96  593.94  582.25  589.22

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.383   0.372   0.371    0.77    0.83    0.63

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      0.04    0.07    0.15   20.64   41.04   70.01
ms/file   110.04  109.97  110.01    0.20    0.20    0.23   0.089

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

     500   56.36   58.13   55.25  166.31  165.46  165.43
    1000   59.56   61.46   60.54  161.69  165.97  166.49


 /media/pi/PATRIOT/
 Total MB  120832, Free MB  120832, Used MB       0

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

    1000   80.87   80.23   81.92  131.41  130.72  130.39
    2000   83.67   81.82   82.14  130.85  131.29  131.36
DriveSpeed Benchmark Pi 4B Main Drive

This demonstrates that DriveSpeed measured performance on the main drive, in this case, nowhere near to USB 3 speeds.

Code: Select all

     DriveSpeed RasPi 1.1 Mon Apr 29 10:20:57 2019

 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks/DriveSpeed/drive1
 Total MB   14845, Free MB    8198, Used MB    6646

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   16.41   11.21   12.27   39.81   40.10   40.39
      16   11.79   21.10   34.05   40.18   40.19   40.33
Cached
       8  137.47  156.43  285.59  580.73  598.66  587.97

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.371   0.371   0.363    1.28    1.53    1.30

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      3.49    6.41    8.26    7.67   11.68   17.51
ms/file     1.17    1.28    1.98    0.53    0.70    0.94   0.014
Next Java and OpenGL

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sat Jun 29, 2019 3:08 pm

Java and OpenGL

The Java benchmarks were run after installing Oracle Java 8, then OpenJDK11 later.

Java Whetstone Benchmark

Pi 4B performance was nearly as good as the compiled C version. However, there can be wide variations involving new Java versions. Here, the Pi 3B+ overall MWIPS rating was particularly slow, entirely due to the time taken by the sin, cos and exp, sqrt tests. Other than these, the Pi 4B was three to four times faster.

Code: Select all

 ************************ Pi 3B+ ************************

      Whetstone Benchmark Java Version, May 14 2019, 15:02:11

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    215.20             0.0892
  N2 floating point  -1.131330490    208.76             0.6438
  N3 if then else     1.000000000             103.58    0.9992
  N4 fixed point     12.000000000             538.09    0.5854
  N5 sin,cos etc.     0.499110103               7.04   11.8100
  N6 floating point   0.999999821    106.22             5.0780
  N7 assignments      3.000000000             322.85    0.5724
  N8 exp,sqrt etc.    0.751108646               1.38   26.9200

  MWIPS                              214.14            46.6980

  Operating System    Linux, Arch. arm, Version 4.14.70-v7+
  Java Vendor         Oracle Corporation, Version  1.8.0_212
 

 ************************ Pi 4B ************************

     Whetstone Benchmark Java Version, May 14 2019, 14:16:44

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    503.94             0.0381
  N2 floating point  -1.131330490    488.37             0.2752
  N3 if then else     1.000000000             332.80    0.3110
  N4 fixed point     12.000000000             881.37    0.3574
  N5 sin,cos etc.     0.499110132              42.92    1.9384
  N6 floating point   0.999999821    345.77             1.5600
  N7 assignments      3.000000000             332.97    0.5550
  N8 exp,sqrt etc.    0.825148463              25.00    1.4880

  MWIPS                             1533.01             6.5231

  Operating System    Linux, Arch. arm, Version 4.19.29-v7l+
  Java Vendor         Oracle Corporation, Version  1.8.0_212

 ******************* Pi 4B OpenJDK11 *******************

  Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    524.02             0.0366
  N2 floating point  -1.131330490    494.12             0.2720
  N3 if then else     1.000000000             289.92    0.3570
  N4 fixed point     12.000000000            1092.99    0.2882
  N5 sin,cos etc.     0.499110132              59.86    1.3900
  N6 floating point   0.999999821    345.95             1.5592
  N7 assignments      3.000000000             331.54    0.5574
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1687.92             5.9244

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft
JavaDraw Benchmark

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order for this to run at maximum speed, it was necessary to disable the experimental GL driver. A later version was produced and run via OpenJDK11.

Pi 4B performance gains were best on the most complex test function.

Code: Select all

  ************************ Pi 3B+ ************************

   Java Drawing Benchmark, May 14 2019, 15:32:06
            Produced by javac 1.6.0_27

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      566    56.55
  Display PNG Bitmap Twice Pass 2      651    65.00
  Plus 2 SweepGradient Circles         665    66.45
  Plus 200 Random Small Circles        660    65.93
  Plus 320 Long Lines                  442    44.16
  Plus 4000 Random Small Circles       334    33.30

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.14.70-v7+
  Java Vendor         Oracle Corporation, Version  1.8.0_212


 ************************ Pi 4B ************************

   Java Drawing Benchmark, May 14 2019, 14:33:58
            Produced by javac 1.7.0_02

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      791    79.05
  Display PNG Bitmap Twice Pass 2      932    93.11
  Plus 2 SweepGradient Circles        1152   115.17
  Plus 200 Random Small Circles       1200   119.98
  Plus 320 Long Lines                  784    78.31
  Plus 4000 Random Small Circles       621    62.03

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.29-v7l+
  Java Vendor         Oracle Corporation, Version  1.8.0_212

 ******************* Pi 4B OpenJDK11 *******************

   Java Drawing Benchmark, May 15 2019, 18:55:41
            Produced by OpenJDK 11 javac

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      877    87.65
  Display PNG Bitmap Twice Pass 2     1042   104.18
  Plus 2 SweepGradient Circles        1015   101.47
  Plus 200 Random Small Circles        779    77.85
  Plus 320 Long Lines                  336    33.52
  Plus 4000 Random Small Circles        83     8.25

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft
OpenGL GLUT Benchmark

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

After installing freeglut3, the benchmark ran as before. The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file

Code: Select all

  export vblank_mode=0                                     
  ./videogl32 Width 320, Height 240, NoEnd                 
  ./videogl32 Width 640, Height 480, NoHeading, NoEnd      
  ./videogl32 Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl32 NoHeading                                    
Following are results from the Pi 3B+ and Pi 4B. The early tests depend on graphics speed and the later ones becoming CPU speed dependent.

Code: Select all

 ************************ Pi 3B+ ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Fri Apr 12 22:21:35 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    343.8    208.3     88.4     56.6     24.3     15.5
   640   480    243.0    170.3     82.8     54.5     24.2     15.5
  1024   768    110.6    101.2     63.6     47.8     24.1     15.4
  1920  1080     49.5     47.3     36.8     32.9     23.4     14.9


 ************************ Pi 4B ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu May  2 19:01:05 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    766.7    371.4    230.6    130.2     32.5     22.7
   640   480    427.3    276.5    206.0    121.8     31.7     22.2
  1024   768    193.1    178.8    150.5    110.4     31.9     21.5
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0
Stress Tests

I have run a series of stress tests, some of which lead to high temperatures and slow performance, due to CPU clock throttling. These will be rerun as changes have been incorporated, aimed at reducing temperatures.

RoyLongbottom
Posts: 276
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Jul 10, 2019 5:34 pm

Pi 4B Stress Tests

First Unstressed Tests - It is quite easy to produce programs that run at high speeds on all cores of a modern computer, be it a PC, tablet, phone or a small board system like the Raspberry Pi. These programs are likely to lead to increased CPU temperatures. Given insufficient cooling arrangements, the systems are likely to continuously reduce CPU MHz (throttling) in order to continue operation, and eventually power down. Before examining the results of stress testing, it is useful to consider what can be run without throttling occurring, in this case, on a Raspberry Pi 4, without any cooling.

Graphics Tests No Cooling

Earlier, I connected the Pi 4 system to BBC iPlayer, via WiFi, and displayed programmes for more than two hours on a full screen 1920x1080 display (not a hot day). With CPU utilisation around 100% of one core, maximum temperature was 70°C, with CPU at 1500 MHz all the time.

OpenGL

For this exercise, I ran the OpenGL Textured Kitchen test for an hour, with a full screen display (hotter day than above). Following is a summary of recorded results by the program, the environmental monitor and vmstat. The program ran at 22 FPS over the whole period, with CPU at a constant 1500 MHz, recording slightly more than 100% utilisation of one core, with maximum temperature reaching 73°C.

Code: Select all

            =------ Monitors ------  --------- vmstat ---------  Video
                                                                  gl32
                           °C    °C      
 Seconds     MHz   Volts   CPU  PMIC    free    User System Idle   FPS

       0    1500  0.8894    61    54 3589900       0     0   100
     120    1500  0.8841    69    59 3523336      25     2    73    22
     240    1500  0.8841    71    62 3520464      25     2    73    22
     360    1500  0.8841    71    63 3522848      25     2    73    22
     480    1500  0.8841    73    63 3522292      25     2    73    22
     600    1500  0.8841    72    63 3522284      25     2    73    22
     720    1500  0.8841    72    63 3521780      24     2    74    22
     840    1500  0.8841    73    63 3520640      25     2    73    22
     960    1500  0.8841    72    63 3520884      25     2    73    22
    1080    1500  0.8841    72    63 3520140      25     2    73    22
    1200    1500  0.8841    73    63 3519864      24     2    73    22
    1320    1500  0.8841    73    63 3519892      25     2    73    22
    1440    1500  0.8841    73    63 3519892      25     2    73    22
    1560    1500  0.8841    73    63 3518880      25     2    73    22
    1680    1500  0.8841    72    63 3519264      25     2    73    22
    1800    1500  0.8841    73    63 3517976      25     2    73    22
    1920    1500  0.8841    73    63 3518616      25     2    73    22
    2040    1500  0.8841    72    63 3517984      25     2    73    22
    2160    1500  0.8841    72    63 3518604      24     2    73    22
    2280    1500  0.8841    73    63 3518496      25     2    73    22
    2400    1500  0.8841    73    63 3518868      25     2    73    22
    2520    1500  0.8841    72    63 3518488      25     2    73    22
    2640    1500  0.8841    73    63 3518212      25     2    73    22
    2760    1500  0.8841    73    63 3520008      25     2    73    22
    2880    1500  0.8841    73    63 3519756      25     2    73    22
    3000    1500  0.8841    73    63 3516752      25     3    72    22
    3120    1500  0.8841    73    63 3518132      25     2    73    22
    3240    1500  0.8841    73    63 3518132      25     2    73    22
    3360    1500  0.8841    73    63 3517620      24     2    73    22
    3480    1500  0.8841    73    63 3517428      25     2    73    22
    3600    1500  0.8841    73    63 3517656      25     2    73    22
Single Core and Multi-Core CPU Tests

Below are various results from running five minute MP-Integer-Tests on a Raspberry Pi 4B, out of the case, with no cooling attachment. As indicated earlier, ongoing speed measurements by a benchmark provides a better understanding of behaviour, than samples of CPU MHz, that can vary rapidly.

Starting and ending recorded temperatures are shown, along with time when and if 80°C was reached, when throttling will start. The first column is for a run using a single thread, where CPU MHz, and effectively measured speeds, were constant over the whole period. The second column provides details when using four threads, with data in L1 caches. The next two made use of data in L2 cache, starting throttling after one minute, worse than the L1 results, but starting at a higher temperature. The last column provides results when data was in RAM and running at full speed for over four and a half minutes.

Code: Select all

                Speed in MB/second

   KB        512      64     640    1536   15624
   Threads     1       4       4       8       4
   Start°C    62      60      62      64      61

  Seconds
      10    5718   23631   22628   20177    3445
      20    5717   23603   22634   18329    3443
      30    5640   23416   22670   18756    3405
      40    5735   23613   22045   17737    3440
      50    5740   23618   22636   18456    3444
      60    5652   23244   22069   19059    3410
      70    5707   23483   19864   17648    3437
      80    5736   23360   18639   16017    3445
      90    5683   21552   17986   16654    3447
     100    5695   20867   17383   14864    3395
     110    5719   20218   16475   14805    3437
     120    5672   19017   16207   15128    3443
     130    5727   18871   15165   13328    3401
     140    5735   18888   14773   12638    3437
     150    5732   18460   14979   12780    3443
     160    5677   17799   14780   13086    3440
     170    5719   17976   14313   13221    3404
     180    5711   18005   14391   12618    3443
     190    5650   17745   14018   12185    3440
     200    5738   17312   14120   13267    3397
     210    5709   17241   14062   11916    3442
     220    5678   17124   14004   11866    3441
     230    5719   17392   13467   12018    3397
     240    5720   16990   13728   11825    3440
     250    5651   17289   13372   12011    3434
     260    5714   17135   13683   11596    3442
     270    5717   16891   13584   11481    3398
     280    5657   16505   13055   11781    3442
     290    5725   17049   13396   11550    3445
     300    5713   16578   12957   11666    3402

   Max      5740   23631   22670   20177    3447
   Min      5640   16505   12957   11481    3395
   %          98      70      57      57      98

  Max °C      72      82      84      85      80
 Time 80°C   N/A      90      60      60     280
Integer Stress Tests - MP-IntStress

The following are results of 15 minute stress tests, using 1280 KB data and 8 threads. The data is greater than L2 cache, but was in cache as only four threads were executed at a time. This then ran at full speed, with additional swapping of cached data.

Four tests were carried out with no added cooling on a bare board, fitted with a copper heatsink, then with the official, and expensive, Power Over Ethernet fan and, finally, using an inexpensive case with a fitted fan. The changing CPU MHz measurements show that throttling is occurring but, with coarse sampling, they do not reflect real performance, unlike the MB/second details.

With no cooling, throttling started after a minute, reaching 85°C to 86°C, slowly reducing performance to almost half speed. The copper heatsink produced a small improvement. During the two tests where fans were used, the processor ran continuously at 1500 MHz and throughput effectively at a constant MB/second. The POE fan appeared to be slightly more efficient.

Code: Select all

                No Cooling         Copper Heatsink    Official POE Hat   Case With Fan

Seconds MB/sec   MHz  °C   MB/sec   MHz  °C   MB/sec   MHz  °C   MB/sec   MHz  °C

      0         1500  60           1500  60           1500  47           1500  41
     20  21651  1500  73    21381  1500  71    21770  1500  56    22018  1500  54
     40  21892  1500  79    20517  1500  74    21767  1500  57    21979  1500  56
     60  20919  1500  81    21407  1500  77    22234  1500  57    22076  1500  58
     80  17174  1000  81    21153  1500  79    22035  1500  58    22248  1500  60
    100  15643  1000  81    20960  1500  81    21920  1500  59    22153  1500  61
    120  15163  1000  82    18967  1500  82    22184  1500  60    22239  1500  63
    140  14756  1000  81    16828  1000  81    21941  1500  60    22037  1500  64
    160  14491  1000  83    15892  1500  83    21863  1500  60    22231  1500  65
    180  14492  1000  83    16157  1000  82    21753  1500  60    22130  1500  64
    200  14283  1000  84    15039  1000  82    21921  1500  60    22050  1500  65
    220  14386  1000  83    15438  1000  82    21656  1500  60    22210  1500  66
    240  14101  1000  83    14905  1000  82    21908  1500  60    22132  1500  65
    260  13574  1000  84    14597  1000  83    21983  1500  60    22298  1500  65
    280  13763  1000  83    14703  1000  83    21701  1500  60    22031  1500  66
    300  13179  1000  84    14519  1000  82    21857  1500  60    22285  1500  65
    320  13566  1000  84    14204  1000  84    21791  1500  60    22009  1500  65
    340  13368   750  84    14139   750  83    21468  1500  60    22101  1500  65
    360  13530  1000  84    14249  1000  84    22162  1500  60    22166  1500  65
    380  13190  1000  85    14457  1000  82    21819  1500  61    22163  1500  66
    400  13215  1000  84    14395  1000  83    21800  1500  60    22243  1500  65
    420  13021   750  85    14365  1000  83    22083  1500  61    22115  1500  64
    440  13127  1000  84    14214  1000  83    21780  1500  60    22172  1500  64
    460  12933  1000  85    14152  1000  83    21902  1500  60    22138  1500  64
    480  12658  1000  85    14090  1000  84    21964  1500  60    22220  1500  64
    500  12981   750  83    14199  1000  84    22026  1500  61    22061  1500  65
    520  12699  1000  85    14005  1000  83    21661  1500  61    22027  1500  64
    540  12622  1000  84    13987  1000  84    21684  1500  60    22281  1500  65
    560  12761  1000  84    14222  1000  84    22071  1500  59    22097  1500  64
    580  13408  1000  84    13845  1000  84    21728  1500  58    22225  1500  64
    600  13878  1000  85    13945  1000  84    21981  1500  59    22091  1500  62
    620  13893  1000  83    13877  1000  84    21704  1500  58    22203  1500  62
    640  13717  1000  86    13844  1000  84    21935  1500  58    22133  1500  62
    660  13321  1000  85    13774  1000  83    21816  1500  61    22075  1500  62
    680  13154  1000  85    13500  1000  83    21827  1500  61    22229  1500  63
    700  12663  1000  85    13926  1000  83    21995  1500  60    22007  1500  63
    720  12504  1000  85    13722  1000  83    22004  1500  60    22279  1500  64
    740  12501   750  85    13778   750  84    21954  1500  60    22020  1500  65
    760  12227  1000  85    13564  1000  83    21848  1500  60    22270  1500  65
    780  12199   750  85    13755  1000  82    21840  1500  61    22129  1500  65
    800  12505  1000  85    13451  1500  82    22137  1500  59    22175  1500  64
    820  12268   750  85    13587  1000  83    21876  1500  60    22210  1500  64
    840  12322  1500  78    13610  1000  82    21685  1500  61    22041  1500  65
    860  12312  1500  74    14411  1500  75    22077  1500  61    22192  1500  65
    880  12306  1500  73    14380  1500  73    21842  1500  61    22109  1500  65
    900  12305  1500  71    14345  1500  73    21883  1500  61    22199  1500  65

    Max  21892        86    21407        84    22234        61    22298        66
    Min  12199   750        13451   750        21468  1500        21979  1500
 %Min/Ma    56                 63                 97                 99
Single Precision Floating Point Stress Tests - MP-FPUStress,

The table below covers the first 10 minutes of tests on the three cooling configurations. This time, the rather meaningless variations in recorded CPU MHz are not included. Again they used 1280 KB data (320K words) and 8 threads, with 8 floating point operations per word. Maximum temperatures and associated performance degradations were similar to those during the integer tests.

The following graphs provide a more meaningful indication of the effects of adequate cooling that is needed for this kind CPU utilisation (confirmed during running by vmstat as 100% of four cores).

Code: Select all

            No Cooling  Copper HS    Case+Fan

Seconds     °C GFLOPS   °C GFLOPS    °C GFLOPS

      0     61          59           40
     20     76  19.2    73   19.6    55  20.7
     40     81  19.0    78   19.4    61  20.3
     60     82  17.8    80   19.6    62  20.2
     80     83  15.5    82   17.2    64  20.7
    100     84  15.0    82   15.6    65  20.2
    120     83  14.0    82   14.5    66  20.3
    140     84  13.3    81   13.9    65  20.3
    160     84  13.3    83   13.9    66  20.7
    180     86  12.9    83   13.5    67  20.3
    200     85  13.0    83   13.6    67  20.3
    220     84  12.8    84   13.4    66  20.4
    240     84  12.6    83   13.3    67  20.6
    260     83  12.6    84   13.3    67  20.3
    280     85  12.2    84   13.3    67  20.4
    300     84  12.1    83   13.0    67  20.3
    320     85  12.0    84   13.0    67  20.8
    340     84  11.6    85   12.8    67  20.3
    360     85  11.6    84   13.0    67  20.2
    380     85  11.3    83   12.7    67  20.7
    400     85  11.6    84   12.8    67  20.5
    420     84  11.6    84   12.5    68  20.2
    440     85  11.5    84   12.7    67  20.4
    460     84  11.5    85   12.6    67  20.4
    480     85  11.5    84   12.3    66  20.2
    500     84  11.1    85   12.4    67  20.3
    520     85  11.3    83   12.4    67  20.2
    540     84  11.4    85   12.4    68  20.5
    560     84  11.3    84   12.3    67  20.2
    580     85  11.3    83   12.3    67  20.4
    600     85  11.3    84   12.3    67  20.2

    900     85  10.9    84   12.2    67  20.3

   Max          19.2         19.6        20.8
   Min          10.9         12.2        20.3
%Min/Max          57           62          98
GFLOPS.gif
GFLOPS.gif (13.44 KiB) Viewed 112 times


Double Precision Floating Point Stress Tests - MP-FPUStressDP

Four sets of results are below, again excluding those CPU MHz figures, but including PMIC temperatures. They are without and with the case/fan, using 8 threads, one with 1280 KB data size at 8 operations per word, and the other 128 KB with 32 operations per word.

The second one runs at a higher speed and lower temperature, using data in L1 caches, compared with the other via L2 cache. Maximum temperature and performance degradation of the latter were similar to the earlier examples.

Code: Select all

DP     1280 KB        8       8              128 KB        8      32

                 CPU  PMIC          CPU  PMIC          CPU  PMIC          CPU  PMIC
 Second GFLOPS    °C    °C GFLOPS    °C    °C GFLOPS    °C    °C GFLOPS    °C    °C

      0           48    42           45  42.0           54  47.7           39  35.4
     20    9.3    64  55.2    9.1    61  55.2   10.7    70  57.1   10.7    39  35.4
     40    9.2    73  62.8    9.0    65  59.0   10.6    73  61.8   10.7    53  43.9
     60    9.2    79  68.4    9.1    67  61.8   10.7    75  64.6   10.6    56  48.6
     80    8.8    80  70.3    9.3    66  62.8   10.7    78  67.5   10.6    57  50.5
    100    7.8    81  70.3    9.1    67  62.8   10.7    80  69.4   10.7    58  51.4
    120    7.2    82  70.3    9.2    67  62.8   10.1    82  70.3   10.7    59  53.3
    140    6.8    82  70.3    9.3    67  62.8    9.5    81  70.3   10.7    59  53.3
    160    6.5    82  70.3    9.1    68  62.8    9.1    80  70.3   10.6    59  53.3
    180    6.3    82  70.3    9.1    68  62.8    8.7    82  70.3   10.7    60  53.3
    200    6.1    81  70.3    9.3    68  64.6    8.5    81  70.3   10.7    59  54.3
    220    6.2    82  70.3    9.1    69  62.8    8.5    82  70.3   10.7    59  54.3
    240    6.2    83  72.2    9.1    68  62.8    8.3    81  70.3   10.6    60  54.3
    260    6.1    83  72.2    9.3    68  62.8    8.3    81  70.3   10.7    59  54.3
    280    6.1    84  72.2    9.1    67  64.6    8.0    83  70.3   10.7    61  54.3
    300    6.1    83  70.3    9.1    68  64.6    8.0    81  70.3   10.6    60  54.3
    320    6.0    84  72.2    9.1    68  64.6    7.9    82  70.3   10.7    61  54.3
    340    5.9    85  72.2    9.2    68  64.6    7.6    82  71.2   10.8    61  53.3
    360    5.8    85  72.2    9.1    68  62.8    7.7    82  70.3   10.7    60  54.3
    380    5.8    84  72.2    9.2    68  64.6    7.8    83  70.3   10.6    60  54.3
    400    5.7    84  72.2    9.1    68  62.8    7.7    83  70.3   10.6    61  54.3
    420    5.7    84  72.2    9.2    68  62.8    7.7    82  70.3   10.6    60  54.3
    440    5.6    84  72.2    9.1    68  64.6    7.6    82  70.3   10.7    60  54.3
    460    5.7    84  72.2    9.1    68  62.8    7.6    83  70.3   10.6    61  54.3
    480    5.6    84  72.2    9.1    69  64.6    7.5    82  70.3   10.7    60  54.3
    500    5.6    84  72.2    9.1    69  62.8    7.5    82  71.2   10.6    60  54.3
    520    5.5    85  72.2    9.1    68  62.8    7.4    81  70.3   10.7    60  54.3
    540    5.5    84  74.1    9.3    67  64.6    7.4    82  70.3   10.7    60  54.3
    560    5.5    84  72.2    9.1    69  62.8    7.4    82  70.3   10.8    59  54.3
    580    5.4    84  74.1    9.1    67  64.6    7.3    82  70.3   10.7    60  55.2
    600    5.5    84  74.1    9.2    68  62.8    7.3    81  70.3   10.7    60  54.3
    620    5.4    85  74.1    9.2    68  62.8    7.3    82  70.3   10.6    61  54.3
    640    5.4    84  74.1    9.2    69  62.8    7.3    83  70.3   10.6    62  55.2
    660    5.4    85  74.1    9.3    68  62.8    7.3    83  70.3   10.7    60  54.3
    680    5.5    85  72.2    9.0    67  62.8    7.3    83  70.3   10.7    60  54.3
    700    5.4    85  74.1    9.1    69  62.8    7.3    81  70.3   10.7    60  54.3
    720    5.4    85  72.2    9.2    68  64.6    7.3    84  70.3   10.7    60  54.3
    740    5.4    84  72.2    9.1    68  62.8    7.3    82  70.3   10.7    60  55.2
    760    5.3    85  74.1    9.1    68  62.8    7.3    81  70.3   10.7    60  54.3
    780    5.4    85  74.1    9.3    67  62.8    7.3    83  70.3   10.7    59  54.3
    800    5.4    84  74.1    9.1    69  64.6    7.3    81  70.3   10.7    60  54.3
    820    5.3    85  72.2    9.1    68  62.8    7.3    82  70.3   10.7    60  54.3
    840    5.3    84  72.2    9.2    68  62.8    7.2    82  70.3   10.7    60  54.3
    860    5.2    85  74.1    9.1    69  64.6    7.2    81  70.3   10.6    60  54.3
    880    5.2    85  74.1    9.1    68  62.8    7.2    82  70.3   10.6    60  54.3
    900    5.3    84  74.1    9.1    68  62.8    7.2    81  70.3   10.6    60  54.3

   Max     9.3    85  74.1    9.3    69  64.6   10.7    84  71.2   10.8    62  55.2
   Min     5.2                9.0                7.2               10.6
%Min/Ma     57                 97                 67                 98
More Later

Return to “General programming discussion”