Bojan Nikolic

[home] | [blog] | [BN Algorithms Ltd] [Quantlib]

How to easily measure Floating Point Operations Per Second (FLOPS)

The hard way of measuring FLOPS is to modify your program so that it itself keeps track of the number of floating operations performed in each module/function, run it on your target hardware and finally divide the two numbers. But, this requires possibly extensive modification to the program, and if it is done at too granular a level (i.e., in too tight a loop) it can affect the performance of the program.

A much easier way of measuring FLOPS for a particular combination of program and hardware is to use the CPU performance counters, now very conveniently accessible under Linux using the perf tools. In particular my Intel CPU can count an event FP_COMP_OPS_EXE which stands I guess for the Floating Point Computations Operations Executed. There are five umasks of this event:

  • X87: traditional 8087 style 80bit floating point operations
  • SSE_FP_PACKED_DOUBLE: SSE double-precision on packed data (128 bit registers, so this is two operations)
  • SSE_FP_SCALAR_SINGLE: one single-precision operation
  • SSE_PACKED_SINGLE: four single-precision operation (32bit single precision packed into 128 bit register)
  • SSE_SCALAR_DOUBLE: one double-precision operation

These events can be turned into codes to be monitored as explained here, leading in my case to the following output:

./check_events FP_COMP_OPS_EXE:X87 FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE FP_COMP_OPS_EXE:SSE_PACKED_SINGLE FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE
Detected PMU models:
    [18, ix86arch, "Intel X86 architectural PMU"]
    [51, perf, "perf_events generic PMU"]
    [68, snb, "Intel Sandy Bridge"]
Total events: 2332 available, 166 supported
Requested Event: FP_COMP_OPS_EXE:X87
Actual    Event: snb::FP_COMP_OPS_EXE:X87:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x530110
Requested Event: FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE
Actual    Event: snb::FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x531010
Requested Event: FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE
Actual    Event: snb::FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x532010
Requested Event: FP_COMP_OPS_EXE:SSE_PACKED_SINGLE
Actual    Event: snb::FP_COMP_OPS_EXE:SSE_PACKED_SINGLE:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x534010
Requested Event: FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE
Actual    Event: snb::FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606353
Codes          : 0x538010

The resulting codes are supplied to the perf stat program, and resulting events added up to give the total number of operations, which is then divided by the total time taken. For example, for a carefully tuned program, I measure the following:

xcorrelators-bench/CPU-correlator$ perf stat -e r530110 -e r531010 -e r532010 -e r534010 -e r538010  ./correlator
total maxFlops with 2 threads is: 18.5955
correlate took 1.09114 s, max Gflops = 18.5955, achieved 17.8384 Gflops, 95.9282 % efficiency
throughput: 6.9222 GB/s load, 4.6148 MB/s store

 Performance counter stats for './correlator':

            32,693 r530110                                                      [80.01%]
                 0 r531010                                                      [79.99%]
                 0 r532010                                                      [80.01%]
    39,195,349,051 r534010                                                      [80.02%]
                17 r538010                                                      [80.02%]

       8.015465141 seconds time elapsed

This shows that the program performed close to 40 billion packed single-precision operations, i.e., 160 billion total single precision ops during its 8 second run-time, leading to direct estimate of 20 GFLOPS of practical performance in this case.

blog comments powered by Disqus
Blog tools: Add to Technorati Favorites Save to del.iciou.us
Tweet