How to easily measure Floating Point Operations Per Second (FLOPS)
The hard way of measuring FLOPS is to modify your program so that it itself keeps track of the number of floating operations performed in each module/function, run it on your target hardware and finally divide the two numbers. But, this requires possibly extensive modification to the program, and if it is done at too granular a level (i.e., in too tight a loop) it can affect the performance of the program.
A much easier way of measuring FLOPS for a particular combination of program and hardware is to use the CPU performance counters, now very conveniently accessible under Linux using the perf tools. In particular my Intel CPU can count an event FP_COMP_OPS_EXE which stands I guess for the Floating Point Computations Operations Executed. There are five umasks of this event:
- X87: traditional 8087 style 80bit floating point operations
- SSE_FP_PACKED_DOUBLE: SSE double-precision on packed data (128 bit registers, so this is two operations)
- SSE_FP_SCALAR_SINGLE: one single-precision operation
- SSE_PACKED_SINGLE: four single-precision operation (32bit single precision packed into 128 bit register)
- SSE_SCALAR_DOUBLE: one double-precision operation
These events can be turned into codes to be monitored as explained here, leading in my case to the following output:
./check_events FP_COMP_OPS_EXE:X87 FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE FP_COMP_OPS_EXE:SSE_PACKED_SINGLE FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE Detected PMU models: [18, ix86arch, "Intel X86 architectural PMU"] [51, perf, "perf_events generic PMU"] [68, snb, "Intel Sandy Bridge"] Total events: 2332 available, 166 supported Requested Event: FP_COMP_OPS_EXE:X87 Actual Event: snb::FP_COMP_OPS_EXE:X87:k=1:u=1:e=0:i=0:c=0:t=0 PMU : Intel Sandy Bridge IDX : 142606353 Codes : 0x530110 Requested Event: FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE Actual Event: snb::FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE:k=1:u=1:e=0:i=0:c=0:t=0 PMU : Intel Sandy Bridge IDX : 142606353 Codes : 0x531010 Requested Event: FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE Actual Event: snb::FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE:k=1:u=1:e=0:i=0:c=0:t=0 PMU : Intel Sandy Bridge IDX : 142606353 Codes : 0x532010 Requested Event: FP_COMP_OPS_EXE:SSE_PACKED_SINGLE Actual Event: snb::FP_COMP_OPS_EXE:SSE_PACKED_SINGLE:k=1:u=1:e=0:i=0:c=0:t=0 PMU : Intel Sandy Bridge IDX : 142606353 Codes : 0x534010 Requested Event: FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE Actual Event: snb::FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE:k=1:u=1:e=0:i=0:c=0:t=0 PMU : Intel Sandy Bridge IDX : 142606353 Codes : 0x538010
The resulting codes are supplied to the perf stat program, and resulting events added up to give the total number of operations, which is then divided by the total time taken. For example, for a carefully tuned program, I measure the following:
xcorrelators-bench/CPU-correlator$ perf stat -e r530110 -e r531010 -e r532010 -e r534010 -e r538010 ./correlator total maxFlops with 2 threads is: 18.5955 correlate took 1.09114 s, max Gflops = 18.5955, achieved 17.8384 Gflops, 95.9282 % efficiency throughput: 6.9222 GB/s load, 4.6148 MB/s store Performance counter stats for './correlator': 32,693 r530110 [80.01%] 0 r531010 [79.99%] 0 r532010 [80.01%] 39,195,349,051 r534010 [80.02%] 17 r538010 [80.02%] 8.015465141 seconds time elapsed
This shows that the program performed close to 40 billion packed single-precision operations, i.e., 160 billion total single precision ops during its 8 second run-time, leading to direct estimate of 20 GFLOPS of practical performance in this case.