Extended Berkeley Packet Filter (eBPF/BPF) is a fully programmable way of collecting information and changing behaviour from inside the Linux kernel. Amongst it’s many uses collecting information of what limits the performance of a system as a whole or of specific applications. From its Kernel viewpoint, BPF has unique ability to collect information about I/O, networking and other aspects of overall performance that often dominates simple computational throughput in importance.

There are numerous such tools in the bcc collection: https://github.com/iovisor/bcc. On Debian they are available as bpfcc-tools . Many are described on https://www.brendangregg.com/ebpf.html as well as (I believe) in his book: https://www.brendangregg.com/bpf-performance-tools-book.html.

Amongst the capabilities of these tools is to record the stack of user-space applications at various tracing points. This allows attribution of these traces to higher level functions in the program which often makes it far easier to interpret their significance.

The mechanism for interpreting the stack when doing this recording is the frame pointer.

Frame Pointer

Frame pointer is a technique of implementing functions in which the value of the stack pointer (i.e. the sp register) at entry into the function is stored in the base pointer register (i.e., bp register) while the previous value of the bp register is pushed onto the stack. It has the advantage of easy calculation of addresses of standard variables on the stack (because any subsequent changes to stack do not need to be taken into account) at the cost of using a register. It also has an advantage that it becomes very easy to identify which part of the stack (`stack frame’) belongs to each function call.

GCC controls the use of this technique with: -fno-omit-frame-pointer/-fomit-frame-pointer options. The default in many relevant scenarios is -fomit-frame-pointer. The effect on a simple function (g()) which only calls another function (f()) is shown below:

-fno-omit-frame-pointer -fomit-frame-pointer


0000000000001192 <g>:
    1192:	55                   	push   %rbp
    1193:	48 89 e5             	mov    %rsp,%rbp
    1196:	b8 00 00 00 00       	mov    $0x0,%eax
    119b:	e8 99 ff ff ff       	call   1139 <f>
    11a0:	90                   	nop
    11a1:	5d                   	pop    %rbp
    11a2:	c3                   	ret




0000000000001193 <g>:
    1193:	48 83 ec 08          	sub    $0x8,%rsp
    1197:	b8 00 00 00 00       	mov    $0x0,%eax
    119c:	e8 98 ff ff ff       	call   1139 <f>
    11a1:	90                   	nop
    11a2:	48 83 c4 08          	add    $0x8,%rsp
    11a6:	c3                   	ret


With the stack pointer present this is the print-out of the stack when in function f() (which was called from g which in turn was called from main):

(gdb) x/30x $sp
0x7fffffffe950:	0x00000000	0x00000000	0x00000000	0x00000000
0x7fffffffe960:	0x00000000	0x00000000	0x00000000	0x00000000
0x7fffffffe970:	0xffffe980	0x00007fff	0x555551a0	0x00005555
0x7fffffffe980:	0xffffe990	0x00007fff	0x555551b1	0x00005555
0x7fffffffe990:	0x00000001	0x00000000	0xf7e031ca	0x00007fff

The chain of frame-pointer is clearly visible which allows trivially walking up the stack one function frame at a time. Since the return instruction address is right next to the frame-pointer the frame can be attributed to a function easily too.

This is basically the technique eBPF performance tools use.

How to use in practice

To obtain the stack attribution clearly the code needs to be compiled with -fno-omit-frame-pointer or equivalent. Since the frame-pointers create a chain, it will be broken when entering any function which was not compiled -fno-omit-frame-pointer.

For example libc is typically compiled -fomit-frame-pointer so walking the stack will be broken at that function. This can easily be a problem since the very purpose of using eBPF is often trace system-related events which are typical called from libc!

One simple example derived from this public Github issue is the following program:

#include <unistd.h>
#include <sys/syscall.h>
#include <time.h>

void f() {
    for (size_t i = 0; i < 15; ++i) {
      //sleep(1);
      struct timespec ts;
      ts.tv_sec=1;
      ts.tv_nsec=0;
      syscall(SYS_clock_nanosleep, CLOCK_REALTIME, 0, &ts, &ts );
    }
}

void g() {
    f();
}

int main() {
    g();
}

We can trace up the stack because the SYS call is made directly:

$ sudo   offwaketime-bpfcc --stack-storage-size 200000  -Up $(pgrep t2) -d 5
Tracing blocked time (us) by user off-CPU and waker stack for 5 secs.

    waker:           swapper/1 0
    --               --
    syscall
    g
    main
    __libc_start_call_main
    target:          t2 11296
        4001289

However original version (using the commented out line sleep(1)) calls into C and the stack can not be walked up:

$ sudo   offwaketime-bpfcc --stack-storage-size 200000  -Up $(pgrep t2) -d 5
Tracing blocked time (us) by user off-CPU and waker stack for 5 secs.

    waker:           swapper/0 0
    --               --
    __clock_nanosleep
    target:          t2 10797
        4002066