如何使 bpftrace:runqlat.bt 按 workqueue name 统计延迟

问题

当我们想利用 ebpf 统计 task/kworker 的调度延迟，通常有两个前端选择，一是 bcc，另一个是 bpftrace。而如果要自定义，通常选择先上手 bpftrace，因为简单。

bpftrace/tools 下的工具 runqlat.bt 就可以统计调度延迟，只不过输出是直方图，我们希望能够像 perf sched 那样输出 task/kworker 的调度延迟时间。

浏览 runqlat.bt

先浏览下代码，参考 v0.11.4:

 1 #!/usr/bin/env bpftrace
 2 /*
 3  * runqlat.bt   CPU scheduler run queue latency as a histogram.
 4  *              For Linux, uses bpftrace, eBPF.
 5  *
 6  * This is a bpftrace version of the bcc tool of the same name.
 7  *
 8  * Copyright 2018 Netflix, Inc.
 9  * Licensed under the Apache License, Version 2.0 (the "License")
10  *
11  * 17-Sep-2018  Brendan Gregg   Created this.
12  */
13
14 #include <linux/sched.h>
15
16 BEGIN
17 {
18         printf("Tracing CPU scheduler... Hit Ctrl-C to end.\n");
19 }
20
21 tracepoint:sched:sched_wakeup,
22 tracepoint:sched:sched_wakeup_new
23 {
24         @qtime[args->pid] = nsecs;
25 }
26
27 tracepoint:sched:sched_switch
28 {
29         if (args->prev_state == TASK_RUNNING) {
30                 @qtime[args->prev_pid] = nsecs;
31         }
32
33         $ns = @qtime[args->next_pid];
34         if ($ns) {
35                 @usecs = hist((nsecs - $ns) / 1000);
36         }
37         delete(@qtime[args->next_pid]);
38 }
39
40 END
41 {
42         clear(@qtime);
43 }

行数不多，看上去也很简单，对于首次接触 bpftrace 的这段代码主要盲点: args/@/$/nsecs/hist/delete/clear/BEGIN/END。

先运行下看看：

root@localhost:/bpftrace# bpftrace tools/runqlat.bt
Attaching 5 probes...
Tracing CPU scheduler... Hit Ctrl-C to end.
^C


@usecs:
[2, 4)               236 |@@@@@@@@@                                           |
[4, 8)              1109 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[8, 16)             1246 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)             772 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[32, 64)             233 |@@@@@@@@@                                           |
[64, 128)            318 |@@@@@@@@@@@@@                                       |
[128, 256)           147 |@@@@@@                                              |
[256, 512)            42 |@                                                   |
[512, 1K)             52 |@@                                                  |
[1K, 2K)              22 |                                                    |
[2K, 4K)               7 |                                                    |

这些输出是怎么产生的？带着这些疑问来认识 bpftrace。

了解 bpftrace 语言

基本语法就是：probe[,probe,...] /filter/ { action }

probe 可以有多种，比如这里的 traceponit，action 可以有多个，filter 是过滤条件，可写可不写，看情况。

@, $

代表变量类型，@ 表示全局 (map variables)，而 $ 表示 scratch variables(栈变量？)

@[] 表示关联数组，它是 map variable 的变体，格式 @array_name[key_name, key_name2, ...] = value, 类似 hash or Python Dict：

>>> Dict = {1: 'Geeks', 2: 'For', 3: 'Geeks'}
>>> Dict
{1: 'Geeks', 2: 'For', 3: 'Geeks'}
>>> Dict[1]
'Geeks'
>>> Dict[2]
'For'

args

docs/tutorial_one_liners.md:

args is a pointer to a struct containing all the tracepoint arguments. This
struct is automatically generated by bpftrace based tracepoint information.

args 是一个指针，指向该 tracepoint 的所有参数。这个结构是由 bpftrace 根据 tracepoint 信息自动生成。

可以用如下命令找出这个结构：

root@localhost:/# bpftrace -vl 'tracepoint:sched:sched_wakeup'
tracepoint:sched:sched_wakeup
    char comm[16];
    pid_t pid;
    int prio;
    int success;
    int target_cpu;
    long state;

再看下 wakeup tracepoint code:

static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
                           struct rq_flags *rf)
{
        check_preempt_curr(rq, p, wake_flags);
        p->state = TASK_RUNNING;
        trace_sched_wakeup(p);

DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
             TP_PROTO(struct task_struct *p),
             TP_ARGS(p));

DECLARE_EVENT_CLASS(sched_wakeup_template,

        TP_PROTO(struct task_struct *p),

        TP_ARGS(__perf_task(p)),

        TP_STRUCT__entry(
                __array(        char,   comm,   TASK_COMM_LEN   )
                __field(        pid_t,  pid                     )
                __field(        int,    prio                    )
                __field(        int,    success                 )
                __field(        int,    target_cpu              )
        ),

到这里就清晰了，这个 args 就指向 TP_STRUCT__entry。

nsecs

内置变量，里面用的是 {bpf_}ktime_get_ns()。

BEGIN, END

These are special built-in events provided by the bpftrace runtime. BEGIN is triggered before all other probes are attached. END is triggered after all other probes are detached.

内置probe，管开头和结束。

delete/clear/hist

都是内置 map functions，map 是一种特殊的 data type，除了用作 variable 外，还可以用来存储直方图（调用hist）等。

delete 是从 map 里删除一个 key，所以要指定 key，比如上面的 delete(@qtime[args->next_pid])，

而 clear 是清除 map 里所有 keys/values, 所以语法是 clear(map m), 比如上面的 clear(@qtime)。

bpftrace 退出时所有的 maps 都会打印出来, 不想打就在 END block 里加上 clear。

ok, 在了解下这些基本语言规则后，我们来修改这个 bt 文件。

修改 runqlat.bt

调度延迟的时间就是从 wakeup 开始一直到 start running 这个时间点。

wakeup 时记录 wakeup time 到 @qtimes[task_id] 里，而在 switch 时，用 nsecs 减去这个 wakeup 的时间(前提要存在)就是这个 task 的调度延迟。

来看下 kernel trace_sched_switch():

TRACE_EVENT(sched_switch,

        TP_PROTO(bool preempt,
                 struct task_struct *prev,
                 struct task_struct *next),

        TP_ARGS(preempt, prev, next),

        TP_STRUCT__entry(
                __array(        char,   prev_comm,      TASK_COMM_LEN   )
                __field(        pid_t,  prev_pid                        )
                __field(        int,    prev_prio                       )
                __field(        long,   prev_state                      )
                __array(        char,   next_comm,      TASK_COMM_LEN   )
                __field(        pid_t,  next_pid                        )
                __field(        int,    next_prio                       )
        ),

        TP_fast_assign(
                memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
                __entry->prev_pid       = prev->pid;
                __entry->prev_prio      = prev->prio;
                __entry->prev_state     = __trace_sched_switch_state(preempt, prev);
                memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
                __entry->next_pid       = next->pid;
                __entry->next_prio      = next->prio;

我们关心的是 prev_comm 和 next_comm，next 是即将 running 的 task，所以 switch block 可以改成：


        $ns = @qtime[args->next_pid];
        if ($ns) {
-               @usecs = hist((nsecs - $ns) / 1000);
+               @uesc[args->next_comm] = ((nsecs - $ns) / 1000);
        }

这样就输出所有 tasks 的调度延迟了，效果如下：

Attaching 5 probes...
Tracing CPU scheduler... Hit Ctrl-C to end.



@usecs[ksoftirqd/6]: 14
@usecs[swapper/2]: 14
@usecs[swapper/1]: 17
@usecs[swapper/0]: 17
@usecs[kworker/7:1]: 32
@usecs[swapper/6]: 42
@usecs[kworker/6:1]: 50
@usecs[kworker/3:2]: 57

因为要 Ctrl-c 才终止，把 bpftrace tools/runqlat.bt 输出重定向到文本，grep kworker 搜索这个文本即可完成 kworker 统计。

那代码要支持过滤 kworker 怎么写了？高版本有字符串包含，这个版本只有 strncmp，修改如下:

 tracepoint:sched:sched_wakeup_new
 {
-       @qtime[args->pid] = nsecs;
+       if (!strncmp(args->comm, "kworker", 7)) {
+               @qtime[args->pid] = nsecs;
+       }
 }

 tracepoint:sched:sched_switch
 {
        if (args->prev_state == TASK_RUNNING) {
-               @qtime[args->prev_pid] = nsecs;
+               if (!strncmp(args->prev_comm, "kworker", 7)) {
+                       @qtime[args->prev_pid] = nsecs;
+               }
        }

这样一次性统计所有 kworker 的调度延迟，我们跑下看看：

Attaching 5 probes...
Tracing CPU scheduler... Hit Ctrl-C to end.
^C


@usecs[kworker/3:2]: 40
@usecs[kworker/u17:2]: 44
@usecs[kworker/7:1]: 64
@usecs[kworker/u16:14]: 81
@usecs[kworker/6:1]: 92
@usecs[kworker/u16:5]: 120
@usecs[kworker/u16:7]: 121
@usecs[kworker/u16:1]: 124

怎么没有 workqueue name? 方括号里的显示来自 args->next_comm，也就是入参 next 的 comm，来看 kernel 部分(大概还有一半-)。

btw: 看了下最新的 Linux 6.3rc 也不支持。