LPC2022 EROFS 提到一个解压调度延迟问题:

sched-perf-in-lpc2022-erofs.png

最近,Android 提了一笔优化 patch for this issue:

Author: Sandeep Dhavale <dhavale@google.com>
Date: Thu Dec 22 01:15:29 2022 +0000

EROFS: Replace erofs_unzipd workqueue with per-cpu threads

Using per-cpu thread pool we can reduce the scheduling latency compared
to workqueue implementation. With this patch scheduling latency and
variation is reduced as per-cpu threads are SCHED_FIFO kthread_workers.

The results were evaluated on arm64 Android devices running 5.10 kernel.

The table below shows resulting improvements of total scheduling latency
for the same app launch benchmark runs with 50 iterations. Scheduling
latency is the latency between when the task (workqueue kworker vs
kthread_worker) became eligible to run to when it actually started
running.
+-------------------------+-----------+----------------+---------+
| | workqueue | kthread_worker | diff |
+-------------------------+-----------+----------------+---------+
| Average (us) | 15253 | 2914 | -80.89% |
| Median (us) | 14001 | 2912 | -79.20% |
| Minimum (us) | 3117 | 1027 | -67.05% |
| Maximum (us) | 30170 | 3805 | -87.39% |
| Standard deviation (us) | 7166 | 359 | |
+-------------------------+-----------+----------------+---------+

Background: Boot times and cold app launch benchmarks are very
important to android ecosystem as they directly translate to
responsiveness from user point of view. While erofs provides
lot of important features like space savings, we saw some
performance penalty in cold app launch benchmarks in few scenarios.
Analysis showed that the significant variance was coming from the
scheduling cost while decompression cost was more or less the same.

ok, 我也来过一遍测试和统计,发现问题和验证问题,你懂的,主机 ubuntu,板子 arm64, 内核4.x。

如何模拟 heavy load

使用 fio 应该可以满足要求,配置如下:

[global]
time_based
runtime=120

[randread]
ioengine=sync
filename=/data/testing
rw=randread
direct=1
size=500m

[burn_8x90%_qsort]
ioengine=cpuio
cpuload=90
numjobs=8
cpumode=qsort

fio 源码按下面配置编译(for arm64)后,直接 push 到板子上运行。

./configure --prefix=out --cc=aarch64-linux-gnu-gcc-7 --build-static

aarch64-linux-gnu-gcc-7 如果没有,直接 apt install

另外,板子内核要使能 CONFIG_SYSVIPC=y,不然跑起来会有如下错误:

shmget: Function not implemented
error: failed to setup shm segment

跑起来效果:

XXX:/ # /data/fio /data/burn-cpu.fio
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
...
fio-3.30
Starting 9 processes
burn_8x90%_qsort (qsort): ioengine=cpuio, cpuload=90, cpucycle=50000
burn_8x90%_qsort (qsort): ioengine=cpuio, cpuload=90, cpucycle=50000
burn_8x90%_qsort (qsort): ioengine=cpuio, cpuload=90, cpucycle=50000
burn_8x90%_qsort (qsort): ioengine=cpuio, cpuload=90, cpucycle=50000
burn_8x90%_qsort (qsort): ioengine=cpuio, cpuload=90, cpucycle=50000
burn_8x90%_qsort (qsort): ioengine=cpuio, cpuload=90, cpucycle=50000
burn_8x90%_qsort (qsort): ioengine=cpuio, cpuload=90, cpucycle=50000
burn_8x90%_qsort (qsort): ioengine=cpuio, cpuload=90, cpucycle=50000

复现 sched latency UI

看图估摸着应该是类似 systrace 的东西,看下 Android 官方说明:

系统跟踪实用程序是一款用于将设备活动保存到跟踪文件的 Android 工具。在搭载 Android 10(API 级别 29)或更高版本的设备上,跟踪文件会以 Perfetto 格式保存,如本主题后面部分所示。在搭载较低版本 Android 系统的设备上,跟踪文件会以 Systrace 格式保存。

当前环境是 11,使用 perfetto 格式。

  • 生成 trace 文件

方法有命令行和 UI tools,具体可以看官方文档,这里使用 perfetto 命令(因为要连着 adb 敲 cmd)。

脚本参考 https://ui.perfetto.dev/ 下 Record new trace,只打开 CPU/Scheduling details。

step1: 先把 fio 跑起来

step2: 开始录制,确保不要停下来

adb shell "cat /data/config.pbtx | perfetto --txt -c - -o /data/misc/perfetto-traces/trace.perfetto-trace"

step3: 执行一个场景操作(这里就点击系统区一个 app,比如 setting 啥的),完成后可以按配置自动退出 or ctrl-c 强行退出

step4: 取出 trace 到当前

adb pull /data/misc/perfetto-traces/trace.perfetto-trace .
  • 浏览 trace

直接用 https://ui.perfetto.dev/ 下 Open trace file 来 view,也就是一个 HTML report。

Create an HTML report from the trace file using a web-based UI or from the command line.

关于 cmdline 描述好像有点问题,对 perfetto trace 文件,cmdline 必须先 convert 到旧的 systrace 文件,然后再 –from-file 成 html,网络有问题的,连不上不能下载 convert工具,只能用在线。

  • 分析 trace

perfetto doc/CPU scheduling events 提到:

On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel ftrace infrastructure.

This allows to get fine grained scheduling events such as:

  • Which threads were scheduled on which CPU core at any point in time, with nanosecond accuracy.
  • The reason why a running thread got descheduled (e.g. pre-emption, blocked on a mutex, blocking syscall or any other wait queue).
  • The point in time when a thread became eligible to be executed, even if it was not put immediately on any CPU run queue, together with the source thread that made it executable.

说明文档里抓了一个进程 897 的 cpu slice, 如下图:

latency.png

右下已经对 scheduling latency 做出描述:

This is interval from when the task became eligible to run to when it started running.

留意到 patch 的 commit message 最后也有这个描述,那基本就是这个了,rt?

图上的 Start time 就是真正 running 的时间点,而 wakeup 时间点就是 eligible to executed,但是因某些原因没有被放入 cpu 的 running queue,doc 也提到了几个因素:

  • All CPUs might be busy running some other thread, and Thread A needs to wait to get a run queue slot assigned (or the other threads have higher priority).
  • Some other CPUs other than the current one, but the scheduler load balancer might take some time to move the thread on another CPU.

ok, 那 897 的调度延迟就是从 wakeup 到 start time 的时间间隔,就是 10us 729ns。

  Start time: 6s 177ms 391us 761ns 
- Wakeup: 6s 177ms 381us 32ns
Scheduling latency = 10us 729ns

查了下色块的定义,好像 Blue 是 Runnable (it could run, but the scheduler hasn’t picked it to run yet).

内核支持 workqueue name

我的内核只显示kworker/u17:3,需要一笔 patch:

commit 6b59808bfe482642287ddf3fe9d4cccb10756652
Author: Tejun Heo <tj@kernel.org>
Date: Fri May 18 08:47:13 2018 -0700

workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}

还有关联的几笔,添加后的显示:

XXX:/ # ps -ef | grep kworker | grep erofs
root 584 2 0 15:55:32 ? 00:00:00 [kworker/u17:4-erofs_unzipd]
```bash

```bash
XXX:/ # cat /proc/584/status
Name: kworker/u17:4-erofs_unzipd
Umask: 0000
State: I (idle)
Tgid: 584
Ngid: 0
Pid: 584

ok, 再抓一次,如下图:

my-latency.png

不过似乎 perfetto 不支持统计这个 latency,手动查还是累了。

如何统计调度延迟

kernel 下 perf 工具,可以直接统计:

perf sched latency will summarize scheduler latencies by task, including average and maximum delay:

查了下安卓自带的有 simpleperf, list 虽然有 sched,record 也录了,report 愣是没找到,编一个 perf 得了。

进 tools/perf 下:

make CROSS_COMPILE=aarch64-linux-gnu- ARCH=arm64 LDFLAGS=-static O=out

当前目录生成了 perf bin,推到板子上,先记录下:

perf sched record -- sleep 5 // 记录5s

完成后当前目录有个 perf.data,注意时长。

过滤下kworker by perf sched latency | grep kworker

kworker/4:2-eve:404   |     88.386 ms |      207 | avg:    0.248 ms | max:   12.380 ms | max at:    271.002611 s
kworker/u16:18-:1499 | 36.641 ms | 681 | avg: 0.081 ms | max: 4.086 ms | max at: 269.173878 s
kworker/u16:17-:1498 | 21.918 ms | 511 | avg: 0.080 ms | max: 2.706 ms | max at: 272.811798 s
kworker/u16:19-:1500 | 41.742 ms | 713 | avg: 0.077 ms | max: 4.098 ms | max at: 272.342235 s

好使好使,不过我的 kworker 在哪?待续。

Reference