死机分析以前在R平台搞过,基本就是抓到死机时的CPU register等信息,然后用objdump反汇编出来结合源码定位分析,现在到了手机平台,多了个Tracer32,高通分析死机都在用,现在死机都挂我这了,老问高通也不是个事重拾下,我觉得可以不用trace32,基本还是那老一套。

先来看死机现场:

[ 1256.852648] Unable to handle kernel NULL pointer dereference at virtual address 00000004 
[ 1256.931061] pgd = e2110000
[ 1256.933725] [00000004] *pgd=00000000
[ 1256.937725] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 1256.942470] Modules linked in: wlan(O) [last unloaded: wlan]
[ 1256.948098] CPU: 1 PID: 585 Comm: qti Tainted: G W O 3.18.71-perf-g24d2c84 #1
[ 1256.956092] task: e41161c0 ti: e2148000 task.ti: e2148000
[ 1256.961479] PC is at diagchar_read+0x610/0x11fc
[ 1256.965978] LR is at 0x0
[ 1256.968488] pc : [<c04e0b24>] lr : [<00000000>] psr: 60010013
[ 1256.968488] sp : e2149ef0 ip : 00000051 fp : b1bb5b7c
[ 1256.979944] r10: c6651000 r9 : 00000201 r8 : c5227bc0
[ 1256.985152] r7 : c13dbc38 r6 : 00000014 r5 : 000186a0 r4 : b1bb5b78
[ 1256.991676] r3 : 00000000 r2 : 80000000 r1 : 00000000 r0 : 00000000

内核空指针,一个关键信息是pc:c04e0b24。

objdump vmlinux出来后,基本-lD就够用了,搜到pc:

/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2933
c04e0b1c: e3510000 cmp r1, #0
c04e0b20: 05983030 ldreq r3, [r8, #48] ; 0x30
c04e0b24: 05933004 ldreq r3, [r3, #4] ===============> 这里crash
c04e0b28: 0a000028 beq c04e0bd0 <diagchar_read+0x6bc>
c04e0b2c: ea0001b2 b c04e11fc <diagchar_read+0xce8>
/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2937

找到源码2933行:

	if (driver->data_ready[index] & EVENT_MASKS_TYPE) {
/*Copy the type of data being passed*/
data_type = driver->data_ready[index] & EVENT_MASKS_TYPE;
session_info = diag_md_session_get_peripheral(APPS_DATA);
COPY_USER_SPACE_OR_EXIT(buf, data_type, 4);
2931 if (session_info && session_info->event_mask &&
2932 session_info->event_mask->ptr) {
2933 COPY_USER_SPACE_OR_EXIT(buf + sizeof(int),
*(session_info->event_mask->ptr),
session_info->event_mask->mask_len);
} else {
COPY_USER_SPACE_OR_EXIT(buf + sizeof(int),
*(event_mask.ptr),
event_mask.mask_len);
}
driver->data_ready[index] ^= EVENT_MASKS_TYPE;
goto exit;
}

2933行是:COPY_USER_SPACE_OR_EXIT(buf + sizeof(int),,crash的地方再看下现场:

r8 : c5227bc0
r3 : 00000000

也就是说r3 = 0是触发这个死机的因, r3和r8看着应该和2934,2935有关,那到底是不是了,先看下偏移,struct里又是宏又是嵌套结构体,用gdb帮忙:

$ arm-eabi-gdb vmlinux 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.

(gdb) p &((struct diag_md_session_t*)0)->event_mask
$1 = (struct diag_mask_info **) 0x30

(gdb) p &((struct diag_mask_info*)0)->mask_len
$2 = (int *) 0x4 <__vectors_start+4>

再往上看看就能得出:

r8=session_info
r3=r8+48=session_info->event_mask
r3=r3+4=session_info->event_mask->mask_len

r3=0触发,也就是session_info->event_mask是0?2931已经判断过了:

/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2931 (discriminator 1)
c04e0a68: e3580000 cmp r8, #0 ==================> r8 = sesstion_info
/work/buildfarm/jenkins/workspace/buildfarml_rmnj_10/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2930 (discriminator 1)
c04e0a6c: e2833004 add r3, r3, #4
c04e0a70: e58d3024 str r3, [sp, #36] ; 0x24
/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2931 (discriminator 1)
c04e0a74: 0a00002d beq c04e0b30 <diagchar_read+0x61c>
c04e0a78: e5982030 ldr r2, [r8, #48] ; 0x30 =====> r2 = sesstion_info->event_mask
c04e0a7c: e3520000 cmp r2, #0 ==============> sesstion_info->event_mask == 0?
c04e0a80: 0a00002a beq c04e0b30 <diagchar_read+0x61c>
/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2932 (discriminator 1)
c04e0a84: e592a000 ldr sl, [r2]
/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2931 (discriminator 1)
c04e0a88: e35a0000 cmp sl, #0
c04e0a8c: 0a000027 beq c04e0b30 <diagchar_read+0x61c>

so, 难道是DDR出现了跳变?多半是硬件问题。