死机分析以前在R平台搞过,基本就是抓到死机时的CPU register等信息,然后用objdump反汇编出来结合源码定位分析,现在到了手机平台,多了个Tracer32,高通分析死机都在用,现在死机都挂我这了,老问高通也不是个事重拾下,我觉得可以不用trace32,基本还是那老一套。

先来看死机现场:

[ 1256.852648] Unable to handle kernel NULL pointer dereference at virtual address 00000004 
[ 1256.931061] pgd = e2110000 
[ 1256.933725] [00000004] *pgd=00000000 
[ 1256.937725] Internal error: Oops: 5 [#1] PREEMPT SMP ARM 
[ 1256.942470] Modules linked in: wlan(O) [last unloaded: wlan] 
[ 1256.948098] CPU: 1 PID: 585 Comm: qti Tainted: G W O 3.18.71-perf-g24d2c84 #1 
[ 1256.956092] task: e41161c0 ti: e2148000 task.ti: e2148000 
[ 1256.961479] PC is at diagchar_read+0x610/0x11fc 
[ 1256.965978] LR is at 0x0 
[ 1256.968488] pc : [<c04e0b24>] lr : [<00000000>] psr: 60010013 
[ 1256.968488] sp : e2149ef0 ip : 00000051 fp : b1bb5b7c 
[ 1256.979944] r10: c6651000 r9 : 00000201 r8 : c5227bc0 
[ 1256.985152] r7 : c13dbc38 r6 : 00000014 r5 : 000186a0 r4 : b1bb5b78 
[ 1256.991676] r3 : 00000000 r2 : 80000000 r1 : 00000000 r0 : 00000000 

内核空指针,一个关键信息是pc:c04e0b24。

objdump vmlinux出来后,基本-lD就够用了,搜到pc:

/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2933
c04e0b1c:    e3510000     cmp     r1, #0
c04e0b20:    05983030     ldreq      r3, [r8, #48]    ; 0x30
c04e0b24:    05933004     ldreq     r3, [r3, #4] ===============> 这里crash
c04e0b28:    0a000028     beq     c04e0bd0 <diagchar_read+0x6bc>
c04e0b2c:    ea0001b2     b     c04e11fc <diagchar_read+0xce8> 
/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2937

找到源码2933行:

    if (driver->data_ready[index] & EVENT_MASKS_TYPE) {
        /*Copy the type of data being passed*/
        data_type = driver->data_ready[index] & EVENT_MASKS_TYPE;
        session_info = diag_md_session_get_peripheral(APPS_DATA);
        COPY_USER_SPACE_OR_EXIT(buf, data_type, 4);
2931        if (session_info && session_info->event_mask &&
2932            session_info->event_mask->ptr) {
2933            COPY_USER_SPACE_OR_EXIT(buf + sizeof(int),
                    *(session_info->event_mask->ptr),
                    session_info->event_mask->mask_len);
        } else {
            COPY_USER_SPACE_OR_EXIT(buf + sizeof(int),
                        *(event_mask.ptr),
                        event_mask.mask_len);
        }
        driver->data_ready[index] ^= EVENT_MASKS_TYPE;
        goto exit;
    }

2933行是:COPY_USER_SPACE_OR_EXIT(buf + sizeof(int),,crash的地方再看下现场:

r8 : c5227bc0
r3 : 00000000

也就是说r3 = 0是触发这个死机的因, r3和r8看着应该和2934,2935有关,那到底是不是了,先看下偏移,struct里又是宏又是嵌套结构体,用gdb帮忙:

$ arm-eabi-gdb vmlinux 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.

(gdb) p &((struct diag_md_session_t*)0)->event_mask
$1 = (struct diag_mask_info **) 0x30

(gdb) p &((struct diag_mask_info*)0)->mask_len
$2 = (int *) 0x4 <__vectors_start+4>

再往上看看就能得出:

r8=session_info
r3=r8+48=session_info->event_mask
r3=r3+4=session_info->event_mask->mask_len

r3=0触发,也就是session_info->event_mask是0?2931已经判断过了:

/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2931 (discriminator 1)
c04e0a68:    e3580000     cmp    r8, #0 ==================> r8 = sesstion_info
/work/buildfarm/jenkins/workspace/buildfarml_rmnj_10/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2930 (discriminator 1)
c04e0a6c:    e2833004     add    r3, r3, #4
c04e0a70:    e58d3024     str    r3, [sp, #36]    ; 0x24
/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2931 (discriminator 1)
c04e0a74:    0a00002d     beq    c04e0b30 <diagchar_read+0x61c>
c04e0a78:    e5982030     ldr    r2, [r8, #48]    ; 0x30 =====> r2 = sesstion_info->event_mask
c04e0a7c:    e3520000     cmp    r2, #0 ==============> sesstion_info->event_mask == 0?
c04e0a80:    0a00002a     beq    c04e0b30 <diagchar_read+0x61c>
/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2932 (discriminator 1)
c04e0a84:    e592a000     ldr    sl, [r2]
/code/kernel/msm-3.18/drivers/char/diag/diagchar_core.c:2931 (discriminator 1)
c04e0a88:    e35a0000     cmp    sl, #0
c04e0a8c:    0a000027     beq    c04e0b30 <diagchar_read+0x61c>

so, 难道是DDR出现了跳变?多半是硬件问题。