[Android稳定性] 第015篇 [问题篇] Unable to handle kernel NULL pointer dereference

0. 问题现象

  • 死机

1. 问题分析

1.1 dmesg_TZ.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[    9.188060][  T175] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000102
[ 9.188065][ T175] Mem abort info:
[ 9.188067][ T175] ESR = 0x0000000096000005
[ 9.188069][ T175] EC = 0x25: DABT (current EL), IL = 32 bits
[ 9.188072][ T175] SET = 0, FnV = 0
[ 9.188074][ T175] EA = 0, S1PTW = 0
[ 9.188075][ T175] FSC = 0x05: level 1 translation fault
[ 9.188078][ T175] Data abort info:
[ 9.188079][ T175] ISV = 0, ISS = 0x00000005
[ 9.188080][ T175] CM = 0, WnR = 0
[ 9.188083][ T175] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000c850e000
[ 9.188086][ T175] [0000000000000102] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 9.188095][ T175] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
[ 9.188188][ T175] Dumping ftrace buffer:
[ 9.188199][ T175] (ftrace buffer empty)

[ 9.188845][ T175] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 9.188849][ T175] Workqueue: events power_supply_changed_work
[ 9.188863][ T175] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 9.188868][ T175] pc : __queue_work+0x28/0x550
[ 9.188876][ T175] lr : queue_work_on+0x3c/0x80
[ 9.188880][ T175] sp : ffffffc00b473ca0
[ 9.188882][ T175] x29: ffffffc00b473ca0 x28: ffffff804531dbc8 x27: ffffff82f2740fa8
[ 9.188890][ T175] x26: ffffff800b791f10 x25: 0000000000000000 x24: 0000000000000007
[ 9.188896][ T175] x23: 0000000000000000 x22: 0000000000000001 x21: 0000000000000000
[ 9.188902][ T175] x20: 0000000000000000 x19: ffffff806d0f9148 x18: ffffffc00ac0d040
[ 9.188908][ T175] x17: 000000002a4cec24 x16: 000000002a4cec24 x15: 0000000000000046
[ 9.188914][ T175] x14: 0000000000000000 x13: 0000000000000ef0 x12: 0000000000000002
[ 9.188920][ T175] x11: 0000000000000000 x10: ffffffffffffd240 x9 : 000000000000001b
[ 9.188926][ T175] x8 : 0000000000000001 x7 : ffffff806baa9380 x6 : 000000161b03f216
[ 9.188932][ T175] x5 : 1672031b16000000 x4 : 0080000000000000 x3 : 1b430b9338000000
[ 9.188939][ T175] x2 : ffffff806d0f9148 x1 : 0000000000000000 x0 : 0000000000000020
[ 9.188946][ T175] Call trace:
[ 9.188948][ T175] __queue_work+0x28/0x550
[ 9.188953][ T175] queue_work_on+0x3c/0x80
[ 9.188957][ T175] fts_power_usb_notifier_callback+0x2c/0x40 [focaltech_spi]
[ 9.189037][ T175] blocking_notifier_call_chain+0x70/0xbc
[ 9.189047][ T175] power_supply_changed_work+0x7c/0xc8
[ 9.189054][ T175] process_one_work+0x1e4/0x43c
[ 9.189060][ T175] worker_thread+0x25c/0x430
[ 9.189065][ T175] kthread+0x104/0x1d4
[ 9.189069][ T175] ret_from_fork+0x10/0x20
[ 9.189079][ T175] Code: a9054ff4 910003fd aa0203f3 aa0103f7 (39440828)

初步定位:

  • 问题类型:Unable to handle kernel NULL pointer dereference at virtual address 0000000000000102
  • 问题模块:focaltech_spi
  • 问题函数:fts_power_usb_notifier_callback+0x2c

1.2 trace32恢复现场

android-stability-015_0001.png
从现场可以看出问题点的汇编为:

1
ldrb x8,[x1, #0x102]

这句的意思是从x1+0x102的内存地址中读取1个字节放到x8寄存器中,而此时x1寄存器为0,所以访问的地址为0x102,这也是calltrace中爆出来的NULL pointer dereference(0000000000000102)

而此时x1寄存器就是参数struct workque_struct *wq的地址,那就说明这个wq已经被销毁了!

将栈帧上移到fts_power_usb_notifier_callback函数
android-stability-015_0002.png

我们可以看到在这个函数传入的wq是一个有效的wq!
所以问题点就在fts_power_usb_notifier_callback在往底层执行queue_work的过程中这个wq被销毁了

2. 根本原因

fts_data->ts_workqueue,这个队列在 fts_power_usb_notifier_callback执行过程中被destory了