记一次fp的空指针引用导致的KE crash

top

一、问题背景

https://wayawbott0.f.mioffice.cn/sheets/shtk4qr1GSkUjvozmsj0OWi0tGe
1731377038298.png
测试版本:V816.0.24.8.26.UGUCNXM
稳定版挂测MTBF报出大量的空指针引用的报错

二、问题分析

2.1 dump解析

使用离线解析工具linux ramdump parser解析dump,打开dmesg_tz.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
[51222.768793][T13540] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038
[51222.768825][T13540] Mem abort info:
[51222.768836][T13540] ESR = 0x96000007
[51222.768848][T13540] EC = 0x25: DABT (current EL), IL = 32 bits
[51222.768858][T13540] SET = 0, FnV = 0
[51222.768868][T13540] EA = 0, S1PTW = 0
[51222.768877][T13540] Data abort info:
[51222.768887][T13540] ISV = 0, ISS = 0x00000007
[51222.768896][T13540] CM = 0, WnR = 0
[51222.768909][T13540] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000bd874000
[51222.768919][T13540] [0000000000000038] pgd=00000000e7355003, p4d=00000000e7355003, pud=00000000e7355003, pmd=000000084d2de003, pte=0000000000000000
[51222.768955][T13540] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[51222.768996][T13540] Skip md ftrace buffer dump for: 0x1609e0

//...

[51222.770472][T13540] CPU: 1 PID: 13540 Comm: pool-10-thread- Tainted: G WC O 5.10.198-android12-9-00085-g226a9632f13d-ab11136126 #1
[51222.770483][T13540] Hardware name: Qualcomm Technologies, Inc. Flame QRD (DT)
[51222.770498][T13540] pstate: 00400005 (nzcv daif +PAN -UAO -TCO BTYPE=--)
[51222.770518][T13540] pc : mutex_lock+0x34/0x184
[51222.770535][T13540] lr : seq_read_iter+0x4c/0x640
[51222.770545][T13540] sp : ffffffc031873bb0
[51222.770555][T13540] x29: ffffffc031873bc0 x28: ffffff883b6d5c80
[51222.770572][T13540] x27: 0000000000000000 x26: 0000000000000000
[51222.770589][T13540] x25: 0000000000000000 x24: ffffff884e36b478
[51222.770606][T13540] x23: ffffffc031873c50 x22: 0000000000000400
[51222.770622][T13540] x21: ffffffc031873c78 x20: 0000000000000000
[51222.770638][T13540] x19: 0000000000000038 x18: ffffffc01b6ad050
[51222.770654][T13540] x17: 0000000000000000 x16: 0000000000000000
[51222.770670][T13540] x15: 0000000000000000 x14: 0000000000000008
[51222.770686][T13540] x13: ffffffc031873ca8 x12: 0000000000000004
[51222.770703][T13540] x11: ffffff883b6d5c80 x10: 0000000000000000
[51222.770719][T13540] x9 : 0000000000000000 x8 : 0000000000000038
[51222.770735][T13540] x7 : 0000000000000000 x6 : 0000000000000000
[51222.770751][T13540] x5 : ffffff805abde818 x4 : 0000000000000000
[51222.770768][T13540] x3 : ffffffc031873de0 x2 : ffffff883b6d5c80
[51222.770784][T13540] x1 : 0000000000000000 x0 : 0000000000000038
[51222.770801][T13540] Call trace:
[51222.770815][T13540] mutex_lock+0x34/0x184
[51222.770828][T13540] seq_read_iter+0x4c/0x640
[51222.770841][T13540] seq_read+0xfc/0x134
[51222.770856][T13540] proc_reg_read+0x104/0x1fc
[51222.770871][T13540] vfs_read+0xf4/0x368
[51222.770884][T13540] ksys_read+0x7c/0xf0
[51222.770897][T13540] __arm64_sys_read+0x20/0x30
[51222.770911][T13540] el0_svc_common+0xd4/0x270
[51222.770926][T13540] el0_svc+0x28/0x98
[51222.770939][T13540] el0_sync_handler+0x8c/0xf0
[51222.770952][T13540] el0_sync+0x1b8/0x1c0
[51222.770968][T13540] Code: d503201f aa0803e0 aa1f03e1 aa0103e9 (c8e97d02)
[51222.770982][T13540] ---[ end trace a7da2251c6cbb391 ]---

2.2 trace32恢复现场

1731377046745.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
r.s pc mutex_lock+0x34
r.s lr seq_read_iter+0x4c
r.s x30 0xffffffc031873bb0
r.s x29 0xffffffc031873bc0
r.s x28 0xffffff883b6d5c80
r.s x27 0x0000000000000000
r.s x26 0x0000000000000000
r.s x25 0x0000000000000000
r.s x24 0xffffff884e36b478
r.s x23 0xffffffc031873c50
r.s x22 0x0000000000000400
r.s x21 0xffffffc031873c78
r.s x20 0x0000000000000000
r.s x19 0x0000000000000038
r.s x18 0xffffffc01b6ad050
r.s x17 0x0000000000000000
r.s x16 0x0000000000000000
r.s x15 0x0000000000000000
r.s x14 0x0000000000000008
r.s x13 0xffffffc031873ca8
r.s x12 0x0000000000000004
r.s x11 0xffffff883b6d5c80
r.s x10 0x0000000000000000
r.s x9 0x0000000000000000
r.s x8 0x0000000000000038
r.s x7 0x0000000000000000
r.s x6 0x0000000000000000
r.s x5 0xffffff805abde818
r.s x4 0x0000000000000000
r.s x3 0xffffffc031873de0
r.s x2 0xffffff883b6d5c80
r.s x1 0x0000000000000000
r.s x0 0x0000000000000038

输入寄存器信息后,打开堆栈,可以检查出出问题的地方PC指针处是 m->lock锁,
1731377050489.png
而变量是从iocb->ki_filp->private_data而来,而此时该值为NULL。


继续查看堆栈,将PC指针前移,发现此变量是由struct file结构体而来。
1731377054521.png


查看此地址的file结构体情况
v.v %s %t %o (struct file *)0xFFFFFF884E36B400
1731377058121.png
得到出现问题的file为hwinfo

2.3 /proc/hwinfo节点

通过在手机里查找,发现了/proc/hwinfo节点,手动cat一下,手机进入死机状态,dump信息与mtbf跑测的dump一致。

2.4 code检查

1731377061570.png
1731377064978.png
关于hwinfo节点的创建是由fingerprint模块创建的,查看代码后,此节点只是为了打印一句log,且有的fingerprint驱动注释掉了,有的驱动保留了。
同步验证了出问题的机器都是有此/proc/hwinfo节点,且cat一下均死机,堆栈信息和之前MTBF测试的死机堆栈一致

三、解决方案

此节点为指纹很久以前的需求,在22年确认此需求已不需要,可以将此节点移除。