Android稳定性KEcrash[Android稳定性] 第041篇 [问题篇] Unable to handle kernel paging request at virtual address 00046ffca9037bf9
iliuqi
一、问题现象
死机
二、分析步骤
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
| [ 189.052980][ T5068] Unable to handle kernel paging request at virtual address 00046ffca9037bf9 [ 189.052991][ T5068] Mem abort info: [ 189.052997][ T5068] ESR = 0x0000000096000004 [ 189.053005][ T5068] EC = 0x25: DABT (current EL), IL = 32 bits [ 189.053013][ T5068] SET = 0, FnV = 0 [ 189.053020][ T5068] EA = 0, S1PTW = 0 [ 189.053027][ T5068] FSC = 0x04: level 0 translation fault [ 189.053035][ T5068] Data abort info: [ 189.053039][ T5068] ISV = 0, ISS = 0x00000004 [ 189.053045][ T5068] CM = 0, WnR = 0 [ 189.053053][ T5068] [00046ffca9037bf9] address between user and kernel address ranges [ 189.053064][ T5068] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 189.053311][ T5068] Dumping ftrace buffer: [ 189.053331][ T5068] (ftrace buffer empty) [ 189.055391][ T5068] CPU: 1 PID: 5068 Comm: binder:1027_3 Tainted: G WC OE 6.1.118-android14-11-maybe-dirty #1 [ 189.055405][ T5068] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT) [ 189.055412][ T5068] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE= [ 189.055426][ T5068] pc : dpm_complete+0x128/0x44c [ 189.055451][ T5068] lr : dpm_complete+0x114/0x44c [ 189.055462][ T5068] sp : ffffffc0243fbb40 [ 189.055468][ T5068] x29: ffffffc0243fbb60 x28: ffffff8035d52580 x27: ffffffc00a1fc000 [ 189.055489][ T5068] x26: ffffffc00a1fc210 x25: ffffffc0243fbb48 x24: ffffff8093e724a0 [ 189.055508][ T5068] x23: ffffff8093e72518 x22: ffffff8093e72400 x21: ffffffc0092f0ae9 [ 189.055527][ T5068] x20: ffffffc00a1fc1c0 x19: 0000000000000010 x18: ffffffc022c2d078 [ 189.055545][ T5068] x17: 000000007b71745f x16: 000000007b71745f x15: ffffff8179342180 [ 189.055564][ T5068] x14: 0000000000000010 x13: ffffffc0082809d4 x12: ffffffc00939e698 [ 189.055582][ T5068] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffc00a0c7000 [ 189.055600][ T5068] x8 : a9046ffca9037bfd x7 : 3a4d50006574656c x6 : 0000101a1e00090b [ 189.055619][ T5068] x5 : 0b09001e1a100000 x4 : 0000008000000000 x3 : ffffff8056d3a9c8 [ 189.055637][ T5068] x2 : 00000000ffff93a3 x1 : 0000000000000000 x0 : ffffff8093e72400 [ 189.055657][ T5068] Call trace: [ 189.055663][ T5068] dpm_complete+0x128/0x44c [ 189.055677][ T5068] suspend_devices_and_enter+0x894/0xc04 [ 189.055698][ T5068] pm_suspend+0x330/0x694 [ 189.055711][ T5068] state_store+0x104/0x1c8 [ 189.055724][ T5068] kobj_attr_store+0x30/0x48 [ 189.055747][ T5068] sysfs_kf_write+0x54/0x6c [ 189.055769][ T5068] kernfs_fop_write_iter+0x104/0x1a4 [ 189.055789][ T5068] vfs_write+0x244/0x2e0 [ 189.055805][ T5068] ksys_write+0x78/0xe8 [ 189.055816][ T5068] __arm64_sys_write+0x1c/0x2c [ 189.055829][ T5068] invoke_syscall+0x58/0x114 [ 189.055845][ T5068] el0_svc_common+0xb4/0xfc [ 189.055857][ T5068] do_el0_svc+0x24/0x84 [ 189.055867][ T5068] el0_svc+0x2c/0x90 [ 189.055884][ T5068] el0t_64_sync_handler+0x68/0xb4 [ 189.055897][ T5068] el0t_64_sync+0x1a4/0x1a8 [ 189.055920][ T5068] Code: b40002a8 f9400508 b40003e8 aa1603e0 (b85fc110) [ 189.055933][ T5068] [ 189.169167][ T5068] Kernel panic - not syncing: Oops: Fatal exception
|
2.1 初步定位模块

问题出现在系统休眠过程中
设备陆续suspend
出问题的dev,为 disp_feature/disp-DSI-0

suspend的流程里,出现了问题,disp-DSI-0的class像是被注销了
2.2 第一个问题点

查看dmesg,可以看到初始化流程有两个线程同时执行,
7.0x 秒左右,T615线程执行到mi_display_pwrkey_callback_set
7.04 秒左右,T710线程触发了pwrkey的irq
7.40 秒左右,T710初始化了mi disp_core和mi disp_log
7.45 秒左右,T675再次初始化mi_disp_core和mi disp_log,检查到已经初始化直接return
7.45 秒左右,T675初始化mi disp_feature
由此得到第一个问题点:
display的初始化流程竟然被电源键的中断触发函数触发,而没有走正常的display的流程,这个需要整改
2.3 第二个问题点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
| Line 4538: [ 7.456376][ T710] sysfs: cannot create duplicate filename '/devices/virtual/mi_display/disp_feature' Line 4549: [ 7.467624][ T710] CPU: 1 PID: 710 Comm: irq/135-pm8941_ Tainted: G WC OE 6.1.118-android14-11-maybe-dirty #1 Line 4559: [ 7.485547][ T710] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT) Line 4560: [ 7.485552][ T710] Call trace: Line 4561: [ 7.485555][ T710] dump_backtrace+0xf4/0x11c Line 4562: [ 7.485569][ T710] show_stack+0x18/0x24 Line 4563: [ 7.485573][ T710] dump_stack_lvl+0x60/0x90 Line 4564: [ 7.485580][ T710] sysfs_create_dir_ns+0xf0/0x150 Line 4565: [ 7.485588][ T710] kobject_add_internal+0x228/0x478 Line 4566: [ 7.485595][ T710] kobject_add+0x94/0x10c Line 4567: [ 7.485600][ T710] device_add+0x144/0x618 Line 4568: [ 7.485607][ T710] device_create_groups_vargs+0xcc/0x12c Line 4570: [ 7.499011][ T710] device_create+0x58/0x80 Line 4571: [ 7.499017][ T710] mi_disp_feature_init+0xdc/0x20c [msm_drm] Line 4573: [ 7.510902][ T710] mi_get_disp_feature+0x20/0x40 [msm_drm] Line 4575: [ 7.522143][ T710] mi_display_powerkey_callback+0x18/0x80 [msm_drm] Line 4577: [ 7.537274][ T710] pm8941_pwrkey_irq+0x1e8/0x330 [pm8941_pwrkey] Line 4578: [ 7.537302][ T710] irq_thread_fn+0x44/0xa4 Line 4579: [ 7.537315][ T710] irq_thread+0x164/0x290 Line 4580: [ 7.537320][ T710] kthread+0x10c/0x154 Line 4581: [ 7.537328][ T710] ret_from_fork+0x10/0x20 Line 4583: [ 7.547231][ T710] kobject_add_internal failed for disp_feature with -EEXIST, don't try to register things with the same name in the same directory. Line 4588: [ 7.559217][ T710] [mi_disp:mi_disp_feature_init [msm_drm]] [E]create device failed for disp_feature Line 4591: [ 7.572531][ T710] ------------[ cut here ]------------ Line 4593: [ 7.584887][ T710] remove_proc_entry: removing non-empty directory '/proc/mi_display', leaking at least 'mipi_rw_prim' Line 4594: [ 7.584917][ T710] WARNING: CPU: 1 PID: 710 at fs/proc/generic.c:720 remove_proc_entry+0x1e0/0x1ec Line 4595: [ 7.584935][ T710] Modules linked in: rmnet_wlan(OE) rmnet_offload(OE) rmnet_perf(OE) rmnet_shs(OE) rmnet_perf_tether(OE) rmnet_core(OE) gauge_iio(E) ipanetm(OE) Line 4625: [ 7.672205][ T710] CPU: 1 PID: 710 Comm: irq/135-pm8941_ Tainted: G WC OE 6.1.118-android14-11-maybe-dirty #1 Line 4626: [ 7.672211][ T710] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT) Line 4627: [ 7.672214][ T710] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) Line 4628: [ 7.672219][ T710] pc : remove_proc_entry+0x1e0/0x1ec Line 4629: [ 7.672234][ T710] lr : remove_proc_entry+0x1e0/0x1ec Line 4630: [ 7.672240][ T710] sp : ffffffc00b4a3c60 Line 4631: [ 7.672242][ T710] x29: ffffffc00b4a3c80 x28: 0000000000000000 x27: 00000000ffffffff Line 4632: [ 7.672250][ T710] x26: 0000000000000001 x25: ffffffc00a1a4580 x24: 000000000000000a Line 4633: [ 7.672256][ T710] x23: 000000000000000a x22: ffffffc009318048 x21: ffffff804c52b180 Line 4634: [ 7.672263][ T710] x20: ffffff804c52b22c x19: ffffff804c52b200 x18: ffffffc00aafd048 Line 4635: [ 7.672269][ T710] x17: 0000000000000015 x16: 00000000000000a4 x15: ffffffc00902ec88 Line 4636: [ 7.672276][ T710] x14: 0000000000000001 x13: 000000000000004e x12: 0000000000000018 Line 4637: [ 7.672282][ T710] x11: 00000000ffffffff Line 4640: [ 7.687628][ T710] x10: ffffffc00a09eb5c x9 : 67aa0542b3522000 Line 4641: [ 7.687638][ T710] x8 : 67aa0542b3522000 x7 : 656c20746120676e x6 : 0000000000000027 Line 4642: [ 7.687644][ T710] x5 : ffffff8179154234 x4 : ffffffc0093675d5 x3 : ffff0a00ffffff04 Line 4643: [ 7.687651][ T710] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000063 Line 4644: [ 7.687658][ T710] Call trace: Line 4645: [ 7.687663][ T710] remove_proc_entry+0x1e0/0x1ec Line 4646: [ 7.687673][ T710] mi_disp_core_deinit+0x34/0x60 [msm_drm] Line 4653: [ 7.705247][ T710] mi_disp_feature_init+0x16c/0x20c [msm_drm] Line 4663: [ 7.722296][ T710] mi_get_disp_feature+0x20/0x40 [msm_drm] Line 4669: [ 7.739086][ T710] mi_display_powerkey_callback+0x18/0x80 [msm_drm] Line 4671: [ 7.762509][ T710] pm8941_pwrkey_irq+0x1e8/0x330 [pm8941_pwrkey] Line 4672: [ 7.762528][ T710] irq_thread_fn+0x44/0xa4 Line 4673: [ 7.762539][ T710] irq_thread+0x164/0x290 Line 4674: [ 7.762544][ T710] kthread+0x10c/0x154 Line 4675: [ 7.762550][ T710] ret_from_fork+0x10/0x20 Line 4677: [ 7.784476][ T710] ---[ end trace 0000000000000000 ]--- Line 4678: [ 7.784632][ T710] [mi_disp:mi_display_powerkey_callback [msm_drm]] [E]invalid dsi_display or dsi_panel ptr
|
pm8941_pwrkey_irq
最终触发mi_disp_core_deinit
,对应代码
1 2 3 4 5 6 7 8 9 10
| void mi_disp_core_deinit(void) { if (!g_disp_core) return; debugfs_remove_recursive(g_disp_core->debugfs_dir); remove_proc_entry(MI_DISPLAY_PROCFS_DIR, NULL); class_destroy(g_disp_core->class); kfree(g_disp_core); g_disp_core = NULL; }
|
这边会使得 g_disp_core->class
destory掉,以及kfree掉g_disp_core
以及设为NULL
这里特地问了一下AI,
- class_destory把class清除了
- kfree(g_disp_core) 不会直接将g_disp_core所指向的内存直接清0,而是给系统标记,这段内存可以被释放,可以被使用了
- g_disp_core=NULL,这段是将制作指向的地址从原来的指针指向NULL
继续查看mi_disp_core_deinit
的调用,确认调用处为以下的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
| int mi_disp_feature_init(void) { int ret = 0; struct disp_feature *df = NULL; struct disp_core *disp_core = NULL; int i;
ret = mi_disp_core_init(); if (ret < 0) return -ENODEV;
mi_disp_log_init();
disp_core = mi_get_disp_core(); if (!disp_core) return -ENODEV;
if (g_disp_feature) { DISP_INFO("mi disp_feature already initialized, return!\n"); return 0; }
df = kzalloc(sizeof(struct disp_feature), GFP_KERNEL); if (!df) { DISP_ERROR("can not allocate Buffer\n"); ret = -ENOMEM; goto err_core_deinit; }
ret = mi_disp_cdev_register(DISP_FEATURE_DEVICE_NAME, &disp_feature_fops, &df->cdev); if (ret < 0) { DISP_ERROR("cdev register failed for %s\n", DISP_FEATURE_DEVICE_NAME); goto err_alloc_mem; }
df->dev_id = df->cdev->dev; df->class = disp_core->class; df->pdev = device_create(df->class, NULL, df->dev_id, df, DISP_FEATURE_DEVICE_NAME); if (IS_ERR(df->pdev)) { DISP_ERROR("create device failed for %s\n", DISP_FEATURE_DEVICE_NAME); ret = -ENODEV; goto err_cdev_register; }
df->version = MI_DISP_FEATURE_VERSION; for (i = MI_DISP_PRIMARY; i < MI_DISP_MAX; i++) { df->d_display[i].dev = NULL; df->d_display[i].display = NULL; df->d_display[i].disp_id = MI_DISP_MAX; df->d_display[i].intf_type = MI_INTF_MAX; mutex_init(&df->d_display[i].mutex_lock); } INIT_LIST_HEAD(&df->client_list); spin_lock_init(&df->client_spinlock);
g_disp_feature = df;
DISP_INFO("mi disp_feature driver initialized!\n");
if (hwconf_init() < 0) { DISP_ERROR("can not initialize hwconf.\n"); }
return 0;
err_cdev_register: mi_disp_cdev_unregister(df->cdev); err_alloc_mem: kfree(df); err_core_deinit: mi_disp_core_deinit(); return ret; }
|
goto err_cdev_register
err_cdev_register: ////跳到这里执行
mi_disp_cdev_unregister(df->cdev); ////注销cdev
err_alloc_mem:
kfree(df); ////标记df的内存可释放
err_core_deinit:
mi_disp_core_deinit(); /////这里
1 2 3 4 5 6
| void mi_disp_cdev_unregister(struct cdev *cdev) { unregister_chrdev_region(cdev->dev, 1); cdev_del(cdev); cdev = NULL; }
|
第二个问题出现了
cdev是函数的形参局部变量,将局部变量设为NULL,并不会影响实参
所以df->cdev应该不为NULL,这点我们看一下g_disp_feature->cdev
就可以知道,确实没被清0

从函数汇编角度来看这个问题,也是可以确认的

x0为cdev的值,函数一进来就将x0保存到x19里了,后续操作都不会对x0直接操作,而是操作x19
可以看到ldr x19,[sp, #0x10] ,这里是编译器优化,直接将x19寄存器当作sp来使用返回函数地址了,所以直到函数结束返回,x0中的值仍然没有变
2.4 第三个问题点
1 2 3 4 5 6 7 8
| goto err_cdev_register err_cdev_register: ////跳到这里执行 mi_disp_cdev_unregister(df->cdev); ////注销cdev err_alloc_mem: kfree(df); ////标记df的内存可释放 err_core_deinit: mi_disp_core_deinit(); /////这里
|
kfree了df后,没有将df=NULL,以及g_disp_feature=NULL,
这个是很容易出现问题的
这里需要注意的是:
df和g_disp_feature指向的是同一块内存空间,但是这两个指针是不一样的,属于不同的地址,如果只kfree了df,标明这块内存可以被释放。如果这些内存被使用了,那df和g_disp_feature仍然指向原来的地址。直接调用就会出现异常!
三、问题总结
这个问题虽然发现了3个问题点,但是实际的死机是因为class的状态被destory后没有同步给g_disp_feature,将g_disp_core以及g_disp_feature都要置为NULL
1
| df->class = disp_core->class; ///disp_core->class赋值给disp_feature
|
1 2 3 4 5 6 7 8 9 10
| void mi_disp_core_deinit(void) { if (!g_disp_core) return; debugfs_remove_recursive(g_disp_core->debugfs_dir); remove_proc_entry(MI_DISPLAY_PROCFS_DIR, NULL); class_destroy(g_disp_core->class); kfree(g_disp_core); g_disp_core = NULL; }
|
所以在suspend流程时认为class还存在导致了这个问题,从trace32里看到的整个class的成员都是异常的,这个说明这个内存块应该被其他人使用了