
0. 问题现象
死机,dmesg中日志如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| [ 34.029307][ C0] ------------[ cut here ]------------ [ 35.844227][ C0] msm_watchdog f410000.qcom,wdt: QCOM Apps Watchdog bark! Now = 35.844222 [ 35.844236][ C0] msm_watchdog f410000.qcom,wdt: QCOM Apps Watchdog last pet at 15.841438 [ 35.844243][ C0] msm_watchdog f410000.qcom,wdt: cpu alive mask from last pet ff [ 35.844347][ C0] CPU0 next event is 35840000000 [ 35.844350][ C0] CPU1 next event is 34048000000 [ 35.844353][ C0] CPU2 next event is 34048000000 [ 35.844356][ C0] CPU3 next event is 34048000000 [ 35.844358][ C0] CPU4 next event is 34048000000 [ 35.844360][ C0] CPU5 next event is 34048000000 [ 35.844363][ C0] CPU6 next event is 34048000000 [ 35.844365][ C0] CPU7 next event is 34046135975 [ 35.850421][ C0] Top irqs in last 35850 ms: [ 35.850426][ C0] IRQ 11 [GICv3:arch_timer] - 29310 times [ 35.850430][ C0] IRQ 295 [ipcc:glink-native-adsp] - 6227 times [ 35.850434][ C0] IRQ 284 [GICv3:cci] - 4691 times [ 35.850438][ C0] IRQ 16 [GICv3:ipcc_0] - 8446 times [ 35.850441][ C0] IRQ 93 [GICv3:ufshcd] - 9749 times [ 35.850447][ C0] msm_watchdog f410000.qcom,wdt: Causing a QCOM Apps Watchdog bite! [ 35.850456][ C0] msm_watchdog f410000.qcom,wdt: Wdog - STS: 0x1401be, CTL: 0x3, BARK TIME: 0x9ffc4, BITE TIME: 0xeffa6
|
1. 问题分析
1.1 check watchdog基础信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
| ----------begin Watchdog---------- Non-secure Watchdog data Watchdog enabled Pet time: 15.0s Bark time: 20.0s Watchdog last pet: 1.5e-08 Watchdog next pet: 15.000000015 Watchdog next bark: 20.000000015 Watchdog task is waiting on core 6 from 30.961304 CPU online bits: 11111111 CPU runqueue online bits: 11111101 CPU isolated bits: 00000000 CPU dying bits: 00000010 pet_timer_flags: 0x29400006 pet_timer_expires: 4294900006 Current jiffies : 4294901254 Timestamp of last timer interrupt(last_jiffies_update): 35.828 tick_do_timer_cpu: 7 CPU logical map: [0, 1, 2, 3, 4, 5, 6, 7] tick_broadcast_oneshot_mask: 00000000 tick_broadcast_pending_mask: 00000000 tick_broadcast_force_mask: 00000000 tick_broadcast_device cpumask: 00000010 tick_broadcast_device next_event: 9223372036.854776 CPU0 tick_device next_event: 35.840000 CPU1 tick_device next_event: 34.048000 CPU2 tick_device next_event: 34.048000 CPU3 tick_device next_event: 34.048000 CPU4 tick_device next_event: 34.048000 CPU5 tick_device next_event: 34.048000 CPU6 tick_device next_event: 34.048000 CPU7 tick_device next_event: 34.046136 CPU#0 : ping_start: 15.841391 : ping_end: 15.841398 CPU#1 : ping_start: 15.841398 : ping_end: 15.841405 CPU#2 : ping_start: 15.841406 : ping_end: 15.841413 CPU#3 : ping_start: 15.841414 : ping_end: 15.841422 CPU#4 : ping_start: 15.841422 : ping_end: 15.841428 CPU#5 : ping_start: 15.841429 : ping_end: 15.841434 CPU#6 : ping_start: 15.841434 : ping_end: 15.841434 CPU#7 : ping_start: 15.841434 : ping_end: 15.841436 timer_fired : 30.961298 thread_start : 15.841390 ------------end Watchdog----------
|
1.2 查看tz中记录的non secure watchdog日志
core 0:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
| Core 0 PC: __delay+9c <ffffffc008faa978>
Core 0 LR: __delay[jt]+8c <ffffffc008faa998>
[<ffffffc008faa978>] __delay+0x9c
[<ffffffc008faaa54>] __const_udelay+0x28
[<ffffffc0014d5890>] qcom_wdt_trigger_bite[qcom_wdt_core]+0x424
[<ffffffc0014d6118>] qcom_wdt_bark_handler[qcom_wdt_core]+0xac
[<ffffffc008142e9c>] __handle_irq_event_percpu+0xbc
[<ffffffc008143150>] handle_irq_event+0x48
[<ffffffc008149218>] handle_fasteoi_irq+0x124
[<ffffffc008141ed0>] generic_handle_domain_irq+0x44
[<ffffffc008010144>] gic_handle_irq+0x4c
[<ffffffc008fdf820>] el1_interrupt+0x98
[<ffffffc008fdf77c>] el1h_64_irq_handler+0x18
[<ffffffc008011304>] ret_to_kernel[jt]+0x0
[<ffffffc00813fd60>] console_emit_next_record+0x330
[<ffffffc00813d598>] console_unlock+0x150
[<ffffffc00813d2f0>] vprintk_emit[jt]+0x16c
[<ffffffc00813d6e4>] vprintk_default+0x44
[<ffffffc0081402a4>] vprintk[jt]+0x90
[<ffffffc008fd87c4>] _printk+0x54
[<ffffffc0080afb14>] __warn_printk+0xe0
[<ffffffc008d87634>] dev_watchdog+0x220
[<ffffffc0081863cc>] call_timer_fn+0x4c
[<ffffffc00818622c>] __run_timers+0x1b8
[<ffffffc0081852d4>] run_timer_softirq+0x24
[<ffffffc008010330>] __do_softirq+0x11c
[<ffffffc008016f68>] ____do_softirq+0x10
[<ffffffc008016ef4>] call_on_irq_stack+0x3c
[<ffffffc008016f48>] do_softirq_own_stack+0x1c
[<ffffffc0080b9340>] __irq_exit_rcu+0x54
[<ffffffc0080b92e0>] irq_exit_rcu+0x10
[<ffffffc008fdf82c>] el1_interrupt+0xa4
[<ffffffc008fdf77c>] el1h_64_irq_handler+0x18
[<ffffffc008011304>] ret_to_kernel[jt]+0x0
[<ffffffc00815ddc0>] rcu_momentary_dyntick_idle+0x18
[<ffffffc0081c3b4c>] multi_cpu_stop+0x94
[<ffffffc0081c4330>] cpu_stopper_thread+0xe8
[<ffffffc0080e9a7c>] smpboot_thread_fn[jt]+0x104
[<ffffffc0080e2db0>] kthread+0x104
[<ffffffc008016ea8>] ret_from_fork+0x10
|
其他核的trace都类似以下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| Core 7 PC: rcu_momentary_dyntick_idle+3c <ffffffc00815dde4>
Core 7 LR: multi_cpu_stop+94 <ffffffc0081c3b4c>
[<ffffffc00815dde4>] rcu_momentary_dyntick_idle+0x3c
[<ffffffc0081c3b4c>] multi_cpu_stop+0x94
[<ffffffc0081c4330>] cpu_stopper_thread+0xe8
[<ffffffc0080e9a7c>] smpboot_thread_fn[jt]+0x104
[<ffffffc0080e2db0>] kthread+0x104
[<ffffffc008016ea8>] ret_from_fork+0x10
|
所有的cpu都暂时停止了工作,等待rcu_mementary_dyntick_idle
。
而cpu0 正在执行printk的动作,其他core都在等cpu0执行完输出任务,msm-watchdog没办法在规定时间内pet,导致了此次异常
2. 解决方案
关闭kernel的串口日志