[Android稳定性] 第018篇 [问题篇] 串口日志未关闭导致的watchdog

0. 问题现象

死机,dmesg中日志如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[   34.029307][    C0] ------------[ cut here ]------------
[ 35.844227][ C0] msm_watchdog f410000.qcom,wdt: QCOM Apps Watchdog bark! Now = 35.844222
[ 35.844236][ C0] msm_watchdog f410000.qcom,wdt: QCOM Apps Watchdog last pet at 15.841438
[ 35.844243][ C0] msm_watchdog f410000.qcom,wdt: cpu alive mask from last pet ff
[ 35.844347][ C0] CPU0 next event is 35840000000
[ 35.844350][ C0] CPU1 next event is 34048000000
[ 35.844353][ C0] CPU2 next event is 34048000000
[ 35.844356][ C0] CPU3 next event is 34048000000
[ 35.844358][ C0] CPU4 next event is 34048000000
[ 35.844360][ C0] CPU5 next event is 34048000000
[ 35.844363][ C0] CPU6 next event is 34048000000
[ 35.844365][ C0] CPU7 next event is 34046135975
[ 35.850421][ C0] Top irqs in last 35850 ms:
[ 35.850426][ C0] IRQ 11 [GICv3:arch_timer] - 29310 times
[ 35.850430][ C0] IRQ 295 [ipcc:glink-native-adsp] - 6227 times
[ 35.850434][ C0] IRQ 284 [GICv3:cci] - 4691 times
[ 35.850438][ C0] IRQ 16 [GICv3:ipcc_0] - 8446 times
[ 35.850441][ C0] IRQ 93 [GICv3:ufshcd] - 9749 times
[ 35.850447][ C0] msm_watchdog f410000.qcom,wdt: Causing a QCOM Apps Watchdog bite!
[ 35.850456][ C0] msm_watchdog f410000.qcom,wdt: Wdog - STS: 0x1401be, CTL: 0x3, BARK TIME: 0x9ffc4, BITE TIME: 0xeffa6

1. 问题分析

1.1 check watchdog基础信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
----------begin Watchdog----------
Non-secure Watchdog data
Watchdog enabled
Pet time: 15.0s
Bark time: 20.0s
Watchdog last pet: 1.5e-08
Watchdog next pet: 15.000000015
Watchdog next bark: 20.000000015
Watchdog task is waiting on core 6 from 30.961304
CPU online bits: 11111111
CPU runqueue online bits: 11111101
CPU isolated bits: 00000000
CPU dying bits: 00000010
pet_timer_flags: 0x29400006
pet_timer_expires: 4294900006
Current jiffies : 4294901254
Timestamp of last timer interrupt(last_jiffies_update): 35.828
tick_do_timer_cpu: 7
CPU logical map: [0, 1, 2, 3, 4, 5, 6, 7]
tick_broadcast_oneshot_mask: 00000000
tick_broadcast_pending_mask: 00000000
tick_broadcast_force_mask: 00000000
tick_broadcast_device cpumask: 00000010
tick_broadcast_device next_event: 9223372036.854776
CPU0 tick_device next_event: 35.840000
CPU1 tick_device next_event: 34.048000
CPU2 tick_device next_event: 34.048000
CPU3 tick_device next_event: 34.048000
CPU4 tick_device next_event: 34.048000
CPU5 tick_device next_event: 34.048000
CPU6 tick_device next_event: 34.048000
CPU7 tick_device next_event: 34.046136
CPU#0 : ping_start: 15.841391 : ping_end: 15.841398
CPU#1 : ping_start: 15.841398 : ping_end: 15.841405
CPU#2 : ping_start: 15.841406 : ping_end: 15.841413
CPU#3 : ping_start: 15.841414 : ping_end: 15.841422
CPU#4 : ping_start: 15.841422 : ping_end: 15.841428
CPU#5 : ping_start: 15.841429 : ping_end: 15.841434
CPU#6 : ping_start: 15.841434 : ping_end: 15.841434
CPU#7 : ping_start: 15.841434 : ping_end: 15.841436
timer_fired : 30.961298
thread_start : 15.841390
------------end Watchdog----------

1.2 查看tz中记录的non secure watchdog日志

core 0:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
Core 0 PC: __delay+9c <ffffffc008faa978>

Core 0 LR: __delay[jt]+8c <ffffffc008faa998>



[<ffffffc008faa978>] __delay+0x9c

[<ffffffc008faaa54>] __const_udelay+0x28

[<ffffffc0014d5890>] qcom_wdt_trigger_bite[qcom_wdt_core]+0x424

[<ffffffc0014d6118>] qcom_wdt_bark_handler[qcom_wdt_core]+0xac

[<ffffffc008142e9c>] __handle_irq_event_percpu+0xbc

[<ffffffc008143150>] handle_irq_event+0x48

[<ffffffc008149218>] handle_fasteoi_irq+0x124

[<ffffffc008141ed0>] generic_handle_domain_irq+0x44

[<ffffffc008010144>] gic_handle_irq+0x4c

[<ffffffc008fdf820>] el1_interrupt+0x98

[<ffffffc008fdf77c>] el1h_64_irq_handler+0x18

[<ffffffc008011304>] ret_to_kernel[jt]+0x0

[<ffffffc00813fd60>] console_emit_next_record+0x330

[<ffffffc00813d598>] console_unlock+0x150

[<ffffffc00813d2f0>] vprintk_emit[jt]+0x16c

[<ffffffc00813d6e4>] vprintk_default+0x44

[<ffffffc0081402a4>] vprintk[jt]+0x90

[<ffffffc008fd87c4>] _printk+0x54

[<ffffffc0080afb14>] __warn_printk+0xe0

[<ffffffc008d87634>] dev_watchdog+0x220

[<ffffffc0081863cc>] call_timer_fn+0x4c

[<ffffffc00818622c>] __run_timers+0x1b8

[<ffffffc0081852d4>] run_timer_softirq+0x24

[<ffffffc008010330>] __do_softirq+0x11c

[<ffffffc008016f68>] ____do_softirq+0x10

[<ffffffc008016ef4>] call_on_irq_stack+0x3c

[<ffffffc008016f48>] do_softirq_own_stack+0x1c

[<ffffffc0080b9340>] __irq_exit_rcu+0x54

[<ffffffc0080b92e0>] irq_exit_rcu+0x10

[<ffffffc008fdf82c>] el1_interrupt+0xa4

[<ffffffc008fdf77c>] el1h_64_irq_handler+0x18

[<ffffffc008011304>] ret_to_kernel[jt]+0x0

[<ffffffc00815ddc0>] rcu_momentary_dyntick_idle+0x18

[<ffffffc0081c3b4c>] multi_cpu_stop+0x94

[<ffffffc0081c4330>] cpu_stopper_thread+0xe8

[<ffffffc0080e9a7c>] smpboot_thread_fn[jt]+0x104

[<ffffffc0080e2db0>] kthread+0x104

[<ffffffc008016ea8>] ret_from_fork+0x10

其他核的trace都类似以下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Core 7 PC: rcu_momentary_dyntick_idle+3c <ffffffc00815dde4>

Core 7 LR: multi_cpu_stop+94 <ffffffc0081c3b4c>



[<ffffffc00815dde4>] rcu_momentary_dyntick_idle+0x3c

[<ffffffc0081c3b4c>] multi_cpu_stop+0x94

[<ffffffc0081c4330>] cpu_stopper_thread+0xe8

[<ffffffc0080e9a7c>] smpboot_thread_fn[jt]+0x104

[<ffffffc0080e2db0>] kthread+0x104

[<ffffffc008016ea8>] ret_from_fork+0x10

所有的cpu都暂时停止了工作,等待rcu_mementary_dyntick_idle
而cpu0 正在执行printk的动作,其他core都在等cpu0执行完输出任务,msm-watchdog没办法在规定时间内pet,导致了此次异常

2. 解决方案

关闭kernel的串口日志