[Android稳定性] 第025篇 [问题篇] KASAN slab-out-of-bounds内存越界问题

一、问题背景

kasan版本

corgi: 4967550

刚跑测出现4台死机,看着都是一样问题

概率4/7

二、问题分析

2.1 dmesg_TZ.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
[ 1776.466411][T12145] BUG: KASAN: slab-out-of-bounds in usbpd_mi_vdm_received_cb+0x478/0xb84 [usbpd]
[ 1776.466641][T12145] Read of size 4 at addr ffffff808d6c0a60 by task kworker/u17:3/12145
[ 1776.466686][T12145] CPU: 7 PID: 12145 Comm: kworker/u17:3 Tainted: G C OE 6.1.99-android14-11-maybe-dirty-qki-consolidate #1
[ 1776.466716][T12145] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[ 1776.466736][T12145] Workqueue: usbpd_wq usbpd_sm [usbpd]
[ 1776.466946][T12145] Call trace:
[ 1776.466962][T12145] dump_backtrace+0x1c0/0x1f0
[ 1776.466992][T12145] show_stack+0x2c/0x3c
[ 1776.467017][T12145] dump_stack_lvl+0x104/0x170
[ 1776.467045][T12145] print_report+0x15c/0x4a0
[ 1776.467080][T12145] kasan_report+0xd4/0x130
[ 1776.467109][T12145] __asan_load4+0x94/0x9c
[ 1776.467143][T12145] usbpd_mi_vdm_received_cb+0x478/0xb84 [usbpd]
[ 1776.467346][T12145] handle_vdm_rx+0x360/0x14f0 [usbpd]
[ 1776.467548][T12145] handle_state_snk_ready+0x3f0/0x10dc [usbpd]
[ 1776.467748][T12145] usbpd_sm+0x6b4/0x12dc [usbpd]
[ 1776.467950][T12145] process_one_work+0x538/0xbbc
[ 1776.467982][T12145] worker_thread+0x57c/0x95c
[ 1776.468012][T12145] kthread+0x1c4/0x1f8
[ 1776.468042][T12145] ret_from_fork+0x10/0x20
[ 1776.468084][T12145] Allocated by task 18133:
[ 1776.468103][T12145] kasan_set_track+0x4c/0x84
[ 1776.468126][T12145] kasan_save_alloc_info+0x24/0x34
[ 1776.468148][T12145] __kasan_kmalloc+0xa0/0xbc
[ 1776.468168][T12145] __kmalloc+0xf8/0x284
[ 1776.468187][T12145] phy_msg_received+0x51c/0xefc [usbpd]
[ 1776.468382][T12145] pdphy_msg_rx_irq+0x318/0x4f8 [qpnp_pdphy]
[ 1776.468436][T12145] __handle_irq_event_percpu+0x16c/0x554
[ 1776.468455][T12145] handle_irq_event+0x6c/0x12c
[ 1776.468471][T12145] handle_edge_irq+0x1ec/0x68c
[ 1776.468493][T12145] generic_handle_irq+0x70/0x94
[ 1776.468509][T12145] periph_interrupt+0x24c/0x2c4 [spmi_pmic_arb]
[ 1776.468581][T12145] pmic_arb_chained_irq+0x2f4/0x5e4 [spmi_pmic_arb]
[ 1776.468650][T12145] generic_handle_domain_irq+0x74/0x98
[ 1776.468667][T12145] gic_handle_irq+0x70/0x1bc
[ 1776.468699][T12145] The buggy address belongs to the object at ffffff808d6c0a40
[ 1776.468699][T12145] which belongs to the cache kmalloc-64 of size 64
[ 1776.468723][T12145] The buggy address is located 32 bytes inside of
[ 1776.468723][T12145] 64-byte region [ffffff808d6c0a40, ffffff808d6c0a80)
[ 1776.468763][T12145] The buggy address belongs to the physical page:
[ 1776.468781][T12145] page:fffffffe0235b000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d6c0
[ 1776.468801][T12145] head:fffffffe0235b000 order:1 compound_mapcount:0 compound_pincount:0
[ 1776.468815][T12145] flags: 0x4000000000010200(slab|head|zone=1)
[ 1776.468839][T12145] raw: 4000000000010200 fffffffe07f37d08 ffffff800b41c3f0 ffffff800b4283c0
[ 1776.468856][T12145] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[ 1776.468866][T12145] page dumped because: kasan: bad access detected
[ 1776.468876][T12145] page_owner tracks the page as allocated
[ 1776.468892][T12145] page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 950, tgid 950 (modprobe), ts 158955933088, free_ts 158788154806
[ 1776.468932][T12145] post_alloc_hook+0x110/0x118
[ 1776.468956][T12145] prep_new_page+0x34/0x244
[ 1776.468982][T12145] get_page_from_freelist+0x1938/0x19a0
[ 1776.469006][T12145] __alloc_pages+0x144/0x35c
[ 1776.469028][T12145] alloc_slab_page+0x8c/0x1ac
[ 1776.469050][T12145] new_slab+0xa4/0x374
[ 1776.469070][T12145] ___slab_alloc+0x844/0xd08
[ 1776.469087][T12145] __slab_alloc+0x6c/0xac
[ 1776.469105][T12145] __kmem_cache_alloc_node+0x26c/0x2e0
[ 1776.469123][T12145] __kmalloc_node_track_caller+0xe0/0x26c
[ 1776.469143][T12145] kvasprintf+0xbc/0x16c
[ 1776.469165][T12145] kasprintf+0x80/0xb0
[ 1776.469184][T12145] iommu_group_add_device+0xd0/0x468
[ 1776.469207][T12145] __iommu_probe_device+0x328/0x558
[ 1776.469234][T12145] iommu_probe_device+0x44/0x380
[ 1776.469257][T12145] of_iommu_configure+0x308/0x3b8
[ 1776.469277][T12145] page last free stack trace:
[ 1776.469293][T12145] free_unref_page_prepare+0x520/0x550
[ 1776.469315][T12145] free_unref_page+0x7c/0x408
[ 1776.469336][T12145] __free_pages+0xf4/0x120
[ 1776.469356][T12145] __vunmap+0x51c/0x5fc
[ 1776.469375][T12145] vfree+0xa4/0xf4
[ 1776.469393][T12145] load_module+0x2d78/0x3040
[ 1776.469416][T12145] __arm64_sys_finit_module+0x170/0x1d4
[ 1776.469436][T12145] invoke_syscall+0x88/0x1d4
[ 1776.469458][T12145] el0_svc_common+0xec/0x1cc
[ 1776.469479][T12145] do_el0_svc+0x50/0x194
[ 1776.469500][T12145] el0_svc+0x30/0x84
[ 1776.469518][T12145] el0t_64_sync_handler+0x68/0xb4
[ 1776.469535][T12145] el0t_64_sync+0x190/0x194
[ 1776.469565][T12145] Memory state around the buggy address:
[ 1776.469584][T12145] ffffff808d6c0900: fc fc fc fc fc fc fc fc 00 00 00 00 03 fc fc fc
[ 1776.469605][T12145] ffffff808d6c0980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1776.469626][T12145] >ffffff808d6c0a00: fc fc fc fc fc fc fc fc 00 00 00 00 fc fc fc fc
[ 1776.469645][T12145] ^
[ 1776.469664][T12145] ffffff808d6c0a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1776.469685][T12145] ffffff808d6c0b00: fc fc fc fc fc fc fc fc 00 00 00 00 03 fc fc fc
[ 1776.469704][T12145] ==================================================================

得到的信息点:

  • 问题类型:slab-out-of-bounds
  • 问题函数:usbpd_mi_vdm_received_cb+0x478/0xb84 [usbpd]
  • 越界地址:Read of size 4 at addr ffffff808d6c0a60 by task kworker/u17:3/12145

2.2 trace32恢复现场

对应的源码

目前来看死在了第5395行的case中,而这个case只有一个for循环

1
2
3
4
5
6
case USBPD_UVDM_SESSION_SEED:
for (i = 0; i < USBPD_UVDM_SS_LEN; i++) {
pd->vdm_data.s_secert[i] = vdos[i];
usbpd_dbg(&pd->dev, "usbpd s_secert vdos[%d]=0x%x", i, vdos[i]);
}
break;

那问题点就只有在pd->vdm_data.s_secert[i] = vdos[i],所以我们重点查看一下这个,推导出越界地址:ffffff808d6c0a60

2.2.1 推到rx_msg起始地址

查看汇编代码:

1
2
3
4
5
6
7
8
910012F8            add     x24,x23,#0x4     ; x24,vdos,#4
AA1803E0 mov x0,x24
9535B65B bl 0xFFFFFFC00865D468 ; __asan_load4_noabort //这前面三句是用来kasan检测的,每次read size 4看是否越界
D100E260 sub x0,x19,#0x38 ; x0,hdlr,#56
B94006F9 ldr w25,[x23,#0x4] ; w25,[vdos,#4] //vdos+0x4 取32位传到w25
9535B6CD bl 0xFFFFFFC00865D63C ; __asan_store8_noabort
F81C8279 stur x25,[x19,#-0x38] ; x25,[hdlr,#-56]

ldr w25,[x23,#0x4]

这个就说明,每次vdos经过一次for循环,地址递增0x4

我们现在看一下vdos变量的起始地址:

vdos是直接函数传参传入的!查看堆栈,看上一个栈帧

注意:013栈帧中已经显示了vdos的地址了,为何我们还要去推导地址呢?

原因在于,for循环导致了vdos的地址进行了偏移,所以我们需要找到起始地址,就需要从源头开始推导

vods变量是由handle_vdm_rx传递的。

1
2
3
4
5
6
7
static void handle_vdm_rx(struct usbpd *pd, struct rx_msg *rx_msg)
{
int ret;
u32 vdm_hdr =
rx_msg->data_len >= sizeof(u32) ? ((u32 *)rx_msg->payload)[0] : 0;

u32 *vdos = (u32 *)&rx_msg->payload[sizeof(u32)];

vdos其实是rx_msg->payload的起始地址,这边还进行了一次强转,强转为u32类型

继续查看栈帧,找到rx_msg的初始化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static bool handle_data_snk_ready(struct usbpd *pd, struct rx_msg *rx_msg)
{
u32 ado;

switch (PD_MSG_HDR_TYPE(rx_msg->hdr)) {
case MSG_SOURCE_CAPABILITIES:
/* save the PDOs so userspace can further evaluate */
memset(&pd->received_pdos, 0,
sizeof(pd->received_pdos));
memcpy(&pd->received_pdos, rx_msg->payload,
min_t(size_t, rx_msg->data_len,
sizeof(pd->received_pdos)));
pd->src_cap_id++;

usbpd_set_state(pd, PE_SNK_EVALUATE_CAPABILITY);
break;
case MSG_VDM:
handle_vdm_rx(pd, rx_msg); //这边传入
break;

handle_data_snk_ready的调用栈

1
2
3
4
5
6
7
8
9
10
11
static void handle_state_snk_ready(struct usbpd *pd, struct rx_msg *rx_msg)
{
int ret;

if (rx_msg && !PD_MSG_HDR_COUNT(rx_msg->hdr) &&
handle_ctrl_snk_ready(pd, rx_msg)) {
return;
} else if (rx_msg && !PD_MSG_HDR_IS_EXTENDED(rx_msg->hdr) &&
handle_data_snk_ready(pd, rx_msg)) { //这边调用
return;

对应的栈帧如下:

可以看到rx_msg的最开始初始化的地方在函数usbpd_sm

1
2
3
4
5
6
7
8
9
10
11
static void usbpd_sm(struct work_struct *w)
{
//...
struct rx_msg *rx_msg = NULL;
//...
if (!list_empty(&pd->rx_q)) {
rx_msg = list_first_entry(&pd->rx_q, struct rx_msg, entry); //去除rx_msg
list_del(&rx_msg->entry); // 删除entry,也就是从链表中去掉
}
// ...
}

在这个函数中rx_msg初始化后,进行了赋值

所以就可以得到rx_msg结构体的起始地址为:0xffffff808d6c0a40

2.2.2 推导vdos起始地址

rx_msg初始化时,有两个步骤:

  1. list_first_entry(&pd->rx_q, struct rx_msg, entry)
  2. list_del(&rx_msg->entry)

这里也详细解释一下这部分:

1
2
3
4
5
6
7
8
9
10
11
12
13
#define list_entry(ptr, type, member) \
container_of(ptr, type, member)

/**
* list_first_entry - get the first element from a list
* @ptr: the list head to take the element from.
* @type: the type of the struct this is embedded in.
* @member: the name of the list_head within the struct.
*
* Note, that list is expected to be not empty.
*/
#define list_first_entry(ptr, type, member) \
list_entry((ptr)->next, type, member)

list_first_entry展开后起始就是container_of的封装,所以这句的意义就是根据pd->rx_q的next指针以及成员变量entry,来找到rx_msg的起始地址。

可以看到pd->rx_q的链表前后指针均指向同一地址也就是0xffffff809e4396d8

同时rx_q的变量的起始地址也是0xffffff809e4396d8

而entry是在struct rx_msg的偏移为:

我们查看一下汇编:

x19寄存器里其实就是rx_msg的起始地址,但是因为x19后续还需要操作,所以会将x19的地址保存到x29-0x18的位置,也就是0xFFFFFFC049757C28

我们去读一下这个地址

这个地址也就和栈帧中的rx_msg的起始地址对应上了

这里标记一下entry的next指针和prev指针,由于会执行list_del操作才会这样,后面会用到

我们继续推导vdos的起始地址

vdos的起始地址=rx_msg的起始地址+payload成员的偏移

所以vdos的起始地址为0xFFFFFF808D6C0A40+0x18=0xFFFFFF808D6C0A58

2.2.3 确认vdos的范围

vdos的范围也就是payload的范围,这个肯定是在这个rx_msg初始化时就已经确定的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
static void phy_msg_received(struct usbpd *pd, enum pd_sop_type sop,
u8 *buf, size_t len)
{
struct rx_msg *rx_msg;
unsigned long flags;
u16 header;
u8 msg_type, num_objs;

if (sop == SOPII_MSG) {
usbpd_err(&pd->dev, "only SOP/SOP' supported\n");
return;
}

if (len < 2) {
usbpd_err(&pd->dev, "invalid message received, len=%zd\n", len);
return;
}

header = *((u16 *)buf);
buf += sizeof(u16);
len -= sizeof(u16);

if (len % 4 != 0) {
usbpd_err(&pd->dev, "len=%zd not multiple of 4\n", len);
return;
}

/* if MSGID already seen, discard */
if (PD_MSG_HDR_ID(header) == pd->rx_msgid[sop] &&
PD_MSG_HDR_TYPE(header) != MSG_SOFT_RESET) {
usbpd_dbg(&pd->dev, "MessageID already seen, discarding\n");
return;
}

pd->rx_msgid[sop] = PD_MSG_HDR_ID(header);

/* discard Pings */
if (PD_MSG_HDR_TYPE(header) == MSG_PING && !len)
return;

/* check header's count field to see if it matches len */
if (PD_MSG_HDR_COUNT(header) != (len / 4)) {
usbpd_err(&pd->dev, "header count (%d) mismatch, len=%zd\n",
PD_MSG_HDR_COUNT(header), len);
return;
}

/* if spec rev differs (i.e. is older), update PHY */
if (PD_MSG_HDR_REV(header) < pd->spec_rev)
pd->spec_rev = PD_MSG_HDR_REV(header);

msg_type = PD_MSG_HDR_TYPE(header);
num_objs = PD_MSG_HDR_COUNT(header);
usbpd_dbg(&pd->dev, "%s type(%d) num_objs(%d)\n",
msg_to_string(msg_type, num_objs,
PD_MSG_HDR_IS_EXTENDED(header)),
msg_type, num_objs);

#if IS_ENABLED(CONFIG_PD_POLICY_MANAGER)
if ((msg_type == MSG_ALERT) && num_objs){
usbpd_dbg(&pd->dev, "ignore MSG_ALERT\n");
return;
}
#endif /* CONFIG_PD_POLICY_MANAGER */

if (!PD_MSG_HDR_IS_EXTENDED(header)) {
rx_msg = kzalloc(sizeof(*rx_msg) + len, GFP_ATOMIC);
if (!rx_msg)
return;

rx_msg->hdr = header;
rx_msg->data_len = len;
memcpy(rx_msg->payload, buf, len);
} else {
rx_msg = pd_ext_msg_received(pd, header, buf, len, sop);
if (!rx_msg)
return;
}

if (pd->vdm_in_suspend && msg_type == MSG_VDM) {
usbpd_dbg(&pd->dev, "Skip wq and handle VDM directly\n");
handle_vdm_rx(pd, rx_msg);
kfree(rx_msg);
return;
}

#if IS_ENABLED(CONFIG_PD_POLICY_MANAGER)
#ifdef DR_SWAP_RESPONSE_TIME
if (IS_CTRL(rx_msg, MSG_DR_SWAP))
pd->dr_swap_recvd_time = ktime_get();
#endif /* DR_SWAP_RESPONSE_TIME */
#endif /* CONFIG_PD_POLICY_MANAGER */

spin_lock_irqsave(&pd->rx_lock, flags);
list_add_tail(&rx_msg->entry, &pd->rx_q);
spin_unlock_irqrestore(&pd->rx_lock, flags);

if (!work_busy(&pd->sm_work))
kick_sm(pd, 0);
else
usbpd_dbg(&pd->dev, "usbpd_sm already running\n");
}

我们可以看到使用了kzalloc进行了内存的申请,payload的大小为len,而且len是通过传值引入并经过了一系列的计算,如果我们推导其实挺麻烦的,但是我们可以看到其中有一个操作

rx_msg->data_len = len;

它会将len的大小保存到rx_msg中,那其实我们只需要读取这个成员就能知道这个size

三、确认根因

由上面的推导,我们已知下面的信息

  1. vdos起始地址=rx_msg->payload起始地址=0xffffff808d6c0a58
  2. vdos的范围=rx_msg->data_len=0x8

而for循环中一次性读取4字节,而且会循环4次,这里当i=2的时候其实就会超出内存了

这里为什么一次性地址递增0x4,是因为进行了一次强转

四、解决方案

循环次数不应该默认为4,而应该使用计算得出

1
for(i=0; i < rx_msg->data_len/sizeof(u32); i++)

五、扩展

这个问题是KASAN检测出来的slab内存泄漏问题,正好可以回顾一下slab内存的分布

使用kmalloc申请内存时,当使能SLAB_BUG以及KASAN时,内存分布如上图,本次申请的内存分布如下:

我们已知slab内存申请时会填充0xcc 以及最后填充0x5a,那就可以定位出这段

就可以很熟悉的看到这个slab内存,DEAD000000000100和DEAD000000000122 也就对应我们上面所提到的entry对象

我们现在来找一下alloc/free track,0xFFFFFFC0038FD978对应的地址就是0xffffff808d6c0a90

这个也就对应上dmesg_TZ中的cpu0 以及pid=18133

1
2
3
4
5
6
7
8
[ 1776.468084][T12145] Allocated by task 18133:

CPU0 6 process is running
curr: logcat ( 18133) [affinity=0xff] [vruntime= 10803060235950]
idle: swapper/0 ( 0) [affinity=0x01] [vruntime= 0]
stop: migration/0 ( 21) [affinity=0x01] [vruntime= 0]
CFS 2 process is pending
curr: logcat ( 18133) [affinity=0xff] [vruntime= 10803060235950]

我们在找一下kasan alloc meta

地址:FFFFFFC00865F4A0