Switch-Router

SACK与内核TCP重传队列

Published at 2022-12-16 | Last Update 2022-12-16

内核 TCP 重传队列并不是保存重传过的报文的队列,而是保存着尚未被对端确认的 sk_buff (也就是可能会被重传的报文),

该队列以红黑树的形式存放在struct sock 结构中

struct sock { 
    ...
    union {
		struct sk_buff	*sk_send_head;
		struct rb_root	tcp_rtx_queue;   // 红黑树的根
	};
	...
}

队列中(树上)的sk_buffseq序号排列, 比如此刻假设 SND_UNA 为 101, 那么队首元素就为起始序号 101 的sk_buff.

sk_buff从队列中取出的条件是被完全应答

这意味着即使从某个 ack 报文的 SACK block 获知其中一段数据被对端接收, 发送方也不能从队列中将这段数据对应的 sk_buff 取出.

取而代之的是, 会在这些sk_buff上进行一些标记.

TCP 连接的sk_buff的 cb 区域存放的是struct tcp_skb_cb 结构, 它里面有一个 8bit 的bitmap变量sacked专门保存 SACK 相关标记.

#define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))

struct tcp_skb_cb {
	...
	__u8		sacked;					/* State flags for SACK.	*/
#define TCPCB_SACKED_ACKED	0x01		/* SKB ACK'd by a SACK block */
#define TCPCB_SACKED_RETRANS	0x02	/* SKB retransmitted */
#define TCPCB_LOST		0x04			/* SKB is lost */
    ...

这几个标记的意义为 SACKED(S) : 该sk_buff在某个 SACK block 中被应答 RETRANS(R): 该sk_buff被重传过 LOST(L) : 该sk_buff确认已经丢失

将这几个标记组合起来, 各自代表的意义在内核代码中也有注释解释.

* Tag  InFlight	Description
 * 0	1		- orig segment is in flight.
 * S	0		- nothing flies, orig reached receiver.
 * L	0		- nothing flies, orig lost by net.
 * R	2		- both orig and retransmit are in flight.
 * L|R	1		- orig is lost, retransmit is in flight.
 * S|R  1		- orig reached receiver, retrans is still in flight.

另外, 内核在struct tcp_sock上有几个变量记录着重传队列上一些报文的数量

struct tcp_sock {
	...
	u32	lost_out;	/* Lost packets	  */
	u32	sacked_out;	/* SACK'd packets */
	...
	u32	packets_out;	/* Packets which are "in flight" */
	u32	retrans_out;	/* Retransmitted packets out	 */
}

lost_out : 本端正处于丢失状态的报文数目 sacked_out: 本端正处于 SACK 已应答状态的报文数目 packets_out: 本端认为在途的状态的报文数目(只要没被完全确认, 就视为在途) retrans_out: 本端眼中正在被重传的报文数目.

以上的解释有些枯燥, 还是举一个实际例子说吧.

sender 向 receiver 发送5个长度为1000字节的报文, 其中第2个和第4个报文丢失. receiver 向 sender 回复的 5 个ack中, 有两个带上了 SACK block, 本末附录贴出了对应的 packetdrill 脚本.

整个收发情况如下图所示, 括号中的数字表示 SACK block

抓包结果是

站在 sender 的视角, 逐个分析每个 ack 到达时的变化.

1st ack 到达前

由于发送了5个报文, 因此在 1st ack 到达之前, 重传队列上存在5个sk_buff, 且它们的标记均未设置. 此时tcp_sock 上的状态如下

tp->retrans_out = 0 
tp->sacked_out = 0  
tp->lost_out = 0 
tp->packets_out = 5   // [1:1001] [1001:2001] [2001:3001] [3001:4001] [4001:5001]
1st ack 到达

192.0.2.1.58393 > 192.168.165.149.9996: Flags [.], ack 1001, win 257, length 0

重传队列上的队首sk_buff被取下, 在途报文减少

tp->retrans_out = 0 
tp->sacked_out = 0  
tp->lost_out = 0 
tp->packets_out = 4   // [1001:2001] [2001:3001] [3001:4001] [4001:5001]
2nd ack 到达

192.0.2.1.58393 > 192.168.165.149.9996: Flags [.], ack 1001, win 257, options [sack 1 {2001:3001},nop,nop], length 0

2nd ack 携带了一个 SACK block. 因此, [2001:3001]对应的sk_buff将 SACKED 标记

tp->retrans_out = 0 
tp->sacked_out = 1   // [2001:3001]
tp->lost_out = 0 
tp->packets_out = 4  // [1001:2001] [2001:3001] [3001:4001] [4001:5001]
3rd ack 到达

192.0.2.1.58393 > 192.168.165.149.9996: Flags [.], ack 1001, win 257, options [sack 1 {4001:5001},nop,nop], length 0

2nd ack 携带了一个 SACK block. 因此, [4001:5001]对应的sk_buff 被标记上 SACKED

tp->retrans_out = 0 
tp->sacked_out = 2   // [2001:3001] [4001:5001]
tp->lost_out = 0 
tp->packets_out = 4  // [1001:2001] [2001:3001] [3001:4001] [4001:5001]
重传

发送端进行重传

192.168.165.149.9996 > 192.0.2.1.58393: Flags [P.], seq 1001:2001, ack 1, win 502, length 1000

重传队列上, [1001:2001] 被标记上 RETRANS 和 LOST. [3001:4001] 被标记上 LOST

tp->retrans_out = 1  // [1001:2001]
tp->sacked_out = 2   // [2001:3001] [4001:5001]
tp->lost_out = 2     // [1001:2001] [3001:4001] 
tp->packets_out = 4  // [1001:2001] [2001:3001] [3001:4001] [4001:5001]
4th ack 到达

192.0.2.1.58393 > 192.168.165.149.9996: Flags [.], ack 3001, win 257, length 0

4th ack 确认到 3001 之前的报文, 因此, 可以将 [1001:2001] 和 [2001:3001] 都从重传队列取下

tp->retrans_out 0    
tp->sacked_out 1,   //  [4001:5001]
tp->lost_out 1,     //  [3001:4001] 
tp->packets_out 2   //  [3001:4001] [4001:5001]
重传

发送端进行重传

192.168.165.149.9996 > 192.0.2.1.58393: Flags [P.], seq 3001:4001, ack 1, win 502, length 1000

重传队列上, [3001:4001] 被标记上 RETRANS

tp->retrans_out 1,  //  [3001:4001]    
tp->sacked_out 1,   //  [4001:5001]
tp->lost_out 1,     //  [3001:4001] 
tp->packets_out 2   //  [3001:4001] [4001:5001]
5th ack 到达

192.0.2.1.58393 > 192.168.165.149.9996: Flags [.], ack 5001, win 257, length 0

所有报文均得到确认. 重传队列上再无sk_buff

tp->retrans_out = 0 
tp->sacked_out = 0  
tp->lost_out = 0 
tp->packets_out = 0

附录

实验使用的 packetdrill 脚本

0 `sysctl -q net.ipv4.tcp_sack=1`
0 `sysctl -q net.ipv4.tcp_recovery=0`
+0  socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

// 3-way handshake
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 >  S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 7>
//0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.1 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4

// Write extra 7 data segments.
+0 write(4, ..., 1000) = 1000
+0 > P. 1:1001(1000) ack 1
+0 write(4, ..., 1000) = 1000
+0 > P. 1001:2001(1000) ack 1
+0 write(4, ..., 1000) = 1000
+0 > P. 2001:3001(1000) ack 1
+0 write(4, ..., 1000) = 1000
+0 > P. 3001:4001(1000) ack 1
+0 write(4, ..., 1000) = 1000
+0 > P. 4001:5001(1000) ack 1

// 1st ack 
+0 < . 1:1(0) ack 1001 win 257

// 3rd ack
+0 < . 1:1(0) ack 1001 win 257 <sack 2001:3001,nop,nop>
// 4th ack
+0 < . 1:1(0) ack 1001 win 257 <sack 4001:5001,nop,nop>

+0.1 > P. 1001:2001(1000) ack 1
+0 < . 1:1(0) ack 3001 win 257

+0.1 > P. 3001:4001(1000) ack 1
+0 < . 1:1(0) ack 5001 win 257