Ideally, the window size or number of packets in flight of the transport protocol should be set to the BDP of the path, where bandwidth = rate of the bottleneck link, and RTT is end-to-end RTT from sending data to receiving ACK. Understand why. Suppose BDP = N packets. Then, after you have sent N packets, the ACK for the first packet would have come back, enabling you to send the next packet and so on. So in the ideal case, a window size of exactly N will give you just enough pipelining to fully utilize the bottleneck. Acks will come at precisely the correct time when you need to send the new packet (“ack clocking”). With less than N packets in flight, bottleneck may be idle some times, leading to lower throughput. If you have more than N packets in flight, extra packets will get queued up at the bottleneck router (and eventually dropped when buffer is full). You won’t increase throughput, but will increase delay per packet. So, in ideal world, maintain BDP worth of window size, and no buffer at bottleneck router.
Understand the graph of throughput vs. number of packets in flight (increases up to BDP, and flattens out). Understand graph of observed delay of packets vs number of packets in flight (flat initially as no packet sees queueing, increases linearly after number of pakcets crosses BDP due to queueing delays). [See the bufferbloat reference.]
In practice, BDP is hard to measure, so TCP uses a self-learning mechanism. It increases window size when all is going well, and reduces when it observes congestion (as indicated by packet loss for example). Ensures that it hovers around the ideal window size. So can undershoot and overshoot the ideal window. A buffer in the bottleneck link is needed to accommodate these fluctuations.
BDP is also variable due to cross traffic (other flows in the network). So, bottleneck buffer is needed to absorb these fluctuations from other traffic and ensure bottelneck is utilized and packets are not dropped when available bandwidth changes suddenly.
The bottleneck buffer size is a very important parameter in TCP performance. What is the ideal bottleneck buffer size? If no cross traffic and ideal world, its ok to not have any buffers (as we explain above, buffer only hurts delay with no throughput benefit in ideal case). However, given that BDP is not known and is variable, and TCP uses some heuristics to discover BDP and ideal window size, we need to have some buffering in the real world. How to set buffer size in the real world?
Even before we set buffer size to ideal value: what happens if buffer size is below ideal value? Overbuffering, too much delay, no gain in throughput. What happens if we undersize the buffer? Not enough packets to send and bottleneck link may be underutilized. So the buffer size should be set such that it is just large enough to keep link occupied, but not too much beyond. Now let’s see what the ideal value of the buffer is.
One simple heuristic: set buffer size equal to BDP, so that even if BDP worth of packets come in a burst, you can handle it. A more analytical argument for BDP-sized buffers is below.
Consider a very simple TCP model. TCP sender is in congestion avoidance steady state. Increases window up to W packets, packet loss happens, reduces window to W/2. Let’s think in terms of packets for ease of analysis.
Now, packet loss happens when window reaches W packets. At this point, the bottleneck buffer and the pipe have both filled up and so bottleneck router has dropped a packet. Let the size of bottleneck buffer be B packets. Let the rate of bottleneck be R and end to end delay on the path be D. Now BDP = R * D. And W = R * D + B.
Now, the sender’s window reduces to W/2. Under the sender gets at least W/2 acks, it cannot send next packet. In this time that the sender slows down, the bottleneck buffer should have enough packets to sustain the link. Since sender receives acks at the bottleneck rate (ack clocking), the time taken for sender to receive W/2 acks = time taken for bottleneck link to send W/2 pakcets. That is, the bottleneck buffer should be able to send at least W/2 packets at bottleneck rate before it starts getting new packets. Therefore, B should be at least W/2. From these two points, we conclude W/2 = RD or W = 2 RD and hence B = R*D.
If buffer is set at this value of RD, it will empty exactly when the sender has gotten W/2 acks and starts sending new data. So buffer occupancy goes from RD to 0, and starts filling up when the sender starts ramping up again.
When buffer is set to BDP, the max window size W = 2 BDP, and min window size W/2 = BDP. So window keeps oscillating from BDP to (BDP + buffer size = 2 BDP). Average window size is 1.5 * BDP.
When queue is empty (buffer has fully emptied, just before TCP starts ramping up after loss), queueing delay = 0, so total delay = D. When queue is full (TCP is at max window size, buffer is full, just before a buffer drop), queueing delay = D, so total delay = 2D. That is, the TCP segment RTT goes from D to 2D due to queueing delay, so average RTT = 1.5 * D.
What happens if buffer is too large or small? If buffer is too large, it will never empty between two TCP cycles, it will lead to extra queueing delay. Note that TCP can fill up any buffer space available by choosing a larger-than-required window size. This is called buffer bloat problem. Exists in the Internet today to some extent, especially in wireless networks, home cable networks etc. [See the bufferbloat reference paper.] For example, if buffer size is 3 * BDP, TCP window size will oscillate between 4* BDP and 2* BDP, so after 2* BDP out of the 3* BDP buffer clears out, TCP will start ramping up again. So the buffer will never empty, leading to unnecessary extra queueing delay.
If buffer size is too small, it leads to buffer underflow. That is, the buffer cannot keep the bottleneck link busy when TCP slows down (and TCP will be slowing down too much due to frequent buffer drops). Frequent buffer drops (and other sources of loss) are potential factors that can cause low TCP throughput due to fewer-than-ideal number of packets in the window. Suppose buffersize = 0.5* BDP. Then window size will oscillate between 1.5 * BDP and 0.75 * BDP. TCP will start ramping up only after getting 0.75 * BDP worth of ACKs, but during this time, the buffer can only push through 0.5 * BDP worth of data. Rest of the time, TCP sender is waiting for ACKs, but buffer is empty and link is not utilized.
What happens when there is more than one flow? Does BDP heuristic for buffer size hold? When multiple flows share a link of rate R, their throughputs will add up to R, so their BDPs will add up to R*D (asusming equal delays D). So conventional wisdom was that links should be provisioned for bandwidth * average delay expected. However, analysis [“Sizing Router Buffers” reference] shows that it can be lower values are enough because the peaks of all flows won’t be synchronized.
Average window size = 3/4 * W. So average throughput = 3/4 * W * MSS / RTT. Clearly, TCP throughput achieved has an inverse relationship with RTT. This is called RTT unfairness of TCP. If two links share bottleneck link. Then the flows with higher RTT will get a lower share of the bottleneck link.
What is the relationship between TCP throughput and loss rate? [See reference on TCP throughput model.] Consider the sawtooth diagram. Consider one cycle where window goes from W/2 to W. Since window size increases by 1 segment every RTT, it increases by W/2 segments in W/2 * RTT. The number of pakcets transmitted in one cycle area under one “tooth” = (W/2)^2 + 1/2 * (W/2)^2. The expected number fo packets in one cycle is also 1/p if p is the probability of packet loss. Equating the two, we get W = sqrt(8/3p). Substituting in the throughput formula, we get
throughput = sqrt(3/2) * MSS / ( RTT * sqrt(p) )
In general, more losses means TCP is frequently reducing window size, so it may lead to lower than optimal utlization of bottleneck link and lower than ideal throughput.
Finally, discuss fairness. Consider two users sharing a link. Let (x,y) be their throughputs. We can represent their achieved rates on a graph at point (x,y). Now, for both flows to utilize the link capacity C fully, we want x+y = C. For fairness, we want x=y. We want the congestion control algorithm to converge to the intersection of the two lines. We can show via graphical reasoning that Additive Increase Multiplicative Decrease (AIMD) converges to the ideal point, no matter what value of x and y we start with.