Recently I came about a data communications subject that was pretty unknown to me, known as the Bandwidth-delay product. Knowledge about this can help you to recognize certain network issues and ways to resolve them. It is all about two Linux hosts, a source and a destination host, communicating with each other over a high capacity network link. The question is how can you, given this scenario, reach maximum throughput over the network?
BDP
The first step is to determine the Bandwidth-delay product for this network. Bandwidth-delay product (BDP) is defined as the product of a data link’s capacity (in bits per second) and its round-trip delay time (in seconds). The result, the amount of data (in bits or bytes), is the maximum amount of data on the network at any given time, that is data that has been transmitted but not yet acknowledged.
Why is this important? The TCP protocol is designed for reliable transmission of data, acknowledgements are an essential part of the protocol. A high BDP value has impact on the efficiency of TCP, because the protocol can only achieve optimum throughput if a sender sends a sufficiently large quantity of data before being required to stop and wait until a confirming message (acknowledgement) is received from the receiver, acknowledging successful receipt of that data.
Round-trip delay time (RTD) or Round-trip time (RTT) is defined as the length of time it takes for a signal to be sent plus the length of time it takes for an acknowledgment of that signal to be received.
For measuring the RTT and the bandwidth, you can use tooling like ping and iPerf3.
In our example we use the following average values:
RTT = 3 ms
Bandwidth = 2655752000 bits/seconds
Now the BDP = Bandwidth (bits/seconds) x RTT (s)
BDP = 2655752000 x 3×10-3 = 7967256 bits = 995908 bytes = 995 kB
The BDP for this network is about 1 MB. Networks with a BDP significantly larger than 1×105 bits (12500 bytes) are considered as Long Fat Networks (LFN). So our example network is definitely a LFN.
TCP tuning
Now we will discover how we can optimize the performance between our two hosts. Tests are performed by copying a 1 GB file with random data from the source host to the target host. The most important tuning issue for TCP is the TCP window size, which controls how much data can be in the network at any one point. The optimal value to use for the TCP window size is the bandwidth delay product. The following TCP settings play an important role at the source host:
net.core.wmem_default: This sets the default OS send buffer size for all types of connections.
net.core.wmem_max: This sets the max OS send buffer size for all types of connections.
net.ipv4.tcp_wmem: TCP Auto-tuning setting. “This variable takes 3 different values which holds information on how much TCP sendbuffer memory space each TCP socket has to use. Every TCP socket has this much buffer space to use before the buffer is filled up. Each of the three values are used under different conditions. … The first value in this variable tells the minimum TCP send buffer space available for a single TCP socket. … The second value in the variable tells us the default buffer space allowed for a single TCP socket to use. … The third value tells the kernel the maximum TCP send buffer space.”
At the target host, the counterparts:
net.core.rmem_default: This sets the default OS receive buffer size for all types of connections.
net.core.rmem_max: This sets the max OS receive buffer size for all types of connections.
net.ipv4.tcp_rmem: TCP Auto-tuning setting. “The first value tells the kernel the minimum receive buffer for each TCP connection, and this buffer is always allocated to a TCP socket, even under high pressure on the system. … The second value specified tells the kernel the default receive buffer allocated for each TCP socket. This value overrides the /proc/sys/net/core/rmem_default value used by other protocols. … The third and last value specified in this variable specifies the maximum receive buffer that can be allocated for a TCP socket.”
And finally for both hosts; net.ipv4.tcp_mem: TCP Auto-tuning setting. The tcp_mem variable defines how the TCP stack should behave when it comes to memory usage, the unit used here is memory pages. … The first value specified in the tcp_mem variable tells the kernel the low threshold. Below this point, the TCP stack do not bother at all about putting any pressure on the memory usage by different TCP sockets. … The second value tells the kernel at which point to start pressuring memory usage down. … The third value tells the kernel how many memory pages it may use maximally. If this value is reached, TCP streams and packets start getting dropped until we reach a lower memory usage again. This value includes all TCP sockets currently in use. The defaults are calculated at boot time from the amount of available memory.
For more details on these settings, see the TCP Man-page.
These settings are all kernel parameters, which can be set using the sysctl command, e.g.:
# sysctl -w net.core.rmem_default=497954
Testing
In the first scenario, we will use the default (untuned) TCP settings:
On the source host:
net.core.wmem_default 212992
net.core.wmem_max 212992
net.ipv4.tcp_wmem 4096 87380 212992
On the destination host:
net.core.rmem_default 212992
net.core.rmem_max 212992
net.ipv4.tcp_rmem 4096 87380 212992
File transfer took on average 22 seconds, bandwidth 45,5 MB/s
In the second scenario, TCP default window sizes are set according to the bandwidth-delay product calculations given above. TCP default sizes are set to 995908 bytes.
On the source host:
net.core.wmem_default 995908
net.core.wmem_max 1000000
net.ipv4.tcp_wmem 4096 995908 1000000
On the destination host:
net.core.rmem_default 995908
net.core.rmem_max 1000000
net.ipv4.tcp_rmem 4096 995908 1000000
File transfer took on average 7 seconds, bandwidth 145 MB/s
In the third scenario, to verify that the BDP calculation was accurate, the TCP default window sizes were set to 2x the BDP calculations. This does not give any performance increase, and may even increase latency due to ‘buffer bloat’.
On the source host:
net.core.wmem_default 2000000
net.core.wmem_max 3000000
net.ipv4.tcp_wmem 4096 2000000 3000000
On the destination host:
net.core.rmem_default 2000000
net.core.rmem_max 3000000
net.ipv4.tcp_rmem 4096 2000000 3000000
File transfer took on average 7 seconds, bandwidth 143 MB/s
In the fourth scenario, the TCP default window sizes are set to BDL / 2, and the maximum windows sizes are set to BDL. This gives performance similar to the second scenario, because the TCP window sizes will be scaled very quickly to the maximum.
On the source host:
net.core.wmem_default 497954
net.core.wmem_max 995908
net.ipv4.tcp_mem 4096 497954 995908
On the destination host:
net.core.rmem_default 497954
net.core.rmem_max 995908
net.ipv4.tcp_rmem 4096 497954 995908
File transfer took on average 7 seconds, bandwidth 143 MB/s.
Conlusion
The fourth scenario is a good choice, it performs as good as the second scenario, it will initially allocate half of the memory.
Also don’t forget to check the Receive and the Transmit buffers on the NICs of both hosts (Using the ethtool) In this example, the recommended settings are:
RX (receive) buffer: 4096
TX (transmit) buffer: 4096
RX-jumbo (receive buffer for jumbo frames): 4096
# ethtool -G eth0 rx 4096 tx 4096 rx-jumbo 4096
This post is based on a paper written by my dear colleague Tiarnán Ó Corráin. Thank you very much for this Mr. T.!
Besides that, I used various resources, most valuable were: WikipediA pages, Travis Sparks and the TCP man-page.
As always, I thank you for reading.
3 thoughts on “About Long Fat Networks and TCP tuning”