About Long Fat Networks and TCP tuning

Recently I came about a data communications subject that was pretty unknown to me, known as the Bandwidth-delay product. Knowledge about this can help you to recognize certain network issues and ways to resolve them. It is all about two Linux hosts, a source and a destination host, communicating with each other over a high capacity network link. The question is how can you, given this scenario, reach maximum throughput over the network?

BDP

The first step is to determine the Bandwidth-delay product for this network. Bandwidth-delay product (BDP) is defined as the product of a data link’s capacity (in bits per second) and its round-trip delay time (in seconds). The result, the amount of data (in bits or bytes), is the maximum amount of data on the network at any given time, that is data that has been transmitted but not yet acknowledged.
Why is this important? The TCP protocol is designed for reliable transmission of data, acknowledgements are an essential part of the protocol. A high BDP value has impact on the efficiency of TCP, because the protocol can only achieve optimum throughput if a sender sends a sufficiently large quantity of data before being required to stop and wait until a confirming message (acknowledgement) is received from the receiver, acknowledging successful receipt of that data.

Round-trip delay time (RTD) or Round-trip time (RTT) is defined as the length of time it takes for a signal to be sent plus the length of time it takes for an acknowledgment of that signal to be received.

For measuring the RTT and the bandwidth, you can use tooling like ping and iPerf3.

In our example we use the following average values:
RTT = 3 ms
Bandwidth = 2655752000 bits/seconds

Now the BDP = Bandwidth (bits/seconds) x RTT (s)
BDP = 2655752000 x 3×10-3 = 7967256 bits = 995908 bytes = 995 kB

The BDP for this network is about 1 MB. Networks with a BDP significantly larger than 1×105 bits (12500 bytes) are considered as Long Fat Networks (LFN). So our example network is definitely a LFN.

TCP tuning

Now we will discover how we can optimize the performance between our two hosts. Tests are performed by copying a 1 GB file with random data from the source host to the target host. The most important tuning issue for TCP is the TCP window size, which controls how much data can be in the network at any one point. The optimal value to use for the TCP window size is the bandwidth delay product. The following TCP settings play an important role at the source host:

net.core.wmem_default: This sets the default OS send buffer size for all types of connections.

net.core.wmem_max: This sets the max OS send buffer size for all types of connections.

net.ipv4.tcp_wmem: TCP Auto-tuning setting. “This variable takes 3 different values which holds information on how much TCP sendbuffer memory space each TCP socket has to use. Every TCP socket has this much buffer space to use before the buffer is filled up. Each of the three values are used under different conditions. … The first value in this variable tells the minimum TCP send buffer space available for a single TCP socket. … The second value in the variable tells us the default buffer space allowed for a single TCP socket to use. … The third value tells the kernel the maximum TCP send buffer space.”

At the target host, the counterparts:

net.core.rmem_default: This sets the default OS receive buffer size for all types of connections.

net.core.rmem_max: This sets the max OS receive buffer size for all types of connections.

net.ipv4.tcp_rmem: TCP Auto-tuning setting. “The first value tells the kernel the minimum receive buffer for each TCP connection, and this buffer is always allocated to a TCP socket, even under high pressure on the system. … The second value specified tells the kernel the default receive buffer allocated for each TCP socket. This value overrides the /proc/sys/net/core/rmem_default value used by other protocols. … The third and last value specified in this variable specifies the maximum receive buffer that can be allocated for a TCP socket.”

And finally for both hosts; net.ipv4.tcp_mem: TCP Auto-tuning setting. The tcp_mem variable defines how the TCP stack should behave when it comes to memory usage, the unit used here is memory pages. … The first value specified in the tcp_mem variable tells the kernel the low threshold. Below this point, the TCP stack do not bother at all about putting any pressure on the memory usage by different TCP sockets. … The second value tells the kernel at which point to start pressuring memory usage down. … The third value tells the kernel how many memory pages it may use maximally. If this value is reached, TCP streams and packets start getting dropped until we reach a lower memory usage again. This value includes all TCP sockets currently in use. The defaults are calculated at boot time from the amount of available memory.

For more details on these settings, see the TCP Man-page.

These settings are all kernel parameters, which can be set using the sysctl command, e.g.:

# sysctl -w net.core.rmem_default=497954

Testing

In the first scenario, we will use the default (untuned) TCP settings:

On the source host:

net.core.wmem_default 212992

net.core.wmem_max 212992

net.ipv4.tcp_wmem 4096 87380 212992

On the destination host:

net.core.rmem_default 212992

net.core.rmem_max 212992

net.ipv4.tcp_rmem 4096 87380 212992

File transfer took on average 22 seconds, bandwidth 45,5 MB/s

In the second scenario, TCP default window sizes are set according to the bandwidth-delay product calculations given above. TCP default sizes are set to 995908 bytes.

On the source host:

net.core.wmem_default 995908

net.core.wmem_max 1000000

net.ipv4.tcp_wmem 4096 995908 1000000

On the destination host:

net.core.rmem_default 995908

net.core.rmem_max 1000000

net.ipv4.tcp_rmem 4096 995908 1000000

File transfer took on average 7 seconds, bandwidth 145 MB/s

In the third scenario, to verify that the BDP calculation was accurate, the TCP default window sizes were set to 2x the BDP calculations. This does not give any performance increase, and may even increase latency due to ‘buffer bloat’.

On the source host:

net.core.wmem_default 2000000

net.core.wmem_max 3000000

net.ipv4.tcp_wmem 4096 2000000 3000000

On the destination host:

net.core.rmem_default 2000000

net.core.rmem_max 3000000

net.ipv4.tcp_rmem 4096 2000000 3000000

File transfer took on average 7 seconds, bandwidth 143 MB/s

In the fourth scenario, the TCP default window sizes are set to BDL / 2, and the maximum windows sizes are set to BDL. This gives performance similar to the second scenario, because the TCP window sizes will be scaled very quickly to the maximum.

On the source host:

net.core.wmem_default 497954

net.core.wmem_max 995908

net.ipv4.tcp_mem 4096 497954 995908

On the destination host:

net.core.rmem_default 497954

net.core.rmem_max 995908

net.ipv4.tcp_rmem 4096 497954 995908

File transfer took on average 7 seconds, bandwidth 143 MB/s.

Conlusion

The fourth scenario is a good choice, it performs as good as the second scenario, it will initially allocate half of the memory.

Also don’t forget to check the Receive and the Transmit buffers on the NICs of both hosts (Using the ethtool) In this example, the recommended settings are:

RX (receive) buffer: 4096

TX (transmit) buffer: 4096

RX-jumbo (receive buffer for jumbo frames): 4096

# ethtool -G eth0 rx 4096 tx 4096 rx-jumbo 4096

This post is based on a paper written by my dear colleague Tiarnán Ó Corráin. Thank you very much for this Mr. T.!

Besides that, I used various resources, most valuable were: WikipediA pages, Travis Sparks and the TCP man-page.

As always, I thank you for reading.

Adventures in a Virtual World