Increase TCP performance for gigabit network

From Blue-IT.org Wiki

Source of this article

The whole text of this article is from this source: [1], written by Mattias Wadenstein <maswan@acc.umu.se>, please send suggestions for improvements. Feel free to use or publish in original or modified form with attribution or link directly here.

TCP performance tuning - how to tune linux

The short summary:

  • The default Linux tcp window sizing parameters sucks.
  • The short fix [wirespeed for gigE within 5 ms RTT and fastE within 50 ms RTT]:
vim /etc/sysctl.conf
net/core/rmem_max = 8738000
net/core/wmem_max = 6553600
net/ipv4/tcp_rmem = 8192 873800 8738000
net/ipv4/tcp_wmem = 4096 655360 6553600


It might also be a good idea to increase vm/min_free_kbytes, especially if you have e1000 with NAPI or similar. A sensible value is 16M or 64M:

vm/min_free_kbytes = 65536

If you run an ancient kernel, increase the txqueuelen to at least 1000:

ifconfig ethN txqueuelen 1000

If you are seeing "TCP: drop open request" for real load (not a DDoS), you need to increase tcp_max_syn_backlog (8192 worked much better than 1024 on heavy webserver load).

The background:

TCP performance is limited by latency and window size (and overhead, which reduces the effective window size) by window_size/RTT (this is how much data that can be "in transit" over the link at any given moment).

To get the actual transfer speeds possible you have to divide the resulting window by the latency (in seconds):

The overhead is: window/2^tcp_adv_win_scale (tcp_adv_win_scale default is 2)

So for linux default parameters for the recieve window (tcp_rmem): 87380 - (87380 / 2^2) = 65536.

Given a transatlantic link (150 ms RTT), the maximum performance ends up at: 65536/0.150 = 436906 bytes/s or about 400 kbyte/s, which is really slow today.

With the increased default size: (873800 - 873800/2^2)/0.150 = 4369000 bytes/s, or about 4Mbytes/s, which is resonable for a modern network. And note that this is the default, if the sender is configured with a larger window size it will happily scale up to 10 times this (8738000*0.75/0.150 = ~40Mbytes/s), pretty good for a modern network.

For the txqueuelen, this is mostly relevant for gigE, but should not hurt anything else. Old kernels have shipped with a default txqueuelen of 100, which is definately too low and hurts performance.

net/core/[rw]mem_max is in bytes, and the largest possible window size. net/ipv4/tcp_[rw]mem is in bytes and is "min default max" for the tcp windows, this is negotiated between both sender and reciever. "r" is for when this machine is on the recieving end, "w" when the connection is initiated from this machine.

There are more tuning parameters, for the Linux kernel they are documented in Documentation/networking/ip-sysctl.txt, but in our experience only the parameters above need tuning to get good tcp performance..


So, what's the downside?

None pretty much. It uses a bit more kernel memory, but this is well regulated by a tuning parameter (net/ipv4/tcp_mem) that has good defaults (percentage of physical ram). Note that you shouldn't touch that unless you really know what you are doing. If you change it and set it too high, you might end up with no memory left for processes and stuff.

If you go up above the middle value of net/ipv4/tcp_mem, you enter tcp_memory_pressure, which means that new tcp windows won't grow until you have gotten back under the pressure value. Allowing bigger windows means that it takes fewer connections for someone evil to make the rest of the tcp streams to go slow.

What you remove is an artificial limit to tcp performance, without that limit you are bounded by the available end-to-end bandwidth and loss. So you might end up saturating your uplink more effectively, but tcp is good at handling this.

The txqueuelen increase will eat about 1.5 megabytes of memory at most given an MSS of 1500 bytes (normal ethernet).

Regarding min_free_kbytes, faster networking means kernel buffers get full faster and you need more headroom to be able to allocate them. You need to have enough to last until the vm manages to free up more memory, and at high transfer speeds you have high buffer filling speeds too. This will eat memory though, memory that will not be available for normal processes or file cache.

If you see stuff like "swapper: page allocation failure. order:0, mode:0x20" you definately need to increase min_free_kbytes for the vm.


Notes for Linux 2.4 users:

The RFC1323 window scale value is initially calculated as roof(ln(x/65536)/ln(2)), where in 2.4 based kernels x = initial receive buffer (or tcp_rmem[default]) and in 2.6 kernels x = max(tcp_rmem[max], core_rmem_max)

This effectively limits 2.4 kernel tcp windows to ~tcp_rmem[default], but from 2.4.27 (or possibly earlier) the wscale is set to max(calculated_wscale, tcp_default_win_scale) which means that one can set tcp_default_win_scale to the same value as a 2.6 kernel would.

So if there is a tcp_default_win_scale in /proc/sys/net/ipv4 you should add net/ipv4/tcp_default_win_scale = roof(ln(tcp_rmem[max]/65536)/ln(2)) to /etc/sysctl.conf


Notes for other operating systems:

Find out how to set the tcp window sizing options, then increase them to sensible values. Most operating systems have the same horribly small default values.

AIX example: no -p -o sb_max=8738000 -o rfc1323=1 -o tcp_recvspace=873800 -o tcp_sendspace=873800 -o use_isno=0 AIX 5.2 and up sets tuning parameters permanently like this, otherwise you have to set them at boot in /etc/rc.net. See man no for more information on AIX.

Solaris example:

ndd -set /dev/tcp tcp_xmit_hiwat 873800
ndd -set /dev/tcp tcp_recv_hiwat 873800
ndd -set /dev/tcp tcp_max_buf 8738000

See also:

[2] - long page with lots of info on obscure operating systems.

About this document:

Written by Mattias Wadenstein <maswan@acc.umu.se>, please send suggestions for improvements. Feel free to use or publish in original or modified form with attribution or link directly here.