How to achieve Gigabit speeds with Linux

 

1 Gbit/s network cards have been available for some time now and 10Gbit/s cards have recently become available. However achieving rates of the order of Gigabits per second is not straightforward. It requires careful tuning of several components in the end systems. This document aims to provide information of how to configure end systems to achieve Gbit/s data transfers.

Before you begin with the configuration, MAKE SURE YOU HAVE PERMISSION TO TRANSFER DATA ON THE ENTIRE NETWORK. Gbit/s transfers can create real damage on a network and are not (yet) appropriate for production environments.

Although this document is mainly about achieving data transfers using TCP some parts are also useful if you wish to use other transport protocols. You should always, if possible, try first UDP traffic data transfers and when everything is fine move towards TCP.

TCP will usually work well if the traffic is not competing with other flows. If any of the segments in the data path contains other traffic TCP will perform worse. This is discussed in more detail in section 5

1 - HARDWARE

The first thing to make sure for a good data transfer is appropriate hardware. Here are some guidelines for hardware configurations for 1 and 10 Gbits/s.
  1. 1. 1 Gbit/s network cards
    1. PCI 64-66MHz bus recommended (4Gbit/s theoretical bus limit)
    2. Pay attention to the shared buses on the motherboard (for ex. the SuperMicro motherboard for Intel Xeon Processors splits the PCI bus in 3 segments: PCI slots 1-3, PCI slot 4 and the on-board SCSI controller if one exists and PCI slots 5-6.
  2. 2. Intel 10Gbit/s network cards.
    1. a)PCI-X 133 MHz bus recommended (8.5 Gbit/s theoretical limit)
    2. b)Processor (and motherboard) with 533 MHz front-side bus
    3. c)PCI slot configured as bus master (improved stability)
    4. d)The Intel 10 Gbit/s card should be alone on its bus segment for optimal performance.
    5. e)The PCI burst size should be increased to 512K
    6. f)The card driver should be configured with the following parameters:
      1. i. Interrupt coalescence
      2. ii. Jumbo frames
      3. iii. Gigabit Ethernet flow control (it should be active also on the connecting switch)
      4. iv. Increased network card packet buffers (default 1024, maximum 4096)
      5. v. RX and TX checksum offload enabled

2 - DRIVERS ISSUES

interrupt coalescence settings

The interrupt coalescence (IC) feature available for the Intel PRO/1000 XT NIC, (as well as many other NIC's) can be set for receive (RxInt) and transmit (TxInt) interrupts. These values can be set to delay interrupts in units of 1.024 us. For the current latest driver, 5.2.20, the default value is 0 which means the host CPU is interrupted for each packet received. Interrupt reduction can improve CPU efficiency if properly tuned for specific network traffic. As Ethernet frames arrive, the NIC places them in memory but the NIC will wait the RxInt time before generating an interrupt to indicate that one or more frames have been received. Thus increasing IC reduces the number of context switches made by the kernel to service the interrupts, but adds extra latency to frame reception.

When increasing the IC there should be a sufficient numbers of descriptors in the ring-buffers associated with the interface to hold the number of packets expected between consecutive interrupts.

As expected when increasing the IC settings the value of the latency increases, so that the difference in latency reflects the increased length of time packets spend in the NICs memory before being processed by the kernel.

If TxInt is reduced to 0 the throughput is significantly affected for all values of RxInt due to increased PCI activity and insufficient power to cope with the context switching in the sending PC.

If CPU power is important for your system (for example a shared server machine) than it is recommended to use a high interrupt coalescence in order to moderate CPU usage. If the machine is going to be dedicated to a single transfer than interrupt coalescence should be off.

NAPI

Since 2.4.20 version of the Linux kernel, the network subsystem has changed and is now called NAPI (for New API) [1]. This new API allows to handle received packets no more per packet but per device.

Although NAPI is compatible with the old system and so with the old driver, you need to use a NAPI-aware driver to enable this improvement in your machine. It exists e.g. for Syskonnect Gigabit card [LINK TO BE PROVIDED BY MATHIEU].

The NAPI network subsystem is a lot more efficient than the old system, especially in a high performance context. The pros are:

One problem is that there is no parallelism in SMP machine for traffic coming in from a single interface, because a device is always handled by a CPU.

3 - KERNEL CONFIGURATION

3.1 iftxtqueue length high

There are settings available to regulate the size of the queue between the kernel network subsystems and the driver for network interface card. Just as with any queue, it is recommended to size it such that losses do no occur due to local buffer overflows. Therefore careful tuning is required to ensure that the sizes of the queues are optimal for your network connection.

These settings are especially important for TCP as losses on local queues will cause TCP to fall into congestion control which will limit the TCP sending rates. Meanwhile, full queues will cause packet losses when transporting udp packets.

There are two queues to consider, the txqueuelen; which is related to the transmit queue size, and the netdev_backlog; which determines the recv queue size.

To set the length of the transmit queue of the device. It is useful to set this to small values for slower devices with a high latency (modem links, ISDN) to prevent fast bulk transfers from disturbing interactive traffic like telnet too much.

Users can manually set this queue size using the ifconfig command on the required device. Eg.

/sbin/ifconfig eth2 txqueuelen 2000

The default of 100 is inadequate for long distance, high throughput pipes. For example, on a network with a rtt of 120ms and at Gig rates, a txqueuelen of at least 10000 is recommended.

3.2 kernel receiver backlog

For the receiver side, we have a similar queue for incoming packets. This queue will build up in size when an interface receives packets faster than the kernel can process them. If this queue is too small (default is 300), we will begin to loose packets at the receiver, rather than on the network. One can set this value by:

/sbin/sysctl w sys.net.core.netdev_max_backlog=2000

3.3 TCP cache parameter (Yee)

Linux 2.4.x tcp has a function to cache tcp network transfer statistics. The idea behind this was to improve the performance of tcp on links such that it does not have to discover the optimal congestion avoidance settings (ssthresh) of every connection. However, in high speed networks, or during low network congestion periods, a new tcp connection will use the cached values and can perform worse as a result.

In order to rectify this, one can flush all the tcp cache settings using the command:

/sbin/sysctl w sys.net.ipv4.route.flush=1

Note that this flushes all routes, and is only temporary ie, one must run this command every time the cache is to be emptied.

3.4 SACKs and Nagle

SACKs (Selective Acknowledgments) are an optimisation to TCP which in normal scenarios improves considerably performance. In Gigabit networks with no traffic competition these have the opposite effect. To improve performance they should be turned off by:

/sbin/sysctl -w net.ipv4.tcp_sack=0

Nagle algorithm should however be turned on. This is the default value. You can check if your program has Nagle switched off of it sets the TCP_NODELAY socket option. If this is the case comment this.

4 - SOCKET BUFFERS

TCP uses what it calls the "congestion window" or CWND [4] to determine how many packets can be sent at one time. The maximum congestion window is related to the amount of buffer space that the kernel allocates for each socket. For each socket, there is a default value for the buffer size, which can be changed by the program using a system library call just before opening the socket.

If the buffers are too small, like they are when default values are used, the TCP congestion window will never fully open up. If the buffers are too large, the sender can overrun the receiver, and the TCP window will shut down.

4.1 Socket buffers and bandwidth delay product

The optimal socket buffer size is twice the size of the bandwidth * delay product of the link:

socket buffer size = 2* bandwidth * delay

Estimating an approximate delay of the path is straightforward with a tool such as ping (see tools section below). More difficult is to have an idea of the bandwidth available. Once again, you shouldn't attempt to transfer Gigabit/s of data when you haven't at least minimal control over all the links in the path. Otherwise tools like pchar and pathchar can be used to have an idea of the bottleneck bandwidth on a path. Note that these tools are not very reliable, since estimating the available bandwidth on a path is still an open research issue.

Note that you should change the socket buffer size in both sender and receiver with the same value. To change the buffer socket size with iperf you use the -W option.

When you are building an application you use the appropriate "set socket option" system call. Here is an example using C (other languages should use a similar construction):

int socket_descriptor, int sndsize; err = setsockopt (socket_descriptor, SOL_SOCKET, SO_SNDBUF, (char*)&sndsize, (int)sizeof(sndsize));

and in the receiver

int socket_descriptor, int rcvsize; err = setsockopt (socket_descriptor, SOL_SOCKET, SO_RCVBUF, (char*)&rcvsize, (int)sizeof(rcvsize));

to check what the buffer size is you can use the "get socket option system call:

int sockbufsize=0; int size=sizeof(int); err=getsockopt(socket_descriptor, SOL_SOCKET, SO_SNDBUF, (char*) &socketbufsize, &size);

4.2 socket buffer memory queue limits r|w mem (default and max)

If your program asks for more socket buffer memory than the kernel is configured to give you, the program won't get it. You can adjust the maximum value of socket buffer memory by:

/sbin/sysctl -w net.core.rmem_max= VALUE

Where VALUE should enough for your socket buffer size.

you should also set the "write" value.

/sbin/sysctl -w net.core.wmem_max= VALUE

/sbin/sysctl -w net.ipv4.tcp_mem= MIN DEFAULT MAX

This should be done both in the sender and the receiver.

4.3 autotuning in 2.4 kernels

If you don't set the socket buffer options in your program, linux will attempt to auto-tune your tcp connection. This means that it will allocate as less memory as possible for your TCP connection without wasting memory whilst maintaining high performance.

/sbin/sysctl -w net.ipv4.tcp_rmem= MIN DEFAULT MAX

/sbin/sysctl -w net.ipv4.tcp_wmem= MIN DEFAULT MAX

5 - OTHER METHODS (Miguel)

There are some extra actions you can take to improve throughput in IP networks. Here are three of them:

5.1 Using large block sizes

Using large data block sizes improves performance. Applications use frequently 8Kb blocks. A value of 64Kb is a better choice.

5.2 Parallel streams

If possible, the application can always use several TCP streams to transfer the data. These should involve to create and open more than one socket and parallelise the data transfer among these sockets. iperf can also be configured to achieve this with the -P option:

bbcp and gridFTP (see Tools appendix) are two file transfer tools that also allow the creation of parallel streams for data transfer. Be careful that when operations with disk access are involved performance is worse. Disks are frequently the bottleneck. A detailed analylis of disk performance is outside the scope of this document.

5.3 New TCP stack

Current TCP has been shown not to scale to high bandwidth delay product networks. Several proposals already emerged to overcome this limitation. The main ones are High Speed TCP [5], Scalable TCP [6] and FAST [7]. Installing the appropriate stacks in your Linux kernel can, therefore, improve considerably your performance. Implementations of HS-TCP and Scalable TCP can be found at the DataTAG site (http://www.datatag.org)

6 - NETWORK SUPPORT (Miguel)

The previous sections of this document dealt with end system configuration. They assume no control of the network equipment (routers, switches, etc) in the data path. It is however recommended that the following network configurations are taken place. This may require complex coordination between all the domains in the data path.

A - Tools

ping - The standard tool to measure end-to-end delay. Available in every operating system

traceroute - Lists all routers between two hosts. Usually available by default in Linux

tcpdump - dumps all TCP header information for a specified source/destination. Very used and useful for network debugging.

pathchar - A tool to estimate the bandwdth available in all the links in a given path (not very reliable). http://www.caida.org/tools/utilities/others/pathchar/

iperf - currently the most used tool for traffic generation and measurement of end-to-end TCP/UDP performance.

bbftp - File Transfer Software, http://doc.in2p3.fr/bbftp/

GridFTP - File Transfer Software, http://www.globus.org/datagrid/gridftp.html

B - References

[1] J.H. Salim, R. Olsson and A. Kuznetsov, "Beyond Softnet". In Proc. Linux 2.5 Kernel Developers Summit, San Jose, CA, USA, March 2001. Available at

[2] - LINK TO BE PROVIDED BY MATHIEU

[3] Jeffrey C. Mogul, K.K. Ramakrishnan, "Eliminating Receive Livelock in an Interrupt-driven Kernel", ACM Transactions on Computer Systems

[4] W. Richard Stevens, "TCP/IP Illustrated, Volume 1 - The Protocols", Addison-Wesley Professional Computing Series

[5] Sally Floyd, "HighSpeed TCP for Large Congestion Windows", IETF InternetDrafts, http://www.ietf.org/internet-drafts/draft-ietf-tsvwg-highspeed-01.txt

[6] Tom Kelly, "ScalableTCP: Improving Performance in HighSpeed Wide Area Networks", First International Workshop on Protocols for Fast Long-Distance Networks, February 2003

[7] Cheng Jin, David X. Wei and Steven Low, "FAST TCP for high-speed longdistance networks", http://netlab.caltech.edu/pub/papers/draft-jwl-tcp-fast-01.txt

contact person