1 Gbit/s network cards have been available for some time now and 10Gbit/s cards have recently become available. However achieving rates of the order of Gigabits per second is not straightforward. It requires careful tuning of several components in the end systems. This document aims to provide information of how to configure end systems to achieve Gbit/s data transfers.
Before you begin with the configuration, MAKE SURE YOU HAVE PERMISSION TO TRANSFER DATA ON THE ENTIRE NETWORK. Gbit/s transfers can create real damage on a network and are not (yet) appropriate for production environments.
Although this document is mainly about achieving data transfers using TCP some parts are also useful if you wish to use other transport protocols. You should always, if possible, try first UDP traffic data transfers and when everything is fine move towards TCP.
TCP will usually work well if the traffic is not competing with other flows. If any of the segments in the data path contains other traffic TCP will perform worse. This is discussed in more detail in section 5
When increasing the IC there should be a sufficient numbers of descriptors in the ring-buffers associated with the interface to hold the number of packets expected between consecutive interrupts.
As expected when increasing the IC settings the value of the latency increases, so that the difference in latency reflects the increased length of time packets spend in the NICs memory before being processed by the kernel.
If TxInt is reduced to 0 the throughput is significantly affected for all values of RxInt due to increased PCI activity and insufficient power to cope with the context switching in the sending PC.
If CPU power is important for your system (for example a shared server machine) than it is recommended to use a high interrupt coalescence in order to moderate CPU usage. If the machine is going to be dedicated to a single transfer than interrupt coalescence should be off.
Although NAPI is compatible with the old system and so with the old driver, you need to use a NAPI-aware driver to enable this improvement in your machine. It exists e.g. for Syskonnect Gigabit card [LINK TO BE PROVIDED BY MATHIEU].
The NAPI network subsystem is a lot more efficient than the old system, especially in a high performance context. The pros are:
These settings are especially important for TCP as losses on local queues will cause TCP to fall into congestion control – which will limit the TCP sending rates. Meanwhile, full queues will cause packet losses when transporting udp packets.
There are two queues to consider, the txqueuelen; which is related to the transmit queue size, and the netdev_backlog; which determines the recv queue size.
To set the length of the transmit queue of the device. It is useful to set this to small values for slower devices with a high latency (modem links, ISDN) to prevent fast bulk transfers from disturbing interactive traffic like telnet too much.
Users can manually set this queue size using the ifconfig command on the required device. Eg.
/sbin/ifconfig eth2 txqueuelen 2000
The default of 100 is inadequate for long distance, high throughput pipes. For example, on a network with a rtt of 120ms and at Gig rates, a txqueuelen of at least 10000 is recommended.
/sbin/sysctl –w sys.net.core.netdev_max_backlog=2000
In order to rectify this, one can flush all the tcp cache settings using the command:
/sbin/sysctl –w sys.net.ipv4.route.flush=1
Note that this flushes all routes, and is only temporary – ie, one must run this command every time the cache is to be emptied.
/sbin/sysctl -w net.ipv4.tcp_sack=0
Nagle algorithm should however be turned on. This is the default value. You can check if your program has Nagle switched off of it sets the TCP_NODELAY socket option. If this is the case comment this.
If the buffers are too small, like they are when default values are used, the TCP congestion window will never fully open up. If the buffers are too large, the sender can overrun the receiver, and the TCP window will shut down.
socket buffer size = 2* bandwidth * delay
Estimating an approximate delay of the path is straightforward with a tool such as ping (see tools section below). More difficult is to have an idea of the bandwidth available. Once again, you shouldn't attempt to transfer Gigabit/s of data when you haven't at least minimal control over all the links in the path. Otherwise tools like pchar and pathchar can be used to have an idea of the bottleneck bandwidth on a path. Note that these tools are not very reliable, since estimating the available bandwidth on a path is still an open research issue.
Note that you should change the socket buffer size in both sender and receiver with the same value. To change the buffer socket size with iperf you use the -W option.
When you are building an application you use the appropriate "set socket option" system call. Here is an example using C (other languages should use a similar construction):
int socket_descriptor, int sndsize; err = setsockopt (socket_descriptor, SOL_SOCKET, SO_SNDBUF, (char*)&sndsize, (int)sizeof(sndsize));
and in the receiver
int socket_descriptor, int rcvsize; err = setsockopt (socket_descriptor, SOL_SOCKET, SO_RCVBUF, (char*)&rcvsize, (int)sizeof(rcvsize));
to check what the buffer size is you can use the "get socket option system call:
int sockbufsize=0; int size=sizeof(int); err=getsockopt(socket_descriptor, SOL_SOCKET, SO_SNDBUF, (char*) &socketbufsize, &size);
/sbin/sysctl -w net.core.rmem_max= VALUE
Where VALUE should enough for your socket buffer size.
you should also set the "write" value.
/sbin/sysctl -w net.core.wmem_max= VALUE
/sbin/sysctl -w net.ipv4.tcp_mem= MIN DEFAULT MAX
This should be done both in the sender and the receiver.
/sbin/sysctl -w net.ipv4.tcp_rmem= MIN DEFAULT MAX
/sbin/sysctl -w net.ipv4.tcp_wmem= MIN DEFAULT MAX
bbcp and gridFTP (see Tools appendix) are two file transfer tools that also allow the creation of parallel streams for data transfer. Be careful that when operations with disk access are involved performance is worse. Disks are frequently the bottleneck. A detailed analylis of disk performance is outside the scope of this document.
traceroute - Lists all routers between two hosts. Usually available by default in Linux
tcpdump - dumps all TCP header information for a specified source/destination. Very used and useful for network debugging.
pathchar - A tool to estimate the bandwdth available in all the links in a given path (not very reliable). http://www.caida.org/tools/utilities/others/pathchar/
iperf - currently the most used tool for traffic generation and measurement of end-to-end TCP/UDP performance.
bbftp - File Transfer Software, http://doc.in2p3.fr/bbftp/
GridFTP - File Transfer Software, http://www.globus.org/datagrid/gridftp.html
[2] - LINK TO BE PROVIDED BY MATHIEU
[3] Jeffrey C. Mogul, K.K. Ramakrishnan, "Eliminating Receive Livelock in an Interrupt-driven Kernel", ACM Transactions on Computer Systems
[4] W. Richard Stevens, "TCP/IP Illustrated, Volume 1 - The Protocols", Addison-Wesley Professional Computing Series
[5] Sally Floyd, "HighSpeed TCP for Large Congestion Windows", IETF InternetDrafts, http://www.ietf.org/internet-drafts/draft-ietf-tsvwg-highspeed-01.txt
[6] Tom Kelly, "ScalableTCP: Improving Performance in HighSpeed Wide Area Networks", First International Workshop on Protocols for Fast Long-Distance Networks, February 2003
[7] Cheng Jin, David X. Wei and Steven Low, "FAST TCP for high-speed longdistance networks", http://netlab.caltech.edu/pub/papers/draft-jwl-tcp-fast-01.txt