Selasa, 17 Januari 2012

Network Tuning and Performance Guide (OpenBSD)

Many of today's desktop systems and servers come with on board gigabit network controllers. After some simple speeds tests you will soon find out that you are not be able to transfer data over the network much faster than you did with a 100MB link. There are many factors which affect network performance including hardware, operating systems and network stack options. The purpose of this page is to explain how you can achieve up to 930 megabits per second transfer rates over a gigabit link using OpenBSD as a firewall or transparent bridge.

It is important to remember that you can not expect to reach gigabit speeds using slow hardware or an unoptimized firewall rule set. Speed and efficiency are key to our goal. Lets start with the most important aspect of the equation, hardware.

Hardware


No matter what operating system you choose, the machine you run on will determine the theoretical speed limit you can expect to achieve. When people talk about how fast a system is they always mention CPU clock speed. We would expect an AMD64 2.4GHz to run faster than a Pentium3 1.0 GHz, but CPU speed is not the key, motherboard bus speed is.

In terms of a firewall or bridge we are looking to move data through the system as fast as possible. This means we need to have a PCI bus that is able to move data quickly between network interfaces. To do this the machine must have a wide bus and high bus speed. CPU clock speed is a very minor part of the equation.

The quality of a network card is key to high though put. As a very general rule, using the on-board network card is going to be much slower than an add in PCI card. The reason is that most desktop motherboard manufacturers use cheap on-board network chip sets that use CPU processing time instead of handling TCP traffic by themselves. This leads to very slow network performance and high CPU load.

A gigabit network controller built on board using the CPU will slow the entire system down. More than likely the system will not even be able to sustain 100MB speeds while also pegging the CPU at 100%. A network controller that is able to negotiate as a gigabit is _very_ different from a controller that can transfer a gigabit of data per second.

Ideally you want to use a server based add on card with a TCP offload engine or TCP accelerator. We have seen very good speeds with the Intel Pro/1000 MT series (em4) cards. They are not too expensive and all OS's have support.

Not to say that all on-board chip sets are bad. Supermicro server boards use an Intel 82546EB Gigabit Ethernet Controller on their server motherboards. It offers two(2) copper gigabit ports through a single chip set offering a 133MHz PCI-X, 128 bit wide bus, pre-fetching up to 64 packet descriptors and has two 64 KB on-chip packet buffers. This is an exceptionally fast chip and it saves space by being built onto the server board.

Now, in order to move data in and out of the network cards as fast as possible we need a bus with a wide bit rate and high clock speed. For example, a PCI-X 64bit slot is wider than a PCI-X 32bit as is a 66MHz bus is faster than a 33MHz bus. Wide is good, fast is good, but wide and fast are better.

The equation to calculate the theoretical speed of a PCI or PCI-X slot is the following:
 (bus speed in MHz) * (bus width in bits) / 8 = speed in Megabytes/second
66 MHz * 32 bit / 8 = 264 Megabytes/second

For example, if we have a motherboard with a 32bit wide bus running at 66MHz then the theoretical max speed we can push data through the slot is 66*32/8= 264 Megabytes/second. With a server class board we could use a 64bit slot running at 133MHz and reach speeds of 133*64/8= 1064 Megabytes/second.
Need to simulate packet loss or a high latency network connection? Take a look at our Network Latency and Packet Loss Emulation page for theories and examples.

Now that you have the max speed of the single PCI slot we need to understand this number represents the max speed of the bus if nothing else is using the PCI bus. Since all PCI cards and built on-board chips use the same bus then they must also be taken into account. If we have two network cards each using a 64bit, 133MHz slot then each slot will get to use 50% of the total speed of the PCI bus. Each card can do 133*64/8= 1064 Megabytes/second and if both network cards are being used at once, like on a firewall, then each card can use 1064/2= 532 Megabytes/second max. This is still well above the maximum speed of a gigabit connection which can move 1000/8= 128 Megabytes/second.

PCI Express is a newer technology which elevates bus bandwidth from hundreds of megabytes per second to many gigabytes per second. This allows a single machine to support multiple gigabit ports per interface card or even multiple 10 gigabit ports. The PCIe link is built around dedicated unidirectional couples of serial (1-bit), point-to-point connections known as lanes. This is in sharp contrast to the earlier PCI connection, which is a bus-based system where all the devices share the same bidirectional, 32-bit or 64-bit parallel bus. PCIe's dedicated lanes allow for an incredible increase in bandwidth.

Lets take a look at some of the new PCI Express (PCIe) interface speeds compared to the older PCI bus. These values were collected from the PCIe Wikipedia page:
(type)      (bus speed) *  (bus width)    / 8 = (speed in Megabytes/second)

PCI 66 MHz * 32 bit / 8 = 264 MB/s
PCIe v1 2500 Mhz * 32 1 bit lanes / 8 = 250 MB/s (minus 20% overhead)
PCIe v2 x1 5000 Mhz * 1 1 bit lane / 8 = 500 MB/s (minus 20% overhead)
PCIe v2 x2 5000 Mhz * 2 1 bit lanes / 8 = 1000 MB/s (minus 20% overhead)
PCIe v2 x4 5000 Mhz * 4 1 bit lanes / 8 = 2000 MB/s (minus 20% overhead)
PCIe v2 x8 5000 Mhz * 8 1 bit lanes / 8 = 4000 MB/s (minus 20% overhead)
PCIe v2 x16 5000 Mhz * 16 1 bit lanes / 8 = 8000 MB/s (minus 20% overhead)
PCIe v2 x32 5000 Mhz * 32 1 bit lanes / 8 = 16000 MB/s (minus 20% overhead)
PCIe v3 x32 5000 Mhz * 32 1 bit lanes / 8 = 19700 MB/s (minus 1.5% overhead)

We highly recommend getting an interface card supporting PCIe due to their high bandwidth and low power usage. Note, PCIe version 2.x has a 20% bandwidth overhead which PCIe version 3.x does not. PCIe 2.0 delivers 5 GT/s (GT/s is Gigatransfers per second), but employs an 8b/10b encoding scheme which results in a 20 percent overhead on the raw bit rate. PCIe 3.0 removes the requirement for encoding and uses a technique called "scrambling" in which "a known binary polynomial" is applied to a data stream in a feedback topology. Because the scrambling polynomial is known, the data can be recovered by running it through a feedback topology using the inverse polynomial and also uses a 128b/130b encoding scheme, reducing the overhead to approximately 1.5%, as opposed to the 20% overhead of 8b/10b encoding used by PCIe 2.0.

Look at the specifications or motherboard you expect to use and the above equation to get a rough idea of the speeds you can expect out of the box. Hardware speed is the key to a fast firewall. Before setting up your new system and possibly wasting hours wondering why it is not reaching your speed goals, make sure you understand the limitations of the hardware. Do not expect throughput out of your system hardware that it is _not_ capable of.

For example, when using a four port network card on a machine, consider the bandwidth of the adapter slot you put it into. Standard PCI is a 32 bit wide interface and the bus speed is 66MHz or 133 MHz. This bandwidth is shared across all devices on the same bus. PCIe v1 is a serial connection with 2.5 GHz frequency in both directions for a 1x slot. The effective maximum bandwidth is 250MB/s bidirectional. So, if you decide to support 4, 1Gbps connections on one card it might be best to do it with a PCIe v2 8x or faster slot and card.

How much ram do I need for a firewall?


For a standard OpenBSD firewall one(1) gigabyte of ram is more than enough. In fact, unless you are running many memory hungry services you will actually use less than 100 megabytes of ram at any one time. On our testing system we had eight(8) gig available, but OpenBSD will only recognize 3.1 gig of that no matter if you use the i386 or AMD64 kernel. One of the few times you may need more ram is if your firewall is going to load tables in Pf with tens of thousands of entries. These days ram is cheap, but there is no need to put four(4) to eight(8) gigabytes in the machine as it will only go to waste. In fact, having too much RAM in your box will COST you memory, as more kernel memory is used up tracking all your RAM. So cutting your ram to 2 GB will probably improve the upper limit.

What kind of hardware would you recommend for a firewall ?


Keep in mind that if you are looking for a small home firewall any old hardware will do. You do not have to spend a lot of money to get a very decent firewall these days. Something like a Intel Core 2 Duo or an AMD Athlon and DDR ram would work fine. If you are in a pinch even a Intel P3 will have more than enough bandwidth for a small home office of 5 to 10 people. Old hardware that is stable is a great solution.

If you are looking to use more modern hardware and want detailed specifics, here is the setup which we use in the cluster. It is extremely fast, practically silent and incredibly power efficient. This box is able to sustain full gigabit speeds (~102MB/sec data throughput) bi-directionally using this hardware as well a run other software like packet sniffers and analysis programs. It is a quad core box running at 2.4Ghz and uses DDR3 ram at 1333MHz. All the parts are very power efficient and the entire system running at idle uses only fifty six (56) watts; sixty two (62) watts at full load. On average the CPU and motherboard run at eleven(11) degrees Celsius over ambient room temperature.
Processor    : AMD Athlon II X4 610e Propus 2.4GHz 45watt
CPU Cooler : Zalman 9500A-LED 92mm 2 Ball CPU Cooler (fan off)
Motherboard : Asus M4A89GTD Pro/USB3 AM3 AMD 890GX
Memory : Kingston 4GB DDR3 KVR1333D3N9K2/4G
Hard Drive : Western Digital Caviar Green WD30EZRX
Power Supply : Antec Green 380 watts EA-380D
Network Card : Intel PRO/1000 GT PCI PWLA8391GT (two cards)
Case : Antec LanBoy Air (completely fan-less)

You can reduce the power consumption of your firewall and keep track of system temperatures by using Power Management with apmd and Sensorsd hardware monitor (sensorsd.conf).

Can we achieve higher transfer rates with a Maximum Transmission Unit (MTU) of 9000 ?


It is sometimes recommend to set the MTU of your network interface over a default value of 1500. Users of "jumbo frames" can set the MTU as high as 9000 if all of your network equipment supports "Jumbo Frames." The MTU value tells the network card to send a Ethernet frame of the value specified in bytes. While this may be useful when connecting two hosts directly together using the same MTU, it is a lot less useful when connecting through a switch or network which does not support a larger MTU.

When a switch or a machine receives a MTU that is larger then they are able to forward they must fragment the packets. This takes time and is very inefficient. The throughput you may gain when connecting to similar high MTU machines you will loose when connecting to any 1500 MTU machine.

Either way, increasing the MTU is _may_ not be necessary depending on your situation. two(2) gigabits per second can be attained using a 10Gbit card at the normal 1500 byte MTU setting with the following network tweaks listed on this page. Understand that a MTU of 9000 would significantly reduce the network overhead of a TCP connection compared to a MTU of 1500, but we can still sustain a high transfer rates. If you are in need of transferring speeds over 2 gigabits per second then you will definitely need to look at setting your MTU to 9000. Take a look at the section on this page titled, "Can we achieve 10 gigabit speeds using OpenBSD and FreeBSD ?" for details.

What TTL (Time to live) should we use ?


Time to live is the limit on the period of time or number of iterations a packet can experience before it should be discarded. Note that the TTL you use is for the one way trip to the remote machine. That remote machine will then have its own TTL set when they try to return packets to you. A packet's TTL is reduced by one for every router it goes through. Once the TTL reaches zero(0) the packet is discarded no matter were it is in the network.

Understand that most modern OSs like OpenBSD, FreeBSD, Ubuntu and RHEL set the default TTL at 64 hops. This should be plenty to reach through the Internet to the other side of the world. If you use traceroute and give an ip, traceroute will show you how many hops it takes to reach your destination. For example, we can go from the eastern United States to a university in Asia in 23 hops. If our TTL was set to 64 then the packet would still of had 41 more hops it could have used before the packet was dropped.

Lets do a quick test by seeing how many hops (routers) we need to go through to get to the other side of the world. We are located in the north eastern United States. The other side of the globe is located in the ocean just west of Geraldton, Australia. If we do a icmp based traceroute to the tourist board at geraldtontourist.com.au (202.191.55.184) it is only 16 hops away. BTW, according to Geoip "202.191.55.184" might be located in Sydney Australia, but we are definitely on the same continent. So, possibly five(5) more hops to Geraldton on the west coast.
# traceroute -I 202.191.55.184

traceroute to 202.191.55.184 (202.191.55.184), 64 hops max, 60 byte packets
1 L100.BLTMMD-VFTTP-16.verizon-gni.net (71.166.35.1) 6.72 ms 4.721 ms 4.947 ms
2 G11-0-0-316.BLTMMD-LCR-03.verizon-gni.net (130.81.49.8) 7.505 ms 7.282 ms 7.485 ms
3 so-9-2-0-0.LCC1-RES-BB-RTR1-RE1.verizon-gni.net (130.81.28.80) 9.933 ms 9.809 ms 9.975 ms
4 0.ae1.BR2.IAD8.ALTER.NET (152.63.34.21) 49.952 ms 9.865 ms 9.961 ms
5 ae6.edge1.washingtondc4.level3.net (4.68.62.133) 12.630 ms 0.xe-0-0-0.XL3.IAD8.ALTER.NET (152.63.32.214) 12.479 ms ae7.edge1.washingtondc4.level3.net (4.68.62.137) 24.850 ms
6 GigabitEthernet4-0-0.GW8.IAD8.ALTER.NET (152.63.33.93) 9.829 ms GigabitEthernet6-0-0.GW8.IAD8.ALTER.NET (152.63.33.13) 14.865 ms GigabitEthernet4-0-0.GW8.IAD8.ALTER.NET (152.63.33.93) 12.363 ms
7 ae-84-84.ebr4.Washington1.Level3.net (4.69.134.185) 14.751 ms ae-94-94.ebr4.Washington1.Level3.net (4.69.134.189) 12.356 ms 17.356 ms
8 ae-4-4.ebr3.LosAngeles1.Level3.net (4.69.132.81) 87.438 ms 82.472 ms ge-7-0-0.lax22.ip4.tinet.net (89.149.185.222) 92.395 ms
9 singtel-gw.ip4.tinet.net (77.67.79.14) 84.833 ms 84.900 ms *
10 203.208.148.18 (203.208.148.18) 301.232 ms 229.793 ms 232.466 ms
11 * * *
12 203.208.148.18 (203.208.148.18) 228.924 ms 229.758 ms *
13 * * 119.225.2.166 (119.225.2.166) 241.682 ms
14 * 203-22-107-13.ico.com.au (203.22.107.13) 236.695 ms *
15 119.225.2.166 (119.225.2.166) 236.767 ms 237.288 ms 202.191.55.202 (202.191.55.202) 239.965 ms
16 202.191.55.184 (202.191.55.184) 239.966 ms 203-22-107-13.ico.com.au (203.22.107.13) 237.264 ms 234.789 ms

Can you set the TTL higher? Yes the highest value is 254. This is normally considered too high for any network. A TTL of 64 should be fine for most instances.

Can we achieve 10 gigabit speeds using OpenBSD or FreeBSD ?


Yes. In fact, with the right hardware and a little knowledge we can achieve over 9.2 gigabits per second with simultaneous bi-directional transfers through a Pf firewall. Understand that there are some limitations you need to be aware of, not with the transfer speeds, but with the choice or hardware and operating system.

10g firewall hardwareThe critical parts of any firewall is going to be the network card, motherboard bus bandwidth and the memory speeds of the machine.i No matter how good your Os is, if you can not actually move the data through the hardware you will never be able to reach 10 gigabit speeds. The list below is exactly the hardware we tested with for the FreeBSD firewall and both linux machines. Notice that this is _not_ the fastest or most expensive, bleeding edge Intel Core i7 Nehalem CPU or hardware. A firewall does not need to be exotic to be fast. What we have is a 2U server which uses 65 watts of power at idle and 80 watts at full load (measured with a Kill-A-Watt) and it can support 10G speeds. The network card is a dual port 10g fiber card in a PCI Express x8 motherboard slot. The memory speeds are 1333MHz using DDR3 ECC ram. Also, the CPU, motherboard and OS support the Advanced Encryption Standard (AES) Instruction Set or AES-NI for hardware accelerated AES encryption and decryption in the CPU in case you decide to setup and VPN.
Processor    : Intel Xeon L5630 Westmere 2.13GHz 12MB L3 Cache LGA 1366 40 Watt Quad-Core
Motherboard : Supermicro X8ST3-F
Chassis : SuperChassis 825TQ-R700LPV 2U rackmount (Revision K)
Memory : KVR1333D3E9SK2/4G 4GB 1333MHz DDR3 ECC CL9 DIMM (Kit of 2) w/ Thermal Sensor
Hard Drive : Western Digital Black Caviar 2TB (SATA)
Network Card : Myricom Myri-10G "Gen2" 10G-PCIE2-8B2-2S (PCI Express x8)
Transceiver : Myricom Myri-10G SFP+ 10GBase-SR optical fiber transceiver (850nm wavelength)

The Operating SystemWe prefer to use OpenBSD due to the newer version of Pf and CARP. The problem is OpenBSD does not have a wide range of 10g card drivers available. The newer higher performance cards which achieve full line speeds, low system load and are widely available are just not supported by OpenBSD at this time, but FreeBSD does offer supprt. If you want to stick with OpenBSD please take a look at the Intel X520-SR2 Dual Port 10GbE Adapter which worked fine in our tests, but was hard to find a seller.

FreeBSD (latest stable or -current) has support for many of the newest 10g fiber and copper based cards and many vendors openly support the OS. FreeBSD also has Pf, though using the older OpenBSD 4.1 rules syntax, and supports CARP and ALTQ. This is the OS we decided to use since we could also use the Myricom Myri-10G "Gen2" optical 10g cards which perform at full 10g speeds bi-directionally. Myricom supports the FreeBSD OS and its newest firmware drivers are included with the basic system install. The Myri10GE FreeBSD driver is named mxge, and has been integrated in FreeBSD since 6.3.

The latest version of FreeBSD is a great OS out of the box. These are the very minimal modifications and configuration changes we made to the system to get it up to 10g speeds.
### FreeBSD (latest stable) 10 gigabit configuration
##

### /boot/loader.conf
##
autoboot_delay="3" # reduce boot menu delay from 10 to 3 seconds
if_mxge_load="YES" # load the Myri10GE kernel module on boot
loader_logo="beastie" # old FreeBSD logo menu
net.inet.tcp.syncache.hashsize=1024 # syncache hash size
net.inet.tcp.syncache.bucketlimit=100 # syncache bucket limit
net.inet.tcp.tcbhashsize=4096 # tcb hash size
net.isr.bindthreads=0 # do not bind threads to CPUs
net.isr.direct=1 # interrupt handling via multiple CPU
net.isr.direct_force=1 # "
net.isr.maxthreads=3 # Max number of threads for NIC IRQ balancing (4 cores in box)
vm.kmem_size=1G # physical memory available for kernel (320Mb by default)

### /etc/sysctl.conf
##
kern.ipc.maxsockbuf=16777216 # kernel socket buffer space
kern.ipc.nmbclusters=262144 # kernel mbuf space raised 275MB of kernel dedicated ram
kern.ipc.somaxconn=32768 # size of the listen queue for accepting new TCP connections
kern.ipc.maxsockets=204800 # increase the limit of the open sockets
kern.randompid=348 # randomized processes id's
net.inet.icmp.icmplim=50 # reply to no more than 50 ICMP packets per sec
net.inet.ip.process_options=0 # do not processes any TCP options in the TCP headers
net.inet.ip.redirect=0 # do not allow ip header redirects
net.inet.ip.rtexpire=2 # route cache expire in two seconds
net.inet.ip.rtminexpire=2 # "
net.inet.ip.rtmaxcache=256 # route cache entries increased
net.inet.icmp.drop_redirect=1 # drop icmp redirects
net.inet.tcp.blackhole=2 # drop any TCP packets to closed ports
net.inet.tcp.delayed_ack=0 # no need to delay ACK's
net.inet.tcp.drop_synfin=1 # drop TCP packets which have SYN and FIN set
net.inet.tcp.msl=7500 # close lost tcp connections in 7.5 seconds (default 30)
net.inet.tcp.nolocaltimewait=1 # do not create TIME_WAIT state for localhost
net.inet.tcp.path_mtu_discovery=0 # disable MTU path discovery
net.inet.tcp.recvbuf_max=16777216 # TCP receive buffer space
net.inet.tcp.recvspace=8192 # decrease buffers for incoming data
net.inet.tcp.sendbuf_max=16777216 # TCP send buffer space
net.inet.tcp.sendspace=16384 # decrease buffers for outgoing data
net.inet.udp.blackhole=1 # drop any UDP packets to closed ports
security.bsd.see_other_uids=0 # keeps users segregated to their own processes list
security.bsd.see_other_gids=0 # "

### /etc/rc.conf
##
## disable sendmail
sendmail_enable="NO"
sendmail_submit_enable="NO"
sendmail_outbound_enable="NO"
sendmail_msp_queue_enable="NO"

## enable daemons
openntpd_enable="YES"
postfix_enable="YES"
sshd_enable="YES"
syslogd_flags="-ss"

## Configure network with max MTU of 9000 without Large Receive Offload (LRO)
ifconfig_mxge0="inet 10.10.10.1 netmask 255.255.255.0 mtu 9000"
ifconfig_mxge1="inet 172.16.16.1 netmask 255.255.255.0 mtu 9000"

## enable pf firewall support
pf_enable="YES" # Enable PF (load module if required)
pf_rules="/etc/pf.conf" # rules definition file for pf
pf_flags="" # additional flags for pfctl start up
pflog_enable="YES" # start pflogd(8)
pflog_logfile="/var/log/pflog" # pflogd logfile location
pflog_flags="" # additional flags for pflogd start up

10g bidirectional network speed test #1To test, we set up the FreeBSD firewall in the middle of two Ubuntu Linux servers. Pf is enabled on the firewall and its only rule is "pass all keep state". All three machines are using the exact same hardware as stated above. The testing tool "iperf" was configured to do a bidirectional test. A connection is be made from Ubuntu Linux #1 though the firewall to Ubuntu Linux #2. Simultaneously, another connection was made from Ubuntu Linux #2 through the firewall to Ubuntu Linux #1. The results were a speed average of 1.15 gigabytes per second (GB/s) in each direction simultaneously. An impressive result.
ubuntu linux #1   <->   BSD firewall  <->   ubuntu linux #2
10.10.10.100 10.10.10.1 - 172.16.16.1 172.16.16.100
Flow 1 -> <- Flow 2

box1$ iperf -c box2 -i 1 -t 60 -d
box2$ iperf -s
[flow 1] 0.0-30.0 sec 32.7 GBytes 9.35 Gbits/sec
[flow 2] 0.0-30.0 sec 31.8 GBytes 9.12 Gbits/sec

Average Speed: 9.2 Gbits/sec or 1.15 gigabytes per second (GB/s) in each direction simultaneously.

10g unidirectional network speed test #2Netperf is also an excellent testing tool. With this test we setup the machine as we would for a public firewall. The FreeBSD box in firewall mode with NAT, scrubbing and tcp sequence number randomization enabled can still get 9.892 gigabits per second from one linux box to the other. Most importantly at an MTU of 9000 (jumbo packets) we can achieve 8,294,531 packets over sixty seconds _through_ the NAT'ed firewall at 9.922 gigabits per second. When the MTU is limited to 1500 (standard MTU for most of the Internet) we hit almost 10 million packets over sixty secods (9,666,777 packets or 161,112 pps) and 1.9333 gigabits per second. Notice that the FreeBSD machine is using 12.5% of the 4 cores for interrupt processing during these tests and the rest of the machine is sitting 86.4% idle.
## Interrupts during the Netperf tests average 12%-14%
CPU: 0.0% user, 0.0% nice, 1.0% system, 12.5% interrupt, 86.4% idle

##### TCP stream test at an MTU of 8972 (~9000)
:~# netperf -H 10.10.10.100 -t TCP_STREAM -C -c -l 60
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 65536 65536 60.01 9892.83 7.12 5.60 0.472 0.371

##### UDP stream test at an MTU of 8972 (~9000)
:~# netperf -H 172.16.16.100 -t UDP_STREAM -l 60 -C -c -- -m 8972 -s 128K -S 128K
Socket Message Elapsed Messages CPU Service
Size Size Time Okay Errors Throughput Util Demand
bytes bytes secs # # 10^6bits/sec % SS us/KB

262144 8972 60.00 8294531 0 9922.4 11.14 inf

##### UDP stream test at an MTU of 1500
:~# netperf -H 172.16.16.100 -t UDP_STREAM -l 60 -C -c -- -m 1500 -s 128K -S 128K
Socket Message Elapsed Messages CPU Service
Size Size Time Okay Errors Throughput Util Demand
bytes bytes secs # # 10^6bits/sec % SS us/KB

262144 1500 60.00 9666777 0 1933.3 6.74 inf

10g Summary informationOther important items to note about the firewall:

  • Doing a TCP SYN attack, the firewall can make an average of 4000 new states per second at 25% CPU inturrupt utilization on a quad core machine. This _may_ be a limit induced by the single CPU core used.

  • currently, you can not use ALTq to support your 10 gigabit interfaces. The parent bandwidth value in Altq is a 32bit int and thus can not support values over 2^32 or 4294Mb (4.29Gb). We have reported this "bug" to the developers.


If you need to support a 10 gigabit network and have an external connection which can also support 10g or even 40 gigabit then FreeBSD with the right hardware will do perfectly.

When trying to attain maximum throughput, the most important options involve TCP window sizes and send/receive space buffers.

Should we use the OpenBSD GENERIC or GENERIC.MP kernel?


As of OpenBSD v5.0 you are welcome to use either one. Both kernels performed exceptionally well in our speeds tests. Generic is the single CPU kernel while generic.mp is the multi CPU kernel.

Despite the recent development of multiple processors support in the OpenBSD, the kernel still operates as if were running on a single processor system. On a SMP system only one processor is able to run the kernel at any point in time, a semantic which is enforced by a Big Giant Lock. The Big Giant Lock (BGL) works like a token. If the kernel is being run under one CPU then it has the BGL and thus the kernel can _not_ be run on a second CPU. The network stack and thus pf and pfsync run in the kernel and so under the Big Giant Lock.

If you have access to a multi core machine and are expecting to use programs that will take advantage of the cores then the multi core board is a good choice. PF is _not_ a multi core program so it will not benefit from multi core kernel. For example an intrusion detection app, monitoring script or real time network reporting tool. Truthfully, if you have multiple cores then use them.

OpenBSD v5.0 and later network stack "speed_tweaks"


These tweaks are for OpenBSD v5.0 and later. The network stack in 5.0 and later will dynamically adjust the TCP send and receive window sizes. There have been a lot of work done to remove many of the bottlenecks in the network code and how Pf handles traffic compared to earlier releases.

We tested TCP bandwidth rates with iperf and found 5.0 to be quite good with rfc1323 enabled and using the built in dynamic TCP window sizing. The default send and receive space for UDP was fine for connections up to 25Mbit/sec sending and receiving on the OpenBSD box. This means that if you have a 25/25 FIOS Internet connection you do NOT have to change anything. But, for testing we wanted to see what size buffer was necessary for 100 Mbits/sec network flooded with UDP traffic. We increased the net.inet.udp.recvspace and net.inet.udp.sendspace values to support 128Kbit buffer sizes. iperf was able to support speeds of 200Mbit/sec without packet loss. This is an excellent trade of just 128KByte for a nicely sized overflow buffer which a 100Mbit network would not overflow.

NOTE: It is very important to remember to use "keep state" or "modulate state" on ever single one of your pf rules. OpenBSD 5.0 and later use Dynamic Adjustment of TCP Window Sizes. If your rules do note keep state and pass the initial SYN packet from the client to the server the window size can not be negotiated. This means your networks speeds will be very, very slow in the hundreds of kilobytes per second instead of tens of megabytes per second. Check out our PF Config (pf.conf) page for more detailed information.
### Calomel.org  OpenBSD v5.0 and later /etc/sysctl.conf
##
ddb.panic=0 # do not enter ddb console on kernel panic, reboot if possible
kern.bufcachepercent=90 # Allow the kernel to use up to 90% of the RAM for cache (default 10%)
machdep.allowaperture=2 # Access the X Window System (if you use X on the system)
net.inet.ip.forwarding=1 # Permit forwarding (routing) of packets through the firewall
net.inet.ip.ifq.maxlen=512 # Maximum allowed input queue length (256*number of physical interfaces)
net.inet.ip.mtudisc=0 # TCP MTU (Maximum Transmission Unit) discovery off since our mss is small enough
net.inet.tcp.mssdflt=1472 # maximum segment size (1472 from scrub pf.conf match statement)
#net.inet.udp.recvspace=131072 # Increase UDP "receive" buffer size. Good for 200Mbit without packet drop.
#net.inet.udp.sendspace=131072 # Increase UDP "send" buffer size. Good for 200Mbit without packet drop.

OpenBSD v4.8 and earlier network stack "speed_tweaks"


First, make sure you are running OpenBSD v4.8 or earlier. These setting will significantly increase the network transfer rates of the machine.

Second, make sure you have applied any patches to the system according to the OpenBSD page. We have a patch guide if you need it, Patching OpenBSD kernel and packages.

The following options are put in the /etc/sysctl.conf file. They will increase the network buffer sizes and allow TCP window scaling. Understand that these settings are at the upper extreme. We found them perfectly suited in a production environment which can saturate a gigabit link. You may not need to set each of the values this high, but that is up to your environment and testing methods. Summery explanations of each line follow each option.
### Calomel.org  OpenBSD v4.8 and earlier /etc/sysctl.conf
##
ddb.panic=0 # do not enter ddb console on kernel panic, reboot if possible
kern.bufcachepercent=90 # Allow the kernel to use up to 90% of the RAM for cache (default 10%)
kern.maxclusters=128000 # Cluster allocation limit
machdep.allowaperture=2 # Access the X Window System
machdep.kbdreset=1 # permit console CTRL-ALT-DEL to do a nice halt
net.bpf.bufsize=1048576 # Internal kernel buffer for storing packet captured packets received from the network
net.inet.icmp.errppslimit=1000 # Maximum number of outgoing ICMP error messages per second
net.inet.icmp.rediraccept=0 # Deny icmp redirects
net.inet.ip.forwarding=1 # Permit forwarding (routing) of packets
net.inet.ip.ifq.maxlen=512 # Maximum allowed input queue length (256*number of interfaces)
net.inet.ip.mtudisc=0 # TCP MTU (Maximum Transmission Unit) discovery off since our mss is small enough
net.inet.ip.ttl=64 # the TTL should match what we have for "min-ttl" in scrub rule in pf.conf
net.inet.ipcomp.enable=1 # IP Payload Compression protocol (IPComp) reduces the size of IP datagrams
net.inet.tcp.ackonpush=0 # acks for packets with the push bit set should not be delayed
net.inet.tcp.ecn=0 # Explicit Congestion Notification enabled
net.inet.tcp.mssdflt=1472 # maximum segment size (1472 from scrub pf.conf match statement)
net.inet.tcp.recvspace=262144 # Increase TCP "receive" windows size to increase performance
net.inet.tcp.rfc1323=1 # RFC1323 enable optional TCP protocol features (window scale and time stamps)
net.inet.tcp.rfc3390=1 # RFC3390 increasing TCP's Initial Congestion Window
net.inet.tcp.sack=1 # TCP Selective ACK (SACK) Packet Recovery
net.inet.tcp.sendspace=262144 # Increase TCP "send" windows size to increase performance
net.inet.udp.recvspace=262144 # Increase UDP "receive" windows size to increase performance
net.inet.udp.sendspace=262144 # Increase UDP "send" windows size to increase performance
vm.swapencrypt.enable=1 # encrypt pages that go to swap

### CARP options if needed
# net.inet.carp.arpbalance=0 # CARP load-balance
# net.inet.carp.log=2 # Log CARP state changes
# net.inet.carp.preempt=1 # Enable CARP interfaces to preempt each other (0 -> 1)
# net.inet.ip.forwarding=1 # Enable packet forwarding through the firewall (0 -> 1)

You can apply each of these settings manually by using sysctl on the command line. For example, "sysctl kern.maxclusters=128000" will set the kern.maxclusters variable until the machine is rebooted. By setting the variables manually you can test each of them to see if they will help your machine.
For more information about OpenBSD's Pf firewall and HFSC quality of service options check out our PF Config (pf.conf) and PF quality of service HFSC "how to's".

Testing and verifying network speeds


Continuing with OpenBSD v5.0, a lot of work has been done on the single and multi-core kernels focused on speed and efficiency improvements. Since many OpenBSD machines will be used as a firewall or bridge we wanted to see what type of speeds we could expect passing through the machine. Lets take a look at the single and multi core kernel, the effects of using PF enabled or disabled and the effect of the our "speed tweaks" listed in the section above.

The testing hardware


To do our testing we will use the latest patches applied to the latest distribution. Our test setup consists of two(2) identical boxes containing an Intel Core 2 Quad (Q9300), eight(8) gigs of ram and an Intel PRO/1000 MT (CAT5e copper) network card. The cards were put in a 64bit PCI-X slot running at 133 MHz. The boxes are connected to each other by an Extreme Networks Summit X450a-48t gigabit switch using 12' unshielded CAT6 cable.

The testing software


The following iperf options were used on the machines we will call test0 and test1. We will be sustaining a full speed transfer for 30 seconds and take the average speed in Mbits/sec as the result. Iperf is available through the OpenBSD repositories using "pkg_add iperf".
## iperf listening server
root@test1: iperf -s

## iperf sending client
root@test0: iperf -i 1 -t 30 -c test1

The PF rules


The following minimal PF rules were used if PF was enabled (pf=YES)
# pfctl  -sr                                                                                                                         
scrub in all fragment reassemble
pass in all flags S/SA keep state
block drop in on ! lo0 proto tcp from any to any port = 6000


Test 1: No Speed Tweaks. Using the GENERIC and GENERIC.MP kernel (patched -stable) with the default tcp window sizes we are able to sustain over 300 Mbits/sec (37 Megabytes/sec). Since the link was at gigabit (1000 Mbits/sec maximum) we are using less then 40% of our network line speed.

bsd.single_processor_patched
pf=YES
speed_tweaks=NO
[ 1] 0.0-30.0 sec 1.10 GBytes 315 Mbits/sec

bsd.single_processor_patched
pf=NO
speed_tweaks=NO
[ 1] 0.0-30.0 sec 1.24 GBytes 356 Mbits/sec

bsd.multi_processor_patched
pf=YES
speed_tweaks=NO
[ 4] 0.0-30.2 sec 1.13 GBytes 321 Mbits/sec

bsd.multi_processor_patched
pf=NO
speed_tweaks=NO
[ 4] 0.0-30.0 sec 1.28 GBytes 368 Mbits/sec

According to the results the network utilization was quite poor. We are able to push data across the network at less than half of its capacity (Gigabit=1000Mbit/s and we used 368Mbit/s or 36%). For most uses on a home network with a cable modem or FIOS you will not notice. But, what if you have access to a high speed gigabit or 10 gigabit network?

Test 2: Calomel.org Speed Tweaks. Using the GENERIC and GENERIC.MP (patched -stable) kernel we are able to sustain around 800 Mbits/sec, almost three(3) times the default speeds.

bsd.single_processor_patched
pf=YES
speed_tweaks=YES
[ 1] 0.0-30.0 sec 2.95 GBytes 845 Mbits/sec

bsd.single_processor_patched
pf=NO
speed_tweaks=YES
[ 1] 0.0-30.0 sec 3.25 GBytes 868 Mbits/sec

bsd.multi_processor_patched
pf=YES
speed_tweaks=YES
[ 4] 0.0-30.0 sec 2.69 GBytes 772 Mbits/sec

bsd.multi_processor_patched
pf=NO
speed_tweaks=YES
[ 4] 0.0-30.2 sec 2.82 GBytes 803 Mbits/sec

These results are much better. We are utilizing more than 80% of a gigabit network. This means we can sustain over 100 megabytes per second on our network. Both the single processors and multi processor kernels performed efficiently. The use of PF reduced our throughput only minimally.

Why do these "speed tweaks" work? What is the theory?


The dominant protocol used on the Internet today is TCP, a "reliable" "window-based" protocol. The best possible network performance is achieved when the network pipe between the sender and the receiver is kept full of data. Take a look at the excellent study done at the Pittsburgh Supercomputing Center titled, "Enabling High Performance Data Transfers". They cover bandwidth delay products (BDP), buffers, maximum TCP buffer (memory) space, socket buffer sizes, TCP large window extensions (RFC1323), TCP selective acknowledgments option (SACK, RFC2018) and path MTU theory.
Your firewall is one of the most important machines on the network. Keep the system time up to date with OpenNTPD "how to" (ntpd.conf), monitor your hardware with S.M.A.R.T. - Monitoring hard drive health and keep track of any changed files with a custom Intrusion Detection (IDS) using mtree. If you need to verify a hard drive for bad sectors check out Badblocks hard drive validation/wipe.

Other Operating System Software


The next few sections are going to be dedicated to different operating systems Other then OpenBSD. Each OS has some way in which you can increase the overall throughput of the system. Just scroll to the OS you are most interested in.

RedHat or CentOS Linux network stack


### Calomel.org  RedHat or CentOS Linux  /etc/sysctl.conf
##
# some of the defaults may be different for your kernel call this file with
# sysctl -p these are just suggested values that worked well to
# increase throughput in several network benchmark tests,

### IPV4 specific settings
# turns TCP timestamp support off, default 1, reduces CPU use
net.ipv4.tcp_timestamps = 0
# turn SACK support on -- you probably want this off for 10GigE
net.ipv4.tcp_sack = 1
# scaling support
net.ipv4.tcp_window_scaling=1
# on systems with a VERY fast bus to memory interface this is the big plus
# sets min/default/max TCP read buffer, default 4096 87380 174760
# setting to 100M - 10M is too small for cross country (chsmall)
net.ipv4.tcp_rmem = 1000000 1000000 1000000
# sets min/pressure/max TCP write buffer, default 4096 16384 131072
net.ipv4.tcp_wmem = 1000000 1000000 1000000
# sets min/pressure/max TCP buffer space, default 31744 32256 32768
net.ipv4.tcp_mem = 150000000 150000000 150000000

### CORE settings (for socket and UDP effect)
# maximum receive socket buffer size, default 131071
net.core.rmem_max = 1000000
# maximum send socket buffer size, default 131071
net.core.wmem_max = 1000000
# default receive socket buffer size, default 65535
net.core.rmem_default = 2524287
# default send socket buffer size, default 65535
net.core.wmem_default = 2524287
# maximum amount of option memory buffers, default 10240
net.core.optmem_max = 2524287
# number of unprocessed input packets before kernel starts dropping them, default 300
net.core.netdev_max_backlog = 300000
# enable window scaling RFC1323 TCP window scaling
net.ipv4.tcp_window_scaling=1

Suse or openSUSE Linux network stack


### Calomel.org  Suse or openSUSE Linux  /etc/sysctl.conf
##
# some of the defaults may be different for your kernel call this file with
# sysctl -p these are just suggested values that worked well to
# increase throughput in several network benchmark tests,

# packet reordering in a network can be interpreted as packet loss
# and increasing the value of this parameter should improve performance
net.ipv4.tcp_reordering = 20
# Sets the Maximum Socket Send Buffer for TCP Protocol
net.ipv4.tcp_wmem = 8192 87380 16777216
# Sets the Maximum Socket Receive Buffer for TCP Protocol
net.ipv4.tcp_rmem = 8192 87380 16777216
# Enables/Disables the behavior of cache performance characteristics connection
net.ipv4.tcp_no_metrics_save = 1
# You can set this to one of the manu available high speed congestion variants like "cubic" or "hs-tcp"
net.ipv4.tcp_congestion_control = cubic
# sets the Maximum Socket Send Buffer for all protocols
net.core.wmem_max = 16777216
# Sets the Maximum Socket Receive Buffer for all protocols
net.core.rmem_max = 16777216

Windows XP/2000 Server/Server 2003 network stack


Edit the registry using "regedit" and look for the following section:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

Now add the following values:

  • Add a registry DWORD named TcpWindowSize with a value of 131400 (click on 'decimal').

  • Add a registry DWORD named Tcp1323Opts with a value of 3. This will enable rfc1323 scaling and timestamps.

  • Add a registry DWORD named ForwardBufferMemory with a value of 80000. Increase TCP windows size

  • Add a registry DWORD named NumForwardPackets with a value of 60000. Increase buffer for forwarded packets.


Finally, one last note for Windows XP users: When you install Service Pack 2 (SP2), make sure to disable "Internet Connection Sharing". This is a major network slow down and by disabling it you should fix this performance problem. Also make sure you turn off or remove QOS in the TCP/IP Network settings.

Questions?


How can I find performance bottlenecks and display real time statistics about the firewall hardware?
On any Unix based system run the command "systat vmstat" to give you a top like display of memory totals, paging amount, swap numbers, interrupts per second and much more. Systat is incredible useful to determine where the performance bottleneck is on a machine.

Tidak ada komentar:

Posting Komentar