Skip to content

Keeping FreeBSD TCP performance in the midst of a highly-buffered connection

21-Jun-14

I was perplexed recently, when I began an rsync job to a raspberry pi server. I know exactly what limits the bandwidth of this connection–it is the CPU (or network) on the Raspberry Pi, which cannot accept data fast enough.

So, even though my server is on a 1 Gbit/s interface, and the Raspberry Pi is on a 100 Mbit/s interface, the transfer rate is ~ 10 Mbit/s. Fair enough.

But what really perplexed me is that the presence of this rsync connection severly limited other connections–notably Samba. The Simpson’s show in the living room had audio that was noticeably stuttering.

So, I began to investigate. This same low-rate occurred with iperf. It seemed a little better from my basement computer than the living room machine. Here is an iperf from the basement to the FreeBSD server:


C:\Users\Poojan\Downloads\iperf-2.0.5-2-win32>iperf -c server
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.20 port 64155 connected with 192.168.1.8 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.1 sec 25.1 MBytes 21.0 Mbits/sec

Whereas without rsync going, it would be around 670 Mbit/s or so.

I started playing around with buffers. Curiously, reducing sendbuf_max helped:


Poojan@server ~ >sudo sysctl net.inet.tcp.sendbuf_max
net.inet.tcp.sendbuf_max: 262144
Poojan@server ~ >sudo sysctl net.inet.tcp.sendbuf_max=65536
net.inet.tcp.sendbuf_max: 262144 -> 65536

Which yielded:

C:\Users\Poojan\Downloads\iperf-2.0.5-2-win32>iperf -c server
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.20 port 64171 connected with 192.168.1.8 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 788 MBytes 661 Mbits/sec

I posited that maybe there’s some overall limit to the buffers, and rsync was stealing all of them, so making them smaller allowed more buffers to be available to iperf. I went hunting for this limit.

I tried doubling kern.ipc.maxsockbuf:


Poojan@server ~ >sudo sysctl -w kern.ipc.maxsockbuf=524288
kern.ipc.maxsockbuf: 262144 -> 524288

which yielded:

C:\Users\Poojan\Downloads\iperf-2.0.5-2-win32>iperf -c server
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.20 port 64216 connected with 192.168.1.8 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 25.0 MBytes 20.9 Mbits/sec

No luck. Note: I realized that the above was with Jumbo frames enabled on both server & client. I disabled jumbo on client.

I then did a netstat -m, just in case:


1470/5175/6645 mbufs in use (current/cache/total)
271/2635/2906/10485760 mbuf clusters in use (current/cache/total/max)
271/2635 mbuf+clusters out of packet secondary zone in use (current/cache)
85/335/420/762208 4k (page size) jumbo clusters in use (current/cache/total/max)
1041/361/1402/225839 9k jumbo clusters in use (current/cache/total/max)
0/0/0/127034 16k jumbo clusters in use (current/cache/total/max)
10618K/11152K/21771K bytes allocated to network (current/cache/total)
1106/2171/531 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
361/1345/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

This didn’t really show any indication that buffers were being over-subscribed, at least not during the tests.

But now, with a sendbuf_max size of 262144, and a maxsockbuf size of 524288, my iperf reading went down:


C:\Users\Poojan\Downloads\iperf-2.0.5-2-win32>iperf -c server
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.20 port 64438 connected with 192.168.1.8 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.7 sec 2.88 MBytes 2.25 Mbits/sec

From reading this summary of FreeBSD buffers, it seems that kern.ipc.maxsockbuf operates at a different level than net.inet.tcp.sendbuf. And, in fact, both these being large is impacting the performance. So, maybe this is just pure buffer bloat.

But, then I realized that my better results were when the sendbuf was less than 64k. So, I disabled RFC1323 (which allows for buffers larger than 64k, in addition to time-stamps). And voila!


C:\Users\Poojan\Downloads\iperf-2.0.5-2-win32>iperf -c server
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.20 port 65203 connected with 192.168.1.8 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 798 MBytes 669 Mbits/sec

Be the first to like.

worrbase – CrashPlan on FreeBSD 9.0, A HOWTO

15-Jun-14

I’m blogging this as a reminder to myself on what to do. I found it years ago, but that link doesn’t seem to exist anymore. Anyway, this is a good write-up on FreeBSD 9.0:

worrbase – CrashPlan on FreeBSD 9.0, A HOWTO.

The only difference betwen this and what I did was that I use Oracle JRE 1.8 /usr/ports/java/linux-oracle-jre18/, so my JAVACOMMON looks like:

JAVACOMMON=/usr/local/linux-oracle-jre1.8.0/bin/java

Be the first to like.

NFSv4 ACL history

28-May-14

Good summary of NFSv4 ACL’s (and their history):

Implementing Native NFSv4 ACLs in Linux (by Greg Banks at SGI).

Be the first to like.

Crucial m500 120GB as a ZFS slog

12-Apr-14

I re-did my benchmarks of my ZFS ZIL using a Crucial m500. The main reason I got this drive is that it has power-loss protection. And, it’s speeds are adequate to keep up with my gigabit Ethernet samba server.

Here are the results. My first run was decent, but not stellar:

I didn’t do a [ccie]zpool iostat[/ccie] during this run. This matches around what I was getting as a 4k random write. My second run was much better:

And it matches a typical zpool iostat:

Overall, I’m pretty happy. I was hoping to get at minimum 125 MB/s, which my gigabit Ethernet can’t really saturate.

I can now even try turning on forced sync writes in Samba.

Be the first to like.

Crucial m500 120GB benchmarks

09-Apr-14

The latest in my obsession with SSD’s. I jumped on a $70 deal at NewEgg. I bought it to use as a SLOG (ZIL) in my ZFS server, because of its write speeds and because of its power loss protection.

I immediately updated the MU03 firmware to MU05.

This may be with a SATA II (3 Gbps) cable:

Wha? The 4K random write is quite low. Let’s repeat:

Little better at 76 MB/s random 4k write, but not the 120 MB/s that I expected.

Let’s try the native SATA III port and a shorter cable:

Wow! that sequential read is fast. But, the random write (QD=1) is still disappointing. What gives?

Post Script

Looks like someone else gets similar results, which don’t line up with what TweakTown measured.

Oh, well. The power loss protection is important, and it really only matters how the drive performs in the ZIL. Hopefully, I’ll see closer to 140MB/s rather than 65 MB/s.

Update 2014-04-09

Tried re-running with Intel’s RST driver:

Slightly better, but still no 120 MB/s. I have a feeling that something changed with the latest firmware. Wish I had run this before I upgraded.

Be the first to like.

More ZIL/SLOG comparisons

05-Apr-14

Suspicious of my previous tests using iozone, I wrote a small program (that follows) to measure synchronous write speed. I tested this with and without an SLOG device, and then used a really fast Plextor M5 Pro 256GB SSD as the SLOG. I wanted to make sure that something could achieve the fast SLOG speed that I hoped for. (Remember I am doing this in the interest of spending money on a new SSD.)

Summary

An SLOG device can degrade performance, by forcing synchronous writes to wait until they are committed to the SLOG. However, it cannot enhance performance. The max (steady-state) performance is determined by bandwidth to the main storage.
More…

Be the first to like.

SLOG tests on a 32GB Kingston SSD with iozone

02-Apr-14

This is a follow-up on my previous post. I ran iozone both with and without the [ccie]-e[/ccie] flag. This flag includes a sync/flush in the speed calculations. This flush should tell ZFS to commit the in-process data to disk, thus flushing it to the SLOG device (or to the on-disk ZIL).

I ran 4 tests: one without the [ccie]-e[/ccie] flag, and one with it; I then repeated the combination with the SLOG device removed.

Here are the results: More…

Be the first to like.

ZFS slog (ZIL) considerations

01-Apr-14

Running List of Alternatives

2015-08-11

It looks from this Anandtech article that the m550 does not have as strong power-loss protection as I originally thought. It protects the “lower pages” (least-significant bits) from corruption between page writes. But, it does not protect the DRAM buffer from losing data.

2014-07-07

The Transcend MTS600 line is capable of 310 MB/s sequential write. It is ideal as an SLOG, as it is also available in a 32GB size. The downside is that it is an M.2 (NGFF) format. There is also an MTS400 line, which is only capable of 160 MB/s, and an MTS800 line, which has the same 310MB/s performance as the MTS600.

2014-05-10

The Seagate PRO models have power-loss protection, and they seem to do very well on sequential read/writes–better than the Crucial m500.

The new ADATA SP920 are the same as the Crucial m550 (except the NAND configuration is like the Crucial m500), which all have power-loss protection. The ADATA performs considerably better than the (older generation) Crucial m500.

And of course, the Crucial m550 itself is faster than the ADATA SP920 due to more parallel configuration of NAND.

Update on 2014-05-09

This rundown of SSD’s indicates that the power loss protection must last seconds, which invalidates the use of a cable to do this. Also, it’s not just the data in transit, it’s the mapping tables that need to be protected. (In fact, it seems that you can brick your SSD if you cut power to it—and that might explain why one of my SSD’s don’t work.)

Original Post

My latest obsession is what drive is best for a SLOG (dedicated ZIL device).

I plan on testing tonight whether sequential writes or random writes are more important. This DDRDrive marketing propaganda states that (at least for multiple producers) random IOPS are more important. But, I’m very unlikely to see multiple synchronous writes to my ZFS pool. (Maybe a ports build or something, but that’s unlikely).

I have numerous options. But, lets start with the power-loss protection. I found this cable online that has (on each of the 5V and 12V lines) a 2.2 mF capacitor. For those of you that don’t know, that’s really big as far as capacitors go. This should let the SLOG device maintain a charge for quite a bit longer during a power failure.

For example, the SanDisk Extreme II 120GB lists 3.4 W as its power consumption during write. If I assume that all of that comes from the 5.0 V rail (which is the conservative calculation), I get 0.68 A (aka 680 mA) as the current draw during write.

If I then assume that a +/- 10% voltage fluctuation is all that can be tolerated, the capacitor will maintain a voltage for:

ΔT = C*ΔV/I = 2.2 mF × 0.5V / 680 mA = 1.6 ms

So, the cable maintains a voltage for 1.6 ms. This does not seem like much, but if I assume that at most my producer (Samba) can generate 1 Gbps (or 125 MB/s), I can get 125MB/s * 1.6ms = 200 kB to disk. (Of course I’d be lucky to get 100 MB/s with the stock NIC cards in any of the machines on my LAN, but that’s a different story.)

But that’s not really the point. The point is that the SSD should remain active after the producer (the Samba server in this case) dies. This allows writes in progress to be completed. In that case, the real amount of data is how quickly the SSD can commit to disk. That, of course, depends on the write speed of the SSD.

I’ve looked at 4 main candidates for an SSD:

  • OCZ Vertex 2 (60 GB)
  • Crucial m500 (120 GB)
  • Sandisk Ultra II (120 GB)
  • Kingston V300 (120GB)

The benefits of each are as follows. Note that I’ve picked the smallest capacity available for each architecture.

OCZ Vertex 2

The main benefit is price. This thing can be had on eBay for $40. It’s also speedy for its generation. Its random write (4k) is pretty high at 65 MB/s. Its sequential write is 225 MB/s (this is probably compressed data; see below on the 120 GB being 146 MB/s). This means that it can keep up with the 125 MB/s Ethernet and a fair margin above it. For this guy, I probably need the capacitor cable.

Crucial m500

There are two main benefits here. First is speed. Both its sequential write and random write are really high, at 141 MB/s and 121 MB/s, respectively. The second is that this drive has power-loss protection. So, I don’t need a clunky cable with built-in capacitors. Whatever data has made it to the drive will get committed.

SanDisk Ultra II

Once again, speed is a huge benefit. This blows away anything else in both sequential (339 MB/s) and random (133 MB/s) writes. In addition, SanDisk also has a unique nCache non-volatile cache. This allows data to hit SLC (or really MLC operating in SLC mode) before it hits the MLC. This approximates the best of both worlds: the speed and (temporal) reliability of SLC, with the cheaper/denser MLC (for permanent storage).

It does, however, have DDR cache. I hope that if the controller issues a flush, this DDR gets flushed (at least to the nCache). But, to be safe, I’ll probably buy the capacitor cable to pair with this, if I go this route.

Kingston V300 (120GB)

This doesn’t have the fancy flash footwork of the SanDisk nCache, nor does it have power-loss protection of the m500. However, it tends to be cheaper. And the speed is quite sufficient at 133 MB/s random and 172 MB/s sequential. It’s basically an older generation SSD.

(Lack of) Conclusion

So, which will I go for? Well, before I make a decision, I want to test whether sequential or random write is the key metric. I plan on doing this tonight. I have a 32GB Kingston SV100 drive as SLOG on the system. I’ll use dd or something to issue synchronous writes of various sizes. The 32GB Kingston has a sequential write speed of 68 MB/s, but a dismal random write speed. If sequential writes are the key metric, I should see a throughput of around 68 MB/s. If not, I should see something much lower.

That’s the hypothesis, anyway. I’ll test it tonight.

Thoughts on DDRDrive

Warning: this section is mere conjecture.

The DDRDrive propaganda is interesting. They didn’t get a good result with just one producer, so they pushed it to multiple producers. To be fair, this is probably a good model of an enterprise situation–a database server, for example. But, their scenario is with multiple file systems on the same server. Is it likely to split up the database between multiple file systems? I’d posit than in an enterprise environment, one would just split up the multiple databases between multiple servers (or virtual servers) and be done.

In addition, they show the write speeds of their measured disks dropping off after some time. This is probably because the cache is empty to begin with, but further writes need to re-write over data (erase + write) rather than a simple write. However, this situation is unlikely in newer versions of FreeBSD, as the ZFS now supports TRIM. This should make freed sectors erased, so that the erase+write cycle does not occur.

Rejected Options

The Vertex 2 also has 120 GB, but it’s speed isn’t much faster (73 MB/s random and 146 MB/s sequential). I’d take it if were as cheap or cheaper than the 60GB.

The OCZ Agility 3 was also an option, but it isn’t any faster with non-compressible data.

 

Be the first to like.

Kingston SV100S2/32 and SV100S2/64 benchmarks

28-Mar-14

Update 2014-04-06: Tests on the 64GB with a $10 Sabrent USB 3.0 Enclosure
Added to the end

The 64GB SSD has been my solid-state workhorse in my ZFS pool for a while. First, it was the L2ARC, then it was the ZIL.

In fact, I used to have two, but I broke the connector off of one. Which is a different story.

Curiously, I never benchmarked these drives using Crystal Disk Mark. (I did benchmark them several times using dd.) I recently bought the 32GB drive on eBay, and so I took the opportunity to benchmark both.

First, I benchmarked the 32GB via a USB 3.0 dock:

This wasn’t much different than the native SATA performance:

The USB 3.0 and SATA results are the same, except that the queue depth 32 read gets slightly faster on SATA (presumably due to native command queuing).

And finally, here’s the 64 GB numbers (native SATA only):

Update 2014-04-06

Tested the 64GB using a $10 Sabrent USB3.0 bus-powered 2.5″ enclosure. This beats any $30 USB 3.0 drive out there, although it is comparatively bulky (enclosure + cable).

Be the first to like.

The quest for the perfect router

11-Mar-14

All I want is a router that supports QoS and IPv6.

I recently upgraded to my cable internet to DOCSIS 3.0, and with it IPv6.

I was using a WD N900 router that I got for cheap at Staples. But, I noticed that the QoS rules don’t apply to IPv6. It would (I assume) prioritize VOIP etc. but the IPv6 wasn’t rate-limited in any way. I could tell this because IPv4 traffic (via a speed test) would follow the upload/download limits I set for QoS, but IPv6 would not.

So, I decided to reuse the Atom D525 board I had and build a pfSense router. Note that this is the second time I’m using pfSense–the previous time was with an old Dell Inspiron laptop.

I went ahead and bought a refurbished mini-ITX case for $30. (The refurb is no longer available.)

Now, I’ve hit a snag: the case I bought won’t fit an expansion card. And, the USB gigabit Ethernet adapter I have periodically disappears from pfSense. This causes the LAN IPv6 (prefix-delegated) address to disappear, which then means all the computers on the LAN lose IPv6 access.

It would, indeed, be better to figure out why the ue ASIX chipset driver (ue0 is the device assigned to the USB Ethernet) loses its mind every now and then. But, I can’t spend that kind of time debugging the problem.

Instead, I plan on installing a PCI Ethernet card that I have laying around. Problem is (once again) that the case I bought does not have a slot for a PCI card. So, I’ll probably take the PCI bracket off the card, stick it in the case, and cut a hole in the case so I can get the Ethernet port out.

It probably won’t be pretty, but I don’t want to spend any more time/money on this.

And that’s my dilemma. I’m sort of a perfectionist about what I want (IPv6 + QoS), and that usually means spending either time or money.

Be the first to like.