Last week, February 17, we released pfSense Plus 21.02 alongside pfSense CE 2.5. It was the culmination of 9 months of work on new features, testing, and bug fixing, and we were quite proud of it. Unfortunately, an obscure and esoteric bug lurked inside that resulted in an All Hands On Deck call for our engineering and support teams. Today’s blog will dive into the interesting details of how our team handled and debugged this as the outstanding professionals they are, and how this team really makes Netgate special. The following gets technical in places, but I think you’ll appreciate the insight it brings.
A day after we announced the release of pfSense Plus 21.02, our support organization started getting calls from a few customers who were experiencing lock-ups under certain circumstances. At first the reports were hard to make sense of, but within a day they had enough of the puzzle put together to alert the engineering team that there was a problem that needed immediate attention. It seemed to affect only the SG-3100 product, which was our first big clue for debugging, but that’s also a very popular product, so we had to act quickly to fix the problem. The support team also put together a reliable reproduction case so that the engineering team could better study the problem and quickly evaluate possible fixes. As a precautionary measure, we also halted the ability of SG-3100 users to upgrade to pSense Plus 21.02 while we figured out the problem.
The engineering team took over and assembled its key players to start reproducing the problem and figuring out a debugging strategy. While we don’t always ship our appliances with debug code enabled, we have a few tools that are readily at our disposal. One is the “^T” information handler that you can press at any time you have a command prompt in FreeBSD. It tells you what process is currently running, and it gives you a little bit of information on what that process is doing. In our case, we ran a simple “ping” test from the command line and observed that packets weren’t flowing. We then pressed “^T” and got this:
load: 1.52 cmd: ping 12106 [*pf rulesets] 30.38r 0.00u 0.00s 0% 2112k
Uh-oh! The “*” is never a good sign, it means that the process is stuck inside of the kernel, waiting on a lock. A “lock” is a multiprocessor synchronization tool that is kind of like a traffic light at a busy intersection. Code executing on one CPU has to take turns with code executing on another CPU if they happen to be operating in overlapping areas. These overlapping areas are identified as part of the design process developers use when writing multiprocessor-aware code. Sometimes only one CPU is allowed to operate at a time while the others wait, other times multiple CPUs operate at the same time because the developer knows that they won’t conflict. A common example of this is when multiple CPUs try to read from a location in memory at the same time that another CPU is trying to write to it. The readers won’t interfere with each other as long as the writer stays out of their way and they stay out of its way. The topic of locking and synchronization is one of the most complicated topics in computer science and engineering, but most kernel developers understand it really well.
Normally these lock waits are complete in a matter of microseconds. This one looked stuck and wasn’t letting the ping process do any work. It also gave us the hint that the problem might be in the “pf” packet filter code in the kernel. Because most network packets have to travel through this code, being that it’s the firewall module that pfSense is built around, this hypothesis seemed reasonable.
Since this pointed to a multiprocessor problem, we rebooted our test hardware with only a single CPU core enabled. Sure enough, the problem went away, even though we hadn’t changed any code. This was another strong piece of evidence that pointed to the multi-processor locks, but we still didn’t know exactly where to look yet.
Another tool that FreeBSD provides to pfSense, which is what makes it such a powerful platform to develop on, is a built-in kernel debugger, KDB. We don’t always enable it on the code that we ship in our appliances, but adding it into our test environment is quick and easy. With the kernel debugger available, we also enabled the WITNESS debugging toolkit. Its job is to track all of the locks in the kernel, automatically alert on common coding mistakes with these locks, and provide detailed tracing information to the kernel debugger when the automatic detection isn’t enough.
Armed with the KDB and WITNESS tools, we engaged the debugger and ran the command “show alllocks”. This displays information about every lock in the kernel that was active, including which code holds the lock, i.e. has a green light to proceed, and which code is waiting on a lock, i.e. is stuck at a red light. What we saw was this:
db> show alllocks
Process 13200 (sshd) thread 0xdca493c0 (100102)
shared rm PFil shared rmlock (PFil shared rmlock) r = 0 (0xc625e1c0) locked @ FreeBSD-src/sys/net/pfil.c:108
exclusive rw tcpinp (tcpinp) r = 0 (0xdf014b98) locked @ FreeBSD-src/sys/netinet/tcp_usrreq.c:965
exclusive sx so_snd_sx (so_snd_sx) r = 0 (0xdf4ca174) locked @ FreeBSD-src/sys/kern/uipc_sockbuf.c:285
Process 39349 (dpinger) thread 0xdee60b40 (100132)
shared rm PFil shared rmlock (PFil shared rmlock) r = 0 (0xc625e1c0) locked @ FreeBSD-src/sys/net/pfil.c:108
I’ve edited the output for brevity, the command actually produced several pages of information. What it told us is that “sshd” and “dpinger” were waiting on a lock. That was somewhat expected, but what was really weird was that it only showed processes waiting to get the lock, it didn’t seem to know of any processes owning the lock. In other words, all red lights and no green lights. It’s common to have multiple red lights, but to have a red light you must have a green light. Usually if you find the process with the green light, it’ll point you closer to the problem.
Next we ran the “show lockchain” command. This is an especially powerful tool that helps determine inter-dependencies between multiple kernel locks. To our surprise, the output was corrupt! The code was not only stuck waiting on a lock, doing so had corrupted its state, making it unlikely to ever recover without a reboot. The mystery deepened.
Something to keep in mind is that kernel locks come in many varieties. Some are optimized for uses that do frequent and symmetrical reads and writes of data, while others are optimized for uses that do mostly reads and only a few writes. This latter type is called an “rmlock” and makes specialized use of the CPU cache and its inter-processor communication channels to be as efficient as possible in the common case. This lock is used heavily in the “pf” firewall code so that packets passing through can quickly read their firewall rules, but still be synchronized with updates that write new rules.
We did an experiment where we replaced this “rmlock” with a more generic and less efficient “rwlock”. This required about 10 lines of code changes. The result was that the problem went away, but the performance also suffered a little bit. This was both another clue as to the cause, and a fall-back in case we had trouble finding the real culprit. Correctly working code, even if it’s slower, is always better than broken code, right?
Now we knew that the “rmlock” module might be to blame, and this is where our team morphed into being code wizards; they started disassembling code and examining the low-level machine code, line by line. The details are explained here in much better detail than I can provide, but what we found was that a few lines of machine code, generated by the compiler, were correct but out of sequence. By executing out of sequence, the CPU was getting confused as to whether a lock might already be owned by code executing on another CPU. This caused it to force all processes to wait for the lock, even though the lock wasn’t actually owned. All red lights, no green lights. All network packets were stuck waiting for a signal to proceed that would never come.
Our problem wasn’t that the compiler was producing bad code; because the “rmlock” code is so well optimized, we had entered a realm where the normal rules sometimes don’t apply and need special exceptions. Modern CPUs play a lot of tricks with re-arranging the order that each instruction is executed. Usually this is done to speed up the overall code, and it’s done in cooperation with rules set by the compiler and ultimately controlled by the developer. However, it’s usually done automatically and correctly; very few developers have to think about details this obscure and esoteric. Even fewer developers can read machine code and know when it looks out of order.
It was tempting to substitute in the “rwlock” band-aid I described above, sacrificing performance for a quick and superficial fix. Pushing through that and finding and fixing the true root cause is what makes our Netgate team so special and powerful. The fix was to add a single line of code that gave a clear direction to the compiler to ensure that the problematic segment of machine code stayed executing in the proper sequence, something that we call a “memory barrier”. Being that pfSense and FreeBSD are open source, the fix has already been publicly reviewed and contributed upstream, and can be found here.
The total time elapsed from the first report to the proposed solution was 48 hours. This is an amazing turn around time for something this deep and esoteric. We’ve been testing the fix extensively inside of Netgate since then. The problem seems to only affect a subset of Netgate appliances that use non-X86 CPUs, and it’s unlikely that it affects any of our pfSense CE users. However, out of an abundance of caution and clarity, we provided a patch update to pfSense Plus 21.02 for all appliances.
An obvious question is how this bug escaped our pre-release testing. Part of it is because it requires a specific set of circumstances to trigger, and those conditions were not obvious right away. We’re taking a very close look at our testing procedures going forward, and adding this to our test plan.
I can’t thank the Netgate support and engineering teams enough for their amazing work on a difficult problem. You all are awesome!