Solving a Mystery Bug in LLVM


Random SIGSEGV

This story starts with a failing test where the failure was a SEGV. Pulling up the relevant code in the debugger showed that it was a null dereference. Normally this would be straightforward to debug, but in this case the null dereference was right after a null check. I set a breakpoint at the null check, but the SIGSEGV still occurred. The breakthrough came when I set breakpoints on every instruction between the null check and the dereference. Something was jumping in between the null check and the dereference.

COMEFROM

In the absence of a rewind debugger, I had to figure out what code was branching to the middle of a block of code. Fortunately, I had the branch target and objdump. I was able to find the offending branch by text searching the objdump disassembly for the branch target. Next I had to figure out where this branch was supposed to go. I was able to use the objdump of the unlinked objects to see that the intended target was exactly 2^14 away from where it ended up going. Whenever you see an exact power of 2 like that, you know something fishy is going on.

LLVM Just Needs to Relax

Why is 2^14 important? Because PPC64 conditional branches include 12 signed bits for the relative branch target. Instructions are required to be 32 bit aligned, so we get the 2 lowest bits of the branch target for free. Hence 14 bits. Conditional branches are almost always within a function, and functions rarely span 14 bits of address space, so this isn’t a common issue. When a function is too large, the compiler triggers Branch Relaxation scanning the branches to see if any of them are too far. The ones that are too far are re-written as two branches: A short conditional branch with inverted tests and hinting, and an unconditional branch. This increases the code size, so relaxation is repeated until a fixed point is reached.

How Does LLVM Know?

This is where the actual bug was found. In order to decide which branches to relax, LLVM needs to scan the code to determine how many bytes separate a branch from its target. This sounds like it would be easy on PPC64. Every instruction is 4 bytes. The problem with that logic is that not everything that LLVM is working with is an instruction yet. Some of them are pseudo instructions. I don’t remember how I found which pseudo instruction was miscounted, but there was a two-instruction pseudo-op that was incorrectly sized.

Law of Large Codebases

In order for the bug to trigger, there needed to be a branch target that was just large enough that it was over, but with a handful of miscounts that prevented it from being relaxed. Most codebases would never see the bug. But Google has a huge codebase and a large testing infrastructure to re-run the tests after dependent code changes. We were able to find the bug only because we had a large number of chances to find it.

Lessons

The most important lesson is that you can find even odd corner case bugs by thoroughly bounding and documenting behavior. Sometimes you get a stroke of insight to set multiple breakpoints or that the branch is off by 2^14. But another fun lesson is that for some PRs, the time invested is not matched by the number of lines changed