Frequency Counting a Large Set of Strings
Wed, 2021-07-14 @ 10:40:50 PM
Frequency Counting a Large Set of Strings

I recently was given an interview question where the hardest part was to conduct a frequency count on a large set of strings. I have thoughts on how the interview went and I’ll share them at the end, but I wanted to run through several different ways that you could solve this problem.

Basic Problem Setup

Here’s the basic setup of the problem: You have 6 billion strings, and their frequencies are distributed according to a power law. There are approximately 100 million unique strings, and you need to find the top K, where K is approximately 100 Thousand. The strings are each around 50 characters on average. The strings are distributed in a peaky manner, meaning that a specific string is likely to be near other uses of the same strings, rather than being distributed uniformly random through the 6 billion entries.

Naive Solution Outline

Here is an outline of a naive solution:
1. Build a hashmap of strings to integers
2. Loop over the logs and update the hash map
3. Use QuickPartition to find the K most frequent strings in the set.
The problem was presented in Java.

The supposed problem with the algorithm above is that a hashmap is guaranteed to run out of memory. The interviewer specifically said that “There isn’t a machine around where this wouldn’t run out of memory.”

Brute Force (Resources) Solutions

All of the solutions below look at changing the environment rather than the algorithm.

Bigger Machine

Are there machines that won’t run out of memory as the interviewer said? I was suspicious when he blindly asserted this.

The default load factor for a hashmap in Java is 70%. A 50 character string takes up 144 bytes (More for 64 bit). Each entry in the hashmap consumes 36 Bytes (Assume 64 bit Java and double the size of all references). An Integer value object would consume another 16 bytes.

Adding these up we have 144 + 16 + 36 = 196 Bytes. Let’s round to 256 to be conservative. We use 2^8 bytes per entry. We have 100 Million. 1 Million is ~ 2^20, so 100 Million is ~ 2^27. We would need approximately 2^35 bytes of memory, plus load factor and extras. 2^32 is 4G. 2^35 is then 32G. Adjusting for load factor and the fact that we need 1.5 times the memory when we expand, and other things in the JVM, let’s say 64G. Even if I’m off by a factor of 2, a machine with 128G of memory is not unreasonable.

Use C++ on a Smaller Large Machine

Much of the overhead of a HashMap in Java is due to Java and not to the HashMap itself. We can allocate the table to the correct size initially, and avoid the 1.5x factor. The entries themselves can also be smaller due to inline storage (fewer pointers), storing primitive ints and less string overhead. I’m more confident about the memory usage of C++, so let’s say we would only need 32G. Java can call C++ via the JNI, and the HashMap is purely temporary, so this is an option. If we need more memory reductions, we can calculate a 128-bit hash of the strings in question so that we only store 32 bytes per string, rather than 100 bytes for a 50-byte unicode string.

Use C++ and MemMap on a Machine with an SSD

In C++, you don’t necessarily have to sufficient ram for all the entries at once. Especially given that the strings are peaky in their uses, you could mmap a larger array than will fit in memory backed by a file on SSD. You can then use this array to back a hashtable. The least recently used portions of this table will automatically be flushed to ssd by the kernel. Using this strategy is strictly an engineering tradeoff: Is it fast enough for our use case? Given that the data is peaky, I would expect the kernel’s caching to be effective for this use case.

Two Pass Solutions

Both of these solutions look at the data in two passes. One is exact, and the other is probabilistic.

Bin and Batch

Here we need sufficient disk space to store a second copy of all 6 billion records. We can run through all of the records and compute a hash of the strings. With the hash, we take the high 16 bits of the hash, and write the record to a file with those 16 bits in hex. This will create at most 65K files. The advantage of these files is that every copy of a certain string will be in a single file. We can the build a hashmap for the first file safely within memory, and discard all but the top K. We repeat for the next file, and then compare hashmaps to keep only the top K. This strategy works because the binning is deterministic, so we don’t miss any uses of a string.

Naive Hash Table

An array of primitives in Java is much more memory efficient than a hashmap of the same size, due to wrapping and saving the keys.

We could allocate an array of f * 100 million integers, where f is configurable. In the first pass through the records, we would hash each string and update the counter for the corresponding entry in the array. This is a naive hash table because we don’t handle collisions at all, we just allow them to update the same counter.

At the end of the first pass, we know the quantity of records, and can estimate the number of collisions in the table. We can then take a threshold t from the K + (3 sigma estimate of collisions).

We can then run through the records a second time, skipping any record whose entry in the table is lower than t.

This solution is probabilistic in that there is a small chance that a collision could cause us to choose an imperfect set of high frequency strings. If a probabilistic solution is acceptable, this has the nice property of requiring no extra disk space.

Purely Probabilistic Solutions

I only have one probabilistic solution. This solution was discounted by my interviewer, because he didn’t understand the algorithm that I was describing. If the frequencies are distributed according to a power law, then this solution is very likely to work, and has better performance properties than either of the two pass solutions.

Reservoir Sampling

Reservoir Sampling is a standard technique for choosing a sample from a stream of items without revisiting them. The way we would use this algorithm to solve the frequency problem is to take a 0.1% sample of the logs. We will be left with 6 Million records, and then we can run the hashmap algorithm on the sample. With high probability, the elements that are most populous in the sample will be the most populous elements in the full set of records.

The interviewer claimed that a reservoir sample wouldn’t work because the data was peaky, but the reservoir sample algorithm is designed specifically to choose a uniformly random sample. If the data were maintained in order, you would end up with a sample that remains peaky.

This is a solution I came up with during the interview. I like it because it demonstrates that often you can apply insights from a different discipline in novel ways in computer science.

Write a MapReduce

I’ve never written a MapReduce, not even at Google. But this would be a pretty obvious use for one. The initial key value pair would be index -> string. We would chunk the data by index to mappers. The mappers would change the string to be the keys, and the values to be 1. The local combining function would sum the values. The reduce step would apply the same combining function. This is like the 2nd homework exercise from a class on MapReduce.

On Being a Humble Interviewer

Interviewing is hard. You have a brief moment of time to assess someone’s ability to contribute at your company. In some ways, you want your interview problems to look like the problems a candidate is likely to face while on the job. On the flipside, they need to be small and contained so that you can repeatedly ask the same question to different candidates.

Humility comes in recognizing that although the problems need to resemble the work, the solutions don’t necessarily have to be solutions that you’ve seen before. If a candidate suggests a reservoir sample, don’t just discount it. Especially if you aren’t absolutely sure that it won’t work.

If you say that a line is guaranteed to run out of memory, you should absolutely know how much memory a machine would need to have in order for that line to not run out of memory. The interviewer discounted my suggestion of a MapReduce by saying that the bug was discovered on Thursday, and we need to have a solution by next Wednesday, and that won’t be enough time to stand up a MapReduce cluster. Well if running on an instance with 64G of memory is a solution, let’s just use a bigger instance, and I’ll enjoy my 6 day weekend.

Thanks for Coming to My Ted Talk

I have certainly beaten this horse to death. I was slightly aggravated by this interview, and I wanted to channel it into something productive. I guess making a fairly large catalog of potential solutions is a relatively productive outlet. I have tried to make the problem and the company vague enough that it won’t show up in a Google search done during the interview. Although I didn’t sign an NDA, I don’t really like exposing interview problems. I have mangled this one enough by extracting purely the frequency counting problem that it shouldn’t be an issue.

permalink

QuickPartition

Tue, 2021-07-13 @ 11:23:41 PM

QuickPartition, a Variation on QuickSelect

QuickSelect Solves the problem of finding the kth lowest element in an unsorted array. QuickPartition is a variation on that problem, where we want the k lowest elements from an unsorted array.

I have written up two different versions in Haskell:

Preamble

{-# LANGUAGE BangPatterns        #-}
{-# LANGUAGE MultiWayIf          #-}
{-# LANGUAGE OverloadedLists     #-}
{-# LANGUAGE ScopedTypeVariables #-}

module Main where

import           Control.Monad.Primitive
import           Control.Monad.ST.Strict
import           Data.Foldable
import qualified Data.Sequence               as S
import qualified Data.Vector.Unboxed         as V
import qualified Data.Vector.Unboxed.Mutable as VM

First Version

This version uses Data.Sequence to partition and accumulate the elements to be returned. If you need a Data.Vector you can convert from a Data.Sequence. This version has the advantage of being simple. It splits the array into 3 parts in order to guarantee forward progress.

The basic algorithm is

Set keep to empty
Choose a pivot
Partition the array based on pivot, look at upper half
1. Too many -> re run just on upper half
2. Too few -> add upper half to keep, run on lower half
3. Just right -> add upper half to keep, return

quickPartition ::
     forall a. (Ord a, V.Unbox a)
  => Int
  -> V.Vector a
  -> S.Seq a
quickPartition k as = part k as S.empty
  where
    toVec :: S.Seq a -> V.Vector a
    toVec s = V.fromListN (S.length s) $ toList s
    split :: a -> V.Vector a -> (S.Seq a, S.Seq a, S.Seq a)
    split pivot as = go as S.empty S.empty S.empty
      where
        go ::
             V.Vector a
          -> S.Seq a
          -> S.Seq a
          -> S.Seq a
          -> (S.Seq a, S.Seq a, S.Seq a)
        go as first mid last
          | V.null as = (first, mid, last)
          | otherwise =
            let !h = V.head as
                !t = V.tail as
             in if | h > pivot -> go t first mid (last S.|> h)
                   | h < pivot -> go t (first S.|> h) mid last
                   | h == pivot -> go t first (mid S.|> h) last
    part 0 as keep = keep
    part k as keep
      | V.null as = keep
      | otherwise =
        let !midIdx = V.length as `div` 2
            !pivot = as V.! midIdx
            (first, mid, last) = split pivot as
            !count = S.length first
            !midCount = S.length mid
         in if | count <= k && count + midCount >= k ->
                 keep <> first <> S.take (k - count) mid
               | count < k ->
                 part (k - count - midCount) (toVec last) (keep <> first <> mid)
               | count > k -> part k (toVec first) keep

Second Version

This version doesn’t use Data.Sequence Instead it uses mutable arrays and does a more standard quick sort, recursing only into the necessary parts. The Data.Vector library has the nice feature of allowing slices. This makes the partition step easier to write, because we don’t have to keep track of low and high, or whether they are inclusive, etc. The slices modify the underlying array, so at the end we clone the relevant portion and then freeze it.

I will admit that array manipulation and loops are uglier in Haskell.

quickPartition2 ::
     forall a. (Ord a, V.Unbox a, Show a)
  => Int
  -> V.Vector a
  -> V.Vector a
quickPartition2 k as =
  if V.length as <= k
    then as
    else runST $ arrange k as
  where
    arrange :: (PrimMonad m) => Int -> V.Vector a -> m (V.Vector a)
    arrange k as = do
      asM <- V.thaw as
      arrangeM k asM
      V.unsafeFreeze =<< (VM.clone $ VM.slice 0 k asM)
    arrangeM :: (PrimMonad m) => Int -> VM.MVector (PrimState m) a -> m ()
    arrangeM 0 asM = return ()
    arrangeM k asM = do
      let !sz = VM.length asM
          !midIdx = sz `div` 2
      pivot <- VM.read asM midIdx
      (minN, maxN) <- partitionStep pivot asM
      if | minN <= k && k <= maxN -> return ()
         | maxN < k -> arrangeM (k - maxN) $ VM.slice maxN (sz - maxN) asM
         | minN > k -> arrangeM k $ VM.slice 0 minN asM
    partitionStep ::
         (PrimMonad m) => a -> VM.MVector (PrimState m) a -> m (Int, Int)
    partitionStep pivot asM
      | VM.length asM == 1 = return (0, 1)
    partitionStep pivot asM
      | otherwise = meetLoop 0 ((VM.length asM) - 1)
      where
        meetLoop i j
          | i > j = return (j + 1, i)
          | otherwise = do
            i' <- findLowLoop i
            j' <- findHighLoop j
            if i' <= j'
              then do
                when (i' < j') $ VM.swap asM i' j'
                meetLoop (i' + 1) (j' - 1)
              else meetLoop i' j'
        findLowLoop i
          | i == VM.length asM = return i
          | otherwise = do
            iElement <- VM.read asM i
            if iElement >= pivot
              then return i
              else findLowLoop (i + 1)
        findHighLoop j = do
          jElement <- VM.read asM j
          if jElement <= pivot
            then return j
            else findHighLoop (j - 1)

permalink

How to Build a Comb

Sun, 2021-07-11 @ 02:46:27 AM

How to Build a Signed All Bits Set Comb

Or Putting a Clichéd Interview Question to Work

The Basic Problem

A set of signed-all-bits-set combs for a point B on an elliptic curve (or any group) is a group of data tables. There are 3 parameters: n combs with t teeth and s bits of separation between teeth. As a concrete example, we will consider n = 4, t = 5, and s = 13. We need to build 4 tables that look like this:

index	contents
0	B ⋅ 2⁵² − B ⋅ 2³⁹ − B ⋅ 2²⁶ − B ⋅ 2¹³ − B ⋅ 2⁰
1	B ⋅ 2⁵² − B ⋅ 2³⁹ − B ⋅ 2²⁶ − B ⋅ 2¹³ + B ⋅ 2⁰
2	B ⋅ 2⁵² − B ⋅ 2³⁹ − B ⋅ 2²⁶ + B ⋅ 2¹³ − B ⋅ 2⁰
3	B ⋅ 2⁵² − B ⋅ 2³⁹ − B ⋅ 2²⁶ + B ⋅ 2¹³ + B ⋅ 2⁰
…	…
15	B ⋅ 2⁵² + B ⋅ 2³⁹ + B ⋅ 2²⁶ + B ⋅ 2¹³ + B ⋅ 2⁰

The other 3 tables look the same, but they start at B ⋅ 2⁶⁵, B ⋅ 2¹³⁰, and B ⋅ 2¹⁹⁵

We only need 16 entries in the table rather than 32, because we can negate. If the high bit is 1, we don’t negate. If it is 0, we invert the lower 4 bits of the table index, and then negate the point. This in effect only negates the highest bit.

Start By Repeatedly Doubling

Here is the algorithm for the first step, as pseudocode. There are some shortcuts you can take with this algorithm, but I have left them out for simplicity.

teeth_sets = Array<Point>[4][4]
everything_points = Array<Point>[4](IdentityPoint)

accumulator = B
for (i = 0 to 259)
    if (i % s == 0)
        everything_points[i / (t * s)] *= accumulator
    elif (i % s == 1)
        teeth_number = (i / s) % t
        if (teeth_number != t - 1)
            teeth_sets[i / (t * s)][teeth_number] = accumulator
    accumulator = accumulator * 2

At the end of the loop, we have for each comb the entry with all bits set to positive, and each tooth squared. The squared teeth are important for building out the other 15 entries that go into the table.

Fun with Gray Codes

A Gray Code for n bits is a cyclical arrangement of all n bit strings so that each string differs from its two neighbors by exactly 1 bit. It generally refers to specific such code, call the Reflected Binary Code. We will use the 4-bit gray-code. See Wikipedia for more on Gray Codes

By taking the everything point, and adding or subtracting teeth, we can flip a specific bit in our signed representation. Let’s give an example. The first everything point is B ⋅ 2⁵² + B ⋅ 2³⁹ + B ⋅ 2²⁶ + B ⋅ 2¹³ + B ⋅ 2⁰. The first tooth we saved is B ⋅ 2¹ if we subtract this tooth, we get B ⋅ 2⁵² + B ⋅ 2³⁹ + B ⋅ 2²⁶ + B ⋅ 2¹³ + B ⋅ 2⁰ − B ⋅ 2¹ = B ⋅ 2⁵² + B ⋅ 2³⁹ + B ⋅ 2²⁶ + B ⋅ 2¹³ − B ⋅ 2⁰.

By remembering which direction we need to flip and walking a gray code, we can generate all the entries in each comb.

Putting a Stupid Interview Problem to Work

There’s a programming interview problem floating around the internet that goes something like this:
Given an array of numbers, for each entry, find the product of all other entries. There’s a twist to solve it without using division. It turns out the problem actually has a use in reducing multiple points to affine representations.

In a finite field, division is much more expensive than multiplication. Each of the points in our comb is represented in projective space as (X,Y,Z). For any non-zero scalar λ, the point (λX,λY,λZ) is the same point as (X,Y,Z). We can save space (and therefore lookup time) and store the points as (X/Z,Y/Z,1), where the 1 is implied. In order to do this we need to divide each point by its own Z. Here’s where the interview problem comes in: We want each point to have the same Z. Then we only need to do a single inversion. In order to do this, we have to multiply each point by every Z but its own.

Here’s the solution: Call the result OtherZ. Break the problem into two halves: OtherZBefore and OtherZAfter

Then run two loops (Only first loop shown. Second is analogous)

accumulator = Identity
for (i = 0 to max)
    OtherZBefore[i] = accumulator
    accumulator *= points[i].Z

Then OtherZ[i] = OtherZBefore[i] ⋅ OtherZAfter[i]

Tada! You have a completed comb. Now you can write the data to disk, or print it out and embed it in the source code.

permalink

P11_260

Fri, 2021-07-09 @ 06:12:11 AM

P11_260

I wrote an elliptic curve signature system that uses a Granger Moss Prime and has some nice properties:

The field would vectorize well in hardware or on an FPGA.
I have benchmarked it and it scales well on AVX2 and AVX512, with separate implementations for both.
Unlike Ed25519, the keys form a group because we use the entire keyspace.
The session key algorithm is a defense in depth that resists fault attacks and poor randomness

Why Write an Elliptic Curve Signature System

Yes, I know it is unlikely to be widely adopted. There were several reasons to write a signature system.

I understand a lot more how a signature system is built including the scalar manipulation.
I learned a lot about Intel’s AVX2 and AVX512 instruction sets while vectorizing the code, and a lot about performance testing by using the linux perf tool.
It is a good reminder that papers don’t exist just to be read, but to be implemented as well.
It helped me understand some of the weaknesses in a system like Ed25519

So How Does it Work

The signature system is based on scalar multiplication of points on an elliptic curve over a prime field. The prime is (t¹¹−1)/(t−1) where t = 2²⁶ − 15. This prime is congruent to 3 mod 4 for fast square roots. Approximate division by t is straightforward: $q = z >> 2^26; r = t & (2^26 - 1) + 15 ⋅ q$.

The curve is an edwards curve, defined as: x² + y² = dx²y², d = − 49142 This is the smallest such d that yields a curve with cofactor 4. Thanks to the team at safecurves.cr.yp.to I was able to use sage and numerical sieves to construct proofs of safety for the curve.

Signed-All-Bits-Set

The software uses a signed-all-bits-set representation for speedups and to avoid timing attacks by early termination. In signed-all-bits-set representation, a number is represented in binary, but each bit is ± 1. To convert a b bit number to this representation mod l, you do the following: First write the number e = sum(i=0 to b,2ⁱd_i), d_i ∈ −1, 1. Then we can modify the equation. (e+2^b−1)/2 = sum(i=0 to b,2ⁱ(d_i+1)/2). So given a binary number e, if we want its digits in SABS representation, we compute (e+2^b−1)/2. Since the scalars that we are concerned about form a prime field, we can always divide by 2.

The Signature Algorithm

To Sign:

There is a well known point B. The signing key is a pair: A, a where A = a * B, and a is a scalar, the secret key. Choose a session key k. then compute the point R = kB. Use blake2b512 to produce a 512 bit value h = Hash(R,A,M), where A is the compressed public key. Compute s = k − ha. The signature is the pair R, s where R is compressed.

To Verify: Compute h = Hash(R,A,M). Check that sB + hA = R. This can be done without decompressing R. We can verify the check equation: sB + hA = (k−ha) ⋅ B + ha ⋅ B = k ⋅ B = R.

This particular system was chosen based on guidance from Dan Bernstein and Tanja Lange

Combs

The signature algorithm uses 4 5-bit combs with teeth separation of 13 bits. Combs are straightforward once you see what they are doing. The first table contains 16 entries:

index	contents
0	B ⋅ 2⁵² − B ⋅ 2³⁹ − B ⋅ 2²⁶ − B ⋅ 2¹³ − B ⋅ 2⁰
1	B ⋅ 2⁵² − B ⋅ 2³⁹ − B ⋅ 2²⁶ − B ⋅ 2¹³ + B ⋅ 2⁰
…	…
15	B ⋅ 2⁵² + B ⋅ 2³⁹ + B ⋅ 2²⁶ + B ⋅ 2¹³ + B ⋅ 2⁰

We only need 16 entries because we can use the 5th bit as a sign bit. The next comb is constructed in a similar manner, starting with B ⋅ 2⁶⁵. To multiply B by a scalar, we start with acc equal to the identity and do the following 13 times:

Square acc
Do the following 4 times:
1. Extract the relevant bits e.g. bits 52, 39, 26, 13, 0
2. Do a constant time lookup in the table based on the lower 4 bits (possibly tweaked by 5th bit)
3. Do a constant time negate/identity based on the 5th bit
4. Multiply acc by the result.

You always use a comb when signing, because B is known. You can use it in half of the verification, because you compute sB, and again B is known. If you need to verify several signatures from the same key, in can be effective to compute a comb for a known public key A to speed up the computation of hA.

Session Key Algorithm

At the very end of the article by DJB and Tanja Lange, there is a discussion of what to use to derive the session key. The article argues that using the message and the private key to generate the session key is best because it is stateless and simple, and there is no security to be gained by including randomness in the generation. This analysis misses a specific kind of attack: A fault attack. In this system, the message is hashed twice, once to generate the session key, k and then again to produce h. If an attacker can cause the document to be different for the second hashing, they can produce multiple signatures with the same session key, breaking the private key. If, however, the random number hardware is working, and you include a random value in the session key hash, then the fault attack won’t work.

permalink

Solving a Mystery Bug in LLVM
Thu, 2021-07-08 @ 03:02:43 PM
Solving a Mystery Bug in LLVM

Random SIGSEGV

This story starts with a failing test where the failure was a SEGV. Pulling up the relevant code in the debugger showed that it was a null dereference. Normally this would be straightforward to debug, but in this case the null dereference was right after a null check. I set a breakpoint at the null check, but the SIGSEGV still occurred. The breakthrough came when I set breakpoints on every instruction between the null check and the dereference. Something was jumping in between the null check and the dereference.

COMEFROM

In the absence of a rewind debugger, I had to figure out what code was branching to the middle of a block of code. Fortunately, I had the branch target and objdump. I was able to find the offending branch by text searching the objdump disassembly for the branch target. Next I had to figure out where this branch was supposed to go. I was able to use the objdump of the unlinked objects to see that the intended target was exactly 2^14 away from where it ended up going. Whenever you see an exact power of 2 like that, you know something fishy is going on.

LLVM Just Needs to Relax

Why is 2^14 important? Because PPC64 conditional branches include 12 signed bits for the relative branch target. Instructions are required to be 32 bit aligned, so we get the 2 lowest bits of the branch target for free. Hence 14 bits. Conditional branches are almost always within a function, and functions rarely span 14 bits of address space, so this isn’t a common issue. When a function is too large, the compiler triggers Branch Relaxation scanning the branches to see if any of them are too far. The ones that are too far are re-written as two branches: A short conditional branch with inverted tests and hinting, and an unconditional branch. This increases the code size, so relaxation is repeated until a fixed point is reached.

How Does LLVM Know?

This is where the actual bug was found. In order to decide which branches to relax, LLVM needs to scan the code to determine how many bytes separate a branch from its target. This sounds like it would be easy on PPC64. Every instruction is 4 bytes. The problem with that logic is that not everything that LLVM is working with is an instruction yet. Some of them are pseudo instructions. I don’t remember how I found which pseudo instruction was miscounted, but there was a two-instruction pseudo-op that was incorrectly sized.

Law of Large Codebases

In order for the bug to trigger, there needed to be a branch target that was just large enough that it was over, but with a handful of miscounts that prevented it from being relaxed. Most codebases would never see the bug. But Google has a huge codebase and a large testing infrastructure to re-run the tests after dependent code changes. We were able to find the bug only because we had a large number of chances to find it.

Lessons

The most important lesson is that you can find even odd corner case bugs by thoroughly bounding and documenting behavior. Sometimes you get a stroke of insight to set multiple breakpoints or that the branch is off by 2^14. But another fun lesson is that for some PRs, the time invested is not matched by the number of lines changed

permalink

Older Posts . . .

Frequency Counting a Large Set of Strings

Frequency Counting a Large Set of Strings

Basic Problem Setup

Naive Solution Outline

Brute Force (Resources) Solutions

Bigger Machine

Use C++ on a Smaller Large Machine

Use C++ and MemMap on a Machine with an SSD

Two Pass Solutions

Bin and Batch

Naive Hash Table

Purely Probabilistic Solutions

Reservoir Sampling

Write a MapReduce

On Being a Humble Interviewer

Thanks for Coming to My Ted Talk

QuickPartition

QuickPartition, a Variation on QuickSelect

Preamble

First Version

Second Version

How to Build a Comb

How to Build a Signed All Bits Set Comb

Or Putting a Clichéd Interview Question to Work

The Basic Problem

Start By Repeatedly Doubling

Fun with Gray Codes

Putting a Stupid Interview Problem to Work

P11_260

P11_260

Why Write an Elliptic Curve Signature System

So How Does it Work

Signed-All-Bits-Set

The Signature Algorithm

Combs

Session Key Algorithm

Solving a Mystery Bug in LLVM

Solving a Mystery Bug in LLVM

Random SIGSEGV

COMEFROM

LLVM Just Needs to Relax

How Does LLVM Know?

Law of Large Codebases

Lessons