I recently was given an interview question where the hardest part was to conduct a frequency count on a large set of strings. I have thoughts on how the interview went and I’ll share them at the end, but I wanted to run through several different ways that you could solve this problem.
Here’s the basic setup of the problem: You have 6 billion strings, and their frequencies are distributed according to a power law. There are approximately 100 million unique strings, and you need to find the top K, where K is approximately 100 Thousand. The strings are each around 50 characters on average. The strings are distributed in a peaky manner, meaning that a specific string is likely to be near other uses of the same strings, rather than being distributed uniformly random through the 6 billion entries.
Here is an outline of a naive solution:
The problem was presented in Java.
The supposed problem with the algorithm above is that a hashmap is guaranteed to run out of memory. The interviewer specifically said that “There isn’t a machine around where this wouldn’t run out of memory.”
All of the solutions below look at changing the environment rather than the algorithm.
Are there machines that won’t run out of memory as the interviewer said? I was suspicious when he blindly asserted this.
The default load factor for a hashmap in Java is 70%. A 50 character
string takes up 144
bytes (More for 64 bit). Each entry in the hashmap consumes 36
Bytes (Assume 64 bit Java and double the size of all references). An
Integer
value object would consume another 16 bytes.
Adding these up we have 144 + 16 + 36 = 196 Bytes. Let’s round to 256 to be conservative. We use 2^8 bytes per entry. We have 100 Million. 1 Million is ~ 2^20, so 100 Million is ~ 2^27. We would need approximately 2^35 bytes of memory, plus load factor and extras. 2^32 is 4G. 2^35 is then 32G. Adjusting for load factor and the fact that we need 1.5 times the memory when we expand, and other things in the JVM, let’s say 64G. Even if I’m off by a factor of 2, a machine with 128G of memory is not unreasonable.
Much of the overhead of a HashMap in Java is due to Java and not to
the HashMap itself. We can allocate the table to the correct size
initially, and avoid the 1.5x factor. The entries themselves can also be
smaller due to inline storage (fewer pointers), storing primitive
int
s and less string overhead. I’m more confident about the
memory usage of C++, so let’s say we would only need 32G. Java can call
C++ via the JNI, and the HashMap is purely temporary, so this is an
option. If we need more memory reductions, we can calculate a 128-bit
hash of the strings in question so that we only store 32 bytes per
string, rather than 100 bytes for a 50-byte unicode string.
In C++, you don’t necessarily have to sufficient ram for all the
entries at once. Especially given that the strings are peaky in their
uses, you could mmap
a larger array than will fit in memory
backed by a file on SSD. You can then use this array to back a
hashtable. The least recently used portions of this table will
automatically be flushed to ssd by the kernel. Using this strategy is
strictly an engineering tradeoff: Is it fast enough for our use case?
Given that the data is peaky, I would expect the kernel’s
caching to be effective for this use case.
Both of these solutions look at the data in two passes. One is exact, and the other is probabilistic.
Here we need sufficient disk space to store a second copy of all 6 billion records. We can run through all of the records and compute a hash of the strings. With the hash, we take the high 16 bits of the hash, and write the record to a file with those 16 bits in hex. This will create at most 65K files. The advantage of these files is that every copy of a certain string will be in a single file. We can the build a hashmap for the first file safely within memory, and discard all but the top K. We repeat for the next file, and then compare hashmaps to keep only the top K. This strategy works because the binning is deterministic, so we don’t miss any uses of a string.
An array of primitives in Java is much more memory efficient than a hashmap of the same size, due to wrapping and saving the keys.
We could allocate an array of f * 100 million integers, where f is configurable. In the first pass through the records, we would hash each string and update the counter for the corresponding entry in the array. This is a naive hash table because we don’t handle collisions at all, we just allow them to update the same counter.
At the end of the first pass, we know the quantity of records, and can estimate the number of collisions in the table. We can then take a threshold t from the K + (3 sigma estimate of collisions).
We can then run through the records a second time, skipping any record whose entry in the table is lower than t.
This solution is probabilistic in that there is a small chance that a collision could cause us to choose an imperfect set of high frequency strings. If a probabilistic solution is acceptable, this has the nice property of requiring no extra disk space.
I only have one probabilistic solution. This solution was discounted by my interviewer, because he didn’t understand the algorithm that I was describing. If the frequencies are distributed according to a power law, then this solution is very likely to work, and has better performance properties than either of the two pass solutions.
Reservoir Sampling is a standard technique for choosing a sample from a stream of items without revisiting them. The way we would use this algorithm to solve the frequency problem is to take a 0.1% sample of the logs. We will be left with 6 Million records, and then we can run the hashmap algorithm on the sample. With high probability, the elements that are most populous in the sample will be the most populous elements in the full set of records.
The interviewer claimed that a reservoir sample wouldn’t work because the data was peaky, but the reservoir sample algorithm is designed specifically to choose a uniformly random sample. If the data were maintained in order, you would end up with a sample that remains peaky.
This is a solution I came up with during the interview. I like it because it demonstrates that often you can apply insights from a different discipline in novel ways in computer science.
I’ve never written a MapReduce, not even at Google. But this would be a pretty obvious use for one. The initial key value pair would be index -> string. We would chunk the data by index to mappers. The mappers would change the string to be the keys, and the values to be 1. The local combining function would sum the values. The reduce step would apply the same combining function. This is like the 2nd homework exercise from a class on MapReduce.
Interviewing is hard. You have a brief moment of time to assess someone’s ability to contribute at your company. In some ways, you want your interview problems to look like the problems a candidate is likely to face while on the job. On the flipside, they need to be small and contained so that you can repeatedly ask the same question to different candidates.
Humility comes in recognizing that although the problems need to resemble the work, the solutions don’t necessarily have to be solutions that you’ve seen before. If a candidate suggests a reservoir sample, don’t just discount it. Especially if you aren’t absolutely sure that it won’t work.
If you say that a line is guaranteed to run out of memory, you should absolutely know how much memory a machine would need to have in order for that line to not run out of memory. The interviewer discounted my suggestion of a MapReduce by saying that the bug was discovered on Thursday, and we need to have a solution by next Wednesday, and that won’t be enough time to stand up a MapReduce cluster. Well if running on an instance with 64G of memory is a solution, let’s just use a bigger instance, and I’ll enjoy my 6 day weekend.
I have certainly beaten this horse to death. I was slightly aggravated by this interview, and I wanted to channel it into something productive. I guess making a fairly large catalog of potential solutions is a relatively productive outlet. I have tried to make the problem and the company vague enough that it won’t show up in a Google search done during the interview. Although I didn’t sign an NDA, I don’t really like exposing interview problems. I have mangled this one enough by extracting purely the frequency counting problem that it shouldn’t be an issue.
QuickSelect Solves the problem of finding the kth lowest element in an unsorted array. QuickPartition is a variation on that problem, where we want the k lowest elements from an unsorted array.
I have written up two different versions in Haskell:
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE OverloadedLists #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Main where
import Control.Monad.Primitive
import Control.Monad.ST.Strict
import Data.Foldable
import qualified Data.Sequence as S
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as VM
This version uses Data.Sequence
to partition and
accumulate the elements to be returned. If you need a
Data.Vector
you can convert from a
Data.Sequence
. This version has the advantage of being
simple. It splits the array into 3 parts in order to guarantee forward
progress.
The basic algorithm is
Set keep to empty
Choose a pivot
Partition the array based on pivot, look at upper half
quickPartition ::
forall a. (Ord a, V.Unbox a)
=> Int
-> V.Vector a
-> S.Seq a
= part k as S.empty
quickPartition k as where
toVec :: S.Seq a -> V.Vector a
= V.fromListN (S.length s) $ toList s
toVec s split :: a -> V.Vector a -> (S.Seq a, S.Seq a, S.Seq a)
= go as S.empty S.empty S.empty
split pivot as where
go ::
V.Vector a
-> S.Seq a
-> S.Seq a
-> S.Seq a
-> (S.Seq a, S.Seq a, S.Seq a)
last
go as first mid | V.null as = (first, mid, last)
| otherwise =
let !h = V.head as
!t = V.tail as
in if | h > pivot -> go t first mid (last S.|> h)
| h < pivot -> go t (first S.|> h) mid last
| h == pivot -> go t first (mid S.|> h) last
0 as keep = keep
part
part k as keep| V.null as = keep
| otherwise =
let !midIdx = V.length as `div` 2
!pivot = as V.! midIdx
last) = split pivot as
(first, mid, !count = S.length first
!midCount = S.length mid
in if | count <= k && count + midCount >= k ->
<> first <> S.take (k - count) mid
keep | count < k ->
- count - midCount) (toVec last) (keep <> first <> mid)
part (k | count > k -> part k (toVec first) keep
This version doesn’t use Data.Sequence
Instead it uses
mutable arrays and does a more standard quick sort, recursing only into
the necessary parts. The Data.Vector
library has the nice
feature of allowing slices. This makes the partition step easier to
write, because we don’t have to keep track of low and high, or whether
they are inclusive, etc. The slices modify the underlying array, so at
the end we clone the relevant portion and then freeze it.
I will admit that array manipulation and loops are uglier in Haskell.
quickPartition2 ::
forall a. (Ord a, V.Unbox a, Show a)
=> Int
-> V.Vector a
-> V.Vector a
=
quickPartition2 k as if V.length as <= k
then as
else runST $ arrange k as
where
arrange :: (PrimMonad m) => Int -> V.Vector a -> m (V.Vector a)
= do
arrange k as <- V.thaw as
asM
arrangeM k asM=<< (VM.clone $ VM.slice 0 k asM)
V.unsafeFreeze arrangeM :: (PrimMonad m) => Int -> VM.MVector (PrimState m) a -> m ()
0 asM = return ()
arrangeM = do
arrangeM k asM let !sz = VM.length asM
!midIdx = sz `div` 2
<- VM.read asM midIdx
pivot <- partitionStep pivot asM
(minN, maxN) if | minN <= k && k <= maxN -> return ()
| maxN < k -> arrangeM (k - maxN) $ VM.slice maxN (sz - maxN) asM
| minN > k -> arrangeM k $ VM.slice 0 minN asM
partitionStep ::
PrimMonad m) => a -> VM.MVector (PrimState m) a -> m (Int, Int)
(
partitionStep pivot asM| VM.length asM == 1 = return (0, 1)
partitionStep pivot asM| otherwise = meetLoop 0 ((VM.length asM) - 1)
where
meetLoop i j| i > j = return (j + 1, i)
| otherwise = do
<- findLowLoop i
i' <- findHighLoop j
j' if i' <= j'
then do
< j') $ VM.swap asM i' j'
when (i' + 1) (j' - 1)
meetLoop (i' else meetLoop i' j'
findLowLoop i| i == VM.length asM = return i
| otherwise = do
<- VM.read asM i
iElement if iElement >= pivot
then return i
else findLowLoop (i + 1)
= do
findHighLoop j <- VM.read asM j
jElement if jElement <= pivot
then return j
else findHighLoop (j - 1)
A set of signed-all-bits-set combs for a point B on an elliptic curve (or any group) is a group of data tables. There are 3 parameters: n combs with t teeth and s bits of separation between teeth. As a concrete example, we will consider n = 4, t = 5, and s = 13. We need to build 4 tables that look like this:
index | contents |
---|---|
0 | B ⋅ 252 − B ⋅ 239 − B ⋅ 226 − B ⋅ 213 − B ⋅ 20 |
1 | B ⋅ 252 − B ⋅ 239 − B ⋅ 226 − B ⋅ 213 + B ⋅ 20 |
2 | B ⋅ 252 − B ⋅ 239 − B ⋅ 226 + B ⋅ 213 − B ⋅ 20 |
3 | B ⋅ 252 − B ⋅ 239 − B ⋅ 226 + B ⋅ 213 + B ⋅ 20 |
… | … |
15 | B ⋅ 252 + B ⋅ 239 + B ⋅ 226 + B ⋅ 213 + B ⋅ 20 |
The other 3 tables look the same, but they start at B ⋅ 265, B ⋅ 2130, and B ⋅ 2195
We only need 16 entries in the table rather than 32, because we can negate. If the high bit is 1, we don’t negate. If it is 0, we invert the lower 4 bits of the table index, and then negate the point. This in effect only negates the highest bit.
Here is the algorithm for the first step, as pseudocode. There are some shortcuts you can take with this algorithm, but I have left them out for simplicity.
teeth_sets = Array<Point>[4][4]
everything_points = Array<Point>[4](IdentityPoint)
accumulator = B
for (i = 0 to 259)
if (i % s == 0)
everything_points[i / (t * s)] *= accumulator
elif (i % s == 1)
teeth_number = (i / s) % t
if (teeth_number != t - 1)
teeth_sets[i / (t * s)][teeth_number] = accumulator
accumulator = accumulator * 2
At the end of the loop, we have for each comb the entry with all bits set to positive, and each tooth squared. The squared teeth are important for building out the other 15 entries that go into the table.
A Gray Code for n bits is a cyclical arrangement of all n bit strings so that each string differs from its two neighbors by exactly 1 bit. It generally refers to specific such code, call the Reflected Binary Code. We will use the 4-bit gray-code. See Wikipedia for more on Gray Codes
By taking the everything point, and adding or subtracting teeth, we can flip a specific bit in our signed representation. Let’s give an example. The first everything point is B ⋅ 252 + B ⋅ 239 + B ⋅ 226 + B ⋅ 213 + B ⋅ 20. The first tooth we saved is B ⋅ 21 if we subtract this tooth, we get B ⋅ 252 + B ⋅ 239 + B ⋅ 226 + B ⋅ 213 + B ⋅ 20 − B ⋅ 21 = B ⋅ 252 + B ⋅ 239 + B ⋅ 226 + B ⋅ 213 − B ⋅ 20.
By remembering which direction we need to flip and walking a gray code, we can generate all the entries in each comb.
There’s a programming interview problem floating around the internet
that goes something like this:
Given an array of numbers, for each entry, find the product of all other
entries. There’s a twist to solve it without using division. It turns
out the problem actually has a use in reducing multiple points to affine
representations.
In a finite field, division is much more expensive than multiplication. Each of the points in our comb is represented in projective space as (X,Y,Z). For any non-zero scalar λ, the point (λX,λY,λZ) is the same point as (X,Y,Z). We can save space (and therefore lookup time) and store the points as (X/Z,Y/Z,1), where the 1 is implied. In order to do this we need to divide each point by its own Z. Here’s where the interview problem comes in: We want each point to have the same Z. Then we only need to do a single inversion. In order to do this, we have to multiply each point by every Z but its own.
Here’s the solution: Call the result OtherZ. Break the problem into two halves: OtherZBefore and OtherZAfter
Then run two loops (Only first loop shown. Second is analogous)
accumulator = Identity
for (i = 0 to max)
OtherZBefore[i] = accumulator
accumulator *= points[i].Z
Then OtherZ[i] = OtherZBefore[i] ⋅ OtherZAfter[i]
Tada! You have a completed comb. Now you can write the data to disk, or print it out and embed it in the source code.
I wrote an elliptic curve signature system that uses a Granger Moss Prime and has some nice properties:
Yes, I know it is unlikely to be widely adopted. There were several reasons to write a signature system.
perf
tool.The signature system is based on scalar multiplication of points on an elliptic curve over a prime field. The prime is (t11−1)/(t−1) where t = 226 − 15. This prime is congruent to 3 mod 4 for fast square roots. Approximate division by t is straightforward: $q = z >> 2^26; r = t & (2^26 - 1) + 15 ⋅ q$.
The curve is an edwards curve, defined as: x2 + y2 = dx2y2, d = − 49142 This is the smallest such d that yields a curve with cofactor 4. Thanks to the team at safecurves.cr.yp.to I was able to use sage and numerical sieves to construct proofs of safety for the curve.
The software uses a signed-all-bits-set representation for speedups and to avoid timing attacks by early termination. In signed-all-bits-set representation, a number is represented in binary, but each bit is ± 1. To convert a b bit number to this representation mod l, you do the following: First write the number e = sum(i=0 to b,2idi), di ∈ −1, 1. Then we can modify the equation. (e+2b−1)/2 = sum(i=0 to b,2i(di+1)/2). So given a binary number e, if we want its digits in SABS representation, we compute (e+2b−1)/2. Since the scalars that we are concerned about form a prime field, we can always divide by 2.
To Sign:
There is a well known point B. The signing key is a pair: A, a where A = a * B, and a is a scalar, the secret key. Choose a session key k. then compute the point R = kB. Use blake2b512 to produce a 512 bit value h = Hash(R,A,M), where A is the compressed public key. Compute s = k − ha. The signature is the pair R, s where R is compressed.
To Verify: Compute h = Hash(R,A,M). Check that sB + hA = R. This can be done without decompressing R. We can verify the check equation: sB + hA = (k−ha) ⋅ B + ha ⋅ B = k ⋅ B = R.
This particular system was chosen based on guidance from Dan Bernstein and Tanja Lange
The signature algorithm uses 4 5-bit combs with teeth separation of 13 bits. Combs are straightforward once you see what they are doing. The first table contains 16 entries:
index | contents |
---|---|
0 | B ⋅ 252 − B ⋅ 239 − B ⋅ 226 − B ⋅ 213 − B ⋅ 20 |
1 | B ⋅ 252 − B ⋅ 239 − B ⋅ 226 − B ⋅ 213 + B ⋅ 20 |
… | … |
15 | B ⋅ 252 + B ⋅ 239 + B ⋅ 226 + B ⋅ 213 + B ⋅ 20 |
We only need 16 entries because we can use the 5th bit as a sign bit. The next comb is constructed in a similar manner, starting with B ⋅ 265. To multiply B by a scalar, we start with acc equal to the identity and do the following 13 times:
You always use a comb when signing, because B is known. You can use it in half of the verification, because you compute sB, and again B is known. If you need to verify several signatures from the same key, in can be effective to compute a comb for a known public key A to speed up the computation of hA.
At the very end of the article by DJB and Tanja Lange, there is a discussion of what to use to derive the session key. The article argues that using the message and the private key to generate the session key is best because it is stateless and simple, and there is no security to be gained by including randomness in the generation. This analysis misses a specific kind of attack: A fault attack. In this system, the message is hashed twice, once to generate the session key, k and then again to produce h. If an attacker can cause the document to be different for the second hashing, they can produce multiple signatures with the same session key, breaking the private key. If, however, the random number hardware is working, and you include a random value in the session key hash, then the fault attack won’t work.
This story starts with a failing test where the failure was a SEGV. Pulling up the relevant code in the debugger showed that it was a null dereference. Normally this would be straightforward to debug, but in this case the null dereference was right after a null check. I set a breakpoint at the null check, but the SIGSEGV still occurred. The breakthrough came when I set breakpoints on every instruction between the null check and the dereference. Something was jumping in between the null check and the dereference.
In the absence of a rewind debugger, I had to figure out what code was branching to the middle of a block of code. Fortunately, I had the branch target and objdump. I was able to find the offending branch by text searching the objdump disassembly for the branch target. Next I had to figure out where this branch was supposed to go. I was able to use the objdump of the unlinked objects to see that the intended target was exactly 2^14 away from where it ended up going. Whenever you see an exact power of 2 like that, you know something fishy is going on.
Why is 2^14 important? Because PPC64 conditional branches include 12 signed bits for the relative branch target. Instructions are required to be 32 bit aligned, so we get the 2 lowest bits of the branch target for free. Hence 14 bits. Conditional branches are almost always within a function, and functions rarely span 14 bits of address space, so this isn’t a common issue. When a function is too large, the compiler triggers Branch Relaxation scanning the branches to see if any of them are too far. The ones that are too far are re-written as two branches: A short conditional branch with inverted tests and hinting, and an unconditional branch. This increases the code size, so relaxation is repeated until a fixed point is reached.
This is where the actual bug was found. In order to decide which branches to relax, LLVM needs to scan the code to determine how many bytes separate a branch from its target. This sounds like it would be easy on PPC64. Every instruction is 4 bytes. The problem with that logic is that not everything that LLVM is working with is an instruction yet. Some of them are pseudo instructions. I don’t remember how I found which pseudo instruction was miscounted, but there was a two-instruction pseudo-op that was incorrectly sized.
In order for the bug to trigger, there needed to be a branch target that was just large enough that it was over, but with a handful of miscounts that prevented it from being relaxed. Most codebases would never see the bug. But Google has a huge codebase and a large testing infrastructure to re-run the tests after dependent code changes. We were able to find the bug only because we had a large number of chances to find it.
The most important lesson is that you can find even odd corner case bugs by thoroughly bounding and documenting behavior. Sometimes you get a stroke of insight to set multiple breakpoints or that the branch is off by 2^14. But another fun lesson is that for some PRs, the time invested is not matched by the number of lines changed