Frequency Counting a Large Set of Strings

I recently was given an interview question where the hardest part was to conduct a frequency count on a large set of strings. I have thoughts on how the interview went and I’ll share them at the end, but I wanted to run through several different ways that you could solve this problem.

Basic Problem Setup

Here’s the basic setup of the problem: You have 6 billion strings, and their frequencies are distributed according to a power law. There are approximately 100 million unique strings, and you need to find the top K, where K is approximately 100 Thousand. The strings are each around 50 characters on average. The strings are distributed in a peaky manner, meaning that a specific string is likely to be near other uses of the same strings, rather than being distributed uniformly random through the 6 billion entries.

Naive Solution Outline

Here is an outline of a naive solution:

Build a hashmap of strings to integers
Loop over the logs and update the hash map
Use QuickPartition to find the K most frequent strings in the set.

The problem was presented in Java.

The supposed problem with the algorithm above is that a hashmap is guaranteed to run out of memory. The interviewer specifically said that “There isn’t a machine around where this wouldn’t run out of memory.”

Brute Force (Resources) Solutions

All of the solutions below look at changing the environment rather than the algorithm.

Bigger Machine

Are there machines that won’t run out of memory as the interviewer said? I was suspicious when he blindly asserted this.

The default load factor for a hashmap in Java is 70%. A 50 character string takes up 144 bytes (More for 64 bit). Each entry in the hashmap consumes 36 Bytes (Assume 64 bit Java and double the size of all references). An Integer value object would consume another 16 bytes.

Adding these up we have 144 + 16 + 36 = 196 Bytes. Let’s round to 256 to be conservative. We use 2^8 bytes per entry. We have 100 Million. 1 Million is ~ 2^20, so 100 Million is ~ 2^27. We would need approximately 2^35 bytes of memory, plus load factor and extras. 2^32 is 4G. 2^35 is then 32G. Adjusting for load factor and the fact that we need 1.5 times the memory when we expand, and other things in the JVM, let’s say 64G. Even if I’m off by a factor of 2, a machine with 128G of memory is not unreasonable.

Use C++ on a Smaller Large Machine

Much of the overhead of a HashMap in Java is due to Java and not to the HashMap itself. We can allocate the table to the correct size initially, and avoid the 1.5x factor. The entries themselves can also be smaller due to inline storage (fewer pointers), storing primitive ints and less string overhead. I’m more confident about the memory usage of C++, so let’s say we would only need 32G. Java can call C++ via the JNI, and the HashMap is purely temporary, so this is an option. If we need more memory reductions, we can calculate a 128-bit hash of the strings in question so that we only store 32 bytes per string, rather than 100 bytes for a 50-byte unicode string.

Use C++ and MemMap on a Machine with an SSD

In C++, you don’t necessarily have to sufficient ram for all the entries at once. Especially given that the strings are peaky in their uses, you could mmap a larger array than will fit in memory backed by a file on SSD. You can then use this array to back a hashtable. The least recently used portions of this table will automatically be flushed to ssd by the kernel. Using this strategy is strictly an engineering tradeoff: Is it fast enough for our use case? Given that the data is peaky, I would expect the kernel’s caching to be effective for this use case.

Two Pass Solutions

Both of these solutions look at the data in two passes. One is exact, and the other is probabilistic.

Bin and Batch

Here we need sufficient disk space to store a second copy of all 6 billion records. We can run through all of the records and compute a hash of the strings. With the hash, we take the high 16 bits of the hash, and write the record to a file with those 16 bits in hex. This will create at most 65K files. The advantage of these files is that every copy of a certain string will be in a single file. We can the build a hashmap for the first file safely within memory, and discard all but the top K. We repeat for the next file, and then compare hashmaps to keep only the top K. This strategy works because the binning is deterministic, so we don’t miss any uses of a string.

Naive Hash Table

An array of primitives in Java is much more memory efficient than a hashmap of the same size, due to wrapping and saving the keys.

We could allocate an array of f * 100 million integers, where f is configurable. In the first pass through the records, we would hash each string and update the counter for the corresponding entry in the array. This is a naive hash table because we don’t handle collisions at all, we just allow them to update the same counter.

At the end of the first pass, we know the quantity of records, and can estimate the number of collisions in the table. We can then take a threshold t from the K + (3 sigma estimate of collisions).

We can then run through the records a second time, skipping any record whose entry in the table is lower than t.

This solution is probabilistic in that there is a small chance that a collision could cause us to choose an imperfect set of high frequency strings. If a probabilistic solution is acceptable, this has the nice property of requiring no extra disk space.

Purely Probabilistic Solutions

I only have one probabilistic solution. This solution was discounted by my interviewer, because he didn’t understand the algorithm that I was describing. If the frequencies are distributed according to a power law, then this solution is very likely to work, and has better performance properties than either of the two pass solutions.

Reservoir Sampling

Reservoir Sampling is a standard technique for choosing a sample from a stream of items without revisiting them. The way we would use this algorithm to solve the frequency problem is to take a 0.1% sample of the logs. We will be left with 6 Million records, and then we can run the hashmap algorithm on the sample. With high probability, the elements that are most populous in the sample will be the most populous elements in the full set of records.

The interviewer claimed that a reservoir sample wouldn’t work because the data was peaky, but the reservoir sample algorithm is designed specifically to choose a uniformly random sample. If the data were maintained in order, you would end up with a sample that remains peaky.

This is a solution I came up with during the interview. I like it because it demonstrates that often you can apply insights from a different discipline in novel ways in computer science.

Write a MapReduce

I’ve never written a MapReduce, not even at Google. But this would be a pretty obvious use for one. The initial key value pair would be index -> string. We would chunk the data by index to mappers. The mappers would change the string to be the keys, and the values to be 1. The local combining function would sum the values. The reduce step would apply the same combining function. This is like the 2nd homework exercise from a class on MapReduce.

On Being a Humble Interviewer

Interviewing is hard. You have a brief moment of time to assess someone’s ability to contribute at your company. In some ways, you want your interview problems to look like the problems a candidate is likely to face while on the job. On the flipside, they need to be small and contained so that you can repeatedly ask the same question to different candidates.

Humility comes in recognizing that although the problems need to resemble the work, the solutions don’t necessarily have to be solutions that you’ve seen before. If a candidate suggests a reservoir sample, don’t just discount it. Especially if you aren’t absolutely sure that it won’t work.

If you say that a line is guaranteed to run out of memory, you should absolutely know how much memory a machine would need to have in order for that line to not run out of memory. The interviewer discounted my suggestion of a MapReduce by saying that the bug was discovered on Thursday, and we need to have a solution by next Wednesday, and that won’t be enough time to stand up a MapReduce cluster. Well if running on an instance with 64G of memory is a solution, let’s just use a bigger instance, and I’ll enjoy my 6 day weekend.

Thanks for Coming to My Ted Talk

I have certainly beaten this horse to death. I was slightly aggravated by this interview, and I wanted to channel it into something productive. I guess making a fairly large catalog of potential solutions is a relatively productive outlet. I have tried to make the problem and the company vague enough that it won’t show up in a Google search done during the interview. Although I didn’t sign an NDA, I don’t really like exposing interview problems. I have mangled this one enough by extracting purely the frequency counting problem that it shouldn’t be an issue.