Cardinality Estimation

ProcessOne
· 3 min read
Send by email

In server world, we always need to maintain some metrics; We need to measure to improve. A very common one being “unique active user” per unit of time. While this is really easy to describe, it’s complex when it comes to implementation.

Naive implementation logs all events (let say, user connection), either on memory or disk, and count number of unique entries in the log on a given time frame by removing duplicates.
Well, this is really consuming and can not scale on real server handling millions of user. Better take the probabilistic estimation way… and here comes a very impressive algorithm.

Back in 2007, a team at INRIA published a paper about an efficient algorithm for estimating the number of distinct elements, known as the cardinality, of large data ensembles. This paper by Philippe Flajolet, Éric Fusy, Olivier Gandouet and Frédéric Meunier can be found here.
There are many blogposts about it already, but this paper is worth reading to make your day !

The HyperLogLog algorithm is just genius thanks to uncommon approach that make it require only few Kilobytes of data to give an accurate cardinality estimation on large sets. Basically, the idea is to rely on probability of distribution of user id numeric representation (hashes).

The HyperLogLog algorithm works for maximal cardinalities in the range [0..109] while handling a number of registers (m) in the range [24..216].
Of course, using less registers reduce computing time and memory use, at cost of cardinality estimation precision.

So, what is best value of m (number of registers) for my need ?
Well, that all depends ! Of course, we will run it on huge production services, so it must be fast. It also must be accurate in most cases.

Let’s assess errors margin and approximations

From the original paper, we have an estimation of the standard error of the algorithm:

Let σ ≈ 1.04/√m represent the standard error; the estimates provided by HYPERLOGLOG
are expected to be within σ, 2σ, 3σ of the exact count in respectively 65%, 95%, 99%
of all the cases.
bitsmσ 2σ 3σ
101024±3.25% ±6.50% ±9.75%
112048±2.30% ±4.60% ±6.90%
124096±1.62% ±3.26% ±4.89%
138192±1.15% ±2.30% ±3.45%
1416384±0.81% ±1.62% ±2.43%
1532768±0.57% ±1.14% ±1.71%
1665536±0.40% ±0.81% ±1.23%

Let say one need cardinality approximation ±~3%, with a trust level of 95%.
One can get cardinality ±3.26% using only 4096 registers.
Now if i want error <2% with a trust level of 99%, I must use 215 or 216 registers.

Now, what about execution runtime and memory consumption ?

Let’s play with the erlang implementation available on github thanks to Shyun Yeoh (vaxelfel).

First, let’s check how much Erlang VM memory is needed to store one hyperloglog record, given number of registers m

mmemory in bytes
20482845
40966173
819213341
1638428701
3276862469
65536131101

The memory consumption is approximative, and depends on default heap size and other low level memory parameters. But it gives the big picture anyway.

At last, let’s try to update an HyperLogLog record one million time with unique values, and report time and error margin:

merrorms
2048+3.41%2178
4096+2.31%2414
8192+0.71%6343
16384-0.49%11100
32768+0.16%25667
65536+0.10%9584

Without surprise, the consumed time in HyperLogLog update is almost proportional to data structure size in memory, at one exception.
When using m=216, there is an approximate 5x speedup on this implementation. I guess this is related to some binary matching optimization on Erlang VM but I could not get measures clearly stating this.

Parallelization

A production server never run on its own. The magic with HyperLogLog algorithm is that merging data from different servers is possible as long at they run with the same hash function and the same number of registers.
The resulting estimate of merged data gives cardinality for whole cluster. Bravo !

Conclusion

For rapid cardinality estimation using Erlang implementation, there is two main use of HyperLogLog:

  • For fast operation and minor resources consumption, use of 212 registers.
  • For precise estimation, if memory is a problem, use 214 registers, if CPU is a problem use 216 registers.