Happy Birthday Paradox
By Noam YadgarThe Birthday paradox only seems like a paradox because it unexpectedly challenges our perception of probability. It states that in a group of only 23 individuals, there is more than a 50% chance that at least two are sharing a birthday. Although it may seem counterintuitive, we have the math that proves and supports this statement.
In this article, we will gradually reprogram our intuition and build the knowledge required to derive the expression that proves the paradox. This blog focuses on software engineering and system design so we will follow a real-world example of this problem found in software. But first, let’s start with basic counting.
Part I: Counting Lists
Suppose we have the following set:
\[\large{ \{a,b,c\} }\]We would like to know how many pairs can we construct from this set? We can manually build a set of all possible pairs to answer this.
\[\{(a,a),(a,b),(a,c),\\(b,a),(b,b),(b,c),\\(c,a),(c,b),(c,c)\}\]By simply counting the members of this set, we can see that it’s \(9\). Since our original question can be rephrased as how many pairs can we make out of a set of size \(3\)? For many, it would seem intuitive to think of the \(9\) as \(3^2\), where \(3\) stands for the size of our set and \(2\), for the length of a pair.
Let’s try another example and see if this pattern holds. How many triplets can we make out of the set \(\{p,q\}\)?
\[\{(p,p,p),(p,p,q),(p,q,p),(p,q,q),\\(q,p,p),(q,p,q),(q,q,p),(q,q,q)\}\]By counting, we can see that the size of this set is \(2^3 = 8\), so it seems like we can generalize the question and come up with a generalized answer:
How many lists of length \(r\) can we make out of a set of size \(n\)?
The answer appears to be \(n^r\), but how can we prove it?
Proof by mapping
We should develop an expression that will help us count the number of possible lists without actually going and counting them one by one. If we can develop such an expression, we can then algebraically check if it equals \(n^r\).
One way of doing this is by mapping each member to a number. The number we are mapping for each member will be at the base of the size of our set. For example, consider the set:
\[\large{ \{a,b,c,d\} }\]We can map each member of this set to a number in \(\text{base}_4\), so we will get:
\[\large{ \{0,1,2,3\} }\]Suppose we would like to know how many triplets we can make from this set. The answer is all the numbers ranging from \(000\) to \(333\) in \(\text{base}_4\). In other words: \(1 + 333\) in \(\text{base}_4\).
Converting to decimal
Converting any base to decimal is done by assigning an index to each number column. From right to left, we start from 0 to however long our number is (in our case, 3 digits long). We then take each digit and multiply it by the original base, raised to the power of the index, Lastly, we will sum up all the terms.
In our example of \(333\) in \(\text{base}_4\), we do:
\[3\cdot4^2 + 3\cdot4^1 + 3\cdot4^0 = 63\]By adding 1, we get \(64\) triplets.
Please note that this process is no different from the way we often write large decimal numbers (e.g., \(4\) million as \(4\cdot10^6\)).
We can generalize this process of counting as follows:
\[\large { 1 + \sum_{i=0}^{(r-1)} n^i(n-1) }\]Algebraic proof
Now we check if the following equation is true:
\[\large { 1 + \sum_{i=0}^{(r-1)} n^i(n-1) } = n^r\]\[ \cancel{1} + \cancel{n}-\cancel{1} + \cancel{n^2}-\cancel{n} + \cancel{n^3}-\cancel{n^2} \cdots n^r\cancel{-n^{r-1}} \\ = n^r \]This equation is valid if \(r\) is a natural number. Since \(r\) is indeed a natural number in our case, we can answer the question with certainty:
Provided \(r \in \mathbb{N}\), How many lists of length \(r\) can we make out of a set of size \(n\)? The answer:
\[ \huge {n^r} \]In software engineering
The question of how many things of some type we can pull out of some set, is very common in software engineering. A good example is Git’s representation of a shortened commit SHA. A full commit SHA is usually a 40-digit long hexadecimal number. However, Git’s shortened version is only seven digits. If we think of hexadecimal as a set of 16 symbols, we can calculate all the possible seven digits lists of this set and come up with the number of all possible 7-digit long, shortened SHAs:
\[ 16^7 = 268,435,456 \]Internally, Git is always storing and using the full 40-digit long SHAs
Part II: Probability
An event in sequence
Suppose we roll a die twice. What is the probability of getting the number \(5\) in both attempts? To answer this question, we can think of a die as a set of 6 possible outcomes \(\{1,2,3,4,5,6\}\) and the two attempts as a pair, where each element of the pair represents the outcome of a single roll.
Now we can ask. How many pairs can we make from a set of 6 members? Lucky for us, we just learned to answer this type of question; it’s \(6^2\). Since there’s only one occurrence of the pair \((5,5)\), our probability is \(\large{\frac{1}{6^2}}\) (~2.77%).
Another way of looking at this is to think about the probability of the first roll to produce 5, multiplied by the probability of the second roll to produce 5: \(\large{ \frac{1}{6} \cdot \frac{1}{6} = (\frac{1}{6})^2 = \frac{1}{6^2} }\).
flowchart TB 1((1)) --> 1b((1)) 1 --> 2b((2)) 1 --> 3b((3)) 1 --> 4b((4)) 1 --> 5b((5)) 1 --> 6b((6)) 2((2)) --> 1b 2 --> 2b 2 --> 3b 2 --> 4b 2 --> 5b 2 --> 6b 3((3)) --> 1b 3 --> 2b 3 --> 3b 3 --> 4b 3 --> 5b 3 --> 6b 4((4)) --> 1b 4 --> 2b 4 --> 3b 4 --> 4b 4 --> 5b 4 --> 6b 5((5)) --> 1b 5 --> 2b 5 --> 3b 5 --> 4b 5 --> 5b 5 --> 6b 6((6)) --> 1b 6 --> 2b 6 --> 3b 6 --> 4b 6 --> 5b 6 --> 6b linkStyle 28 stroke:#1f1,stroke-width:4px,color:red;
For any event that randomly produces an outcome from \(n\) possibilities, we can say that the probability of \(r\) times of that event to produce some list of outcomes is \(\large{\frac{1}{n^r}}\) or simply:
\[ \large{ (\frac{1}{n})^r } \]In the Git’s shortened SHA example, any commit has a chance of \(\large{(\frac{1}{16})^7}\) to produce some shortened SHA. One way of looking at this is, to imagine that any insertion of a hexadecimal digit to a position in our shortened SHA, is an event that can produce an outcome of 1 digit, out of 16. Since we have 7 positions, we have \(16^7\) possible SHAs, and we are interested in just 1, thus, \(\large{\frac{1}{16^7}}\).
A sequence of events
Let’s return to our scenario of rolling a die twice and slightly modify the question. What is the probability of not getting the same number in both attempts? We can answer it by doing a similar process and using some basic common logic.
There are \(6^2\) possible outcomes of rolling a die twice, and there are \(6\) outcomes that have similar numbers in both attempts (\((1,1) \cdots (6,6)\)). Therefore, the probability is \(\large{ \frac{6^2 - 6}{6^2} = \frac{30}{36} = \frac{5}{6}}\) (about 83.3%).
This question becomes significantly harder to solve if we increase the number of attempts. For example, if we roll a die three times, what is the probability of not getting a number more than once?
The probability should be something out of \(6^3\) possibilities. But counting all triplets with more than one similar number is a little harder. If we slowly count them all, we get \(96\) triplets so the probability is:
\[\frac{6^3 - 96}{6^3} = \frac{120}{216} = \frac{5}{9} \]To make things easier, we should come up with a systematic and consistent way of counting all of these possibilities, regardless of size.
Permutations
Permutations are all the ways we can arrange the order of some set without repetitions. To count all the permutations of a set, we take the factorial product of its size. For example, counting the permutations of a set with a size of 6 elements is:
\[6! = 6 \cdot 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1 = 720 \]The factorial process is closely related to the process of counting lists. For example, the number of possible triplets we can make from a set of three members is \(3^3\). You can also think about it as \(3\cdot3\cdot3\), where the first number represents three items we can choose for the first position of the triplet, the second number also represents three items we can choose for the second position, and the third represents three items we can choose for the third position.
flowchart TB 1((1)) --> 1b((1)) 1 --> 2b((2)) 1 --> 3b((3)) 2((2)) --> 1b 2 --> 2b 2 --> 3b 3((3)) --> 1b 3 --> 2b 3 --> 3b 1b --> 1c((1)) 1b --> 2c((2)) 1b --> 3c((3)) 2b --> 1c 2b --> 2c 2b --> 3c 3b --> 1c 3b --> 2c 3b --> 3c
Now, if we count all possible ordered sets from a set of three members (i.e., lists without repetitions), for the first position, we can choose three numbers, for the second position we are left to choose two numbers, and in the last position, we can only choose one number; that is by definition \(3! = 3\cdot2\cdot1 \) possibilities
flowchart TB 1((1)) --> 2b((2)) 1 --> 3b((3)) 2((2)) --> 1b((1)) 2 --> 3b 3((3)) --> 1b((1)) 3 --> 2b 2b --> 1c((1)) 2b --> 3c((3)) 3b --> 1c 3b --> 2c((2)) 1b --> 2c 1b --> 3c linkStyle 0,7 stroke:#1f1 linkStyle 1,9 stroke:#f11 linkStyle 5,6 stroke:#f1f linkStyle 4,10 stroke:blue linkStyle 3,8 stroke:salmon linkStyle 2,11 stroke:orange
Instead of counting lists that have more than one similar number and subtracting from the total, counting all the permutations directly, will be easier.
Our question is a bit more complex, as we are trying to find all the permutations of triplets that have no repeated numbers out of a set of 6 elements. We need a way to cut the factorial process mid-way, proportionally to the size of our subsets. To do this, we can use the \(nPr\) formula, where \(n\) is the size of our set (in our case 6) and \(r\) is the size of our subsets (in our case 3).
\[ P(n,r) = \frac{n!}{(n-r)!} \]By pluging in \(P(6,3)\) we get \(120\) (the number of all three attempts without repetitions).
Part III: The birthday problem
If you take some time to examine some of the possible triplets, you may realize that the probability of not getting the same number more than once in three attempts, implicitly includes the likelihood of two attempts as well.
For example:
flowchart TB 12((1,2)) --> 3b((3)) 12 --> 4b((4)) 12 --> 5b((5)) 12 --> 6b((6)) 34((3,4)) --> 1b((1)) 34 --> 2b((2)) 34 --> 5b((5)) 34 --> 6b((6))
Earlier, we said that the probability of some event that randomly picks an item out of \(n\) possibilities is \(\large{ \frac{1}{n} }\), and if we want to calculate a sequence of results that are coming from \(r\) times of that event, it will be \(\large{ (\frac{1}{n})^r }\). You may now realize that calculating the probability of any sequence of events is done by multiplying the probabilities of all events together.
In our example, we have three events with three probabilities, all linked together by the size of 6:
- The first roll of a die, with a probability of \(\large{ \frac{6}{6} }\) to get a sequence without repetitions.
- The second roll, with probability of \(\large{ \frac{5}{6} }\), because the first attempt already takes one number.
- The third roll, with a probability of \(\large{ \frac{4}{6} }\), because the first couple of attempts already take two numbers.
We found that the probability of not getting the same number more than once within three attempts, is \(\large{\frac{5}{9}}\), let’s see if multiplying the probabilities of the three events leads us to the same result:
\[ \frac{6}{6} \cdot \frac{5}{6} \cdot \frac{4}{6} = \frac{120}{216} = \frac{5}{9} \]Indeed, we’ve reached the same result. Conveniently, we don’t have to count all possible sequences to calculate this probability. Instead, we multiply the probabilities of each discrete event to yield the same result.
Formula
Now we know that to calculate the probability of finding a list of length \(r\) without repetitions is the process of: \(\frac{n-1}{n} \cdot \frac{n-2}{n} \cdots \frac{n-r}{n}\). We would like to describe the probability efficiently. We can take the \(nPr\) formula and divide it by the total number of possibilities:
\[ \large{ \frac{\frac{n!}{(n-r)!}}{n^r} } \]The above formula evaluates the likelihood of producing no collisions; since we are interested in the opposite, we will subtract the entire expression from 1. Let’s subtract from 1 and rearrange:
\[ \large{ 1 - \frac{n!}{(n-r)! \cdot n^r} } \]Let’s go to our Git example and check the likelihood of at least two commits sharing a similar shortened SHA within 20,000 commits:
\[ 1 - \frac{16^7!}{(16^7-20,000)! \cdot 16^{140,000}} \approx 52.5\%\]If we think about how many lists of \(20,000\) shortened SHAs we can make out of \(16^7\), it’s easier to believe that there is at least a 50% chance that, in one list, we will have more than one identical shortened SHA.
xychart-beta x-axis [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000] x-axis "Items per 16^7" y-axis "Probability of collision" 0 --> 1 bar [0.16996338, 0.52531973, 0.81297238, 0.94923026, 0.99050505, 0.99877662, 0.99989140, 0.99999336, 0.99999972, 0.99999999] line [0.16996338, 0.52531973, 0.81297238, 0.94923026, 0.99050505, 0.99877662, 0.99989140, 0.99999336, 0.99999972, 0.99999999]
Big numbers
Using the formula above, we are forced to deal with big numbers like \(16^7!\) or \(16^{7\cdot20,000}\). These numbers usually cannot be computed using standard calculators and are often calculated by programs that use various special techniques.
It’s important to remember that the formula above is just another way of writing:
\[ 1 - \frac{n-1}{n} \cdot \frac{n-2}{n} \cdots \frac{n-r}{n} \]In fact, for computing the Git example, I wrote a Python function that iterates through \(r\) and gradually multiplies the current term with the next term.
def birthday(n, r) -> float:
p = 1
for i in range(1, r+1):
p *= (n-i)/n
return 1 - p
>>> birthday(16**7, 20000)
0.5253197304787406
It is much easier for a computer to multiply 20,000 terms, than to evaluate large factorial and exponent terms.
Approximation
In computing, we say that \(r\) introduces linear complexity (\(O(n)\)). As \(r\) is getting larger and larger, the more computing power we need to calculate this type of probability and indeed, sometimes we want to check collisions for big collections (e.g., SHA-512).
To overcome this, we can develop a reasonable approximation using the famous Euler constant \(e\).
This part of the article requires a basic understanding of calculus, Taylor series , and \(e\).
In the function \(e^x\), when the absolute distance of \(x\) is much smaller than one (\(|x| \ll 1\)), using the Taylor series expansion, we can approximate \(e^x \approx 1 + x\) (the first term of the expansion).
Earlier, we have shown that to find the likelihood of \(r\) items without collisions is:
\[ \frac{n-1}{n} \cdot \frac{n-2}{n} \cdots \frac{n-r}{n} \]Which can also be written as:
\[ (1-\frac{1}{n}) \cdot (1-\frac{2}{n}) \cdots (1-\frac{n-r}{n}) \]If \(n\) is very big, relative to \(r\), each term of this series has a fraction that is much smaller than 1, more importantly, its absolute distance from 0 is much smaller than 1. Taking our Git example, the first term is \((1-\large{\frac{1}{16^7}}) \), we see that \(|(-\frac{1}{16^7})| \ll 1\). Therefore, we can approximate: \(e^{-1/16^7} \approx (1-\frac{1}{16^7})\).
We can replace all the terms in our expression with their corresponding approximations:
\[ e^{-1/n} \cdot e^{-2/n} \cdots e^{-r/n} \]Which can be rewritten as:
\[ \large{ e^{-(1+2 \cdots + r)/n} }\]The sum of \(r\) consecutive numbers can be expressed in terms of average as: \(\large{\frac{r+r^2}{2}}\). We rewrite the function as: \(e^{-\large{ \frac{(r+r^2)/2}{n} }}\) and finally, we can multiply the exponent by 2, invert the likelihood by subtracting from 1, and get our final expression:
\[\large{ 1 - e^{- \frac{r+r^2}{2n} } }\]Since in many scenarios \(n\) is significantly bigger than \(r\), we can ignore the addition of \(r\) in the numerator and still get a very good approximation:
\[\large{ 1 - e^{- \frac{r^2}{2n} } }\]from math import e
def birthday_approx(n, r) -> float:
return 1 - e**(-1*(r*r)/(2*n))
>>> birthday_approx(16**7, 20000)
0.5252932621873736
SHA-512
Now that we have a good approximation for the birthday problem, we can reverse the question and determine how many items produce \(P\) probability of collision. Let’s take SHA-512 and see how many SHAs we should use to raise the collision probability to 50%.
SHA-512 is represented as a 128-long hex number, so the number of possibilities is a giant 155-digit long number:
\[ 16^{128} = 1.340780793\cdot10^{154} \]We can place this number in the expression and equate to 0.5:
\[ 1-e^{-r^2/2(16^{128})}=0.5 \]Because \(1-x=0.5\) means that \(x=0.5\), the following is also true:
\[ e^{-r^2/2(16^{128})}=0.5 \]Using \(\large{e^{\ln{x}}} =x\), we can rewrite the equation as:
\[ e^{-r^2/2(16^{128})}=e^{\ln{0.5}} \]Now, we can omit \(e\):
\[ \frac{-r^2}{2(16^{128})}=\ln{0.5} \]Multiplying by \(-2(16^{128})\):
\[ r^2 = 1.8587168528\cdot10^{154} \]Taking the square root of this number, we get the number of SHAs that raises the collision likelihood to 50%:
\[ r = \sqrt{1.8587168528\cdot10^{154}} \\ r = 1.363347664\cdot10^{77} \]How big is that number? To raise the likelihood of collision to 50%, we would need a number with 78 digits, that’s around the number of atoms in the universe.