## Table of Contents

## Introduction

I share my birthday with my mom. On ~~my~~ her birthday I asked if she
had heard of the following puzzle:

How many people have to be in a room before the probability that there is a shared birthday is greater than 50%?

This is called the Birthday Problem, and the solution is known as
the *birthday paradox*. It is an interesting problem because:

- The answer is counterintuitive (hence the name
*birthday paradox*). - The solution relies on elements of combinatorics, set theory, and probability.
- The ramifications affect the security of cryptographic hash algorithms.

The next section covers the math that you’ll need. If you’re already familiar with probability, feel free to skip ahead.

## Probability Crash Course

Here is an extremely abbreviated summary of of math that you’ll need from combinatorics, probability, and set theory:

- The expression
`N!`

, where`N`

is a non-negative integer, is called a factorial, and it is means “the product of all the integers from`N`

down to`1`

”. So`5! = 5 × 4 × 3 × 2 × 1 = 120`

. To make our lives easier,`0! = 1`

by definition. - In probability, an occurance is called an
*event*, and the probability of an*event*`A`

is denoted`P(A)`

. - The value of a probability is a real number between
`0`

and`1`

, inclusive (written`P(A) ∈ [0, 1]`

). Inclusive means that`0`

and`1`

are valid probabilities, where`0`

means “this event is impossible” and`1`

means “this event will occur with certainty”. - You can convert a probability to a percentage by multiplying by
`100`

and appending a trailing`%`

. So`P(A) = 0.1`

means`10%`

,`P(A) = 0.2`

means`20%`

, and so on. - The probability of two independent events
`A`

and`B`

both occurring is written`P(A ⋂ B)`

and read as “the joint probability of A and B”. The value is calculated as`P(A) × P(B)`

. This pattern of multiplication continues, so the probability of three independent events`A`

,`B`

, and`C`

occurring is`P(A ⋂ B ⋂ C) = P(A) × P(B) × P(C)`

, and so on. The upside-down U symbols represent an intersection. - The probability of an event
`A`

**not**occurring is written`P(¬A)`

, called the*compliment*, and calculated as`1 - P(A)`

. This is useful because sometimes it is much easier to calculate the probability that an event will not occur, as we’ll see below.

Here’s what all the chicanery above looks like visually:

To calculate a discrete (countable) probability, you sum up all the matching events, then divide the sum by the total number of possible events.

For example, if you put one red marble and three blue marbles in a jar and then randomly choose a single marble out of the jar, then the probability that you will choose the red marble is:

```
P(R) = probability of choosing a red marble
P(R) = number of red marbles / total number of marbles
P(R) = 1/(1 + 3) = 0.25 = 25%
```

Here’s a more complicated example: Let’s say you put 3 red marbles and 5
blue marbles in a jar, pick a marble from the jar at random, and then
roll a fair, 6-sided die. What is the probability that you will
pick a red marble from the jar **and** roll a 5 or a 6 on the die?

```
P(R) = probability of choosing a red marble
P(R) = number of red marbles / total number of marbles
P(R) = 3/(3 + 5) = 0.375 = 37.5%
P(D) = probability of rolling a 5 or a 6 on the die
P(D) = number of matching die faces / total number of die faces
P(D) = 2/6 = ~0.333 = ~33.3%
P(R ⋂ D) = P(R) × P(D)
P(R ⋂ D) = 3/8 × 1/3 = 0.125 = 12.5%
```

## Solving the Birthday Problem

Let’s get back to finding a solution.

To save time, I’ll tell you in up front advance that this is one of
those problems where it much easier to calculate the compliment; that
is, the probability that everyone has a *unique* birthday.

If there is only one person in the room, then the probability that
everyone present has a unique birthday is `1.0`

(`100%`

).

```
P1 = P(unique first birthday) = 365/365
P1 = 1.0 = 100%
```

If there are two people in the room, the probability that everyone has a unique birthday can be calculated by multiplying the probability from the previous step by the probability that the second person’s birthday is not the same as the first person’s birthday. Algebraically:

```
P2 = P(unique second birthday)
P2 = P1 × 364/365
P2 = 365/365 × 364/365 = (365 × 364)/(365^2) = 364/365
P2 = ~0.9972 = 99.7%
```

Add a third person, and a pattern begins to emerge:

```
P3 = P(unique second birthday ⋂ unique third birthday)
P3 = P2 × 363/365
P3 = 365/365 × 364/365 × 363/365 = (365 × 364 × 363)/(365^3)
P3 = ~0.9917 = ~99.2%
```

Do you see the pattern?

The probability that everyone in a group of `N`

people has a unique
birthday is the product of the numbers between `365`

and `365 - (N - 1)`

(inclusive), divided by `365`

.^{N}

We can calculate the product of the numbers between `365 - (N - 1)`

and
`365`

by dividing `N!`

by the product of terms we don’t care about. I
don’t have a name for this (the equation is the same as a permutation
without repetition, but that’s just a coincidence), so let’s call it
a “truncated factorial”:

```
TF(X, Y) = product of numbers between X and Y (inclusive) where X ∈ [1, Y)
TF(X, Y) = Y!/(X - 1))!
TF((365 - (N - 1)), 365) = 365!/((365 - (N - 1) - 1))!
TF(366 - N, 365) = 365!/((365 + -N + 1 + -1))!
TF(366 - N, 365) = 365!/(365 - N)!
```

Combined with the previous equation:

```
U(N) = P(N unique birthdays)
U(N) = 365!/((365 - N)! × 365^N)
```

Then we calculate the compliment of the probability of N unique birthdays to determine the probability of shared birthdays in a group of N people:

```
S(N) = P(shared birthdays among N people)
S(N) = P(¬U(N)) = 1 - U(N)
S(N) = 1 - 365!/((365 - N)! × 365^N)
```

At this point you can set `S(N) = 0.5`

and solve for `N`

to find an
exact solution, but often for discrete probabilities it’s easier to
write a script and let a computer do the work for you.

So I wrote a couple of scripts which use the derived algorithm to do the following:

- Calculate the probability of shared birthdays in a room containing
`N`

people for`N ∈ [1, 50]`

. - Exclude intermediate rows in the table to keep it short.
- Highlight the relevant results (values of
`N`

where`S(N) > 0.5`

). - Render the bar chart and table in the next section.

The full script is available in the companion GitHub repository. Here is the Ruby code for our derived equation:

```
# memoize 365!
F365 = 365.factorial
#
# Calculate the probability of at least one shared birthday in a group
# of N people.
#
def shared_birthdays(v)
1 - F365 / (365 - v).factorial * (365 ** -v)
end
```

## Solution

Here are the results of the script from the previous section:

Number of People | Probability | Percent |
---|---|---|

1 | 0 | 0.00% |

2 | 0.00273 | 0.27% |

3 | 0.0082 | 0.82% |

4 | 0.01635 | 1.64% |

… | ||

22 | 0.47569 | 47.57% |

23 | 0.50729 | 50.73% |

24 | 0.53834 | 53.83% |

25 | 0.56869 | 56.87% |

So the answer to the birthday problem is:

**If there are 23 or more people in the room, then the probability of at
least one shared birthday is greater than 50%**.

## Implications

The answer is called the *birthday paradox* is because it doesn’t feel
intuitive. When asked, most people guess that a much higher number of
people (50 or 130) are needed before the probability of a shared
birthday is greater than 50%.

The mistake is that we aren’t just looking for a person in the room that
shares a birthday with a single person. Instead, we are looking for
*anyone* who shares a birthday with *anyone else*. As a result, the
probability of a shared birthday increases much faster than expected.

This has implications for the collision resistance property of cryptographic hash functions. Collision resistance means that it is hard to find two inputs which hash to the same output.

The birthday paradox means that for all hash functions with an N-bit
output, the probability of finding a collision for a random input is
greater than `0.5`

after calculating `2`

hashes. To
make matters worse, searching for hash collisions is an embarassingly
parallel problem, and programs like hashcat are network-aware and
GPGPU-accelerated.^{N/2}

This is one of the reasons that the output sizes of cryptographic hash functions are much larger than non-cryptographic hash functions. SHA-2 and BLAKE2, for example, generate an output of 256 bits or more, while non-cryptographic hash functions typically only generate an output of 32 or 64 bits.

## Conclusion

My mom was not amused when I asked if 23 people were present when I was born.

## Further Reading

- Companion GitHub repository
- The Birthday Problem (Wikipedia)
- Inkscape: Used to create the probability basics figure.
- matplotlib: Used to create the shared birthday probability chart.
- hugo-shortcode-table: Used to create the shared birthday probability table.