🎂 Simulating the Birthday Problem using Python
4 min read

🎂 Simulating the Birthday Problem using Python

So recently I discovered the Birthday Problem and I was intrigued by the paradox.

What is the Birthday Problem?

Suppose you have a group of people, how big do you think that group of people would need to be in order for there to be a high certainty of two people sharing a birthday with each other?

For simplicity, you can ignore leap years and assume that all birthdays are equally likely (and there are no twins).

The birthday problem tells us that for a given set of 23 people, the chance of two of them being born on the same day is 50%. For a set of 50 people, this would be 97%. For 75 people, it is 99.97%.

Let that sink in: 23 people and 50% chance that there's a matching birthday! That sounds quite crazy, right? How is it possible that we only need such a small number of people to be very certain of a shared birthday?

The math

So how can we mathematically calculate this probability to prove our hypothesis?

In order to find the answer, we will flip the problem around: we will calculate the probability of people not sharing a birthday, which can be done using combinatronics.

Probability of a match + probability of no match is equal to 1. So we can work it out like this:

First we assume that a first person with a birthday exists. The probability of this person 1 having a birthday is \( \frac{365}{365} \). Then we multiply that number by the probability that person 2 doesn't share the same birthday: \( \frac{364}{365} \). We then multiply that number with the probability that person 3 doesn't share a birthday with either person 1 or person 2.

This way you get:

\( \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365} \)

To get the probability that 23 people don't share the same birthday we do this until:

\( \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365} \ldots \frac{343}{365} \)

A general formula to calculate the probability that N people don't share the same birthday, we can set up as:

\( \frac{365!}{(365-n)! \cdot 365^n} \)

So for 23 people, we get:

\( \frac{365!}{342! \cdot  365^{23}} \)

But this is the probability that they don't share a birthday.

To get the probability that they do, we subtract this number from 1:

\( 1 - \frac{365!}{342! \cdot  365^{23}} \)

from math import factorial
1 - (factorial(365) / (factorial(342) * 365**23))

And indeed, if we run this, we get 0.5073.

Why does it work?

The trick for getting a > 50% of a match in a relatively small group, is that we're not comparing 1 birthday with the other 22 birthdays. We are actually comparing each person's birthday to all other 22 person's birthdays, and do that for the whole group of people. The number of possible pairs or combinations grows very fast (quadratically) as the number of people in a group grows.

You can calculate the number of combinations using:

n(n-1)/2

So a group of 23 people has 253 combinations.

A group of 50 people then has 1225 combinations.

Simulating the birthday problem

import numpy as np
import matplotlib.pyplot as plt

Then we create a Simulation class:

class Simulation():
    birthday_options = np.arange(1,366)
    
    def __init__(self, n_simulations, group_sizes):
        self.n_simulations = n_simulations
        self.group_sizes = group_sizes
        
    def _is_match(self, group):
        return len(set(group)) < len(group)
        
    def _simulate_group(self, group_size):
        matches = 0
        for sim in range(self.n_simulations):
            group = np.random.choice(self.birthday_options, size=group_size, replace=True)
            if self._is_match(group):
                matches += 1
        return matches
    
    def run(self):
        probs = []
        for size in self.group_sizes:
            matches = self._simulate_group(size)
            prob = matches / self.n_simulations
            probs.append(prob)
        return probs

We set the number of simulations to run per group size and the group sizes (1 to 100 in this case).

simulations = 10000
group_sizes = np.arange(1, 101)

Now we can instantiate a Simulation instance which we can run using the .run()method.

sim = Simulation(simulations, group_sizes)
probs = sim.run()

The .run()method iterates over all group sizes and for each of these runs a simulation. So from all birthday possibilities in a year, it draws a sample of size group_size (with replacement) and then checks whether there are two matching birthdays in that sample. We do this n_simulation times, and then calculate the % of times we get a matching birthday for that group size.

The simulation run returns the probabilities related to the different group sizes.

The probs variable can then nicely be plotted against the group_sizes.

plt.plot(group_sizes, probs)
plt.xlabel('group size')
plt.ylabel('probability')
plt.title('Probability of a birthday match')
plt.vlines([23], 0, 1, colors=['indianred'], linestyles=['dashed'], label='n=23')
plt.legend(loc='lower right')
plt.show()
Probability of a birthday match across group sizes
Probabiltiy of a birthday match across group sizes

As you can see, the simulation results in roughly the same number as the mathematical equation: ~50%.

That's it for today! Hope this tutorial made you see how easy it is to make simulations in python and how they can help you visualise mathematical phenomena!

Let's keep in touch! 📫

If you would like to be notified whenever I post a new article, you can sign up for my email newsletter here.

If you have any comments, questions or want to collaborate, please email me at lucy@lucytalksdata.com or drop me a message on Twitter.