Summary

This document is intended to be a concise report to display the results of a simulation of the Monty Hall problem. For more information about this problem, go to https://en.wikipedia.org/wiki/Monty_Hall_problem.

This simulation was created as part of the Data Science Certificate in the class Methods for Data Analysis at University of Washington.

The idea is to observe the differences of switching or not switching door choice after the host shows one of the three doors. We will see these differences in terms of distributions of probabilties, and also mean, standard error and variance of probabilities, for each scenario (switching and not switching).

This report will show all the code used with the explanation of what it does, and at the end we will show the results of the simulation.

As extra work, just for fun, two shiny apps were built around this simulation (their information are on their pages):


Understanding the functions

The simulation uses three functions that are the base of the process, as described below.

# Won the game or not
sim.won <- function(prize.door, first.door, switching){
  as.numeric(ifelse(first.door == prize.door, !switching, switching))
}

# Simulates randomly n door choices (1/3 probability for each door initially)
sim.choose.door <- function(n){
  door.prob <- runif(n)
  ifelse(door.prob <= 1/3, 1, ifelse(door.prob <= 2/3, 2, 3))
}

# Simulates randomly n choices of switching (TRUE) or not (FALSE)
sim.choose.switching <- function(n){
  switching.prob <- runif(n)
  ifelse(switching.prob <= 1/2, TRUE, FALSE)
}

The function sim.won defines if a given scenario is a winning one (returns 1) or not (returns 0) by knowing the door with the prize (prize.door), the door chosen first (first.door) and if the participant swicthes or not (switching) the first choice of door.

sim.choose.door simulates randomly n number of door choices, using a 1/3 probability of initially chosing any door, returning the chosen door (1, 2 or 3). Whereas sim.choose.switching simulates randomly n times the option of switching the first door choice (returns TRUE) or not (returns FALSE), using a 1/2 probability of switching.

With these three functions defined we can create the game simulation function.

# Simulates n games and calculates the probability of winning by switching or not switching
sim.game <- function(n=1000){
  prize.door <- sim.choose.door(n) # Define prize door (n times)
  first.door <- sim.choose.door(n) # Chose first door (n times)
  switching <- sim.choose.switching(n) # Define if will switch first door (n times)
  won <- sim.won(prize.door, first.door, switching) # Calculates the outcome of each run
  
  df <- data.frame(won = won, switching = switching) # Put runs in data frame
  
  # Calculate probabilities from the outcomes
  prob.win.switching <- table(df$won, df$switching)[2,2]/sum(table(df$won, df$switching)[,2])
  prob.win.not.switching <- table(df$won, df$switching)[1,2]/sum(table(df$won, df$switching)[,1])
  
  return(c(prob.win.switching, prob.win.not.switching))
}

The function sim.game simulates n runs of the Monty Hall problem and calculates, using the results of the n runs, the probability of winning after switching or not switching the initial door choice.

If we run sim.game many times we will have several results of probabilities to analyze. With this in mind, we created the following function.

# Plots distributions of probabilities using data frame with columns prob.win.switching and prob.win.not.switching
plot.probs <- function(df, bins = 50){
  require(ggplot2)
  require(gridExtra)
  
  # Plot for Switching
  bw <- (max(df$prob.win.switching) - min(df$prob.win.switching))/(bins - 1)
  h1 <- ggplot(df, aes(prob.win.switching)) + geom_histogram(binwidth = bw) + 
    ggtitle('Switching') + xlab('Probability of Winning') + ylab('Frequency') + 
    geom_vline(xintercept = mean(df$prob.win.switching), colour='red') + 
    scale_x_continuous(breaks = c(0.20,0.30,0.40,0.50,0.60,0.80, round(mean(df$prob.win.switching),3)), limits=c(0.20, 0.80))
  
  # Plot for NOT Switching
  bw <- (max(df$prob.win.not.switching) - min(df$prob.win.not.switching))/(bins - 1)
  h2 <- ggplot(df, aes(prob.win.not.switching)) + geom_histogram(binwidth = bw) + 
    ggtitle('NOT Switching') + xlab('Probability of Winning') + ylab('Frequency') + 
    geom_vline(xintercept = mean(df$prob.win.not.switching), colour='red') +
    scale_x_continuous(breaks = c(0.20,0.40,0.50,0.60,0.70,0.80, round(mean(df$prob.win.not.switching),3)), limits=c(0.20, 0.80))
  
  grid.arrange(h1, h2, nrow = 2, top="Distributions of Probability of Winning")
}

The function plot.probs receives a data frame with columns names prob.win.switching and prob.win.not.switching and plots the distribution of probability of winning by switching or not switching first door choice. It is worth noticing that the mean probability is shown in a red vertical line.

Now to automaticaly run sim.game a repetead number of times and analyze the results, we use the following function.

# Executes, repeated times, n runs of Monty Hall problem, also showing plot and descriptive results
dist.game <- function(reps=300, n=100){
  # Minimum n is 100
  if(n < 100){
    n <- 100
  }
  # Minimum reps is 300
  if(reps < 300){
    reps <- 300
  }
  
  dist <- data.frame(prob.win.switching = rep(0, times = reps),
                     prob.win.not.switching = rep(0, times = reps))
  
  # Repeating, "reps" times, "n" runs of Monty Hall problem 
  for(i in 1:reps){
    dist[i, ] <- sim.game(n)
  }
  
  # Plots distributions
  plot.probs(dist)
  
  # Returns descriptive results
  return(cat("<br/><br/>
                  <li>Mean of Probability of Winnning by Switching: <span style='color:red'>", 
                 round(mean(dist$prob.win.switching), 3), "</span></li>",
                 "<li>Std of Probability of Winnning by Switching: <span style='color:red'>", 
                 round(sqrt(var(dist$prob.win.switching)), 3), "</span></li>",
                 "<li>Var. of Probability of Winnning by Switching: <span style='color:red'>", 
                 round(var(dist$prob.win.switching), 5), "</span></li><br/>",
             
                 "<li>Mean of Probability of Winnning by NOT Switching: <span style='color:red'>", 
                 round(mean(dist$prob.win.not.switching), 3), "</span></li>",
                 "<li>Std of Probability of Winnning by NOT Switching: <span style='color:red'>", 
                 round(sqrt(var(dist$prob.win.not.switching)), 3), "</span></li>",
                 "<li>Var. of Probability of Winnning by NOT Switching: <span style='color:red'>", 
                 round(var(dist$prob.win.not.switching), 5), "</span><br/>", sep=""))
}

The function dist.game above executes, repeated times, n runs of Monty Hall problem, also showing the plot with the distribution of probabilities of winnning by switching or not switching, and returns some descriptive results (mean, standard error and variance) of the probabilities.

Running the Simulation

Finally, we can now run the simulation of the Monty Hall problem. The functions created give us the freedom to choose the number of runs used to calculate the probabilities of winning and also define the number of repetitions of these runs, to allow us to analyze the results later.

Below we run 1000 repetitions with 1000 runs of the Monty Hall problem each.

dist.game(1000, 1000)


  • Mean of Probability of Winnning by Switching: 0.668
  • Std of Probability of Winnning by Switching: 0.022
  • Var. of Probability of Winnning by Switching: 0.00047

  • Mean of Probability of Winnning by NOT Switching: 0.332
  • Std of Probability of Winnning by NOT Switching: 0.029
  • Var. of Probability of Winnning by NOT Switching: 0.00085

    The results show that switching allows us to have a higher chance of winning, approximately 66% against 33% of not switching. Is is worth noticing that these results are completed aligned with the results given using probability theory.

    As a final note, we run the simulation 20 times with default values, to observe the running time.

    library(microbenchmark)
    benchmark <- microbenchmark(
      dist.game(),
      times = 20
    )
    print(benchmark)
    ## Unit: seconds
    ##         expr     min       lq     mean   median       uq      max neval
    ##  dist.game() 1.00555 1.031424 1.055917 1.046032 1.086596 1.128799    20

    We can see that the mean simulation time is less than 2 seconds, resulting in great computing time.