Data Analysis in Social Science

Solving probability problems using simulation

# Set the random seed
set.seed(12345)

BERNOULLI TRIAL

Here we flip a fair coin, where a 1 = heads and a 0 = tails. We get a 1 (heads). You can change the prob parameter, but the important point is that for a Bernoulli trial, n and size both need to be one.

rbinom(n = 1, size = 1, prob = 0.50)

## [1] 1

Let’s flip 100 coins! using the replicate() function.

replicate(100,
                  rbinom(n = 1, size = 1, prob = .5))

##   [1] 1 1 1 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0
##  [36] 1 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 0 1 1 1
##  [71] 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0

# let's assign it to an object called heads
heads <- replicate(100,
                  rbinom(n = 1, size = 1, prob = .5))

And we can look at the relative frequency:

mean(heads)

## [1] 0.61

Not quite .5, but close. What if we used more samples?

heads = replicate(10000,
                  rbinom(n = 1, size = 1, prob = .5))

mean(heads)

## [1] 0.5028

Much better!

This is the essence of “simulation”.

BINOMIAL DISTRIBUTION

The full binomial distribution, simply extends what we have already been doing. We can count the head of the repeated Bernoulli trails:

sum(heads)

## [1] 5028

We can use the Binomial distribution to do the same thing:

rbinom(n = 1, size = 10000, prob = .5)

## [1] 5023

Now the number won’t be exactly the same (i.e., these are random draws), But the underlying process is the same.

Now, we can use these functions to solve most (all?) of the probability problems addressed in this week’s lecture. Let’s modify the parameters of the binomial distribution so we can see what happens. Going back to the example from class, what’s the probablity of having 2 girls in a family of 3 children?

k = 2 (the number of “successes” or girls) n = 3(the total number of “trials” (the number of children) p = 0.487 (the probability of having a girl

dbinom(2, size = 3, prob = .487)

## [1] 0.3650031

Exercise

Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four correct answers if a student attempts to answer every question at random. Since only one out of five possible answers is correct, the probability of answering a question correctly by random is 1/5=0.2.

dbinom(4, size = 12, prob=0.2)

## [1] 0.1328756

dbinom() gives the density of the binomial distribution

Monty Hall

Randomly select the doors which have the prize behind it

prize <- sample(1:3, 10000, replace = TRUE)

Randomly select the contestant’s door choice

stay <- sample(1:3, 10000, replace = TRUE)


reveal <- rep(0, 10000)
change <- rep(0, 10000)

for(i in 1:10000) {
  x <- c(1:3)[-c(prize[i], stay[i])]
  
  ## If contestant chose wrong door, then only one possible door to reveal
  ## If contestant already chose the correct door, then two possible doors to reveal
  ## If so, choose randomly which door to reveal
  reveal[i] <- x[sample.int(length(x), size = 1)]
  
  ## This changes from the original choice to the other door not revealed
  change[i] <- c(1:3)[-c(reveal[i], stay[i])]
}

Number of wins for changing or staying with initial door respectively

changewin <- ifelse(change == prize, 1, 0)
staywin <- ifelse(stay == prize, 1, 0)

Proportion of wins

change_perc <- mean(changewin)
stay_perc <- mean(staywin)

change_perc

## [1] 0.672

stay_perc

## [1] 0.328

What this tells you is that your probability of winning the prize changes (from 1/3 before the door reveal) to 2/3 if you choose to switch doors afer the reveal. If you don’t switch, your chance of winning remains at 1/3.

Birthday problem (Imai’s book, page 244)

The problem asks the number of people one needs in order for the probability that at least two people have the same birthday to exceed 0.5, assuming that each birthday is equally likely.

For the birthday problem, we will be sampling with replacement k unique birthdays out of 365 days. Sampling with replacement means that for each of k draws, every one of 365 days is equally likely to be sampled regardless of which dates were sampled before. In other words, the fact that one person is born on a certain day of the year does not exclude someone else from being born on the same day. Once we sample k possibly non-unique birthdays, we check whether they are all different. After repeating this sampling procedure many times, we compute the fraction of simulation trials where at least two birthdays are the same, and this fraction serves as an estimate of the corresponding probability.

k <- 23 # number of people
sims <- 1000 # number of simulations
pr <- rep(NA, sims) # container for the estimates
event <- 0 # counter
for (i in 1:sims) {
  days <- sample(1:365, k, replace = TRUE)
  days.unique <- unique(days) # unique birthdays
  ## if there are duplicates, the number of unique birthdays
  ## will be less than the number of birthdays, whic is `k'
  if (length(days.unique) < k) {
    event <- event + 1
  }
}

Fraction of trials where at least two bdays are the same

answer <- event / sims

answer

## [1] 0.502

With a group of about 23 people the probability that two people share the same birthday is about 0.5.

https://rpubs.com/StatGirl302/TheBirthdayProblem or https://github.com/KobaKhit/Bday-Problem-in-R

The end!

Data Analysis in Social Science - lab2

Raluca Popp

30 January 2018