In today’s lab, we’ll use R to do random sampling. We can use R to sample from a set of outcomes that we enumerate or from any of a number of well known distributions (like the normal distribution, binomial distribution, chi square distribution…). Let’s start by sampling from a set of outcomes which we create. You should log in to Rstudio.cloud with your school email account and try each piece of code below. I encourage you to take your time and tweak each piece of code to make sure that you understand how it works.
Letters <- c("A", "B", "C")
“Letters” is now a vector with 3 elements, “A”, “B”, and “C”.
Try running the following line of code a few times. Is it possible to sample the same letter twice?
sample(Letters, size=2)
Now, try the following which samples 2 elements from “Letters” with replacement.
sample(Letters, size=2, replace=TRUE)
Which of the following two lines of code will return an error (try predicting the answer before running the code)?
sample(Letters, size=4)
sample(Letters, size=4, replace=TRUE)
If I sample two letters from A, B and C with replacement, how often will A be one of them? We could solve this as a probability problem or we could use brute force (meaning simply trying it over and over again). Let’s run this experiment 1000 times! We’ll use the replicate function in R:
results <- replicate(1000, sample(Letters, size=2, replace=TRUE))
“results” is now a 2 x 1000 matrix (2 rows and 1000 columns). Each column is an experiment with two sampled letters. We want to check each column and see if it contains the letter A.
We can check the first three columns individually by running the following lines:
"A" %in% results[, 1] # for the first column of results
"A" %in% results[, 2]
"A" %in% results[, 3]
But it would take a long time to check all 1000 columns individually. Instead we can make R do this to all 1000 columns using the “apply” function. (Note: the 2 in the code below tells it to apply this function to every column. If we used 1, in place of 2, it would apply it to every row instead).
apply(results, 2, function(x) "A" %in% x)
This returns a vector with 1000 Boolean values: TRUE or FALSE. We can count the number of trues and falses as follows:
table(apply(results, 2, function(x) "A" %in% x))
Question: If we sample two letters from A, B and C without replacement, how often will we sample the letter A? Try answering this question using brute force with the help of the code above.
Imagine that my new company makes cell phone chargers and provides one year warranties. Of course, all of my cell phone charges last at least one year and none of them last more than two years. They are equally likely to fail at any point in the second year of use. In math terms, I’d say that the lifespans of my chargers follows a uniform distribution with a minimum of 1 and a maximum of 2. I can simulate the lifespan of one cell phone charger by doing:
runif(n=1, min=1, max=2)
or the lifespans of 100 cell phone chargers by doing:
runif(n=100, min=1, max=2)
If someone buys 100 cell phone chargers, how likely are they to have an average lifespan less than 1.4 years? We could answer this mathematically (although we don’t know how to do that quite yet) but, once again, let’s use brute force and simply simulate this 1000 times
cell_lives <- replicate(1000, runif(n=100, min=1, max=2))
“cell_lives” is now a matrix with 100 rows and 1000 columns. Each column is one simulation (or one sample of a possible set of 100 cell phones).
We can once again use the apply function to get averages for each of these 1000 simulations.
sample_averages <- apply(cell_lives, 2, mean)
Let’s take a look using a histogram!
hist(sample_averages)
We could also get a quick summary of these values:
summary(sample_averages)
Depending on your sample (we’ll each have a different one), you might find that all of the averages are over 1.4. So, any average less than 1.4 must be quite rare! We’ll need more samples to get a sense of just how rare. Let’s simulate this 100,000 times and look at the results. I would do this a million times but that takes too long!
cell_lives <- replicate(1e5, runif(n=100, min=1, max=2))
sample_averages <- apply(cell_lives, 2, mean)
table(sample_averages < 1.4)
Now imagine 10 basketball players who each make 40% of their shots and who each take 30 shots. How many buckets do they each make?
We can simulate each basketball player’s 10 shots as a binomial random variable. Note: This is (at least a little) wrong! We have reason to believe that basketball shooting is not perfectly independent (and the “hot hand” exists). But, perhaps, this model of reality is close enough to being true that it is useful. Here is a simulation of 10 players each taking 30 (independent!) shots each with a 40% chance of success:
rbinom(n=10, size=30, prob=0.4)
Let’s look at the results:
shots_made <- rbinom(n=10, size=30, prob=0.4)
hist(shots_made)
mean(shots_made); sd(shots_made)
Maybe, this model isn’t useful because some players are better shooters than others. In other words, some player’s have a better chance of making each shot and some player’s have a lesser chance. Let’s imagine that shooting ability is normally distributed and that players have a mean shooting ability of 40% with a standard deviation of 5%. Now, let’s simulate the shooting ability of 10 basketball players.
abilities <- rnorm(n=10, mean=0.40, sd=0.05)
hist(abilities)
Now we can take these 10 shooters and let them each take 30 (independent) shots. Their chance of making each shot will depend on their ability.
shots_made <- rbinom(n=10, size=30, prob=abilities)
hist(shots_made)
mean(shots_made); sd(shots_made)
How would you expect the standard deviation in the number of shots made in this population (with shooters of different abilities) to compare the the standard deviation in shots made in the (previous) population of shooters each with 40% shooting ability?
Come of with your own scenario or probability problem that you would like to simulate. Describe the scenario and then simulate it.