Download the dataframe pirate_survey_noerrors.txt from http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_noerrors.txt. The data are stored in a tab-separated text file with headers. Load the dataframe into an object called pirates. Because it’s tab-separated, use sep = “”

pirates <- read.table("http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_noerrors.txt", 
                      sep = "\t", header = T, stringsAsFactors = F)

Question 1: Create the following histograms of the number of tattoos pirates have separately for each favorite pirate. Add appropriate labels for each plot. Hint: Use unique(pirates$favorite.pirate) as your index values. Additionally, before creating the loop, set up a 2 x 3 plotting region using par(mfrow = c(2, 3))

par(mfrow = c(2, 3))

for (favorite.pirate.i in unique(pirates$favorite.pirate)) {
  data.temp <- subset(pirates, favorite.pirate == favorite.pirate.i)
  hist(data.temp$tattoos,
main = favorite.pirate.i, xlab = "tattoos")
}

Question 2: The law of large numbers says that the larger your sample size, the closer your sample statistic will be to the true population value. Let’s test this by conducting a simulation. For sample sizes of 1 to 100, calculate the average difference between the sample mean and the population mean from a Normal distribution with mean 100 and standard deviation 10.

Step 1: Create the design matrix

Step 2: Set up the loop over the rows of the design matrix

Step 3: For each row of the design matrix, extract the sample size (N).

Step 4: Draw N samples from a Normal distribution with mean 100 and standard deviation 10

Step 5: Calculate the absolute difference between the sample mean and the population mean.

Step 6: Save the difference in the design matrix

design.matrix <- expand.grid(
  "sample.size" = 1:100,
  "simulation" = 1:100,
  "result" = NA
  ) #Create the design matrix
head(design.matrix)
##   sample.size simulation result
## 1           1          1     NA
## 2           2          1     NA
## 3           3          1     NA
## 4           4          1     NA
## 5           5          1     NA
## 6           6          1     NA
#set up the loop
for(row.i in 1:nrow(design.matrix)) {
  sample.size.i <- design.matrix$sample.size[row.i]
  data <- rnorm(n = sample.size.i, 
                mean = 100, 
                sd = 10)
  sample.mean <- mean(data)
  diff <- sample.mean - 100
  design.matrix$result[row.i] <- diff
  }

head(design.matrix)
##   sample.size simulation    result
## 1           1          1  1.936064
## 2           2          1 -7.861386
## 3           3          1  5.679104
## 4           4          1  1.877948
## 5           5          1  2.648310
## 6           6          1 -5.127005

Question 3 Plot your aggregate results from question 2

plot(design.matrix$sample.size, design.matrix$result, 
     xlim = c(0, 100), 
     ylim = c(0, 30),
     xlab = "sample.size",
     ylab = "Result",
     pch = 16, 
     col = "pink"
     )

Question 4: How many people do you need in a room for the probability to be greater than 0.50 that at least two people in the room have the same birthday? Answer this question using a simulation. For example, if there are 2 people in the room, what is the probability that they have the same birthday. Now what about 3, 4, … 365 people?

Step 1: Create the design matrix

Step 2: Set up the loop over the rows of the design matrix

Step 3: For each row of the design matrix, extract the number of people in the room (N).

Step 4: Simulate those N people in a room and figure out if at least two have the same birthday. Here’s a Hint:

Step 5: Save the result (TRUE or FALSE) in the design matrix

design.matrix <- expand.grid(
  "people.in.room" = 1:365,
  "simulation" = 1:100,
  "result" = NA
  ) #Create the design matrix

head(design.matrix)
##   people.in.room simulation result
## 1              1          1     NA
## 2              2          1     NA
## 3              3          1     NA
## 4              4          1     NA
## 5              5          1     NA
## 6              6          1     NA
for(row.i in 1:nrow(design.matrix)) {
  people.i <- design.matrix$people.in.room[row.i]
  bdays <- sample(x = 1:365, size = people.i, replace = T)
  result <- length(bdays) != length(unique(bdays))
  design.matrix$result[row.i] <- result
  }

head(design.matrix)
##   people.in.room simulation result
## 1              1          1  FALSE
## 2              2          1  FALSE
## 3              3          1  FALSE
## 4              4          1  FALSE
## 5              5          1  FALSE
## 6              6          1  FALSE

Question 5: Aggregate your data from question 4 and plot it.

Question 6: What is a p-value? Let’s find out: create a vector of length 100 called p.values, where each entry is the p.value of a one-sample t-test conducted on a sample of size 10 from a normal distribution with mean 0 and standard deviation 1.

Question 7: Create a histogram of p.values. Additionally, what percent of the values are less than .05?

Question 8: Now, repeat the simulations from question 2, but include separate simulations for normal distributions with means ranging from 0 to 5 in steps of .5. Keep the null hypothesis the same (mean = 0) for all tests.

Step 1: Create the design matrix

Step 2: Set up the loop over the rows of the design matrix

Step 3: For each row of the design matrix, extract the population mean.

Step 4: Draw 10 samples from a Normal distribution with mean equal to N and standard deviation 1

Step 5: Calculate the t-test and extract the p.value

Step 6: Save the p.value in the design matrix

Question 9: Create 10 histograms of the simulations from Question 4 where each histogram contains data from the simulations using each mean value.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.