Download the dataframe pirate_survey_noerrors.txt from http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_noerrors.txt. The data are stored in a tab-separated text file with headers. Load the dataframe into an object called pirates. Because it’s tab-separated, use sep = “”
pirates <- read.table("http://nathanieldphillips.com/wp-content/uploads/2015/05/pirate_survey_noerrors.txt",
sep = "\t", header = T, stringsAsFactors = F)
Question 1: Create the following histograms of the number of tattoos pirates have separately for each favorite pirate. Add appropriate labels for each plot. Hint: Use unique(pirates$favorite.pirate) as your index values. Additionally, before creating the loop, set up a 2 x 3 plotting region using par(mfrow = c(2, 3))
par(mfrow = c(2, 3))
for (favorite.pirate.i in unique(pirates$favorite.pirate)) {
data.temp <- subset(pirates, favorite.pirate == favorite.pirate.i)
hist(data.temp$tattoos,
main = favorite.pirate.i, xlab = "tattoos")
}
Question 2: The law of large numbers says that the larger your sample size, the closer your sample statistic will be to the true population value. Let’s test this by conducting a simulation. For sample sizes of 1 to 100, calculate the average difference between the sample mean and the population mean from a Normal distribution with mean 100 and standard deviation 10.
Step 1: Create the design matrix
Step 2: Set up the loop over the rows of the design matrix
Step 3: For each row of the design matrix, extract the sample size (N).
Step 4: Draw N samples from a Normal distribution with mean 100 and standard deviation 10
Step 5: Calculate the absolute difference between the sample mean and the population mean.
Step 6: Save the difference in the design matrix
design.matrix <- expand.grid(
"sample.size" = 1:100,
"simulation" = 1:100,
"result" = NA
) #Create the design matrix
head(design.matrix)
## sample.size simulation result
## 1 1 1 NA
## 2 2 1 NA
## 3 3 1 NA
## 4 4 1 NA
## 5 5 1 NA
## 6 6 1 NA
#set up the loop
for(row.i in 1:nrow(design.matrix)) {
sample.size.i <- design.matrix$sample.size[row.i]
data <- rnorm(n = sample.size.i,
mean = 100,
sd = 10)
sample.mean <- mean(data)
diff <- sample.mean - 100
design.matrix$result[row.i] <- diff
}
head(design.matrix)
## sample.size simulation result
## 1 1 1 1.936064
## 2 2 1 -7.861386
## 3 3 1 5.679104
## 4 4 1 1.877948
## 5 5 1 2.648310
## 6 6 1 -5.127005
Question 3 Plot your aggregate results from question 2
plot(design.matrix$sample.size, design.matrix$result,
xlim = c(0, 100),
ylim = c(0, 30),
xlab = "sample.size",
ylab = "Result",
pch = 16,
col = "pink"
)
Question 4: How many people do you need in a room for the probability to be greater than 0.50 that at least two people in the room have the same birthday? Answer this question using a simulation. For example, if there are 2 people in the room, what is the probability that they have the same birthday. Now what about 3, 4, … 365 people?
Step 1: Create the design matrix
Step 2: Set up the loop over the rows of the design matrix
Step 3: For each row of the design matrix, extract the number of people in the room (N).
Step 4: Simulate those N people in a room and figure out if at least two have the same birthday. Here’s a Hint:
Step 5: Save the result (TRUE or FALSE) in the design matrix
design.matrix <- expand.grid(
"people.in.room" = 1:365,
"simulation" = 1:100,
"result" = NA
) #Create the design matrix
head(design.matrix)
## people.in.room simulation result
## 1 1 1 NA
## 2 2 1 NA
## 3 3 1 NA
## 4 4 1 NA
## 5 5 1 NA
## 6 6 1 NA
for(row.i in 1:nrow(design.matrix)) {
people.i <- design.matrix$people.in.room[row.i]
bdays <- sample(x = 1:365, size = people.i, replace = T)
result <- length(bdays) != length(unique(bdays))
design.matrix$result[row.i] <- result
}
head(design.matrix)
## people.in.room simulation result
## 1 1 1 FALSE
## 2 2 1 FALSE
## 3 3 1 FALSE
## 4 4 1 FALSE
## 5 5 1 FALSE
## 6 6 1 FALSE
Question 5: Aggregate your data from question 4 and plot it.
Question 6: What is a p-value? Let’s find out: create a vector of length 100 called p.values, where each entry is the p.value of a one-sample t-test conducted on a sample of size 10 from a normal distribution with mean 0 and standard deviation 1.
Question 7: Create a histogram of p.values. Additionally, what percent of the values are less than .05?
Question 8: Now, repeat the simulations from question 2, but include separate simulations for normal distributions with means ranging from 0 to 5 in steps of .5. Keep the null hypothesis the same (mean = 0) for all tests.
Step 1: Create the design matrix
Step 2: Set up the loop over the rows of the design matrix
Step 3: For each row of the design matrix, extract the population mean.
Step 4: Draw 10 samples from a Normal distribution with mean equal to N and standard deviation 1
Step 5: Calculate the t-test and extract the p.value
Step 6: Save the p.value in the design matrix
Question 9: Create 10 histograms of the simulations from Question 4 where each histogram contains data from the simulations using each mean value.
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.