All R code should be placed inside of R chunks.

Hypothesis Testing

You may use the t.test() function to complete this assignment

Problem 1

A new process for producing a type of novolac resin is supposed to have a mean cycle time of 3.5 hours per batch. Six batches are produced and their cycle times, in hours, were 3.45, 3.47, 3.57, 3.52, 3.40, 3.63. Can you conclude the mean cycle time is different than 3.5 hours?

# Insert R code for problem 1 here
x = c(3.45,3.47,3.57,3.52,3.4,3.63)
t.test(x,mu = 3.5, alternative = 'two.sided', conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  x
## t = 0.19426, df = 5, p-value = 0.8536
## alternative hypothesis: true mean is not equal to 3.5
## 95 percent confidence interval:
##  3.418447 3.594886
## sample estimates:
## mean of x 
##  3.506667

Insert problem 1 answer here at the significance level of 0.05, we do have signifigant evidance to claim that the mean cycle hours per batch is different from 3.5.

Problem 2

A new formulation sunscreen (SPF 30) was designed and tested against the current production version. Seven subjects applied the new formula while a different seven individuals applied the old product. The time it take for erythema (redness) to occur is measured for each individual. Does the new product outperform the current product? You may assume the variances are equal.

New: 15, 21, 22, 26, 29, 35, 37

Old: 14, 19, 22, 22, 30, 31, 34

Define notation and state the hypotheses. Compute the test statistic and \(p\)-value. What is your conclusion?

# Insert R code for problem 2 here
NEW = c(15,21,22,26,29,35,37)
OLD = c(14,19,22,22,30,31,34)
t.test(OLD, NEW, mu = 0, alternative = 'greater', var.equal = TRUE, conf.level = 0.95)
## 
##  Two Sample t-test
## 
## data:  OLD and NEW
## t = -0.45905, df = 12, p-value = 0.6728
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -9.067668       Inf
## sample estimates:
## mean of x mean of y 
##  24.57143  26.42857

Insert problem 2 answer here Ho: New <= OLD Ha: New > OLD

With our significance level of 0.05 we do not have sufficient statistical evidence to claim that when the new sunscreen is applied it takes longer for redness to show.

Problem 3

What if I told you that each of the 7 sunscreen measurements in Problem 2 were taken from the same individual (left/right arm). Does this change your results?

# Insert R code for problem 3 here
DIFF = NEW - OLD
 t.test(NEW,OLD, mu = 0, alternative = 'greater',var.equal = TRUE, paired = 'TRUE')
## 
##  Paired t-test
## 
## data:  NEW and OLD
## t = 2.5174, df = 6, p-value = 0.02272
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
##  0.4236372       Inf
## sample estimates:
## mean difference 
##        1.857143

Insert problem 1 answer here This does change our result because our sample size decreased by half. At the significance level of 0.05 we do have sufficient evidence that when you apply the new sunscreen it takes longer for redness to show than the old sunscreen.

Simulations

Problem 4

In this exercise we will investigate the Central Limit Theorem (CLT). Recall, the CLT states that with a large enough sample drawn from a population with a mean \(\mu\) and a standard deviation \(\sigma\), the sampling distribution of the sample mean (\(\bar{x}\)) will be

\[\bar{x} \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Assume the population distribution for the number of university sporting events (Events) attended in the past month by each Virginia Tech student (25,000 students) follows the distribution.

set.seed(915)
Events = c(rpois(n = 10000, lambda = 1.5), rpois(n = 10000, lambda = 10), rpois(n=5000, lambda = 25))
barplot(table(Events), xlab = "Number of Events", ylab = "Frequency",
        main = "Number of Sporting Events Attended for each Student")

  1. Describe the center, spread, and shape of the distribution. Compute any necessary summary statistics.
# Insert R code for problem 4a here
summary(Events)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   8.000   9.602  14.000  44.000
sd(Events)
## [1] 9.144079

Insert problem 4a answer here The center of the distribution looking at the median is at 8 while the mean lies above it at 9.602. This is because the graph is heavily skewed to the right. The standard deviation is also 9.11 showing that the data is very spread out from the mean.

  1. Assume you are tasked with quantifying the average number of sporting events Virginia Tech students attend in the past month. You do not have the resources to obtain data on every Virginia Tech student, so you decide to take a random sample from all 25,000 students in the population. Draw a sample of size \(n = 10\) from the population Events and compute the sample mean (this is done for you in the code below).
Events_sample = sample(Events, size = 10, replace = FALSE)
mean(Events_sample)
## [1] 4.4

Repeat this above procedure 10,000 times. That is, draw 10,000 samples of size \(n=10\) from the population. This will give you 10,000 different sample means. Plot the histogram of the 10,000 sample means.

# Insert R code for problem 4b here

x = replicate(10000, mean(Events_sample))
plot(density(x))

  1. You decide that perhaps a sample size of \(n=10\) is not large enough, and instead use a sample size of \(n=50\), Repeat part (b) using a sample size of \(n=50\). Plot the histogram of the sampled means.
# Insert R code for problem 4c here
Events_sample2 = sample(Events, size = 50, replace = FALSE)
x2 = replicate(10000, mean(Events_sample2))
plot(density(x2))

  1. Compare the histograms in (b) and (c). What do you notice? How does changing the sample size impact the sampling distribution (distribution of the sample means)? You may want to try different sample sizes.

Insert problem 4d answer here I notice that the mean seems to be focused around 4.5 for B and it seems to be focused more around 8.3 for d. As I seem the sample sizes increase it seems that we are getting closer to the true mean. I also did a sample size of 1000 and then again the mean got closer to 10. It also is present that as the sample sizes go up the farther apart one standard deviation is for the sampling distribution

Problem 5

Here we will investigate the impact that sample size has on the Type I error rate (false positive). Conduct a simulation study, like we did with the batteries, with sample sizes of \(n = 5, 10, 25, 100\) drown from the following populations. \[X_A \sim N(\mu = 20, \sigma = 2) \qquad X_B \sim N(\mu = 20, \sigma = 2)\] Note this is the same as we did in the first part of the lab, where both batteries have mean of 20 hours and a standard deviation of 2. Calculate the proportion of \(p\)-values that are less than \(\alpha = 0.05\) for each sample size. What do you notice?

# Insert R code for problem 5 here
n_A = n_B = 5
mu_A = 20
mu_B = 20
sigma_A = 2
sigma_B = 2

N_Sims = 1000 #Number of simulations
save_t = length(N_Sims) #Empty vector to store t statistics
save_p = length(N_Sims) #Empty vector to store p values

for(i in 1:N_Sims){
  A = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
  B = rnorm(n=n_B, mean = mu_B, sd = sigma_B)
  T_Test = t.test(A,B, mu = 0, var.equal = TRUE)
  save_t[i] = T_Test$statistic
  save_p[i] = T_Test$p.value
}
hist(save_p, breaks = 20)

n_A = n_B = 10
mu_A = 20
mu_B = 20
sigma_A = 2
sigma_B = 2

N_Sims = 1000 #Number of simulations
save_t = length(N_Sims) #Empty vector to store t statistics
save_p = length(N_Sims) #Empty vector to store p values

for(i in 1:N_Sims){
  A = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
  B = rnorm(n=n_B, mean = mu_B, sd = sigma_B)
  T_Test = t.test(A,B, mu = 0, var.equal = TRUE)
  save_t[i] = T_Test$statistic
  save_p[i] = T_Test$p.value
}
hist(save_p, breaks = 20)

n_A = n_B = 25
mu_A = 20
mu_B = 20
sigma_A = 2
sigma_B = 2

N_Sims = 1000 #Number of simulations
save_t = length(N_Sims) #Empty vector to store t statistics
save_p = length(N_Sims) #Empty vector to store p values

for(i in 1:N_Sims){
  A = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
  B = rnorm(n=n_B, mean = mu_B, sd = sigma_B)
  T_Test = t.test(A,B, mu = 0, var.equal = TRUE)
  save_t[i] = T_Test$statistic
  save_p[i] = T_Test$p.value
}
hist(save_p, breaks = 20)

n_A = n_B = 100
mu_A = 20
mu_B = 20
sigma_A = 2
sigma_B = 2

N_Sims = 1000 #Number of simulations
save_t = length(N_Sims) #Empty vector to store t statistics
save_p = length(N_Sims) #Empty vector to store p values

for(i in 1:N_Sims){
  A = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
  B = rnorm(n=n_B, mean = mu_B, sd = sigma_B)
  T_Test = t.test(A,B, mu = 0, var.equal = TRUE)
  save_t[i] = T_Test$statistic
  save_p[i] = T_Test$p.value
}
hist(save_p, breaks = 20)

Insert problem 5 answer here In the graph the first bar represents all p values below 0.05, and it seems that the p value tends to fall below 0.05 less as you increase the sample size. NOTE FROM RYAN: i am assuming we are able to use the code from the lab considering it says in the problem that it is the exact same as the code from lab. If not that is completely my fault and I did not have any intention to cheat.

Problem 6

Here we will investigate the impact that sample size has on the power of the test (correctly identifying a difference). Conduct a simulation study, like we did with the batteries, with sample sizes of \(n = 5, 10, 25, 100\) drown from the following populations. \[X_A \sim N(\mu = 22, \sigma = 2) \qquad X_B \sim N(\mu = 20, \sigma = 2)\] Note this is the same as we did in the second part of the lab, where the name brand battery has a mean life of 22 hours and the generic battery has a mean life of 20 hours. Calculate the proportion of \(p\)-values that are less than \(\alpha = 0.05\) for each sample size. What do you notice?

# Insert R code for problem 6 here
set.seed(916)
n_A = n_B = 10
mu_A = 22 #NOTE this is different
mu_B = 20
sigma_A = 2
sigma_B = 2
A_sample = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
B_sample = rnorm(n=n_B, mean = mu_B, sd = sigma_B)




n_A = n_B = 5
mu_A = 22 #NOTE this has changed
mu_B = 20
sigma_A = 2
sigma_B = 2

N_Sims = 1000 #Number of simulations
save_t = length(N_Sims) #Empty vector to store t statistics
save_p = length(N_Sims) #Empty vector to store p values

for(i in 1:N_Sims){
  A = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
  B = rnorm(n=n_B, mean = mu_B, sd = sigma_B)
  T_Test = t.test(A,B, mu = 0, var.equal = TRUE)
  save_t[i] = T_Test$statistic
  save_p[i] = T_Test$p.value
}

hist(save_p, breaks = 20)

n_A = n_B = 10
mu_A = 22 #NOTE this has changed
mu_B = 20
sigma_A = 2
sigma_B = 2

N_Sims = 1000 #Number of simulations
save_t = length(N_Sims) #Empty vector to store t statistics
save_p = length(N_Sims) #Empty vector to store p values

for(i in 1:N_Sims){
  A = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
  B = rnorm(n=n_B, mean = mu_B, sd = sigma_B)
  T_Test = t.test(A,B, mu = 0, var.equal = TRUE)
  save_t[i] = T_Test$statistic
  save_p[i] = T_Test$p.value
}

hist(save_p, breaks = 20)

n_A = n_B = 25
mu_A = 22 #NOTE this has changed
mu_B = 20
sigma_A = 2
sigma_B = 2

N_Sims = 1000 #Number of simulations
save_t = length(N_Sims) #Empty vector to store t statistics
save_p = length(N_Sims) #Empty vector to store p values

for(i in 1:N_Sims){
  A = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
  B = rnorm(n=n_B, mean = mu_B, sd = sigma_B)
  T_Test = t.test(A,B, mu = 0, var.equal = TRUE)
  save_t[i] = T_Test$statistic
  save_p[i] = T_Test$p.value
}

hist(save_p, breaks = 20)

n_A = n_B = 100
mu_A = 22 #NOTE this has changed
mu_B = 20
sigma_A = 2
sigma_B = 2

N_Sims = 1000 #Number of simulations
save_t = length(N_Sims) #Empty vector to store t statistics
save_p = length(N_Sims) #Empty vector to store p values

for(i in 1:N_Sims){
  A = rnorm(n=n_A, mean = mu_A, sd = sigma_A)
  B = rnorm(n=n_B, mean = mu_B, sd = sigma_B)
  T_Test = t.test(A,B, mu = 0, var.equal = TRUE)
  save_t[i] = T_Test$statistic
  save_p[i] = T_Test$p.value
}

hist(save_p, breaks = 20)

Insert problem 6 answer here The graph for all sample sizes seems to be skewed to the right, and as the sample sizes increase the frequency of p values below 0.05 increases until all samples seem to become below the significance level.