This is entirely our own work except as noted at the end of the document.
Prob1 - Import the data set Spruce into R.
Ht.change.# Your code here
Spruce <- read.csv("http://www1.appstate.edu/~arnholta/Data/Spruce.csv")
glimpse(Spruce)
Observations: 72
Variables: 9
$ Tree <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ Competition <fctr> NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, C...
$ Fertilizer <fctr> F, F, F, F, F, F, NF, NF, NF, NF, NF, NF, F, F, F...
$ Height0 <dbl> 15.0, 9.0, 12.0, 13.7, 12.0, 12.0, 16.8, 14.6, 16....
$ Height5 <dbl> 60.0, 45.2, 42.0, 49.5, 47.3, 56.4, 43.5, 49.2, 54...
$ Diameter0 <dbl> 1.984375, 1.190625, 1.785937, 1.587500, 1.587500, ...
$ Diameter5 <dbl> 7.4, 5.2, 5.7, 6.4, 6.2, 7.4, 4.9, 5.4, 7.1, 5.1, ...
$ Ht.change <dbl> 45.0, 36.2, 30.0, 35.8, 35.3, 44.4, 26.7, 34.6, 38...
$ Di.change <dbl> 5.415625, 4.009375, 3.914062, 4.812500, 4.612500, ...
# Your code here
set.seed(124)
num <- 10^4 - 1
changeHeight0 <- numeric(num)
changeHeight5 <- numeric(num)
for(i in 1:num) {
changeHeight0[i] <- mean(sample(Spruce$Height0, length(Spruce$Height0), replace = TRUE))
changeHeight5[i] <- mean(sample(Spruce$Height5, length(Spruce$Height5), replace = TRUE))
}
changeHeight <- changeHeight5 - changeHeight0
quantile(changeHeight, c(.025, .975))
2.5% 97.5%
28.29014 33.55007
SOLUTION: We are 95% confident that the true mean height change over the 5-year period of the study lies in the interval (28.290, 33.550).
Ht.change for the seedlings in the fertilized and nonfertilized plots.# Your code here
Spruce %>%
group_by(Fertilizer) %>%
filter(Fertilizer == "F") %>%
summarize(mean(Ht.change))
# A tibble: 1 x 2
Fertilizer `mean(Ht.change)`
<fctr> <dbl>
1 F 38.28889
Spruce %>%
group_by(Fertilizer) %>%
filter(Fertilizer == "NF") %>%
summarize(mean(Ht.change))
# A tibble: 1 x 2
Fertilizer `mean(Ht.change)`
<fctr> <dbl>
1 NF 23.57778
# Your code here
quantile(changeHeight, c(.05, .95))
5% 95%
28.72208 33.15861
SOLUTION: 28.722 - We are 95% confident that the difference in mean heights over the 5-year period of the study is no less than 28.722.
Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.
# Your code here
Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")
glimpse(Girls2004)
Observations: 80
Variables: 6
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ State <fctr> WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY...
$ MothersAge <fctr> 15-19, 35-39, 25-29, 20-24, 25-29, 20-24, 20-24, 2...
$ Smoker <fctr> No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Weight <int> 3085, 3515, 3775, 3265, 2970, 2850, 2737, 3515, 374...
$ Gestation <int> 40, 39, 40, 39, 40, 38, 38, 37, 39, 40, 41, 39, 40,...
Girls2004 %>%
group_by(State) %>%
filter(State == "WY") %>%
summarize(mean(Weight))
# A tibble: 1 x 2
State `mean(Weight)`
<fctr> <dbl>
1 WY 3207.9
Girls2004 %>%
group_by(State) %>%
filter(State == "AK") %>%
summarize(mean(Weight))
# A tibble: 1 x 2
State `mean(Weight)`
<fctr> <dbl>
1 AK 3516.35
SOLUTION: The average weight of babies born in WY is 3207.9, while the average weight of babies born in AK is 3516.35.
# Your code here
WYGirl <- Girls2004 %>%
group_by(State) %>%
filter(State == "WY")
AKGirl <- Girls2004 %>%
group_by(State) %>%
filter(State == "AK")
obsMean <- mean(AKGirl$Weight) - mean(WYGirl$Weight)
obsSD <- sd(AKGirl$Weight) - sd(WYGirl$Weight)
set.seed(124)
num <- 10^4 - 1
meanAK <- numeric(num)
meanWY <- numeric(num)
for(i in 1:num) {
meanAK[i] <- mean(sample(AKGirl$Weight, length(AKGirl$Weight), replace = TRUE))
meanWY[i] <- mean(sample(WYGirl$Weight, length(WYGirl$Weight), replace = TRUE))
}
BootMean <- meanAK - meanWY
quantile(BootMean, c(.025, .975))
2.5% 97.5%
84.14875 523.69000
SOLUTION: We are 95% confident that the difference in mean weights for girls born in AK and WY is in the interval (84.15, 523.69).
# Your code here
Girls2004 %>%
group_by(Smoker) %>%
filter(Smoker == "Yes") %>%
summarize(mean(Weight))
# A tibble: 1 x 2
Smoker `mean(Weight)`
<fctr> <dbl>
1 Yes 3114.636
Girls2004 %>%
group_by(Smoker) %>%
filter(Smoker == "No") %>%
summarize(mean(Weight))
# A tibble: 1 x 2
Smoker `mean(Weight)`
<fctr> <dbl>
1 No 3401.58
SOLUTION: The mean weights of babies born to smokers is 3114.64 while the mean weight of babies born to non-smokers is 3401.58.
# Your code here
YGirl <- Girls2004 %>%
group_by(Smoker) %>%
filter(Smoker == "Yes")
NGirl <- Girls2004 %>%
group_by(State) %>%
filter(State == "AK")
obsMean <- mean(YGirl$Weight) - mean(NGirl$Weight)
obsSD <- sd(YGirl$Weight) - sd(NGirl$Weight)
set.seed(124)
num <- 10^4 - 1
meanY <- numeric(num)
meanN <- numeric(num)
for(i in 1:num) {
meanY[i] <- mean(sample(YGirl$Weight, length(YGirl$Weight), replace = TRUE))
meanN[i] <- mean(sample(NGirl$Weight, length(NGirl$Weight), replace = TRUE))
}
BootMean <- meanN - meanY
quantile(BootMean, c(.025, .975))
2.5% 97.5%
81.65307 725.48784
SOLUTION: We are 95% confident that the true difference in mean weights between babies born to nonsmokers and smokers lies in the interval (81.65, 725.49).
Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.
# Your code here
FD <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
glimpse(FD)
Observations: 4,029
Variables: 10
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$ Carrier <fctr> UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, ...
$ FlightNo <int> 403, 405, 409, 511, 667, 669, 673, 677, 679, 681,...
$ Destination <fctr> DEN, DEN, DEN, ORD, ORD, ORD, ORD, ORD, ORD, ORD...
$ DepartTime <fctr> 4-8am, 8-Noon, 4-8pm, 8-Noon, 4-8am, 4-8am, 8-No...
$ Day <fctr> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri...
$ Month <fctr> May, May, May, May, May, May, May, May, May, May...
$ FlightLength <int> 281, 277, 279, 158, 143, 150, 158, 160, 160, 163,...
$ Delay <int> -1, 102, 4, -2, -3, 0, -5, 0, 10, 60, 0, 32, 0, 4...
$ Delayed30 <fctr> No, Yes, No, No, No, No, No, No, No, Yes, No, Ye...
FD %>%
group_by(Carrier) %>%
filter(Carrier == "UA") %>%
summarize(mean(Delay))
# A tibble: 1 x 2
Carrier `mean(Delay)`
<fctr> <dbl>
1 UA 15.98308
FD %>%
group_by(Carrier) %>%
filter(Carrier == "AA") %>%
summarize(mean(Delay))
# A tibble: 1 x 2
Carrier `mean(Delay)`
<fctr> <dbl>
1 AA 10.09738
# Your code here
UA <- FD %>%
group_by(Carrier) %>%
filter(Carrier == "UA")
AA <- FD %>%
group_by(Carrier) %>%
filter(Carrier == "AA")
set.seed(124)
num <- 10^4 - 1
meanUA <- numeric(num)
meanAA <- numeric(num)
for(i in 1:num) {
meanUA[i] <- mean(sample(UA$Delay, length(UA$Delay), replace = TRUE))
meanAA[i] <- mean(sample(AA$Delay, length(AA$Delay), replace = TRUE))
}
BootMean <- meanUA - meanAA
quantile(BootMean, c(.025, .975))
2.5% 97.5%
2.877827 8.958710
SOLUTION: We are 95% confident that the true difference in mean flight delays between the two airlines is in the interval (2.88, 8.96).
Prob4 - Run a simulation to see if the \(t\) ratio \(T = (\bar{X} -\mu)/(S/\sqrt{n})\) has a \(t\) distribution or even an approximate \(t\) distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate \(t\) density curve over the density of your simulated \(T\). Try two different nonnormal distributions \(\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)\) and remember to see if sample size makes a difference (use \(n = 15\) and \(n=500\)).
# Uniform dist with n = 15
n <- 15
Mu <- 0
sigma <- 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims) {
xn <- rnorm(n, Mu, sigma)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
Tu15 <- (xbar - 6) / (SD / 2)
hist(Tu15, freq = FALSE)
#curve(dt(x+ 12, 14), -30, -5, add = TRUE)
lines(density(Tu15, adjust = 2))
# Uniform dist with n = 500
n <- 500
Mu <- 0
sigma <- 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims) {
xn <- rnorm(n, Mu, sigma)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
Tu500 <- (xbar - 6) / (SD / 2)
hist(Tu500, prob = TRUE)
#curve(dt(x + 12, 499), -13.5, -10.5, add = TRUE)
lines(density(Tu500, adjust = 2))
# Exponential dist with n = 15
n <- 15
Mu <- 0
sigma <- 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims) {
xn <- rexp(n, sigma)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
Te15 <- (xbar - 6) / (SD / 2)
hist(Te15, freq = FALSE)
#curve(dt(x + 11, 14), -60, 0, add = TRUE)
lines(density(Te15, adjust = 2))
# Exponential dist with n = 500
n <- 500
Mu <- 0
sigma <- 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims) {
xn <- rexp(n, sigma)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
Te500 <- (xbar - 6) / (SD / 2)
hist(Te500, freq = FALSE)
#curve(dt(x + 10, 499), -13.5, -10.5, add = TRUE)
lines(density(Te500, adjust = 2))
SOULTION: Using a small sample size, the uniform distribution and the exponential distribution are both skewed to the left. However, using a larger sample size, both distributions begin to look more normal.
Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.
# Your code here
set.seed(241)
prop.test(459, 980)
1-sample proportions test with continuity correction
data: 459 out of 980, null probability 0.5
X-squared = 3.7969, df = 1, p-value = 0.05135
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4368038 0.5001819
sample estimates:
p
0.4683673
SOLUTION: We are 95% confident that the true proportion of women who voted for Bush lies in the interval (43.68%, 50.02%).
# Your code here
prop.test(426, 759)
1-sample proportions test with continuity correction
data: 426 out of 759, null probability 0.5
X-squared = 11.152, df = 1, p-value = 0.0008396
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5250797 0.5968214
sample estimates:
p
0.5612648
SOLUTION: We are 95% confident that the true proportion of men who voted for Bush lies in the interval (52.51%, 59.68%). The intervals for men and women do not overlap in this case, so we can conclude with that gender preference that men liked Bush more than women did.
# Your code here
Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming
# Your code here
SOLUTION:
# Your code here
SOLUTION:
Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.
# Your code here
SOLUTION:
# Your code here
SOLUTION:
SOLUTION:
# Your code here
SOLUTION:
Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions \((p_A - p_P)\) and interpret your interval.
# Your code here
SOLUTION:
Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.
> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])
# Your code here
# Your code here
SOLUTION:
# Your code here
SOLUTION:
# Your code here
SOLUTION:
Bangladesh) and compare with the formula \(t\) and bootstrap \(t\) intervals.# Your code here
SOLUTION:
Prob10 - The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.
# Your code here
# Your code here
SOLUTION:
# Your code here
SOLUTION:
# Your code here
SOLUTION:
Prob11 Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.
# Your code here
# Your code here
SOLUTION:
# Your code here
SOLUTION:
# Your code here
SOLUTION:
# Your code here
SOLUTION: