This is entirely our own work except as noted at the end of the document.
Prob1 - Import the data set Spruce into R.
Ht.change.# Your code here
library(resampledata)
ggplot(Spruce, aes(x = Ht.change)) + geom_histogram()
mean(Spruce$Ht.change)
[1] 30.93333
sd(Spruce$Ht.change)
[1] 11.04943
The height change has a somewhat normal distribution. The mean is 30.9 and the standard deviation is 11.05.
# Your code here
ht.change_t <- t.test(Spruce$Ht.change)
ht.change_t
One Sample t-test
data: Spruce$Ht.change
t = 23.755, df = 71, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
28.33685 33.52982
sample estimates:
mean of x
30.93333
SOLUTION: We are 95% confident that the average height change is between 28.34 and 33.53.
Ht.change for the seedlings in the fertilized and nonfertilized plots.# Your code here
ggplot(Spruce, aes(x = Ht.change)) + geom_histogram() + facet_grid(Fertilizer~.)
# Your code here
fert.height_t <- t.test(Spruce$Ht.change[Spruce$Fertilizer == "F"],Spruce$Ht.change[Spruce$Fertilizer == "NF"])
fert.height_t
Welch Two Sample t-test
data: Spruce$Ht.change[Spruce$Fertilizer == "F"] and Spruce$Ht.change[Spruce$Fertilizer == "NF"]
t = 7.5586, df = 69.697, p-value = 1.214e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
10.82909 18.59314
sample estimates:
mean of x mean of y
38.28889 23.57778
SOLUTION: We are 95% confident that the difference in average height change between Fertilized and non Fertilized is 10.83 and 18.59.
Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.
# Your code here
ggplot(Girls2004, aes(x = Weight)) + geom_histogram() + facet_grid(State~.)
SOLUTION: The distribution of both weights in Arkansas and Wyomig are normal. Wyoming has a less scattered distibution and Arkansas has a larger distribution.
# Your code here
girls_t <- t.test(Girls2004$Weight[Girls2004$State == "AK"],Girls2004$Weight[Girls2004$State == "WY"])
girls_t
Welch Two Sample t-test
data: Girls2004$Weight[Girls2004$State == "AK"] and Girls2004$Weight[Girls2004$State == "WY"]
t = 2.7316, df = 71.007, p-value = 0.007946
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
83.29395 533.60605
sample estimates:
mean of x mean of y
3516.35 3207.90
SOLUTION: We are 95% confident that the mean difference in baby weights in grams is between 3207.90 and 3516.25.
# Your code here
head(Girls2004)
ID State MothersAge Smoker Weight Gestation
1 1 WY 15-19 No 3085 40
2 2 WY 35-39 No 3515 39
3 3 WY 25-29 No 3775 40
4 4 WY 20-24 No 3265 39
5 5 WY 25-29 No 2970 40
6 6 WY 20-24 No 2850 38
ggplot(Girls2004, aes(x = Weight)) + geom_histogram() + facet_grid(Smoker~.)
SOLUTION: There are a lot more babies born to nonsmokers than born to smokers. The distribution of the babies born to nonsmokers is normal. The distribution of babies born to smokers is fairly uniform.
# Your code here
girls_t <- t.test(Girls2004$Weight[Girls2004$Smoker == "Yes"],Girls2004$Weight[Girls2004$Smoker == "No"])
girls_t
Welch Two Sample t-test
data: Girls2004$Weight[Girls2004$Smoker == "Yes"] and Girls2004$Weight[Girls2004$Smoker == "No"]
t = -1.8552, df = 14.35, p-value = 0.08423
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-617.9197 44.0330
sample estimates:
mean of x mean of y
3114.636 3401.580
SOLUTION: We are 95% confident that the difference in weights of the babies born to nonsmokers and babies born to smokers is 3114.636 and 3401.58 grams.
Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.
# Your code here
library(resampledata)
head(FlightDelays)
ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
1 1 UA 403 DEN 4-8am Fri May 281 -1
2 2 UA 405 DEN 8-Noon Fri May 277 102
3 3 UA 409 DEN 4-8pm Fri May 279 4
4 4 UA 511 ORD 8-Noon Fri May 158 -2
5 5 UA 667 ORD 4-8am Fri May 143 -3
6 6 UA 669 ORD 4-8am Fri May 150 0
Delayed30
1 No
2 Yes
3 No
4 No
5 No
6 No
ggplot(FlightDelays, aes(x = Delay)) + geom_histogram() + facet_grid(Carrier~.)
# Your code here
flights_t <- t.test(FlightDelays$Delay[FlightDelays$Carrier == "UA"],FlightDelays$Delay[FlightDelays$Carrier == "AA"])
flights_t
Welch Two Sample t-test
data: FlightDelays$Delay[FlightDelays$Carrier == "UA"] and FlightDelays$Delay[FlightDelays$Carrier == "AA"]
t = 3.8255, df = 1843.8, p-value = 0.0001349
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.868194 8.903198
sample estimates:
mean of x mean of y
15.98308 10.09738
#t.test(Delay~Carrier, data = FlightDelays, conf = 0.95) Another way to do this
SOLUTION: We are 95% confident that the difference in average flight delays by carrier is between 10.1 and 15.98.
Prob4 - Run a simulation to see if the \(t\) ratio \(T = (\bar{X} -\mu)/(S/\sqrt{n})\) has a \(t\) distribution or even an approximate \(t\) distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate \(t\) density curve over the density of your simulated \(T\). Try two different nonnormal distributions \(\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)\) and remember to see if sample size makes a difference (use \(n = 15\) and \(n=500\)).
# Your code here
n = 15
mu = 0.5
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for (i in 1:sims) {
xn <- runif(n, mu, 1)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - mu)/(SD/sqrt(n))
hist(T, freq = FALSE, breaks = "scott")
curve(dt(x, n-1), add = TRUE, col = "blue")
n = 500
mu = 0.5
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for (i in 1:sims) {
xn <- runif(n, mu, 1)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - mu)/(SD/sqrt(n))
hist(T, freq = FALSE, breaks = "scott")
curve(dt(x, n-1), add = TRUE, col = "blue")
n = 15
mu = 0.5
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for (i in 1:sims) {
xn <- rexp(n, rate = 1)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - (1/1))/(SD/sqrt(n))
hist(T, freq = FALSE, breaks = "scott", xlim = c(-5, 5))
curve(dt(x, n-1), add = TRUE, col = "blue")
n = 500
mu = 0.5
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for (i in 1:sims) {
xn <- rexp(n, rate = 1)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - (1/1))/(SD/sqrt(n))
hist(T, freq = FALSE, breaks = "scott", xlim = c(-5, 5))
curve(dt(x, n-1), add = TRUE, col = "blue")
SOULTION: As the sample size grows larger, the distribution becomes normal.
Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.
# Your code here
prop.test(x = 459, n = 980, conf.level = .95, correct = FALSE)
1-sample proportions test without continuity correction
data: 459 out of 980, null probability 0.5
X-squared = 3.9224, df = 1, p-value = 0.04765
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4373100 0.4996717
sample estimates:
p
0.4683673
SOLUTION: We are 95% confident that the proportion of women that voted for Bush in the 2000 election is between 43.7% and 49.9%.
# Your code here
prop.test(x = 426, n = 759, conf.level = .95, correct = FALSE)
1-sample proportions test without continuity correction
data: 426 out of 759, null probability 0.5
X-squared = 11.395, df = 1, p-value = 0.0007363
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5257409 0.5961717
sample estimates:
p
0.5612648
SOLUTION: We are 95% confident that the proportion of men that voted for Bush in the 2000 election is between 52.57% and 59.6%. We can conclude that there was a larger proportion of men that voted for Bush than women.
# Your code here
SOLUTION:
Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming
# Your code here
SOLUTION:
# Your code here
SOLUTION:
Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.
# Your code here
SOLUTION:
# Your code here
SOLUTION:
SOLUTION:
# Your code here
SOLUTION:
Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions \((p_A - p_P)\) and interpret your interval.
# Your code here
SOLUTION:
Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.
> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])
# Your code here
chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])
# Your code here
Bangladesh %>%
group_by(Chlorine) %>%
summarize(mean(Chlorine), sd(Chlorine), n())
# A tibble: 202 x 4
Chlorine `mean(Chlorine)` `sd(Chlorine)` `n()`
<dbl> <dbl> <dbl> <int>
1 1.0 1.0 NA 1
2 1.2 1.2 0 2
3 1.3 1.3 NA 1
4 1.4 1.4 0 2
5 1.7 1.7 0 2
6 1.8 1.8 0 3
7 1.9 1.9 NA 1
8 2.0 2.0 0 3
9 2.2 2.2 NA 1
10 2.4 2.4 0 3
# ... with 192 more rows
ggplot(Bangladesh, aes(x = Chlorine)) + geom_histogram()
SOLUTION: The distribution of the chlorine levels in Bangladesh are skewed right.
# Your code here
SOLUTION:
# Your code here
SOLUTION:
Bangladesh) and compare with the formula \(t\) and bootstrap \(t\) intervals.# Your code here
SOLUTION:
Problem 10 (26 in book)The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.
MnGroundwater <- read.csv("http://www1.appstate.edu/~arnholta/Data/MnGroundwater.csv")
# Your code here
hist(MnGroundwater$Alkalinity)
qqnorm(MnGroundwater$Alkalinity)
ggplot(MnGroundwater, aes(x = Alkalinity)) + geom_density()
SOLUTION:
# Your code here
alk_t <- t.test(MnGroundwater$Alkalinity)
alk_t
One Sample t-test
data: MnGroundwater$Alkalinity
t = 80.272, df = 894, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
283575.6 297789.8
sample estimates:
mean of x
290682.7
SOLUTION: We are 95% confident that the alkalinity levels in Minnesota wells is between 283575.6 and 297789.8.
# Your code here
SOLUTION:
Problem 11 (27 in the book)Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.
TXBirths2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/TXBirths2004.csv")
TXBirths2004 %>%
group_by(Smoker) %>%
summarize(n())
# A tibble: 2 x 2
Smoker `n()`
<fctr> <int>
1 No 1497
2 Yes 90
SOLUTION: There are 90 smokers, and 1,497 nonsmokers in the dataset.
ggplot(TXBirths2004, aes(x = Weight)) + geom_histogram() + facet_grid(Smoker~.)
SOLUTION: The distribution of the nonsmokers weights are left skewed. The distribution of the weights of the smokers is fairy uniform.
SOLUTION:
# Your code here
SOLUTION: