This is entirely our own work except as noted at the end of the document.
Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming
error <- 0.03
ptilde <- .5
sample_size <- (ptilde*(1-ptilde))/(error/1.96)^2
sample_size
[1] 1067.111
SOLUTION: To provide a confidence level of 95% and a margin of error of a maximum 3%, the marketers should get responses from a minimum of 1068 people.
ptilde <- .65
sample_size <- (ptilde*(1-ptilde))/(error/1.96)^2
sample_size
[1] 971.0711
SOLUTION: Given 65% of respondants of the survey favored longer hours, the store should use a sample size of 972 people.
Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.
test <- prop.test(34,350)
test
1-sample proportions test with continuity correction
data: 34 out of 350, null probability 0.5
X-squared = 225.6, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.06913692 0.13429260
sample estimates:
p
0.09714286
SOLUTION: We are 95% confident that the proportion of students taking the drug who break out in hives is within (0.0691, 0.1343).
test <- prop.test(56,350)
test
1-sample proportions test with continuity correction
data: 56 out of 350, null probability 0.5
X-squared = 160.48, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.1240384 0.2036158
sample estimates:
p
0.16
SOLUTION: We are 95% confident that the proportion of students taking the placebo who break out in hives is within (0.124, 0.204).
SOLUTION: The intervals do overlap, but we cannot conclude anything because there could be other variables coming into play.
x <- c(34,56)
n <- c(350,350)
test <- prop.test(x,n)
test
2-sample test for equality of proportions with continuity
correction
data: x out of n
X-squared = 5.623, df = 1, p-value = 0.01773
alternative hypothesis: two.sided
95 percent confidence interval:
-0.11508783 -0.01062645
sample estimates:
prop 1 prop 2
0.09714286 0.16000000
SOLUTION: We are 95% confident that the difference in the proportion of students taking the placebo who break out in hives to the proportion of students taking the placebo who break out in hives is within (-0.1151, -0.0106). Since zero is not included in the interval we have sufficient statistical evidence to reject that there is no difference in the effects of the drug and the placebo.
Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions \((p_A - p_P)\) and interpret your interval.
x <- c(44, 70)
n <- c(259, 258)
test <- prop.test(x, n, alternative="less")
test
2-sample test for equality of proportions with continuity
correction
data: x out of n
X-squared = 7.158, df = 1, p-value = 0.003732
alternative hypothesis: less
95 percent confidence interval:
-1.00000000 -0.03801355
sample estimates:
prop 1 prop 2
0.1698842 0.2713178
SOLUTION: We are 95% confident that the difference in adenomas found in the aspirin group and the placebo group is less than or equal to -0.038.
Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.
> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])
Bangladesh <- read.csv("http://www1.appstate.edu/~arnholta/Data/Bangladesh.csv")
chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])
mean(chlorine)
[1] 78.08401
sd(chlorine)
[1] 210.0192
hist(chlorine)
boxplot(chlorine)
SOLUTION: Skewed right with a mean of 78.084 and a standard deviation of 210.02.
t.test(chlorine)
One Sample t-test
data: chlorine
t = 6.0979, df = 268, p-value = 3.736e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
52.87263 103.29539
sample estimates:
mean of x
78.08401
SOLUTION: We are 95% confident that the mean of chlorine levels in Bangladesh wells is within (52.873, 103.295).
sims <- 10^4
Tstar <- numeric(sims)
for(i in 1:sims){
bs <- sample(chlorine, size = sum(chlorine), replace = TRUE)
Tstar[i] <- (mean(bs) - mean(chlorine)/(sd(bs)/sqrt(sum(chlorine))))
}
ggplot(data = data.frame(Tstar = Tstar), aes(x= Tstar)) +
geom_density()
TS <- quantile(Tstar, prob = c(0.025, 0.975))
TS
2.5% 97.5%
19.41677 28.67781
SOLUTION: We would construct a bootstrap \(t\) confidence interval with a 95% confidence level to account for skewness. We are 95% confident that mean chlorine levels lie within (52.213, 55.95).
Bangladesh) and compare with the formula \(t\) and bootstrap \(t\) intervals.skewness <- skewness(Bangladesh$Arsenic, na.rm = TRUE)
sims <- 10^4
Tstar <- numeric(sims)
for(i in 1:sims){
bs <- sample(Bangladesh$Arsenic, size = sum(!is.na(Bangladesh$Arsenic)), replace = TRUE)
Tstar[i] <- (mean(bs) - mean(Bangladesh$Arsenic)/(sd(bs)/(sqrt(sum(!is.na(Bangladesh$Arsenic))))))
}
TS <- quantile(Tstar, prob = c(0.025, 0.975))
TS
2.5% 97.5%
82.20605 158.23365
SOLUTION: The skewness is shown and matters, but we can work accordingly. With this method, we are 95% confident that the mean arsenic levels are within (81.904, 154.371).
Prob10 - The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.
MnGroundwater <- read.csv("http://www1.appstate.edu/~arnholta/Data/MnGroundwater.csv")
glimpse(MnGroundwater)
Observations: 895
Variables: 10
$ County <fctr> Aitkin, Aitkin, Aitkin, Aitkin, Aitkin, Aitkin,...
$ Aquifer.Group <fctr> surficial Quaternary, buried Quaternary, buried...
$ Water.Level <int> 55, 30, 20, 3, 0, 30, 15, 10, 20, 25, 26, 16, 26...
$ Alkalinity <int> 137000, 214000, 120000, 283000, 236000, 229000, ...
$ Aluminum <dbl> 0.059, 2.380, 0.410, 158.190, 0.059, 0.059, 0.63...
$ Arsenic <dbl> 1.810, 0.059, 1.440, 6.340, 10.170, 6.900, 3.650...
$ Chloride <int> 490, 89250, 300, 780, 5090, 2590, 770, 750, 1700...
$ Lead <dbl> 0.17, 0.18, 0.52, 1.15, 0.02, 0.12, 0.34, 0.26, ...
$ pH <dbl> 7.10, 7.60, 6.90, 8.20, 7.90, 7.80, 8.20, 7.80, ...
$ Basin.Name <fctr> Upper Mississippi River, Upper Mississippi Rive...
hist(MnGroundwater$Alkalinity)
ggplot(MnGroundwater, aes(x= Alkalinity)) +
geom_density()
qqnorm(MnGroundwater$Alkalinity)
qqline(MnGroundwater$Alkalinity)
mean(MnGroundwater$Alkalinity)
[1] 290682.7
sd(MnGroundwater$Alkalinity)
[1] 108334.3
SOLUTION: The distribution of alkalinity is approximately normal with a mean of 290682.7 and a standard deviation of 108334.3.
t.test(MnGroundwater$Alkalinity)
One Sample t-test
data: MnGroundwater$Alkalinity
t = 80.272, df = 894, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
283575.6 297789.8
sample estimates:
mean of x
290682.7
SOLUTION: We are 95% confident that the mean alkalinity levels in Minnesota wells is within (283575.6, 297789.8).
sims <- 10^4
Tstar <- numeric(sims)
for(i in 1:sims){
bs <- sample(MnGroundwater$Alkalinity, size = sum(!is.na(MnGroundwater$Alkalinity)), replace = TRUE)
Tstar[i] <- (mean(bs) - mean(MnGroundwater$Alkalinity)/(sd(bs)/sqrt(sum(!is.na(MnGroundwater$Alkalinity)))))
}
TS <- quantile(Tstar, prob = c(0.025, 0.975))
TS
2.5% 97.5%
283640.4 297749.4
SOLUTION: We are 95% confident that the mean akalinity levels in Minnesota wells is within (283450.6, 297649.3).
Prob11 Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.
TXBirths2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/TXBirths2004.csv")
glimpse(TXBirths2004)
Observations: 1,587
Variables: 8
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ MothersAge <fctr> 20-24, 20-24, 25-29, 25-29, 15-19, 30-34, 30-34, 2...
$ Smoker <fctr> No, No, No, No, No, No, No, No, No, No, No, No, No...
$ Gender <fctr> Male, Male, Female, Female, Female, Female, Male, ...
$ Weight <int> 3033, 3232, 3317, 2560, 2126, 2948, 3884, 2665, 371...
$ Gestation <int> 39, 40, 37, 36, 37, 38, 39, 38, 40, 37, 39, 37, 40,...
$ Number <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Multiple <fctr> No, No, No, No, No, No, No, No, No, No, No, No, No...
smokers <- TXBirths2004$Smoker == "Yes"
sum(smokers)
[1] 90
nonsmokers <- TXBirths2004$Smoker == "No"
sum(nonsmokers)
[1] 1497
SOLUTION: 90 smokers and 1497 nonsmokers are represented in this dataset.
msmokers <- TXBirths2004 %>%
group_by(Smoker) %>%
summarize(Mean = mean(Weight), StDev = sd(Weight), n())
msmokers
# A tibble: 2 x 4
Smoker Mean StDev `n()`
<fctr> <dbl> <dbl> <int>
1 No 3287.494 554.4829 1497
2 Yes 3205.989 504.2439 90
library(ggplot2)
ggplot(TXBirths2004, aes(x = Weight)) +
geom_histogram() +
facet_grid(~Smoker)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
SOLUTION: The distribution of smokers is approximately normal with a mean of 3205.989 and a standard deviation of 504.244. The distribution of nonsmokers is slightly skewed to the left with a mean of 3287.494 and a standard deviation of 554.483.
SmokersWeight <- subset(TXBirths2004, select = Weight, Smoker == "Yes", drop = TRUE)
NonsmokersWeight <- subset(TXBirths2004, select = Weight, Smoker == "No", drop = TRUE)
thetahat <- mean(SmokersWeight) - mean(NonsmokersWeight)
nx <- length(SmokersWeight)
ny <- length(NonsmokersWeight)
SE <- sqrt(var(SmokersWeight)/nx + var(NonsmokersWeight)/ny)
N <- 10^4
Tstar <- numeric(N)
DM <- numeric(N)
for(i in 1:N){
xboot <- sample(SmokersWeight, nx, replace = TRUE)
yboot <- sample(NonsmokersWeight, ny, replace = TRUE)
Tstar[i] <- (mean(xboot) - mean(yboot) - thetahat)/(sqrt(var(xboot)/nx + var(yboot)/ny))
DM[i] <- mean(xboot) - mean(yboot)
}
CItboot <- thetahat - quantile(Tstar, c(.975, .025)) * SE
names(CItboot) <- NULL
CItboot
[1] -187.13140 27.92286
SOLUTION: The bootstrap \(t\) interval allows us to be 95% confident that the difference in means of smokers and nonsmokers is within (-190.843, 27.609).
t.test(SmokersWeight, NonsmokersWeight, alternative="greater")
Welch Two Sample t-test
data: SmokersWeight and NonsmokersWeight
t = -1.4806, df = 102.38, p-value = 0.9291
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-172.8809 Inf
sample estimates:
mean of x mean of y
3205.989 3287.494
SOLUTION: We are 95% confident that the difference in weights of Texan babies is at least -172.193 by using the \(t\) confidence interval.