This is entirely our own work except as noted at the end of the document.
Prob1 - Import the data set Spruce into R.
Ht.change.# Your code here
spruce <- read.csv("http://www1.appstate.edu/~arnholta/Data/Spruce.csv")
hist(spruce$Ht.change)
ggplot(spruce,aes(x=Ht.change))+geom_density()
# Your code here
t.test(spruce$Ht.change,conf.level = 0.95)$conf
[1] 28.33685 33.52982
attr(,"conf.level")
[1] 0.95
SOLUTION: We are 95% confident that the mean height change for spruce trees in the 5 year period was between 28.34 and 33.53.
Ht.change for the seedlings in the fertilized and nonfertilized plots.F <- spruce %>%
select(Ht.change,Fertilizer) %>%
filter(Fertilizer=="F")
NF <- spruce %>%
select(Ht.change,Fertilizer) %>%
filter(Fertilizer == "NF")
ggplot(spruce, aes(Ht.change, fill = Fertilizer)) + geom_histogram(alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(F$Ht.change)
qqnorm(NF$Ht.change)
# Your code here
t.test(F$Ht.change,NF$Ht.change,alternative = "less",conf.level = .95)$conf
[1] -Inf 17.95558
attr(,"conf.level")
[1] 0.95
SOLUTION:
Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.
# Your code here
Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")
GirlsWY <- Girls2004 %>%
filter(State == "WY")
GirlsAK <- Girls2004 %>%
filter(State == "AK")
hist(GirlsWY$Weight)
ggplot(GirlsWY, aes(x = Weight)) +
geom_density()
hist(GirlsAK$Weight)
ggplot(GirlsAK, aes(x = Weight)) +
geom_density()
SOLUTION: Both Wyoming and Alaska appear to have a normal distribution, but the weights in Wyoming have a stronger normal distribution.
# Your code here
t.test(Girls2004$Weight[Girls2004$State=="AK"], Girls2004$Weight[Girls2004$State=="WY"],conf.level = 0.95)$conf
[1] 83.29395 533.60605
attr(,"conf.level")
[1] 0.95
SOLUTION: We can state with 95% confidence that the difference in mean of weights between girls in AK and WY falls between 83.294 and 533.606.
# Your code here
GirlsSmoker <- Girls2004 %>% filter(Smoker == "Yes")
GirlsNonSmoker <- Girls2004 %>% filter(Smoker == "No")
hist(GirlsSmoker$Weight)
ggplot(GirlsSmoker, aes(x = Weight)) +
geom_density()
hist(GirlsNonSmoker$Weight)
ggplot(GirlsNonSmoker, aes(x = Weight)) +
geom_density()
SOLUTION: The weights make a normal distribution for both smokers and non smokers in both AK and WY, but the non smokers baby weights is potentially positively skewed.
# Your code here
t.test(Girls2004$Weight[Girls2004$Smoker=="No"], Girls2004$Weight[Girls2004$Smoker=="Yes"], conf.level = .95)$conf
[1] -44.0330 617.9197
attr(,"conf.level")
[1] 0.95
SOLUTION: We are 95% confident that the mean difference in weight between girls born to women who are smokers and non smokers is between -44.03 and 617.92.
Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.
# Your code here
FlightDelays <- read.csv(url("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv"))
UA <- FlightDelays %>%
select(Delay,Carrier) %>%
filter(Carrier == "UA")
AA <- FlightDelays %>%
select(Delay,Carrier) %>%
filter(Carrier == "AA")
hist(UA$Delay)
ggplot(UA, aes(x = Delay)) +
geom_density()
hist(AA$Delay)
ggplot(AA, aes(x = Delay)) +
geom_density()
# Your code here
t.test(UA$Delay, AA$Delay, conf.level = .95)$conf
[1] 2.868194 8.903198
attr(,"conf.level")
[1] 0.95
SOLUTION: We are 95% confident that the difference in mean delay for both airlines is between 2.868194 and 8.903198.
Prob4 - Run a simulation to see if the \(t\) ratio \(T = (\bar{X} -\mu)/(S/\sqrt{n})\) has a \(t\) distribution or even an approximate \(t\) distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate \(t\) density curve over the density of your simulated \(T\). Try two different nonnormal distributions \(\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)\) and remember to see if sample size makes a difference (use \(n = 15\) and \(n=500\)).
When n is 15, (Unif(a = 0, b = 1))
# Your code here
n = 15;
a = 0;
b = 1;
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
xn <- runif(n, a, b)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - 0.5)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)
When n = 500, (Unif(a = 0, b = 1))
n = 500;
a = 0;
b = 1;
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
xn <- runif(n, a, b)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - 0.5)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)
When n = 15, EXP(lamba = 1)
n = 15
lamba = 1
sims <- 10^4
SD <- numeric(sims)
xbar <- numeric(sims)
for(i in 1:sims){
xn <- rexp(n, lamba)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - 1)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)
When n = 500, EXP(lamba = 1)
n = 500
lamba = 1
sims <- 10^4
SD <- numeric(sims)
xbar <- numeric(sims)
for(i in 1:sims){
xn <- rexp(n, lamba)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - 1)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)
SOULTION: After examining the graphs the larger n is for normal distributions the larger the more improved the distribution becomes. The mean is affected by sampling size of the exponential distributions.
Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.
# Your code here
# Using a one sample z-test for a proportion
ci_women <- prop.test(459, 980, conf.level = .95, correct = FALSE)$conf
ci_women
[1] 0.4373100 0.4996717
attr(,"conf.level")
[1] 0.95
SOLUTION: We are 95% confident that the proportion of women that voted for President Bush fell between 0.437 and 0.499.
# Your code here
# Using a one sample z-test for a proportion
ci_men <- prop.test(426, 759, conf.level = .95, correct = FALSE)$conf
ci_men
[1] 0.5257409 0.5961717
attr(,"conf.level")
[1] 0.95
SOLUTION: We are 95% confident that the proportion of men that voted for President Bush fell between 0.525 and 0.596.
# Your code here
prop.test(c(459, 429), c(980, 759), correct = FALSE)$conf
[1] -0.14396498 -0.04973511
attr(,"conf.level")
[1] 0.95
SOLUTION: WE are 95% confidence that the difference in proportions of men who voted for bush and women who voted for Bush falls in the interval of (-0.143, -0.049)
Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming
# Your code here
z <- qnorm(.975)
p <- 0.5
mE <- 0.03
sampSize <- ((z^2) * p * (1 - p)) / (mE^2)
sampSize
[1] 1067.072
SOLUTION: To obtain an marigin error of 3% with 95% confidence level for the survey, the retail store should survey at most 1068 people
# Your code here
z <- qnorm(.975)
p <- 0.65
mE <- 0.03
sampSize <- ((z^2) * p * (1 - p)) / (mE^2)
sampSize
[1] 971.0354
SOLUTION: To obtain an marigin error of 3% with 95% confidence level for the survey and a previous survey of proportion estimate 65%, the retail store should survey at most 971 people
Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.
# Your code here
prop.test(34, 350)$conf
[1] 0.06913692 0.13429260
attr(,"conf.level")
[1] 0.95
SOLUTION: With 95% confidence, of students taking the drug who break out in hives is between the interval (0.069, 0.13)
# Your code here
prop.test(56, 350)$conf
[1] 0.1240384 0.2036158
attr(,"conf.level")
[1] 0.95
SOLUTION: With 95% confidence, that the oproportion of students taking the placebo who break out in hives is between the interval (0.12, 0.20)
SOLUTION: The intervals do overlap, howver nothing can be concluded until a difference in proportions test is conducted.
# Your code here
prop.test(c(56, 34), c(350, 350))$conf
[1] 0.01062645 0.11508783
attr(,"conf.level")
[1] 0.95
SOLUTION: WE are 95% confidence that the difference in proportions that the the difference in proportions of students who break out in hives by using or not using this drug is between the interval of (0.01,0.11)
Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions \((p_A - p_P)\) and interpret your interval.
# Your code here
prop.test(c(44, 70), c(259, 258), alternative = "less")$conf
[1] -1.00000000 -0.03801355
attr(,"conf.level")
[1] 0.95
SOLUTION: We are 95% confident that the difference in adenomas found in the aspirin group and the placebo group is less than or equal to -0.03801.
Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.
> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])
# Your code here
Bangladesh <- read.csv("http://www1.appstate.edu/~arnholta/Data/Bangladesh.csv")
chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])
# Your code here
hist(chlorine)
IQR(chlorine)
[1] 50.5
qqnorm(chlorine)
qqline(chlorine)
SOLUTION: The distribution for chlorine is skewed to the right with an IQR of 50.5.
# Your code here
t.test(chlorine)
One Sample t-test
data: chlorine
t = 6.0979, df = 268, p-value = 3.736e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
52.87263 103.29539
sample estimates:
mean of x
78.08401
SOLUTION: We are 95% confident that the mean of chlorine levels in Bangladesh wells lies between 52.873 and 103.295.
# Your code here
sims <- 10^4
xbar <- mean(chlorine)
n <- length(chlorine)
tstar <- numeric(sims)
for(i in 1:sims){
x <- sample(chlorine, size = n, replace = TRUE)
tstar[i] <- (mean(x) - xbar)/(sd(x)/sqrt(n))
}
TS <- quantile(tstar,c(0.025,0.975))
lower<-xbar-(TS[2]*(sd(chlorine)/sqrt(n)))
upper<-xbar-(TS[1]*(sd(chlorine)/sqrt(n)))
lower
97.5%
56.84449
upper
2.5%
111.8312
SOLUTION: From our bootstrap we can say with 95% confidence that the true mean of chlorine levels lie between 56.844494 and 111.8311533. We would report this interval instead of the confidence interval because bootstrapping accounts for the skewed data.
Bangladesh) and compare with the formula \(t\) and bootstrap \(t\) intervals.# Your code here
k_3 <- skewness(Bangladesh$Arsenic)
n <- length(Bangladesh$Arsenic)
xbar <- mean(Bangladesh$Arsenic)
q <- qt(0.975, n - 1)
lower1 <- xbar+(k_3 / ((6 * sqrt(n)) * (1 + (2 * q^2)))) - (q * (sd(Bangladesh$Arsenic)/sqrt(n)))
upper1 <- xbar+(k_3/((6 *sqrt(n)) * (1 + (2*q^2)))) + (q * (sd(Bangladesh$Arsenic)/sqrt(n)))
lower1
[1] 89.68893
upper1
[1] 160.9619
t.test(Bangladesh$Arsenic)$conf
[1] 89.68342 160.95643
attr(,"conf.level")
[1] 0.95
sims <- 10^4
tstar <- numeric(sims)
for(i in 1:sims){
x <- sample(Bangladesh$Arsenic, size = n, replace = TRUE)
tstar[i] <- (mean(x) - xbar)/(sd(x)/sqrt(n))
}
TS <- quantile(tstar,c(0.025,0.975))
lower2<-xbar-(TS[2]*(sd(Bangladesh$Arsenic)/sqrt(n)))
upper2<-xbar-(TS[1]*(sd(Bangladesh$Arsenic)/sqrt(n)))
lower2
97.5%
95.33215
upper2
2.5%
172.5809
SOLUTION: The Johnsons T confidence interval gives us 89.6889317 and 160.9619374 and is very close to the normal t interval of 89.68342 and 160.95643. The bootstrapped t confidence interval lies between 95.3321539 and 172.5809332.
Prob10 - The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.
# Your code here
# Your code here
library(ggplot2)
library(resampledata)
# Histogram
ggplot(MnGroundwater, aes(x=Alkalinity)) +
geom_histogram(color= "black", fill = "blue") +
labs(title = "Alkalinity in Minnesota Histogram") +
theme_bw()
#density plot
ggplot(MnGroundwater, aes(x=Alkalinity)) +
geom_density(color= "black", fill = "blue") +
labs(title = "Alkalinity in Minnesota Density Plot") +
theme_bw()
#Normal Quartile Plot
ggplot(MnGroundwater, aes(sample=Alkalinity)) +
geom_qq(color= "black", fill = "blue") +
labs(title = "Alkalinity in Minnesota Normal Quartile Plot") +
theme_bw()
SOLUTION: By examining the three different graphs we can conclude that the distribution of the normal quartile plot and the histogram is normal.
# Your code here
tConfidence <- t.test(MnGroundwater$Alkalinity)$conf
tConfidence
[1] 283575.6 297789.8
attr(,"conf.level")
[1] 0.95
SOLUTION: We are 95% confident that the mean alkalinty levels in the Minnesota wells is between 283575.6 and 297789.8.
# Your code here
sims <- 10^4
xbar <- mean(MnGroundwater$Alkalinity)
n <- length(MnGroundwater$Alkalinity)
t <- numeric(sims)
for(i in 1:sims) {
bs <- sample(MnGroundwater$Alkalinity,
size = sum(!is.na(MnGroundwater$Alkalinity)), replace = TRUE)
t[i] <- (mean(bs) - mean(MnGroundwater$Alkalinity)/
(sd(bs)/sqrt(sum(!is.na(MnGroundwater$Alkalinity)))))
}
quantile(t, prob = c(0.025, 0.975))
2.5% 97.5%
283612.0 297714.5
SOLUTION: By examining out bootstrap we are 95% confident that the mean alkalinty levels in the Minnesota wells is between 283342.7 and 297801.6.
Prob11 Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.
# Your code here
# Your code here
library(dplyr)
TXBirths2004 %>%
group_by(Smoker) %>%
summarise(Count = n())
# A tibble: 2 x 2
Smoker Count
<fctr> <int>
1 No 1497
2 Yes 90
SOLUTION: There are a total of 90 smokers and 1496 non-smokers in this data set
# Your code here
ggplot(TXBirths2004, aes(x=Weight)) +
geom_histogram(color= "black", fill = "orange") +
labs(title = "Histogram for Smokers and non-smokers") +
theme_bw() +
facet_grid(~Smoker)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
TXBirths2004 %>%
group_by(Smoker) %>%
summarise(Mean = mean(Weight), SD = sd(Weight), n())
# A tibble: 2 x 4
Smoker Mean SD `n()`
<fctr> <dbl> <dbl> <int>
1 No 3287.494 554.4829 1497
2 Yes 3205.989 504.2439 90
SOLUTION: The distributions for both smokers and non-smokers looks pretty normal. Smokers have a mean of 3205.989 and a standard deviation of 504.2439, while non-smokers have a mean of 3287.494 and a standard deviation of 554.4829.
# Your code here
smokerWeight <- filter(TXBirths2004, Smoker == "Yes")$Weight
nonSmokerWeight <- filter(TXBirths2004, Smoker == "No")$Weight
confidence <- t.test(smokerWeight, nonSmokerWeight)$confidence
sims <- 10^4
that <- mean(smokerWeight) - mean(nonSmokerWeight)
smoker_length <- length(smokerWeight)
nonSmoker_length <- length(nonSmokerWeight)
SE <- sqrt(var(smokerWeight)/smoker_length + var(nonSmokerWeight)/nonSmoker_length)
tstar <- numeric(sims)
for (i in 1:sims) {
smoke <- sample(smokerWeight, size = smoker_length, replace = TRUE)
nonsmoke <- sample(nonSmokerWeight, size = nonSmoker_length, replace = TRUE)
tstar[i] <- (mean(smoke) - mean(nonsmoke) - that) / sqrt(var(smoke)/smoker_length + var(nonsmoke)/nonSmoker_length)
}
boot <- that - quantile(tstar, c(0.025, 0.975)) * SE
boot
2.5% 97.5%
26.0874 -189.8860
SOLUTION: By looking at the bootstrap t interval we can conclude with 95% confidence that the difference in means is 27.63071 and -188.02910.
# Your code here
t.test(smokerWeight, nonSmokerWeight, alternative = "less")$conf
[1] -Inf 9.87141
attr(,"conf.level")
[1] 0.95
SOLUTION: By examining the one sided 95% t confidence interval we can conclude that the average in mean birth weights from a mother that smokes vs. a mother that does not smoke is 9.87141.