Chapter Seven Homework

This is entirely our own work except as noted at the end of the document.

Prob1 - Import the data set Spruce into R.

Create exploratory plots to check the distribution of the variable Ht.change.

# Your code here
library(resampledata)
ggplot(Spruce, aes(x = Ht.change)) + geom_histogram()
mean(Spruce$Ht.change)

[1] 30.93333

sd(Spruce$Ht.change)

[1] 11.04943

The height change has a somewhat normal distribution. The mean is 30.9 and the standard deviation is 11.05.

Find a 95% $t$ confidence interval for the mean height change over the 5-year period of the study and give a sentence interpreting your interval.

ht.change_t <- t.test(Spruce$Ht.change)
ht.change_t


    One Sample t-test

data:  Spruce$Ht.change
t = 23.755, df = 71, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 28.33685 33.52982
sample estimates:
mean of x 
 30.93333

SOLUTION: We are 95% confident that the average height change is between 28.34 and 33.53.

Create exploratory plots to compare the distributions of the variable Ht.change for the seedlings in the fertilized and nonfertilized plots.

# Your code here
ggplot(Spruce, aes(x = Ht.change)) + geom_histogram() + facet_grid(Fertilizer~.)

Find the 95% one-sided lower $t$ confidence bound for the difference in mean heights ($\mu_F - \mu_{NF}$) over the 5-year period of the study and give a sentence interpreting your interval.

# Your code here
fert.height_t <- t.test(Spruce$Ht.change[Spruce$Fertilizer == "F"],Spruce$Ht.change[Spruce$Fertilizer == "NF"])
fert.height_t


    Welch Two Sample t-test

data:  Spruce$Ht.change[Spruce$Fertilizer == "F"] and Spruce$Ht.change[Spruce$Fertilizer == "NF"]
t = 7.5586, df = 69.697, p-value = 1.214e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 10.82909 18.59314
sample estimates:
mean of x mean of y 
 38.28889  23.57778

SOLUTION: We are 95% confident that the difference in average height change between Fertilized and non Fertilized is 10.83 and 18.59.

Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.

Create exploratory plots and compare the distribution of weight between the babies born in the two states.

# Your code here
ggplot(Girls2004, aes(x = Weight)) + geom_histogram() + facet_grid(State~.)

SOLUTION: The distribution of both weights in Arkansas and Wyomig are normal. Wyoming has a less scattered distibution and Arkansas has a larger distribution.

Find a 95% $t$ confidence interval for the difference in mean weights for girls born in these two states. Give a sentence interpreting this interval.

# Your code here
girls_t <- t.test(Girls2004$Weight[Girls2004$State == "AK"],Girls2004$Weight[Girls2004$State == "WY"])
girls_t


    Welch Two Sample t-test

data:  Girls2004$Weight[Girls2004$State == "AK"] and Girls2004$Weight[Girls2004$State == "WY"]
t = 2.7316, df = 71.007, p-value = 0.007946
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  83.29395 533.60605
sample estimates:
mean of x mean of y 
  3516.35   3207.90

SOLUTION: We are 95% confident that the mean difference in baby weights in grams is between 3207.90 and 3516.25.

Create exploratory plots and compare the distribution of weights between babies born to nonsmokers and babies born to smokers.

# Your code here
head(Girls2004)

  ID State MothersAge Smoker Weight Gestation
1  1    WY      15-19     No   3085        40
2  2    WY      35-39     No   3515        39
3  3    WY      25-29     No   3775        40
4  4    WY      20-24     No   3265        39
5  5    WY      25-29     No   2970        40
6  6    WY      20-24     No   2850        38

ggplot(Girls2004, aes(x = Weight)) + geom_histogram() + facet_grid(Smoker~.)

SOLUTION: There are a lot more babies born to nonsmokers than born to smokers. The distribution of the babies born to nonsmokers is normal. The distribution of babies born to smokers is fairly uniform.

Find a 95% $t$ confidence interval for the difference in mean weights between babies born to nonsmokers and smokers. Give a sentence interpreting this interval.

# Your code here
girls_t <- t.test(Girls2004$Weight[Girls2004$Smoker == "Yes"],Girls2004$Weight[Girls2004$Smoker == "No"])
girls_t


    Welch Two Sample t-test

data:  Girls2004$Weight[Girls2004$Smoker == "Yes"] and Girls2004$Weight[Girls2004$Smoker == "No"]
t = -1.8552, df = 14.35, p-value = 0.08423
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -617.9197   44.0330
sample estimates:
mean of x mean of y 
 3114.636  3401.580

SOLUTION: We are 95% confident that the difference in weights of the babies born to nonsmokers and babies born to smokers is 3114.636 and 3401.58 grams.

Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.

Create exploratory plots of the lengths of delays for the two airlines.

# Your code here
library(resampledata)
head(FlightDelays)

  ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
1  1      UA      403         DEN      4-8am Fri   May          281    -1
2  2      UA      405         DEN     8-Noon Fri   May          277   102
3  3      UA      409         DEN      4-8pm Fri   May          279     4
4  4      UA      511         ORD     8-Noon Fri   May          158    -2
5  5      UA      667         ORD      4-8am Fri   May          143    -3
6  6      UA      669         ORD      4-8am Fri   May          150     0
  Delayed30
1        No
2       Yes
3        No
4        No
5        No
6        No

ggplot(FlightDelays, aes(x = Delay)) + geom_histogram() + facet_grid(Carrier~.)

Find a 95% $t$ confidence interval for the difference in mean flight delays between the two airlines and interpret this interval.

# Your code here
flights_t <- t.test(FlightDelays$Delay[FlightDelays$Carrier == "UA"],FlightDelays$Delay[FlightDelays$Carrier == "AA"])
flights_t


    Welch Two Sample t-test

data:  FlightDelays$Delay[FlightDelays$Carrier == "UA"] and FlightDelays$Delay[FlightDelays$Carrier == "AA"]
t = 3.8255, df = 1843.8, p-value = 0.0001349
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.868194 8.903198
sample estimates:
mean of x mean of y 
 15.98308  10.09738

#t.test(Delay~Carrier, data = FlightDelays, conf = 0.95) Another way to do this

SOLUTION: We are 95% confident that the difference in average flight delays by carrier is between 10.1 and 15.98.

Prob4 - Run a simulation to see if the $t$ ratio $T = (\bar{X} -\mu)/(S/\sqrt{n})$ has a $t$ distribution or even an approximate $t$ distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate $t$ density curve over the density of your simulated $T$. Try two different nonnormal distributions $\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)$ and remember to see if sample size makes a difference (use $n = 15$ and $n=500$).

n = 15
mu = 0.5
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for (i in 1:sims) {
  xn <- runif(n, mu, 1)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - mu)/(SD/sqrt(n))
hist(T, freq = FALSE, breaks = "scott")
curve(dt(x, n-1), add = TRUE, col = "blue")

n = 500
mu = 0.5
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for (i in 1:sims) {
  xn <- runif(n, mu, 1)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - mu)/(SD/sqrt(n))
hist(T, freq = FALSE, breaks = "scott")
curve(dt(x, n-1), add = TRUE, col = "blue")

n = 15
mu = 0.5
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for (i in 1:sims) {
  xn <- rexp(n, rate = 1)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - (1/1))/(SD/sqrt(n))
hist(T, freq = FALSE, breaks = "scott", xlim = c(-5, 5))
curve(dt(x, n-1), add = TRUE, col = "blue")

n = 500
mu = 0.5
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for (i in 1:sims) {
  xn <- rexp(n, rate = 1)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - (1/1))/(SD/sqrt(n))
hist(T, freq = FALSE, breaks = "scott", xlim = c(-5, 5))
curve(dt(x, n-1), add = TRUE, col = "blue")

SOULTION: As the sample size grows larger, the distribution becomes normal.

Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.

Find a 95% confidence interval for the proportion of women who voted for Bush.

# Your code here
prop.test(x = 459, n = 980, conf.level = 0.95, correct = FALSE)


    1-sample proportions test without continuity correction

data:  459 out of 980, null probability 0.5
X-squared = 3.9224, df = 1, p-value = 0.04765
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.4373100 0.4996717
sample estimates:
        p 
0.4683673

SOLUTION: We are 95% certain that the proportion of women that voted for Bush in the 2002 General Social Survey is between prop.test(459,980)$conf.int[1] and prop.test(459,980)$conf.int[2].

Find a 95% confidence interval for the proportion of men who voted for Bush. Do the intervals for the men and women overlap? What, if anything, can you conclude about gender difference in voter preference?

# Your code here
prop.test(x = 426, n = 759, conf.level = 0.95, correct = FALSE)


    1-sample proportions test without continuity correction

data:  426 out of 759, null probability 0.5
X-squared = 11.395, df = 1, p-value = 0.0007363
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5257409 0.5961717
sample estimates:
        p 
0.5612648

We are 95% certain that the proportion of men that voted for Bush in the 2002 General Social Survey is between prop.test(426,759)$conf.int[1] and prop.test(426,759)$conf.int[2].

SOLUTION: (20C) - Find a 95% confidence interval for the difference in proportions and interpret your interval.

# Your code here
prop.test(x = c(459,426), n = c(980,759), correct = FALSE)


    2-sample test for equality of proportions without continuity
    correction

data:  c(459, 426) out of c(980, 759)
X-squared = 14.77, df = 1, p-value = 0.0001215
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.14003926 -0.04575569
sample estimates:
   prop 1    prop 2 
0.4683673 0.5612648

SOLUTION: We are 95% that the difference in proportion between the mean of women voting for Bush and men voting for Bush lies between prop.test(c(459,426),c(980,759))$conf.int[1] and prop.test(c(459,426),c(980,759))$conf.int[2].

Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming

they have no preconceived idea of how customers will respond, and

e <- 0.03
n <- .5
sample_size <- ceiling((n*(1-n))/(e/1.96)^2)
sample_size

[1] 1068

SOLUTION: To have a 95% confidence interval and a margin error of 3% there should be 1068 people in the sample.

a previous survey indicated that about 65% of customers favor longer store hours.

e<- .03
n <- .65
sample_size <- ceiling((n*(1-n))/(e/1.96)^2)
sample_size

[1] 972

SOLUTION: The survey with 65% customers favoring longer store hours should use a sample size of 972 people.

Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.

Compute a 95% confidence interval for the proportion of students taking the drug who break out in hives.

prop.test(34,350)


    1-sample proportions test with continuity correction

data:  34 out of 350, null probability 0.5
X-squared = 225.6, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.06913692 0.13429260
sample estimates:
         p 
0.09714286

SOLUTION: We are 95% confident that the porpotion of students taking the drug who break out in hives fall between .069137 and .134293.

Compute a 95% confidence interval for the proportion of students taking the placebo who break out in hives.

 prop.test(56,350)


    1-sample proportions test with continuity correction

data:  56 out of 350, null probability 0.5
X-squared = 160.48, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.1240384 0.2036158
sample estimates:
   p 
0.16

SOLUTION: We are 95% confident that the proportion of students taking the placebo and break out in hives fall between .12404 and .20362

Do the intervals overlap? What, if anything, can you conclude about the effectiveness of the drug?

SOLUTION: Yes they overlap; cannot conclude anything.

Compute 95% confidence interval for the difference in proportions of students who break out in hives by using or not using this drug and give a sentence interpreting this interval.

x <- c(34,56)
n <- c(350,350)
test <- prop.test(x,n)
test


    2-sample test for equality of proportions with continuity
    correction

data:  x out of n
X-squared = 5.623, df = 1, p-value = 0.01773
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.11508783 -0.01062645
sample estimates:
    prop 1     prop 2 
0.09714286 0.16000000

SOLUTION:We are 95% confident that the difference in proportions of students who break out in hives by using or not using this drug is between -.115088 and -.016265.

Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions $(p_A - p_P)$ and interpret your interval.

a <- c(44, 70)
b <- c(259, 258)
test <- prop.test(a, b, alternative="less")
test


    2-sample test for equality of proportions with continuity
    correction

data:  a out of b
X-squared = 7.158, df = 1, p-value = 0.003732
alternative hypothesis: less
95 percent confidence interval:
 -1.00000000 -0.03801355
sample estimates:
   prop 1    prop 2 
0.1698842 0.2713178

SOLUTION: We are 95% confident that the difference in placebo is between -1 and -.03801.

Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.

> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

Compute the numeric summaries of the chlorine levels and create a plot and comment on the distribution.

hist(chlorine)
mu <- median(chlorine)
sd <- IQR(chlorine)
n <- length(chlorine)

SOLUTION:

Find a 95% $t$ confidence interval for the mean $\mu$ of chlorine levels in Bangladesh wells.

# 95% CI for mean/mu of chlorine levels 

t.test(chlorine, conf.level = .95)$conf

[1]  52.87263 103.29539
attr(,"conf.level")
[1] 0.95

# I am 95% confident that the mean of chlorine levels in Bangladesh lies between 52.873 and 103.295

SOLUTION:

Find a 95% bootstrap percentile and bootstrap $t$ confidence intervals for the mean chlorine level and compare results. Which confidence interval will you report?

# what's the difference between the two 
# T bootstrap
xbar <- mean(chlorine)
sims <- 10^4
Tstar <- numeric(sims)
xbarstar <- numeric(sims)
for(i in 1:sims){
  bs <- sample(chlorine, size = n, replace = TRUE)
  Tstar[i] <- (mean(bs)-xbar)/(sd(bs)/sqrt(xbar))
  xbarstar[i] <- mean(bs) # this is for 95% bootstrap percentile
}
# critical values Q1 Q2 
CIt <- quantile(Tstar, prob = c(0.025, 0.975))
names(CIt) <- NULL
CIt

[1] -1.4765689  0.9049597

# Find bootstrap t CI
LL <- xbar - CIt[2]*sd/sqrt(n)
UL <- xbar - CIt[1]*sd/sqrt(n)
c(LL, UL)

[1] 75.29761 82.63043

# Percentile bootstrap CI
quantile(xbarstar, prob = c(0.025, 0.975))

    2.5%    97.5% 
 54.4308 105.4736

Solution: I will report the bootstrap t confidence interval because our data is extremly skewed.

Problem 10 (26 in book)The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.

MnGroundwater <- read.csv("http://www1.appstate.edu/~arnholta/Data/MnGroundwater.csv")

Create a histogram, a density, and a normal quantile plot of the alkalinity and comment on the distribution.

# Your code here
hist(MnGroundwater$Alkalinity)
qqnorm(MnGroundwater$Alkalinity)
ggplot(MnGroundwater, aes(x = Alkalinity)) + geom_density()

SOLUTION: The distribution of the Alkalinity levels are fairly normal.

Find the 95% $t$ confidence interval for the mean $\mu$ of alkalinity levels in Minnesota wells.

# Your code here
alk_t <- t.test(MnGroundwater$Alkalinity)
alk_t


    One Sample t-test

data:  MnGroundwater$Alkalinity
t = 80.272, df = 894, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 283575.6 297789.8
sample estimates:
mean of x 
 290682.7

SOLUTION: We are 95% confident that the alkalinity levels in Minnesota wells is between 283575.6 and 297789.8.

Find the 95% bootstrap percentile and bootstrap $t$ confidence intervals for the mean alkalinity level and compare the results. Which confidence interval will you report?

# Your code here
thetahat <- mean(MnGroundwater$Alkalinity)
sims <- 10^4
Tstar <- numeric(sims)
for(i in 1:sims){
  bs <- sample(MnGroundwater$Alkalinity, size = sum(!is.na(MnGroundwater$Alkalinity)), replace = TRUE)
  Tstar[i] <- (mean(bs) - thetahat)/(sd(bs)/sqrt(sum(!is.na(MnGroundwater$Alkalinity))))
}
ggplot(data = data.frame(Tstar = Tstar), aes(x = Tstar)) +
  geom_density(fill = "purple", alpha = 0.5) + 
  theme_bw()

library(boot)
mean.boot <- function(data, i){
  d <- data[i]
  M <- mean(d)
  V <- var(d)/length(i)
  return(c(M, V))
}
boot.out <- boot(MnGroundwater$Alkalinity, mean.boot, R = 10^4)
boot.ci(boot.out, conf = 0.95, type = c("perc", "stud"))

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot.out, conf = 0.95, type = c("perc", "stud"))

Intervals : 
Level    Studentized          Percentile     
95%   (283485, 297802 )   (283513, 297848 )  
Calculations and Intervals on Original Scale

SOLUTION: We are 95% confident that the intervals for the average alkalinity levels are between 283622 and 297713. The bootstrap confidence intervals are a smaller range of numbers by about 200 comapred to the t test confidence intervals. Because the interval range is smaller for the bootstrap, it would be recommended to use the bootstrap 95% confidence intervals, rather than the t test 95% confidence intervals. The distribution of the alkalinity levels in Minnesota appears to be a normal distribution so this means that a normal t test interval would also be appropriate.

Problem 11 (27 in the book) Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.

TXBirths2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/TXBirths2004.csv")

How many nonsmokers and smokers are there in this data set?

TXBirths2004 %>%
  group_by(Smoker) %>%
  summarize(n())

# A tibble: 2 x 2
  Smoker `n()`
  <fctr> <int>
1     No  1497
2    Yes    90

SOLUTION: There are 90 smokers, and 1,497 nonsmokers in the dataset.

Create exploratory plots of the weights for the two groups and comment on the distributions.

ggplot(TXBirths2004, aes(x = Weight)) + geom_histogram() + facet_grid(Smoker~.)

SOLUTION: The distribution of the nonsmokers weights are left skewed. The distribution of the weights of the smokers is fairly uniform.

Compute the 95% confidence interval for the difference in means using the formula $t$, bootstrap percentile, and bootstrap $t$ methods and compare your results. Which interval would you report?

smokers <- TXBirths2004 %>% filter(TXBirths2004$Smoker == "Yes")
nonsmokers <- TXBirths2004 %>% filter(TXBirths2004$Smoker == "No")

t.test(smokers$Weight, nonsmokers$Weight)


    Welch Two Sample t-test

data:  smokers$Weight and nonsmokers$Weight
t = -1.4806, df = 102.38, p-value = 0.1418
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -190.69149   27.68196
sample estimates:
mean of x mean of y 
 3205.989  3287.494

thetahat <- mean(smokers$Weight) - mean(nonsmokers$Weight)
se <- sqrt(var(smokers$Weight)/length(smokers$Weight) + 
             var(nonsmokers$Weight)/length(nonsmokers$Weight))
sims<- 10^4
Tstar <- numeric(sims)
diffMeans <- numeric(sims)
for(i in 1:sims)
{
  bootsmokeyes <- sample(smokers$Weight, length(smokers$Weight), replace=TRUE)
  bootsmokeno <- sample(nonsmokers$Weight, length(nonsmokers$Weight), replace=TRUE)
  Tstar[i] <- (mean(bootsmokeyes) - mean(bootsmokeno) - thetahat) / 
  sqrt(var(bootsmokeyes)/length(smokers$Weight) + var(bootsmokeno)/length(nonsmokers$Weight))
  diffMeans[i] <- mean(bootsmokeyes) - mean(bootsmokeno)
}
ciboot <- thetahat - quantile(Tstar, c(.975, .025)) * se
ciboot

     97.5%       2.5% 
-189.48830   26.28634

cit <- quantile(diffMeans, c(0.025, 0.975))
cit

      2.5%      97.5% 
-187.35872   25.47227

SOLUTION: The confidence interval for a regular t test has a difference in numbers as 219. The bootstrap confidence interval has a difference in numbers as 218. Because the range of numbers is so similar, the t-test is what I would recommend to report for the average baby weights difference between nonsmoker mothers and smoker mothers because it is a simply calculation, whereas the bootstrap confidence interval takes more time to get a not-very-much-more precise answer.

Modify your result from the previous question to obtain a one-sided 95% $t$ confidence interval (hypothesizing that babies born to nonsmokers weigh more than babies born to smokers).

# Your code here
t.test(smokers$Weight,nonsmokers$Weight, alternative = "greater")


    Welch Two Sample t-test

data:  smokers$Weight and nonsmokers$Weight
t = -1.4806, df = 102.38, p-value = 0.9291
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -172.8809       Inf
sample estimates:
mean of x mean of y 
 3205.989  3287.494

SOLUTION: We are 95% confident that the difference in weights of babies from nonsmoker mothers and smoker mothers is greater than -179.88 grams.

Chapter Seven Homework

Samantha Widman, Becca Bryan

Wednesday, Dec 06, 2017 - 02:02:35 PM