Chapter Seven Homework

This is entirely our own work except as noted at the end of the document.

Prob1 - Import the data set Spruce into R.

Create exploratory plots to check the distribution of the variable Ht.change.

# Your code here
spruce <- read.csv("http://www1.appstate.edu/~arnholta/Data/Spruce.csv")
hist(spruce$Ht.change)
ggplot(spruce,aes(x=Ht.change))+geom_density()

Find a 95% \(t\) confidence interval for the mean height change over the 5-year period of the study and give a sentence interpreting your interval.

# Your code here
t.test(spruce$Ht.change,conf.level = 0.95)$conf

[1] 28.33685 33.52982
attr(,"conf.level")
[1] 0.95

SOLUTION: We are 95% confident that the mean height change for spruce trees in the 5 year period was between 28.34 and 33.53.

Create exploratory plots to compare the distributions of the variable Ht.change for the seedlings in the fertilized and nonfertilized plots.

F <- spruce %>% 
  select(Ht.change,Fertilizer) %>%
  filter(Fertilizer=="F")

NF <- spruce %>% 
  select(Ht.change,Fertilizer) %>%
  filter(Fertilizer == "NF") 

ggplot(spruce, aes(Ht.change, fill = Fertilizer)) + geom_histogram(alpha = 0.5)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(F$Ht.change)
qqnorm(NF$Ht.change)

Find the 95% one-sided lower \(t\) confidence bound for the difference in mean heights (\(\mu_F - \mu_{NF}\)) over the 5-year period of the study and give a sentence interpreting your interval.

# Your code here

t.test(F$Ht.change,NF$Ht.change,alternative = "less",conf.level = .95)$conf

[1]     -Inf 17.95558
attr(,"conf.level")
[1] 0.95

SOLUTION:

Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.

Create exploratory plots and compare the distribution of weight between the babies born in the two states.

# Your code here
Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")

GirlsWY <- Girls2004 %>% 
  filter(State == "WY")
GirlsAK <- Girls2004 %>% 
  filter(State == "AK")

hist(GirlsWY$Weight)

ggplot(GirlsWY, aes(x = Weight)) + 
  geom_density() 

hist(GirlsAK$Weight)

ggplot(GirlsAK, aes(x = Weight)) + 
  geom_density()

SOLUTION: Both Wyoming and Alaska appear to have a normal distribution, but the weights in Wyoming have a stronger normal distribution.

Find a 95% \(t\) confidence interval for the difference in mean weights for girls born in these two states. Give a sentence interpreting this interval.

# Your code here
t.test(Girls2004$Weight[Girls2004$State=="AK"], Girls2004$Weight[Girls2004$State=="WY"],conf.level = 0.95)$conf

[1]  83.29395 533.60605
attr(,"conf.level")
[1] 0.95

SOLUTION: We can state with 95% confidence that the difference in mean of weights between girls in AK and WY falls between 83.294 and 533.606.

Create exploratory plots and compare the distribution of weights between babies born to nonsmokers and babies born to smokers.

# Your code here
GirlsSmoker <- Girls2004 %>% filter(Smoker == "Yes")
GirlsNonSmoker <- Girls2004 %>% filter(Smoker == "No")

hist(GirlsSmoker$Weight)

ggplot(GirlsSmoker, aes(x = Weight)) + 
  geom_density() 

hist(GirlsNonSmoker$Weight)

ggplot(GirlsNonSmoker, aes(x = Weight)) + 
  geom_density()

SOLUTION: The weights make a normal distribution for both smokers and non smokers in both AK and WY, but the non smokers baby weights is potentially positively skewed.

Find a 95% \(t\) confidence interval for the difference in mean weights between babies born to nonsmokers and smokers. Give a sentence interpreting this interval.

# Your code here
t.test(Girls2004$Weight[Girls2004$Smoker=="No"], Girls2004$Weight[Girls2004$Smoker=="Yes"], conf.level = .95)$conf

[1] -44.0330 617.9197
attr(,"conf.level")
[1] 0.95

SOLUTION: We are 95% confident that the mean difference in weight between girls born to women who are smokers and non smokers is between -44.03 and 617.92.

Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.

Create exploratory plots of the lengths of delays for the two airlines.

# Your code here
FlightDelays <- read.csv(url("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv"))

UA <- FlightDelays %>% 
  select(Delay,Carrier) %>%
  filter(Carrier == "UA")
AA <- FlightDelays %>% 
  select(Delay,Carrier) %>%
  filter(Carrier == "AA")

hist(UA$Delay)

ggplot(UA, aes(x = Delay)) + 
  geom_density() 

hist(AA$Delay)

ggplot(AA, aes(x = Delay)) + 
  geom_density()

Find a 95% \(t\) confidence interval for the difference in mean flight delays between the two airlines and interpret this interval.

# Your code here
t.test(UA$Delay, AA$Delay, conf.level = .95)$conf

[1] 2.868194 8.903198
attr(,"conf.level")
[1] 0.95

SOLUTION: We are 95% confident that the difference in mean delay for both airlines is between 2.868194 and 8.903198.

Prob4 - Run a simulation to see if the \(t\) ratio \(T = (\bar{X} -\mu)/(S/\sqrt{n})\) has a \(t\) distribution or even an approximate \(t\) distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate \(t\) density curve over the density of your simulated \(T\). Try two different nonnormal distributions \(\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)\) and remember to see if sample size makes a difference (use \(n = 15\) and \(n=500\)).

When n is 15, (Unif(a = 0, b = 1))

# Your code here

n = 15;
a = 0;
b = 1;

sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- runif(n, a, b)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}

T <- (xbar - 0.5)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

When n = 500, (Unif(a = 0, b = 1))

n = 500;
a = 0;
b = 1;

sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- runif(n, a, b)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}

T <- (xbar - 0.5)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

When n = 15, EXP(lamba = 1)

n = 15
lamba = 1

sims <- 10^4
SD <- numeric(sims)
xbar <- numeric(sims)
for(i in 1:sims){
  xn <- rexp(n, lamba)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}

T <- (xbar - 1)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

When n = 500, EXP(lamba = 1)

n = 500
lamba = 1

sims <- 10^4
SD <- numeric(sims)
xbar <- numeric(sims)
for(i in 1:sims){
  xn <- rexp(n, lamba)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}

T <- (xbar - 1)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

SOULTION: After examining the graphs the larger n is for normal distributions the larger the more improved the distribution becomes. The mean is affected by sampling size of the exponential distributions.

Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.

Find a 95% confidence interval for the proportion of women who voted for Bush.

# Your code here
# Using a one sample z-test for a proportion
ci_women <- prop.test(459, 980, conf.level = .95, correct = FALSE)$conf
ci_women

[1] 0.4373100 0.4996717
attr(,"conf.level")
[1] 0.95

SOLUTION: We are 95% confident that the proportion of women that voted for President Bush fell between 0.437 and 0.499.

Find a 95% confidence interval for the proportion of men who voted for Bush. Do the intervals for the men and women overlap? What, if anything, can you conclude about gender difference in voter preference?

# Your code here
# Using a one sample z-test for a proportion
ci_men <- prop.test(426, 759, conf.level = .95, correct = FALSE)$conf
ci_men

[1] 0.5257409 0.5961717
attr(,"conf.level")
[1] 0.95

SOLUTION: We are 95% confident that the proportion of men that voted for President Bush fell between 0.525 and 0.596.

Find a 95% confidence interval for the difference in proportions and interpret your interval.

# Your code here
prop.test(c(459, 429), c(980, 759), correct = FALSE)$conf

[1] -0.14396498 -0.04973511
attr(,"conf.level")
[1] 0.95

SOLUTION: WE are 95% confidence that the difference in proportions of men who voted for bush and women who voted for Bush falls in the interval of (-0.143, -0.049)

Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming

they have no preconceived idea of how customers will respond, and

# Your code here
z <- qnorm(.975)
p <- 0.5
mE <- 0.03 
sampSize <- ((z^2) * p * (1 - p)) / (mE^2)
sampSize

[1] 1067.072

SOLUTION: To obtain an marigin error of 3% with 95% confidence level for the survey, the retail store should survey at most 1068 people

a previous survey indicated that about 65% of customers favor longer store hours.

# Your code here
z <- qnorm(.975)
p <- 0.65
mE <- 0.03 
sampSize <- ((z^2) * p * (1 - p)) / (mE^2)
sampSize

[1] 971.0354

SOLUTION: To obtain an marigin error of 3% with 95% confidence level for the survey and a previous survey of proportion estimate 65%, the retail store should survey at most 971 people

Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.

Compute a 95% confidence interval for the proportion of students taking the drug who break out in hives.

# Your code here
prop.test(34, 350)$conf

[1] 0.06913692 0.13429260
attr(,"conf.level")
[1] 0.95

SOLUTION: With 95% confidence, of students taking the drug who break out in hives is between the interval (0.069, 0.13)

Compute a 95% confidence interval for the proportion of students taking the placebo who break out in hives.

# Your code here
prop.test(56, 350)$conf

[1] 0.1240384 0.2036158
attr(,"conf.level")
[1] 0.95

SOLUTION: With 95% confidence, that the oproportion of students taking the placebo who break out in hives is between the interval (0.12, 0.20)

Do the intervals overlap? What, if anything, can you conclude about the effectiveness of the drug?

SOLUTION: The intervals do overlap, howver nothing can be concluded until a difference in proportions test is conducted.

Compute 95% confidence interval for the difference in proportions of students who break out in hives by using or not using this drug and give a sentence interpreting this interval.

# Your code here
prop.test(c(56, 34), c(350, 350))$conf

[1] 0.01062645 0.11508783
attr(,"conf.level")
[1] 0.95

SOLUTION: WE are 95% confidence that the difference in proportions that the the difference in proportions of students who break out in hives by using or not using this drug is between the interval of (0.01,0.11)

Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions \((p_A - p_P)\) and interpret your interval.

# Your code here

prop.test(c(44, 70), c(259, 258), alternative = "less")$conf

[1] -1.00000000 -0.03801355
attr(,"conf.level")
[1] 0.95

SOLUTION: We are 95% confident that the difference in adenomas found in the aspirin group and the placebo group is less than or equal to -0.03801.

Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.

> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

# Your code here
Bangladesh <- read.csv("http://www1.appstate.edu/~arnholta/Data/Bangladesh.csv")
chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

Compute the numeric summaries of the chlorine levels and create a plot and comment on the distribution.

# Your code here
hist(chlorine)
IQR(chlorine)

[1] 50.5

qqnorm(chlorine)
qqline(chlorine)

SOLUTION: The distribution for chlorine is skewed to the right with an IQR of 50.5.

Find a 95% \(t\) confidence interval for the mean \(\mu\) of chlorine levels in Bangladesh wells.

# Your code here
t.test(chlorine)


    One Sample t-test

data:  chlorine
t = 6.0979, df = 268, p-value = 3.736e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  52.87263 103.29539
sample estimates:
mean of x 
 78.08401

SOLUTION: We are 95% confident that the mean of chlorine levels in Bangladesh wells lies between 52.873 and 103.295.

Find a 95% bootstrap percentile and bootstrap \(t\) confidence intervals for the mean chlorine level and compare results. Which confidence interval will you report?

# Your code here
sims <- 10^4
xbar <- mean(chlorine)
n <- length(chlorine)
tstar <- numeric(sims)
for(i in 1:sims){
  x <- sample(chlorine, size = n, replace = TRUE)
  tstar[i] <- (mean(x) - xbar)/(sd(x)/sqrt(n))
}
TS <- quantile(tstar,c(0.025,0.975))
lower<-xbar-(TS[2]*(sd(chlorine)/sqrt(n)))
upper<-xbar-(TS[1]*(sd(chlorine)/sqrt(n)))
lower

   97.5% 
56.84449

upper

    2.5% 
111.8312

SOLUTION: From our bootstrap we can say with 95% confidence that the true mean of chlorine levels lie between 56.844494 and 111.8311533. We would report this interval instead of the confidence interval because bootstrapping accounts for the skewed data.

Johnson’s \(t\) confidence interval adjusts for skewness by shifting endpoints right or left for positive or negative skewness, respectively. The interval is \(\bar{X} + \hat{\kappa_3}/(6\sqrt{n})(1 + 2q^2) \pm q(S/\sqrt{n})\), where \(\hat{\kappa_3}\) is a sample estimate of the population skewness \(E(X - \mu)/\sigma^3\) and \(q\) denotes the \(1 - \alpha/2\) quantile for a \(t\) distribution with \(n-1\) degrees of freedom. Calculate Johnson’s \(t\) interval for the arsenic data (in Bangladesh) and compare with the formula \(t\) and bootstrap \(t\) intervals.

# Your code here
k_3 <- skewness(Bangladesh$Arsenic)
n <- length(Bangladesh$Arsenic)
xbar <- mean(Bangladesh$Arsenic)
q <- qt(0.975, n - 1)

lower1 <- xbar+(k_3 / ((6 * sqrt(n)) * (1 + (2 * q^2)))) - (q * (sd(Bangladesh$Arsenic)/sqrt(n)))
upper1 <- xbar+(k_3/((6 *sqrt(n)) * (1 + (2*q^2)))) + (q * (sd(Bangladesh$Arsenic)/sqrt(n)))
lower1

[1] 89.68893

upper1

[1] 160.9619

t.test(Bangladesh$Arsenic)$conf

[1]  89.68342 160.95643
attr(,"conf.level")
[1] 0.95

sims <- 10^4
tstar <- numeric(sims)
for(i in 1:sims){
  x <- sample(Bangladesh$Arsenic, size = n, replace = TRUE)
  tstar[i] <- (mean(x) - xbar)/(sd(x)/sqrt(n))
}
TS <- quantile(tstar,c(0.025,0.975))
lower2<-xbar-(TS[2]*(sd(Bangladesh$Arsenic)/sqrt(n)))
upper2<-xbar-(TS[1]*(sd(Bangladesh$Arsenic)/sqrt(n)))
lower2

   97.5% 
95.33215

upper2

    2.5% 
172.5809

SOLUTION: The Johnsons T confidence interval gives us 89.6889317 and 160.9619374 and is very close to the normal t interval of 89.68342 and 160.95643. The bootstrapped t confidence interval lies between 95.3321539 and 172.5809332.

Prob10 - The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.

# Your code here

Create a histogram, a density, and a normal quantile plot of the alkalinity and comment on the distribution.

# Your code here
library(ggplot2)
library(resampledata)

# Histogram 
ggplot(MnGroundwater, aes(x=Alkalinity)) +
  geom_histogram(color= "black", fill = "blue") +
  labs(title = "Alkalinity in Minnesota Histogram") +
  theme_bw()

#density plot
ggplot(MnGroundwater, aes(x=Alkalinity)) +
  geom_density(color= "black", fill = "blue") +
  labs(title = "Alkalinity in Minnesota Density Plot") +
  theme_bw()

#Normal Quartile Plot 
ggplot(MnGroundwater, aes(sample=Alkalinity)) +
  geom_qq(color= "black", fill = "blue") +
  labs(title = "Alkalinity in Minnesota Normal Quartile Plot") +
  theme_bw()

SOLUTION: By examining the three different graphs we can conclude that the distribution of the normal quartile plot and the histogram is normal.

Find the 95% \(t\) confidence interval for the mean \(\mu\) of alkalinity levels in Minnesota wells.

# Your code here
tConfidence <- t.test(MnGroundwater$Alkalinity)$conf 
tConfidence

[1] 283575.6 297789.8
attr(,"conf.level")
[1] 0.95

SOLUTION: We are 95% confident that the mean alkalinty levels in the Minnesota wells is between 283575.6 and 297789.8.

Find the 95% bootstrap percentile and bootstrap \(t\) confidence intervals for the mean alkalinity level and compare the results. Which confidence interval will you report?

# Your code here
sims <- 10^4
xbar <- mean(MnGroundwater$Alkalinity)
n <- length(MnGroundwater$Alkalinity)
t <- numeric(sims)

for(i in 1:sims) {
    bs <- sample(MnGroundwater$Alkalinity, 
                 size = sum(!is.na(MnGroundwater$Alkalinity)), replace = TRUE)
  t[i] <- (mean(bs) - mean(MnGroundwater$Alkalinity)/
             (sd(bs)/sqrt(sum(!is.na(MnGroundwater$Alkalinity)))))
}

quantile(t, prob = c(0.025, 0.975))

    2.5%    97.5% 
283612.0 297714.5

SOLUTION: By examining out bootstrap we are 95% confident that the mean alkalinty levels in the Minnesota wells is between 283342.7 and 297801.6.

Prob11 Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.

# Your code here

How many nonsmokers and smokers are there in this data set?

# Your code here
library(dplyr)
TXBirths2004 %>%
  group_by(Smoker) %>%
  summarise(Count = n())

# A tibble: 2 x 2
  Smoker Count
  <fctr> <int>
1     No  1497
2    Yes    90

SOLUTION: There are a total of 90 smokers and 1496 non-smokers in this data set

Create exploratory plots of the weights for the two groups and comment on the distributions.

# Your code here
ggplot(TXBirths2004, aes(x=Weight)) + 
  geom_histogram(color= "black", fill = "orange") +
  labs(title = "Histogram for Smokers and non-smokers") +
  theme_bw() + 
  facet_grid(~Smoker)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

TXBirths2004 %>%
  group_by(Smoker) %>%
  summarise(Mean = mean(Weight), SD = sd(Weight), n())

# A tibble: 2 x 4
  Smoker     Mean       SD `n()`
  <fctr>    <dbl>    <dbl> <int>
1     No 3287.494 554.4829  1497
2    Yes 3205.989 504.2439    90

SOLUTION: The distributions for both smokers and non-smokers looks pretty normal. Smokers have a mean of 3205.989 and a standard deviation of 504.2439, while non-smokers have a mean of 3287.494 and a standard deviation of 554.4829.

Compute the 95% confidence interval for the difference in means using the formula \(t\), bootstrap percentile, and bootstrap \(t\) methods and compare your results. Which interval would you report?

# Your code here

smokerWeight <- filter(TXBirths2004, Smoker == "Yes")$Weight
nonSmokerWeight <- filter(TXBirths2004, Smoker == "No")$Weight
confidence <- t.test(smokerWeight, nonSmokerWeight)$confidence

sims <- 10^4
that <- mean(smokerWeight) - mean(nonSmokerWeight)
smoker_length <- length(smokerWeight)
nonSmoker_length <- length(nonSmokerWeight)
SE <- sqrt(var(smokerWeight)/smoker_length + var(nonSmokerWeight)/nonSmoker_length)

tstar <- numeric(sims)
for (i in 1:sims) {
  smoke <- sample(smokerWeight, size = smoker_length, replace = TRUE)
  nonsmoke <- sample(nonSmokerWeight, size = nonSmoker_length, replace = TRUE)
  tstar[i] <- (mean(smoke) - mean(nonsmoke) - that) / sqrt(var(smoke)/smoker_length + var(nonsmoke)/nonSmoker_length)
}

boot <- that - quantile(tstar, c(0.025, 0.975)) * SE
boot

     2.5%     97.5% 
  26.0874 -189.8860

SOLUTION: By looking at the bootstrap t interval we can conclude with 95% confidence that the difference in means is 27.63071 and -188.02910.

Modify your result from the previous question to obtain a one-sided 95% \(t\) confidence interval (hypothesizing that babies born to nonsmokers weigh more than babies born to smokers).

# Your code here
t.test(smokerWeight, nonSmokerWeight, alternative = "less")$conf

[1]    -Inf 9.87141
attr(,"conf.level")
[1] 0.95

SOLUTION: By examining the one sided 95% t confidence interval we can conclude that the average in mean birth weights from a mother that smokes vs. a mother that does not smoke is 9.87141.

Chapter Seven Homework

Cristian Gulisano, Michael Mckay, Sarah Mbiki

Wednesday, Dec 06, 2017 - 04:36:52 PM