Chapter Seven Homework Answers Part 2

This is entirely our own work except as noted at the end of the document.

Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming

they have no preconceived idea of how customers will respond, and

error <- 0.03
ptilde <- .5
sample_size <- (ptilde*(1-ptilde))/(error/1.96)^2
sample_size

[1] 1067.111

SOLUTION: To provide a confidence level of 95% and a margin of error of a maximum 3%, the marketers should get responses from a minimum of 1068 people.

a previous survey indicated that about 65% of customers favor longer store hours.

ptilde <- .65
sample_size <- (ptilde*(1-ptilde))/(error/1.96)^2
sample_size

[1] 971.0711

SOLUTION: Given 65% of respondants of the survey favored longer hours, the store should use a sample size of 972 people.

Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.

Compute a 95% confidence interval for the proportion of students taking the drug who break out in hives.

test <- prop.test(34,350)
test


    1-sample proportions test with continuity correction

data:  34 out of 350, null probability 0.5
X-squared = 225.6, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.06913692 0.13429260
sample estimates:
         p 
0.09714286

SOLUTION: We are 95% confident that the proportion of students taking the drug who break out in hives is within (0.0691, 0.1343).

Compute a 95% confidence interval for the proportion of students taking the placebo who break out in hives.

test <- prop.test(56,350)
test


    1-sample proportions test with continuity correction

data:  56 out of 350, null probability 0.5
X-squared = 160.48, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.1240384 0.2036158
sample estimates:
   p 
0.16

SOLUTION: We are 95% confident that the proportion of students taking the placebo who break out in hives is within (0.124, 0.204).

Do the intervals overlap? What, if anything, can you conclude about the effectiveness of the drug?

SOLUTION: The intervals do overlap, but we cannot conclude anything because there could be other variables coming into play.

Compute 95% confidence interval for the difference in proportions of students who break out in hives by using or not using this drug and give a sentence interpreting this interval.

x <- c(34,56)
n <- c(350,350)
test <- prop.test(x,n)
test


    2-sample test for equality of proportions with continuity
    correction

data:  x out of n
X-squared = 5.623, df = 1, p-value = 0.01773
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.11508783 -0.01062645
sample estimates:
    prop 1     prop 2 
0.09714286 0.16000000

SOLUTION: We are 95% confident that the difference in the proportion of students taking the placebo who break out in hives to the proportion of students taking the placebo who break out in hives is within (-0.1151, -0.0106). Since zero is not included in the interval we have sufficient statistical evidence to reject that there is no difference in the effects of the drug and the placebo.

Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions \((p_A - p_P)\) and interpret your interval.

x <- c(44, 70)
n <- c(259, 258)
test <- prop.test(x, n, alternative="less")
test


    2-sample test for equality of proportions with continuity
    correction

data:  x out of n
X-squared = 7.158, df = 1, p-value = 0.003732
alternative hypothesis: less
95 percent confidence interval:
 -1.00000000 -0.03801355
sample estimates:
   prop 1    prop 2 
0.1698842 0.2713178

SOLUTION: We are 95% confident that the difference in adenomas found in the aspirin group and the placebo group is less than or equal to -0.038.

Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.

> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

Bangladesh <- read.csv("http://www1.appstate.edu/~arnholta/Data/Bangladesh.csv")
chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

Compute the numeric summaries of the chlorine levels and create a plot and comment on the distribution.

mean(chlorine)

[1] 78.08401

sd(chlorine)

[1] 210.0192

hist(chlorine)
boxplot(chlorine)

SOLUTION: Skewed right with a mean of 78.084 and a standard deviation of 210.02.

Find a 95% \(t\) confidence interval for the mean \(\mu\) of chlorine levels in Bangladesh wells.

t.test(chlorine)


    One Sample t-test

data:  chlorine
t = 6.0979, df = 268, p-value = 3.736e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  52.87263 103.29539
sample estimates:
mean of x 
 78.08401

SOLUTION: We are 95% confident that the mean of chlorine levels in Bangladesh wells is within (52.873, 103.295).

Find a 95% bootstrap percentile and bootstrap \(t\) confidence intervals for the mean chlorine level and compare results. Which confidence interval will you report?

sims <- 10^4
Tstar <- numeric(sims)
for(i in 1:sims){
  bs <- sample(chlorine, size = sum(chlorine), replace = TRUE)
  Tstar[i] <- (mean(bs) - mean(chlorine)/(sd(bs)/sqrt(sum(chlorine))))
}
ggplot(data = data.frame(Tstar = Tstar), aes(x= Tstar)) +
  geom_density()
TS <- quantile(Tstar, prob = c(0.025, 0.975))
TS

    2.5%    97.5% 
19.41677 28.67781

SOLUTION: We would construct a bootstrap \(t\) confidence interval with a 95% confidence level to account for skewness. We are 95% confident that mean chlorine levels lie within (52.213, 55.95).

Johnson’s \(t\) confidence interval adjusts for skewness by shifting endpoints right or left for positive or negative skewness, respectively. The interval is \(\bar{X} + \hat{\kappa_3}/(6\sqrt{n})(1 + 2q^2) \pm q(S/\sqrt{n})\), where \(\hat{\kappa_3}\) is a sample estimate of the population skewness \(E(X - \mu)/\sigma^3\) and \(q\) denotes the \(1 - \alpha/2\) quantile for a \(t\) distribution with \(n-1\) degrees of freedom. Calculate Johnson’s \(t\) interval for the arsenic data (in Bangladesh) and compare with the formula \(t\) and bootstrap \(t\) intervals.

skewness <- skewness(Bangladesh$Arsenic, na.rm = TRUE)
sims <- 10^4
Tstar <- numeric(sims)
for(i in 1:sims){
  bs <- sample(Bangladesh$Arsenic, size = sum(!is.na(Bangladesh$Arsenic)), replace = TRUE)
  Tstar[i] <- (mean(bs) - mean(Bangladesh$Arsenic)/(sd(bs)/(sqrt(sum(!is.na(Bangladesh$Arsenic))))))
}
TS <- quantile(Tstar, prob = c(0.025, 0.975))
TS

     2.5%     97.5% 
 82.20605 158.23365

SOLUTION: The skewness is shown and matters, but we can work accordingly. With this method, we are 95% confident that the mean arsenic levels are within (81.904, 154.371).

Prob10 - The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.

MnGroundwater <- read.csv("http://www1.appstate.edu/~arnholta/Data/MnGroundwater.csv")
glimpse(MnGroundwater)

Observations: 895
Variables: 10
$ County        <fctr> Aitkin, Aitkin, Aitkin, Aitkin, Aitkin, Aitkin,...
$ Aquifer.Group <fctr> surficial Quaternary, buried Quaternary, buried...
$ Water.Level   <int> 55, 30, 20, 3, 0, 30, 15, 10, 20, 25, 26, 16, 26...
$ Alkalinity    <int> 137000, 214000, 120000, 283000, 236000, 229000, ...
$ Aluminum      <dbl> 0.059, 2.380, 0.410, 158.190, 0.059, 0.059, 0.63...
$ Arsenic       <dbl> 1.810, 0.059, 1.440, 6.340, 10.170, 6.900, 3.650...
$ Chloride      <int> 490, 89250, 300, 780, 5090, 2590, 770, 750, 1700...
$ Lead          <dbl> 0.17, 0.18, 0.52, 1.15, 0.02, 0.12, 0.34, 0.26, ...
$ pH            <dbl> 7.10, 7.60, 6.90, 8.20, 7.90, 7.80, 8.20, 7.80, ...
$ Basin.Name    <fctr> Upper Mississippi River, Upper Mississippi Rive...

Create a histogram, a density, and a normal quantile plot of the alkalinity and comment on the distribution.

hist(MnGroundwater$Alkalinity)
ggplot(MnGroundwater, aes(x= Alkalinity)) +
  geom_density()
qqnorm(MnGroundwater$Alkalinity)
qqline(MnGroundwater$Alkalinity)
mean(MnGroundwater$Alkalinity)

[1] 290682.7

sd(MnGroundwater$Alkalinity)

[1] 108334.3

SOLUTION: The distribution of alkalinity is approximately normal with a mean of 290682.7 and a standard deviation of 108334.3.

Find the 95% \(t\) confidence interval for the mean \(\mu\) of alkalinity levels in Minnesota wells.

t.test(MnGroundwater$Alkalinity)


    One Sample t-test

data:  MnGroundwater$Alkalinity
t = 80.272, df = 894, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 283575.6 297789.8
sample estimates:
mean of x 
 290682.7

SOLUTION: We are 95% confident that the mean alkalinity levels in Minnesota wells is within (283575.6, 297789.8).

Find the 95% bootstrap percentile and bootstrap \(t\) confidence intervals for the mean alkalinity level and compare the results. Which confidence interval will you report?

sims <- 10^4
Tstar <- numeric(sims)
for(i in 1:sims){
  bs <- sample(MnGroundwater$Alkalinity, size = sum(!is.na(MnGroundwater$Alkalinity)), replace = TRUE)
  Tstar[i] <- (mean(bs) - mean(MnGroundwater$Alkalinity)/(sd(bs)/sqrt(sum(!is.na(MnGroundwater$Alkalinity)))))
}
TS <- quantile(Tstar, prob = c(0.025, 0.975))
TS

    2.5%    97.5% 
283640.4 297749.4

SOLUTION: We are 95% confident that the mean akalinity levels in Minnesota wells is within (283450.6, 297649.3).

Prob11 Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.

TXBirths2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/TXBirths2004.csv")
glimpse(TXBirths2004)

Observations: 1,587
Variables: 8
$ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ MothersAge <fctr> 20-24, 20-24, 25-29, 25-29, 15-19, 30-34, 30-34, 2...
$ Smoker     <fctr> No, No, No, No, No, No, No, No, No, No, No, No, No...
$ Gender     <fctr> Male, Male, Female, Female, Female, Female, Male, ...
$ Weight     <int> 3033, 3232, 3317, 2560, 2126, 2948, 3884, 2665, 371...
$ Gestation  <int> 39, 40, 37, 36, 37, 38, 39, 38, 40, 37, 39, 37, 40,...
$ Number     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Multiple   <fctr> No, No, No, No, No, No, No, No, No, No, No, No, No...

How many nonsmokers and smokers are there in this data set?

smokers <- TXBirths2004$Smoker == "Yes"
sum(smokers)

[1] 90

nonsmokers <- TXBirths2004$Smoker == "No"
sum(nonsmokers)

[1] 1497

SOLUTION: 90 smokers and 1497 nonsmokers are represented in this dataset.

Create exploratory plots of the weights for the two groups and comment on the distributions.

msmokers <- TXBirths2004 %>%
  group_by(Smoker) %>%
  summarize(Mean = mean(Weight), StDev = sd(Weight), n())
msmokers

# A tibble: 2 x 4
  Smoker     Mean    StDev `n()`
  <fctr>    <dbl>    <dbl> <int>
1     No 3287.494 554.4829  1497
2    Yes 3205.989 504.2439    90

library(ggplot2)
ggplot(TXBirths2004, aes(x = Weight)) +
  geom_histogram() +
  facet_grid(~Smoker)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

SOLUTION: The distribution of smokers is approximately normal with a mean of 3205.989 and a standard deviation of 504.244. The distribution of nonsmokers is slightly skewed to the left with a mean of 3287.494 and a standard deviation of 554.483.

Compute the 95% confidence interval for the difference in means using the formula \(t\), bootstrap percentile, and bootstrap \(t\) methods and compare your results. Which interval would you report?

SmokersWeight <- subset(TXBirths2004, select = Weight, Smoker == "Yes", drop = TRUE)
NonsmokersWeight <- subset(TXBirths2004, select = Weight, Smoker == "No", drop = TRUE)
thetahat <- mean(SmokersWeight) - mean(NonsmokersWeight)
nx <- length(SmokersWeight)
ny <- length(NonsmokersWeight)
SE <- sqrt(var(SmokersWeight)/nx + var(NonsmokersWeight)/ny)
N <- 10^4
Tstar <- numeric(N)
DM <- numeric(N)
for(i in 1:N){
  xboot <- sample(SmokersWeight, nx, replace = TRUE)
  yboot <- sample(NonsmokersWeight, ny, replace = TRUE)
  Tstar[i] <- (mean(xboot) - mean(yboot) - thetahat)/(sqrt(var(xboot)/nx + var(yboot)/ny))
  DM[i] <- mean(xboot) - mean(yboot)
}
CItboot <- thetahat - quantile(Tstar, c(.975, .025)) * SE
names(CItboot) <- NULL
CItboot

[1] -187.13140   27.92286

SOLUTION: The bootstrap \(t\) interval allows us to be 95% confident that the difference in means of smokers and nonsmokers is within (-190.843, 27.609).

Modify your result from the previous question to obtain a one-sided 95% \(t\) confidence interval (hypothesizing that babies born to nonsmokers weigh more than babies born to smokers).

t.test(SmokersWeight, NonsmokersWeight, alternative="greater")


    Welch Two Sample t-test

data:  SmokersWeight and NonsmokersWeight
t = -1.4806, df = 102.38, p-value = 0.9291
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -172.8809       Inf
sample estimates:
mean of x mean of y 
 3205.989  3287.494

SOLUTION: We are 95% confident that the difference in weights of Texan babies is at least -172.193 by using the \(t\) confidence interval.

Chapter Seven Homework Answers Part 2

Sara Hursa

Wednesday, Dec 06, 2017 - 02:25:52 PM