Chapter 7 Homework Part 2

Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming

they have no preconceived idea of how customers will respond, and

# Use nsize to get the minimum size to give the margin of error with the default p of 0.5
nsize(b = .03, type = "pi")


The required sample size (n) to estimate the population 
proportion of successes with a 0.95 confidence interval 
so that the margin of error is no more than 0.03 is 1068 .

SOLUTION: Solution Above. In this case we have no known population so we use the default value of 0.5.

a previous survey indicated that about 65% of customers favor longer store hours.

# Use nsize to get the minium size to give the margin of eror with the known p of 0.65
nsize(b = .03, type = "pi", p = 0.65)


The required sample size (n) to estimate the population 
proportion of successes with a 0.95 confidence interval 
so that the margin of error is no more than 0.03 is 972 .

SOLUTION: Solution Above. In this case we have a known value for the population so we use that as our p (population).

Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.

Compute a 95% confidence interval for the proportion of students taking the drug who break out in hives.

confidence_experiment <- prop.test(34, 350, correct = FALSE)$conf

SOLUTION: We can conclude with 95% confidence that the proportion of students that take the drug who break out with hives is between 0.0703507 and 0.1326822.

Compute a 95% confidence interval for the proportion of students taking the placebo who break out in hives.

confidence_control <- prop.test(56, 350, correct = FALSE)$conf

SOLUTION: We can conclude with 95% confidence that the proportin of students that took a placebo (the control group) and broke out with hives is between 0.125315 and 0.2020674.

Do the intervals overlap? What, if anything, can you conclude about the effectiveness of the drug?

SOLUTION: The confidence intervals overlap from 0.1326822 to 0.125315. Given this information we have insufficient evidence to suggest that the drug had an effect on the proportion of students that broke out in hives.

Compute 95% con“fidence interval for the difference in proportions of students who break out in hives by using or not using this drug and give a sentence interpreting this interval.

confidence_total <- prop.test(c(34, 56), c(350, 350), correct = FALSE)

SOLUTION: We can say with 95% confidence that the difference in proportions of hives between the control and experimental groups is between 6.1712204 and 1. We can note taht this interval does not include 0, so we can also conclude that the difference in proportions is significant, or that the medication does have an affect.

Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions \((p_A - p_P)\) and interpret your interval.

confidence <- prop.test(c(44, 70), c(259, 258), correct = FALSE, alt = "greater")

SOLUTION: We can say with 95% confidence that the difference in proportions of adenomas between the group given asprin and the group that wasn’t given asprin is greater than 7.7368523. We have insufficient evidence to support the idea that asprin is effective in preventing adenomas as 0 is included in the confidence interval.

Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.

> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

Compute the numeric summaries of the chlorine levels and create a plot and comment on the distribution.

ggplot(data.frame(X = chlorine), aes(x = X)) +
  geom_density(color = "black", fill = "cyan") +
  theme_bw() +
  labs(title = "Water Chlorine Levels in Bangladesh", xlab = "Chlorine Level")

SOLUTION: The distribution is exponential. The best measure of center here is the median which is 14.2 and the best measure of spread is IQR which is 50.5.

Find a 95% \(t\) confidence interval for the mean \(\mu\) of chlorine levels in Bangladesh wells.

confidence <- t.test(chlorine)$conf

SOLUTION: WE can say with 95% confidence that the true mean of chlorine levels in Bangladesh wells is between 52.872635 and 103.2953948.

Find a 95% bootstrap percentile and bootstrap \(t\) confidence intervals for the mean chlorine level and compare results. Which confidence interval will you report?

sims <- 10^4
xbar <- mean(chlorine)
n <- length(chlorine)
t_star <- numeric(sims)
for (i in 1:sims) {
  x <- sample(chlorine, size = n, replace = TRUE)
  t_star[i] <- (mean(x) - xbar) / (sd(x)/sqrt(n))
}
confidence_boot <- quantile(t_star, c(0.025, 0.975))
lower_bound = xbar - (confidence_boot[2] * (sd(chlorine)/sqrt(n)))
upper_bound = xbar - (confidence_boot[1] * (sd(chlorine)/sqrt(n)))

SOLUTION: From our bootstrap we can say with 95% confidence that the true mean of chlorine levels in Bangladesh wells is between 57.3526732 and 112.3002873. In this case, we would use the bootstrapped test as it corrects for skewedness as we have a skewed sample. We can also suggest the bootstrapped t interval as n > 10.

Johnson’s \(t\) confidence interval adjusts for skewness by shifting endpoints right or left for positive or negative skewness, respectively. The interval is \(\bar{X} + \hat{\kappa_3}/(6\sqrt{n})(1 + 2q^2) \pm q(S/\sqrt{n})\), where \(\hat{\kappa_3}\) is a sample estimate of the population skewness \(E(X - \mu)/\sigma^3\) and \(jq\) denotes the \(1 - \alpha/2\) quantile for a \(t\) distribution with \(n-1\) degrees of freedom. Calculate Johnson’s \(t\) interval for the arsenic data (in Bangladesh) and compare with the formula \(t\) and bootstrap \(t\) intervals.

# Your code here
# skewness function in r out of the package E1071 - name of function: skewness
n <- length(Bangladesh$Arsenic)
xbar <- mean(Bangladesh$Arsenic)
q <- qt(0.975, n - 1)
k3 <- skewness(Bangladesh$Arsenic)
lower_bound <- xbar + (k3/((6 * sqrt(n)) * (1 + (2 * q^2)))) - (q * (sd(Bangladesh$Arsenic)/sqrt(n)))
upper_bound <- xbar + (k3/((6 * sqrt(n)) * (1 + (2 * q^2)))) + (q * (sd(Bangladesh$Arsenic)/sqrt(n)))

confidence <- t.test(Bangladesh$Arsenic)$conf

sims <- 10^4
t_star <- numeric(sims)
for (i in 1:sims) {
  x <- sample(Bangladesh$Arsenic, size = n, replace = TRUE)
  t_star[i] <- (mean(x) - xbar) / (sd(x)/sqrt(n))
}
confidence_boot <- quantile(t_star, c(0.025, 0.975))
lower_bound_boot = xbar - (confidence_boot[2] * (sd(Bangladesh$Arsenic)/sqrt(n)))
upper_bound_boot = xbar - (confidence_boot[1] * (sd(Bangladesh$Arsenic)/sqrt(n)))

SOLUTION: Our Johnson’s T Confidence interval is (89.6889317, 160.9619374), a normal t-interval is (89.6834234, 160.956429), and a bootstraped t confidence interval is (95.2007911, 173.0304234). The Johnson’s T Confidence interval is almost exactly the same as the t formula, but the Bootstrapped T interval is shifted upward.

Prob10 - The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.

Create a histogram, a density, and a normal quantile plot of the alkalinity and comment on the distribution.

ggplot(MnGroundwater, aes(x = Alkalinity)) +
  geom_histogram(color = "black", fill = "cyan") +
  theme_bw() +
  labs(title = "Histogram of Water Alkalinity")
ggplot(MnGroundwater, aes(sample = Alkalinity)) +
  geom_qq(color = "black", fill = "cyan") +
  theme_bw() +
  labs(title = "QQ Plot of Water Alkalinity")
ggplot(MnGroundwater, aes(x = Alkalinity)) +
  geom_density(color = "black", fill = "cyan") +
  theme_bw() +
  labs(title = "Histogram of Water Alkalinity")

SOLUTION: The distribution appears to be very close to normal.

Find the 95% \(t\) confidence interval for the mean \(\mu\) of alkalinity levels in Minnesota wells.

confidence <- t.test(MnGroundwater$Alkalinity)$conf

SOLUTION: We can state with 95% confidence that the true mean of alkalinity values is between 2.83575610^{5} and 2.977897610^{5}

Find the 95% bootstrap percentile and bootstrap \(t\) confidence intervals for the mean alkalinity level and compare the results. Which confidence interval will you report?

sims <- 10^4
xbar <- mean(MnGroundwater$Alkalinity)
n <- length(MnGroundwater$Alkalinity)
t_star <- numeric(sims)
for (i in 1:sims) {
  x <- sample(MnGroundwater$Alkalinity, size = n, replace = TRUE)
  t_star[i] <- (mean(x) - xbar) / (sd(x)/sqrt(n))
}
confidence_boot <- quantile(t_star, c(0.025, 0.975))
lower_bound = xbar - (confidence_boot[2] * (sd(MnGroundwater$Alkalinity)/sqrt(n)))
upper_bound = xbar - (confidence_boot[1] * (sd(MnGroundwater$Alkalinity)/sqrt(n)))

SOLUTION: From our bootstrap we can state with 95% confidence that the true mean of alkalinity values is between 2.834704210^{5} and 2.977076910^{5}. We would use this value rather than our normal t test as it corrects for some skewedness and abnormality in the data. We can also suggest that the bootstrap t confidence interval is more reliable here as n > 10.

Prob11 Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.

How many nonsmokers and smokers are there in this data set?

TXBirths2004 %>% 
  group_by(Smoker) %>% 
  summarise(Count = n())

# A tibble: 2 x 2
  Smoker Count
  <fctr> <int>
1     No  1497
2    Yes    90

Create exploratory plots of the weights for the two groups and comment on the distributions.

ggplot(TXBirths2004, aes(x = Weight)) +
  geom_density(color = "black", fill = "cyan") +
  theme_bw() +
  facet_grid(Smoker~.) +
  labs(title = "Birth Weights of Texas Children in 2004", xlab = "Birth Weight")
ggplot(TXBirths2004, aes(sample = Weight)) +
  geom_qq(color = "black", fill = "cyan") +
  theme_bw() +
  facet_grid(Smoker~.) +
  labs(title = "Birth Weights of Texas Children in 2004", xlab = "Birth Weight")

SOLUTION: The distribution for both smokers and non-smokers appears to be normal but skewed slightly to the left.

Compute the 95% confidence interval for the difference in means using the formula \(t\), bootstrap percentile, and bootstrap \(t\) methods and compare your results. Which interval would you report?

smoker <- filter(TXBirths2004, Smoker == "Yes")$Weight
nonsmoker <- filter(TXBirths2004, Smoker == "No")$Weight
confidence <- t.test(smoker, nonsmoker)$confidence
sims <- 10^4
thetahat <- mean(smoker) - mean(nonsmoker)
nx <- length(smoker)
ny <- length(nonsmoker)
SE <- sqrt(var(smoker)/nx + var(nonsmoker)/ny)
t_star <- numeric(sims)
for (i in 1:sims) {
  x <- sample(smoker, size = nx, replace = TRUE)
  y <- sample(smoker, size = ny, replace = TRUE)
  t_star[i] <- (mean(x) - mean(y) - thetahat) / sqrt(var(x)/nx + var(y)/ny)
}
confidence_boot <- thetahat - quantile(t_star, c(0.025, 0.975)) * SE

SOLUTION: We can conclude with 95% confidence that the t confidence interval for the difference in mean baby weight born to smokers and nonsmokers is between and . The bootstrap interval tells us with 95% confidence that the differenc in means is between -55.2241197 and -275.7348774. In this case we would use the bootstrapped confidence inerval. We can suggest the bootstrapped confidence interval also because n > 10.

Modify your result from the previous question to obtain a one-sided 95% \(t\) confidence interval (hypothesizing that babies born to nonsmokers weigh more than babies born to smokers).

confidence <- t.test(smoker, nonsmoker, alternative = "less")$conf

SOLUTION: We can conclude with 95% confidence that the difference of mean birth weights in babies between smokers and nonsmokers is greater than 9.8714098.

Chapter 7 Homework Part 2

Gurney Buchanan

Wednesday, Dec 06, 2017 - 01:20:39 PM