Prob6 - A retail store wishes to conduct a marketing survey of its customers to see if customers would favor longer store hours. How many people should be in their sample if the marketers want their margin of error to be at most 3% with 95% confidence, assuming

# Use nsize to get the minimum size to give the margin of error with the default p of 0.5
nsize(b = .03, type = "pi")

The required sample size (n) to estimate the population 
proportion of successes with a 0.95 confidence interval 
so that the margin of error is no more than 0.03 is 1068 . 

SOLUTION: Solution Above. In this case we have no known population so we use the default value of 0.5.

# Use nsize to get the minium size to give the margin of eror with the known p of 0.65
nsize(b = .03, type = "pi", p = 0.65)

The required sample size (n) to estimate the population 
proportion of successes with a 0.95 confidence interval 
so that the margin of error is no more than 0.03 is 972 . 

SOLUTION: Solution Above. In this case we have a known value for the population so we use that as our p (population).

Prob7 - Suppose researchers wish to study the effectiveness of a new drug to alleviate hives due to math anxiety. Seven hundred math students are randomly assigned to take either this drug or a placebo. Suppose 34 of the 350 students who took the drug break out in hives compared to 56 of the 350 students who took the placebo.

confidence_experiment <- prop.test(34, 350, correct = FALSE)$conf

SOLUTION: We can conclude with 95% confidence that the proportion of students that take the drug who break out with hives is between 0.0703507 and 0.1326822.

confidence_control <- prop.test(56, 350, correct = FALSE)$conf

SOLUTION: We can conclude with 95% confidence that the proportin of students that took a placebo (the control group) and broke out with hives is between 0.125315 and 0.2020674.

SOLUTION: The confidence intervals overlap from 0.1326822 to 0.125315. Given this information we have insufficient evidence to suggest that the drug had an effect on the proportion of students that broke out in hives.

confidence_total <- prop.test(c(34, 56), c(350, 350), correct = FALSE)

SOLUTION: We can say with 95% confidence that the difference in proportions of hives between the control and experimental groups is between 6.1712204 and 1. We can note taht this interval does not include 0, so we can also conclude that the difference in proportions is significant, or that the medication does have an affect.

Prob8 - An article in the March 2003 New England Journal of Medicine describes a study to see if aspirin is effective in reducing the incidence of colorectal adenomas, a precursor to most colorectal cancers (Sandler et al. (2003)). Of 517 patients in the study, 259 were randomly assigned to receive aspirin and the remaining 258 received a placebo. One or more adenomas were found in 44 of the aspirin group and 70 in the placebo group. Find a 95% one-sided upper bound for the difference in proportions \((p_A - p_P)\) and interpret your interval.

confidence <- prop.test(c(44, 70), c(259, 258), correct = FALSE, alt = "greater")

SOLUTION: We can say with 95% confidence that the difference in proportions of adenomas between the group given asprin and the group that wasn’t given asprin is greater than 7.7368523. We have insufficient evidence to support the idea that asprin is effective in preventing adenomas as 0 is included in the confidence interval.

Prob9 - The data set Bangladesh has measurements on water quality from 271 wells in Bangladsesh. There are two missing values in the chlorine variable. Use the following R code to remove these two observations.

> chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])

chlorine <- with(Bangladesh, Chlorine[!is.na(Chlorine)])
ggplot(data.frame(X = chlorine), aes(x = X)) +
  geom_density(color = "black", fill = "cyan") +
  theme_bw() +
  labs(title = "Water Chlorine Levels in Bangladesh", xlab = "Chlorine Level")

SOLUTION: The distribution is exponential. The best measure of center here is the median which is 14.2 and the best measure of spread is IQR which is 50.5.

confidence <- t.test(chlorine)$conf

SOLUTION: WE can say with 95% confidence that the true mean of chlorine levels in Bangladesh wells is between 52.872635 and 103.2953948.

sims <- 10^4
xbar <- mean(chlorine)
n <- length(chlorine)
t_star <- numeric(sims)
for (i in 1:sims) {
  x <- sample(chlorine, size = n, replace = TRUE)
  t_star[i] <- (mean(x) - xbar) / (sd(x)/sqrt(n))
}
confidence_boot <- quantile(t_star, c(0.025, 0.975))
lower_bound = xbar - (confidence_boot[2] * (sd(chlorine)/sqrt(n)))
upper_bound = xbar - (confidence_boot[1] * (sd(chlorine)/sqrt(n)))

SOLUTION: From our bootstrap we can say with 95% confidence that the true mean of chlorine levels in Bangladesh wells is between 57.3526732 and 112.3002873. In this case, we would use the bootstrapped test as it corrects for skewedness as we have a skewed sample. We can also suggest the bootstrapped t interval as n > 10.

# Your code here
# skewness function in r out of the package E1071 - name of function: skewness
n <- length(Bangladesh$Arsenic)
xbar <- mean(Bangladesh$Arsenic)
q <- qt(0.975, n - 1)
k3 <- skewness(Bangladesh$Arsenic)
lower_bound <- xbar + (k3/((6 * sqrt(n)) * (1 + (2 * q^2)))) - (q * (sd(Bangladesh$Arsenic)/sqrt(n)))
upper_bound <- xbar + (k3/((6 * sqrt(n)) * (1 + (2 * q^2)))) + (q * (sd(Bangladesh$Arsenic)/sqrt(n)))

confidence <- t.test(Bangladesh$Arsenic)$conf

sims <- 10^4
t_star <- numeric(sims)
for (i in 1:sims) {
  x <- sample(Bangladesh$Arsenic, size = n, replace = TRUE)
  t_star[i] <- (mean(x) - xbar) / (sd(x)/sqrt(n))
}
confidence_boot <- quantile(t_star, c(0.025, 0.975))
lower_bound_boot = xbar - (confidence_boot[2] * (sd(Bangladesh$Arsenic)/sqrt(n)))
upper_bound_boot = xbar - (confidence_boot[1] * (sd(Bangladesh$Arsenic)/sqrt(n)))

SOLUTION: Our Johnson’s T Confidence interval is (89.6889317, 160.9619374), a normal t-interval is (89.6834234, 160.956429), and a bootstraped t confidence interval is (95.2007911, 173.0304234). The Johnson’s T Confidence interval is almost exactly the same as the t formula, but the Bootstrapped T interval is shifted upward.

Prob10 - The data set MnGroundwater has measurements on water quality of 895 randomly selected wells in Minnesota.

ggplot(MnGroundwater, aes(x = Alkalinity)) +
  geom_histogram(color = "black", fill = "cyan") +
  theme_bw() +
  labs(title = "Histogram of Water Alkalinity")
ggplot(MnGroundwater, aes(sample = Alkalinity)) +
  geom_qq(color = "black", fill = "cyan") +
  theme_bw() +
  labs(title = "QQ Plot of Water Alkalinity")
ggplot(MnGroundwater, aes(x = Alkalinity)) +
  geom_density(color = "black", fill = "cyan") +
  theme_bw() +
  labs(title = "Histogram of Water Alkalinity")

SOLUTION: The distribution appears to be very close to normal.

confidence <- t.test(MnGroundwater$Alkalinity)$conf

SOLUTION: We can state with 95% confidence that the true mean of alkalinity values is between 2.83575610^{5} and 2.977897610^{5}

sims <- 10^4
xbar <- mean(MnGroundwater$Alkalinity)
n <- length(MnGroundwater$Alkalinity)
t_star <- numeric(sims)
for (i in 1:sims) {
  x <- sample(MnGroundwater$Alkalinity, size = n, replace = TRUE)
  t_star[i] <- (mean(x) - xbar) / (sd(x)/sqrt(n))
}
confidence_boot <- quantile(t_star, c(0.025, 0.975))
lower_bound = xbar - (confidence_boot[2] * (sd(MnGroundwater$Alkalinity)/sqrt(n)))
upper_bound = xbar - (confidence_boot[1] * (sd(MnGroundwater$Alkalinity)/sqrt(n)))

SOLUTION: From our bootstrap we can state with 95% confidence that the true mean of alkalinity values is between 2.834704210^{5} and 2.977076910^{5}. We would use this value rather than our normal t test as it corrects for some skewedness and abnormality in the data. We can also suggest that the bootstrap t confidence interval is more reliable here as n > 10.

Prob11 Consider the babies born in Texas in 2004 (TXBirths2004). We will compare the weights of babies born to nonsmokers and smokers.

TXBirths2004 %>% 
  group_by(Smoker) %>% 
  summarise(Count = n())
# A tibble: 2 x 2
  Smoker Count
  <fctr> <int>
1     No  1497
2    Yes    90
ggplot(TXBirths2004, aes(x = Weight)) +
  geom_density(color = "black", fill = "cyan") +
  theme_bw() +
  facet_grid(Smoker~.) +
  labs(title = "Birth Weights of Texas Children in 2004", xlab = "Birth Weight")
ggplot(TXBirths2004, aes(sample = Weight)) +
  geom_qq(color = "black", fill = "cyan") +
  theme_bw() +
  facet_grid(Smoker~.) +
  labs(title = "Birth Weights of Texas Children in 2004", xlab = "Birth Weight")

SOLUTION: The distribution for both smokers and non-smokers appears to be normal but skewed slightly to the left.

smoker <- filter(TXBirths2004, Smoker == "Yes")$Weight
nonsmoker <- filter(TXBirths2004, Smoker == "No")$Weight
confidence <- t.test(smoker, nonsmoker)$confidence
sims <- 10^4
thetahat <- mean(smoker) - mean(nonsmoker)
nx <- length(smoker)
ny <- length(nonsmoker)
SE <- sqrt(var(smoker)/nx + var(nonsmoker)/ny)
t_star <- numeric(sims)
for (i in 1:sims) {
  x <- sample(smoker, size = nx, replace = TRUE)
  y <- sample(smoker, size = ny, replace = TRUE)
  t_star[i] <- (mean(x) - mean(y) - thetahat) / sqrt(var(x)/nx + var(y)/ny)
}
confidence_boot <- thetahat - quantile(t_star, c(0.025, 0.975)) * SE

SOLUTION: We can conclude with 95% confidence that the t confidence interval for the difference in mean baby weight born to smokers and nonsmokers is between and . The bootstrap interval tells us with 95% confidence that the differenc in means is between -55.2241197 and -275.7348774. In this case we would use the bootstrapped confidence inerval. We can suggest the bootstrapped confidence interval also because n > 10.

confidence <- t.test(smoker, nonsmoker, alternative = "less")$conf

SOLUTION: We can conclude with 95% confidence that the difference of mean birth weights in babies between smokers and nonsmokers is greater than 9.8714098.