This is entirely our own work except as noted at the end of the document.

1 Problem 1

Prob1 - Import the data set Spruce into R.

1.1 Part A

  • Create exploratory plots to check the distribution of the variable Ht.change.
ggplot(Spruce, aes(x = Ht.change)) +
  geom_density(color = "black", fill = "skyblue2") +
  theme_bw() +
  labs(title = "Spruce Tree Height Change", x = "Height Change")
ggplot(Spruce, aes(sample = Ht.change)) +
  geom_qq() +
  labs(title = "QQ Plot of Spruce Height Changes")

1.2 Part B

  • Find a 95% \(t\) confidence interval for the mean height change over the 5-year period of the study and give a sentence interpreting your interval.
(confidence <- t.test(Spruce$Ht.change)$conf)
[1] 28.33685 33.52982
attr(,"conf.level")
[1] 0.95

SOLUTION: Statement: We can say with 95% confidence that the mean of height changes falls between 28.336845 and 33.5298216. Interpretation: 95% of the confidence intervals produced using this method will contain the true mean.

1.3 Part C

  • Create exploratory plots to compare the distributions of the variable Ht.change for the seedlings in the fertilized and nonfertilized plots.
ggplot(Spruce, aes(x = Ht.change)) +
  geom_density(color = "black", fill = "skyblue2") +
  theme_bw() +
  labs(title = "Spruce Tree Height Change", x = "Height Change") +
  facet_grid(Fertilizer~.)
ggplot(Spruce, aes(sample = Ht.change)) +
  geom_qq() +
  theme_bw() +
  labs(title = "QQ Plot of Spruce Height Changes") +
  facet_grid(Fertilizer~.)

1.4 Part D

  • Find the 95% one-sided lower \(t\) confidence bound for the difference in mean heights (\(\mu_F - \mu_{NF}\)) over the 5-year period of the study and give a sentence interpreting your interval.
(confidence <- t.test(filter(Spruce, Fertilizer == "F")$Ht.change, filter(Spruce, Fertilizer == "NF")$Ht.change)$conf)
[1] 10.82909 18.59314
attr(,"conf.level")
[1] 0.95

SOLUTION: Statement: We can say with 95% confidence that the difference in mean of height changes for fertilized and non-fertilized seedlings falls between 10.8290853 and 18.5931369. Interpretation: 95% of the confidence intervals produced using this method will contain the true difference in means.

2 Problem 2

Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.

2.1 Part A

  • Create exploratory plots and compare the distribution of weight between the babies born in the two states.
ggplot(Girls2004, aes(x = Weight)) +
  geom_density(color = "black", fill = "skyblue2") +
  theme_bw() +
  labs(title = "Birth Weight of Baby Girls", x = "Weight") +
  facet_grid(State~.)
ggplot(Girls2004, aes(sample = Weight)) +
  geom_qq() +
  theme_bw() +
  labs(title = "QQ Plot of Birth Weights") +
  facet_grid(State~.)

SOLUTION: Both states appear to have a normal distribution, although the weights in Wyoming have a stronger normal distribution.

2.2 Part B

  • Find a 95% \(t\) confidence interval for the difference in mean weights for girls born in these two states. Give a sentence interpreting this interval.
(confidence <- t.test(filter(Girls2004, State == "AK")$Weight, filter(Girls2004, State == "WY")$Weight)$conf)
[1]  83.29395 533.60605
attr(,"conf.level")
[1] 0.95

SOLUTION: Statement: We can say with 95% confidence that the difference in mean of weights between girls born in Alaska and Wyoming falls between 83.2939508 and 533.6060492. Interpretation: 95% of the confidence intervals produced using this method will contain the true difference in means.

2.3 Part C

  • Create exploratory plots and compare the distribution of weights between babies born to nonsmokers and babies born to smokers.
ggplot(Girls2004, aes(x = Weight)) +
  geom_density(color = "black", fill = "skyblue2") +
  theme_bw() +
  labs(title = "Birth Weight of Baby Girls Faceted on Smoking", x = "Weight") +
  facet_grid(Smoker~.)
ggplot(Girls2004, aes(sample = Weight)) +
  geom_qq() +
  theme_bw() +
  labs(title = "QQ Plot of Birth Weights") +
  facet_grid(Smoker~.)

SOLUTION: The weights form a normal distributon for both smokers and non-smokers, although the non-smokers baby’s birth weights may be slightly positively skewed.

2.4 Part D

  • Find a 95% \(t\) confidence interval for the difference in mean weights between babies born to nonsmokers and smokers. Give a sentence interpreting this interval.
(confidence <- t.test(filter(Girls2004, Smoker == "No")$Weight, filter(Girls2004, Smoker == "Yes")$Weight)$conf)
[1] -44.0330 617.9197
attr(,"conf.level")
[1] 0.95

SOLUTION: Statement: We can say with 95% confidence that the difference in mean of weights between girls born to mothers who dont smoke and to mothers who smoke falls between -44.0329967 and 617.9196897. Interpretation: 95% of the confidence intervals produced using this method will contain the true difference in means.

3 Problem 3

Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.

3.1 Part A

  • Create exploratory plots of the lengths of delays for the two airlines.
ggplot(FlightDelays, aes(x = Delay)) +
  geom_density(color = "black", fill = "skyblue2") +
  theme_bw() +
  labs(title = "Delay Time for Flights", x = "Time (minutes)") +
  facet_grid(Carrier~.)
ggplot(FlightDelays, aes(sample = Delay)) +
  geom_qq() +
  theme_bw() +
  labs(title = "QQ Plot of Delay Times") +
  facet_grid(Carrier~.)

3.2 Part B

  • Find a 95% \(t\) confidence interval for the difference in mean flight delays between the two airlines and interpret this interval.
(confidence <- t.test(filter(FlightDelays, Carrier == "UA")$Delay, filter(FlightDelays, Carrier == "AA")$Delay)$conf)
[1] 2.868194 8.903198
attr(,"conf.level")
[1] 0.95

SOLUTION: Statement: We can say with 95% confidence that the difference in mean of delay times for flights on United Airlines and American Airlines falls between 2.8681941 and 8.9031985. Interpretation: 95% of the confidence intervals produced using this method will contain the true difference in means.

4 Problem 4

Prob4 - Run a simulation to see if the \(t\) ratio \(T = (\bar{X} -\mu)/(S/\sqrt{n})\) has a \(t\) distribution or even an approximate \(t\) distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate \(t\) density curve over the density of your simulated \(T\). Try two different nonnormal distributions \(\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)\) and remember to see if sample size makes a difference (use \(n = 15\) and \(n=500\)).

#Parameters for Uniform Distribution 
a <- 0
b <- 1
mu_unif <- (a + b)/2
#Parameters for Exponential Distribution
lam <- 1
mu_exp <- 1/lam
#Other Parameters
n1 <- 15
n2 <- 500
#Run the simulation
sims <- 10 ^ 4
xbar_unif1 <- numeric(sims)
xbar_unif2 <- numeric(sims)
xbar_exp1 <- numeric(sims)
xbar_exp2 <- numeric(sims)
SD_unif1 <- numeric(sims)
SD_unif2 <- numeric(sims)
SD_exp1 <- numeric(sims)
SD_exp2 <- numeric(sims)
for (i in 1:sims) {
  xunif1 <- runif(n1, a, b)
  xunif2 <- runif(n2, a, b)
  xexp1 <- rexp(n1, lam)
  xexp2 <- rexp(n2, lam)
  xbar_unif1[i] <- mean(xunif1)
  xbar_unif2[i] <- mean(xunif2)
  xbar_exp1[i] <- mean(xexp1)
  xbar_exp2[i] <- mean(xexp2)
  SD_unif1[i] <- sd(xunif1)
  SD_unif2[i] <- sd(xunif2)
  SD_exp1[i] <- sd(xexp1)
  SD_exp2[i] <- sd(xexp2)
}
t_val_unif1 <- (xbar_unif1 - mu_unif)/(SD_unif1/sqrt(n1))
t_val_unif2 <- (xbar_unif2 - mu_unif)/(SD_unif2/sqrt(n2))
t_val_exp1 <- (xbar_exp1 - mu_exp)/(SD_exp1/sqrt(n1))
t_val_exp2 <- (xbar_exp2 - mu_exp)/(SD_exp2/sqrt(n2))

ggplot(data.frame(data = t_val_unif1), aes(x = data)) +
  geom_histogram(color = "black", fill = "skyblue2", aes(y = ..density..)) +
  theme_bw() +
  stat_function(fun = dt, args = list(df = 14), aes(color = "red")) +
  labs(title = "Uniform Distribution with N = 15")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data.frame(data = t_val_unif2), aes(x = data)) +
  geom_histogram(color = "black", fill = "skyblue2", aes(y = ..density..)) +
  theme_bw() +
  stat_function(fun = dt, args = list(df = 499), aes(color = "red")) +
  labs(title = "Uniform Distribution with N = 500")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data.frame(data = t_val_exp1), aes(x = data)) +
  geom_histogram(color = "black", fill = "skyblue2", aes(y = ..density..)) +
  theme_bw() +
  stat_function(fun = dt, args = list(df = 14), aes(color = "red")) +
  labs(title = "Exponential Distribution with N = 15")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data.frame(data = t_val_exp2), aes(x = data)) +
  geom_histogram(color = "black", fill = "skyblue2", aes(y = ..density..)) +
  theme_bw() +
  stat_function(fun = dt, args = list(df = 499), aes(color = "red")) +
  labs(title = "Exponential Distribution with N = 500")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

SOULTION: We can notice a couple differences in these graphs. When sampling from a Normal Distributuion, you get a more normal distribuiton than when given an exponential distribution. When sampling from a normal distribution, increasing the sample size improves the distribution but with a sample size of 15 you already find a normal distribution. When sampling from an exponential distribuiton, the sampling size affects the mean.

5 Problem 5

Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.

5.1 Part A

  • Find a 95% confidence interval for the proportion of women who voted for Bush.
(confidence_women <- prop.test(459, 980, correct = FALSE)$conf)
[1] 0.4373100 0.4996717
attr(,"conf.level")
[1] 0.95

SOLUTION: Statement: We can say with 95% confidence that the proprtion of women that voted for Bush in 2000 falls between 0.43731 and 0.4996717.

5.2 Part B

  • Find a 95% confidence interval for the proportion of men who voted for Bush. Do the intervals for the men and women overlap? What, if anything, can you conclude about gender difference in voter preference?
(confidence_men <- prop.test(426, 759, correct = FALSE)$conf)
[1] 0.5257409 0.5961717
attr(,"conf.level")
[1] 0.95

SOLUTION: Statement: We can say with 95% confidence that the proprtion of women that voted for Bush in 2000 falls between 0.5257409 and 0.5961717.

5.3 Part C

SOLUTION: We can note that the confidence intervals do not overlap and the confidence interval for men falls higher than women. We can conclude that men favored Bush in the 2000 election more than women.