This is entirely our own work except as noted at the end of the document.

Prob1 - Import the data set Spruce into R.

Spruce <- read.csv("http://www1.appstate.edu/~arnholta/Data/Spruce.csv")
glimpse(Spruce)
Observations: 72
Variables: 9
$ Tree        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ Competition <fctr> NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, C...
$ Fertilizer  <fctr> F, F, F, F, F, F, NF, NF, NF, NF, NF, NF, F, F, F...
$ Height0     <dbl> 15.0, 9.0, 12.0, 13.7, 12.0, 12.0, 16.8, 14.6, 16....
$ Height5     <dbl> 60.0, 45.2, 42.0, 49.5, 47.3, 56.4, 43.5, 49.2, 54...
$ Diameter0   <dbl> 1.984375, 1.190625, 1.785937, 1.587500, 1.587500, ...
$ Diameter5   <dbl> 7.4, 5.2, 5.7, 6.4, 6.2, 7.4, 4.9, 5.4, 7.1, 5.1, ...
$ Ht.change   <dbl> 45.0, 36.2, 30.0, 35.8, 35.3, 44.4, 26.7, 34.6, 38...
$ Di.change   <dbl> 5.415625, 4.009375, 3.914062, 4.812500, 4.612500, ...
hist(Spruce$Ht.change)
boxplot(Spruce$Ht.change)

t.test(Spruce$Ht.change)

    One Sample t-test

data:  Spruce$Ht.change
t = 23.755, df = 71, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 28.33685 33.52982
sample estimates:
mean of x 
 30.93333 

SOLUTION: We are 95% confident that the change in height for a spruce tree in a 5-year period is within (28.337, 33.530).

library(dplyr)
fert <- Spruce %>%
  group_by(Fertilizer) %>%
  summarize(Mean = mean(Ht.change), StDev = sd(Ht.change), n())
fert
# A tibble: 2 x 4
  Fertilizer     Mean    StDev `n()`
      <fctr>    <dbl>    <dbl> <int>
1          F 38.28889 7.980540    36
2         NF 23.57778 8.525193    36
library(ggplot2)
ggplot(Spruce, aes(x = Ht.change, fill = factor(Fertilizer))) +
  geom_histogram(position = "dodge", bins = 10)

t.test(Spruce$Ht.change~Spruce$Fertilizer, alternative = "greater")

    Welch Two Sample t-test

data:  Spruce$Ht.change by Spruce$Fertilizer
t = 7.5586, df = 69.697, p-value = 6.068e-11
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 11.46664      Inf
sample estimates:
 mean in group F mean in group NF 
        38.28889         23.57778 

SOLUTION: We are 95% confident that the difference in heights between fertilized and non-fertilized spruce trees in a 5-year period is at least 11.467.

Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.

Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")
glimpse(Girls2004)
Observations: 80
Variables: 6
$ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ State      <fctr> WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY...
$ MothersAge <fctr> 15-19, 35-39, 25-29, 20-24, 25-29, 20-24, 20-24, 2...
$ Smoker     <fctr> No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Weight     <int> 3085, 3515, 3775, 3265, 2970, 2850, 2737, 3515, 374...
$ Gestation  <int> 40, 39, 40, 39, 40, 38, 38, 37, 39, 40, 41, 39, 40,...
state <- Girls2004 %>%
  group_by(State) %>%
  summarize(Mean = mean(Weight), StDev = sd(Weight), n())
state
# A tibble: 2 x 4
   State    Mean    StDev `n()`
  <fctr>   <dbl>    <dbl> <int>
1     AK 3516.35 578.8336    40
2     WY 3207.90 418.3184    40
ggplot(Girls2004, aes(x = Weight, fill = factor(State))) +
  geom_histogram(position = "dodge", bins = 10)

SOLUTION: AK~N(3516.35, 578.83) and WY~N(3207.90, 418.32)

t.test(Girls2004$Weight[Girls2004$State == "AK"],Girls2004$Weight[Girls2004$State == "WY"])

    Welch Two Sample t-test

data:  Girls2004$Weight[Girls2004$State == "AK"] and Girls2004$Weight[Girls2004$State == "WY"]
t = 2.7316, df = 71.007, p-value = 0.007946
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  83.29395 533.60605
sample estimates:
mean of x mean of y 
  3516.35   3207.90 

SOLUTION: We are 95% confident that the difference in weights between girls born in Alaska and Wyoming is within (83.294, 533.606).

glimpse(Girls2004)
Observations: 80
Variables: 6
$ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ State      <fctr> WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY...
$ MothersAge <fctr> 15-19, 35-39, 25-29, 20-24, 25-29, 20-24, 20-24, 2...
$ Smoker     <fctr> No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Weight     <int> 3085, 3515, 3775, 3265, 2970, 2850, 2737, 3515, 374...
$ Gestation  <int> 40, 39, 40, 39, 40, 38, 38, 37, 39, 40, 41, 39, 40,...
smoker <- Girls2004 %>%
  group_by(Smoker) %>%
  summarize(Mean = mean(Weight), StDev = sd(Weight), n())
smoker
# A tibble: 2 x 4
  Smoker     Mean    StDev `n()`
  <fctr>    <dbl>    <dbl> <int>
1     No 3401.580 526.1256    69
2    Yes 3114.636 467.9971    11
ggplot(Girls2004, aes(x = Weight, fill = factor(Smoker))) +
  geom_histogram(position = "dodge", bins = 8)

SOLUTION: NO~N(3401.58, 526.13) and YES~N(3114.64, 467.997)

t.test(Girls2004$Weight[Girls2004$Smoker == "Yes"],Girls2004$Weight[Girls2004$Smoker == "No"])

    Welch Two Sample t-test

data:  Girls2004$Weight[Girls2004$Smoker == "Yes"] and Girls2004$Weight[Girls2004$Smoker == "No"]
t = -1.8552, df = 14.35, p-value = 0.08423
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -617.9197   44.0330
sample estimates:
mean of x mean of y 
 3114.636  3401.580 

SOLUTION: We are 95% confident that the difference of weights of girls born to smokers and non-smokers is within (-617.92, 44.033).

Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.

FlightDelays <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
glimpse(FlightDelays)
Observations: 4,029
Variables: 10
$ ID           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$ Carrier      <fctr> UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, ...
$ FlightNo     <int> 403, 405, 409, 511, 667, 669, 673, 677, 679, 681,...
$ Destination  <fctr> DEN, DEN, DEN, ORD, ORD, ORD, ORD, ORD, ORD, ORD...
$ DepartTime   <fctr> 4-8am, 8-Noon, 4-8pm, 8-Noon, 4-8am, 4-8am, 8-No...
$ Day          <fctr> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri...
$ Month        <fctr> May, May, May, May, May, May, May, May, May, May...
$ FlightLength <int> 281, 277, 279, 158, 143, 150, 158, 160, 160, 163,...
$ Delay        <int> -1, 102, 4, -2, -3, 0, -5, 0, 10, 60, 0, 32, 0, 4...
$ Delayed30    <fctr> No, Yes, No, No, No, No, No, No, No, Yes, No, Ye...
hist(FlightDelays$Delay)
boxplot(FlightDelays$Delay)

t.test(FlightDelays$Delay~FlightDelays$Carrier, alternative = "two.sided")

    Welch Two Sample t-test

data:  FlightDelays$Delay by FlightDelays$Carrier
t = -3.8255, df = 1843.8, p-value = 0.0001349
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8.903198 -2.868194
sample estimates:
mean in group AA mean in group UA 
        10.09738         15.98308 

SOLUTION: We are 95% confident that the difference in delay times between United Airlines and American Airlines is within (-8.903, -2.868).

Prob4 - Run a simulation to see if the \(t\) ratio \(T = (\bar{X} -\mu)/(S/\sqrt{n})\) has a \(t\) distribution or even an approximate \(t\) distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate \(t\) density curve over the density of your simulated \(T\). Try two different nonnormal distributions \(\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)\) and remember to see if sample size makes a difference (use \(n = 15\) and \(n=500\)).

n = 15
a = 0
b = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- runif(n, a, b)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - 0.5)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

n = 15
lamda = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- rexp(n, lamda)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - 1)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

n = 500
a = 0
b = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- runif(n, a, b)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - 0.5)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

n = 500
lamda = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- rexp(n, lamda)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - 1)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.

prop.test(459, 980)

    1-sample proportions test with continuity correction

data:  459 out of 980, null probability 0.5
X-squared = 3.7969, df = 1, p-value = 0.05135
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.4368038 0.5001819
sample estimates:
        p 
0.4683673 

SOLUTION: We are 95% confident that the proportion of women who voted for Bush from the 2002 General Social Survey is within (0.437, 0.5).

prop.test(426, 759)

    1-sample proportions test with continuity correction

data:  426 out of 759, null probability 0.5
X-squared = 11.152, df = 1, p-value = 0.0008396
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5250797 0.5968214
sample estimates:
        p 
0.5612648 

SOLUTION: We are 95% confident that the proportion of men who voted for Bush from the 2002 General Social Survey is within (0.525, 0.597). The intervals for the proportion of women who voted for Bush and the proportion of men who voted for Bush do not overlap since the interval for women ends at 0.500 exactly. Also we tested the proportion of means seperately, therefore we cannot conclude anything about their difference.