Chapter Seven Homework Answers Part 1

This is entirely our own work except as noted at the end of the document.

Prob1 - Import the data set Spruce into R.

Create exploratory plots to check the distribution of the variable Ht.change.

Spruce <- read.csv("http://www1.appstate.edu/~arnholta/Data/Spruce.csv")
glimpse(Spruce)

Observations: 72
Variables: 9
$ Tree        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ Competition <fctr> NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, C...
$ Fertilizer  <fctr> F, F, F, F, F, F, NF, NF, NF, NF, NF, NF, F, F, F...
$ Height0     <dbl> 15.0, 9.0, 12.0, 13.7, 12.0, 12.0, 16.8, 14.6, 16....
$ Height5     <dbl> 60.0, 45.2, 42.0, 49.5, 47.3, 56.4, 43.5, 49.2, 54...
$ Diameter0   <dbl> 1.984375, 1.190625, 1.785937, 1.587500, 1.587500, ...
$ Diameter5   <dbl> 7.4, 5.2, 5.7, 6.4, 6.2, 7.4, 4.9, 5.4, 7.1, 5.1, ...
$ Ht.change   <dbl> 45.0, 36.2, 30.0, 35.8, 35.3, 44.4, 26.7, 34.6, 38...
$ Di.change   <dbl> 5.415625, 4.009375, 3.914062, 4.812500, 4.612500, ...

hist(Spruce$Ht.change)
boxplot(Spruce$Ht.change)

Find a 95% \(t\) confidence interval for the mean height change over the 5-year period of the study and give a sentence interpreting your interval.

t.test(Spruce$Ht.change)


    One Sample t-test

data:  Spruce$Ht.change
t = 23.755, df = 71, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 28.33685 33.52982
sample estimates:
mean of x 
 30.93333

SOLUTION: We are 95% confident that the change in height for a spruce tree in a 5-year period is within (28.337, 33.530).

Create exploratory plots to compare the distributions of the variable Ht.change for the seedlings in the fertilized and nonfertilized plots.

library(dplyr)
fert <- Spruce %>%
  group_by(Fertilizer) %>%
  summarize(Mean = mean(Ht.change), StDev = sd(Ht.change), n())
fert

# A tibble: 2 x 4
  Fertilizer     Mean    StDev `n()`
      <fctr>    <dbl>    <dbl> <int>
1          F 38.28889 7.980540    36
2         NF 23.57778 8.525193    36

library(ggplot2)
ggplot(Spruce, aes(x = Ht.change, fill = factor(Fertilizer))) +
  geom_histogram(position = "dodge", bins = 10)

Find the 95% one-sided lower \(t\) confidence bound for the difference in mean heights (\(\mu_F - \mu_{NF}\)) over the 5-year period of the study and give a sentence interpreting your interval.

t.test(Spruce$Ht.change~Spruce$Fertilizer, alternative = "greater")


    Welch Two Sample t-test

data:  Spruce$Ht.change by Spruce$Fertilizer
t = 7.5586, df = 69.697, p-value = 6.068e-11
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 11.46664      Inf
sample estimates:
 mean in group F mean in group NF 
        38.28889         23.57778

SOLUTION: We are 95% confident that the difference in heights between fertilized and non-fertilized spruce trees in a 5-year period is at least 11.467.

Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.

Create exploratory plots and compare the distribution of weight between the babies born in the two states.

Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")
glimpse(Girls2004)

Observations: 80
Variables: 6
$ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ State      <fctr> WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY...
$ MothersAge <fctr> 15-19, 35-39, 25-29, 20-24, 25-29, 20-24, 20-24, 2...
$ Smoker     <fctr> No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Weight     <int> 3085, 3515, 3775, 3265, 2970, 2850, 2737, 3515, 374...
$ Gestation  <int> 40, 39, 40, 39, 40, 38, 38, 37, 39, 40, 41, 39, 40,...

state <- Girls2004 %>%
  group_by(State) %>%
  summarize(Mean = mean(Weight), StDev = sd(Weight), n())
state

# A tibble: 2 x 4
   State    Mean    StDev `n()`
  <fctr>   <dbl>    <dbl> <int>
1     AK 3516.35 578.8336    40
2     WY 3207.90 418.3184    40

ggplot(Girls2004, aes(x = Weight, fill = factor(State))) +
  geom_histogram(position = "dodge", bins = 10)

SOLUTION: AK~N(3516.35, 578.83) and WY~N(3207.90, 418.32)

Find a 95% \(t\) confidence interval for the difference in mean weights for girls born in these two states. Give a sentence interpreting this interval.

t.test(Girls2004$Weight[Girls2004$State == "AK"],Girls2004$Weight[Girls2004$State == "WY"])


    Welch Two Sample t-test

data:  Girls2004$Weight[Girls2004$State == "AK"] and Girls2004$Weight[Girls2004$State == "WY"]
t = 2.7316, df = 71.007, p-value = 0.007946
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  83.29395 533.60605
sample estimates:
mean of x mean of y 
  3516.35   3207.90

SOLUTION: We are 95% confident that the difference in weights between girls born in Alaska and Wyoming is within (83.294, 533.606).

Create exploratory plots and compare the distribution of weights between babies born to nonsmokers and babies born to smokers.

glimpse(Girls2004)

Observations: 80
Variables: 6
$ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ State      <fctr> WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY...
$ MothersAge <fctr> 15-19, 35-39, 25-29, 20-24, 25-29, 20-24, 20-24, 2...
$ Smoker     <fctr> No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Weight     <int> 3085, 3515, 3775, 3265, 2970, 2850, 2737, 3515, 374...
$ Gestation  <int> 40, 39, 40, 39, 40, 38, 38, 37, 39, 40, 41, 39, 40,...

smoker <- Girls2004 %>%
  group_by(Smoker) %>%
  summarize(Mean = mean(Weight), StDev = sd(Weight), n())
smoker

# A tibble: 2 x 4
  Smoker     Mean    StDev `n()`
  <fctr>    <dbl>    <dbl> <int>
1     No 3401.580 526.1256    69
2    Yes 3114.636 467.9971    11

ggplot(Girls2004, aes(x = Weight, fill = factor(Smoker))) +
  geom_histogram(position = "dodge", bins = 8)

SOLUTION: NO~N(3401.58, 526.13) and YES~N(3114.64, 467.997)

Find a 95% \(t\) confidence interval for the difference in mean weights between babies born to nonsmokers and smokers. Give a sentence interpreting this interval.

t.test(Girls2004$Weight[Girls2004$Smoker == "Yes"],Girls2004$Weight[Girls2004$Smoker == "No"])


    Welch Two Sample t-test

data:  Girls2004$Weight[Girls2004$Smoker == "Yes"] and Girls2004$Weight[Girls2004$Smoker == "No"]
t = -1.8552, df = 14.35, p-value = 0.08423
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -617.9197   44.0330
sample estimates:
mean of x mean of y 
 3114.636  3401.580

SOLUTION: We are 95% confident that the difference of weights of girls born to smokers and non-smokers is within (-617.92, 44.033).

Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.

Create exploratory plots of the lengths of delays for the two airlines.

FlightDelays <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
glimpse(FlightDelays)

Observations: 4,029
Variables: 10
$ ID           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$ Carrier      <fctr> UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, ...
$ FlightNo     <int> 403, 405, 409, 511, 667, 669, 673, 677, 679, 681,...
$ Destination  <fctr> DEN, DEN, DEN, ORD, ORD, ORD, ORD, ORD, ORD, ORD...
$ DepartTime   <fctr> 4-8am, 8-Noon, 4-8pm, 8-Noon, 4-8am, 4-8am, 8-No...
$ Day          <fctr> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri...
$ Month        <fctr> May, May, May, May, May, May, May, May, May, May...
$ FlightLength <int> 281, 277, 279, 158, 143, 150, 158, 160, 160, 163,...
$ Delay        <int> -1, 102, 4, -2, -3, 0, -5, 0, 10, 60, 0, 32, 0, 4...
$ Delayed30    <fctr> No, Yes, No, No, No, No, No, No, No, Yes, No, Ye...

hist(FlightDelays$Delay)
boxplot(FlightDelays$Delay)

Find a 95% \(t\) confidence interval for the difference in mean flight delays between the two airlines and interpret this interval.

t.test(FlightDelays$Delay~FlightDelays$Carrier, alternative = "two.sided")


    Welch Two Sample t-test

data:  FlightDelays$Delay by FlightDelays$Carrier
t = -3.8255, df = 1843.8, p-value = 0.0001349
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8.903198 -2.868194
sample estimates:
mean in group AA mean in group UA 
        10.09738         15.98308

SOLUTION: We are 95% confident that the difference in delay times between United Airlines and American Airlines is within (-8.903, -2.868).

Prob4 - Run a simulation to see if the \(t\) ratio \(T = (\bar{X} -\mu)/(S/\sqrt{n})\) has a \(t\) distribution or even an approximate \(t\) distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate \(t\) density curve over the density of your simulated \(T\). Try two different nonnormal distributions \(\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)\) and remember to see if sample size makes a difference (use \(n = 15\) and \(n=500\)).

n = 15
a = 0
b = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- runif(n, a, b)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - 0.5)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

n = 15
lamda = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- rexp(n, lamda)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - 1)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

n = 500
a = 0
b = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- runif(n, a, b)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - 0.5)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

n = 500
lamda = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
  xn <- rexp(n, lamda)
  xbar[i] <- mean(xn)
  SD[i] <- sd(xn)
}
T <- (xbar - 1)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)

Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.

Find a 95% confidence interval for the proportion of women who voted for Bush.

prop.test(459, 980)


    1-sample proportions test with continuity correction

data:  459 out of 980, null probability 0.5
X-squared = 3.7969, df = 1, p-value = 0.05135
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.4368038 0.5001819
sample estimates:
        p 
0.4683673

SOLUTION: We are 95% confident that the proportion of women who voted for Bush from the 2002 General Social Survey is within (0.437, 0.5).

Find a 95% confidence interval for the proportion of men who voted for Bush. Do the intervals for the men and women overlap? What, if anything, can you conclude about gender difference in voter preference?

prop.test(426, 759)


    1-sample proportions test with continuity correction

data:  426 out of 759, null probability 0.5
X-squared = 11.152, df = 1, p-value = 0.0008396
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5250797 0.5968214
sample estimates:
        p 
0.5612648

SOLUTION: We are 95% confident that the proportion of men who voted for Bush from the 2002 General Social Survey is within (0.525, 0.597). The intervals for the proportion of women who voted for Bush and the proportion of men who voted for Bush do not overlap since the interval for women ends at 0.500 exactly. Also we tested the proportion of means seperately, therefore we cannot conclude anything about their difference.

Chapter Seven Homework Answers Part 1

Sara Hursa

Thursday, Nov 30, 2017 - 01:37:47 PM