This is entirely our own work except as noted at the end of the document.
Prob1 - Import the data set Spruce into R.
Ht.change.Spruce <- read.csv("http://www1.appstate.edu/~arnholta/Data/Spruce.csv")
glimpse(Spruce)
Observations: 72
Variables: 9
$ Tree <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ Competition <fctr> NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, NC, C...
$ Fertilizer <fctr> F, F, F, F, F, F, NF, NF, NF, NF, NF, NF, F, F, F...
$ Height0 <dbl> 15.0, 9.0, 12.0, 13.7, 12.0, 12.0, 16.8, 14.6, 16....
$ Height5 <dbl> 60.0, 45.2, 42.0, 49.5, 47.3, 56.4, 43.5, 49.2, 54...
$ Diameter0 <dbl> 1.984375, 1.190625, 1.785937, 1.587500, 1.587500, ...
$ Diameter5 <dbl> 7.4, 5.2, 5.7, 6.4, 6.2, 7.4, 4.9, 5.4, 7.1, 5.1, ...
$ Ht.change <dbl> 45.0, 36.2, 30.0, 35.8, 35.3, 44.4, 26.7, 34.6, 38...
$ Di.change <dbl> 5.415625, 4.009375, 3.914062, 4.812500, 4.612500, ...
hist(Spruce$Ht.change)
boxplot(Spruce$Ht.change)
t.test(Spruce$Ht.change)
One Sample t-test
data: Spruce$Ht.change
t = 23.755, df = 71, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
28.33685 33.52982
sample estimates:
mean of x
30.93333
SOLUTION: We are 95% confident that the change in height for a spruce tree in a 5-year period is within (28.337, 33.530).
Ht.change for the seedlings in the fertilized and nonfertilized plots.library(dplyr)
fert <- Spruce %>%
group_by(Fertilizer) %>%
summarize(Mean = mean(Ht.change), StDev = sd(Ht.change), n())
fert
# A tibble: 2 x 4
Fertilizer Mean StDev `n()`
<fctr> <dbl> <dbl> <int>
1 F 38.28889 7.980540 36
2 NF 23.57778 8.525193 36
library(ggplot2)
ggplot(Spruce, aes(x = Ht.change, fill = factor(Fertilizer))) +
geom_histogram(position = "dodge", bins = 10)
t.test(Spruce$Ht.change~Spruce$Fertilizer, alternative = "greater")
Welch Two Sample t-test
data: Spruce$Ht.change by Spruce$Fertilizer
t = 7.5586, df = 69.697, p-value = 6.068e-11
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
11.46664 Inf
sample estimates:
mean in group F mean in group NF
38.28889 23.57778
SOLUTION: We are 95% confident that the difference in heights between fertilized and non-fertilized spruce trees in a 5-year period is at least 11.467.
Prob2 - Consider the data set Girls2004 with birth weights of baby girls born in Wyoming or Alaska.
Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")
glimpse(Girls2004)
Observations: 80
Variables: 6
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ State <fctr> WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY...
$ MothersAge <fctr> 15-19, 35-39, 25-29, 20-24, 25-29, 20-24, 20-24, 2...
$ Smoker <fctr> No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Weight <int> 3085, 3515, 3775, 3265, 2970, 2850, 2737, 3515, 374...
$ Gestation <int> 40, 39, 40, 39, 40, 38, 38, 37, 39, 40, 41, 39, 40,...
state <- Girls2004 %>%
group_by(State) %>%
summarize(Mean = mean(Weight), StDev = sd(Weight), n())
state
# A tibble: 2 x 4
State Mean StDev `n()`
<fctr> <dbl> <dbl> <int>
1 AK 3516.35 578.8336 40
2 WY 3207.90 418.3184 40
ggplot(Girls2004, aes(x = Weight, fill = factor(State))) +
geom_histogram(position = "dodge", bins = 10)
SOLUTION: AK~N(3516.35, 578.83) and WY~N(3207.90, 418.32)
t.test(Girls2004$Weight[Girls2004$State == "AK"],Girls2004$Weight[Girls2004$State == "WY"])
Welch Two Sample t-test
data: Girls2004$Weight[Girls2004$State == "AK"] and Girls2004$Weight[Girls2004$State == "WY"]
t = 2.7316, df = 71.007, p-value = 0.007946
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
83.29395 533.60605
sample estimates:
mean of x mean of y
3516.35 3207.90
SOLUTION: We are 95% confident that the difference in weights between girls born in Alaska and Wyoming is within (83.294, 533.606).
glimpse(Girls2004)
Observations: 80
Variables: 6
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ State <fctr> WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY, WY...
$ MothersAge <fctr> 15-19, 35-39, 25-29, 20-24, 25-29, 20-24, 20-24, 2...
$ Smoker <fctr> No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Weight <int> 3085, 3515, 3775, 3265, 2970, 2850, 2737, 3515, 374...
$ Gestation <int> 40, 39, 40, 39, 40, 38, 38, 37, 39, 40, 41, 39, 40,...
smoker <- Girls2004 %>%
group_by(Smoker) %>%
summarize(Mean = mean(Weight), StDev = sd(Weight), n())
smoker
# A tibble: 2 x 4
Smoker Mean StDev `n()`
<fctr> <dbl> <dbl> <int>
1 No 3401.580 526.1256 69
2 Yes 3114.636 467.9971 11
ggplot(Girls2004, aes(x = Weight, fill = factor(Smoker))) +
geom_histogram(position = "dodge", bins = 8)
SOLUTION: NO~N(3401.58, 526.13) and YES~N(3114.64, 467.997)
t.test(Girls2004$Weight[Girls2004$Smoker == "Yes"],Girls2004$Weight[Girls2004$Smoker == "No"])
Welch Two Sample t-test
data: Girls2004$Weight[Girls2004$Smoker == "Yes"] and Girls2004$Weight[Girls2004$Smoker == "No"]
t = -1.8552, df = 14.35, p-value = 0.08423
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-617.9197 44.0330
sample estimates:
mean of x mean of y
3114.636 3401.580
SOLUTION: We are 95% confident that the difference of weights of girls born to smokers and non-smokers is within (-617.92, 44.033).
Prob3 - Import the FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.
FlightDelays <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
glimpse(FlightDelays)
Observations: 4,029
Variables: 10
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$ Carrier <fctr> UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, UA, ...
$ FlightNo <int> 403, 405, 409, 511, 667, 669, 673, 677, 679, 681,...
$ Destination <fctr> DEN, DEN, DEN, ORD, ORD, ORD, ORD, ORD, ORD, ORD...
$ DepartTime <fctr> 4-8am, 8-Noon, 4-8pm, 8-Noon, 4-8am, 4-8am, 8-No...
$ Day <fctr> Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri, Fri...
$ Month <fctr> May, May, May, May, May, May, May, May, May, May...
$ FlightLength <int> 281, 277, 279, 158, 143, 150, 158, 160, 160, 163,...
$ Delay <int> -1, 102, 4, -2, -3, 0, -5, 0, 10, 60, 0, 32, 0, 4...
$ Delayed30 <fctr> No, Yes, No, No, No, No, No, No, No, Yes, No, Ye...
hist(FlightDelays$Delay)
boxplot(FlightDelays$Delay)
t.test(FlightDelays$Delay~FlightDelays$Carrier, alternative = "two.sided")
Welch Two Sample t-test
data: FlightDelays$Delay by FlightDelays$Carrier
t = -3.8255, df = 1843.8, p-value = 0.0001349
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.903198 -2.868194
sample estimates:
mean in group AA mean in group UA
10.09738 15.98308
SOLUTION: We are 95% confident that the difference in delay times between United Airlines and American Airlines is within (-8.903, -2.868).
Prob4 - Run a simulation to see if the \(t\) ratio \(T = (\bar{X} -\mu)/(S/\sqrt{n})\) has a \(t\) distribution or even an approximate \(t\) distribution when the samples are drawn from a nonnormal distribution. Be sure to superimpose the appropriate \(t\) density curve over the density of your simulated \(T\). Try two different nonnormal distributions \(\left( Unif(a = 0, b = 1), Exp(\lambda = 1) \right)\) and remember to see if sample size makes a difference (use \(n = 15\) and \(n=500\)).
n = 15
a = 0
b = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
xn <- runif(n, a, b)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - 0.5)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)
n = 15
lamda = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
xn <- rexp(n, lamda)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - 1)/(SD/sqrt(15))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)
n = 500
a = 0
b = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
xn <- runif(n, a, b)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - 0.5)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)
n = 500
lamda = 1
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
for(i in 1:sims){
xn <- rexp(n, lamda)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - 1)/(SD/sqrt(500))
curve(dt(x, 3), -5, 5)
hist(T, freq = FALSE)
curve(dt(x, 3), add = TRUE)
Prob5 - One question is the 2002 General Social Survey asked participants whom they voted for in the 2000 election. Of the 980 women who voted, 459 voted for Bush. Of the 759 men who voted, 426 voted for Bush.
prop.test(459, 980)
1-sample proportions test with continuity correction
data: 459 out of 980, null probability 0.5
X-squared = 3.7969, df = 1, p-value = 0.05135
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4368038 0.5001819
sample estimates:
p
0.4683673
SOLUTION: We are 95% confident that the proportion of women who voted for Bush from the 2002 General Social Survey is within (0.437, 0.5).
prop.test(426, 759)
1-sample proportions test with continuity correction
data: 426 out of 759, null probability 0.5
X-squared = 11.152, df = 1, p-value = 0.0008396
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5250797 0.5968214
sample estimates:
p
0.5612648
SOLUTION: We are 95% confident that the proportion of men who voted for Bush from the 2002 General Social Survey is within (0.525, 0.597). The intervals for the proportion of women who voted for Bush and the proportion of men who voted for Bush do not overlap since the interval for women ends at 0.500 exactly. Also we tested the proportion of means seperately, therefore we cannot conclude anything about their difference.