This is entirely our own work except as noted at the end of the document.
Spruce into R.*Note: data set information can be found at http://www1.appstate.edu/~arnholta/Data/
Spruce <- read.csv("http://www1.appstate.edu/~arnholta/Data/Spruce.csv")
head(Spruce)
Tree Competition Fertilizer Height0 Height5 Diameter0 Diameter5
1 1 NC F 15.0 60.0 1.984375 7.4
2 2 NC F 9.0 45.2 1.190625 5.2
3 3 NC F 12.0 42.0 1.785937 5.7
4 4 NC F 13.7 49.5 1.587500 6.4
5 5 NC F 12.0 47.3 1.587500 6.2
6 6 NC F 12.0 56.4 1.587500 7.4
Ht.change Di.change
1 45.0 5.415625
2 36.2 4.009375
3 30.0 3.914062
4 35.8 4.812500
5 35.3 4.612500
6 44.4 5.812500
summary(Spruce)
Tree Competition Fertilizer Height0 Height5
Min. : 1.00 C :36 F :36 Min. : 9.00 Min. :24.00
1st Qu.:18.75 NC:36 NF:36 1st Qu.:13.28 1st Qu.:37.92
Median :36.50 Median :14.75 Median :45.30
Mean :36.50 Mean :14.57 Mean :45.51
3rd Qu.:54.25 3rd Qu.:16.00 3rd Qu.:53.25
Max. :72.00 Max. :18.70 Max. :68.00
Diameter0 Diameter5 Ht.change Di.change
Min. :1.191 Min. : 2.700 Min. : 8.30 Min. :1.019
1st Qu.:1.786 1st Qu.: 4.450 1st Qu.:23.20 1st Qu.:2.712
Median :1.984 Median : 5.750 Median :30.10 Median :3.915
Mean :1.935 Mean : 5.931 Mean :30.93 Mean :3.996
3rd Qu.:1.984 3rd Qu.: 7.100 3rd Qu.:38.17 3rd Qu.:5.116
Max. :2.381 Max. :11.300 Max. :51.50 Max. :8.919
Ht.change.EDA.hist(Spruce$Ht.change)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(Spruce$Ht.change)
spruce_t <- t.test(Spruce$Ht.change)
head(spruce_t)
$statistic
t
23.7549
$parameter
df
71
$p.value
[1] 1.636099e-35
$conf.int
[1] 28.33685 33.52982
attr(,"conf.level")
[1] 0.95
$estimate
mean of x
30.93333
$null.value
mean
0
spruce_t$conf.int[1]
[1] 28.33685
spruce_t$conf.int[2]
[1] 33.52982
SOLUTION: We are 95% confident that the mean height change for a spruce tree over the 5-year period is within (28.33685, 33.52982) .
Ht.change for the seedlings in the fertilized and nonfertilized plots.#With Fertilizer
EDA.hist(Spruce$Ht.change[Spruce$Fertilizer == "F"])
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(Spruce$Ht.change[Spruce$Fertilizer == "F"])
#Without Fertilizer
EDA.hist(Spruce$Ht.change[Spruce$Fertilizer != "F"])
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(Spruce$Ht.change[Spruce$Fertilizer != "F"])
spruce_t2 <- t.test(Spruce$Ht.change~Spruce$Fertilizer,alternative="greater")
head(spruce_t2)
$statistic
t
7.558587
$parameter
df
69.6971
$p.value
[1] 6.067505e-11
$conf.int
[1] 11.46664 Inf
attr(,"conf.level")
[1] 0.95
$estimate
mean in group F mean in group NF
38.28889 23.57778
$null.value
difference in means
0
spruce_t2$conf.int[1]
[1] 11.46664
SOLUTION: We are 95% confident that the difference in heights between spruce trees that were fertilized and those that were not during a 5 year period is at least 11.46664.
Girls2004 with birth weights of baby girls born in Wyoming or Alaska.Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")
Girls2004
ID State MothersAge Smoker Weight Gestation
1 1 WY 15-19 No 3085 40
2 2 WY 35-39 No 3515 39
3 3 WY 25-29 No 3775 40
4 4 WY 20-24 No 3265 39
5 5 WY 25-29 No 2970 40
6 6 WY 20-24 No 2850 38
7 7 WY 20-24 No 2737 38
8 8 WY 25-29 No 3515 37
9 9 WY 25-29 No 3742 39
10 10 WY 35-39 No 3570 40
11 11 WY 20-24 No 3834 41
12 12 WY 20-24 Yes 3090 39
13 13 WY 25-29 Yes 3350 40
14 14 WY 30-34 No 3292 37
15 15 WY 15-19 No 3317 40
16 16 WY 30-34 No 2485 37
17 17 WY 20-24 No 3215 39
18 18 WY 20-24 No 3230 40
19 19 WY 30-34 No 3345 39
20 20 WY 25-29 No 3050 41
21 21 WY 30-34 No 2212 37
22 22 WY 35-39 No 3605 39
23 23 WY 30-34 No 2722 39
24 24 WY 30-34 No 2880 39
25 25 WY 20-24 No 3610 39
26 26 WY 30-34 No 3355 39
27 27 WY 20-24 No 3995 41
28 28 WY 20-24 Yes 2948 39
29 29 WY 35-39 No 3345 41
30 30 WY 30-34 Yes 2892 39
31 31 WY 20-24 No 2466 37
32 32 WY 20-24 Yes 3290 39
33 33 WY 25-29 No 3310 39
34 34 WY 40-44 No 3175 37
35 35 WY 25-29 No 2715 38
36 36 WY 25-29 No 3540 38
37 37 WY 25-29 No 3402 38
38 38 WY 25-29 Yes 3923 39
39 39 WY 20-24 No 3204 37
40 40 WY 15-19 Yes 2495 37
41 41 AK 20-24 No 4337 41
42 42 AK 20-24 No 2948 40
43 43 AK 30-34 No 3269 39
44 44 AK 20-24 No 3608 38
45 45 AK 30-34 No 4016 39
46 46 AK 25-29 No 2919 40
47 47 AK 20-24 No 2608 37
48 48 AK 40-44 No 4309 39
49 49 AK 20-24 No 3288 39
50 50 AK 25-29 No 3742 38
51 51 AK 15-19 No 4394 41
52 52 AK 20-24 No 2182 37
53 53 AK 25-29 No 4592 40
54 54 AK 20-24 No 3090 39
55 55 AK 30-34 No 3770 40
56 56 AK 20-24 No 3977 39
57 57 AK 25-29 No 3153 40
58 58 AK 25-29 No 3458 41
59 59 AK 15-19 No 3912 38
60 60 AK 20-24 Yes 2863 40
61 61 AK 35-39 No 3190 39
62 62 AK 25-29 Yes 3515 38
63 63 AK 25-29 No 3288 39
64 64 AK 15-19 No 3114 40
65 65 AK 30-34 Yes 3543 41
66 66 AK 20-24 No 3825 39
67 67 AK 25-29 No 3458 39
68 68 AK 30-34 No 3698 41
69 69 AK 20-24 No 3572 39
70 70 AK 30-34 Yes 2352 40
71 71 AK 20-24 No 3175 40
72 72 AK 25-29 No 3742 41
73 73 AK 20-24 No 3997 39
74 74 AK 25-29 No 2576 38
75 75 AK 30-34 No 3572 40
76 76 AK 35-39 No 3968 39
77 77 AK 20-24 No 4564 42
78 78 AK 20-24 No 4210 40
79 79 AK 25-29 No 3260 38
80 80 AK 20-24 No 3600 40
summary(Girls2004)
ID State MothersAge Smoker Weight
Min. : 1.00 AK:40 15-19: 6 No :69 Min. :2182
1st Qu.:20.75 WY:40 20-24:29 Yes:11 1st Qu.:3076
Median :40.50 25-29:22 Median :3331
Mean :40.50 30-34:15 Mean :3362
3rd Qu.:60.25 35-39: 6 3rd Qu.:3709
Max. :80.00 40-44: 2 Max. :4592
Gestation
Min. :37.00
1st Qu.:38.00
Median :39.00
Mean :39.14
3rd Qu.:40.00
Max. :42.00
#Wyoming
EDA.hist(Girls2004$Weight[Girls2004$State == "WY"])
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(Girls2004$Weight[Girls2004$State == "WY"])
#Alaska
EDA.hist(Girls2004$Weight[Girls2004$State == "AK"])
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(Girls2004$Weight[Girls2004$State == "AK"])
SOLUTION: On average, the distribution of weight between the babies born in the two states is greater in Alaksa than in Wyoming.
girls_t <- t.test(Girls2004$Weight[Girls2004$State == "AK"],Girls2004$Weight[Girls2004$State == "WY"])
girls_t
Welch Two Sample t-test
data: Girls2004$Weight[Girls2004$State == "AK"] and Girls2004$Weight[Girls2004$State == "WY"]
t = 2.7316, df = 71.007, p-value = 0.007946
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
83.29395 533.60605
sample estimates:
mean of x mean of y
3516.35 3207.90
girls_t$conf.int[1]
[1] 83.29395
girls_t$conf.int[2]
[1] 533.606
SOLUTION: We are 95% confident that the difference in weights between girls born in Wyoming and girls born in Alaska is within (83.29395, 533.606).
#Smokers
EDA.hist(Girls2004$Weight[Girls2004$Smoker == "Yes"])
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(Girls2004$Weight[Girls2004$Smoker == "Yes"])
#Nonsmokers
EDA.hist(Girls2004$Weight[Girls2004$Smoker == "No"])
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(Girls2004$Weight[Girls2004$Smoker == "No"])
SOLUTION: On avereage, the weights of babies born for smokers is less than the weights of babies born for nonsmokers.
girls_t2 <- t.test(Girls2004$Weight[Girls2004$Smoker == "Yes"],Girls2004$Weight[Girls2004$Smoker == "No"])
girls_t2
Welch Two Sample t-test
data: Girls2004$Weight[Girls2004$Smoker == "Yes"] and Girls2004$Weight[Girls2004$Smoker == "No"]
t = -1.8552, df = 14.35, p-value = 0.08423
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-617.9197 44.0330
sample estimates:
mean of x mean of y
3114.636 3401.580
girls_t2$conf.int[1]
[1] -617.9197
girls_t2$conf.int[2]
[1] 44.033
SOLUTION: We are 95% confident that the difference in weights between girls born to smokers and girls born to non-smokers is within (-617.9197, 44.033)
FlightDelays data set into R. Although the data represent all flights for United Airlines and American Airlines in May and June 2009, assume for this exercise that these flights are a sample from all flights flown by the two airlines under similar conditions. We will compare the lengths of flight delays between the two airlines.FlightDelays <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
summary(FlightDelays)
ID Carrier FlightNo Destination DepartTime
Min. : 1 AA:2906 Min. : 71.0 BNA: 172 4-8am : 699
1st Qu.:1008 UA:1123 1st Qu.: 371.0 DEN: 264 4-8pm : 972
Median :2015 Median : 691.0 DFW: 918 8-Mid : 257
Mean :2015 Mean : 827.1 IAD: 55 8-Noon :1053
3rd Qu.:3022 3rd Qu.: 787.0 MIA: 610 Noon-4pm:1048
Max. :4029 Max. :2255.0 ORD:1785
STL: 225
Day Month FlightLength Delay Delayed30
Fri:637 June:2030 Min. : 68.0 Min. :-19.00 No :3432
Mon:630 May :1999 1st Qu.:155.0 1st Qu.: -6.00 Yes: 597
Sat:453 Median :163.0 Median : -3.00
Sun:551 Mean :185.3 Mean : 11.74
Thu:566 3rd Qu.:228.0 3rd Qu.: 5.00
Tue:628 Max. :295.0 Max. :693.00
Wed:564
head(FlightDelays)
ID Carrier FlightNo Destination DepartTime Day Month FlightLength Delay
1 1 UA 403 DEN 4-8am Fri May 281 -1
2 2 UA 405 DEN 8-Noon Fri May 277 102
3 3 UA 409 DEN 4-8pm Fri May 279 4
4 4 UA 511 ORD 8-Noon Fri May 158 -2
5 5 UA 667 ORD 4-8am Fri May 143 -3
6 6 UA 669 ORD 4-8am Fri May 150 0
Delayed30
1 No
2 Yes
3 No
4 No
5 No
6 No
#United Airlines
EDA.hist(FlightDelays$Delay[FlightDelays$Carrier == "UA"])
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(FlightDelays$Delay[FlightDelays$Carrier == "UA"])
#American Airlines
EDA.hist(FlightDelays$Delay[FlightDelays$Carrier == "AA"])
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EDA.qq(FlightDelays$Delay[FlightDelays$Carrier == "AA"])
flight_t <- t.test(FlightDelays$Delay[FlightDelays$Carrier == "UA"],FlightDelays$Delay[FlightDelays$Carrier == "AA"])
flight_t
Welch Two Sample t-test
data: FlightDelays$Delay[FlightDelays$Carrier == "UA"] and FlightDelays$Delay[FlightDelays$Carrier == "AA"]
t = 3.8255, df = 1843.8, p-value = 0.0001349
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.868194 8.903198
sample estimates:
mean of x mean of y
15.98308 10.09738
flight_t$conf.int[1]
[1] 2.868194
flight_t$conf.int[2]
[1] 8.903198
SOLUTION: We are 95% confident that the difference in delay times between United Airlines and American Airlines is within (2.868194, 8.903198).
\(T = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{n}}}\) ~ \(t_{n-1}\) Assuming t distribution \(t_{n-1}\) is a normal distribution
sims <- 10^4
xbar <- numeric(sims)
SD <- numeric(sims)
#Example given in class,
## Start
Mu <- 100
sigma <- 10
n<- 15
## End
for (i in 1: sims)
{
xn <- rnorm(n, Mu, sigma)
xbar[i] <- mean(xn)
SD[i] <- sd(xn)
}
T <- (xbar - Mu) - (SD / sqrt(n))
hist(T)
#alternative way with hist graphics of density historgram
hist(T, freq = FALSE)
# Need to add t distribution according to the 14 degrees of freedom
curve(dt(x,15))
curve(dt(x,15), -5, 5)
# Class notes
junk <- rt(10000, 15)
hist(junk)
hist(junk, freq = FALSE)
hist(junk, freq = FALSE, breaks = "Scott")
curve(dt(x, 15), add = TRUE)
SOULTION: Unbalanced sample sizes, pooled CI’s are not capturing true mean differenceat 95%. Balanced case, pooled CI does better.