Consider a population that has a normal distribution with mean \(\mu = 36\), standard deviation \(\sigma = 8\)
The sampling distribution of \(\bar{X}\) for samples of size 200 will have what distribution, mean, and standard error?
Use R to draw a random sample of size \(200\) from this population. Conduct EDA on your sample.
Compute the bootstrap distribution for your sample mean, and note the bootstrap mean and standard error.
Compare the bootstrap distribution to the theoretical sampling distribution by creating a table like Table 5.2.
Repeat parts a-d for sample sizes of \(n = 50\) and \(n = 10\). Carefully describe your observations about the effects of sample size on the bootstrap distribution.
# Your code here
set.seed(13)
sam <- rnorm(200, 36, 8)
hist(sam)
xbar <- mean(sam)
SD <- sd(sam)
Your answers:
Normal Distribution
Mean = 35.8785255
SD = 8.1760899
\(x \leftarrow N(36,8)\)
\(\overline{X} \leftarrow N(36, \frac{8}{\sqrt{200}})\)
#simulation
B <- 10^4
bsmean <- numeric(B)
for (i in 1:B){
bsmean[i] <- mean(sample(sam, 200, replace = TRUE))
}
hist(bsmean)
bsxbar <- mean(bsmean)
bsSD <- sd(bsmean)
bias <- bsxbar - xbar
bias
[1] 0.001037884
      Mean: 35.8795634
      Standard Error: 0.5815149
# Your code here
#table
#knitr::kable(sam, caption = "Bootstrap vs Theoretical Distribution", booktabs = TRUE, digits = 4)
B <- 10^4
b_mean <- numeric(B)
for(i in 1: B) {
b_mean[i] <- mean(sample(B, 200, replace = TRUE)) #Something wrong changed to B so code would run
}
bmean_bar <- mean(b_mean)
bmean_bar
[1] 5002.603
| _ | Mean | SD |
|---|---|---|
| Population | 30.0000 | 8.0000 |
| Sampling | 36.0000 | 1.1313 |
| Sample | 35.7887 | 7.8005 |
| Bootstrap | 35.7968 | 1.0923 |
set.seed(13)
sam <- rnorm(50, 36, 8)
hist(sam)
xbar <- mean(sam)
SD <- sd(sam)
      Mean = 35.7886505
      Standard Deviation = 7.8004732
#simulations
#simulation
B <- 10^4
bsmean <- numeric(B)
for (i in 1:B){
bsmean[i] <- mean(sample(sam, 50, replace = TRUE))
}
hist(bsmean)
bsxbar <- mean(bsmean)
bsSD <- sd(bsmean)
      Bootstrap Standard Error = 1.0836931
d
#table
set.seed(13)
sam <- rnorm(10, 36, 8)
hist(sam)
xbar <- mean(sam)
SD <- sd(sam)
      Mean = 40.8004308
      Standard Error = 5.575858
#simulation
B <- 10^4
bsmean <- numeric(B)
for (i in 1:B){
bsmean[i] <- mean(sample(sam, 10, replace = TRUE))
}
hist(bsmean)
bsxbar <- mean(bsmean)
bsSD <- sd(bsmean)
Bootstrap Standard Error = 1.6741516
#table
Your Answer here.
set.seed(31)
ne <- 14 # n even
no <- 15 # n odd
wwe <- rnorm(ne) # draw random sample of size ne
wwo <- rnorm(no) # draw random sample of size no
N <- 10^4
even.boot <- numeric(N) # save space
odd.boot <- numeric(N)
for (i in 1:N)
{
x.even <- sample(wwe, ne, replace = TRUE)
x.odd <- sample(wwo, no, replace = TRUE)
even.boot[i] <- median(x.even)
odd.boot[i] <- median(x.odd)
}
Median <- c(even.boot, odd.boot)
Parity <- rep(c("n = 14", "n = 15"), each = N)
DF <- data.frame(Median = Median, Parity = Parity)
ggplot(data = DF, aes(x = Median)) +
geom_histogram(fill = "lightblue", color = "black") +
theme_bw() +
facet_grid(Parity ~.)
Figure 2.1: Histograms of bootstrapped median values
Your answer:
Import the data from data set Bangladesh. In addition to arsenic concentrations for 271 wells, the data set contains cobalt and chlorine concentrations.
Conduct EDA on the chlorine concentrations and describe the salient features.
Bootstrap the mean.
Find and interpret the 95% bootstrap percentile confidence interval.
What is the bootstrap estimate of the bias? What fraction of the bootstrap standard error does it represent?
Bangladesh <- read.csv("http://www1.appstate.edu/~arnholta/Data/Bangladesh.csv")
head(Bangladesh)
Arsenic Chlorine Cobalt
1 2400 6.2 0.42
2 6 116.0 0.45
3 904 14.8 0.63
4 321 35.9 0.68
5 1280 18.9 0.58
6 151 7.8 0.35
The Chlorine variable has some missing values. The following code will remove these entries:
set.seed(13)
chlorine <- subset(Bangladesh, select = Chlorine, subset = !is.na(Chlorine), drop = TRUE)
chl.mean <- mean(chlorine)
chl.sd <- sd(chlorine)
Your answers:
Mean = 78.0840149
SD = 210.019193
Graph:
# B
set.seed(13)
#simulation
B <- 10^4
chl.bsmean <- numeric(B)
for (i in 1:B){
chl.bsmean[i] <- mean(sample(chlorine, 10, replace = TRUE))
}
chl.xbar <- mean(chl.bsmean)
Bootstrap Mean = 77.342283
set.seed(13)
# Your code here
quantile(chl.bsmean, probs = c(.025, .975))
2.5% 97.5%
11.81975 248.25000
set.seed(13)
bsxbar <- mean(bsmean)
bsSD <- sd(bsmean)
bsxbar
[1] 40.81758
bsSD
[1] 1.674152
Your answer:
# Your code here
N <- 10^4
t_means <- numeric(N)
for(i in 1: N)
{
t_means[i] <- mean(sample(chlorine, length(chlorine), replace = TRUE), trim = .25)
}
mean(t_means, trim = .25)
[1] 17.70971
The values seems to be alot smaller because of the trimming cutting of the top 25% outliers and 25% of the lest most values leading to a guess alot closer to what it should be if there was not outliers.
The data set FishMercury contains mercury levels (parts per million) for 30 fish caught in lakes in Minnesota.
Create a histogram or boxplot of the data. What do you observe?
Bootstrap the mean and record the bootstrap standard error and the 95% bootstrap percentile interval.
Remove the outlier and bootstrap the mean of the remaining data. Record the bootstrap standard error and the 95% bootstrap percentile interval.
What effect did removing the outlier have on the bootstrap distribution, in particular, the standard error?
FishMercury <- read.csv("http://www1.appstate.edu/~arnholta/Data/FishMercury.csv")
head(FishMercury)
Mercury
1 1.870
2 0.160
3 0.088
4 0.160
5 0.145
6 0.099
Your answers:
# Your code here
hist(FishMercury$Mercury)
Note that there is one value (1.87) very far removed from the rest of the values.
boot.fish = numeric(10^4)
for(i in 1: 10^4)
{
boot.fish[i] = mean(sample(FishMercury$Mercury, 10, replace = TRUE))
}
boot.fishMean = mean(boot.fish)
boot.fishSD = sd(boot.fish)
quantile(boot.fish, .025, .0975)
2.5%
0.0993
Bootstrap Mean = 0.1821923
Bootstrap Standard error = 0.0992932
# Your code here
In section 3.3, we performed a permutation test to determine if men and women consumed, on average, different amounts of hot wings.
Bootstrap the difference in means and describe the bootstrap distribution.
Find a 95% bootstrap percentile confidence interval for the difference of means and give a sentence interpreting this interval.
How do the bootstrap and permutation distribution differ?
BeerWings <- read.csv("http://www1.appstate.edu/~arnholta/Data/Beerwings.csv")
head(BeerWings)
ID Hotwings Beer Gender
1 1 4 24 F
2 2 5 0 F
3 3 5 12 F
4 4 6 12 F
5 5 7 12 F
6 6 7 12 F
Your answers:
NewBeerWings = BeerWings %>%
group_by(Gender) %>%
summarise(Mn = mean(Hotwings), SD = sd(Hotwings))
maleWing = subset(BeerWings, select = Hotwings, subset = Gender == "M", drop = TRUE)
femaleWing = subset(BeerWings, select = Hotwings, subset = Gender == "F", drop = TRUE)
MN = numeric(10^4)
for (i in 1: 10^4){
P = mean(sample(maleWing, length(maleWing), replace = TRUE))
J = mean(sample(femaleWing, length(femaleWing), replace = TRUE))
MN[i] = P - J
}
BS = mean(MN)
BSSD = sd(MN)
Bootstrap difference in variance = 1.4383282
quantile(MN, .025, .975)
2.5%
2.4
Import the data from Girls2004 (see Section 1.2).
Perform some exploratory data analysis and obtain summary statistics on the weight of baby girls born in Wyoming and Arkansas (do seperate analyses for each state).
Bootstrap the difference in means, plot the distribution, and give the summary statistics. Obtain a 95% bootstrap percentile confidence interval and interpret this interval.
What is the bootstrap estimate of the bias? What fraction of the bootstrap standard error does it represent?
Conduct a permutation test to calculate the difference in mean weights and state your conclusion?
For what population(s), if any does this calculation hold? Explain?
Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")
head(Girls2004)
ID State MothersAge Smoker Weight Gestation
1 1 WY 15-19 No 3085 40
2 2 WY 35-39 No 3515 39
3 3 WY 25-29 No 3775 40
4 4 WY 20-24 No 3265 39
5 5 WY 25-29 No 2970 40
6 6 WY 20-24 No 2850 38
Your answers:
# Your code here
# Part a.
girls <- Girls2004
group_by(girls,State) %>%
filter(State =="WY") %>%
summarize(mean(Weight))
# A tibble: 1 x 2
State `mean(Weight)`
<fctr> <dbl>
1 WY 3207.9
group_by(girls,State) %>%
filter(State =="AK") %>%
summarize(mean(Weight))
# A tibble: 1 x 2
State `mean(Weight)`
<fctr> <dbl>
1 AK 3516.35
# Your code here
gw <- girls %>%
group_by(State) %>%
filter(State=="WY")
ga <- girls %>%
group_by(State) %>%
filter(State =="AK")
o_diff <- mean(ga$Weight)- mean(gw$Weight)
o_sd <- sd(ga$Weight) - sd(gw$Weight)
Sims<-10^4
gwboot.means <- numeric(Sims)
gaboot.means <- numeric(Sims)
for(i in 1:Sims)
{
gwboot.means[i] <- mean(sample(gw$Weight,length(gw$Weight),replace=TRUE))
gaboot.means[i] <- mean(sample(ga$Weight, length(ga$Weight), replace = TRUE))
}
diff <- gaboot.means - gwboot.means
ggplot(data = data.frame(diff), aes(x = diff)) +
geom_histogram()
b_diff <- mean(diff)
b_diff
[1] 308.8014
b_sd <- sd(diff)
b_sd
[1] 111.9133
conf <- quantile(diff,prob=c(.04,.96))
conf
4% 96%
108.647 502.931
# Your code here
bias <- b_diff - o_diff
bias
[1] 0.3514125
#fraction of bootstrap error
abs((b_diff - o_diff) / (sd(diff)*100))
[1] 3.140042e-05
# Your code here
N <- 10^4
ga_diff <- numeric(N)
gw_diff <- numeric(N)
for(i in 1: N)
{
ga_diff[i] <- mean(sample(ga$Weight, length(ga$Weight), replace = FALSE))
gw_diff[i] <- mean(sample(gw$Weight, length(gw$Weight), replace = FALSE))
}
f_diff <- ga_diff - gw_diff
mean(f_diff)
[1] 308.45
A: We can not find enough information to conclude anything about the populations from the sample conducted; it would require a larger scale to conclude anything about these states and the difference in weight of the babies.
IceCream contains calorie information for a sample of brands of chocolate and vanilla ice cream. Use the bootstrap to determine whether or not there is a difference in the mean number of calories.IceCream <- read.csv("http://www1.appstate.edu/~arnholta/Data/IceCream.csv")
head(IceCream)
Brand VanillaCalories VanillaFat VanillaSugar ChocolateCalories
1 Baskin Robbins 260 16.0 26.0 260
2 Ben & Jerry's 240 16.0 19.0 260
3 Blue Bunny 140 7.0 12.0 130
4 Breyers 140 7.0 13.0 140
5 Brigham's 190 12.0 17.0 200
6 Bulla 234 13.5 21.8 266
ChocolateFat ChocolateSugar
1 14 31.0
2 16 22.0
3 7 14.0
4 8 16.0
5 12 18.0
6 15 22.6
Your answer:
Choclate <- IceCream$ChocolateCalories
Vanilla <- IceCream$VanillaCalories
sims <- 10^4
bs_mean_diff <- numeric(sims)
for(i in 1:sims)
{
V <- mean(sample(Vanilla, length(Vanilla), replace = TRUE))
C <- mean(sample(Choclate, length(Choclate), replace = TRUE))
bs_mean_diff[i] <- abs(V - C)
}
bs_Xbar <- mean(bs_mean_diff)
bs_Xbar
[1] 12.27885
Import the data from Flight Delays Case Study in Section 1.1 data into R. Although the data are on all UA and AA flights flown in May and June of 2009, we will assume these represent a sample from a larger population of UA and AA flights flown under similar circumstances. We will consider the ratio of the means of the flight delay lengths, \(\mu_{\text{UA}} / \mu_{\text{AA}}\).
Perform some exploratory data analysis on flight delay lengths for each of UA and AA flights.
Bootstrap the mean of flight delay lengths for each airline seperately and describe the distribution.
Bootstrap the ratio of means. Provide plots of the bootstrap distribution and describe the distribution.
Find the 95% bootstrap percentile interval for the ratio of means. Interpret this interval.
What is the bootstrap estimate of the bias? What fraction of the bootstrap standard error does it represent?
For inference in this text, we assume that the observations are independent. Is that condition met here? Explain.
FlightDelays <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
Your answers:
group_by(FlightDelays,Carrier) %>%
filter(Carrier =="UA") %>%
summarize(mean(Delay))
# A tibble: 1 x 2
Carrier `mean(Delay)`
<fctr> <dbl>
1 UA 15.98308
group_by(FlightDelays,Carrier) %>%
filter(Carrier=="AA") %>%
summarize(mean(Delay))
# A tibble: 1 x 2
Carrier `mean(Delay)`
<fctr> <dbl>
1 AA 10.09738
UA <- FlightDelays %>%
group_by(Carrier) %>%
filter(Carrier=="UA")
AA <- FlightDelays %>%
group_by(Carrier) %>%
filter(Carrier=="AA")
sims <- 10^4
uaMean <- numeric(sims)
aaMean <- numeric(sims)
for(i in 1:sims)
{
uaMean[i] <- mean(sample(UA$Delay, length(UA$Delay), replace = TRUE))
aaMean[i] <- mean(sample(AA$Delay, length(AA$Delay), replace = TRUE))
}
ggplot(data.frame(X = uaMean), aes(x = X)) +
geom_histogram(aes(fill = ..count..)) +
scale_y_continuous(name = "Count") +
ggtitle("UA Means")
mean(uaMean)
[1] 15.97876
sd(uaMean)
[1] 1.329588
ggplot(data.frame(X = aaMean), aes(x = X)) +
geom_histogram(aes(fill = ..count..)) +
scale_y_continuous(name = "Count") +
ggtitle("AA Means")
mean(aaMean)
[1] 10.09799
sd(aaMean)
[1] 0.7349373
meanRatio <- numeric(sims)
for(i in 1:sims)
{
meanRatio[i] <- aaMean[i]/uaMean[i]
}
ggplot(data.frame(X = meanRatio), aes(x = X)) +
geom_histogram(aes(fill = ..count..)) +
scale_y_continuous(name = "Count") +
ggtitle("Mean Ratio")
mean(meanRatio)
[1] 0.636473
sd(meanRatio)
[1] 0.07166806
q <- quantile(meanRatio,c(0.05,.95))
round(q[1], 4)
5%
0.5254
round(q[2], 4)
95%
0.7633
The 95% bootstrap percenitle interval for the ratio of the mean is between 0.5243 and 0.7596. This means we are 95% confident that the ratio of mean flight delay times for UA is between 0.5243 and 0.7596 less than AA. e.
bias <- mean(meanRatio) - (mean(UA)/mean(AA))
bias #estimate of bias
[1] NA
SE <- bias/sd(meanRatio)
SE #fraction of the bootstrap standard error
[1] NA
Two college students collected data on the price of hardcover textbooks from two disciplinary areas: Mathematics and the Natural Sciences, and the Social Sciences (Hien and Baker (2010)). The data are in the file BookPrices.
Perform some exploratory data analysis on book prices for ` of the two disciplinary areas.
Bootstrap the mean of the book price for each area separately and describe the distributions.
Bootstrap the ratio of means. Provide plots of the bootstrap distribution and comment.
Find the 95% bootstrap percentile interval for the ratio of means. Interpret this interval.
What is the bootstrap estimate of the bias? What fraction of the bootstrap standard error does it represent?
BookPrices <- read.csv("http://www1.appstate.edu/~arnholta/Data/BookPrices.csv")
summary(BookPrices)
Subject Area Price
Economics : 7 Math & Science :27 Min. : 11.0
Mathematics : 7 Social Sciences:17 1st Qu.:113.0
Chemistry : 5 Median :139.0
Computer Science: 5 Mean :134.4
Physics : 5 3rd Qu.:175.9
Biology : 4 Max. :222.7
(Other) :11
Your answers:
# Your code here
math <- BookPrices %>%
filter(Area == "Math & Science")
social <- BookPrices %>%
filter(Area == "Social Sciences")
om_mean <- mean(math$Price)
om_mean
[1] 156.7341
om_sd <- sd(social$Price)
om_sd
[1] 71.91385
os_sd <- sd(math$Price)
os_sd
[1] 39.14483
o_ratio <- om_mean / os_sd
o_ratio
[1] 4.003954
N <- 10^4
math_m <- numeric(N)
social_m <- numeric(N)
for(i in 1: N)
{
math_m[i] <- mean(sample(math$Price, length(math$Price), replace = TRUE))
social_m[i] <- mean(sample(social$Price, length(social$Price), replace = TRUE))
}
hist(math_m)
hist(social_m)
# Your code here
ratio <- math_m / social_m
hist(ratio)
# Your code here
confi <- quantile(ratio, c(.05, .95))
confi
5% 95%
1.214372 2.209606
# Your code here
#bias
mean(ratio) - o_ratio
[1] -2.374535
#fraction of error in percentage
abs((mean(ratio) - o_ratio) / (sd(ratio)*100))
[1] 0.07502823