Chapter Five Homework

Consider a population that has a normal distribution with mean \(\mu = 36\), standard deviation \(\sigma = 8\)
1. The sampling distribution of \(\bar{X}\) for samples of size 200 will have what distribution, mean, and standard error?
2. Use R to draw a random sample of size \(200\) from this population. Conduct EDA on your sample.
3. Compute the bootstrap distribution for your sample mean, and note the bootstrap mean and standard error.
4. Compare the bootstrap distribution to the theoretical sampling distribution by creating a table like Table 5.2.
5. Repeat parts a-d for sample sizes of \(n = 50\) and \(n = 10\). Carefully describe your observations about the effects of sample size on the bootstrap distribution.

Your answers:

\(\bar{x}\)~N(36, 0.566)

# Part B 
# Your code here 
set.seed(13) 
Sam <- rnorm(200,36,8) 
hist(Sam)

xbar <- mean(Sam) 
xbar

[1] 35.87853

SD <- sd(Sam) 
SD

[1] 8.17609

qqnorm(Sam)

Our data has a normal distribution with a mean of 36.60203 and a standard error of 8.008563.

# Your code here 
B <- 10^4 
bsm <- numeric(B) 
for(i in 1:B){ 
  bsm[i]<-mean(sample(Sam,200,replace=TRUE)) 
} 
BSM<-mean(bsm) 
BSM

[1] 35.87956

BSSE<-sd(bsm) 
BSSE

[1] 0.5815149

Bias <- BSM - xbar 
Bias

[1] 0.001037884

hist(bsm)

Overall our distribution is pretty normal. After running the bootstrap the mean of the bootstrap comes out to 36.61025 and the standard error is .5662201.

# Your code here 
df <- data.frame(Mean = c(36,36, xbar, BSM), sd = c(8,8/sqrt(200), SD, BSSE)) 
row.names(df) <- c("population", "Sampling", "Sample", "Bootstrap") 
knitr::kable(df)

	Mean	sd
population	36.00000	8.0000000
Sampling	36.00000	0.5656854
Sample	35.87853	8.1760899
Bootstrap	35.87956	0.5815149

Parts a-d for \(n = 10\):

\(\bar{x}\)¯~N(36, 2.5298)

# Your code here 
Samp <- rnorm(10, 36, 8) 
hist(Samp)

qqnorm(Samp)

xbar <- mean(Samp) 
xbar

[1] 39.0327

SD <- sd(Samp) 
SD

[1] 9.960148

# Your code here 
B <- 10^4 
bsm <- numeric(B) 
for(i in 1:B){
  bsm[i]<-mean(sample(Sam,10,replace=TRUE)) 
} 
BSM<-mean(bsm) 
BSM

[1] 35.87488

BSSE<-sd(bsm)
BSSE

[1] 2.560988

Bias <- BSM - xbar 
Bias

[1] -3.15782

# Your code here 
df <- data.frame(Mean = c(36,36, xbar, BSM), sd = c(8,8/sqrt(200), SD, BSSE)) 
row.names(df) <- c("population", "Sampling", "Sample", "Bootstrap") 
knitr::kable(df)

	Mean	sd
population	36.00000	8.0000000
Sampling	36.00000	0.5656854
Sample	39.03270	9.9601483
Bootstrap	35.87488	2.5609875

Parts a-d for \(n = 50\):

\(\bar{x}\)~N(36, 1.1314)

# Your code here 
Samp <- rnorm(50, 36, 8) 
hist(Samp)

qqnorm(Samp)

xbar <- mean(Samp) 
xbar

[1] 37.58786

SD <- sd(Samp) 
SD

[1] 8.524557

# Your code here 
set.seed(31) 
B <- 10^4 
bsm <- numeric(B) 
for(i in 1:B){ 
  bsm[i]<-mean(sample(Sam,10,replace=TRUE)) 
} 
BSM<-mean(bsm) 
BSM

[1] 35.87243

BSSE<-sd(bsm) 
BSSE

[1] 2.5992

Bias <- BSM - xbar 
Bias

[1] -1.715429

# Your code here 
df <- data.frame(Mean = c(36,36, xbar, BSM), sd = c(8,8/sqrt(200), SD, BSSE)) 
row.names(df) <- c("population", "Sampling", "Sample", "Bootstrap") 
knitr::kable(df)

	Mean	sd
population	36.00000	8.0000000
Sampling	36.00000	0.5656854
Sample	37.58786	8.5245571
Bootstrap	35.87243	2.5992004

Your Answer here. As we increase the sample size the data is more exact to the population.

We investigate the bootstrap distribution of the median. Create random samples of size \(n\) for various \(n\) and bootstrap the median. Describe the bootstrap distribution. Change the sample sizes to 36 and 37; 200 and 201; 10,000 and 10,001. Note the similarities/dissimilarities, trends and so on. Why does the parity of the sample size matter?

set.seed(31)
ne <- 14 # n even
no <- 15 # n odd

wwe <- rnorm(ne) # draw random sample of size ne
wwo <- rnorm(no) # draw random sample of size no

N <- 10^4
even.boot <- numeric(N) # save space
odd.boot <- numeric(N)
for (i in 1:N)
{
  x.even <- sample(wwe, ne, replace = TRUE)
  x.odd <- sample(wwo, no, replace = TRUE)
  even.boot[i] <- median(x.even)
  odd.boot[i] <- median(x.odd)
}

Median <- c(even.boot, odd.boot)
Parity <- rep(c("n = 14", "n = 15"), each = N)
DF <- data.frame(Median = Median, Parity = Parity)

ggplot(data = DF, aes(x = Median)) + 
  geom_histogram(fill = "lightblue", color = "black") + 
  theme_bw() + 
  facet_grid(Parity ~.)

Figure 1: Histograms of bootstrapped median values

# Your code here
set.seed(31)
ne <- 36 # n even
no <- 37 # n odd

wwe <- rnorm(ne) # draw random sample of size ne
wwo <- rnorm(no) # draw random sample of size no

N <- 10^4
even.boot <- numeric(N) # save space
odd.boot <- numeric(N)
for (i in 1:N)
{
  x.even <- sample(wwe, ne, replace = TRUE)
  x.odd <- sample(wwo, no, replace = TRUE)
  even.boot[i] <- median(x.even)
  odd.boot[i] <- median(x.odd)
}

Median <- c(even.boot, odd.boot)
Parity <- rep(c("n = 36", "n = 37"), each = N)
DF <- data.frame(Median = Median, Parity = Parity)

ggplot(data = DF, aes(x = Median)) + 
  geom_histogram(fill = "lightblue", color = "black") + 
  theme_bw() + 
  facet_grid(Parity ~.)

Figure 2: Histograms of bootstrapped median values

# Your code here
set.seed(31)
ne <- 200 # n even
no <- 201 # n odd

wwe <- rnorm(ne) # draw random sample of size ne
wwo <- rnorm(no) # draw random sample of size no

N <- 10^4
even.boot <- numeric(N) # save space
odd.boot <- numeric(N)
for (i in 1:N)
{
  x.even <- sample(wwe, ne, replace = TRUE)
  x.odd <- sample(wwo, no, replace = TRUE)
  even.boot[i] <- median(x.even)
  odd.boot[i] <- median(x.odd)
}

Median <- c(even.boot, odd.boot)
Parity <- rep(c("n = 200", "n = 201"), each = N)
DF <- data.frame(Median = Median, Parity = Parity)

ggplot(data = DF, aes(x = Median)) + 
  geom_histogram(fill = "lightblue", color = "black") + 
  theme_bw() + 
  facet_grid(Parity ~.)

Figure 3: Histograms of bootstrapped median values

# Your code here
set.seed(31)
ne <- 10000 # n even
no <- 10001 # n odd

wwe <- rnorm(ne) # draw random sample of size ne
wwo <- rnorm(no) # draw random sample of size no

N <- 10^4
even.boot <- numeric(N) # save space
odd.boot <- numeric(N)
for (i in 1:N)
{
  x.even <- sample(wwe, ne, replace = TRUE)
  x.odd <- sample(wwo, no, replace = TRUE)
  even.boot[i] <- median(x.even)
  odd.boot[i] <- median(x.odd)
}

Median <- c(even.boot, odd.boot)
Parity <- rep(c("n = 10000", "n = 10001"), each = N)
DF <- data.frame(Median = Median, Parity = Parity)

ggplot(data = DF, aes(x = Median)) + 
  geom_histogram(fill = "lightblue", color = "black") + 
  theme_bw() + 
  facet_grid(Parity ~.)

Figure 4: Histograms of bootstrapped median values

Your answer: As we increase the size of n the vairance and the standard deviation of our distribution decreases.

Import the data from data set Bangladesh. In addition to arsenic concentrations for 271 wells, the data set contains cobalt and chlorine concentrations.
1. Conduct EDA on the chlorine concentrations and describe the salient features.
2. Bootstrap the mean.
3. Find and interpret the 95% bootstrap percentile confidence interval.
4. What is the bootstrap estimate of the bias? What fraction of the bootstrap standard error does it represent?

Bangladesh <- read.csv("http://www1.appstate.edu/~arnholta/Data/Bangladesh.csv")
head(Bangladesh)

  Arsenic Chlorine Cobalt
1    2400      6.2   0.42
2       6    116.0   0.45
3     904     14.8   0.63
4     321     35.9   0.68
5    1280     18.9   0.58
6     151      7.8   0.35

The Chlorine variable has some missing values. The following code will remove these entries:

chlorine <- subset(Bangladesh, select = Chlorine, subset = !is.na(Chlorine), drop = TRUE)
library(dplyr) 
Bangladesh %>% summarize(mean = mean(chlorine), sd = sd(chlorine), n())

      mean       sd n()
1 78.08401 210.0192 271

Your answers:

\(\bar{x}\)~N(78.084, 210.02)

# Your code here
data.frame(Chlorine = chlorine) %>%
  ggplot(aes(x= Chlorine)) + 
  geom_histogram(color = "black", fill = "green") + 
  labs(title  = "Chlorine Levels in Bangladesh")

mean <- mean(chlorine)
mean

[1] 78.08401

sd <- sd(chlorine)
sd

[1] 210.0192

The mean of the sample above is 78.08401 and the standard deviation of the sample is 210.0192. The distribution of the sample is also logarithmic.

# Your code here
sims <- 10^4 
bootstrap <- numeric(sims)

for(i in 1:sims){
  bootstrap[i] <- mean(sample(chlorine, 269, replace = TRUE))  
}

bootxbar <- mean(bootstrap)
bootxbar

[1] 77.96584

# 77.96274

booterror <- sd(bootstrap)
booterror

[1] 12.71351

#12.71615

Bias <- BSM - mean(chlorine) 
Bias

[1] -42.21158

In part B when we run the bootstrap we get a mean of 77.96274 and a standard error of 12.71615.

# Your code here
quantile(bootstrap, prob = c(.025, .975))

     2.5%     97.5% 
 54.97361 104.82507

We are 95% confident that the mean value falls in the interval (55.14881, 104.36637).

The standard error is 210.12 with a bias of 6%.

Consider Bangladesh chlorine (concentration). Bootstrap the trimmed mean (say, trim the upper and lower 25%) and compare your results with the usual mean (previous exercise).

Your answer:

Samp <- rnorm(271, mean(chlorine, trim = 0.25), sd(chlorine))
set.seed(13)
B <- 10^4
bsm <- numeric(B)
for(i in 1:B){
  bsm[i] <- mean(sample(Samp, 271, replace = TRUE))
}
hist(bsm)

qqnorm(bsm)
qqline(bsm)

BSM <- mean(bsm)
BSM

[1] 29.64578

By using the trimmed mean, we eliminate the outliers and we are given a approximate mean of 29.64578.

The data set FishMercury contains mercury levels (parts per million) for 30 fish caught in lakes in Minnesota.
1. Create a histogram or boxplot of the data. What do you observe?
2. Bootstrap the mean and record the bootstrap standard error and the 95% bootstrap percentile interval.
3. Remove the outlier and bootstrap the mean of the remaining data. Record the bootstrap standard error and the 95% bootstrap percentile interval.
4. What effect did removing the outlier have on the bootstrap distribution, in particular, the standard error?

FishMercury <- read.csv("http://www1.appstate.edu/~arnholta/Data/FishMercury.csv")
head(FishMercury)

Your answers:

hist(FishMercury$Mercury)

boxplot(FishMercury)

Note that there is one value (1.87) very far removed from the rest of the values.

set.seed(13)
sims <- 10^4
bsm <- numeric(B)
for(i in 1:sims){
  bsm[i] <- mean(sample(FishMercury$Mercury, 30, replace=TRUE))
}
hist(bsm)

qqnorm(bsm)
qqline(bsm)

BSM <- mean(bsm)
BSM

[1] 0.1817277

BSS <- sd(bsm)
BSS

[1] 0.05742464

bias <- BSM-mean(FishMercury$Mercury)
bias

[1] -0.0001389333

quantile(bsm, prob=c(0.025,0.975))

     2.5%     97.5% 
0.1121667 0.3064675

mean(bsm)-qnorm(0.975)*sd(bsm)

[1] 0.06917751

removedOutlier <- boxplot(FishMercury$Mercury, outline=FALSE)

y <-removedOutlier$stats
sims <- 10^4
bsm <- numeric(sims)
for(i in 1:sims){
  bsm[i] <- mean(sample(y,29,replace=TRUE))
}
hist(bsm)

qqnorm(bsm)
qqline(bsm)

BSM <- mean(bsm)
BSM

[1] 0.1211786

BSS <- sd(bsm)
BSS

[1] 0.00964194

bias <- BSM-mean(FishMercury$Mercury)
bias

[1] -0.06068803

quantile(bsm,prob=c(0.025, 0.975))

     2.5%     97.5% 
0.1015172 0.1396905

mean(bsm)-qnorm(0.975)*sd(bsm)

[1] 0.1022808

mean(bsm)+qnorm(0.975)*sd(bsm)

[1] 0.1400765

By removing the outlier we allowed the distribution to be approximately normal with a smaller confidence interval and error.

In section 3.3, we performed a permutation test to determine if men and women consumed, on average, different amounts of hot wings.
1. Bootstrap the difference in means and describe the bootstrap distribution.
2. Find a 95% bootstrap percentile confidence interval for the difference of means and give a sentence interpreting this interval.
3. How do the bootstrap and permutation distribution differ?

BeerWings <- read.csv("http://www1.appstate.edu/~arnholta/Data/Beerwings.csv")
head(BeerWings)

  ID Hotwings Beer Gender
1  1        4   24      F
2  2        5    0      F
3  3        5   12      F
4  4        6   12      F
5  5        7   12      F
6  6        7   12      F

Your answers:

library(dplyr)
BW <- BeerWings %>%
  group_by(Gender) %>%
  summarize(mean = mean(Hotwings))
BW

# A tibble: 2 x 2
  Gender      mean
  <fctr>     <dbl>
1      F  9.333333
2      M 14.533333

sims <- 10^4
bsm <- numeric(sims)
for(i in 1:sims){
  bsm[i] <- mean(sample(BW$mean, 15, replace=TRUE))-mean(sample(BW$mean,15,replace=TRUE))
}

hist(bsm)

qqnorm(bsm)
qqline(bsm)

BSM <- mean(bsm)
BSM

[1] 0.004298667

BSS <- sd(bsm)
BSS

[1] 0.9556567

bias <- BSM + 5.2
bias

[1] 5.204299

quantile(bsm, prob=c(0.025,0.975))

     2.5%     97.5% 
-1.733333  1.733333

mean(bsm) - qnorm(0.975) * sd(bsm)

[1] -1.868754

mean(bsm) + qnorm(.975)*sd(bsm)

[1] 1.877351

We are 95% confident that the true mean number fo hotwings a male eats subtracted by the mean number of hotwings a female eats falls within the interval of (-3.408,3.3899).

In this situation, our bootstrapped distribution describes the whole poopulation of males and females better then the permutation distribution.

Import the data from Girls2004 (see Section 1.2).
1. Perform some exploratory data analysis and obtain summary statistics on the weight of baby girls born in Wyoming and Arkansas (do seperate analyses for each state).
2. Bootstrap the difference in means, plot the distribution, and give the summary statistics. Obtain a 95% bootstrap percentile confidence interval and interpret this interval.
3. What is the bootstrap estimate of the bias? What fraction of the bootstrap standard error does it represent?
4. Conduct a permutation test to calculate the difference in mean weights and state your conclusion?
5. For what population(s), if any does this calculation hold? Explain?

Girls2004 <- read.csv("http://www1.appstate.edu/~arnholta/Data/Girls2004.csv")
head(Girls2004)

  ID State MothersAge Smoker Weight Gestation
1  1    WY      15-19     No   3085        40
2  2    WY      35-39     No   3515        39
3  3    WY      25-29     No   3775        40
4  4    WY      20-24     No   3265        39
5  5    WY      25-29     No   2970        40
6  6    WY      20-24     No   2850        38

Your answers:

# Your code here
# Part a.
sum()

[1] 0

# Your code here

# Your code here

# Your code here

Do chocolate and vanilla ice creams have the same number of calories? The data set IceCream contains calorie information for a sample of brands of chocolate and vanilla ice cream. Use the bootstrap to determine whether or not there is a difference in the mean number of calories.

IceCream <- read.csv("http://www1.appstate.edu/~arnholta/Data/IceCream.csv")
head(IceCream)

           Brand VanillaCalories VanillaFat VanillaSugar ChocolateCalories
1 Baskin Robbins             260       16.0         26.0               260
2  Ben & Jerry's             240       16.0         19.0               260
3     Blue Bunny             140        7.0         12.0               130
4        Breyers             140        7.0         13.0               140
5      Brigham's             190       12.0         17.0               200
6          Bulla             234       13.5         21.8               266
  ChocolateFat ChocolateSugar
1           14           31.0
2           16           22.0
3            7           14.0
4            8           16.0
5           12           18.0
6           15           22.6

Your answer:

# Your code here

Import the data from Flight Delays Case Study in Section 1.1 data into R. Although the data are on all UA and AA flights flown in May and June of 2009, we will assume these represent a sample from a larger population of UA and AA flights flown under similar circumstances. We will consider the ratio of the means of the flight delay lengths, \(\mu_{\text{UA}} / \mu_{\text{AA}}\).
1. Perform some exploratory data analysis on flight delay lengths for each of UA and AA flights.
2. Bootstrap the mean of flight delay lengths for each airline seperately and describe the distribution.
3. Bootstrap the ratio of means. Provide plots of the bootstrap distribution and describe the distribution.
4. Find the 95% bootstrap percentile interval for the ratio of means. Interpret this interval.
5. What is the bootstrap estimate of the bias? What fraction of the bootstrap standard error does it represent?
6. For inference in this text, we assume that the observations are independent. Is that condition met here? Explain.

FlightDelays <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")

Your answers:

# Your code here

# Your code here

# Your code here

# Your code here

# Your code here

Two college students collected data on the price of hardcover textbooks from two disciplinary areas: Mathematics and the Natural Sciences, and the Social Sciences (Hien and Baker (2010)). The data are in the file BookPrices.
1. Perform some exploratory data analysis on book prices for each of the two disciplinary areas.
2. Bootstrap the mean of the book price for each area separately and describe the distributions.
3. Bootstrap the ratio of means. Provide plots of the bootstrap distribution and comment.
4. Find the 95% bootstrap percentile interval for the ratio of means. Interpret this interval.
5. What is the bootstrap estimate of the bias? What fraction of the bootstrap standard error does it represent?

BookPrices <- read.csv("http://www1.appstate.edu/~arnholta/Data/BookPrices.csv")

Your answers:

# Your code here

# Your code here

# Your code here

# Your code here

# Your code here

Chapter Five Homework

Cristian Gulisano, Connor Noone, Michael McKay

Tuesday, Nov 21, 2017 - 04:55:10 PM