Homework3

Part 2: CONTENT PROBLEMS

Problem 2 : What’s wrong?

The standard deviation of the bootstrap distribution will be approximately the same as the standard deviation of the original sample. Thi is not true because bootstrap distribution uses computer simulation to resample from the sample population with replacement many times.
The bootstrap distributino is created by resampling without replacement from the original sample. True
When generating the resamples, it is best to use a sample size smaller than the size of the original sample. The number of observations or sample size that we randomly select must be the same size as the number of the original sample obsevations. (Professor Knudson, 2017; https://www.youtube.com/watch?v=CeLlJ5EngYY)
The bootstrap distribution is created by resampling with replacement from the population. The bootstrap distribution is created by resampling with replacement from the original sample popultion. We hardly can estimate the population in statistics.

Problem 3 : A Small-sample permutation test

Do new “directed reading activities” improve the reading ability of elementary school students, as measured by their Degree of Reading Power” (DRP) scores? A study assigns students at random to either the new method (treatment group, 21 students) or traditional teaching methods (control group, 23 students).

To illustrate the process, let’s perform a permutation test by hand for a small random subset ofthe DRP data. Here is the subset: # Treatment Group 57 61 # Control Group 42 62 41 28

TreatG <- c(57, 61)
ContG <- c(42, 62, 41, 28)

Calculate the difference in means xtreatment − xcontrol between the two groups. This is the observed value of the statistic.

mean(TreatG)-mean(ContG)

## [1] 15.75

Resample: Start with the six scores and choose an SRS of two scores to form the treatment group for the first resample. (Hint: You can do this using the sample function in R or you could even use a 6-sided die. Using either method, be sure to skip repeated digits.) A permutation resample is an SRS, without replacement. The remaining four scores are the control group. What is the difference in the group means for this sample?

n1 <- length(TreatG)
n2 <- length(ContG)
sumG <- TreatG+ContG

TreatGP <- sample(sumG, size=2, replace=FALSE)
mean(TreatGP)-mean(ContG)

## [1] 50.25

Repeat part (b) 20 times to get 20 resamples and 20 values of the statistic. Make a histogram of the distribution of these 20 values. This is the permutation distribution for your resamples.

PermD <- c(55.25, 50.25, 50.75, 62.75, 67.75, 50.75, 67.25, 50.25, 50.75, 55.25, 67.25, 55.25, 50.75, 62.75, 50.75, 50.25, 62.75, 67.75, 67.75, 67.75)
hist(PermD)

What proportion of the 20 statistic values were equal to or greater than the original value in part (a)? You have just estimated the one-sided P-value for these 6 observations.

All the values were much greater than the calculated value in (a).

For this small data set, there are only 15 ( 6 choose 2) possible permutations of the data. As a result, we can calculate the exact p-value by counting the number of permutations with a statistic value greater than or equal to the original value and then dividing by 15. What is the exact p-value here? How close was your estimate?

pvalvector <- c(55.25/15, 50.25/15, 50.75/15, 62.75/15, 67.75/15, 50.75/15, 67.25/15, 50.25/15, 50.75/15, 55.25/15, 67.25/15, 55.25/15, 50.75/15, 62.75/15, 50.75/15, 50.25/15, 62.75/15, 67.75/15, 67.75/15, 67.75/15)

sum(pvalvector)

## [1] 77.6

The p value above represents the statistics is not as close as the original population. THis is due to the small sample size of the population.

Problem 4: Bootstrap Comparison of Tree Diameters

We want to compare the mean diameter at breast height (DBH) for trees from the northern and southern halves of a land tract using a random sample of 30 trees from each region. These data are available on the WISE site, named nspines.csv.

Use a side-by-side boxplot or faceted histograms to examine the data graphically (splitting by region). Does it appear reasonable to use standard t procedures? (Hint: Recall assumptions for t-procedures such as approximate normality and/or a sufficiently large sample size)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(readxl)
library(stringr)

nspines <- read_excel("nspines.xlsx")
str(nspines)

## Classes 'tbl_df', 'tbl' and 'data.frame':    60 obs. of  2 variables:
##  $ ns : chr  "n" "n" "n" "n" ...
##  $ dbh: num  27.8 14.5 39.1 3.2 58.8 55.5 25 5.4 19 30.6 ...

summary(nspines)

##       ns                 dbh       
##  Length:60          Min.   : 2.20  
##  Class :character   1st Qu.:14.10  
##  Mode  :character   Median :29.85  
##                     Mean   :29.12  
##                     3rd Qu.:43.90  
##                     Max.   :58.80

Use a side-by-side boxplot or faceted histograms to examine the data graphically (splitting by region). Does it appear reasonable to use standard t procedures? (Hint: Recall assumptions for t-procedures such as approximate normality and/or a sufficiently large sample size)

ggplot(nspines, aes(x = ns, y = dbh, color = ns))+
  geom_point()+
  facet_wrap(~ ns)

Calculate our observed statistic xNorth − xSouth

xnorth <- nspines %>%
  filter(ns == "n")

Xnorth <- xnorth$dbh

xsouth <- nspines %>% 
  filter(ns == "s")

Xsouth <- xsouth$dbh



mean(xnorth$dbh)-mean(xsouth$dbh)

## [1] -10.83333

Bootstrap the difference in means xNorth − xSouth (at least 1000 times) and look at the bootstrap distribution. (Include the histogram).

n <- length(nspines)

nspinesBS<-function(Xnorth, Xsouth, nsim){
  
  n1<-length(xnorth)
  n2<-length(xsouth)
  
  BS<-c()
  
  for(i in 1:nsim){
    bootSamp1<-sample(1:n1, n1, replace=TRUE)
    bootSamp2<-sample(1:n2, n2, replace=TRUE)
    thisXbar<-mean(Xnorth[bootSamp1])-mean(Xsouth[bootSamp2])
    BS<-c(BS, thisXbar)
  }
  
  return(BS)
}


nspinesBootCI<-nspinesBS(Xnorth, Xsouth, nsim=10000)
t.test(nspinesBootCI)

## 
##  One Sample t-test
## 
## data:  nspinesBootCI
## t = -177.11, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -14.26171 -13.94947
## sample estimates:
## mean of x 
## -14.10559

Difference in mean is -13.98532

hist(nspinesBootCI)

There are two ways we discussed calculating a bootstrap confidence interval, the quantile method and the hybrid method (this is also known as the “bootstrap t confidence interval” because it combines the standard error from the bootstrap with the critical value from the t). Calculate both types of confidence intervals.

# Quantile Method
quantile(nspinesBootCI, c(0.05, 0.95))

##    5%   95% 
## -29.9   1.7

# Hybrid Method 
sd(nspinesBootCI)

## [1] 7.964334

bootSE<-sd(nspinesBootCI)/sqrt(n)
bootSE

## [1] 5.631635

(mean(Xnorth)-mean(Xsouth))+c(-1.96,1.96)*qt(0.95, df=59)*bootSE

## [1] -29.278864   7.612198

If the bootstrap distribution is approximately Normal and the bias is small, we can use the hybrid method (“bootstrap t confidence interval”); however, it is vulnerable to departure from Normality and large bias. Comment on whether the conditions for the hybrid method (“bootstrap t confidence interval”) are met. Do you believe this interval would be reliable? Note: The bootstrap estimate of the bias is the difference between the mean of the bootstrap distribution and the statistic from the original data. This means that we want the bootstrap distribution to be centered around the observed statistic.

Condition for the hybrid method might not meet because the bootstrap is based off of the sample data from the population. This means that the assumption of normality and bias is not necessarily the case for the sample distribution.

Compare the bootstrap results with the usual two-sample t confidence interval. How do the intervals differ? Which would you use?

t.test(Xnorth, Xsouth)

## 
##  Welch Two Sample t-test
## 
## data:  Xnorth and Xsouth
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.090199  -2.576468
## sample estimates:
## mean of x mean of y 
##  23.70000  34.53333

23.70000-34.53333

## [1] -10.83333

Although the difference in mean is somewhat close to the difference in mean from the bootstrap distribution, the range of confidence interval is much tighter. We can conclude that bootstrap results is much reliable.

Homework3

Takuma Mimura

2/13/2020

Part 1 : PACKET

Problem 1: TURN IN YOUR PACKET

Part 2: CONTENT PROBLEMS

Problem 2 : What’s wrong?

Problem 3 : A Small-sample permutation test

Problem 4: Bootstrap Comparison of Tree Diameters