Homework 3 Part 2

Problem 3 Part B

scores <- c(57, 61, 42, 62, 41, 28)
sample(scores, 2, replace = FALSE) #repeat 20 times
## [1] 28 42

I ran the sample function 20 times to obtain the two values to form the treatment group for the resample. The remaining four numbers formed the control group for the resample. The reamining 19 resamples are recorded on my paper.

Part C

ResampleValues <- c(-9,3.75,16.5,-10.5,-9,-20.25,15.75,1.5,16.5,-9,-10.5,15.75,4.5,15.75,4.5,-5.25,-21,10.25,-9,-21)
hist(ResampleValues)

I calculated the value (xbar treatment-xbar control) for each of the resample groups and created the vector ResampleValues to hold this data. Then I created a histogram of these resample values.

Problem 4 Part A

pines <- read.csv("nspines (3).csv", 
                header = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
treeplot <- ggplot(pines, aes(x=factor(ns), y=dbh, fill=factor(ns)))+
  geom_boxplot()
print(treeplot)

I created a boxplot by splitting the data by region (north & south). Each subset of data has 30 observations, which is the minimal sufficient data size for a t-procedure. Both graphs appear to be skewed with the median value of the north subset skewed toward Q1 and the south subset data skewed toward Q3.

Part B

library(tidyverse)
pines_grouped = pines %>% group_by(ns) %>% summarize(mean=mean(dbh))
pines_grouped
## # A tibble: 2 x 2
##   ns     mean
##   <fct> <dbl>
## 1 n      23.7
## 2 s      34.5
nmean=23.7
smean=34.5

observed_statistic <- nmean - smean
observed_statistic
## [1] -10.8

The difference in means is -10.8.

Part C

north <- c(27.8,14.5,39.1,3.2,58.8,55.5,25,5.4,19,30.6,15.1,3.6,28.4,15,2.2,14.2,44.2,25.7,11.2,46.8,36.9,54.1,10.2,2.5,13.8,43.5,13.8,39.7,6.4,4.8)
south <- c(44.4,26.1,50.4,23.3,39.5,51,48.1,47.2,40.3,37.4,36.8,21.7,35.7,32,40.4,12.8,5.6,44.3,52.9,38,2.6,44.6,45.5,29.1,18.7,7,43.8,28.3,36.9,51.6)

set.seed(1)
bootStrapCI2<-function(north, south, nsim){
  n1<-length(north)
  n2<-length(south)
  
  bootCI2<-c()
  
  for(i in 1:nsim){
    bootSamp1<-sample(1:n1, n1, replace=TRUE)
    bootSamp2<-sample(1:n2, n2, replace=TRUE)
    thisXbar<-mean(north[bootSamp1])-mean(south[bootSamp2])
    bootCI2<-c(bootCI2, thisXbar)
  }
  
  return(bootCI2)
}

bootpine<-bootStrapCI2(north, south, 10000)
hist(bootpine)

The histogram looks approxiametely symmetric. The mean of the bootstrap is -10.8

Part D

#Quantile Method for Calculating a Bootstrap Confidence Interval
quantile(bootpine, c(0.025,0.975))
##       2.5%      97.5% 
## -18.773333  -2.552667
#Hybrid Method for Calculating a Bootstrap Confidence Interval
se<-sd(bootpine)
mean(north)-mean(south)+c(-1,1)*qt(0.975,df=59)*se
## [1] -19.052932  -2.613735

Part E According to the histogram, the bootstrap distribution is approxiamtely normal. The bootstrap estimate of the bias = -10.8 - 15.75 = -26.55 With approximatetly normal distribution and a relatively small bias compared to the standard error (se = 4.10), the hybrid method can be used.

Part F

t.test(north, south, alternative="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  north and south
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.090199  -2.576468
## sample estimates:
## mean of x mean of y 
##  23.70000  34.53333

The intervals using each method are very similar ranges of numbers since the sample size is sufficiently large enough (n=30) for each subset of data so any method can be used to calculate the confidence interval for this data. I would prefer to use the bootstrap quantile method because no assumptions are made on the data, and I believe this would allow for the most accurate interval.