In collaboration with Bryce O’Connor and Sam Nurmi

Problem 2

  1. No becuase the original dat will be skewwed and the bootstrap data will adjust accordiclgy so the stadard deviations will be diffrent
  2. No, bootstapping is resampling with replacment of the original statement
  3. No, becuase you use bootstarpping on small data sets so its more liely that you would want to generate larger resampling with many simulations such as 10000 resamplings
  4. No, its from the sample not the population

Problem 3

  1. mean of treatment = 59 mean of control = 43.25 59 - 43.25 = 15.75

  2. I created a vector using the data then used the sample tool with no replacement

drp <- c(57, 61, 42, 62, 41, 28)
sample(drp, 1, replace = FALSE, prob= NULL)
## [1] 41

using this the frst two values I got would be used as the treatment results 61, 62
and the remaining values the control 57, 42, 41, 28

  1. I just re-ran this code 20 times and recorded the sampling
  sim 1:
    treatment: 61, 62 
    control: 57, 42, 41, 28
    observed value = 61.5 - 42 = 19.5
    
  sim 2:
    treatment:61, 41
    control: 57, 42, 62, 28
    observed value = 51 - 47.25 = 3.77
    
  sim 3:
    treatment: 41, 61 
    control: 57, 42, 62, 28
    observed value = 51 - 47. 25 = 3.75
    
  sim 4:
    treatment: 41, 57
    control: 62, 42, 61, 28
    observed value = 49 - 48.25 = 0.75
    
  sim 5:
    treatment: 61, 41
    control: 57, 42, 62, 28
    observed value = 51 - 47.25 = 3.77 
    
  sim 6:
    treatment: 57, 61
    control: 62, 42, 41, 28
    observed value = 59 - 43.25 = 15.75
    
  sim 7:
    treatment: 42, 61
    control: 57, 62, 41, 28
    observed value = 51.5 - 47 = 4.5

  sim 8:
    treatment: 57, 28
    control: 61, 42, 41, 62
    observed value = 42.5 - 51.5 = -9
    
  Sim 9:
    treatment: 28, 41
    control: 57, 42, 61, 62
    observed value = 34.5 - 55.5 = -21
    
  sim 10:
    treatment: 41, 62
    control: 57, 42, 61, 28
    observed value = 51.5 - 47 = 4.5
    
  sim 11:
    treatment: 42, 28
    control: 57, 41, 61, 62
    observed value =  35 - 55.25 = -20.25
    
  sim 12:
    treatment: 28, 41
    control: 57, 42, 61, 62
    observed value = 34.5 - 55.5 = -21
    
  sim 13:
    treatment: 62, 28
    control: 57, 42, 61, 41
    observed value = 45 - 50.25 = -5.25
    
  sim 14:
    treatment: 28, 41
    control:  57, 42, 61, 62
    observed value = 34.5 - 55.5 = -21
    
  sim 15:
    treatment: 28, 57
    control: 41, 42, 61, 62
    observed value = 42.5 - 51.5 = -9
    
  sim 16:
    treatment: 41, 62
    control: 57, 42, 61, 28
    observed value = 51.5 - 47 = 4.5
    
  sim 17:
    treatment: 57, 28
    control: 41, 42, 61, 62
    observed value = 42.5 - 51.5 = -9
    
  sim 18:
    treatment: 57, 42
    control: 41, 28, 61, 62
    observed value = 49.5 - 48 = 1.5
    
  sim 19:
    treatment: 57, 42
    control: 41, 28, 61, 62
    observed value = 49.5 - 48 = 1.5
    
  sim 20:
    treatment: 57, 41
    control: 42, 28, 61, 62
    observed value = 49 - 48.25 = 0.75
    

then I made a histogram

## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
vals <- c(-21, 1.5, 0.75, -9, -9, -9, -5.25, -21, 3.77, -20.25, 4.5, 15.75, 4.5, 19.5, -21, 0.75, 3.77, 4.5, 1.5, 3.75)
valmatrix <- matrix(vals, nrow=20)
valmatrix <- as.data.frame(valmatrix)

ggplot(valmatrix, aes(vals))+
  geom_histogram(bins = 20, color = 'black', fill = 'red')+
  theme_bw()+
  labs(title <- 'Run Simulation P-Values', x = 'Sim P-Value')

  1. I found the estimated p-value based on the values above estimated p-value = 0.1

  2. p-value = 2/15= 0.133 so our estimated value was not far off the actual p value

Problem 4

  1. I created a box plot, it would not be reasonable to use a standard t procedure becuase the sample size is both small and the distribution is skewed.
trees = read.csv("nspines.csv", 
                 header=TRUE,
                 na.strings = "?")

boxplot1<-ggplot(trees, aes(x=factor(ns), y=dbh, fill=factor(ns)))+
  geom_boxplot()+
  theme_bw()+
  ggtitle("Side-by-Side Boxplot Plot of Tree location by DBH")

print(boxplot1)

  1. I calculated the observed statistic
north = trees[1:30,2]
south = trees[31:60,2]

observedStat = mean(north)- mean(south)
observedStat
## [1] -10.83333
  1. I ran a bootstrap on the data for 10000 simulations and created a historgram
bootStrapCI2<-function(data1, data2, nsim){
  
  n1<-length(data1)
  n2<-length(data2)
  
  bootCI2<-c()
  
  for(i in 1:nsim){
    bootSamp1<-sample(1:n1, n1, replace=TRUE)
    bootSamp2<-sample(1:n2, n2, replace=TRUE)
    thisXbar<-mean(data1[bootSamp1])-mean(data2[bootSamp2])
    bootCI2<-c(bootCI2, thisXbar)
  }
  
  return(bootCI2)
}

DBHBoot <- bootStrapCI2(north,south, nsim = 10000)
hist(DBHBoot)

  1. I calculated the confidence interval using both methods
quantile(DBHBoot, c(0.025, 0.975))
##       2.5%      97.5% 
## -18.686750  -2.839917
se <- sd(DBHBoot)
mean(north) - mean(south) + c(-1,1) * qt(0.975, df=29) * se
## [1] -19.132659  -2.534008
  1. The ditribution is close to normal, since the distribution is not really skewed the hybrid method should be effective.

  2. The diffrence between the two confidence intervals is small however I prefer the quantile method since it does not have any assumptions about the distribution. I imagine this one can be used in most situations.