Homwork3

In collaboration with Bryce O’Connor and Sam Nurmi

Problem 2

No becuase the original dat will be skewwed and the bootstrap data will adjust accordiclgy so the stadard deviations will be diffrent
No, bootstapping is resampling with replacment of the original statement
No, becuase you use bootstarpping on small data sets so its more liely that you would want to generate larger resampling with many simulations such as 10000 resamplings
No, its from the sample not the population

Problem 3

mean of treatment = 59 mean of control = 43.25 59 - 43.25 = 15.75
I created a vector using the data then used the sample tool with no replacement

drp <- c(57, 61, 42, 62, 41, 28)
sample(drp, 1, replace = FALSE, prob= NULL)

## [1] 41

using this the frst two values I got would be used as the treatment results 61, 62
and the remaining values the control 57, 42, 41, 28

I just re-ran this code 20 times and recorded the sampling

  sim 1:
    treatment: 61, 62 
    control: 57, 42, 41, 28
    observed value = 61.5 - 42 = 19.5
    
  sim 2:
    treatment:61, 41
    control: 57, 42, 62, 28
    observed value = 51 - 47.25 = 3.77
    
  sim 3:
    treatment: 41, 61 
    control: 57, 42, 62, 28
    observed value = 51 - 47. 25 = 3.75
    
  sim 4:
    treatment: 41, 57
    control: 62, 42, 61, 28
    observed value = 49 - 48.25 = 0.75
    
  sim 5:
    treatment: 61, 41
    control: 57, 42, 62, 28
    observed value = 51 - 47.25 = 3.77 
    
  sim 6:
    treatment: 57, 61
    control: 62, 42, 41, 28
    observed value = 59 - 43.25 = 15.75
    
  sim 7:
    treatment: 42, 61
    control: 57, 62, 41, 28
    observed value = 51.5 - 47 = 4.5

  sim 8:
    treatment: 57, 28
    control: 61, 42, 41, 62
    observed value = 42.5 - 51.5 = -9
    
  Sim 9:
    treatment: 28, 41
    control: 57, 42, 61, 62
    observed value = 34.5 - 55.5 = -21
    
  sim 10:
    treatment: 41, 62
    control: 57, 42, 61, 28
    observed value = 51.5 - 47 = 4.5
    
  sim 11:
    treatment: 42, 28
    control: 57, 41, 61, 62
    observed value =  35 - 55.25 = -20.25
    
  sim 12:
    treatment: 28, 41
    control: 57, 42, 61, 62
    observed value = 34.5 - 55.5 = -21
    
  sim 13:
    treatment: 62, 28
    control: 57, 42, 61, 41
    observed value = 45 - 50.25 = -5.25
    
  sim 14:
    treatment: 28, 41
    control:  57, 42, 61, 62
    observed value = 34.5 - 55.5 = -21
    
  sim 15:
    treatment: 28, 57
    control: 41, 42, 61, 62
    observed value = 42.5 - 51.5 = -9
    
  sim 16:
    treatment: 41, 62
    control: 57, 42, 61, 28
    observed value = 51.5 - 47 = 4.5
    
  sim 17:
    treatment: 57, 28
    control: 41, 42, 61, 62
    observed value = 42.5 - 51.5 = -9
    
  sim 18:
    treatment: 57, 42
    control: 41, 28, 61, 62
    observed value = 49.5 - 48 = 1.5
    
  sim 19:
    treatment: 57, 42
    control: 41, 28, 61, 62
    observed value = 49.5 - 48 = 1.5
    
  sim 20:
    treatment: 57, 41
    control: 42, 28, 61, 62
    observed value = 49 - 48.25 = 0.75

then I made a histogram

## Warning: package 'tidyverse' was built under R version 3.6.2

## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'readr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## Warning: package 'stringr' was built under R version 3.6.2

## Warning: package 'forcats' was built under R version 3.6.2

## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

vals <- c(-21, 1.5, 0.75, -9, -9, -9, -5.25, -21, 3.77, -20.25, 4.5, 15.75, 4.5, 19.5, -21, 0.75, 3.77, 4.5, 1.5, 3.75)
valmatrix <- matrix(vals, nrow=20)
valmatrix <- as.data.frame(valmatrix)

ggplot(valmatrix, aes(vals))+
  geom_histogram(bins = 20, color = 'black', fill = 'red')+
  theme_bw()+
  labs(title <- 'Run Simulation P-Values', x = 'Sim P-Value')

I found the estimated p-value based on the values above estimated p-value = 0.1
p-value = 2/15= 0.133 so our estimated value was not far off the actual p value

Problem 4

I created a box plot, it would not be reasonable to use a standard t procedure becuase the sample size is both small and the distribution is skewed.

trees = read.csv("nspines.csv", 
                 header=TRUE,
                 na.strings = "?")

boxplot1<-ggplot(trees, aes(x=factor(ns), y=dbh, fill=factor(ns)))+
  geom_boxplot()+
  theme_bw()+
  ggtitle("Side-by-Side Boxplot Plot of Tree location by DBH")

print(boxplot1)

I calculated the observed statistic

north = trees[1:30,2]
south = trees[31:60,2]

observedStat = mean(north)- mean(south)
observedStat

## [1] -10.83333

I ran a bootstrap on the data for 10000 simulations and created a historgram

bootStrapCI2<-function(data1, data2, nsim){
  
  n1<-length(data1)
  n2<-length(data2)
  
  bootCI2<-c()
  
  for(i in 1:nsim){
    bootSamp1<-sample(1:n1, n1, replace=TRUE)
    bootSamp2<-sample(1:n2, n2, replace=TRUE)
    thisXbar<-mean(data1[bootSamp1])-mean(data2[bootSamp2])
    bootCI2<-c(bootCI2, thisXbar)
  }
  
  return(bootCI2)
}

DBHBoot <- bootStrapCI2(north,south, nsim = 10000)
hist(DBHBoot)

I calculated the confidence interval using both methods

quantile(DBHBoot, c(0.025, 0.975))

##       2.5%      97.5% 
## -18.686750  -2.839917

se <- sd(DBHBoot)
mean(north) - mean(south) + c(-1,1) * qt(0.975, df=29) * se

## [1] -19.132659  -2.534008

The ditribution is close to normal, since the distribution is not really skewed the hybrid method should be effective.
The diffrence between the two confidence intervals is small however I prefer the quantile method since it does not have any assumptions about the distribution. I imagine this one can be used in most situations.

Homwork3

Dakota Barksdale

2/18/2020

Problem 2

Problem 3

Problem 4