Problem-set-3.utf8.md

#Problem 2:
  
#Explain what is wrong with each of the following statements.
#a) The standard deviation of the bootstrap distribution will be approximately the same as the standard deviation of the original sample.

#The standard deviation for the bootstrap cannot be the same as the stadnard deviation for the sample because the bootstrap, while it  uses the same n value and uses the values from the sample, is a simulation which is run numerous times to get over the rpoblem of non-normality and small sample size, and thus would have a different stadnard deviation. This standard deviation, however is more accurate and does not have the same problem of skew.
  
#b)The bootstrap distribution is created by resampling without replacement from the original sample.

#The bootstrap uses resampling to make sure that the values which are picked form the original sample, when running a simulation, are then readded into the pool which allows for the picking of one value not to change the probability of picking the others. Say you had 10 cards in a hat, and you chose one, you would write down the response and then re-add it into the hat to make sure that you have the same probability as before you picked any of the cards. 
  
#When generating the resamples, it is best to use a sample size smaller than the size of the original sample

#it is not accurate to use a smaller or larger sample when resampling. When you are runnign simulations you want to make sure that you are creating many resamples with the same sample size to be representative of the sample but to remove flaws like skewness or non-normality in general. 

#The bootstrap distribution is created by resampling with replacement from the population.

#Bootstrapping is a method of resampling, which means that it utilizes sample data, not population data, when running simulations. Importantly, population parameters are fized and unknown, so there would be no way of knowing these parameters in order to run simulations. 

#Problem 3

Xbartreatment<-(57+61)/2
Xbarcontrol<-sum(42, 62, 41,28)/4
Xbartreatment-Xbarcontrol

## [1] 15.75

#difference of 15.75
resamplegroup<-c(57,62,41,28,42,61)
sample(resamplegroup, 2)

## [1] 42 28

newtreatment<- mean(41,62)
newcontrol<-mean(57,28,57,61)
newtreatment-newcontrol

## [1] -16

#difference of -16


n1<-2
n2<-4
diff<-c()
for(i in 1:20){
  permsamp<-sample(1:(n1+n2), n1, replace=FALSE)
  thisxbar<-mean(resamplegroup[permsamp])-mean(resamplegroup[-permsamp])
  diff<-c(diff, thisxbar)}
hist(diff)

mean(diff>=15.75)

## [1] 0.2

#the proportion that was equal to or greater than the original value was .066.

#caluculating the 15 possible observations to get the actual p-value

mean(57,62)-mean(41,28,42,61)

## [1] 16

mean(57,41)-mean(62,28,42,61)

## [1] -5

mean(57,28)-mean(62,41,42,61)

## [1] -5

mean(57,42)-mean(62,41,28,61)

## [1] -5

mean(57,61)-mean(62,41,42,28)

## [1] -5

mean(62,41)-mean(57,28,42,61)

## [1] 5

mean(62,28)-mean(57,41,42,61)

## [1] 5

mean(62,42)-mean(57,41,28,61)

## [1] 5

mean(62,61)-mean(57,41,28,42)

## [1] 5

mean(41,28)-mean(57,62,42,61)

## [1] -16

mean(41,42)-mean(57,61,62,28)

## [1] -16

mean(41,61)-mean(57,62,28,42)

## [1] -16

mean(28,42)-mean(57,62,61,41)

## [1] -29

mean(28,61)-mean(57,62,42,41)

## [1] -29

mean(42,61)-mean(57,62,28,41)

## [1] -15

1/15

## [1] 0.06666667

#0.067 is the actual p-value for observing a difference of at or above 15.75

#Problem 4

call<-read.csv("calls80.csv", header=TRUE)
hist(call$length)

#bootstrap for 1000 resamples

bootstrapCI1 <- function(data, nsim){
  n <- length(data)
  bootCI <- c()
  for(i in 1:nsim){
    bootSamp <- sample(1:n, n, replace=TRUE) 
    thisXbar <- mean(data[bootSamp]) 
    bootCI <- c(bootCI, thisXbar) 
  }
  return(bootCI)
}

callBootCI <- bootstrapCI1(call$length, nsim=1000)

hist(callBootCI, main="Call lengths")

#the sample size of 80 is roughy normal, although the historgram shows a slight skew to the right. 

qqnorm(callBootCI, main="Call Lengths")

#the tails within the qqnorm plot suggest that there is in fact skew in the data to the right, you can begin to see the tails starting to become much more disperesed with some outliers. There is the same slight right skew in the data. 

SRSCall<-c(104,102, 35, 211, 56, 325, 67, 9, 179, 59)

SRSstrapCI1 <- function(data, nsim){
  n <- length(data)
  SRSbootCI <- c()
  for(i in 1:nsim){
    SRSbootSamp <- sample(1:n, n, replace=TRUE) 
    SRSXbar <- mean(data[SRSbootSamp]) 
    SRSbootCI <- c(SRSbootCI, SRSXbar) 
  }
  return(SRSbootCI)
}
SRSBootCI<-SRSstrapCI1(SRSCall,nsim=1000)

hist(SRSBootCI, main="SRS Call Lengths")

qqnorm(SRSBootCI, main="SRS Call Lengths")

#Standard erros

SE1<-sd(callBootCI)/sqrt(80)
SE1

## [1] 4.327687

SE2<-sd(SRSBootCI)/sqrt(10)
SE2

## [1] 9.548678

#the Standard Error is larger for the smaller sample size becasue with a smaller sample size there is less normality in the distribution, which means a larger stadnard deviation (or there is larger varianve amongst the data points) and thus the observed standard error must be larger. 


#Problem 5

spines <- read.csv("nspines.csv", header = TRUE)

boxplot(spines$dbh ~ spines$ns, xlab = "Region", ylab = "DBH")

#it looks as if we should not utilize ths same t procedures becasue the data does not refelct normality, and the sample size of n=30 is not suffficiently large.
#after viewing the data for spines we can see that north is 1:30 and south 31:60

spinesnorth<-(spines$dbh[1:30])
spinessouth<-(spines$dbh[31:60])
meannorth<-mean(spinesnorth)
meansouth<-mean(spinessouth)

test_stat<-meannorth-meansouth
#-10.833

bootstrapCI2 <- function(data1, data2, nsim){
  n1 <- length(data1)
  n2 <- length(data2)
  
  bootCI2 <- c()
  
  for(i in 1:nsim){
    bootSamp1 <- sample(1:n1, n1, replace=TRUE) 
    bootSamp2 <- sample(1:n2, n2, replace=TRUE)
    thisXbar <- mean(data1[bootSamp1])-mean(data2[bootSamp2])
    bootCI2 <- c(bootCI2, thisXbar) 
  }
  
  return(bootCI2)
}

spinesBootCI2 <- bootstrapCI2(spinesnorth, spinessouth, nsim=20000)
hist(spinesBootCI2)

#quantile method
quantile(spinesBootCI2, c(0.025,.975))

##      2.5%     97.5% 
## -18.63675  -2.75325

#Hybrid method
bootSE<-sd(spinesBootCI2)
(mean(spinesnorth)-mean(spinessouth))+c(-1,1)*qt(0.975, df=60)*bootSE

## [1] -18.981866  -2.684801

#I would not use the hybrid method becasue it assumes normality and the initial sample data were both skewed, when split between north and south trees. Moreover, n=30 is relatively small sample size, even though it can be considered the base minimum size for a sample.
#although this iterval might not be incredibly inaccurate the quantile method would provide us with a more accurate measure, which can be seen by the tighter confidence interval.

#usual t confidence interval: Welch's two sample

t.test(spinesnorth,spinessouth)

## 
##  Welch Two Sample t-test
## 
## data:  spinesnorth and spinessouth
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.090199  -2.576468
## sample estimates:
## mean of x mean of y 
##  23.70000  34.53333

#95 percent CI: -19.09, -2.576

#the traditional t-test has the widest confidence interval, which refelcts the non-normality of our samples, causing error in our estimation.
#the t-test is the least accurate method for data with skew, while the hybrid method is more accurate, but still not ideal.
#the best of these methods is the quantile method as it has the smallest confidence interval which refects the accuracy of simulation, and if you visualize the data from this simulation in hist form, you can see its normality.