Bootstraping

Boostrating to estimate the mean of a given variable X. Say the population is X. We have a sample of X of size

X <- c(21,45,38,45,43,60,87,54,22,42,3,55,43,66,56,57,60,47,45,63,58,64,56,71,39,88,65,47,77,46,75,55,45,51,50,67,44,81,60,58,73,55,45,65,77,68,82,77,79,68,59,49,85,65,61,63)
length(X) # how many data point

## [1] 56

mean(X) # the true average of X

## [1] 57.5

X has 56 data points with a true mean of 57.5. Suppose a sample of size 20 is drawn from X. This is a small sample size, but bootstraping can help with the estimation of the true mean along with the estimation of the standard error of the mean estimate. Let’s see how that works.

set.seed(124)
X.sample <- sample(X,20, replace = FALSE) # take random sample of size 10 without replacement out of X
X.sample

##  [1] 43 56 47 64 55 56 46 39 77 66 67 60 51 44 47 38 71 68 65 45

mean(X.sample)

## [1] 55.25

The true mean of 57.5 is understimated with the sample. The sample’s estimate is 55.25. Bootstrapping will give us the standard error which we can use to construct a confidence interval around the mean estimate.

library(mosaic)

## Loading required package: dplyr

## Warning: package 'dplyr' was built under R version 3.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: lattice

## Loading required package: ggformula

## Loading required package: ggplot2

## Loading required package: ggstance

## 
## Attaching package: 'ggstance'

## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh

## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")

## Loading required package: mosaicData

## Loading required package: Matrix

## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.

## 
## Attaching package: 'mosaic'

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following object is masked from 'package:ggplot2':
## 
##     stat

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median,
##     prop.test, quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

boo <- function(x){
  a <- sample(x,20, replace = TRUE)
  mean(a)
}
my.boot = do(1000) * {boo(X.sample)}
head(my.boot)

##   result
## 1  57.70
## 2  59.00
## 3  57.20
## 4  61.10
## 5  51.00
## 6  51.95

length(my.boot$result)

## [1] 1000

s.error <- sd(my.boot$result) 
s.error # the standard error of the mean estimate

## [1] 2.540572

It is important to look at the distribution of the bootstrap means

hist(my.boot$result)

Because the distribution is fairly normal, we can construct the 95% confidence interval using 1.96 as critical value.

upper.bound <- mean(X.sample) + 1.96 * s.error
lower.bound <- mean(X.sample) - 1.96 * s.error
upper.bound

## [1] 60.22952

lower.bound

## [1] 50.27048

The interval is (50.27, 60.23). This means we are 95% confident that the true mean is between the range 50.27 - 60.23. In fact, the true mean is 57.5. So the interval estimate contains the true mean indeed.

Bootstraping

J Mess