A function for calculating Bootstrap Method in Machine Learning

Introduction

Bootstrap is a re-sampled re-sampling method developed by Bradford Efron of Stanford’s Statistics Department in 1979 and has created a new look for statistical research as it addresses many of the issues that Old methods of statistics can not be solved.

The use of Bootstrap is vary. Most widely used in evaluating the quality of an estimate. Traditionally, people usually use P-value standard to give out evaluation about one estimate while it maintains quite a few weakness. With bootstrap, this approach is said to be more effective than P-value.

To illustrate, we will apply Bootstrap method on the Portfolio data on the yields of two shares X and Y. The typical problem of portfolio management is to find the proportion of investment in these two assets so that your investment risk is minimal according to Harry Markowitz’s theory of profit-loss.

Since the topic is set, the correlation between these two assets has been calculated. But estimation of correlation coefficient based on just only one limited observation will be not accurate. With bootstrap, the test will be performed on a variety of samples in orer to know the accuracy of the estimation for correlation.

# Loadning data
library(ISLR)
data("Portfolio")

##            X          Y
## 1 -0.8952509 -0.2349235
## 2 -1.5624543 -0.8851760
## 3 -0.4170899  0.2718880
## 4  1.0443557 -0.7341975
## 5 -0.3155684  0.8419834
## 6 -1.7371238 -2.0371910

# Based on the given sample, calculate the correlation value
cor(Portfolio$X, Portfolio$Y)

## [1] 0.5154676

Note that this 0.5154676 is just only correlation for a specific sample as we know above. But for each sample taken from the origional data will give out different correlation value. Also, we need to know more about this correlation such as the distribution of this correlation or the standard deviation (ie, the exact level)…

The function below will solve this problem with three parameters:

Origional data input,
The number of repetitions of sampling process N,
Random seeding sed.

# Create function 
correlation_bootstrap <- function(Portfolio, N, sed){
  corr <- numeric(N)
  set.seed(sed)
  for (i in 1:N){
    n <- nrow(Portfolio)
    u <- Portfolio[sample(n,n,replace = TRUE), ]
    corr[i] <- cor(u$X, u$Y)
  }
  return(corr)
}

# Apply function with N = 10000, sed = 30
u <- correlation_bootstrap(Portfolio, 10000, 24)

# See the first 6 obs 
head(u)

## [1] 0.5764427 0.4654958 0.5697109 0.4806435 0.5321107 0.4930609

# Show correlation distribution in Plots

u %>% 
  data.frame(Correlation = u) %>% 
  ggplot(aes(Correlation)) + 
  geom_histogram(fill = 'red', alpha = 0.3) +
  geom_vline(xintercept = mean(u), color = "blue") +
  theme_minimal()

The blue line illustrates the mean correlation value with 10000 random sample selections. See the histogram above, there are signs of normal distribution for the correlation coefficient. Let’s look at the standard deviation which is very small:

u %>% sd()

## [1] 0.06588243

u %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2756  0.4743  0.5188  0.5167  0.5625  0.7458

For 10000 random sample selection with Bootstrap, the smallest correlation value is approximately 0.2127 and the highest one is approximately 0.5181 while standard deviation stands at very small value (around 0.06622).

In R, for supporting Bootstrap without coding “for loop”, using boot() function is another way. Instead of creating your own function, using boot() requires you to build a statistical calculation function that we are interested in then apply boot(). By this way for the statisticians who were not born for coding will find it flexible and helpfull.

corr_a <- function(Portfolio, index){
  X <- Portfolio$X[index]
  Y <- Portfolio$Y[index]
  return(cor(X,Y))
}

# How it works for index of X and Y is from row 1 to row 20
corr_a(Portfolio, 1:20)

## [1] 0.280034

# Apply boot()
library(boot)
set.seed(30) 
correlation <- boot(Portfolio, corr_a, R = 10000)

correlation

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = corr_a, R = 10000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.5154676 -0.001193204   0.0662034

# Histogram of correlation
correlation$t %>% 
  data.frame(New = correlation$t) %>% 
  ggplot(aes(New)) + 
  geom_histogram(fill = "blue", alpha = 0.3) + 
  geom_vline(xintercept = mean(correlation$t), color = "red") + 
  theme_minimal()

A function for calculating Bootstrap Method in Machine Learning

Mai Thi Nguyen

4/15/2018

Introduction

My conclusion

Reference