-0– title: ‘Stats 401: Homework 2’ author: “Miles Chen” date: “” output: pdf_document: default —

Introduction

In this homework assignment, I’ll try to lead you through coding up algorithms for bootstrap, cross-validation, and tree-making from scratch. The hope is that by coding these algorithms by hand, you’ll gain a deeper understanding of how these tools work.

1) Bootstrap

Bootstrapping refers to the technique of resampling our sample with replacement. This is equivalent to creating infinite duplicates of our sample to create a “pseudo-population.” We then draw a random sample the same size as our original sample from the pseudo-population.

We often use bootstrap to get an estimate of the standard error (the standard deviation of the sampling distribution) of a statistic when we do not have access to the actual population.

In this example, we will try to estimate the standard error of the correlation of a sample.

We will begin with a synthetic population.

set.seed(1)  # for reproducibility
# x is 10,000 random normal values, centered at 100, sd 10, rounded to 1 decimal place
x <- round(rnorm(10^4, 100, 10), 1) 
# y is created to be correlated with x, but with error
y <- x + round(rnorm(10^4, 0, 5), 1)
# the correlation between x and y in the population is 0.8986
cor(x, y)

## [1] 0.8986356

# plot(x,y) # optional if you want to see the plot

The population we created has a correlation of 0.8986. In real life, we would not know what the population actually looks like, and this value of \(\rho\) would be unknown to us.

# we randomly sample 100 of these values to create our sample
samp <- sample(10^4, 100)
x_samp <- x[samp] # we subset x and y accordingly
y_samp <- y[samp]
cor(x_samp, y_samp)  # the correlation in our sample is 0.86296

## [1] 0.8983157

We have a sample of 100 values. The correlation between x and y in our sample is 0.86296. Without knowing what the population looks like, we may wish to estimate the standard error of this value.

One way we can do this is via bootstrap resampling. To perform bootstrap resampling, we resample our sample with replacement. In our case, there are 100 observations in our sample. So we will sample the integers 1:100 with replacement to choose which values will be in our bootstrap sample.

We subset our current sample accordingly and calculate the resulting correlation.

we repeat this many times (say, 1000), and record the correlation for each bootstrap sample. This effectively builds up a sampling distribution of different values of the correlation r. When we take the standard deviation of all those values, we have an estimate of the standard error.

# one instance
set.seed(1)
index <- sample(100, replace = TRUE)
x_boot <- x_samp[index]
y_boot <- y_samp[index]
cor(x_boot, y_boot)

## [1] 0.8761554

# your turn...
# Perform 1000 replicates of bootstrap
# For each replicate, calculate the resulting correlation, and store that value.
# So everyone gets the same results, make a for loop, and 
# include the line set.seed(i) before you perform the sampling

R <- rep(NA, 10000)
for(i in 1:10000){
  # write your code here ...
set.seed(i)
  index <- sample(100, replace = TRUE)
  x_boot <- x_samp[index]
  y_boot <- y_samp[index]
  R[i] <- cor(x_boot, y_boot)
}

hist(R)

sd(R)

## [1] 0.01984005

mean(R)

## [1] 0.8968892

# after performing the replicates, produce a histogram of the different observed correlations
# Calculate the standard deviation of the correlation estimates

2) Cross-Validation

We can use cross-validation to evaluate the predictive performance of several competing models.

Again, we will manually implement cross-validation from scratch first, and then use the built-in function in R.

We will use the dataset ironslag from the package DAAG (a companion library for the textbook Data Analysis and Graphics in R).

The description of the data is as follows: The iron content of crushed blast-furnace slag can be determined by a chemical test at a laboratory or estimated by a cheaper, quicker magnetic test. These data were collected to investigate the extent to which the results of a chemical test of iron content can be predicted from a magnetic test of iron content, and the nature of the relationship between these quantities. [Hand, D.J., Daly, F., et al. (1993) A Handbook of Small Data Sets]

The ironslag data has 53 observations, each with two values - the measurement using the chemical test and the measurement from the magnetic test.

We can start by fitting a linear regression model \(Y = \beta_0 + \beta_1 X + \epsilon\). A quick look at the scatterplot seems to indicate that the data may not be linear.

# install.packages("DAAG") # if necessary
library(DAAG)

## Warning: package 'DAAG' was built under R version 4.4.2

x <- seq(10,40, .1) # a sequence used to plot lines

L1 <- lm(magnetic ~ chemical, data = ironslag)
plot(ironslag$chemical, ironslag$magnetic, main = "Linear fit", pch = 16)
yhat1 <- L1$coef[1] + L1$coef[2] * x
lines(x, yhat1, lwd = 2, col = "blue")

In addition to the linear model, fit the following models that predict the magnetic measurement (Y) from the chemical measurement (X).

Quadratic: \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\)

Exponential: \(\log(Y) = \beta_0 + \beta_1 X + \epsilon\), equivalent to \(Y = \exp(\beta_0 + \beta_1 X + \epsilon)\)

log-log: \(\log(Y) = \beta_0 + \beta_1 \log(X) + \epsilon\)

# I've started each of these for you. 
# Your job is to create the plots with fitted lines.

L2 <- lm(magnetic ~ chemical + I(chemical^2), data = ironslag)
#plot the sample
plot(ironslag$chemical, ironslag$magnetic, main = "Quadratic fit", pch = 16)
#plot the model's line
yhat2 <- L2$coef[1] + L2$coef[2] * x + L2$coef[3] *x *x
lines(x, yhat2, lwd = 2, col = "blue")

L3 <- lm(log(magnetic) ~ chemical, data = ironslag)
# when plotting the fitted line for this one, create estimates of log(y-hat) linearly
# then exponentiate log(y-hat)
plot(ironslag$chemical, ironslag$magnetic, main = "Exponential fit", pch = 16)
yhat3 <- exp(L3$coef[1] + L3$coef[2] * x)
lines(x, yhat3, lwd = 2, col = "blue")

L4 <- lm(log(magnetic) ~ log(chemical), data = ironslag)
# for this one, use plot(log(chemical), log(magnetic))
# the y-axis is now on the log-scale, so you can create and plot log(y-hat) directly
# just remember that you'll use log(x) rather than x directly
plot(log(ironslag$chemical), log(ironslag$magnetic), main = "Log-log fit", pch = 16)
yhat4 <- L4$coef[1] + L4$coef[2] * log(x)
lines(log(x), yhat4, lwd = 2, col = "blue")

Leave-one-out Cross validation

We will first attempt leave-one-out cross validation. In LOOCV, we remove one data point from our data set. We fit the model to the remaining 52 data points. With this model, we make a prediction for the left-out point. We then compare that prediction to the actual value to calculate the squared error. Once we find the squared error for all 53 points, we can take the mean to get a cross-validation error estimate of that model.

To test out our four models, we will build a loop that will remove one point of data at a time. Thus, we will make a for(i in 1:53) loop. For each iteration of the loop, we will fit the four models on the remaining 52 data points, and make a prediction for the remaining point.

# create vectors to store the validation errors for each model
# error_model1 <- rep(NA, 53)
# ...

error_model1 <- rep(NA, 53)
error_model2 <- rep(NA, 53)
error_model3 <- rep(NA, 53)
error_model4 <- rep(NA, 53)


for(i in 1:53){
  
  # write a line to select the ith line in the data
  # store this line as the 'test' case
  # store the remaining as the 'training' data
test_case <- ironslag [i,]
training <- ironslag[-i,]
  
  # fit the four models and calculate the prediction error for each one
  # hint: it will be in the form
model1 <- lm(magnetic ~ chemical, data = training)
fitted_value <- predict(model1, test_case[1])
error_model1[i] <- test_case[2] - fitted_value

model2 <- lm(magnetic ~ chemical + I(chemical^2), data = training)
fitted_value <- predict(model2, test_case[1])
error_model2[i] <- test_case[2] - fitted_value

model3 <- lm(log(magnetic) ~ chemical + chemical, data = training)
fitted_value <- exp(predict(model3, test_case[1]))
error_model3[i] <- test_case[2] - fitted_value

model4 <- lm(log(magnetic) ~ log(chemical) , data = training)
fitted_value <- exp(predict(model4, test_case[1]))
error_model4[i] <- test_case[2] - fitted_value

  

}


mean(unlist(error_model1)^2)

## [1] 19.55644

mean(unlist(error_model2)^2)

## [1] 17.85248

mean(unlist(error_model3)^2)

## [1] 18.44188

mean(unlist(error_model4)^2)

## [1] 20.45424

Compare the sizes of the cross validation error to help you decide which model does the best job of predicting the test cases.

Using the same form, perform 6-fold cross validation. I have already created the list of cases to include in each fold.

set.seed(1)
shuffled <- sample(53) # shuffles the values 1 to 53
cases <- list(
  test1 = shuffled[ 1:9], # the first 9 go into test set #1
  test2 = shuffled[10:18],
  test3 = shuffled[19:27],
  test4 = shuffled[28:36],
  test5 = shuffled[37:45],
  test6 = shuffled[46:53]
)
cv1 <- rep(NA, 6)
cv2 <- rep(NA, 6)
cv3 <- rep(NA,6)
cv4 <- rep(NA,6)

# There are 6 test sets, each have 9, with the exception of test set #6, which has 8
# you can access each vector in the cases list via cases[[1]], cases[[2]], etc.

for(i in 1:6){
  # write your code here.
  # it will be quite similar to the code you wrote for LOOCV
v_index <- cases [[i]]
n <- length (v_index)

#splitting data into the respective folds
trainx <- ironslag$chemical[-v_index]; traint <- ironslag$magnetic[-v_index]
validx <- ironslag$chemical [v_index];validt <- ironslag$magnetic[v_index]

train <- data.frame (chemical = trainx, magnetic = traint); 
valid <- data.frame (chemical = validx, magnetic = validt)

model1 <- lm(magnetic ~ chemical , data = train)
fitted_value <- predict (model1, valid)
cv1[i] <- sum((valid$magnetic - fitted_value)^2)/n


model2 <- lm(magnetic ~ chemical + I(chemical^2), data = train)
fitted_value <- predict(model2, valid)
cv2[i] <- sum((valid$magnetic - fitted_value)^2)/n

model3 <- lm(log(magnetic) ~ chemical, data = train)
fitted_value <- exp(predict(model3, valid))
cv3[i] <- sum((valid$magnetic - fitted_value)^2)/n

model4 <- lm(log(magnetic) ~ log(chemical), data = train)
fitted_value <- exp(predict(model4, valid))
cv4[i] <- sum((valid$magnetic - fitted_value)^2)/n

cv_results <- rbind(cv1,cv2,cv3,cv4)
cv_results <- cbind(cv_results, rowMeans(cv_results))
colnames(cv_results) <- c("Val_Set_1","Val_Set_2","Val_Set_3","Val_Set_4", "Val_Set_5","Val_Set_6","Means")
cv_results
}

Again, compare the sizes of the cross validation error to help you decide which model does the best job of predicting the test cases.

Cross-validation with R

Now that you have wrote your cross-validation script from scratch, we can use the built-in functions in R. Library(boot) has the function cv.glm() which can be used to estimate cross-validation error.

To make use of cv.glm() on the linear models, we must first use glm() to fit a generalized linear model to our data. If you do not change the attribute “family” in the function glm(), it will fit a linear model

# library(boot)
library(boot)
set.seed(1)
L1 <- glm(magnetic ~ chemical, data = ironslag) # equivalent to lm(magnetic ~ chemical)
L2 <- glm(magnetic ~ chemical + I(chemical^2), data = ironslag)
L3 <- glm(log(magnetic) ~ chemical, data = ironslag)
L4 <- glm(log(magnetic) ~ log(chemical), data = ironslag)
L1_loocv <- cv.glm(ironslag, L1)$delta[1]
L1_6foldcv <- cv.glm(ironslag, L1, K = 6)$delta[1]
L1_loocv

## [1] 19.55644

# find the LOOCV and 6-fold CV values for the other three models

L2_loocv <- cv.glm(ironslag, L2)$delta[1]
L2_6foldcv <- cv.glm(ironslag, L2, K = 6)$delta[1]
L2_loocv

## [1] 17.85248

exp_cost <- function(y, yhat) {
mean( (exp(y) - exp(yhat) ) ^2)
}

10

## [1] 10

L3_loocv <- cv.glm(ironslag, L3,cost=exp_cost )$delta[1]
L3_6foldcv <- cv.glm(ironslag, L3, K = 6)$delta[1]
L3_loocv

## [1] 18.44188

L4_loocv <- cv.glm(ironslag, L4, cost=exp_cost )$delta[1]
L4_6foldcv <- cv.glm(exp(ironslag), L4, K = 6,cost=exp_cost)$delta[1]
L4_loocv

## [1] 20.45424

L4_6foldcv

## [1] 1.218472e+24

Your LOOCV estimates from cv.glm() should match your estimates when you coded your algorithm from scratch. The 6-fold CV estimates will vary because row selection will vary due to random sampling.

3) Decision Trees from Scratch

We will start by writing an algorithm to make a decision tree from scratch.

A decision tree will consider all possible splits, and will make a split that results in the smallest possible sum of squared residuals (SSR). It then recursively applies splits onto the resulting partition. We will write a short loop that will consider all possible splits for a chosen variable, and keep track of the resulting SSR.

We look at the Auto dataset from the package ISLR2. We will try to predict the variable mpg based on the other variables. The variable origin is categorical. The variable name cannot be used as a predictor.

We will try to model the variable mpg (gas mileage) based on the other variables. For this first example, we will just consider the variable displacement.

# install.packages("ISLR2") # install the package if necessary
library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(Auto)

# create an empty vector to store the resulting SSR after a split
total_ssr <- rep(NA, nrow(Auto))
boundaries <- rep(NA, nrow(Auto) - 1)

for(i in 1:(nrow(Auto)-1)) {
  # we will go through every possible split, which can occur between any two observations
val1 <- sort(Auto$displacement)[i] # we sort the values of displacement and select the ith value
val2 <- sort(Auto$displacement)[i + 1] # we also select the next value
boundary <- mean(c(val1, val2)) # we take the mean to find our boundary value
  
  # split the mpg value data into two groups
  # 1) those that correspond to displacement being less than the boundary
mpg1 <- Auto %>% filter(Auto$displacement <= boundary) %>% select(mpg)
  # 2) those that correspond to displacement being greater than the boundary
mpg2 <- Auto %>% filter(Auto$displacement > boundary) %>% select(mpg)   
  # find the mean of both groups
mean1 <- colMeans(mpg1)
mean2 <- colMeans(mpg2)  
  
  # calculate the SSR for each group
ssr1 <- sum((mpg1-mean1)^2)
ssr2 <- sum((mpg2-mean2)^2)  
  
  # add the SSRs from both groups together and store the resulting value in total_ssr
total_ssr[i] <- ssr1+ssr2
  # also store the corresponding boundary value to the vector boundaries
boundaries[i] <- boundary  
  
}

plot(total_ssr, type = "l")

# find the boundary that minimizes total SSR after the split
minimizer <- which(total_ssr == min(total_ssr))
val <- sort(boundaries)[minimizer]

# use the boundary value to split the data into Auto1 and Auto2
Auto1 <- filter(Auto,displacement <= val)

## Error in `filter()`:
## ℹ In argument: `displacement <= val`.
## Caused by error:
## ! `..1` must be of size 392 or 1, not size 0.

Auto2 <- filter(Auto,displacement > val)

## Error in `filter()`:
## ℹ In argument: `displacement > val`.
## Caused by error:
## ! `..1` must be of size 392 or 1, not size 0.

# repeat the above process on just Auto1 to find where to make another split
for(i in 1:(nrow(Auto1)-1)) {
# we will go through every possible split, which can occur between any two observations
val1 <- sort(Auto1$displacement)[i] # we sort the values of displacement and select the ith value
val2 <- sort(Auto1$displacement)[i + 1] # we also select the next value
boundary <- mean(c(val1, val2)) # we take the mean to find our boundary value
# split the mpg value data into two groups
# 1) those that correspond to displacement being less than the boundary
mpg1 <- Auto1 %>% filter(Auto1$displacement <= boundary) %>% select(mpg)
# 2) those that correspond to displacement being greater than the boundary
mpg2 <- Auto1 %>% filter(Auto1$displacement > boundary) %>% select(mpg)
# find the mean of both groups
mean1 <- colMeans(mpg1)
mean2 <- colMeans(mpg2)

# calculate the SSR for each group
ssr1 <- sum((mpg1-mean1)^2)
ssr2 <- sum((mpg2-mean2)^2)
# add the SSRs from both groups together and store the resulting value in total_ssr
total_ssr[i] <- ssr1+ssr2
# also store the corresponding boundary value to the vector boundaries
boundaries[i] <- boundary
}

## Error: object 'Auto1' not found

minimizer <- which(total_ssr == min(total_ssr))
val <- sort(boundaries)[minimizer]

plot(total_ssr, type = "l")

(minimizer <- which(total_ssr == min(total_ssr)))

## integer(0)

(val <- sort(Auto$displacement)[minimizer])

## numeric(0)

Compare the results of your partitioning code to the results of calling tree on the Auto data.

library(rpart)

## Warning: package 'rpart' was built under R version 4.4.2

rpart(mpg ~ displacement, data = Auto)

## n= 392 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 392 23818.9900 23.44592  
##    2) displacement>=190.5 170  2210.1880 16.66000  
##      4) displacement>=284.5 98   662.3563 14.70612 *
##      5) displacement< 284.5 72   664.4728 19.31944 *
##    3) displacement< 190.5 222  7785.9020 28.64234  
##      6) displacement>=112.5 102  2046.9750 25.31863 *
##      7) displacement< 112.5 120  3654.3430 31.46750  
##       14) displacement>=93.5 65  1338.6510 30.11692 *
##       15) displacement< 93.5 55  2057.0070 33.06364  
##         30) displacement< 80.5 16   541.5144 29.13125 *
##         31) displacement>=80.5 39  1166.5690 34.67692 *

# ^^ building a tree, and considering partition only on the variable displacement

4) Decision Trees with R

Create a decision tree on the Auto data using all the numeric predictors with the function rpart().

Once you make the tree, use the cv.tree() or plotcp() to see if pruning the tree will improve prediction performance.

Plot the resulting tree with the text labels.

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.4.2

fit <- rpart(mpg ~ . - name-origin, data = Auto)
rpart.plot(fit, main="Decision Tree", cex = 0.75)

print(fit)

## n= 392 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 392 23818.9900 23.44592  
##    2) displacement>=190.5 170  2210.1880 16.66000  
##      4) horsepower>=127 96   457.0662 14.51875 *
##      5) horsepower< 127 74   741.9541 19.43784 *
##    3) displacement< 190.5 222  7785.9020 28.64234  
##      6) horsepower>=70.5 151  3347.5600 26.28013  
##       12) year< 78.5 94  1222.0920 24.12021  
##         24) weight>=2305 55   413.6684 22.28545 *
##         25) weight< 2305 39   362.1677 26.70769 *
##       13) year>=78.5 57   963.7389 29.84211  
##         26) weight>=2580 33   224.9988 27.46061 *
##         27) weight< 2580 24   294.2333 33.11667 *
##      7) horsepower< 70.5 71  1803.7790 33.66620  
##       14) year< 77.5 28   280.2500 29.75000 *
##       15) year>=77.5 43   814.4786 36.21628 *

plotcp(fit)

fit <- rpart(mpg ~ . - name-origin, data = Auto, cp = 0.039)
rpart.plot(fit, main="Decision Tree", cex = 0.75, shadow.col = "pink")

We did not bother splitting the data between a training and test set. Still, use the resulting tree to make predictions on the data, to get an estimate of the mean squared error.

pred.tree <- predict(fit, newdata = Auto)
paste("Mean: ", mean((pred.tree - Auto$mpg)^2))

## [1] "Mean:  13.2363003071597"

5) Bagging and Random Forests

Before we try to use the Random Forest functions that are available in, let’s see how bagging works.

With bagging, we perform bootstrap resampling on the data, and fit separate trees to each of our bootstrapped datasets. At the end, we use each tree to make a prediction, and we average the the different predictions.

For our learning example, we will consider building a tree to predict mpg based on the variable displacement.

set.seed(1) # set seed for reproducibility

# The Auto dataset has 392 rows.
# we randomly sample 392 rows with replacement. This forms our first bootstrap sample
boot1 <- Auto[sample(nrow(Auto), replace = TRUE), ]
# We fit a tree to predict mpg from on displacement, based on the data in boot1
tree1 <- rpart(mpg ~ displacement, data = boot1)
# our resulting tree. 
tree1

## n= 392 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 392 24043.4800 23.61224  
##    2) displacement>=190.5 175  2728.9290 16.88914  
##      4) displacement>=284.5 91   493.2400 14.50000 *
##      5) displacement< 284.5 84  1153.5470 19.47738 *
##    3) displacement< 190.5 217  7025.4880 29.03410  
##      6) displacement>=106 114  2203.3050 26.55000 *
##      7) displacement< 106 103  3340.1220 31.78350  
##       14) displacement>=93.5 41   811.3649 29.87073 *
##       15) displacement< 93.5 62  2279.5550 33.04839  
##         30) displacement< 82.5 23   710.3661 29.28696 *
##         31) displacement>=82.5 39  1051.8670 35.26667 *

# prediction for mpg for a vehicle with displacement 307
(p1 <- predict(tree1, newdata = Auto[1,]))

##    1 
## 14.5

# We repeat the process and create two additional trees

boot2 <- Auto[sample(nrow(Auto), replace = TRUE), ]
tree2 <- rpart(mpg ~ displacement, data = boot2)
p2 <- predict(tree2, newdata = Auto[1,])
boot3 <- Auto[sample(nrow(Auto), replace = TRUE), ]
tree3 <- rpart(mpg ~ displacement, data = boot3)
p3 <- predict(tree3, newdata = Auto[1,])



# Based on this 'bag' of three trees, what is the predicted mpg of a vehicle
# with displacement 307

mean.pred.mpg <- (p1+p2+p3)/3
paste("Mean of Predictions: ", mean.pred.mpg)

## [1] "Mean of Predictions:  14.643782673637"

To do something similar with the function randomForest(), we provide the formula of the tree we want to create, and how many trees you want to use. In our case, we make three trees. randomForest() pretty much does the rest. Of course, in real life, we would probably never make a tree with only one predictor, and bagging with only three trees seems rather silly.

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

set.seed(1)
bag_Auto = randomForest(mpg ~ displacement, data = Auto, ntree = 3)
yhat.bag = predict(bag_Auto, newdata = Auto[1,])
yhat.bag

##    1 
## 14.9

Random Forests

In Random Forests we restrict the variables that a tree can use to make a split at each node. This often forces a tree to make a sub-optimal split. The result is that we get trees that are less correlated with each other.

Using randomForest(), we can impose this restriction by setting the argument mtry to a value smaller than the total number of variables available. In the Auto data, there are 7 predictors, so setting mtry to any value less than 7 will cause us to grow a Random Forest. We often select p to be p/3 or something similar.

# grow a random forest on the Auto data, using mtry = 3.
fit <- randomForest(mpg ~ .-name-origin, data = Auto, mtry = 3)
fit

## 
## Call:
##  randomForest(formula = mpg ~ . - name - origin, data = Auto,      mtry = 3) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 7.335487
##                     % Var explained: 87.93

# comment on the mean squared residuals compared to the MSE of the single decision tree

varImpPlot(fit)