Conceptual Exercises

2 and 4)

These were handwritten and turned in seperately.

5)

For the majority vote method, we would choose red since 6/10 estimates were greater than .5.

mean(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
## [1] 0.1

For the average probability approach, we would choose green since the average of the bootstrapped estimates is .1.

Applied Exercises

8)

a)

set.seed(1)
library(tree)
## Warning: package 'tree' was built under R version 3.6.3
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
attach(Carseats)
library(gbm)
## Warning: package 'gbm' was built under R version 3.6.3
## Loaded gbm 2.1.8
train <- sample(1:nrow(Carseats), nrow(Carseats)*.8)
cartrain <- Carseats[train,]
cartest <- Carseats[-train,]

b)

tree = tree(Sales ~., cartrain)
plot(tree)
text(tree, pretty = 0)

mean((predict(tree, cartest) - cartest$Sales)^2)
## [1] 4.936081

It would be tedious to interpret the entire tree, so I will trace a single path. For a given observation, if the shelf location is good, price is greater than 109.5, advertising is less than 12.5, and competitor prices are greater than 133.5, the predicted sales would be 9,669. The MSE of this model evaluated on the test set is 4.93.

c)

set.seed(10)
cv.carseats = cv.tree(tree, FUN = prune.tree)
plot(cv.carseats)

bestsize <- cv.carseats$size[which.min(cv.carseats$dev)]

pruned <- prune.tree(tree, best = bestsize)
mean((predict(pruned, cartest) - cartest$Sales)^2)
## [1] 4.936081

Cross-validation chose the full tree, so no pruning occurred. Therefore the pruning process did not improve the test error rate.

d)

bag <- randomForest(Sales ~., data = cartrain, mtry = 10, importance = T)
importance(bag)
##               %IncMSE IncNodePurity
## CompPrice   36.398048     264.73544
## Income      10.284220     142.37481
## Advertising 22.268097     191.00144
## Population  -2.698225      73.24495
## Price       79.158793     722.91424
## ShelveLoc   74.436877     700.55028
## Age         27.588941     234.69303
## Education    1.441073      60.47524
## Urban       -1.910174      10.65936
## US           1.862926      11.63553
varImpPlot(bag)

mean((predict(bag, cartest) - cartest$Sales)^2)
## [1] 2.954726

The test MSE is 2.92, and the most important variables determined by the importance function are: price and shelving location.

e)

set.seed(1)
merror <- rep(NA, 8)
for(i in 3:10){
        rf <- randomForest(Sales ~., data = cartrain, mtry = i, importance = T)
        merror[i-2] <- mean((predict(rf, cartest) - cartest$Sales)^2)
}
df <- data.frame(m = 3:10, merror = merror)
df
plot(3:10, merror, type = "l", col = "blue")

As m first increases, the test error decreases due to decreases in bias. At large values of m, the test error increases due to increases in variance overtaking decreases in bias. Since m with the lowest test error is 8, we will use that model to determine variable importance.

rf <- randomForest(Sales ~., data = cartrain, mtry = 8, importance = T)
importance(rf)
##               %IncMSE IncNodePurity
## CompPrice   33.077597     248.63599
## Income       7.406193     145.22246
## Advertising 23.244972     187.83014
## Population  -3.223337      80.07774
## Price       70.133677     719.85810
## ShelveLoc   81.193097     687.45774
## Age         23.905307     246.15285
## Education    2.279859      65.58869
## Urban       -2.082055      11.27248
## US           4.525378      15.96660
varImpPlot(rf)

Again, price and shelving location seem to be the most important variables.

10)

a)

Hitters <- Hitters[!is.na(Hitters$Salary),]
Hitters$Salary <- log(Hitters$Salary)

b)

Hitterstrain <- Hitters[1:200,]
Hitterstest <- Hitters[201:nrow(Hitters),]

c)

library(gbm)
lambda <- c(.001, .01, .1, .5, .9)
testmses <- rep(NA, length(lambda))
trainmses <- rep(NA, length(lambda))
for(i in 1:length(testmses)){
        boost = gbm(Salary ~., data = Hitterstrain, distribution = "gaussian", n.trees = 1000, shrinkage = lambda[i])
        testmses[i] <- mean( (predict(boost, Hitterstest) - Hitterstest$Salary) ^2)
        trainmses[i] <- mean( (predict(boost, Hitterstrain) - Hitterstrain$Salary) ^2)
}
## Using 1000 trees...
## 
## Using 1000 trees...
## 
## Using 1000 trees...
## 
## Using 1000 trees...
## 
## Using 1000 trees...
## 
## Using 1000 trees...
## 
## Using 1000 trees...
## 
## Using 1000 trees...
## 
## Using 1000 trees...
## 
## Using 1000 trees...
plot(lambda, trainmses, type = "b")

data.frame(lambda, testmses, trainmses)

d)

plot(lambda, testmses, type = "b")

e)

We will use multiple linear regression and ridge regression.

library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.3
## Loading required package: Matrix
## Loaded glmnet 4.0-2
x = model.matrix(Salary~., Hitterstrain)
y = Hitterstrain$Salary
testx = model.matrix(Salary~., Hitterstest)

lm <- lm(Salary ~., Hitterstrain)
mselm <- mean( (predict(lm, Hitterstest) - Hitterstest$Salary) ^2)

glm <- glmnet(x, y, alpha = 0)
mseridge <- mean( (predict(glm, testx) - Hitterstest$Salary)^2)
data.frame(mselm, mseridge, mseboost = 0.2699382)

Boosting is the far superior model in this case.

f)

bestboost <- gbm(Salary ~., data = Hitterstrain, distribution = "gaussian", n.trees = 1000, shrinkage = .1)
summary(bestboost)

It looks like number of at bats and number of hits are the most important variables.

g)

bag <- randomForest(Salary~., Hitterstrain, mtry = ncol(Hitters)-1)
mean( (predict(bag, Hitterstest) - Hitterstest$Salary) ^2)
## [1] 0.231263

The test MSE is .2349, which is the smallest of all models evaluated.

4)

A regression tree would predict classes using conditional probabilites instead of assigning each observation a binary response of 1 or 0. This means that for a regression tree, we would have more information on the strength of a given prediction relative to a classification tree. For example, a classification tree might assign a 1 to an observation that has a conditional probability of .51. For a regression tree, we would know that this observation does not really fit into either category since the conditional probability is output.

With this knowledge, we can adjust the acceptance threshold based on the objectives of the research question. For example, if researchers do not want to miss any 1s, but do not mind if some 0s are misclassified, they can assign 1 to predictions with a conditional probability of .3 or higher instead of .5.

5)

a)

set.seed(1)
df.train <- read.csv("./HW7train.csv", header = T)
index <- sample(1:nrow(df.train), 900)
train <- df.train[index,]
test <- df.train[-index,]

b)

set.seed(2)
library(randomForest)
rf <- randomForest(y ~ ., data = train, importance = T)
importance(rf)
##      %IncMSE IncNodePurity
## X1  42.60365     1963.3613
## X2  33.27170     1585.4286
## X3  27.45592      536.9722
## X4  33.63289      748.1658
## X5  35.67263      919.2697
## X6  34.55811      840.2030
## X7  37.14877      804.8543
## X8  31.74922      629.6042
## X9  36.71831      860.1537
## X10 37.82703      885.3555
plot(rf$importance[,1],type="b",axes=F,ann=F,ylim=c(0,max(rf$importance[,1])+1))
axis(1,at=1:10,lab=names(df.train)[-1])
axis(2,at=seq(0,max(rf$importance)+1,0.25),las=1)
box()

It looks like X1 and X2 are more important relative to the other predictors.

c)

set.seed(1)
mse.perm <- rep(NA, 10)
for(i in 1:10){
        newtrain <- train
        newtrain[,i+1] <- sample(newtrain[,i+1])
        newrf <- randomForest(y ~ ., data = newtrain, importance = T)
        preds <- predict(newrf, test)
        mse.perm[i] <- mean((preds - test$y)^2)
}

plot(mse.perm,type="b",axes=F,ann=F,ylim=c(0,max(mse.perm)+1))
axis(1,at=1:10,lab=names(df.train)[-1])
axis(2,at=seq(0,max(mse.perm)+1,0.25),las=1)
box()

For the permutation test method, X6 and X9 look to be most important which is a different result than part b). Overall, predictor importance is spread more evenly for the permutation method.

d)

set.seed(1)
mse.loo <- rep(NA, 10)
for(i in 1:10){
        newtrain <- train
        newtrain <- train[,-(i+1)]
        newrf <- randomForest(y ~ ., data = newtrain, importance = T)
        preds <- predict(newrf, test)
        mse.loo[i] <- mean((preds - test$y)^2)
}

plot(mse.loo,type="b",axes=F,ann=F,ylim=c(0,max(mse.loo)+1))
axis(1,at=1:10,lab=names(df.train)[-1])
axis(2,at=seq(0,max(mse.loo)+1,0.25),las=1)
box()

data.frame(1:10, mse.loo, mse.perm, gini = rf$importance[,1])

This looks more like the results from part c) with some very slight differences. I would trust the results from part c) more than the results from part b) since the gini index will be biased for collinear predictors.

e)

pairs(df.train)

cor(df.train)
##             y            X1           X2           X3           X4           X5
## y   1.0000000  0.5463990304  0.534840459  0.216550698  0.282289549  0.319120322
## X1  0.5463990  1.0000000000  0.786200930 -0.029067890  0.008199860  0.053218327
## X2  0.5348405  0.7862009304  1.000000000 -0.001424644  0.025740263  0.037342231
## X3  0.2165507 -0.0290678897 -0.001424644  1.000000000 -0.055665572 -0.031538181
## X4  0.2822895  0.0081998603  0.025740263 -0.055665572  1.000000000 -0.003122681
## X5  0.3191203  0.0532183267  0.037342231 -0.031538181 -0.003122681  1.000000000
## X6  0.3086712  0.0414064920  0.031501437 -0.008487021  0.042530646  0.003964744
## X7  0.2686604 -0.0034344072  0.010981310 -0.070005796 -0.019070281 -0.028256980
## X8  0.2465965 -0.0090246666 -0.036528226 -0.053461043 -0.086141758 -0.002933351
## X9  0.3159788  0.0225791156  0.006356100  0.035802833  0.019762720 -0.026747664
## X10 0.3029442 -0.0001144572 -0.030733785 -0.062700444  0.004922118  0.092347592
##               X6           X7           X8           X9           X10
## y    0.308671180  0.268660423  0.246596485  0.315978758  0.3029442218
## X1   0.041406492 -0.003434407 -0.009024667  0.022579116 -0.0001144572
## X2   0.031501437  0.010981310 -0.036528226  0.006356100 -0.0307337853
## X3  -0.008487021 -0.070005796 -0.053461043  0.035802833 -0.0627004437
## X4   0.042530646 -0.019070281 -0.086141758  0.019762720  0.0049221179
## X5   0.003964744 -0.028256980 -0.002933351 -0.026747664  0.0923475923
## X6   1.000000000 -0.044693887 -0.045780661 -0.022807914  0.0098542314
## X7  -0.044693887  1.000000000  0.006187248 -0.005864789  0.0233617429
## X8  -0.045780661  0.006187248  1.000000000  0.038380209 -0.0122219848
## X9  -0.022807914 -0.005864789  0.038380209  1.000000000  0.0078488086
## X10  0.009854231  0.023361743 -0.012221985  0.007848809  1.0000000000

X1 and X2 seem to be collinear, and they were also chosen as the most important variables by the Gini index. This collinearity might be causing the difference between the results in part b) and part c).

6)

a)

set.seed(1)
sigma = 5
dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
y <- rowSums(dftrain) + rnorm(100, sd = 5)
dftrain <- data.frame(y, dftrain)

b)

set.seed(1)
sigma = 5
dftest <- data.frame(replicate(10, runif(10000, -10, 10)))
y <- rowSums(dftest) + rnorm(10000, sd = 5)
dftest <- data.frame(y, dftest)

c)

set.seed(1)
bag <- randomForest(y ~ ., data = dftrain, mtry = 10)
preds <- predict(bag, dftest)
bagmse <- mean( (preds - dftest$y)^2 )

rf <- randomForest(y ~ ., data = dftrain, mtry = 3)
preds <- predict(rf, dftest)
rfmse <- mean( (preds - dftest$y)^2 )
data.frame("Bagging MSE" = bagmse, "Random Forest MSE" = rfmse)

d)

set.seed(2)
bagmse <- rep(NA, 50)
for(i in 1:50){
        dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
        y <- rowSums(dftrain) + rnorm(100, sd = 5)
        dftrain <- data.frame(y, dftrain)
        
        bag <- randomForest(y ~ ., data = dftrain, mtry = 10)
        preds <- predict(bag, dftest)
        bagmse[i] <- mean( (preds - dftest$y)^2 )
}
errbagg <- mean(bagmse)

rfmse <- rep(NA, 50)
for(i in 1:50){
        dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
        y <- rowSums(dftrain) + rnorm(100, sd = 5)
        dftrain <- data.frame(y, dftrain)
        
        rf <- randomForest(y ~ ., data = dftrain, mtry = 3)
        preds <- predict(rf, dftest)
        rfmse[i] <- mean( (preds - dftest$y)^2 ) 
}
errrf <- mean(rfmse)

data.frame("Mean Bagging Error" = errbagg, "Mean Random Forest Error" = errrf)

e)

sigma <- c(.01, 1, 5, 10, 20, 100)
difference <- rep(NA, length(sigma))
for(i in 1:length(sigma)){
        bagmse <- rep(NA, 50)
        for(j in 1:50){
                dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
                y <- rowSums(dftrain) + rnorm(100, sd = sigma[i])
                dftrain <- data.frame(y, dftrain)
                
                bag <- randomForest(y ~ ., data = dftrain, mtry = 10)
                preds <- predict(bag, dftest)
                bagmse[j] <- mean( (preds - dftest$y)^2 )
        }
        errbagg <- mean(bagmse)
        
        rfmse <- rep(NA, 50)
        for(j in 1:50){
                dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
                y <- rowSums(dftrain) + rnorm(100, sd = sigma[i])
                dftrain <- data.frame(y, dftrain)
                
                rf <- randomForest(y ~ ., data = dftrain, mtry = 3)
                preds <- predict(rf, dftest)
                rfmse[j] <- mean( (preds - dftest$y)^2 ) 
        }
        errrf <- mean(rfmse)
        
        difference[i] <- errbagg - errrf
}
data.frame(sigma, difference)
plot(sigma[-6], difference[-6], type = "l", col = "blue")

For small values of sigma, the difference is negative which indicates bagging outperformed random forest. The difference turns positive at sigma = 5, and once sigma is 100, the difference is positive and relatively large in magnitude. As sigma gets very large, random forest outperforms bagging.

Bagging is a more flexible model than random forest since it considers all predictors at each split. Random forest is only able to choose from a random subset of 3 predictors at each split, so it is less flexible. In other words, bagging is more data-dependent and higher variance than random forest. When the signal to noise ratio is low (sigma is high), bagging overfits to the noise present in the data and random forest outperforms bagging. However when the signal to noise ratio is large, and sigma is small, bagging’s data dependence gives it an advantage relative to random forest.