These were handwritten and turned in seperately.
For the majority vote method, we would choose red since 6/10 estimates were greater than .5.
mean(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
## [1] 0.1
For the average probability approach, we would choose green since the average of the bootstrapped estimates is .1.
set.seed(1)
library(tree)
## Warning: package 'tree' was built under R version 3.6.3
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
attach(Carseats)
library(gbm)
## Warning: package 'gbm' was built under R version 3.6.3
## Loaded gbm 2.1.8
train <- sample(1:nrow(Carseats), nrow(Carseats)*.8)
cartrain <- Carseats[train,]
cartest <- Carseats[-train,]
tree = tree(Sales ~., cartrain)
plot(tree)
text(tree, pretty = 0)
mean((predict(tree, cartest) - cartest$Sales)^2)
## [1] 4.936081
It would be tedious to interpret the entire tree, so I will trace a single path. For a given observation, if the shelf location is good, price is greater than 109.5, advertising is less than 12.5, and competitor prices are greater than 133.5, the predicted sales would be 9,669. The MSE of this model evaluated on the test set is 4.93.
set.seed(10)
cv.carseats = cv.tree(tree, FUN = prune.tree)
plot(cv.carseats)
bestsize <- cv.carseats$size[which.min(cv.carseats$dev)]
pruned <- prune.tree(tree, best = bestsize)
mean((predict(pruned, cartest) - cartest$Sales)^2)
## [1] 4.936081
Cross-validation chose the full tree, so no pruning occurred. Therefore the pruning process did not improve the test error rate.
bag <- randomForest(Sales ~., data = cartrain, mtry = 10, importance = T)
importance(bag)
## %IncMSE IncNodePurity
## CompPrice 36.398048 264.73544
## Income 10.284220 142.37481
## Advertising 22.268097 191.00144
## Population -2.698225 73.24495
## Price 79.158793 722.91424
## ShelveLoc 74.436877 700.55028
## Age 27.588941 234.69303
## Education 1.441073 60.47524
## Urban -1.910174 10.65936
## US 1.862926 11.63553
varImpPlot(bag)
mean((predict(bag, cartest) - cartest$Sales)^2)
## [1] 2.954726
The test MSE is 2.92, and the most important variables determined by the importance function are: price and shelving location.
set.seed(1)
merror <- rep(NA, 8)
for(i in 3:10){
rf <- randomForest(Sales ~., data = cartrain, mtry = i, importance = T)
merror[i-2] <- mean((predict(rf, cartest) - cartest$Sales)^2)
}
df <- data.frame(m = 3:10, merror = merror)
df
plot(3:10, merror, type = "l", col = "blue")
As m first increases, the test error decreases due to decreases in bias. At large values of m, the test error increases due to increases in variance overtaking decreases in bias. Since m with the lowest test error is 8, we will use that model to determine variable importance.
rf <- randomForest(Sales ~., data = cartrain, mtry = 8, importance = T)
importance(rf)
## %IncMSE IncNodePurity
## CompPrice 33.077597 248.63599
## Income 7.406193 145.22246
## Advertising 23.244972 187.83014
## Population -3.223337 80.07774
## Price 70.133677 719.85810
## ShelveLoc 81.193097 687.45774
## Age 23.905307 246.15285
## Education 2.279859 65.58869
## Urban -2.082055 11.27248
## US 4.525378 15.96660
varImpPlot(rf)
Again, price and shelving location seem to be the most important variables.
Hitters <- Hitters[!is.na(Hitters$Salary),]
Hitters$Salary <- log(Hitters$Salary)
Hitterstrain <- Hitters[1:200,]
Hitterstest <- Hitters[201:nrow(Hitters),]
library(gbm)
lambda <- c(.001, .01, .1, .5, .9)
testmses <- rep(NA, length(lambda))
trainmses <- rep(NA, length(lambda))
for(i in 1:length(testmses)){
boost = gbm(Salary ~., data = Hitterstrain, distribution = "gaussian", n.trees = 1000, shrinkage = lambda[i])
testmses[i] <- mean( (predict(boost, Hitterstest) - Hitterstest$Salary) ^2)
trainmses[i] <- mean( (predict(boost, Hitterstrain) - Hitterstrain$Salary) ^2)
}
## Using 1000 trees...
##
## Using 1000 trees...
##
## Using 1000 trees...
##
## Using 1000 trees...
##
## Using 1000 trees...
##
## Using 1000 trees...
##
## Using 1000 trees...
##
## Using 1000 trees...
##
## Using 1000 trees...
##
## Using 1000 trees...
plot(lambda, trainmses, type = "b")
data.frame(lambda, testmses, trainmses)
plot(lambda, testmses, type = "b")
We will use multiple linear regression and ridge regression.
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.3
## Loading required package: Matrix
## Loaded glmnet 4.0-2
x = model.matrix(Salary~., Hitterstrain)
y = Hitterstrain$Salary
testx = model.matrix(Salary~., Hitterstest)
lm <- lm(Salary ~., Hitterstrain)
mselm <- mean( (predict(lm, Hitterstest) - Hitterstest$Salary) ^2)
glm <- glmnet(x, y, alpha = 0)
mseridge <- mean( (predict(glm, testx) - Hitterstest$Salary)^2)
data.frame(mselm, mseridge, mseboost = 0.2699382)
Boosting is the far superior model in this case.
bestboost <- gbm(Salary ~., data = Hitterstrain, distribution = "gaussian", n.trees = 1000, shrinkage = .1)
summary(bestboost)
It looks like number of at bats and number of hits are the most important variables.
bag <- randomForest(Salary~., Hitterstrain, mtry = ncol(Hitters)-1)
mean( (predict(bag, Hitterstest) - Hitterstest$Salary) ^2)
## [1] 0.231263
The test MSE is .2349, which is the smallest of all models evaluated.
A regression tree would predict classes using conditional probabilites instead of assigning each observation a binary response of 1 or 0. This means that for a regression tree, we would have more information on the strength of a given prediction relative to a classification tree. For example, a classification tree might assign a 1 to an observation that has a conditional probability of .51. For a regression tree, we would know that this observation does not really fit into either category since the conditional probability is output.
With this knowledge, we can adjust the acceptance threshold based on the objectives of the research question. For example, if researchers do not want to miss any 1s, but do not mind if some 0s are misclassified, they can assign 1 to predictions with a conditional probability of .3 or higher instead of .5.
set.seed(1)
df.train <- read.csv("./HW7train.csv", header = T)
index <- sample(1:nrow(df.train), 900)
train <- df.train[index,]
test <- df.train[-index,]
set.seed(2)
library(randomForest)
rf <- randomForest(y ~ ., data = train, importance = T)
importance(rf)
## %IncMSE IncNodePurity
## X1 42.60365 1963.3613
## X2 33.27170 1585.4286
## X3 27.45592 536.9722
## X4 33.63289 748.1658
## X5 35.67263 919.2697
## X6 34.55811 840.2030
## X7 37.14877 804.8543
## X8 31.74922 629.6042
## X9 36.71831 860.1537
## X10 37.82703 885.3555
plot(rf$importance[,1],type="b",axes=F,ann=F,ylim=c(0,max(rf$importance[,1])+1))
axis(1,at=1:10,lab=names(df.train)[-1])
axis(2,at=seq(0,max(rf$importance)+1,0.25),las=1)
box()
It looks like X1 and X2 are more important relative to the other predictors.
set.seed(1)
mse.perm <- rep(NA, 10)
for(i in 1:10){
newtrain <- train
newtrain[,i+1] <- sample(newtrain[,i+1])
newrf <- randomForest(y ~ ., data = newtrain, importance = T)
preds <- predict(newrf, test)
mse.perm[i] <- mean((preds - test$y)^2)
}
plot(mse.perm,type="b",axes=F,ann=F,ylim=c(0,max(mse.perm)+1))
axis(1,at=1:10,lab=names(df.train)[-1])
axis(2,at=seq(0,max(mse.perm)+1,0.25),las=1)
box()
For the permutation test method, X6 and X9 look to be most important which is a different result than part b). Overall, predictor importance is spread more evenly for the permutation method.
set.seed(1)
mse.loo <- rep(NA, 10)
for(i in 1:10){
newtrain <- train
newtrain <- train[,-(i+1)]
newrf <- randomForest(y ~ ., data = newtrain, importance = T)
preds <- predict(newrf, test)
mse.loo[i] <- mean((preds - test$y)^2)
}
plot(mse.loo,type="b",axes=F,ann=F,ylim=c(0,max(mse.loo)+1))
axis(1,at=1:10,lab=names(df.train)[-1])
axis(2,at=seq(0,max(mse.loo)+1,0.25),las=1)
box()
data.frame(1:10, mse.loo, mse.perm, gini = rf$importance[,1])
This looks more like the results from part c) with some very slight differences. I would trust the results from part c) more than the results from part b) since the gini index will be biased for collinear predictors.
pairs(df.train)
cor(df.train)
## y X1 X2 X3 X4 X5
## y 1.0000000 0.5463990304 0.534840459 0.216550698 0.282289549 0.319120322
## X1 0.5463990 1.0000000000 0.786200930 -0.029067890 0.008199860 0.053218327
## X2 0.5348405 0.7862009304 1.000000000 -0.001424644 0.025740263 0.037342231
## X3 0.2165507 -0.0290678897 -0.001424644 1.000000000 -0.055665572 -0.031538181
## X4 0.2822895 0.0081998603 0.025740263 -0.055665572 1.000000000 -0.003122681
## X5 0.3191203 0.0532183267 0.037342231 -0.031538181 -0.003122681 1.000000000
## X6 0.3086712 0.0414064920 0.031501437 -0.008487021 0.042530646 0.003964744
## X7 0.2686604 -0.0034344072 0.010981310 -0.070005796 -0.019070281 -0.028256980
## X8 0.2465965 -0.0090246666 -0.036528226 -0.053461043 -0.086141758 -0.002933351
## X9 0.3159788 0.0225791156 0.006356100 0.035802833 0.019762720 -0.026747664
## X10 0.3029442 -0.0001144572 -0.030733785 -0.062700444 0.004922118 0.092347592
## X6 X7 X8 X9 X10
## y 0.308671180 0.268660423 0.246596485 0.315978758 0.3029442218
## X1 0.041406492 -0.003434407 -0.009024667 0.022579116 -0.0001144572
## X2 0.031501437 0.010981310 -0.036528226 0.006356100 -0.0307337853
## X3 -0.008487021 -0.070005796 -0.053461043 0.035802833 -0.0627004437
## X4 0.042530646 -0.019070281 -0.086141758 0.019762720 0.0049221179
## X5 0.003964744 -0.028256980 -0.002933351 -0.026747664 0.0923475923
## X6 1.000000000 -0.044693887 -0.045780661 -0.022807914 0.0098542314
## X7 -0.044693887 1.000000000 0.006187248 -0.005864789 0.0233617429
## X8 -0.045780661 0.006187248 1.000000000 0.038380209 -0.0122219848
## X9 -0.022807914 -0.005864789 0.038380209 1.000000000 0.0078488086
## X10 0.009854231 0.023361743 -0.012221985 0.007848809 1.0000000000
X1 and X2 seem to be collinear, and they were also chosen as the most important variables by the Gini index. This collinearity might be causing the difference between the results in part b) and part c).
set.seed(1)
sigma = 5
dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
y <- rowSums(dftrain) + rnorm(100, sd = 5)
dftrain <- data.frame(y, dftrain)
set.seed(1)
sigma = 5
dftest <- data.frame(replicate(10, runif(10000, -10, 10)))
y <- rowSums(dftest) + rnorm(10000, sd = 5)
dftest <- data.frame(y, dftest)
set.seed(1)
bag <- randomForest(y ~ ., data = dftrain, mtry = 10)
preds <- predict(bag, dftest)
bagmse <- mean( (preds - dftest$y)^2 )
rf <- randomForest(y ~ ., data = dftrain, mtry = 3)
preds <- predict(rf, dftest)
rfmse <- mean( (preds - dftest$y)^2 )
data.frame("Bagging MSE" = bagmse, "Random Forest MSE" = rfmse)
set.seed(2)
bagmse <- rep(NA, 50)
for(i in 1:50){
dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
y <- rowSums(dftrain) + rnorm(100, sd = 5)
dftrain <- data.frame(y, dftrain)
bag <- randomForest(y ~ ., data = dftrain, mtry = 10)
preds <- predict(bag, dftest)
bagmse[i] <- mean( (preds - dftest$y)^2 )
}
errbagg <- mean(bagmse)
rfmse <- rep(NA, 50)
for(i in 1:50){
dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
y <- rowSums(dftrain) + rnorm(100, sd = 5)
dftrain <- data.frame(y, dftrain)
rf <- randomForest(y ~ ., data = dftrain, mtry = 3)
preds <- predict(rf, dftest)
rfmse[i] <- mean( (preds - dftest$y)^2 )
}
errrf <- mean(rfmse)
data.frame("Mean Bagging Error" = errbagg, "Mean Random Forest Error" = errrf)
sigma <- c(.01, 1, 5, 10, 20, 100)
difference <- rep(NA, length(sigma))
for(i in 1:length(sigma)){
bagmse <- rep(NA, 50)
for(j in 1:50){
dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
y <- rowSums(dftrain) + rnorm(100, sd = sigma[i])
dftrain <- data.frame(y, dftrain)
bag <- randomForest(y ~ ., data = dftrain, mtry = 10)
preds <- predict(bag, dftest)
bagmse[j] <- mean( (preds - dftest$y)^2 )
}
errbagg <- mean(bagmse)
rfmse <- rep(NA, 50)
for(j in 1:50){
dftrain <- data.frame(replicate(10, runif(100, -10, 10)))
y <- rowSums(dftrain) + rnorm(100, sd = sigma[i])
dftrain <- data.frame(y, dftrain)
rf <- randomForest(y ~ ., data = dftrain, mtry = 3)
preds <- predict(rf, dftest)
rfmse[j] <- mean( (preds - dftest$y)^2 )
}
errrf <- mean(rfmse)
difference[i] <- errbagg - errrf
}
data.frame(sigma, difference)
plot(sigma[-6], difference[-6], type = "l", col = "blue")
For small values of sigma, the difference is negative which indicates bagging outperformed random forest. The difference turns positive at sigma = 5, and once sigma is 100, the difference is positive and relatively large in magnitude. As sigma gets very large, random forest outperforms bagging.
Bagging is a more flexible model than random forest since it considers all predictors at each split. Random forest is only able to choose from a random subset of 3 predictors at each split, so it is less flexible. In other words, bagging is more data-dependent and higher variance than random forest. When the signal to noise ratio is low (sigma is high), bagging overfits to the noise present in the data and random forest outperforms bagging. However when the signal to noise ratio is large, and sigma is small, bagging’s data dependence gives it an advantage relative to random forest.