Random forest considers only a subset of total features and attempts to reduce variance by bootstrap aggregation. So individual trees that are generated by random forest may have different feature subsets. Random forests build lots of bushy trees, and then average them to reduce the variance. 1. a. For the spam data, partition the data into 2/3 training and 1/3 test data.

Answers 1.a. For the spam data, partition the data into 2/3 training and 1/3 test data. Answer: Data was partitioned. # load data

library(ElemStatLearn);data(spam)
set.seed(128)
train <- sample(1:nrow(spam), nrow(spam)*2/3)
spam_train <- spam[train,]
spam_test <- spam[-train,]
    1. Build the bagging model for spam training data using all the variables. Answer: A bagging model was built using the spam training data and all the variables. library(randomForest)
set.seed(111)
spam.bag <- randomForest(spam~., data=spam_train, mtry=(dim(spam)[2]-1)) # build bagging model, so mtry = dim(spam)[2] -1 (number of variables minus 1, which is the dependent variable)
spam.bag
## 
## Call:
##  randomForest(formula = spam ~ ., data = spam_train, mtry = (dim(spam)[2] -      1)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 57
## 
##         OOB estimate of  error rate: 5.41%
## Confusion matrix:
##       email spam class.error
## email  1798   75  0.04004271
## spam     91 1103  0.07621441
    1. Get out of bag error rate for the training data. Answer: This model had an out of bag error rate of 5.41% on the training data. # predicted values are for out of bag observations
yhat=spam.bag$predicted
y=spam_train$spam # training data
mean(y != yhat) # out of bag error rate for training data
## [1] 0.05412455
    1. Apply it to the test data and get the confusion matrix and error rate. Answer:
yhat = predict(spam.bag, spam_test)
y = spam_test$spam
table(y,yhat)
##        yhat
## y       email spam
##   email   872   43
##   spam     62  557

This model had an error rate of 6.84% on the test data. The confusion matrix for the test data can be seen below.

mean(y != yhat) # error rate for test data
## [1] 0.0684485
    1. Build the random model for spam training data using all the variables. find the best mtry among sqrt(p)+-1 and and ntree among 500, 1000, 2000. Get out of bag error rate for the training data. Answer: A random forest model was built using various parameters. The best model was built from 1000 trees, with 9 variables randomly sampled as candidates at each split. This model had an out of bag error rate of 4.50% on the training data.
# get m
m = round(sqrt(dim(spam_train)[2]-1)) # sqrt(p) rounded to integer
for (i in c(500,1000,2000)){ # try three different values for number of trees
  for (j in c(m-1, m, m+1)){ # tree three different values for number of predictors to sample
    set.seed(123)
    rf.spam=randomForest(spam ~., data=spam_train,  mtry=j, ntree=i)
    # get oob error rate for training data
    yhat=rf.spam$predicted
    y=spam_train$spam
    error_rate <- mean(y != yhat)
    if (exists('oob_err')==FALSE){
      oob_err = c(i,j,error_rate) # create initial data frame
    }
    else{
      oob_err = rbind(oob_err, c(i,j,error_rate)) # append to data frame of error rates, ntree, and mtry
    }
  }
}
oob_err <- as.data.frame(oob_err) # convert to data frame
names(oob_err) <- c('ntree', 'mtry', 'oob_error_rate') # add column names
oob_err$mtry <- as.factor(oob_err$mtry) # mtry to factor
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
ggplot(oob_err, aes(x=ntree, y=oob_error_rate, group=mtry)) + geom_line(aes(color=mtry)) # create plot

# get parameters for best tree
ntree_x <- oob_err$ntree[which.min(oob_err$oob_error_rate)] # find ntree that minimizes oob_error_rate
ntree_x
## [1] 1000
## [1] 1000
mtry_x <- as.numeric(levels(oob_err$mtry[which.min(oob_err$oob_error_rate)])[oob_err$mtry[which.min(oob_err$oob_error_rate)]]) # find mtry that minimizes oob_error_rate
mtry_x
## [1] 9
## [1] 9
min(oob_err$oob_error_rate) # find min oob_error_rate
## [1] 0.04499511
## [1] 0.04499511
    1. Apply it to the test data and get the confusion matrix and error rate. Answer: The best random forest model was applied to the test data. The confusion matrix can be seen below. It had an error rate of 6.58% on the test data. # build best tree
set.seed(123)
rf.spam=randomForest(spam ~., data=spam_train,  importance=TRUE, mtry=mtry_x, ntree=ntree_x)

predict on test set

yhat <- predict(rf.spam, spam_test, type = 'class')
y <- spam_test$spam
table(y,yhat) 
##        yhat
## y       email spam
##   email   875   40
##   spam     61  558
mean(y != yhat) # error rate
## [1] 0.06584094
  1. Find the 10 most important variables for the random Forest model.

Answer: The 10 most important variables in this model can be seen in the output below. A.52 is the most important using both measures of importance (meanDecreaseAccuracy and MeanDecreaseGini)

overall importance

varImpPlot(rf.spam, sort=TRUE, n.var=10) # plot top 10

best <- as.data.frame(round(importance(rf.spam), 2)) # get importance values
top <- best[order(-best$MeanDecreaseGini), , drop = FALSE] # sort highest first

get parameters for best tree

ntree_x <- oob_err$ntree[which.min(oob_err$oob_mse)]
ntree_x
## [1] 2000
mtry_x <- as.numeric(levels(oob_err$mtry[which.min(oob_err$oob_mse)])[oob_err$mtry[which.min(oob_err$oob_mse)]])
mtry_x
## [1] 3
min(oob_err$oob_mse)
## [1] 2944.397
    1. Apply it to the test data and get the Mean square error and R squared value. Answer: The best model was applied to the test data. It had a mean square error of 3934.898 and a R squared value of 0.8579 on the test data. # build best tree
set.seed(123)
rf.cpus=randomForest(perf ~., data=cpus_train, importance=TRUE, mtry=mtry_x, ntree=ntree_x)
yhat = predict(rf.cpus, newdata=cpus_test)
y <- cpus_test$perf
mse=mean((yhat-y)^2);mse
## [1] 3934.898
rsq=1-sum((y-yhat)^2)/sum((y-mean(y))^2);rsq
## [1] 0.8579401
    1. What are the important variables? Answer: Variable importance can be seen in the plots below. mmax seems to be the most important variable, and syct seems to be the least important. # overall importance
best <- as.data.frame(round(importance(rf.cpus), 2)) # get importance values
best
##       %IncMSE IncNodePurity
## syct    18.60      176920.6
## mmin    23.62      237667.6
## mmax    36.30     1130594.6
## cach    39.81      567109.8
## chmin   29.49      541894.7
## chmax   22.83      730332.2
varImpPlot(rf.cpus) # plot all

    1. Find the outliers using proximity matrix. Answer: No points had an outlier index greater than 10 so there are no outliers in the data.
set.seed(123)
rf.cpus=randomForest(perf ~., data=cpus_train, mtry=mtry_x, ntree=ntree_x, proximity=TRUE, oob.prox=FALSE)
out = apply(rf.cpus$proximity, 1, function(x) 1/(sum(x^2)-1)) # calculate outlyingness (remove proximity with self, which is 1)
plot(out) # plot outlyingness, most around 0, 

A value is considered an outlier if outlier index is greater than 10, so no outliers in this case

# MSDplot(cpus.rf,cpus0[train,]$perf)
# MDSplot(cpus.rf,cpus0[train,]$perf) #multidimensional scaling plot