The problems on this assignment require you to use regression/classification trees to predict a response variable from a given set of covariate variables. Answer all questions specified on the problem but if you see something interesting and want to do more analysis please report it as well. Don’t forget to include discussion and to make your plots look nice and presentable. Submit your .rmd file with the knitted PDF (or knitted Word Document saved as a PDF). If you are still having trouble with .rmd, let us know and we will help you, but we are going to start requiring this from here on out.
The BostonHousing dataset reported by Harrison and Rubinfeld (1978) is available as data.frame package mlbench (Leisch and Dimitriadou, 2009). The goal here is to predict the median value of owner-occupied homes in USD 1000’s (medv variable) based on other predictors in the dataset. Use this dataset to do the following
Construct a regression tree using rpart(). The following need to be included in your discussion. How many nodes did your tree have? Did you prune the tree? Did it decrease the number of nodes? What is the prediction error (calculate MSE)? Provide the predicted vs observed plot.
#load all libraries
library(vcd)
library(lattice)
library(randomForest)
library(party)
library(partykit)
library(mboost)
library(TH.data)
library(ipred)
library(rpart)
library(mlbench)
library(ggplot2)
library(dplyr)
library(tidyr)
library(boot)
library(fastAdaboost)
We will use rpart from the rpart library and plot the result of the trees over all variables.
data(BostonHousing)
set.seed(12345)
#build a decision tre with rpart and plot the tree
p1.1.tree <- rpart(medv~.,data=BostonHousing,control = rpart.control( minsplit = 10))
plot(as.party(p1.1.tree),
tp_args = list(id = FALSE))
From the plot it appears the lowest x error occurs at a tree size of 9. We will confirm with the CP table below.
#plot the cp of the tree
plotcp(p1.1.tree)
The CP table confirms that the lowest x error occurs at a tree size of 9 and an x error of 0.1933019.
#print the cp table of the tree
p1.1.tree$cptable
## CP nsplit rel error xerror xstd
## 1 0.45274420 0 1.0000000 1.0039957 0.08324106
## 2 0.17117244 1 0.5472558 0.6305946 0.05607611
## 3 0.07165784 2 0.3760834 0.4074182 0.04529558
## 4 0.05900152 3 0.3044255 0.3704079 0.04521515
## 5 0.03375589 4 0.2454240 0.2860290 0.03408051
## 6 0.02661300 5 0.2116681 0.2250617 0.02338934
## 7 0.02357238 6 0.1850551 0.2165841 0.02301406
## 8 0.01085935 7 0.1614827 0.1990835 0.02300119
## 9 0.01000000 8 0.1506234 0.1933019 0.02287841
As can be seen from the CP tree and there is no reason to prune the tree because the current tree has the lowest x error available. We will use the predict function to calculate the y hat values of the tree. Once calculated, we will use these value to calculate and print the MSE.
#use the predict function to predict the values of the tree
pred <- predict(p1.1.tree,data=BostonHousing)
#calculate the MSE of the tree's predicted values
MSE_p1_tree <- round(mean((BostonHousing$medv - pred)^2), 3)
#print the MSe
MSE_p1_tree
## [1] 12.716
From the plot there is noticeably some prediction error from the fitted values.
#create a data.frame of the BostonHousig medv variable with the predicted values
bhousetree <- as.data.frame(cbind(medv=BostonHousing$medv,predicted=pred))
#use ggplot2 to build a predicted vs observed plot of the regression tree
ggplot(data=bhousetree,aes(x=medv,y=predicted)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title='Predicted vs Observed Plot of Regression Tree - Boston Housing Data',
x='Observed Value of Median Value of Owner Occupied Homes',
y='Predicted Value of Median Value of Owner Occupied Homes')
Perform bagging with 50 trees. Report the prediction error (MSE). Provide the predicted vs observed plot.
#Bagging for bostonhousing data
trees1 <- vector(mode = "list", length = 50)
n1 <- nrow(BostonHousing)
bootsamples1 <- rmultinom(length(trees1), n1, rep(1, n1)/n1)
p1.2.tree <- rpart(medv ~ ., data = BostonHousing, control = rpart.control(xval = 0))
for (i in 1:length(trees1))
trees1[[i]] <- update(p1.2.tree, weights = bootsamples1[,i])
#use the predict function to predict the bostonhousing data medv variable
pred2 <- predict(p1.2.tree, data = BostonHousing)
#calculate the mse
MSE_p2_bagging <- round(mean((BostonHousing$medv - pred2)^2), 3)
The MSE from this bagging example happens to be higher than that of just one tree which seems particularly odd.
MSE_p2_bagging
## [1] 16.245
After examining the predicted vs observed plot of bagging it is evident that the predicted values for the medv variable are much further out than that of the trees.
#bind the predicted values with the medv variable for the plot df
bhousebag <- as.data.frame(cbind(medv=BostonHousing$medv,predicted=pred2))
#use ggplot to create a predicted vs observed plot
ggplot(data=bhousebag,aes(x=medv,y=predicted)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title='Predicted vs Observed Plot of Regression Tree Bagging - Boston Housing Data',
x='Observed Value of Median Value of Owner Occupied Homes',
y='Predicted Value of Median Value of Owner Occupied Homes')
Use randomForest() function in R to perform bagging. Report the prediction error (MSE). Was it the same as (b)? If they are different what do you think caused it? Provide the predicted vs observed plot. #For bagging set mtry to be equal to the number of predictor variables
Answer - The difference between a randomForest and bagging is that bagging uses all of the feautures to build the trees whereas the randomForest uses a random sample to build the trees. This means that there could be a problem with multicollinearity, over fitting, or some other problem with the covariates in the data. If the randomForest returns a lower error it is because the bagging example is using covariates that actually increase the error. Setting the number of random variables to 13 means that all 13 variables can be randomly selected in the model.
p1.3.rf <- randomForest(medv~.,data=BostonHousing,mtry=13)
pred3 <- predict(p1.3.rf,data=BostonHousing)
bhouserf <- as.data.frame(cbind(medv=BostonHousing$medv,predicted=pred3))
MSE_p3_rf <- round(mean((BostonHousing$medv - pred3)^2), 3)
MSE_p3_rf
## [1] 10.499
It is clear that there is a much lower error rate from the fitted vs observed plot when we set mtry to 13.
ggplot(data=bhouserf,aes(x=medv,y=predicted)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title='Predicted vs Observed Plot of Random Forest - Boston Housing Data',
x='Observed Value of Median Value of Owner Occupied Homes',
y='Predicted Value of Median Value of Owner Occupied Homes')
Use randomForest() function in R to perform random forest. Report the prediction error (MSE). Provide the predicted vs observed plot.
Without defining mtry we get a lower MSE because we are not setting the tuning parameter to look at all of the variables in the data set. The random subset method provides some validation to our above hypothesis that some of the variables in the data set are not conducive to predicting the median value of owner-occupied homes.
p1.3.rf2 <- randomForest(medv~.,data=BostonHousing)
pred4 <- predict(p1.3.rf2,data=BostonHousing)
bhouserf2 <- as.data.frame(cbind(medv=BostonHousing$medv,predicted=pred4))
MSE_p3_rf2 <- round(mean((BostonHousing$medv - pred4)^2), 3)
MSE_p3_rf2
## [1] 9.742
ggplot(data=bhouserf2,aes(x=medv,y=predicted)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title='Predicted vs Observed Plot of Random Forest - Boston Housing Data',
x='Observed Value of Median Value of Owner Occupied Homes',
y='Predicted Value of Median Value of Owner Occupied Homes')
Provide a table containing each method and associated MSE. Which method is more accurate? Answer - The most accurate model, as described above, is the randomForest without the tuning parameter set. This is because the randomForest randomly selects covariates from subsets of the data without having to select from all 13 covariates in the data set. This tells me that there could be covariates that either correlate with one another or are not conducive to predicting the median value of owner-occupied homes which means this model is probably over-fitted.
p1_table <- as.data.frame(cbind(Trees=MSE_p1_tree,Bagging=MSE_p2_bagging,
RandomForest_mtry13=MSE_p3_rf,RandomForest=MSE_p3_rf2))
p1_table
## Trees Bagging RandomForest_mtry13 RandomForest
## 1 12.716 16.245 10.499 9.742
Consider the glacoma data (data = “GlaucomaM”, package = “TH.data”).
Build a logistic regression model. Note that most of the predictor variables are highly correlated. Hence, logistic regression model using the whole set of variables will not work here as it is sensitive to correlation.
Here we will use the step function to step through the model and find the combination of variables with the lowest AIC. Using the formula function we will capture the formula for this model and use it throughout the problem. The print out of the formual is below. Before performing stepwise regerssion with will remove all variables with a correlation absolute value greater than 0.8.
data(GlaucomaM)
#removal of highly correlated variables
#create an object class containing the class variable and remove it from the data set
Class <- GlaucomaM$Class
Glauc <- GlaucomaM[,!names(GlaucomaM) %in% 'Class']
#create a correlation matrix of the dataset and set the upper triangle and perfecly correlated variables to 0
tmp <- cor(Glauc)
tmp[upper.tri(tmp)] <- 0
diag(tmp) <- 0
#remove any variables with a correlation absolute value of .8 or above and add the class variable back
GlaucomaM <- Glauc[,!apply(tmp,2,function(x) any(x > abs(0.8)))]
GlaucomaM <- cbind(GlaucomaM,Class)
#set class to binary
GlaucomaM <- GlaucomaM %>%
mutate(Class = ifelse(Class=='glaucoma',1,0))
#use stepwise regression for the glm
glauc_glm <- step(glm(Class ~., data = GlaucomaM, family = "binomial"),trace=F)
#extract the formula with the lowest aic
form_glauc <- formula(glauc_glm)
form_glauc
## Class ~ phct + phci + vbrn + vari + tms + mr + rnf + emd
The error rate using the 10 fold method is 0.1.
#set the number of folds and create an empty matrix for error rats
n_folds <- 10
folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
#loop through the folds building models for each and compute the error from each confusion matrix placin the result into the cv_error object
for (k in 1:n_folds) {
test_i <- which(folds_i == k)
train <- GlaucomaM[-test_i, ]
test <- GlaucomaM[test_i, ]
model <- glm(form_glauc,data=train,family='binomial')
pred <- predict(model,test,type='response')
conf <- (table("actual"=test$Class,
"predicted"=pred>0.5))
cv_error[k, ] <- round(1 - ((diag(conf)) / nrow(test)), 4)
}
#calculate the final error for all models
glauc.glm.error <- colMeans(cv_error)[1]
glauc.glm.error
## [1] 0.1
Find a function (package in R) that can conduct the “adaboost” ensemble modeling. Use it to predict glaucoma and report error rate.
Answer-We will use the function adaboost from the package fastAdaboost to create the adaboost. Interestingly, this algorithm predicts glaucoma perfectly with a zero error rate.
adaboost <- adaboost(form_glauc,data=GlaucomaM,nIter=100)
ada.pred <- predict(adaboost,newdata=GlaucomaM)
ada.pred$error
## [1] 0
Report the error rates based on single tree, bagging and random forest. (A table would be great for this)
glauc.tree <- rpart(form_glauc,data=GlaucomaM,control = rpart.control( minsplit = 10))
plot(as.party(glauc.tree),
tp_args = list(id = FALSE))
It is evident from the increased error rate that pruning needs to take place.
plotcp(glauc.tree)
Examining the x error it appears the optimum number of nodes is 2.
glauc.tree$cptable
## CP nsplit rel error xerror xstd
## 1 0.41853844 0 1.0000000 1.0109762 0.003344427
## 2 0.06359004 1 0.5814616 0.7188660 0.077847819
## 3 0.04630239 2 0.5178715 0.7751342 0.084885000
## 4 0.03670627 3 0.4715691 0.7730566 0.088126420
## 5 0.03537562 5 0.3981566 0.7761894 0.091971800
## 6 0.02305684 6 0.3627810 0.8317244 0.099554213
## 7 0.02093145 7 0.3397241 0.7948401 0.099770885
## 8 0.02083691 8 0.3187927 0.7937709 0.100329235
## 9 0.01834305 9 0.2979558 0.8002381 0.100443286
## 10 0.01438561 11 0.2612697 0.8372278 0.101765027
## 11 0.01000000 13 0.2324985 0.8530324 0.101768494
The tree produced from pruning only has two nodes. Splitting a tree in this fashion will not be very predictive for Glaucoma.
opt <- which.min(glauc.tree$cptable[,"xerror"])
cp <- glauc.tree$cptable[opt, "CP"]
glauc.prune <- prune(glauc.tree, cp = cp)
plot(as.party(glauc.prune),
tp_args = list(id = FALSE))
The error rate of 0.51 indicates that we may have over pruned the tree and reduced its overall prediction power.
n_folds <- 10
folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
for (k in 1:n_folds) {
test_i <- which(folds_i == k)
train <- GlaucomaM[-test_i, ]
test <- GlaucomaM[test_i, ]
glauc.tree <- rpart(form_glauc,data=train,control = rpart.control( minsplit = 10))
opt <- which.min(glauc.tree$cptable[,"xerror"])
cp <- glauc.tree$cptable[opt, "CP"]
glauc.prune <- prune(glauc.tree, cp = cp)
pred <- predict(glauc.tree,test)
cconf <- (table("actual"=test$Class,"predicted"=pred>0.5))
cv_error[k, ] <- round(1 - ((diag(conf)) / nrow(test)), 4)
}
glauc.tree.error <- colMeans(cv_error)[1]
glauc.tree.error
## [1] 0.51667
Our randomForest MSE shown below is substantially lower than the error rate in the decision tree. In this example we did not set the tuning parameter for the covariates.
n_folds <- 10
folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
for (k in 1:n_folds) {
test_i <- which(folds_i == k)
train <- GlaucomaM[-test_i, ]
test <- GlaucomaM[test_i, ]
glauc.rf <- randomForest(form_glauc,data=train)
pred <- predict(glauc.rf,test)
conf <- (table("actual"=test$Class,"predicted"=pred>0.5))
cv_error[k, ] <- round(1 - ((diag(conf)) / nrow(test)), 4)
cv_error[k, ] <- round(mean((test$Class - pred)^2), 3)
}
glauc.rf.error <- colMeans(cv_error)[1]
glauc.rf.error
## [1] 0.0571
Bagging with 10 fold cross validation produces an error rate very close to that of the glm.
n_folds <- 10
folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
for (k in 1:n_folds) {
test_i <- which(folds_i == k)
train <- GlaucomaM[-test_i, ]
test <- GlaucomaM[test_i, ]
trees1 <- vector(mode = "list", length = 50)
n1 <- nrow(train)
bootsamples1 <- rmultinom(length(trees1), n1, rep(1, n1)/n1)
glauc.bag <- rpart(form_glauc, data = train, control = rpart.control(xval = 0))
for (i in 1:length(trees1)){
trees1[[i]] <- update(glauc.bag, weights = bootsamples1[,i])
}
pred <- predict(glauc.bag,test)
conf <- (table("actual"=test$Class,"predicted"=pred>0.5))
cv_error[k, ] <- round(1 - ((diag(conf)) / nrow(test)), 4)
cv_error[k, ] <- round(mean((test$Class - pred)^2), 3)
}
glauc.bag.error <- colMeans(cv_error)[1]
glauc.bag.error
## [1] 0.1097
It is obvious that the adaboost algorithm is the clear choice here with an error rate of zero. However, the randomForest still has substantial prediction power and may actually be beneficial on a different data set of the same test due to us not running 10 fold cross validation on the adaboost algorithm.
q2.table <- as.data.frame(rbind(glauc.glm.error,ada.pred$error,glauc.tree.error,glauc.rf.error,glauc.bag.error))
row.names(q2.table) <- c('GLM Error','AdaBoost Error','Tree Error','randomForest Error','Bagging Error')
q2.table
## V1
## GLM Error 0.10000
## AdaBoost Error 0.00000
## Tree Error 0.51667
## randomForest Error 0.05710
## Bagging Error 0.10970
Write a conclusion comparing the above results (use a table to report models and corresponding error rates). Which one is the best model?
Answer - From the above results the AdaBoost algorithm had a zero error rate and predicted the Class variable perfectly. Again, since the same 10 fold cross validation method was not used in the adaboost algorithm the randomForest may still be usable for predicting glaucoma.
From the above analysis, which variables seem to be important in predicting Glaucoma?
Answer - We will print the summary of the glm model and examine the covariates statistical significance. The tms variable has the highest significance followed by: emd, rnf, vbrnn, phci, phct, vari, and the mr variables. This means the moment superior has the highest significance when predicting glaucoma from the tests.
summary(glauc_glm)
##
## Call:
## glm(formula = Class ~ phct + phci + vbrn + vari + tms + mr +
## rnf + emd, family = "binomial", data = GlaucomaM)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.17940 -0.53719 0.01616 0.48980 3.08580
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.485 2.781 1.972 0.0486 *
## phct 5.571 3.306 1.685 0.0919 .
## phci 6.859 2.946 2.328 0.0199 *
## vbrn -5.171 2.366 -2.185 0.0289 *
## vari -15.743 9.093 -1.731 0.0834 .
## tms 6.946 2.426 2.863 0.0042 **
## mr -4.778 3.063 -1.560 0.1188
## rnf -12.198 5.788 -2.107 0.0351 *
## emd 7.778 3.254 2.390 0.0168 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 271.71 on 195 degrees of freedom
## Residual deviance: 134.90 on 187 degrees of freedom
## AIC: 152.9
##
## Number of Fisher Scoring iterations: 6