Recursive Partitioning

The problems on this assignment require you to use regression/classification trees to predict a response variable from a given set of covariate variables. Answer all questions specified on the problem but if you see something interesting and want to do more analysis please report it as well. Don’t forget to include discussion and to make your plots look nice and presentable. Submit your .rmd file with the knitted PDF (or knitted Word Document saved as a PDF). If you are still having trouble with .rmd, let us know and we will help you, but we are going to start requiring this from here on out.

Question 1

The BostonHousing dataset reported by Harrison and Rubinfeld (1978) is available as data.frame package mlbench (Leisch and Dimitriadou, 2009). The goal here is to predict the median value of owner-occupied homes in USD 1000’s (medv variable) based on other predictors in the dataset. Use this dataset to do the following

Part A

Construct a regression tree using rpart(). The following need to be included in your discussion. How many nodes did your tree have? Did you prune the tree? Did it decrease the number of nodes? What is the prediction error (calculate MSE)? Provide the predicted vs observed plot.

#load all libraries
library(vcd)
library(lattice)
library(randomForest)
library(party)
library(partykit)
library(mboost)
library(TH.data)
library(ipred)
library(rpart)
library(mlbench)
library(ggplot2)
library(dplyr)
library(tidyr)
library(boot)
library(fastAdaboost)

Construct a Tree and Plot it

We will use rpart from the rpart library and plot the result of the trees over all variables.

data(BostonHousing)
set.seed(12345)
#build a decision tre with rpart and plot the tree
p1.1.tree <- rpart(medv~.,data=BostonHousing,control = rpart.control( minsplit = 10))
plot(as.party(p1.1.tree), 
         tp_args = list(id = FALSE))

Construct a CP Plot of the Tree

From the plot it appears the lowest x error occurs at a tree size of 9. We will confirm with the CP table below.

#plot the cp of the tree
plotcp(p1.1.tree)

Print a CP Table of the Tree

The CP table confirms that the lowest x error occurs at a tree size of 9 and an x error of 0.1933019.

#print the cp table of the tree
p1.1.tree$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.45274420      0 1.0000000 1.0039957 0.08324106
## 2 0.17117244      1 0.5472558 0.6305946 0.05607611
## 3 0.07165784      2 0.3760834 0.4074182 0.04529558
## 4 0.05900152      3 0.3044255 0.3704079 0.04521515
## 5 0.03375589      4 0.2454240 0.2860290 0.03408051
## 6 0.02661300      5 0.2116681 0.2250617 0.02338934
## 7 0.02357238      6 0.1850551 0.2165841 0.02301406
## 8 0.01085935      7 0.1614827 0.1990835 0.02300119
## 9 0.01000000      8 0.1506234 0.1933019 0.02287841

Predict the Median Value of Owner-Occupied Homes from the Tree & Calculate the MSE

As can be seen from the CP tree and there is no reason to prune the tree because the current tree has the lowest x error available. We will use the predict function to calculate the y hat values of the tree. Once calculated, we will use these value to calculate and print the MSE.

#use the predict function to predict the values of the tree
pred <- predict(p1.1.tree,data=BostonHousing)
#calculate the MSE of the tree's predicted values
MSE_p1_tree <- round(mean((BostonHousing$medv - pred)^2), 3)
#print the MSe
MSE_p1_tree

## [1] 12.716

Plot the Fitted Vs Observed Values

From the plot there is noticeably some prediction error from the fitted values.

#create a data.frame of the BostonHousig medv variable with the predicted values
bhousetree <- as.data.frame(cbind(medv=BostonHousing$medv,predicted=pred))
#use ggplot2 to build a predicted vs observed plot of the regression tree
ggplot(data=bhousetree,aes(x=medv,y=predicted)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title='Predicted vs Observed Plot of Regression Tree - Boston Housing Data',
       x='Observed Value of Median Value of Owner Occupied Homes',
       y='Predicted Value of Median Value of Owner Occupied Homes')

Final Answer - The final, unpruned, tree had 9 nodes. It was unnecessary to prune the tree because the lowest error occured at the 9th node. The MSE of the tree model is 12.716 and the fitted vs observed plot is shown above.

Part B

Perform bagging with 50 trees. Report the prediction error (MSE). Provide the predicted vs observed plot.

#Bagging for bostonhousing data
trees1 <- vector(mode = "list", length = 50)
n1 <- nrow(BostonHousing)
bootsamples1 <- rmultinom(length(trees1), n1, rep(1, n1)/n1)
p1.2.tree <- rpart(medv ~ ., data = BostonHousing, control = rpart.control(xval = 0))
for (i in 1:length(trees1))
  trees1[[i]] <- update(p1.2.tree, weights = bootsamples1[,i])


#use the predict function to predict the bostonhousing data medv variable
pred2 <- predict(p1.2.tree, data = BostonHousing)
#calculate the mse
MSE_p2_bagging <- round(mean((BostonHousing$medv - pred2)^2), 3)

MSE of Bagging With 50

The MSE from this bagging example happens to be higher than that of just one tree which seems particularly odd.

MSE_p2_bagging

## [1] 16.245

Predicted Vs Observed Plot of Bagging With 50 Trees

After examining the predicted vs observed plot of bagging it is evident that the predicted values for the medv variable are much further out than that of the trees.

#bind the predicted values with the medv variable for the plot df
bhousebag <- as.data.frame(cbind(medv=BostonHousing$medv,predicted=pred2))
#use ggplot to create a predicted vs observed plot
ggplot(data=bhousebag,aes(x=medv,y=predicted)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title='Predicted vs Observed Plot of Regression Tree Bagging - Boston Housing Data',
       x='Observed Value of Median Value of Owner Occupied Homes',
       y='Predicted Value of Median Value of Owner Occupied Homes')

Part C

Use randomForest() function in R to perform bagging. Report the prediction error (MSE). Was it the same as (b)? If they are different what do you think caused it? Provide the predicted vs observed plot. #For bagging set mtry to be equal to the number of predictor variables

Answer - The difference between a randomForest and bagging is that bagging uses all of the feautures to build the trees whereas the randomForest uses a random sample to build the trees. This means that there could be a problem with multicollinearity, over fitting, or some other problem with the covariates in the data. If the randomForest returns a lower error it is because the bagging example is using covariates that actually increase the error. Setting the number of random variables to 13 means that all 13 variables can be randomly selected in the model.

Build the RandomForest and Print the MSE

p1.3.rf <- randomForest(medv~.,data=BostonHousing,mtry=13)
pred3 <- predict(p1.3.rf,data=BostonHousing)

bhouserf <- as.data.frame(cbind(medv=BostonHousing$medv,predicted=pred3))

MSE_p3_rf <- round(mean((BostonHousing$medv - pred3)^2), 3)
MSE_p3_rf

## [1] 10.499

Plot the Predicted Vs Observed Plot of the Random Forest

It is clear that there is a much lower error rate from the fitted vs observed plot when we set mtry to 13.

ggplot(data=bhouserf,aes(x=medv,y=predicted)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title='Predicted vs Observed Plot of Random Forest - Boston Housing Data',
       x='Observed Value of Median Value of Owner Occupied Homes',
       y='Predicted Value of Median Value of Owner Occupied Homes')

Part D

Use randomForest() function in R to perform random forest. Report the prediction error (MSE). Provide the predicted vs observed plot.

Build the RandomForest and Calculate the MSE

Without defining mtry we get a lower MSE because we are not setting the tuning parameter to look at all of the variables in the data set. The random subset method provides some validation to our above hypothesis that some of the variables in the data set are not conducive to predicting the median value of owner-occupied homes.

p1.3.rf2 <- randomForest(medv~.,data=BostonHousing)
pred4 <- predict(p1.3.rf2,data=BostonHousing)

bhouserf2 <- as.data.frame(cbind(medv=BostonHousing$medv,predicted=pred4))

MSE_p3_rf2 <- round(mean((BostonHousing$medv - pred4)^2), 3)
MSE_p3_rf2

## [1] 9.742

Plot the Fitted Vs Observed Plot of the RandomForest

ggplot(data=bhouserf2,aes(x=medv,y=predicted)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title='Predicted vs Observed Plot of Random Forest - Boston Housing Data',
       x='Observed Value of Median Value of Owner Occupied Homes',
       y='Predicted Value of Median Value of Owner Occupied Homes')

Part E

Provide a table containing each method and associated MSE. Which method is more accurate? Answer - The most accurate model, as described above, is the randomForest without the tuning parameter set. This is because the randomForest randomly selects covariates from subsets of the data without having to select from all 13 covariates in the data set. This tells me that there could be covariates that either correlate with one another or are not conducive to predicting the median value of owner-occupied homes which means this model is probably over-fitted.

p1_table <- as.data.frame(cbind(Trees=MSE_p1_tree,Bagging=MSE_p2_bagging,
                                RandomForest_mtry13=MSE_p3_rf,RandomForest=MSE_p3_rf2))
p1_table

##    Trees Bagging RandomForest_mtry13 RandomForest
## 1 12.716  16.245              10.499        9.742

Question 2

Consider the glacoma data (data = “GlaucomaM”, package = “TH.data”).

Part A

Build a logistic regression model. Note that most of the predictor variables are highly correlated. Hence, logistic regression model using the whole set of variables will not work here as it is sensitive to correlation.

Perform Stepwise Logistic Regression For Variable Selection

Here we will use the step function to step through the model and find the combination of variables with the lowest AIC. Using the formula function we will capture the formula for this model and use it throughout the problem. The print out of the formual is below. Before performing stepwise regerssion with will remove all variables with a correlation absolute value greater than 0.8.

data(GlaucomaM)

#removal of highly correlated variables
#create an object class containing the class variable and remove it from the data set
Class <- GlaucomaM$Class
Glauc <- GlaucomaM[,!names(GlaucomaM) %in% 'Class']

#create a correlation matrix of the dataset and set the upper triangle and perfecly correlated variables to 0
tmp <- cor(Glauc)
tmp[upper.tri(tmp)] <- 0
diag(tmp) <- 0

#remove any variables with a correlation absolute value of .8 or above and add the class variable back
GlaucomaM <- Glauc[,!apply(tmp,2,function(x) any(x > abs(0.8)))]
GlaucomaM <- cbind(GlaucomaM,Class)

#set class to binary
GlaucomaM <- GlaucomaM %>%
  mutate(Class = ifelse(Class=='glaucoma',1,0))

#use stepwise regression for the glm
glauc_glm <- step(glm(Class ~., data = GlaucomaM, family = "binomial"),trace=F)
#extract the formula with the lowest aic
form_glauc <- formula(glauc_glm)

form_glauc

## Class ~ phct + phci + vbrn + vari + tms + mr + rnf + emd

Part B

Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate

The error rate using the 10 fold method is 0.1.

#set the number of folds and create an empty matrix for error rats
 n_folds <- 10
  folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
  cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
  #loop through the folds building models for each and compute the error from each confusion matrix placin the result into the cv_error object
  for (k in 1:n_folds) {
    test_i <- which(folds_i == k)
    train <- GlaucomaM[-test_i, ]
    test <- GlaucomaM[test_i, ]
    model <- glm(form_glauc,data=train,family='binomial')
    pred <- predict(model,test,type='response')
    conf <- (table("actual"=test$Class,
                          "predicted"=pred>0.5))
    cv_error[k, ] <- round(1 - ((diag(conf)) / nrow(test)), 4) 
  }
  #calculate the final error for all models
 glauc.glm.error <- colMeans(cv_error)[1]
 glauc.glm.error

## [1] 0.1

Part C

Find a function (package in R) that can conduct the “adaboost” ensemble modeling. Use it to predict glaucoma and report error rate.

Answer-We will use the function adaboost from the package fastAdaboost to create the adaboost. Interestingly, this algorithm predicts glaucoma perfectly with a zero error rate.

adaboost <- adaboost(form_glauc,data=GlaucomaM,nIter=100)

ada.pred <- predict(adaboost,newdata=GlaucomaM)

ada.pred$error

## [1] 0

Part D

Report the error rates based on single tree, bagging and random forest. (A table would be great for this)

Consruct a Tree From the GlaucomaM Dataset-Print the Tree

glauc.tree <- rpart(form_glauc,data=GlaucomaM,control = rpart.control( minsplit = 10))
plot(as.party(glauc.tree), 
     tp_args = list(id = FALSE))

Plot the CP of the Tree

It is evident from the increased error rate that pruning needs to take place.

plotcp(glauc.tree)

Print the CP Table of the Glauc Tree

Examining the x error it appears the optimum number of nodes is 2.

glauc.tree$cptable

##            CP nsplit rel error    xerror        xstd
## 1  0.41853844      0 1.0000000 1.0109762 0.003344427
## 2  0.06359004      1 0.5814616 0.7188660 0.077847819
## 3  0.04630239      2 0.5178715 0.7751342 0.084885000
## 4  0.03670627      3 0.4715691 0.7730566 0.088126420
## 5  0.03537562      5 0.3981566 0.7761894 0.091971800
## 6  0.02305684      6 0.3627810 0.8317244 0.099554213
## 7  0.02093145      7 0.3397241 0.7948401 0.099770885
## 8  0.02083691      8 0.3187927 0.7937709 0.100329235
## 9  0.01834305      9 0.2979558 0.8002381 0.100443286
## 10 0.01438561     11 0.2612697 0.8372278 0.101765027
## 11 0.01000000     13 0.2324985 0.8530324 0.101768494

Prune the Tree and Plot the Result of Pruning

The tree produced from pruning only has two nodes. Splitting a tree in this fashion will not be very predictive for Glaucoma.

opt <- which.min(glauc.tree$cptable[,"xerror"])
cp <- glauc.tree$cptable[opt, "CP"]
glauc.prune <- prune(glauc.tree, cp = cp)

plot(as.party(glauc.prune), 
     tp_args = list(id = FALSE))

Perform 10 Fold Cross Validation on the Tree - Print the Error Rate

The error rate of 0.51 indicates that we may have over pruned the tree and reduced its overall prediction power.

n_folds <- 10
folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
for (k in 1:n_folds) {
  test_i <- which(folds_i == k)
  train <- GlaucomaM[-test_i, ]
  test <- GlaucomaM[test_i, ]

  glauc.tree <- rpart(form_glauc,data=train,control = rpart.control( minsplit = 10))
  opt <- which.min(glauc.tree$cptable[,"xerror"])
  cp <- glauc.tree$cptable[opt, "CP"]
  glauc.prune <- prune(glauc.tree, cp = cp)
  pred <- predict(glauc.tree,test)
  cconf <- (table("actual"=test$Class,"predicted"=pred>0.5))
  cv_error[k, ] <- round(1 - ((diag(conf)) / nrow(test)), 4) 
}
glauc.tree.error <- colMeans(cv_error)[1]
glauc.tree.error

## [1] 0.51667

Build a RandomForest With 10 Fold Cross Validation - Print the Error Rate

Our randomForest MSE shown below is substantially lower than the error rate in the decision tree. In this example we did not set the tuning parameter for the covariates.

n_folds <- 10
folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
for (k in 1:n_folds) {
  test_i <- which(folds_i == k)
  train <- GlaucomaM[-test_i, ]
  test <- GlaucomaM[test_i, ]
  
  glauc.rf <- randomForest(form_glauc,data=train)
  pred <- predict(glauc.rf,test)
  conf <- (table("actual"=test$Class,"predicted"=pred>0.5))
  cv_error[k, ] <- round(1 - ((diag(conf)) / nrow(test)), 4) 
  cv_error[k, ] <- round(mean((test$Class - pred)^2), 3)
}
glauc.rf.error <- colMeans(cv_error)[1]
glauc.rf.error

## [1] 0.0571

Perform Bagging With 50 & 10 Fold Cross Validation - Print the Error Rate

Bagging with 10 fold cross validation produces an error rate very close to that of the glm.

n_folds <- 10
folds_i <- sample(rep(1:n_folds, length.out = length(GlaucomaM)))
cv_error <- matrix(NA, nrow = n_folds, ncol = length(GlaucomaM))
for (k in 1:n_folds) {
  test_i <- which(folds_i == k)
  train <- GlaucomaM[-test_i, ]
  test <- GlaucomaM[test_i, ]
  
  trees1 <- vector(mode = "list", length = 50)
  n1 <- nrow(train)
  bootsamples1 <- rmultinom(length(trees1), n1, rep(1, n1)/n1)
  glauc.bag <- rpart(form_glauc, data = train, control = rpart.control(xval = 0))
  for (i in 1:length(trees1)){
    trees1[[i]] <- update(glauc.bag, weights = bootsamples1[,i])
  }
  pred <- predict(glauc.bag,test)
  conf <- (table("actual"=test$Class,"predicted"=pred>0.5))
  cv_error[k, ] <- round(1 - ((diag(conf)) / nrow(test)), 4) 
  cv_error[k, ] <- round(mean((test$Class - pred)^2), 3)
}
glauc.bag.error <- colMeans(cv_error)[1]
glauc.bag.error

## [1] 0.1097

Table of Errors from All Models

It is obvious that the adaboost algorithm is the clear choice here with an error rate of zero. However, the randomForest still has substantial prediction power and may actually be beneficial on a different data set of the same test due to us not running 10 fold cross validation on the adaboost algorithm.

q2.table <- as.data.frame(rbind(glauc.glm.error,ada.pred$error,glauc.tree.error,glauc.rf.error,glauc.bag.error))
row.names(q2.table) <- c('GLM Error','AdaBoost Error','Tree Error','randomForest Error','Bagging Error')
q2.table

##                         V1
## GLM Error          0.10000
## AdaBoost Error     0.00000
## Tree Error         0.51667
## randomForest Error 0.05710
## Bagging Error      0.10970

Part E

Write a conclusion comparing the above results (use a table to report models and corresponding error rates). Which one is the best model?

Answer - From the above results the AdaBoost algorithm had a zero error rate and predicted the Class variable perfectly. Again, since the same 10 fold cross validation method was not used in the adaboost algorithm the randomForest may still be usable for predicting glaucoma.

Part F

From the above analysis, which variables seem to be important in predicting Glaucoma?

Answer - We will print the summary of the glm model and examine the covariates statistical significance. The tms variable has the highest significance followed by: emd, rnf, vbrnn, phci, phct, vari, and the mr variables. This means the moment superior has the highest significance when predicting glaucoma from the tests.

summary(glauc_glm)

## 
## Call:
## glm(formula = Class ~ phct + phci + vbrn + vari + tms + mr + 
##     rnf + emd, family = "binomial", data = GlaucomaM)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.17940  -0.53719   0.01616   0.48980   3.08580  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    5.485      2.781   1.972   0.0486 * 
## phct           5.571      3.306   1.685   0.0919 . 
## phci           6.859      2.946   2.328   0.0199 * 
## vbrn          -5.171      2.366  -2.185   0.0289 * 
## vari         -15.743      9.093  -1.731   0.0834 . 
## tms            6.946      2.426   2.863   0.0042 **
## mr            -4.778      3.063  -1.560   0.1188   
## rnf          -12.198      5.788  -2.107   0.0351 * 
## emd            7.778      3.254   2.390   0.0168 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.71  on 195  degrees of freedom
## Residual deviance: 134.90  on 187  degrees of freedom
## AIC: 152.9
## 
## Number of Fisher Scoring iterations: 6

STAT 701 Homework 5

Jeremiah Lowhorn

Saturday, September 16, 2017

Recursive Partitioning

Question 1

Part A

Construct a Tree and Plot it

Construct a CP Plot of the Tree

Print a CP Table of the Tree

Predict the Median Value of Owner-Occupied Homes from the Tree & Calculate the MSE

Plot the Fitted Vs Observed Values

Final Answer - The final, unpruned, tree had 9 nodes. It was unnecessary to prune the tree because the lowest error occured at the 9th node. The MSE of the tree model is 12.716 and the fitted vs observed plot is shown above.

Part B

MSE of Bagging With 50

Predicted Vs Observed Plot of Bagging With 50 Trees

Part C

Build the RandomForest and Print the MSE

Plot the Predicted Vs Observed Plot of the Random Forest

Part D

Build the RandomForest and Calculate the MSE

Plot the Fitted Vs Observed Plot of the RandomForest

Part E

Question 2

Part A

Perform Stepwise Logistic Regression For Variable Selection

Part B

Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate

Part C

Part D

Consruct a Tree From the GlaucomaM Dataset-Print the Tree

Plot the CP of the Tree

Print the CP Table of the Glauc Tree

Prune the Tree and Plot the Result of Pruning

Perform 10 Fold Cross Validation on the Tree - Print the Error Rate

Build a RandomForest With 10 Fold Cross Validation - Print the Error Rate

Perform Bagging With 50 & 10 Fold Cross Validation - Print the Error Rate

Table of Errors from All Models

Part E

Part F