For today, I will be demonstrating some of the uses of tree based methods to investigate data. To begin, we’ll start with using trees for classification.

The data in question are from the Mushroom data set from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Mushroom. The data were originally gained from the source, Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf. The data are contained in a table with 8124 observations. The main issue surrounding the data are if the mushrooms are edible or poisonous, as indicated in the first column with e for edible and p for poisonous. The predictive attributes in the table are as follows:

  1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
  2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
  3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
  4. bruises?: bruises=t,no=f
  5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
  6. gill-attachment: attached=a,descending=d,free=f,notched=n
  7. gill-spacing: close=c,crowded=w,distant=d
  8. gill-size: broad=b,narrow=n
  9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
  10. stalk-shape: enlarging=e,tapering=t
  11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
  12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
  13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
  14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
  15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
  16. veil-type: partial=p,universal=u
  17. veil-color: brown=n,orange=o,white=w,yellow=y
  18. ring-number: none=n,one=o,two=t
  19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
  20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
  21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
  22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

Please note that mushrooms whose suitability for consumption could not be identified were also indicated as poisonous. In addition, there are missing values for the eleventh predictor ‘stalk-root’. Missing values are noted with a “?”.

We will be using the rpart library to investigate the data.

library(rpart)

We will also be loading several other libraries that will assist us in making more elaborate tree plots.

library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)

We will now load the data into memory, as well as apply appropriate column names. In addition, we will add an additional predictor “missed”, which will account for the missing data. The number 1 will indicate missing data, while 0 will signal the opposite.

mushroom <- read.csv('agaricus-lepiota.data',header=F,sep=',')
names(mushroom) <- c('edible','cap_shape','cap_surface','cap_color','bruises','odor','gill_attach','gill_spacing','gill_size','gill_color','stalk_shape','stalk_root','stalk_surface_abo','stalk_surface_bel','stalk_color_abo','stalk_color_bel','veil_type','veil_color','ring_number','ring_type','spore_color','pop','habitat')
missed <- rep(NA,8124)
for(i in 1:8124){
    ifelse(mushroom$stalk_root[i] == '?',missed[i] <- 1, missed[i] <- 0)
}
mushroom$missed <- as.factor(missed)

Next, we will split the data set into a training and test set, leaving a thousand records for testing while the rest are used for training the model.

train.m <- sample(8124,7124)
mush.train <- mushroom[train.m,]
mush.test <- mushroom[-train.m,]

At this point, we will now construct our model, using the rpart() function. We will display the results of how the data is split both numerically and graphically.

mush.model <- rpart(edible~.,data=mush.train)
mush.model
## n= 7124 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 7124 3422 e (0.51965188 0.48034812)  
##   2) odor=a,l,n 3808  106 e (0.97216387 0.02783613)  
##     4) spore_color=b,h,k,n,o,u,w,y 3745   43 e (0.98851802 0.01148198) *
##     5) spore_color=r 63    0 p (0.00000000 1.00000000) *
##   3) odor=c,f,m,p,s,y 3316    0 p (0.00000000 1.00000000) *
# Unfortunately the plot() is not working well with the data for some reason.
# It may still be an option for other sets of data so I'll keep it below. but I will use the 
# prp() function from rpart.plot instead
# par(mar=c(1.25,1,1,.25)) # Adjustment to fit the graph.
# plot(mush.model,uniform=T,compress=T)
# text(mush.model)
prp(mush.model)

For a more elaborae plot of the model, we will use the fancyRpartPlot() function.

par(mar=c(5,4,4,2)+.1)
subtext <- 'e - edible; p - poisonous'
fancyRpartPlot(mush.model,main='Decision Tree of edible mushrooms',sub=subtext)

The plotcp() function is available to us to see if there is a need to prune the tree model.

plotcp(mush.model)

However, in this case we will not be pruning the data. With confidence in the model, we will now see how successful it is against our test data. It is save to say that the results are encouraging.

table(predicted=predict(mush.model,newdata=mush.test,type='class'),actual=mush.test$edible)
##          actual
## predicted   e   p
##         e 506   5
##         p   0 489

We will now take a look at an example of a regression tree. The data we will use are from the NO2 data set from the StatLib Archive at http://lib.stat.cmu.edu/datasets/. The result in question is the amount of the log of NO2 detected on an hourly basis from a street under use by cars. The variables in the data set are as follows:

  1. hourly values of log(no2)
  2. log(# of cars/hr)
  3. temp 2 meters above ground (celsius)
  4. wind speed (meters/sec.)
  5. temp diff between 25 and 2 meters above ground (celsius)
  6. wind directon (degrees 0 to 360)
  7. hour of day
  8. day number from 10/1/01

We will, again, read the data into memory, apply appropriate column names, and split the set into training and test sets. At random, 400 records are chosen as a training set, while the remaining 100 are left as a test set.

set.seed(4)
no2 <- read.csv('NO2.dat',header=F,sep='')
names(no2) <- c('no2_val','log_cars','temp2','wind','temp252','winddir','hour','daynum')
train.n <- sample(500,400)
no2.train <- no2[train.n,]
no2.test <- no2[-train.n,]

We will create our model, including all variables, and observe error and R^2 values on each of the splits.

no2.model <- rpart(no2_val~.,data=no2.train)
plotcp(no2.model)

rsq.rpart(no2.model)
## 
## Regression tree:
## rpart(formula = no2_val ~ ., data = no2.train)
## 
## Variables actually used in tree construction:
## [1] log_cars temp2    temp252  wind     winddir 
## 
## Root node error: 219.65/400 = 0.54913
## 
## n= 400 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.253574      0   1.00000 1.00710 0.086479
## 2  0.123478      1   0.74643 0.75542 0.066957
## 3  0.054535      2   0.62295 0.67950 0.058598
## 4  0.046293      3   0.56841 0.70148 0.059206
## 5  0.030904      4   0.52212 0.63616 0.050252
## 6  0.019002      5   0.49122 0.63014 0.050787
## 7  0.018667      6   0.47221 0.65221 0.051037
## 8  0.017713      7   0.45355 0.64957 0.051752
## 9  0.016306      9   0.41812 0.64130 0.051170
## 10 0.015141     10   0.40182 0.63155 0.051080
## 11 0.012428     11   0.38667 0.62289 0.049650
## 12 0.012240     12   0.37425 0.62214 0.049583
## 13 0.010367     13   0.36201 0.61138 0.049224
## 14 0.010000     14   0.35164 0.60948 0.048905

At this time we won’t do any pruning. With our model at hand, we can go about making plots of the splits.

par(mar=c(1.25,1,1,.25))
plot(no2.model,uniform=T,compress=T)
text(no2.model,pretty=NULL,cex=.8)

fancyRpartPlot(no2.model)

We can also figure out the mean squared error for this model against our test data.

mean((predict(no2.model,newdata=no2.test) - no2.test$no2_val)^2)
## [1] 0.3079992

Better results can be found, sometimes, if pruning is involved. Here we will prune unsing a cp = .014 and observe the mean squared error.

no2.prune <- prune(no2.model,cp=.014)
mean((predict(no2.prune,newdata=no2.test) - no2.test$no2_val)^2)
## [1] 0.361854

We now try to examine the no2 data using bagging. This can be done in R through the randomForest library. Keep in mind that for bagging, all of the predictive variables have to be considered for each split of the tree.

library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
set.seed(5)
no2.bag <- randomForest(no2_val~.,data=no2.train,mtry=7,importance=T)
no2.bag
## 
## Call:
##  randomForest(formula = no2_val ~ ., data = no2.train, mtry = 7,      importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 0.2236494
##                     % Var explained: 59.27

With our model in place, we can compare predicted values from the model to the actual values in the test set. Here, we will plot the two sets of values and draw a line as a guide as to how linear the relationship is.

no2.bag.pred <- predict(no2.bag,newdata=no2.test)
plot(no2.bag.pred,no2.test$no2_val,pch=10)
abline(0,1,col='red')

Finally, we can derive the mean squared error, comparing predicted to actual results.

mean((no2.bag.pred - no2.test$no2_val)^2)
## [1] 0.2089594

We can do someting similar with Random Forest, calculating the mean squared error for the model. However, we will consider only three variables at each split.

set.seed(6)
no2.rf <- randomForest(no2_val~.,data=no2.train,mtry=3,importance=T)
no2.rf
## 
## Call:
##  randomForest(formula = no2_val ~ ., data = no2.train, mtry = 3,      importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.2251164
##                     % Var explained: 59.01
no2.rf.pred <- predict(no2.rf,newdata=no2.test)
mean((no2.rf.pred - no2.test$no2_val)^2)
## [1] 0.2126008

In addition, we can also look at the relative importance of each variable in the model for both decrease in accuracy of predictions in out of bag samples once that variable is excluded and the decrease of node impurity once splits are made over the variable. This information can be found by using the importance() and varImpPlot() functions.

varImpPlot(no2.rf)

importance(no2.rf)
##           %IncMSE IncNodePurity
## log_cars 54.82355      60.94893
## temp2    22.75752      25.17390
## wind     31.50442      39.08811
## temp252  22.16151      23.42393
## winddir  15.61321      23.96181
## hour     16.11011      24.17910
## daynum    7.74120      14.52980

The last method we will look at is boosting, which can be done through the gbm package. Again, we will develop a model and ascertain the relative importance of each predictive variable through the summary() function. We will also construct partial dependence plots for the top two variables, showing the marginal effect of each variable once the others have been integrated out.

library(gbm)
set.seed(7)
no2.boost <- gbm(no2_val~.,data=no2.train,distribution='gaussian',n.trees=5000,interaction.depth=4)
summary(no2.boost)

##               var   rel.inf
## log_cars log_cars 37.545525
## wind         wind 20.271634
## temp2       temp2 12.179537
## temp252   temp252 12.138150
## winddir   winddir 10.882722
## daynum     daynum  4.131832
## hour         hour  2.850600
par(mfrow=c(1,2))
plot(no2.boost,i='log_cars')
plot(no2.boost,i='wind')

We can also use the model to make predictions and contrast them with the test data.

no2.boost.pred <- predict(no2.boost,newdata=no2.test,n.trees=5000)
mean((no2.boost.pred - no2.test$no2_val)^2)
## [1] 0.2190693

Note that the default shrinkage value is .001. Next, we will try .01 and .1 as values for shrinkage and see if our results in terms of mean squared error improve.

no2.boost.2 <- gbm(no2_val~.,data=no2.train,distribution='gaussian',n.trees=5000,interaction.depth=4,shrinkage=.01,verbose=F)
no2.boost.2.pred <- predict(no2.boost.2,newdata=no2.test,n.trees=5000)
mean((no2.boost.2.pred - no2.test$no2_val)^2)
## [1] 0.2253821
no2.boost.3 <- gbm(no2_val~.,data=no2.train,distribution='gaussian',n.trees=5000,interaction.depth=4,shrinkage=.1,verbose=F)
no2.boost.3.pred <- predict(no2.boost.3,newdata=no2.test,n.trees=5000)
mean((no2.boost.3.pred - no2.test$no2_val)^2)
## [1] 0.2601128