Data Set 1: Death Penalty

This dataset contains 362 observations of crimes with the following factors listed: the aggravation index measuring the severity of the crime (6 being the most severe), the race of the victim (1=white, 0=black), and whether the aggravator was given the death penalty or not.

library(MASS) 
library(tree)

dpen <- read.csv("C:/DataMining/Data/DeathPenalty.csv")
dpen$Death = factor(dpen$Death, levels=c("0","1"))
levels(dpen$Death) = c("NonDeathPenelty","DeathPenelty")
dpen$Death <- factor(dpen$Death)
head(dpen)
##   Agg VRace           Death
## 1   1     1    DeathPenelty
## 2   1     1    DeathPenelty
## 3   1     1 NonDeathPenelty
## 4   1     1 NonDeathPenelty
## 5   1     1 NonDeathPenelty
## 6   1     1 NonDeathPenelty

This first tree is the basic result that R produces. There are six terminal nodes on this tree and it isn’t very complex but it could still be simplified to a further extent.

dpentree <- tree(Death~.,data=dpen)
dpentree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 362 321.90 NonDeathPenelty ( 0.83702 0.16298 )  
##    2) Agg < 2.5 283  58.12 NonDeathPenelty ( 0.97880 0.02120 )  
##      4) Agg < 1.5 244  32.35 NonDeathPenelty ( 0.98770 0.01230 ) *
##      5) Agg > 1.5 39  21.15 NonDeathPenelty ( 0.92308 0.07692 ) *
##    3) Agg > 2.5 79 100.10 DeathPenelty ( 0.32911 0.67089 )  
##      6) Agg < 4.5 42  57.84 NonDeathPenelty ( 0.54762 0.45238 )  
##       12) VRace < 0.5 17  18.55 NonDeathPenelty ( 0.76471 0.23529 ) *
##       13) VRace > 0.5 25  33.65 DeathPenelty ( 0.40000 0.60000 ) *
##      7) Agg > 4.5 37  20.82 DeathPenelty ( 0.08108 0.91892 )  
##       14) VRace < 0.5 11  12.89 DeathPenelty ( 0.27273 0.72727 ) *
##       15) VRace > 0.5 26   0.00 DeathPenelty ( 0.00000 1.00000 ) *
plot(dpentree,col=8)
text(dpentree, cex=.75)

Here is the summary for the tree showing a low residual mean deviance and a low classification error rate. Also included is an example using case number 127 which was correctly classified using the tree.

summary(dpentree)
## 
## Classification tree:
## tree(formula = Death ~ ., data = dpen)
## Number of terminal nodes:  6 
## Residual mean deviance:  0.3331 = 118.6 / 356 
## Misclassification error rate: 0.06354 = 23 / 362
dpen[127,]
##     Agg VRace           Death
## 127   1     0 NonDeathPenelty

A more simplified tree has been created by removing nodes that lead to two identical terminal nodes. It has a slightly higher residual mean deviance but it’s by no means a major increase.

dpensnip=snip.tree(dpentree,nodes=c(2,7))
dpensnip
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 362 321.90 NonDeathPenelty ( 0.83702 0.16298 )  
##    2) Agg < 2.5 283  58.12 NonDeathPenelty ( 0.97880 0.02120 ) *
##    3) Agg > 2.5 79 100.10 DeathPenelty ( 0.32911 0.67089 )  
##      6) Agg < 4.5 42  57.84 NonDeathPenelty ( 0.54762 0.45238 )  
##       12) VRace < 0.5 17  18.55 NonDeathPenelty ( 0.76471 0.23529 ) *
##       13) VRace > 0.5 25  33.65 DeathPenelty ( 0.40000 0.60000 ) *
##      7) Agg > 4.5 37  20.82 DeathPenelty ( 0.08108 0.91892 ) *
plot(dpensnip)
text(dpensnip)

summary(dpensnip)
## 
## Classification tree:
## snip.tree(tree = dpentree, nodes = c(2L, 7L))
## Number of terminal nodes:  4 
## Residual mean deviance:  0.3663 = 131.1 / 358 
## Misclassification error rate: 0.06354 = 23 / 362

Data Set 2: House Prices

This data set includes prices and characteristics of n=128 houses.

hprice <- read.csv("C:/DataMining/Data/HousePrices.csv")
head(hprice)
##   HomeID  Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 1      1 114300 1790        2         2      2    No         East
## 2      2 114200 2030        4         2      3    No         East
## 3      3 114800 1740        3         2      1    No         East
## 4      4  94700 1980        3         2      3    No         East
## 5      5 119800 2130        3         3      3    No         East
## 6      6 114600 1780        3         2      2    No        North

This first regression tree produced has 10 terminal nodes. The way to read the tree at the first split is; if the house is in the East(a) neighborhood or the North(b) neighborhood you continue to the left, but if it is in the West(c) neighborhood then you continue to the right. For brick “a” is a response of No and “b” is a response of yes.

hptree <- tree(Price ~., data=hprice)
hptree
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 128 9.169e+10 130400  
##    2) Neighborhood: East,North 89 3.007e+10 117800  
##      4) SqFt < 2020 55 1.257e+10 110400  
##        8) Brick: No 40 6.656e+09 105800  
##         16) Offers < 2.5 17 1.151e+09 114500 *
##         17) Offers > 2.5 23 3.295e+09  99420 *
##        9) Brick: Yes 15 2.857e+09 122500 *
##      5) SqFt > 2020 34 9.617e+09 129800  
##       10) Brick: No 23 6.316e+09 123800  
##         20) Bathrooms < 2.5 10 1.351e+09 111700 *
##         21) Bathrooms > 2.5 13 2.373e+09 133100 *
##       11) Brick: Yes 11 7.527e+08 142300 *
##    3) Neighborhood: West 39 1.487e+10 159300  
##      6) Brick: No 23 4.024e+09 148200  
##       12) SqFt < 2010 9 3.002e+08 137000 *
##       13) SqFt > 2010 14 1.844e+09 155500 *
##      7) Brick: Yes 16 3.983e+09 175200  
##       14) Bedrooms < 3.5 8 4.316e+08 164100 *
##       15) Bedrooms > 3.5 8 1.580e+09 186300 *
plot(hptree, col=8)
text(hptree, digits=2)

The regression tree’s summary shows an extremely large residual mean deviance. When case number 53 is classified with the tree, the result is a predicted price of $130,000 while the actual price was $117,400 which is an error of 12,600.

summary(hptree)
## 
## Regression tree:
## tree(formula = Price ~ ., data = hprice)
## Variables actually used in tree construction:
## [1] "Neighborhood" "SqFt"         "Brick"        "Offers"      
## [5] "Bathrooms"    "Bedrooms"    
## Number of terminal nodes:  10 
## Residual mean deviance:  1.35e+08 = 1.594e+10 / 118 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -30320.0  -7756.0    450.6      0.0   6153.0  31460.0
hprice[53,]
##    HomeID  Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 53     53 117400 2150        2         3      4    No        North

To see if a less complex tree can be created and produce a smaller deviation the following plot is created. The plot shows that 10 terminal nodes has the lowest deviance. So, a more simplified tree can be created but it would be at the expense of an even higher residual mean deviance.

set.seed(2)
cvpst <- cv.tree(hptree, K=10)## 10 folds used
cvpst$size
## [1] 10  9  8  7  5  4  3  2  1
cvpst$dev
## [1] 31400030314 40194910574 40174107958 40103665878 42501546086 46137039945
## [7] 50541218968 50508083845 93160356249
plot(cvpst, pch=21, bg=8, type="p", cex=1.5) 

Data Set 3: Spam

This data set contains 4600 observations and 58 variables. Unfortunately, it has no variable labels and has no comprehensive data dictionary. If it can be assumed that the variables are listed in the same order then the first 57 can be found here: http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names. The last variable X1 is known and is the classification of either spam or not spam.

download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data","spambase.data")
spam<-read.csv("./spambase.data")
head(spam)
##     X0 X0.64 X0.64.1 X0.1 X0.32 X0.2 X0.3 X0.4 X0.5 X0.6 X0.7 X0.64.2 X0.8
## 1 0.21  0.28    0.50    0  0.14 0.28 0.21 0.07 0.00 0.94 0.21    0.79 0.65
## 2 0.06  0.00    0.71    0  1.23 0.19 0.19 0.12 0.64 0.25 0.38    0.45 0.12
## 3 0.00  0.00    0.00    0  0.63 0.00 0.31 0.63 0.31 0.63 0.31    0.31 0.31
## 4 0.00  0.00    0.00    0  0.63 0.00 0.31 0.63 0.31 0.63 0.31    0.31 0.31
## 5 0.00  0.00    0.00    0  1.85 0.00 0.00 1.85 0.00 0.00 0.00    0.00 0.00
## 6 0.00  0.00    0.00    0  1.92 0.00 0.00 0.00 0.00 0.64 0.96    1.28 0.00
##   X0.9 X0.10 X0.32.1 X0.11 X1.29 X1.93 X0.12 X0.96 X0.13 X0.14 X0.15 X0.16
## 1 0.21  0.14    0.14  0.07  0.28  3.47  0.00  1.59     0  0.43  0.43     0
## 2 0.00  1.75    0.06  0.06  1.03  1.36  0.32  0.51     0  1.16  0.06     0
## 3 0.00  0.00    0.31  0.00  0.00  3.18  0.00  0.31     0  0.00  0.00     0
## 4 0.00  0.00    0.31  0.00  0.00  3.18  0.00  0.31     0  0.00  0.00     0
## 5 0.00  0.00    0.00  0.00  0.00  0.00  0.00  0.00     0  0.00  0.00     0
## 6 0.00  0.00    0.96  0.00  0.32  3.85  0.00  0.64     0  0.00  0.00     0
##   X0.17 X0.18 X0.19 X0.20 X0.21 X0.22 X0.23 X0.24 X0.25 X0.26 X0.27 X0.28
## 1     0     0     0     0     0     0     0     0     0     0     0  0.07
## 2     0     0     0     0     0     0     0     0     0     0     0  0.00
## 3     0     0     0     0     0     0     0     0     0     0     0  0.00
## 4     0     0     0     0     0     0     0     0     0     0     0  0.00
## 5     0     0     0     0     0     0     0     0     0     0     0  0.00
## 6     0     0     0     0     0     0     0     0     0     0     0  0.00
##   X0.29 X0.30 X0.31 X0.33 X0.34 X0.35 X0.36 X0.37 X0.38 X0.39 X0.40 X0.41
## 1     0     0  0.00     0     0  0.00     0  0.00  0.00     0     0  0.00
## 2     0     0  0.06     0     0  0.12     0  0.06  0.06     0     0  0.01
## 3     0     0  0.00     0     0  0.00     0  0.00  0.00     0     0  0.00
## 4     0     0  0.00     0     0  0.00     0  0.00  0.00     0     0  0.00
## 5     0     0  0.00     0     0  0.00     0  0.00  0.00     0     0  0.00
## 6     0     0  0.00     0     0  0.00     0  0.00  0.00     0     0  0.00
##   X0.42 X0.43 X0.778 X0.44 X0.45 X3.756 X61 X278 X1
## 1 0.132     0  0.372 0.180 0.048  5.114 101 1028  1
## 2 0.143     0  0.276 0.184 0.010  9.821 485 2259  1
## 3 0.137     0  0.137 0.000 0.000  3.537  40  191  1
## 4 0.135     0  0.135 0.000 0.000  3.537  40  191  1
## 5 0.223     0  0.000 0.000 0.000  3.000  15   54  1
## 6 0.054     0  0.164 0.054 0.000  1.671   4  112  1

The first tree produced is quite complex with 13 terminal nodes. The hope is that it can be simplified down further.

spmtree <- tree(X1 ~., data=spam) 
spmtree
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4600 1098.00 0.39390  
##    2) X0.44 < 0.0555 3470  623.60 0.23490  
##      4) X0.3 < 0.055 3140  430.50 0.16400  
##        8) X0.778 < 0.378 2737  247.40 0.10050  
##         16) X0.32.1 < 0.2 2507  168.80 0.07260  
##           32) X0.15 < 0.01 2439  135.50 0.05904 *
##           33) X0.15 > 0.01 68   16.76 0.55880 *
##         17) X0.32.1 > 0.2 230   55.40 0.40430 *
##        9) X0.778 > 0.378 403   97.07 0.59550  
##         18) X278 < 55.5 182   37.14 0.28570 *
##         19) X278 > 55.5 221   28.07 0.85070 *
##      5) X0.3 > 0.055 330   27.27 0.90910  
##       10) X0.18 < 0.14 317   16.09 0.94640 *
##       11) X0.18 > 0.14 13    0.00 0.00000 *
##    3) X0.44 > 0.0555 1130  117.30 0.88230  
##      6) X0.16 < 0.4 1060   65.38 0.93400  
##       12) X0.38 < 0.49 1045   52.11 0.94740 *
##       13) X0.38 > 0.49 15    0.00 0.00000 *
##      7) X0.16 > 0.4 70    6.30 0.10000 *
plot(spmtree, col=8)
text(spmtree, cex=.75)

The summary of the first tree shows a low residual mean deviance and a low misclassification error. So, it can probably be simplified without lowering the performance of the tree too greatly.

summary(spmtree)
## 
## Regression tree:
## tree(formula = X1 ~ ., data = spam)
## Variables actually used in tree construction:
## [1] "X0.44"   "X0.3"    "X0.778"  "X0.32.1" "X0.15"   "X278"    "X0.18"  
## [8] "X0.16"   "X0.38"  
## Number of terminal nodes:  10 
## Residual mean deviance:  0.07568 = 347.4 / 4590 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.94740 -0.05904 -0.05904  0.00000  0.05263  0.94100

The next three trees are further and further pruned to reduce them in complexity.

pstcut <- prune.tree(spmtree,k=100)
plot(pstcut)
text(pstcut, digits=2)

pstcut
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 4600 1098.00 0.3939  
##   2) X0.44 < 0.0555 3470  623.60 0.2349  
##     4) X0.3 < 0.055 3140  430.50 0.1640 *
##     5) X0.3 > 0.055 330   27.27 0.9091 *
##   3) X0.44 > 0.0555 1130  117.30 0.8823 *
pstcut <- prune.tree(spmtree,k=200)
plot(pstcut)
text(pstcut, digits=2)

pstcut
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 4600 1098.0 0.3939  
##   2) X0.44 < 0.0555 3470  623.6 0.2349 *
##   3) X0.44 > 0.0555 1130  117.3 0.8823 *
pstcut <- prune.tree(spmtree,k=300)
plot(pstcut)
text(pstcut, digits=2)

pstcut
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 4600 1098.0 0.3939  
##   2) X0.44 < 0.0555 3470  623.6 0.2349 *
##   3) X0.44 > 0.0555 1130  117.3 0.8823 *

To find the optimal size for a regression tree using this data set, the following plot is created. The deviance continues to decrease as the complexity of the tree increases. There is no obvious optimal balance for complexity and low deviance. So, picking 6 terminal nodes for the size of the tree would have a lower deviance than a smaller tree but remains significantly simplified in comparison to our first tree.

set.seed(2)
cvpst <- cv.tree(spmtree, K=10)## 10 folds used
cvpst$size
##  [1] 10  9  8  7  6  5  4  3  2  1
cvpst$dev
##  [1]  390.9341  407.8643  437.3305  446.5432  463.3060  496.5684  538.7192
##  [8]  613.6478  775.0800 1098.8074
plot(cvpst, pch=21, bg=8, type="p", cex=1.5)

This tree with 6 nodes still has a relatively low misclassification rate and a low residual mean deviance.

pstcut <- prune.tree(spmtree, best=10)
pstcut
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4600 1098.00 0.39390  
##    2) X0.44 < 0.0555 3470  623.60 0.23490  
##      4) X0.3 < 0.055 3140  430.50 0.16400  
##        8) X0.778 < 0.378 2737  247.40 0.10050  
##         16) X0.32.1 < 0.2 2507  168.80 0.07260  
##           32) X0.15 < 0.01 2439  135.50 0.05904 *
##           33) X0.15 > 0.01 68   16.76 0.55880 *
##         17) X0.32.1 > 0.2 230   55.40 0.40430 *
##        9) X0.778 > 0.378 403   97.07 0.59550  
##         18) X278 < 55.5 182   37.14 0.28570 *
##         19) X278 > 55.5 221   28.07 0.85070 *
##      5) X0.3 > 0.055 330   27.27 0.90910  
##       10) X0.18 < 0.14 317   16.09 0.94640 *
##       11) X0.18 > 0.14 13    0.00 0.00000 *
##    3) X0.44 > 0.0555 1130  117.30 0.88230  
##      6) X0.16 < 0.4 1060   65.38 0.93400  
##       12) X0.38 < 0.49 1045   52.11 0.94740 *
##       13) X0.38 > 0.49 15    0.00 0.00000 *
##      7) X0.16 > 0.4 70    6.30 0.10000 *
plot(pstcut, col=8)
text(pstcut)

summary(pstcut)
## 
## Regression tree:
## tree(formula = X1 ~ ., data = spam)
## Variables actually used in tree construction:
## [1] "X0.44"   "X0.3"    "X0.778"  "X0.32.1" "X0.15"   "X278"    "X0.18"  
## [8] "X0.16"   "X0.38"  
## Number of terminal nodes:  10 
## Residual mean deviance:  0.07568 = 347.4 / 4590 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.94740 -0.05904 -0.05904  0.00000  0.05263  0.94100