Data Set 1: Death Penalty
This dataset contains 362 observations of crimes with the following factors listed: the aggravation index measuring the severity of the crime (6 being the most severe), the race of the victim (1=white, 0=black), and whether the aggravator was given the death penalty or not.
library(MASS)
library(tree)
dpen <- read.csv("C:/DataMining/Data/DeathPenalty.csv")
dpen$Death = factor(dpen$Death, levels=c("0","1"))
levels(dpen$Death) = c("NonDeathPenelty","DeathPenelty")
dpen$Death <- factor(dpen$Death)
head(dpen)
## Agg VRace Death
## 1 1 1 DeathPenelty
## 2 1 1 DeathPenelty
## 3 1 1 NonDeathPenelty
## 4 1 1 NonDeathPenelty
## 5 1 1 NonDeathPenelty
## 6 1 1 NonDeathPenelty
This first tree is the basic result that R produces. There are six terminal nodes on this tree and it isn’t very complex but it could still be simplified to a further extent.
dpentree <- tree(Death~.,data=dpen)
dpentree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 362 321.90 NonDeathPenelty ( 0.83702 0.16298 )
## 2) Agg < 2.5 283 58.12 NonDeathPenelty ( 0.97880 0.02120 )
## 4) Agg < 1.5 244 32.35 NonDeathPenelty ( 0.98770 0.01230 ) *
## 5) Agg > 1.5 39 21.15 NonDeathPenelty ( 0.92308 0.07692 ) *
## 3) Agg > 2.5 79 100.10 DeathPenelty ( 0.32911 0.67089 )
## 6) Agg < 4.5 42 57.84 NonDeathPenelty ( 0.54762 0.45238 )
## 12) VRace < 0.5 17 18.55 NonDeathPenelty ( 0.76471 0.23529 ) *
## 13) VRace > 0.5 25 33.65 DeathPenelty ( 0.40000 0.60000 ) *
## 7) Agg > 4.5 37 20.82 DeathPenelty ( 0.08108 0.91892 )
## 14) VRace < 0.5 11 12.89 DeathPenelty ( 0.27273 0.72727 ) *
## 15) VRace > 0.5 26 0.00 DeathPenelty ( 0.00000 1.00000 ) *
plot(dpentree,col=8)
text(dpentree, cex=.75)

Here is the summary for the tree showing a low residual mean deviance and a low classification error rate. Also included is an example using case number 127 which was correctly classified using the tree.
summary(dpentree)
##
## Classification tree:
## tree(formula = Death ~ ., data = dpen)
## Number of terminal nodes: 6
## Residual mean deviance: 0.3331 = 118.6 / 356
## Misclassification error rate: 0.06354 = 23 / 362
dpen[127,]
## Agg VRace Death
## 127 1 0 NonDeathPenelty
A more simplified tree has been created by removing nodes that lead to two identical terminal nodes. It has a slightly higher residual mean deviance but it’s by no means a major increase.
dpensnip=snip.tree(dpentree,nodes=c(2,7))
dpensnip
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 362 321.90 NonDeathPenelty ( 0.83702 0.16298 )
## 2) Agg < 2.5 283 58.12 NonDeathPenelty ( 0.97880 0.02120 ) *
## 3) Agg > 2.5 79 100.10 DeathPenelty ( 0.32911 0.67089 )
## 6) Agg < 4.5 42 57.84 NonDeathPenelty ( 0.54762 0.45238 )
## 12) VRace < 0.5 17 18.55 NonDeathPenelty ( 0.76471 0.23529 ) *
## 13) VRace > 0.5 25 33.65 DeathPenelty ( 0.40000 0.60000 ) *
## 7) Agg > 4.5 37 20.82 DeathPenelty ( 0.08108 0.91892 ) *
plot(dpensnip)
text(dpensnip)

summary(dpensnip)
##
## Classification tree:
## snip.tree(tree = dpentree, nodes = c(2L, 7L))
## Number of terminal nodes: 4
## Residual mean deviance: 0.3663 = 131.1 / 358
## Misclassification error rate: 0.06354 = 23 / 362
Data Set 2: House Prices
This data set includes prices and characteristics of n=128 houses.
hprice <- read.csv("C:/DataMining/Data/HousePrices.csv")
head(hprice)
## HomeID Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 1 1 114300 1790 2 2 2 No East
## 2 2 114200 2030 4 2 3 No East
## 3 3 114800 1740 3 2 1 No East
## 4 4 94700 1980 3 2 3 No East
## 5 5 119800 2130 3 3 3 No East
## 6 6 114600 1780 3 2 2 No North
This first regression tree produced has 10 terminal nodes. The way to read the tree at the first split is; if the house is in the East(a) neighborhood or the North(b) neighborhood you continue to the left, but if it is in the West(c) neighborhood then you continue to the right. For brick “a” is a response of No and “b” is a response of yes.
hptree <- tree(Price ~., data=hprice)
hptree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 128 9.169e+10 130400
## 2) Neighborhood: East,North 89 3.007e+10 117800
## 4) SqFt < 2020 55 1.257e+10 110400
## 8) Brick: No 40 6.656e+09 105800
## 16) Offers < 2.5 17 1.151e+09 114500 *
## 17) Offers > 2.5 23 3.295e+09 99420 *
## 9) Brick: Yes 15 2.857e+09 122500 *
## 5) SqFt > 2020 34 9.617e+09 129800
## 10) Brick: No 23 6.316e+09 123800
## 20) Bathrooms < 2.5 10 1.351e+09 111700 *
## 21) Bathrooms > 2.5 13 2.373e+09 133100 *
## 11) Brick: Yes 11 7.527e+08 142300 *
## 3) Neighborhood: West 39 1.487e+10 159300
## 6) Brick: No 23 4.024e+09 148200
## 12) SqFt < 2010 9 3.002e+08 137000 *
## 13) SqFt > 2010 14 1.844e+09 155500 *
## 7) Brick: Yes 16 3.983e+09 175200
## 14) Bedrooms < 3.5 8 4.316e+08 164100 *
## 15) Bedrooms > 3.5 8 1.580e+09 186300 *
plot(hptree, col=8)
text(hptree, digits=2)

The regression tree’s summary shows an extremely large residual mean deviance. When case number 53 is classified with the tree, the result is a predicted price of $130,000 while the actual price was $117,400 which is an error of 12,600.
summary(hptree)
##
## Regression tree:
## tree(formula = Price ~ ., data = hprice)
## Variables actually used in tree construction:
## [1] "Neighborhood" "SqFt" "Brick" "Offers"
## [5] "Bathrooms" "Bedrooms"
## Number of terminal nodes: 10
## Residual mean deviance: 1.35e+08 = 1.594e+10 / 118
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -30320.0 -7756.0 450.6 0.0 6153.0 31460.0
hprice[53,]
## HomeID Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 53 53 117400 2150 2 3 4 No North
To see if a less complex tree can be created and produce a smaller deviation the following plot is created. The plot shows that 10 terminal nodes has the lowest deviance. So, a more simplified tree can be created but it would be at the expense of an even higher residual mean deviance.
set.seed(2)
cvpst <- cv.tree(hptree, K=10)## 10 folds used
cvpst$size
## [1] 10 9 8 7 5 4 3 2 1
cvpst$dev
## [1] 31400030314 40194910574 40174107958 40103665878 42501546086 46137039945
## [7] 50541218968 50508083845 93160356249
plot(cvpst, pch=21, bg=8, type="p", cex=1.5)

Data Set 3: Spam
This data set contains 4600 observations and 58 variables. Unfortunately, it has no variable labels and has no comprehensive data dictionary. If it can be assumed that the variables are listed in the same order then the first 57 can be found here: http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names. The last variable X1 is known and is the classification of either spam or not spam.
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data","spambase.data")
spam<-read.csv("./spambase.data")
head(spam)
## X0 X0.64 X0.64.1 X0.1 X0.32 X0.2 X0.3 X0.4 X0.5 X0.6 X0.7 X0.64.2 X0.8
## 1 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65
## 2 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12
## 3 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31
## 4 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31
## 5 0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00
## 6 0.00 0.00 0.00 0 1.92 0.00 0.00 0.00 0.00 0.64 0.96 1.28 0.00
## X0.9 X0.10 X0.32.1 X0.11 X1.29 X1.93 X0.12 X0.96 X0.13 X0.14 X0.15 X0.16
## 1 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43 0.43 0
## 2 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16 0.06 0
## 3 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0
## 4 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0
## 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0
## 6 0.00 0.00 0.96 0.00 0.32 3.85 0.00 0.64 0 0.00 0.00 0
## X0.17 X0.18 X0.19 X0.20 X0.21 X0.22 X0.23 X0.24 X0.25 X0.26 X0.27 X0.28
## 1 0 0 0 0 0 0 0 0 0 0 0 0.07
## 2 0 0 0 0 0 0 0 0 0 0 0 0.00
## 3 0 0 0 0 0 0 0 0 0 0 0 0.00
## 4 0 0 0 0 0 0 0 0 0 0 0 0.00
## 5 0 0 0 0 0 0 0 0 0 0 0 0.00
## 6 0 0 0 0 0 0 0 0 0 0 0 0.00
## X0.29 X0.30 X0.31 X0.33 X0.34 X0.35 X0.36 X0.37 X0.38 X0.39 X0.40 X0.41
## 1 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00
## 2 0 0 0.06 0 0 0.12 0 0.06 0.06 0 0 0.01
## 3 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00
## 4 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00
## 5 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00
## 6 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00
## X0.42 X0.43 X0.778 X0.44 X0.45 X3.756 X61 X278 X1
## 1 0.132 0 0.372 0.180 0.048 5.114 101 1028 1
## 2 0.143 0 0.276 0.184 0.010 9.821 485 2259 1
## 3 0.137 0 0.137 0.000 0.000 3.537 40 191 1
## 4 0.135 0 0.135 0.000 0.000 3.537 40 191 1
## 5 0.223 0 0.000 0.000 0.000 3.000 15 54 1
## 6 0.054 0 0.164 0.054 0.000 1.671 4 112 1
The first tree produced is quite complex with 13 terminal nodes. The hope is that it can be simplified down further.
spmtree <- tree(X1 ~., data=spam)
spmtree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.00 0.39390
## 2) X0.44 < 0.0555 3470 623.60 0.23490
## 4) X0.3 < 0.055 3140 430.50 0.16400
## 8) X0.778 < 0.378 2737 247.40 0.10050
## 16) X0.32.1 < 0.2 2507 168.80 0.07260
## 32) X0.15 < 0.01 2439 135.50 0.05904 *
## 33) X0.15 > 0.01 68 16.76 0.55880 *
## 17) X0.32.1 > 0.2 230 55.40 0.40430 *
## 9) X0.778 > 0.378 403 97.07 0.59550
## 18) X278 < 55.5 182 37.14 0.28570 *
## 19) X278 > 55.5 221 28.07 0.85070 *
## 5) X0.3 > 0.055 330 27.27 0.90910
## 10) X0.18 < 0.14 317 16.09 0.94640 *
## 11) X0.18 > 0.14 13 0.00 0.00000 *
## 3) X0.44 > 0.0555 1130 117.30 0.88230
## 6) X0.16 < 0.4 1060 65.38 0.93400
## 12) X0.38 < 0.49 1045 52.11 0.94740 *
## 13) X0.38 > 0.49 15 0.00 0.00000 *
## 7) X0.16 > 0.4 70 6.30 0.10000 *
plot(spmtree, col=8)
text(spmtree, cex=.75)

The next three trees are further and further pruned to reduce them in complexity.
pstcut <- prune.tree(spmtree,k=100)
plot(pstcut)
text(pstcut, digits=2)

pstcut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.00 0.3939
## 2) X0.44 < 0.0555 3470 623.60 0.2349
## 4) X0.3 < 0.055 3140 430.50 0.1640 *
## 5) X0.3 > 0.055 330 27.27 0.9091 *
## 3) X0.44 > 0.0555 1130 117.30 0.8823 *
pstcut <- prune.tree(spmtree,k=200)
plot(pstcut)
text(pstcut, digits=2)

pstcut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.0 0.3939
## 2) X0.44 < 0.0555 3470 623.6 0.2349 *
## 3) X0.44 > 0.0555 1130 117.3 0.8823 *
pstcut <- prune.tree(spmtree,k=300)
plot(pstcut)
text(pstcut, digits=2)

pstcut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.0 0.3939
## 2) X0.44 < 0.0555 3470 623.6 0.2349 *
## 3) X0.44 > 0.0555 1130 117.3 0.8823 *
To find the optimal size for a regression tree using this data set, the following plot is created. The deviance continues to decrease as the complexity of the tree increases. There is no obvious optimal balance for complexity and low deviance. So, picking 6 terminal nodes for the size of the tree would have a lower deviance than a smaller tree but remains significantly simplified in comparison to our first tree.
set.seed(2)
cvpst <- cv.tree(spmtree, K=10)## 10 folds used
cvpst$size
## [1] 10 9 8 7 6 5 4 3 2 1
cvpst$dev
## [1] 390.9341 407.8643 437.3305 446.5432 463.3060 496.5684 538.7192
## [8] 613.6478 775.0800 1098.8074
plot(cvpst, pch=21, bg=8, type="p", cex=1.5)

This tree with 6 nodes still has a relatively low misclassification rate and a low residual mean deviance.
pstcut <- prune.tree(spmtree, best=10)
pstcut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.00 0.39390
## 2) X0.44 < 0.0555 3470 623.60 0.23490
## 4) X0.3 < 0.055 3140 430.50 0.16400
## 8) X0.778 < 0.378 2737 247.40 0.10050
## 16) X0.32.1 < 0.2 2507 168.80 0.07260
## 32) X0.15 < 0.01 2439 135.50 0.05904 *
## 33) X0.15 > 0.01 68 16.76 0.55880 *
## 17) X0.32.1 > 0.2 230 55.40 0.40430 *
## 9) X0.778 > 0.378 403 97.07 0.59550
## 18) X278 < 55.5 182 37.14 0.28570 *
## 19) X278 > 55.5 221 28.07 0.85070 *
## 5) X0.3 > 0.055 330 27.27 0.90910
## 10) X0.18 < 0.14 317 16.09 0.94640 *
## 11) X0.18 > 0.14 13 0.00 0.00000 *
## 3) X0.44 > 0.0555 1130 117.30 0.88230
## 6) X0.16 < 0.4 1060 65.38 0.93400
## 12) X0.38 < 0.49 1045 52.11 0.94740 *
## 13) X0.38 > 0.49 15 0.00 0.00000 *
## 7) X0.16 > 0.4 70 6.30 0.10000 *
plot(pstcut, col=8)
text(pstcut)

summary(pstcut)
##
## Regression tree:
## tree(formula = X1 ~ ., data = spam)
## Variables actually used in tree construction:
## [1] "X0.44" "X0.3" "X0.778" "X0.32.1" "X0.15" "X278" "X0.18"
## [8] "X0.16" "X0.38"
## Number of terminal nodes: 10
## Residual mean deviance: 0.07568 = 347.4 / 4590
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.94740 -0.05904 -0.05904 0.00000 0.05263 0.94100