In this data set, we will be using CART analysis to construct a tree diagram to determine if someone will receive the death penalty based on the other data in this set.
download.file("https://www.biz.uiowa.edu/faculty/jledolter/DataMining/DeathPenalty.csv", "DeathPenalty.csv",method="curl")
dpen <- read.csv("DeathPenalty.csv")
head(dpen)
## Agg VRace Death
## 1 1 1 1
## 2 1 1 1
## 3 1 1 0
## 4 1 1 0
## 5 1 1 0
## 6 1 1 0
library(tree)
length(dpen$Death)
## [1] 362
dpentree <- tree(Death ~., data=dpen, mindev=0.1, mincut=1)
dpentree <- tree(Death ~., data=dpen, mincut=1)
dpentree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 362 49.380 0.1630
## 2) Agg < 3.5 307 13.360 0.0456
## 4) Agg < 2.5 283 5.873 0.0212 *
## 5) Agg > 2.5 24 5.333 0.3333 *
## 3) Agg > 3.5 55 8.182 0.8182
## 6) VRace < 0.5 17 4.118 0.5882
## 12) Agg < 5.5 13 3.231 0.4615 *
## 13) Agg > 5.5 4 0.000 1.0000 *
## 7) VRace > 0.5 38 2.763 0.9211
## 14) Agg < 4.5 12 2.250 0.7500 *
## 15) Agg > 4.5 26 0.000 1.0000 *
plot(dpentree, col=8)
text(dpentree, digits=2)
Because the death variable is binary, we will interpret these leaves at the bottom to be the chance that someone will receive the death penalty. Based on the tree, if someone has a an Agg value of over 3.5, they automatically have a higher chance of getting the death penalty. This tree seems pretty logical, but we should still prune the tree to see if it can be improved.
First, we will prune the tree using an alpha value of 1 to see how the tree changes.
dpencut <- prune.tree(dpentree,k=1)
plot(dpencut)##do not have to use k=1.7 (can try different things)
text(dpencut, digits=2)
dpencut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 362 49.380 0.1630
## 2) Agg < 3.5 307 13.360 0.0456
## 4) Agg < 2.5 283 5.873 0.0212 *
## 5) Agg > 2.5 24 5.333 0.3333 *
## 3) Agg > 3.5 55 8.182 0.8182
## 6) VRace < 0.5 17 4.118 0.5882 *
## 7) VRace > 0.5 38 2.763 0.9211 *
In this tree, there is only one split on nodes 2 and 3. In the original, the right side was split by race and then again by Agg.
Now, we will increase the alpha value to see how the tree is changed. If there is drastic change, that means that the error level improved greatly.
dpencut <- prune.tree(dpentree,k=2)
plot(dpencut)
text(dpencut, digits=2)
dpencut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 362 49.380 0.1630
## 2) Agg < 3.5 307 13.360 0.0456
## 4) Agg < 2.5 283 5.873 0.0212 *
## 5) Agg > 2.5 24 5.333 0.3333 *
## 3) Agg > 3.5 55 8.182 0.8182 *
This tree has now been drastically changed. Now, there is no further analysis on the right side of the tree; if Agg is greater than 3.5, there is a high chance of getting the death penalty.
Now, we will make a graph to see which k value decreases the the deviance the most efficiently. This means that, k values that only gradually decrease the deviance after the larger drops are not worth the error that may result.
dpencut <- prune.tree(dpentree)
dpencut
## $size
## [1] 6 5 4 3 2 1
##
## $dev
## [1] 16.68689 17.20005 18.08693 19.38794 21.54338 49.38398
##
## $k
## [1] -Inf 0.5131579 0.8868778 1.3010132 2.1554387 27.8405962
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(dpencut)
According to this graph, k=2 may be the best k value because it is the last in the largest drops before the deviance gradually decreases. The next tree is the tree with a k value of 2.
dpenbest <- prune.tree(dpentree,best=2)
dpenbest
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 362 49.380 0.1630
## 2) Agg < 3.5 307 13.360 0.0456 *
## 3) Agg > 3.5 55 8.182 0.8182 *
plot(dpenbest)
text(dpenbest, digits=2)
Using this house prices data, we will use CART analysis to predict the neighborhood in which a house is located. We will get rid of the first column which is just a random number assigned to a house.
download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/HousePrices.csv", "HousePrices.csv",method="curl")
hp<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/HousePrices.csv")
hp=hp[-1]
library(MASS)
library(tree)
head(hp)
## Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 1 114300 1790 2 2 2 No East
## 2 114200 2030 4 2 3 No East
## 3 114800 1740 3 2 1 No East
## 4 94700 1980 3 2 3 No East
## 5 119800 2130 3 3 3 No East
## 6 114600 1780 3 2 2 No North
hptree <- tree(Neighborhood~.,data=hp)
hptree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 128 280.800 East ( 0.35156 0.34375 0.30469 )
## 2) Price < 128750 68 92.140 North ( 0.41176 0.58824 0.00000 )
## 4) Offers < 3.5 52 72.090 East ( 0.50000 0.50000 0.00000 )
## 8) Price < 93200 5 0.000 North ( 0.00000 1.00000 0.00000 ) *
## 9) Price > 93200 47 64.620 East ( 0.55319 0.44681 0.00000 )
## 18) Price < 118500 35 48.260 North ( 0.45714 0.54286 0.00000 )
## 36) Price < 105850 8 8.997 East ( 0.75000 0.25000 0.00000 ) *
## 37) Price > 105850 27 35.590 North ( 0.37037 0.62963 0.00000 ) *
## 19) Price > 118500 12 10.810 East ( 0.83333 0.16667 0.00000 ) *
## 5) Offers > 3.5 16 12.060 North ( 0.12500 0.87500 0.00000 ) *
## 3) Price > 128750 60 98.140 West ( 0.28333 0.06667 0.65000 )
## 6) Price < 157350 41 77.260 West ( 0.41463 0.09756 0.48780 )
## 12) Brick: No 26 42.680 West ( 0.19231 0.11538 0.69231 )
## 24) Bathrooms < 2.5 12 6.884 West ( 0.00000 0.08333 0.91667 ) *
## 25) Bathrooms > 2.5 14 27.780 West ( 0.35714 0.14286 0.50000 )
## 50) SqFt < 2350 9 17.910 West ( 0.22222 0.22222 0.55556 ) *
## 51) SqFt > 2350 5 6.730 East ( 0.60000 0.00000 0.40000 ) *
## 13) Brick: Yes 15 18.830 East ( 0.80000 0.06667 0.13333 )
## 26) SqFt < 2000 5 10.550 East ( 0.40000 0.20000 0.40000 ) *
## 27) SqFt > 2000 10 0.000 East ( 1.00000 0.00000 0.00000 ) *
## 7) Price > 157350 19 0.000 West ( 0.00000 0.00000 1.00000 ) *
plot(hptree)
text(hptree,digits=2)
This is the first tree with no pruning. According to this tree, a house can only be in the west neighborhood if it is priced over $128,750. However, this is not the optimal tree because nodes 26 and 27 lead to the same neighborhood. Therefore, we must prune the tree to limit the error.
hpcut <- prune.tree(hptree)
hpcut
## $size
## [1] 11 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 109.5356 112.6789 116.3501 121.8967 129.3611 137.3564 145.3700
## [8] 153.6520 169.4032 190.2832 280.7536
##
## $k
## [1] -Inf 3.143293 3.671246 5.546599 7.464395 7.995287 8.013625
## [8] 8.281956 15.751268 20.879921 90.470458
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(hpcut)
There are not very many drastic drops in deviance levels after k=2. Based on the graph, the best k values would be between 2 and 4. Therefore, our next tree will have a k value of 3 to see how it changes the tree.
hpcut <- prune.tree(hptree,k=3)
plot(hpcut)
text(hpcut, digits=2)
hpcut
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 128 280.800 East ( 0.35156 0.34375 0.30469 )
## 2) Price < 128750 68 92.140 North ( 0.41176 0.58824 0.00000 )
## 4) Offers < 3.5 52 72.090 East ( 0.50000 0.50000 0.00000 )
## 8) Price < 93200 5 0.000 North ( 0.00000 1.00000 0.00000 ) *
## 9) Price > 93200 47 64.620 East ( 0.55319 0.44681 0.00000 )
## 18) Price < 118500 35 48.260 North ( 0.45714 0.54286 0.00000 )
## 36) Price < 105850 8 8.997 East ( 0.75000 0.25000 0.00000 ) *
## 37) Price > 105850 27 35.590 North ( 0.37037 0.62963 0.00000 ) *
## 19) Price > 118500 12 10.810 East ( 0.83333 0.16667 0.00000 ) *
## 5) Offers > 3.5 16 12.060 North ( 0.12500 0.87500 0.00000 ) *
## 3) Price > 128750 60 98.140 West ( 0.28333 0.06667 0.65000 )
## 6) Price < 157350 41 77.260 West ( 0.41463 0.09756 0.48780 )
## 12) Brick: No 26 42.680 West ( 0.19231 0.11538 0.69231 )
## 24) Bathrooms < 2.5 12 6.884 West ( 0.00000 0.08333 0.91667 ) *
## 25) Bathrooms > 2.5 14 27.780 West ( 0.35714 0.14286 0.50000 )
## 50) SqFt < 2350 9 17.910 West ( 0.22222 0.22222 0.55556 ) *
## 51) SqFt > 2350 5 6.730 East ( 0.60000 0.00000 0.40000 ) *
## 13) Brick: Yes 15 18.830 East ( 0.80000 0.06667 0.13333 )
## 26) SqFt < 2000 5 10.550 East ( 0.40000 0.20000 0.40000 ) *
## 27) SqFt > 2000 10 0.000 East ( 1.00000 0.00000 0.00000 ) *
## 7) Price > 157350 19 0.000 West ( 0.00000 0.00000 1.00000 ) *
This tree did not change at all from the original, even though our alpha value showed to get rid of a large amount of deviance. In addition, node 13 has not been changed to not split into another choice. Therefore, we will specifically prune node 13 to simplify the tree.
summary(hptree)
##
## Classification tree:
## tree(formula = Neighborhood ~ ., data = hp)
## Variables actually used in tree construction:
## [1] "Price" "Offers" "Brick" "Bathrooms" "SqFt"
## Number of terminal nodes: 11
## Residual mean deviance: 0.9362 = 109.5 / 117
## Misclassification error rate: 0.2031 = 26 / 128
hpsnip=snip.tree(hptree,nodes=c(13))
hpsnip
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 128 280.800 East ( 0.35156 0.34375 0.30469 )
## 2) Price < 128750 68 92.140 North ( 0.41176 0.58824 0.00000 )
## 4) Offers < 3.5 52 72.090 East ( 0.50000 0.50000 0.00000 )
## 8) Price < 93200 5 0.000 North ( 0.00000 1.00000 0.00000 ) *
## 9) Price > 93200 47 64.620 East ( 0.55319 0.44681 0.00000 )
## 18) Price < 118500 35 48.260 North ( 0.45714 0.54286 0.00000 )
## 36) Price < 105850 8 8.997 East ( 0.75000 0.25000 0.00000 ) *
## 37) Price > 105850 27 35.590 North ( 0.37037 0.62963 0.00000 ) *
## 19) Price > 118500 12 10.810 East ( 0.83333 0.16667 0.00000 ) *
## 5) Offers > 3.5 16 12.060 North ( 0.12500 0.87500 0.00000 ) *
## 3) Price > 128750 60 98.140 West ( 0.28333 0.06667 0.65000 )
## 6) Price < 157350 41 77.260 West ( 0.41463 0.09756 0.48780 )
## 12) Brick: No 26 42.680 West ( 0.19231 0.11538 0.69231 )
## 24) Bathrooms < 2.5 12 6.884 West ( 0.00000 0.08333 0.91667 ) *
## 25) Bathrooms > 2.5 14 27.780 West ( 0.35714 0.14286 0.50000 )
## 50) SqFt < 2350 9 17.910 West ( 0.22222 0.22222 0.55556 ) *
## 51) SqFt > 2350 5 6.730 East ( 0.60000 0.00000 0.40000 ) *
## 13) Brick: Yes 15 18.830 East ( 0.80000 0.06667 0.13333 ) *
## 7) Price > 157350 19 0.000 West ( 0.00000 0.00000 1.00000 ) *
plot(hpsnip)
text(hpsnip)
Now, node 13 just leads to the east neighborhood instead of splitting into two choices that both lead to east.
With this data, we will use CART analysis to see if a specific email is spam or not. In this data set, the variable X1 is whether the email is spam or not. If it is spam, that column will have a 1. If it is not, there will be a 0 in that column.
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", "spambase.data",method="curl")
spam<-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data")
summary(spam)
## X0 X0.64 X0.64.1 X0.1
## Min. :0.0000 Min. : 0.0000 Min. :0.0000 Min. : 0.00000
## 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.00000
## Median :0.0000 Median : 0.0000 Median :0.0000 Median : 0.00000
## Mean :0.1046 Mean : 0.2129 Mean :0.2806 Mean : 0.06544
## 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.:0.4200 3rd Qu.: 0.00000
## Max. :4.5400 Max. :14.2800 Max. :5.1000 Max. :42.81000
## X0.32 X0.2 X0.3 X0.4
## Min. : 0.0000 Min. :0.00000 Min. :0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 0.0000
## Median : 0.0000 Median :0.00000 Median :0.0000 Median : 0.0000
## Mean : 0.3122 Mean :0.09592 Mean :0.1142 Mean : 0.1053
## 3rd Qu.: 0.3825 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.: 0.0000
## Max. :10.0000 Max. :5.88000 Max. :7.2700 Max. :11.1100
## X0.5 X0.6 X0.7 X0.64.2
## Min. :0.00000 Min. : 0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median : 0.0000 Median :0.00000 Median :0.1000
## Mean :0.09009 Mean : 0.2395 Mean :0.05984 Mean :0.5417
## 3rd Qu.:0.00000 3rd Qu.: 0.1600 3rd Qu.:0.00000 3rd Qu.:0.8000
## Max. :5.26000 Max. :18.1800 Max. :2.61000 Max. :9.6700
## X0.8 X0.9 X0.10 X0.32.1
## Min. :0.00000 Min. : 0.00000 Min. :0.00000 Min. : 0.0000
## 1st Qu.:0.00000 1st Qu.: 0.00000 1st Qu.:0.00000 1st Qu.: 0.0000
## Median :0.00000 Median : 0.00000 Median :0.00000 Median : 0.0000
## Mean :0.09395 Mean : 0.05864 Mean :0.04922 Mean : 0.2488
## 3rd Qu.:0.00000 3rd Qu.: 0.00000 3rd Qu.:0.00000 3rd Qu.: 0.1000
## Max. :5.55000 Max. :10.00000 Max. :4.41000 Max. :20.0000
## X0.11 X1.29 X1.93 X0.12
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. : 0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
## Median :0.0000 Median :0.0000 Median : 1.310 Median : 0.0000
## Mean :0.1426 Mean :0.1845 Mean : 1.662 Mean : 0.0856
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 2.640 3rd Qu.: 0.0000
## Max. :7.1400 Max. :9.0900 Max. :18.750 Max. :18.1800
## X0.96 X0.13 X0.14 X0.15
## Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.00000
## Median : 0.2200 Median : 0.0000 Median :0.0000 Median : 0.00000
## Mean : 0.8097 Mean : 0.1212 Mean :0.1017 Mean : 0.09429
## 3rd Qu.: 1.2700 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.: 0.00000
## Max. :11.1100 Max. :17.1000 Max. :5.4500 Max. :12.50000
## X0.16 X0.17 X0.18 X0.19
## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000
## Median : 0.0000 Median : 0.0000 Median : 0.0000 Median :0.0000
## Mean : 0.5496 Mean : 0.2654 Mean : 0.7675 Mean :0.1249
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0000
## Max. :20.8300 Max. :16.6600 Max. :33.3300 Max. :9.0900
## X0.20 X0.21 X0.22 X0.23
## Min. : 0.00000 Min. :0.0000 Min. : 0.00000 Min. :0.00000
## 1st Qu.: 0.00000 1st Qu.:0.0000 1st Qu.: 0.00000 1st Qu.:0.00000
## Median : 0.00000 Median :0.0000 Median : 0.00000 Median :0.00000
## Mean : 0.09894 Mean :0.1029 Mean : 0.06477 Mean :0.04706
## 3rd Qu.: 0.00000 3rd Qu.:0.0000 3rd Qu.: 0.00000 3rd Qu.:0.00000
## Max. :14.28000 Max. :5.8800 Max. :12.50000 Max. :4.76000
## X0.24 X0.25 X0.26 X0.27
## Min. : 0.00000 Min. :0.00000 Min. : 0.0000 Min. :0.0000
## 1st Qu.: 0.00000 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.:0.0000
## Median : 0.00000 Median :0.00000 Median : 0.0000 Median :0.0000
## Mean : 0.09725 Mean :0.04785 Mean : 0.1054 Mean :0.0975
## 3rd Qu.: 0.00000 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.:0.0000
## Max. :18.18000 Max. :4.76000 Max. :20.0000 Max. :7.6900
## X0.28 X0.29 X0.30 X0.31
## Min. :0.000 Min. :0.0000 Min. : 0.00000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 0.00000 1st Qu.:0.00000
## Median :0.000 Median :0.0000 Median : 0.00000 Median :0.00000
## Mean :0.137 Mean :0.0132 Mean : 0.07865 Mean :0.06485
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.: 0.00000 3rd Qu.:0.00000
## Max. :6.890 Max. :8.3300 Max. :11.11000 Max. :4.76000
## X0.33 X0.34 X0.35 X0.36
## Min. :0.00000 Min. : 0.0000 Min. :0.00000 Min. : 0.00000
## 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.: 0.00000
## Median :0.00000 Median : 0.0000 Median :0.00000 Median : 0.00000
## Mean :0.04368 Mean : 0.1324 Mean :0.04611 Mean : 0.07921
## 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.: 0.00000
## Max. :7.14000 Max. :14.2800 Max. :3.57000 Max. :20.00000
## X0.37 X0.38 X0.39 X0.40
## Min. : 0.0000 Min. : 0.0000 Min. :0.000000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.000000 1st Qu.: 0.00000
## Median : 0.0000 Median : 0.0000 Median :0.000000 Median : 0.00000
## Mean : 0.3013 Mean : 0.1799 Mean :0.005446 Mean : 0.03188
## 3rd Qu.: 0.1100 3rd Qu.: 0.0000 3rd Qu.:0.000000 3rd Qu.: 0.00000
## Max. :21.4200 Max. :22.0500 Max. :2.170000 Max. :10.00000
## X0.41 X0.42 X0.43 X0.778
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. : 0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 0.0000
## Median :0.00000 Median :0.0650 Median :0.00000 Median : 0.0000
## Mean :0.03858 Mean :0.1391 Mean :0.01698 Mean : 0.2690
## 3rd Qu.:0.00000 3rd Qu.:0.1880 3rd Qu.:0.00000 3rd Qu.: 0.3142
## Max. :4.38500 Max. :9.7520 Max. :4.08100 Max. :32.4780
## X0.44 X0.45 X3.756 X61
## Min. :0.00000 Min. : 0.00000 Min. : 1.000 Min. : 1.00
## 1st Qu.:0.00000 1st Qu.: 0.00000 1st Qu.: 1.588 1st Qu.: 6.00
## Median :0.00000 Median : 0.00000 Median : 2.276 Median : 15.00
## Mean :0.07583 Mean : 0.04425 Mean : 5.192 Mean : 52.17
## 3rd Qu.:0.05200 3rd Qu.: 0.00000 3rd Qu.: 3.705 3rd Qu.: 43.00
## Max. :6.00300 Max. :19.82900 Max. :1102.500 Max. :9989.00
## X278 X1
## Min. : 1.0 Min. :0.0000
## 1st Qu.: 35.0 1st Qu.:0.0000
## Median : 95.0 Median :0.0000
## Mean : 283.3 Mean :0.3939
## 3rd Qu.: 265.2 3rd Qu.:1.0000
## Max. :15841.0 Max. :1.0000
library(tree)
length(spam$X1)
## [1] 4600
spamtree <- tree(X1 ~., data=spam, mindev=0.1, mincut=1)
spamtree <- tree(X1 ~., data=spam, mincut=1)
spamtree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.00 0.39390
## 2) X0.44 < 0.0555 3470 623.60 0.23490
## 4) X0.3 < 0.055 3140 430.50 0.16400
## 8) X0.778 < 0.378 2737 247.40 0.10050
## 16) X0.32.1 < 0.2 2507 168.80 0.07260
## 32) X0.15 < 0.01 2439 135.50 0.05904 *
## 33) X0.15 > 0.01 68 16.76 0.55880 *
## 17) X0.32.1 > 0.2 230 55.40 0.40430 *
## 9) X0.778 > 0.378 403 97.07 0.59550
## 18) X278 < 55.5 182 37.14 0.28570 *
## 19) X278 > 55.5 221 28.07 0.85070 *
## 5) X0.3 > 0.055 330 27.27 0.90910
## 10) X0.18 < 0.14 317 16.09 0.94640 *
## 11) X0.18 > 0.14 13 0.00 0.00000 *
## 3) X0.44 > 0.0555 1130 117.30 0.88230
## 6) X0.16 < 0.4 1060 65.38 0.93400
## 12) X0.38 < 0.49 1045 52.11 0.94740 *
## 13) X0.38 > 0.49 15 0.00 0.00000 *
## 7) X0.16 > 0.4 70 6.30 0.10000 *
plot(spamtree, col=8)
text(spamtree, digits=2)
This tree is somewhat difficult to read because so many variables are referenced and there are a large amount of nodes. Because the dependent variable is binary, we will make the ending numbers probabilities that the email will be spam.
spamcut <- prune.tree(spamtree)
spamcut
## $size
## [1] 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 347.3674 358.5518 371.8239 388.3484 411.5347 443.3914 489.0601
## [8] 575.1522 740.9267 1098.2296
##
## $k
## [1] -Inf 11.18440 13.27210 16.52453 23.18634 31.85670 45.66866
## [8] 86.09210 165.77452 357.30286
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(spamcut)
Like the house prices data, there are few large drops after the first one. Based on this graph, the best k values are between 3 and 5. Therefore, we will make tree diagrams of each to see how the tree changes.
spamcut <- prune.tree(spamtree,k=3)
plot(spamcut)
text(spamcut, digits=2)
spamcut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.00 0.39390
## 2) X0.44 < 0.0555 3470 623.60 0.23490
## 4) X0.3 < 0.055 3140 430.50 0.16400
## 8) X0.778 < 0.378 2737 247.40 0.10050
## 16) X0.32.1 < 0.2 2507 168.80 0.07260
## 32) X0.15 < 0.01 2439 135.50 0.05904 *
## 33) X0.15 > 0.01 68 16.76 0.55880 *
## 17) X0.32.1 > 0.2 230 55.40 0.40430 *
## 9) X0.778 > 0.378 403 97.07 0.59550
## 18) X278 < 55.5 182 37.14 0.28570 *
## 19) X278 > 55.5 221 28.07 0.85070 *
## 5) X0.3 > 0.055 330 27.27 0.90910
## 10) X0.18 < 0.14 317 16.09 0.94640 *
## 11) X0.18 > 0.14 13 0.00 0.00000 *
## 3) X0.44 > 0.0555 1130 117.30 0.88230
## 6) X0.16 < 0.4 1060 65.38 0.93400
## 12) X0.38 < 0.49 1045 52.11 0.94740 *
## 13) X0.38 > 0.49 15 0.00 0.00000 *
## 7) X0.16 > 0.4 70 6.30 0.10000 *
There was no change between the original tree and this tree. Therefore, we will increase the k value.
spamcut <- prune.tree(spamtree,k=4)
plot(spamcut)
text(spamcut, digits=2)
spamcut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.00 0.39390
## 2) X0.44 < 0.0555 3470 623.60 0.23490
## 4) X0.3 < 0.055 3140 430.50 0.16400
## 8) X0.778 < 0.378 2737 247.40 0.10050
## 16) X0.32.1 < 0.2 2507 168.80 0.07260
## 32) X0.15 < 0.01 2439 135.50 0.05904 *
## 33) X0.15 > 0.01 68 16.76 0.55880 *
## 17) X0.32.1 > 0.2 230 55.40 0.40430 *
## 9) X0.778 > 0.378 403 97.07 0.59550
## 18) X278 < 55.5 182 37.14 0.28570 *
## 19) X278 > 55.5 221 28.07 0.85070 *
## 5) X0.3 > 0.055 330 27.27 0.90910
## 10) X0.18 < 0.14 317 16.09 0.94640 *
## 11) X0.18 > 0.14 13 0.00 0.00000 *
## 3) X0.44 > 0.0555 1130 117.30 0.88230
## 6) X0.16 < 0.4 1060 65.38 0.93400
## 12) X0.38 < 0.49 1045 52.11 0.94740 *
## 13) X0.38 > 0.49 15 0.00 0.00000 *
## 7) X0.16 > 0.4 70 6.30 0.10000 *
Again, there was no change between the previous trees and this tree. We will again increase the k value.
spamcut <- prune.tree(spamtree,k=5)
plot(spamcut)
text(spamcut, digits=2)
spamcut
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4600 1098.00 0.39390
## 2) X0.44 < 0.0555 3470 623.60 0.23490
## 4) X0.3 < 0.055 3140 430.50 0.16400
## 8) X0.778 < 0.378 2737 247.40 0.10050
## 16) X0.32.1 < 0.2 2507 168.80 0.07260
## 32) X0.15 < 0.01 2439 135.50 0.05904 *
## 33) X0.15 > 0.01 68 16.76 0.55880 *
## 17) X0.32.1 > 0.2 230 55.40 0.40430 *
## 9) X0.778 > 0.378 403 97.07 0.59550
## 18) X278 < 55.5 182 37.14 0.28570 *
## 19) X278 > 55.5 221 28.07 0.85070 *
## 5) X0.3 > 0.055 330 27.27 0.90910
## 10) X0.18 < 0.14 317 16.09 0.94640 *
## 11) X0.18 > 0.14 13 0.00 0.00000 *
## 3) X0.44 > 0.0555 1130 117.30 0.88230
## 6) X0.16 < 0.4 1060 65.38 0.93400
## 12) X0.38 < 0.49 1045 52.11 0.94740 *
## 13) X0.38 > 0.49 15 0.00 0.00000 *
## 7) X0.16 > 0.4 70 6.30 0.10000 *
Again, there was no change between the previous trees and this tree. Therefore, is tree is the most efficient tree with the least error with the most accuracy. While it is not easy to read, it is the best tree. Therefore, this is the final tree.