CART Analysis Assignment

Death Penalty Data

In this data set, we will be using CART analysis to construct a tree diagram to determine if someone will receive the death penalty based on the other data in this set.

download.file("https://www.biz.uiowa.edu/faculty/jledolter/DataMining/DeathPenalty.csv", "DeathPenalty.csv",method="curl")
dpen <- read.csv("DeathPenalty.csv")
head(dpen)

##   Agg VRace Death
## 1   1     1     1
## 2   1     1     1
## 3   1     1     0
## 4   1     1     0
## 5   1     1     0
## 6   1     1     0

library(tree)
length(dpen$Death)

## [1] 362

Construct the tree

dpentree <- tree(Death ~., data=dpen, mindev=0.1, mincut=1)
dpentree <- tree(Death ~., data=dpen, mincut=1)
dpentree

## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 362 49.380 0.1630  
##    2) Agg < 3.5 307 13.360 0.0456  
##      4) Agg < 2.5 283  5.873 0.0212 *
##      5) Agg > 2.5 24  5.333 0.3333 *
##    3) Agg > 3.5 55  8.182 0.8182  
##      6) VRace < 0.5 17  4.118 0.5882  
##       12) Agg < 5.5 13  3.231 0.4615 *
##       13) Agg > 5.5 4  0.000 1.0000 *
##      7) VRace > 0.5 38  2.763 0.9211  
##       14) Agg < 4.5 12  2.250 0.7500 *
##       15) Agg > 4.5 26  0.000 1.0000 *

plot(dpentree, col=8)
text(dpentree, digits=2)

Because the death variable is binary, we will interpret these leaves at the bottom to be the chance that someone will receive the death penalty. Based on the tree, if someone has a an Agg value of over 3.5, they automatically have a higher chance of getting the death penalty. This tree seems pretty logical, but we should still prune the tree to see if it can be improved.

Prune Tree (k=1)

First, we will prune the tree using an alpha value of 1 to see how the tree changes.

dpencut <- prune.tree(dpentree,k=1)
plot(dpencut)##do not have to use k=1.7 (can try different things)
text(dpencut, digits=2)

dpencut

## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 362 49.380 0.1630  
##   2) Agg < 3.5 307 13.360 0.0456  
##     4) Agg < 2.5 283  5.873 0.0212 *
##     5) Agg > 2.5 24  5.333 0.3333 *
##   3) Agg > 3.5 55  8.182 0.8182  
##     6) VRace < 0.5 17  4.118 0.5882 *
##     7) VRace > 0.5 38  2.763 0.9211 *

In this tree, there is only one split on nodes 2 and 3. In the original, the right side was split by race and then again by Agg.

Prune Tree (k=2)

Now, we will increase the alpha value to see how the tree is changed. If there is drastic change, that means that the error level improved greatly.

dpencut <- prune.tree(dpentree,k=2)
plot(dpencut)
text(dpencut, digits=2)

dpencut

## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 362 49.380 0.1630  
##   2) Agg < 3.5 307 13.360 0.0456  
##     4) Agg < 2.5 283  5.873 0.0212 *
##     5) Agg > 2.5 24  5.333 0.3333 *
##   3) Agg > 3.5 55  8.182 0.8182 *

This tree has now been drastically changed. Now, there is no further analysis on the right side of the tree; if Agg is greater than 3.5, there is a high chance of getting the death penalty.

Graphing k Sizes and Deviance

Now, we will make a graph to see which k value decreases the the deviance the most efficiently. This means that, k values that only gradually decrease the deviance after the larger drops are not worth the error that may result.

dpencut <- prune.tree(dpentree)
dpencut

## $size
## [1] 6 5 4 3 2 1
## 
## $dev
## [1] 16.68689 17.20005 18.08693 19.38794 21.54338 49.38398
## 
## $k
## [1]       -Inf  0.5131579  0.8868778  1.3010132  2.1554387 27.8405962
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(dpencut)

According to this graph, k=2 may be the best k value because it is the last in the largest drops before the deviance gradually decreases. The next tree is the tree with a k value of 2.

dpenbest <- prune.tree(dpentree,best=2)
dpenbest

## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 362 49.380 0.1630  
##   2) Agg < 3.5 307 13.360 0.0456 *
##   3) Agg > 3.5 55  8.182 0.8182 *

plot(dpenbest)
text(dpenbest, digits=2)

House Prices Data

Using this house prices data, we will use CART analysis to predict the neighborhood in which a house is located. We will get rid of the first column which is just a random number assigned to a house.

download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/HousePrices.csv", "HousePrices.csv",method="curl")
hp<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/HousePrices.csv")
hp=hp[-1]

library(MASS) 
library(tree)
head(hp)

##    Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 1 114300 1790        2         2      2    No         East
## 2 114200 2030        4         2      3    No         East
## 3 114800 1740        3         2      1    No         East
## 4  94700 1980        3         2      3    No         East
## 5 119800 2130        3         3      3    No         East
## 6 114600 1780        3         2      2    No        North

Construct the tree

hptree <- tree(Neighborhood~.,data=hp) 
hptree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 128 280.800 East ( 0.35156 0.34375 0.30469 )  
##    2) Price < 128750 68  92.140 North ( 0.41176 0.58824 0.00000 )  
##      4) Offers < 3.5 52  72.090 East ( 0.50000 0.50000 0.00000 )  
##        8) Price < 93200 5   0.000 North ( 0.00000 1.00000 0.00000 ) *
##        9) Price > 93200 47  64.620 East ( 0.55319 0.44681 0.00000 )  
##         18) Price < 118500 35  48.260 North ( 0.45714 0.54286 0.00000 )  
##           36) Price < 105850 8   8.997 East ( 0.75000 0.25000 0.00000 ) *
##           37) Price > 105850 27  35.590 North ( 0.37037 0.62963 0.00000 ) *
##         19) Price > 118500 12  10.810 East ( 0.83333 0.16667 0.00000 ) *
##      5) Offers > 3.5 16  12.060 North ( 0.12500 0.87500 0.00000 ) *
##    3) Price > 128750 60  98.140 West ( 0.28333 0.06667 0.65000 )  
##      6) Price < 157350 41  77.260 West ( 0.41463 0.09756 0.48780 )  
##       12) Brick: No 26  42.680 West ( 0.19231 0.11538 0.69231 )  
##         24) Bathrooms < 2.5 12   6.884 West ( 0.00000 0.08333 0.91667 ) *
##         25) Bathrooms > 2.5 14  27.780 West ( 0.35714 0.14286 0.50000 )  
##           50) SqFt < 2350 9  17.910 West ( 0.22222 0.22222 0.55556 ) *
##           51) SqFt > 2350 5   6.730 East ( 0.60000 0.00000 0.40000 ) *
##       13) Brick: Yes 15  18.830 East ( 0.80000 0.06667 0.13333 )  
##         26) SqFt < 2000 5  10.550 East ( 0.40000 0.20000 0.40000 ) *
##         27) SqFt > 2000 10   0.000 East ( 1.00000 0.00000 0.00000 ) *
##      7) Price > 157350 19   0.000 West ( 0.00000 0.00000 1.00000 ) *

plot(hptree)
text(hptree,digits=2)

This is the first tree with no pruning. According to this tree, a house can only be in the west neighborhood if it is priced over $128,750. However, this is not the optimal tree because nodes 26 and 27 lead to the same neighborhood. Therefore, we must prune the tree to limit the error.

Graphing k Sizes and Deviance

hpcut <- prune.tree(hptree)
hpcut

## $size
##  [1] 11 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1] 109.5356 112.6789 116.3501 121.8967 129.3611 137.3564 145.3700
##  [8] 153.6520 169.4032 190.2832 280.7536
## 
## $k
##  [1]      -Inf  3.143293  3.671246  5.546599  7.464395  7.995287  8.013625
##  [8]  8.281956 15.751268 20.879921 90.470458
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(hpcut)

There are not very many drastic drops in deviance levels after k=2. Based on the graph, the best k values would be between 2 and 4. Therefore, our next tree will have a k value of 3 to see how it changes the tree.

Prune Tree (k=3)

hpcut <- prune.tree(hptree,k=3)
plot(hpcut)
text(hpcut, digits=2)

hpcut

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 128 280.800 East ( 0.35156 0.34375 0.30469 )  
##    2) Price < 128750 68  92.140 North ( 0.41176 0.58824 0.00000 )  
##      4) Offers < 3.5 52  72.090 East ( 0.50000 0.50000 0.00000 )  
##        8) Price < 93200 5   0.000 North ( 0.00000 1.00000 0.00000 ) *
##        9) Price > 93200 47  64.620 East ( 0.55319 0.44681 0.00000 )  
##         18) Price < 118500 35  48.260 North ( 0.45714 0.54286 0.00000 )  
##           36) Price < 105850 8   8.997 East ( 0.75000 0.25000 0.00000 ) *
##           37) Price > 105850 27  35.590 North ( 0.37037 0.62963 0.00000 ) *
##         19) Price > 118500 12  10.810 East ( 0.83333 0.16667 0.00000 ) *
##      5) Offers > 3.5 16  12.060 North ( 0.12500 0.87500 0.00000 ) *
##    3) Price > 128750 60  98.140 West ( 0.28333 0.06667 0.65000 )  
##      6) Price < 157350 41  77.260 West ( 0.41463 0.09756 0.48780 )  
##       12) Brick: No 26  42.680 West ( 0.19231 0.11538 0.69231 )  
##         24) Bathrooms < 2.5 12   6.884 West ( 0.00000 0.08333 0.91667 ) *
##         25) Bathrooms > 2.5 14  27.780 West ( 0.35714 0.14286 0.50000 )  
##           50) SqFt < 2350 9  17.910 West ( 0.22222 0.22222 0.55556 ) *
##           51) SqFt > 2350 5   6.730 East ( 0.60000 0.00000 0.40000 ) *
##       13) Brick: Yes 15  18.830 East ( 0.80000 0.06667 0.13333 )  
##         26) SqFt < 2000 5  10.550 East ( 0.40000 0.20000 0.40000 ) *
##         27) SqFt > 2000 10   0.000 East ( 1.00000 0.00000 0.00000 ) *
##      7) Price > 157350 19   0.000 West ( 0.00000 0.00000 1.00000 ) *

This tree did not change at all from the original, even though our alpha value showed to get rid of a large amount of deviance. In addition, node 13 has not been changed to not split into another choice. Therefore, we will specifically prune node 13 to simplify the tree.

Pruning Node 13

summary(hptree)

## 
## Classification tree:
## tree(formula = Neighborhood ~ ., data = hp)
## Variables actually used in tree construction:
## [1] "Price"     "Offers"    "Brick"     "Bathrooms" "SqFt"     
## Number of terminal nodes:  11 
## Residual mean deviance:  0.9362 = 109.5 / 117 
## Misclassification error rate: 0.2031 = 26 / 128

hpsnip=snip.tree(hptree,nodes=c(13))
hpsnip

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 128 280.800 East ( 0.35156 0.34375 0.30469 )  
##    2) Price < 128750 68  92.140 North ( 0.41176 0.58824 0.00000 )  
##      4) Offers < 3.5 52  72.090 East ( 0.50000 0.50000 0.00000 )  
##        8) Price < 93200 5   0.000 North ( 0.00000 1.00000 0.00000 ) *
##        9) Price > 93200 47  64.620 East ( 0.55319 0.44681 0.00000 )  
##         18) Price < 118500 35  48.260 North ( 0.45714 0.54286 0.00000 )  
##           36) Price < 105850 8   8.997 East ( 0.75000 0.25000 0.00000 ) *
##           37) Price > 105850 27  35.590 North ( 0.37037 0.62963 0.00000 ) *
##         19) Price > 118500 12  10.810 East ( 0.83333 0.16667 0.00000 ) *
##      5) Offers > 3.5 16  12.060 North ( 0.12500 0.87500 0.00000 ) *
##    3) Price > 128750 60  98.140 West ( 0.28333 0.06667 0.65000 )  
##      6) Price < 157350 41  77.260 West ( 0.41463 0.09756 0.48780 )  
##       12) Brick: No 26  42.680 West ( 0.19231 0.11538 0.69231 )  
##         24) Bathrooms < 2.5 12   6.884 West ( 0.00000 0.08333 0.91667 ) *
##         25) Bathrooms > 2.5 14  27.780 West ( 0.35714 0.14286 0.50000 )  
##           50) SqFt < 2350 9  17.910 West ( 0.22222 0.22222 0.55556 ) *
##           51) SqFt > 2350 5   6.730 East ( 0.60000 0.00000 0.40000 ) *
##       13) Brick: Yes 15  18.830 East ( 0.80000 0.06667 0.13333 ) *
##      7) Price > 157350 19   0.000 West ( 0.00000 0.00000 1.00000 ) *

plot(hpsnip)
text(hpsnip)

Now, node 13 just leads to the east neighborhood instead of splitting into two choices that both lead to east.

SPAM Data

With this data, we will use CART analysis to see if a specific email is spam or not. In this data set, the variable X1 is whether the email is spam or not. If it is spam, that column will have a 1. If it is not, there will be a 0 in that column.

download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", "spambase.data",method="curl")
spam<-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data")
summary(spam)

##        X0             X0.64            X0.64.1            X0.1         
##  Min.   :0.0000   Min.   : 0.0000   Min.   :0.0000   Min.   : 0.00000  
##  1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.00000  
##  Median :0.0000   Median : 0.0000   Median :0.0000   Median : 0.00000  
##  Mean   :0.1046   Mean   : 0.2129   Mean   :0.2806   Mean   : 0.06544  
##  3rd Qu.:0.0000   3rd Qu.: 0.0000   3rd Qu.:0.4200   3rd Qu.: 0.00000  
##  Max.   :4.5400   Max.   :14.2800   Max.   :5.1000   Max.   :42.81000  
##      X0.32              X0.2              X0.3             X0.4        
##  Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 0.0000  
##  Median : 0.0000   Median :0.00000   Median :0.0000   Median : 0.0000  
##  Mean   : 0.3122   Mean   :0.09592   Mean   :0.1142   Mean   : 0.1053  
##  3rd Qu.: 0.3825   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.: 0.0000  
##  Max.   :10.0000   Max.   :5.88000   Max.   :7.2700   Max.   :11.1100  
##       X0.5              X0.6              X0.7            X0.64.2      
##  Min.   :0.00000   Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median : 0.0000   Median :0.00000   Median :0.1000  
##  Mean   :0.09009   Mean   : 0.2395   Mean   :0.05984   Mean   :0.5417  
##  3rd Qu.:0.00000   3rd Qu.: 0.1600   3rd Qu.:0.00000   3rd Qu.:0.8000  
##  Max.   :5.26000   Max.   :18.1800   Max.   :2.61000   Max.   :9.6700  
##       X0.8              X0.9              X0.10            X0.32.1       
##  Min.   :0.00000   Min.   : 0.00000   Min.   :0.00000   Min.   : 0.0000  
##  1st Qu.:0.00000   1st Qu.: 0.00000   1st Qu.:0.00000   1st Qu.: 0.0000  
##  Median :0.00000   Median : 0.00000   Median :0.00000   Median : 0.0000  
##  Mean   :0.09395   Mean   : 0.05864   Mean   :0.04922   Mean   : 0.2488  
##  3rd Qu.:0.00000   3rd Qu.: 0.00000   3rd Qu.:0.00000   3rd Qu.: 0.1000  
##  Max.   :5.55000   Max.   :10.00000   Max.   :4.41000   Max.   :20.0000  
##      X0.11            X1.29            X1.93            X0.12        
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median :0.0000   Median :0.0000   Median : 1.310   Median : 0.0000  
##  Mean   :0.1426   Mean   :0.1845   Mean   : 1.662   Mean   : 0.0856  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.: 2.640   3rd Qu.: 0.0000  
##  Max.   :7.1400   Max.   :9.0900   Max.   :18.750   Max.   :18.1800  
##      X0.96             X0.13             X0.14            X0.15         
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.00000  
##  Median : 0.2200   Median : 0.0000   Median :0.0000   Median : 0.00000  
##  Mean   : 0.8097   Mean   : 0.1212   Mean   :0.1017   Mean   : 0.09429  
##  3rd Qu.: 1.2700   3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.: 0.00000  
##  Max.   :11.1100   Max.   :17.1000   Max.   :5.4500   Max.   :12.50000  
##      X0.16             X0.17             X0.18             X0.19       
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median : 0.0000   Median : 0.0000   Median :0.0000  
##  Mean   : 0.5496   Mean   : 0.2654   Mean   : 0.7675   Mean   :0.1249  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:0.0000  
##  Max.   :20.8300   Max.   :16.6600   Max.   :33.3300   Max.   :9.0900  
##      X0.20              X0.21            X0.22              X0.23        
##  Min.   : 0.00000   Min.   :0.0000   Min.   : 0.00000   Min.   :0.00000  
##  1st Qu.: 0.00000   1st Qu.:0.0000   1st Qu.: 0.00000   1st Qu.:0.00000  
##  Median : 0.00000   Median :0.0000   Median : 0.00000   Median :0.00000  
##  Mean   : 0.09894   Mean   :0.1029   Mean   : 0.06477   Mean   :0.04706  
##  3rd Qu.: 0.00000   3rd Qu.:0.0000   3rd Qu.: 0.00000   3rd Qu.:0.00000  
##  Max.   :14.28000   Max.   :5.8800   Max.   :12.50000   Max.   :4.76000  
##      X0.24              X0.25             X0.26             X0.27       
##  Min.   : 0.00000   Min.   :0.00000   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 0.00000   1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.:0.0000  
##  Median : 0.00000   Median :0.00000   Median : 0.0000   Median :0.0000  
##  Mean   : 0.09725   Mean   :0.04785   Mean   : 0.1054   Mean   :0.0975  
##  3rd Qu.: 0.00000   3rd Qu.:0.00000   3rd Qu.: 0.0000   3rd Qu.:0.0000  
##  Max.   :18.18000   Max.   :4.76000   Max.   :20.0000   Max.   :7.6900  
##      X0.28           X0.29            X0.30              X0.31        
##  Min.   :0.000   Min.   :0.0000   Min.   : 0.00000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.: 0.00000   1st Qu.:0.00000  
##  Median :0.000   Median :0.0000   Median : 0.00000   Median :0.00000  
##  Mean   :0.137   Mean   :0.0132   Mean   : 0.07865   Mean   :0.06485  
##  3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.: 0.00000   3rd Qu.:0.00000  
##  Max.   :6.890   Max.   :8.3300   Max.   :11.11000   Max.   :4.76000  
##      X0.33             X0.34             X0.35             X0.36         
##  Min.   :0.00000   Min.   : 0.0000   Min.   :0.00000   Min.   : 0.00000  
##  1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.: 0.00000  
##  Median :0.00000   Median : 0.0000   Median :0.00000   Median : 0.00000  
##  Mean   :0.04368   Mean   : 0.1324   Mean   :0.04611   Mean   : 0.07921  
##  3rd Qu.:0.00000   3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.: 0.00000  
##  Max.   :7.14000   Max.   :14.2800   Max.   :3.57000   Max.   :20.00000  
##      X0.37             X0.38             X0.39              X0.40         
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.000000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.000000   1st Qu.: 0.00000  
##  Median : 0.0000   Median : 0.0000   Median :0.000000   Median : 0.00000  
##  Mean   : 0.3013   Mean   : 0.1799   Mean   :0.005446   Mean   : 0.03188  
##  3rd Qu.: 0.1100   3rd Qu.: 0.0000   3rd Qu.:0.000000   3rd Qu.: 0.00000  
##  Max.   :21.4200   Max.   :22.0500   Max.   :2.170000   Max.   :10.00000  
##      X0.41             X0.42            X0.43             X0.778       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   : 0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: 0.0000  
##  Median :0.00000   Median :0.0650   Median :0.00000   Median : 0.0000  
##  Mean   :0.03858   Mean   :0.1391   Mean   :0.01698   Mean   : 0.2690  
##  3rd Qu.:0.00000   3rd Qu.:0.1880   3rd Qu.:0.00000   3rd Qu.: 0.3142  
##  Max.   :4.38500   Max.   :9.7520   Max.   :4.08100   Max.   :32.4780  
##      X0.44             X0.45              X3.756              X61         
##  Min.   :0.00000   Min.   : 0.00000   Min.   :   1.000   Min.   :   1.00  
##  1st Qu.:0.00000   1st Qu.: 0.00000   1st Qu.:   1.588   1st Qu.:   6.00  
##  Median :0.00000   Median : 0.00000   Median :   2.276   Median :  15.00  
##  Mean   :0.07583   Mean   : 0.04425   Mean   :   5.192   Mean   :  52.17  
##  3rd Qu.:0.05200   3rd Qu.: 0.00000   3rd Qu.:   3.705   3rd Qu.:  43.00  
##  Max.   :6.00300   Max.   :19.82900   Max.   :1102.500   Max.   :9989.00  
##       X278               X1        
##  Min.   :    1.0   Min.   :0.0000  
##  1st Qu.:   35.0   1st Qu.:0.0000  
##  Median :   95.0   Median :0.0000  
##  Mean   :  283.3   Mean   :0.3939  
##  3rd Qu.:  265.2   3rd Qu.:1.0000  
##  Max.   :15841.0   Max.   :1.0000

library(tree)
length(spam$X1)

## [1] 4600

Construct the tree

spamtree <- tree(X1 ~., data=spam, mindev=0.1, mincut=1)
spamtree <- tree(X1 ~., data=spam, mincut=1)
spamtree

## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4600 1098.00 0.39390  
##    2) X0.44 < 0.0555 3470  623.60 0.23490  
##      4) X0.3 < 0.055 3140  430.50 0.16400  
##        8) X0.778 < 0.378 2737  247.40 0.10050  
##         16) X0.32.1 < 0.2 2507  168.80 0.07260  
##           32) X0.15 < 0.01 2439  135.50 0.05904 *
##           33) X0.15 > 0.01 68   16.76 0.55880 *
##         17) X0.32.1 > 0.2 230   55.40 0.40430 *
##        9) X0.778 > 0.378 403   97.07 0.59550  
##         18) X278 < 55.5 182   37.14 0.28570 *
##         19) X278 > 55.5 221   28.07 0.85070 *
##      5) X0.3 > 0.055 330   27.27 0.90910  
##       10) X0.18 < 0.14 317   16.09 0.94640 *
##       11) X0.18 > 0.14 13    0.00 0.00000 *
##    3) X0.44 > 0.0555 1130  117.30 0.88230  
##      6) X0.16 < 0.4 1060   65.38 0.93400  
##       12) X0.38 < 0.49 1045   52.11 0.94740 *
##       13) X0.38 > 0.49 15    0.00 0.00000 *
##      7) X0.16 > 0.4 70    6.30 0.10000 *

plot(spamtree, col=8)
text(spamtree, digits=2)

This tree is somewhat difficult to read because so many variables are referenced and there are a large amount of nodes. Because the dependent variable is binary, we will make the ending numbers probabilities that the email will be spam.

Graphing k Sizes and Deviance

spamcut <- prune.tree(spamtree)
spamcut

## $size
##  [1] 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1]  347.3674  358.5518  371.8239  388.3484  411.5347  443.3914  489.0601
##  [8]  575.1522  740.9267 1098.2296
## 
## $k
##  [1]      -Inf  11.18440  13.27210  16.52453  23.18634  31.85670  45.66866
##  [8]  86.09210 165.77452 357.30286
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(spamcut)

Like the house prices data, there are few large drops after the first one. Based on this graph, the best k values are between 3 and 5. Therefore, we will make tree diagrams of each to see how the tree changes.

Prune Tree (k=3)

spamcut <- prune.tree(spamtree,k=3)
plot(spamcut)
text(spamcut, digits=2)

spamcut

## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4600 1098.00 0.39390  
##    2) X0.44 < 0.0555 3470  623.60 0.23490  
##      4) X0.3 < 0.055 3140  430.50 0.16400  
##        8) X0.778 < 0.378 2737  247.40 0.10050  
##         16) X0.32.1 < 0.2 2507  168.80 0.07260  
##           32) X0.15 < 0.01 2439  135.50 0.05904 *
##           33) X0.15 > 0.01 68   16.76 0.55880 *
##         17) X0.32.1 > 0.2 230   55.40 0.40430 *
##        9) X0.778 > 0.378 403   97.07 0.59550  
##         18) X278 < 55.5 182   37.14 0.28570 *
##         19) X278 > 55.5 221   28.07 0.85070 *
##      5) X0.3 > 0.055 330   27.27 0.90910  
##       10) X0.18 < 0.14 317   16.09 0.94640 *
##       11) X0.18 > 0.14 13    0.00 0.00000 *
##    3) X0.44 > 0.0555 1130  117.30 0.88230  
##      6) X0.16 < 0.4 1060   65.38 0.93400  
##       12) X0.38 < 0.49 1045   52.11 0.94740 *
##       13) X0.38 > 0.49 15    0.00 0.00000 *
##      7) X0.16 > 0.4 70    6.30 0.10000 *

There was no change between the original tree and this tree. Therefore, we will increase the k value.

Prune Tree (k=4)

spamcut <- prune.tree(spamtree,k=4)
plot(spamcut)
text(spamcut, digits=2)

spamcut

## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4600 1098.00 0.39390  
##    2) X0.44 < 0.0555 3470  623.60 0.23490  
##      4) X0.3 < 0.055 3140  430.50 0.16400  
##        8) X0.778 < 0.378 2737  247.40 0.10050  
##         16) X0.32.1 < 0.2 2507  168.80 0.07260  
##           32) X0.15 < 0.01 2439  135.50 0.05904 *
##           33) X0.15 > 0.01 68   16.76 0.55880 *
##         17) X0.32.1 > 0.2 230   55.40 0.40430 *
##        9) X0.778 > 0.378 403   97.07 0.59550  
##         18) X278 < 55.5 182   37.14 0.28570 *
##         19) X278 > 55.5 221   28.07 0.85070 *
##      5) X0.3 > 0.055 330   27.27 0.90910  
##       10) X0.18 < 0.14 317   16.09 0.94640 *
##       11) X0.18 > 0.14 13    0.00 0.00000 *
##    3) X0.44 > 0.0555 1130  117.30 0.88230  
##      6) X0.16 < 0.4 1060   65.38 0.93400  
##       12) X0.38 < 0.49 1045   52.11 0.94740 *
##       13) X0.38 > 0.49 15    0.00 0.00000 *
##      7) X0.16 > 0.4 70    6.30 0.10000 *

Again, there was no change between the previous trees and this tree. We will again increase the k value.

Prune Tree (k=5)

spamcut <- prune.tree(spamtree,k=5)
plot(spamcut)
text(spamcut, digits=2)

spamcut

## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4600 1098.00 0.39390  
##    2) X0.44 < 0.0555 3470  623.60 0.23490  
##      4) X0.3 < 0.055 3140  430.50 0.16400  
##        8) X0.778 < 0.378 2737  247.40 0.10050  
##         16) X0.32.1 < 0.2 2507  168.80 0.07260  
##           32) X0.15 < 0.01 2439  135.50 0.05904 *
##           33) X0.15 > 0.01 68   16.76 0.55880 *
##         17) X0.32.1 > 0.2 230   55.40 0.40430 *
##        9) X0.778 > 0.378 403   97.07 0.59550  
##         18) X278 < 55.5 182   37.14 0.28570 *
##         19) X278 > 55.5 221   28.07 0.85070 *
##      5) X0.3 > 0.055 330   27.27 0.90910  
##       10) X0.18 < 0.14 317   16.09 0.94640 *
##       11) X0.18 > 0.14 13    0.00 0.00000 *
##    3) X0.44 > 0.0555 1130  117.30 0.88230  
##      6) X0.16 < 0.4 1060   65.38 0.93400  
##       12) X0.38 < 0.49 1045   52.11 0.94740 *
##       13) X0.38 > 0.49 15    0.00 0.00000 *
##      7) X0.16 > 0.4 70    6.30 0.10000 *

Again, there was no change between the previous trees and this tree. Therefore, is tree is the most efficient tree with the least error with the most accuracy. While it is not easy to read, it is the best tree. Therefore, this is the final tree.

CART Analysis Assignment

Salmon

12/3/2017

Death Penalty Data

Construct the tree

Prune Tree (k=1)

Prune Tree (k=2)

Graphing k Sizes and Deviance

House Prices Data

Construct the tree

Graphing k Sizes and Deviance

Prune Tree (k=3)

Pruning Node 13

SPAM Data

Construct the tree

Graphing k Sizes and Deviance

Prune Tree (k=3)

Prune Tree (k=4)

Prune Tree (k=5)