In this blog, we examine a classification tree and visualize its decision making hierarchy versus a manually created hierarchy produced for the crime data set using in HW Assignment 3. We then compare the classification tree generated using the algorithm under the standard approach with my manually created hierarchy.
The dataset for HW3 describes the statistics of suburbs or neighborhoods of a major city. The training set contains 466 observations of 13 variables include the response variable called target which takes on the value of 1 if the neighborhood has an above median crime rate and 0 otherwise. The other predictor variables are numeric in data type.
We import and summarize the description of each field and some key statistics below for reference purposes. A more in-depth data analysis from the point of view of logistic regression the below description table and link to my study are here: [https://rpubs.com/Fixed_Point/599508]
| Variable | Impact | Comments |
|---|---|---|
| zn | Lower | Proportion of a town’s land zoned for lots greater than 25000 square feet. Large lots are expected to exclude cheaper smaller houses and have wealthier owners. |
| indus | Higher | Proportion of non-retail business acres per suburb associated with industry, noise, heavy draft and unpleasant visual effects. |
| chas | Lower | Charles River dummy variable = 1 if suburb borders the River, and =0 otherwise. chas means the amenities of a riverside location. A wealthier suburb has less crime. |
| nox | Higher | nox is the concentration of nitrogen oxide in the atmosphere. nox is pollution so an increase in nox might make the neighborhood less appealing and more crime ridden |
| rm | Lower | average number of rooms in a house is a proxy of wealth. An increase in rooms could be associated with lower crime. |
| age | Higher | the proportion of owner-occuplied units built before 1940. Unit age is related to structure quality. An older unit should be associated with less affluence |
| dis | Lower | weighted distance to five employment centers. Closer proximity to an urban center should be associated with more crime if crime is associated with urban centers. |
| rad | Lower | index of access to radial highways measures proximity to urban centers and ease of access. Easier access may allow criminals to travel in as well. |
| tax | Lower | Higher taxes should provide more resources to the suburb but could also reduce house values due to the higher tax burden. Both directions are plausible but I argue higher taxes may provide more policing |
| ptratio | Higher | A low Pupil-teacher ratio per school district implies greater wealth and is a proxy for school quality. Thus an increase in the ratio should imply higher crime rate. |
| lstat | Higher | A higher proportion of low status (less educated and less skilled workers) residents could imply a higher crime rate because of a greater level of poverty. |
| medv | Lower | A higher median property value might imply the neighborhood has a lower crime rate. |
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
After detailed analysis of the variables, I had proposed the below rules to classify high (1) and low (0) crime neighborhoods.
In order to create those rules I had decided to transform the tax rate variable after visual analysis of the conditional distribution of the tax rate. Under certain conditions, many high crime neighborhoods clustered around a tax rate of 300 while low crime neighborhoods did not.
\[ Tax2 = (tax - 300)^2\]
dat %>% mutate( tax2 = (tax - 300)^2 ) -> dat2
manualrule <- function(df){
n = nrow(df)
outcome = vector(mode="integer", length=n)
for (i in c(1:n) )
{
val = "NA"
nox = df[i,"nox"]
rad = df[i,"rad"]
tax2 = df[i, "tax2"]
ptratio = df[i, "ptratio"]
if( nox < 0.48 )
{
val = "Low"
}
else if( nox > .6)
{
val = "High"
}
else if( rad > 6 )
{
val = "High"
}
else if( tax2 > 5000)
{
val = "Low"
}
else if(ptratio > 18)
{
val = "High"
}
else
{
val = "Low"
}
outcome[i] = ifelse(val == "High", 1, 0 )
}
return(outcome)
}
Those rules are summarized below for reference purposes but we are not going to explain our reasoning since that is covered in the previously mentioned analysis.
If nox < 0.48, then assign to Low Crime. If nox > 0.60, then assign to High Crime. In the intermediate region, we consider other variables:
If rad > 6 then assign to High Crime. If rad <= 6 then use \(Tax2=(Tax-300)^2\) as an artificial variable. If \(Tax2 > 5000\) then we assign to Low Crime. Otherwise if ptratio > 18 then assign to High Crime Else assign to Low Crime
The classification tree approach will divide the training data set into 2 parts called dat2.train (90% of the observations) and dat2.eval (10% of the observations) for training and evaluation respectively.
# Partition the data
set.seed(11)
N = nrow(dat2)
(idxsample = sample( N, round( N/10, 0) ) )
## [1] 250 34 184 409 144 113 37 60 374 396 213 214 62 222 341 163 93
## [18] 235 39 45 290 433 424 47 448 6 323 363 12 450 226 131 394 208
## [35] 346 267 98 329 459 296 327 392 376 463 188 263 244
dat2.eval = dat2[idxsample,]
dat2.train = dat2[-idxsample,]
skim(dat2.eval)
| Name | dat2.eval |
| Number of rows | 47 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zn | 0 | 1 | 8.29 | 22.67 | 0.00 | 0.00 | 0.00 | 0.00 | 90.00 | ▇▁▁▁▁ |
| indus | 0 | 1 | 12.48 | 6.59 | 1.22 | 7.42 | 12.83 | 18.10 | 25.65 | ▅▆▁▇▁ |
| chas | 0 | 1 | 0.09 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| nox | 0 | 1 | 0.57 | 0.12 | 0.40 | 0.48 | 0.55 | 0.67 | 0.87 | ▆▇▅▃▁ |
| rm | 0 | 1 | 6.26 | 0.65 | 4.63 | 5.89 | 6.21 | 6.75 | 7.83 | ▁▃▇▅▁ |
| age | 0 | 1 | 72.01 | 26.34 | 9.90 | 52.10 | 82.90 | 93.80 | 100.00 | ▁▂▂▂▇ |
| dis | 0 | 1 | 3.62 | 2.32 | 1.13 | 1.99 | 2.53 | 4.87 | 12.13 | ▇▃▁▁▁ |
| rad | 0 | 1 | 11.00 | 9.49 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▁▁▁▅ |
| tax | 0 | 1 | 429.34 | 182.36 | 187.00 | 292.50 | 384.00 | 666.00 | 666.00 | ▆▆▅▁▇ |
| ptratio | 0 | 1 | 18.99 | 1.89 | 14.70 | 18.10 | 20.20 | 20.20 | 21.20 | ▂▁▂▂▇ |
| lstat | 0 | 1 | 14.13 | 6.91 | 1.98 | 9.37 | 13.59 | 17.42 | 34.37 | ▅▇▇▂▁ |
| medv | 0 | 1 | 21.46 | 9.52 | 7.40 | 15.50 | 19.40 | 25.50 | 50.00 | ▅▇▂▂▁ |
| target | 0 | 1 | 0.55 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▆▁▁▁▇ |
| tax2 | 0 | 1 | 49276.36 | 61657.84 | 16.00 | 1764.50 | 9604.00 | 133956.00 | 133956.00 | ▇▁▁▁▅ |
skim(dat2.train)
| Name | dat2.train |
| Number of rows | 419 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zn | 0 | 1 | 11.95 | 23.44 | 0.00 | 0.00 | 0.00 | 20.00 | 100.00 | ▇▁▁▁▁ |
| indus | 0 | 1 | 10.95 | 6.86 | 0.46 | 5.04 | 8.56 | 18.10 | 27.74 | ▇▆▁▇▁ |
| chas | 0 | 1 | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| nox | 0 | 1 | 0.55 | 0.12 | 0.39 | 0.45 | 0.53 | 0.62 | 0.87 | ▇▇▅▃▁ |
| rm | 0 | 1 | 6.29 | 0.71 | 3.86 | 5.89 | 6.21 | 6.62 | 8.78 | ▁▂▇▂▁ |
| age | 0 | 1 | 67.96 | 28.54 | 2.90 | 42.35 | 76.90 | 94.20 | 100.00 | ▂▂▂▃▇ |
| dis | 0 | 1 | 3.82 | 2.08 | 1.14 | 2.11 | 3.27 | 5.21 | 10.71 | ▇▅▃▁▁ |
| rad | 0 | 1 | 9.37 | 8.59 | 1.00 | 4.00 | 5.00 | 16.00 | 24.00 | ▇▂▁▁▃ |
| tax | 0 | 1 | 407.28 | 166.29 | 188.00 | 280.00 | 330.00 | 666.00 | 711.00 | ▇▇▅▁▇ |
| ptratio | 0 | 1 | 18.33 | 2.22 | 12.60 | 16.85 | 18.70 | 20.20 | 22.00 | ▁▃▅▅▇ |
| lstat | 0 | 1 | 12.46 | 7.11 | 1.73 | 6.87 | 10.88 | 16.43 | 37.97 | ▇▇▃▂▁ |
| medv | 0 | 1 | 22.72 | 9.21 | 5.00 | 17.10 | 21.60 | 25.00 | 50.00 | ▂▇▅▁▁ |
| target | 0 | 1 | 0.48 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| tax2 | 0 | 1 | 39093.65 | 57945.67 | 0.00 | 576.00 | 6084.00 | 133956.00 | 168921.00 | ▇▁▁▃▁ |
# Build the classification tree model
tree.train = tree(target ~ . , data = dat2.train )
summary(tree.train)
##
## Regression tree:
## tree(formula = target ~ ., data = dat2.train)
## Variables actually used in tree construction:
## [1] "nox" "rad" "rm" "dis" "indus"
## Number of terminal nodes: 9
## Residual mean deviance: 0.03145 = 12.89 / 410
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.993000 -0.032790 0.000000 0.000000 0.006993 0.967200
# Display its structure and values
plot(tree.train)
text(tree.train)
# Display the out of sample accuracy
pred.eval = predict(tree.train , newdata= dat2.eval )
class.eval = ifelse(pred.eval > 0.5, 1, 0 )
table( class.eval, dat2.eval$target )
##
## class.eval 0 1
## 0 21 1
## 1 0 25
It is not entirely clear that we ought to compare model accuracies on just the test sample due to its small size and the fact that the manual classification rule was not previously backtested. Consequently, it is unclear that the manual classification ought to omit the training data set.
For easy of comparison, it makes more sense to evaluate accuracy of the classification tree and the manual approaches on the full data set which we now undertake below. We use the caret library to produce the full set of evaluation statistics.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
(lvl = factor(c("High", "Low")))
## [1] High Low
## Levels: High Low
pred.full = predict(tree.train, newdata=dat2)
class.full = ifelse(pred.full > 0.5, 1, 0)
man.full = manualrule(dat2)
table(class.full, dat2$target )
##
## class.full 0 1
## 0 231 10
## 1 6 219
table(man.full, dat2$target )
##
## man.full 0 1
## 0 211 8
## 1 26 221
We conclude that classification tree outperforms the manually created rule. First, the tree’s accuracy is a little higher (96.6%) than the manual method (92.7%). The manual rule has a tendency to incorrectly report low crime neighborhoods as high crime. I.e. Precision is lower (manual) 89.5% vs. (tree) 97.3%. Nonetheless, the manual rule has an organizational structure that compares well with the automated approach. For example, I picked nox as the most important variable and the threshold chosen manually is consistent with the automated selection. The key improvements by the automated approach occur in later stages of the partition of the scenarios.