Introduction

In this blog, we examine a classification tree and visualize its decision making hierarchy versus a manually created hierarchy produced for the crime data set using in HW Assignment 3. We then compare the classification tree generated using the algorithm under the standard approach with my manually created hierarchy.

Background

The dataset for HW3 describes the statistics of suburbs or neighborhoods of a major city. The training set contains 466 observations of 13 variables include the response variable called target which takes on the value of 1 if the neighborhood has an above median crime rate and 0 otherwise. The other predictor variables are numeric in data type.

We import and summarize the description of each field and some key statistics below for reference purposes. A more in-depth data analysis from the point of view of logistic regression the below description table and link to my study are here: [https://rpubs.com/Fixed_Point/599508]

Expected Impact of Increase in Variable on Crime Rate
Variable Impact Comments
zn Lower Proportion of a town’s land zoned for lots greater than 25000 square feet. Large lots are expected to exclude cheaper smaller houses and have wealthier owners.
indus Higher Proportion of non-retail business acres per suburb associated with industry, noise, heavy draft and unpleasant visual effects.
chas Lower Charles River dummy variable = 1 if suburb borders the River, and =0 otherwise. chas means the amenities of a riverside location. A wealthier suburb has less crime.
nox Higher nox is the concentration of nitrogen oxide in the atmosphere. nox is pollution so an increase in nox might make the neighborhood less appealing and more crime ridden
rm Lower average number of rooms in a house is a proxy of wealth. An increase in rooms could be associated with lower crime.
age Higher the proportion of owner-occuplied units built before 1940. Unit age is related to structure quality. An older unit should be associated with less affluence
dis Lower weighted distance to five employment centers. Closer proximity to an urban center should be associated with more crime if crime is associated with urban centers.
rad Lower index of access to radial highways measures proximity to urban centers and ease of access. Easier access may allow criminals to travel in as well.
tax Lower Higher taxes should provide more resources to the suburb but could also reduce house values due to the higher tax burden. Both directions are plausible but I argue higher taxes may provide more policing
ptratio Higher A low Pupil-teacher ratio per school district implies greater wealth and is a proxy for school quality. Thus an increase in the ratio should imply higher crime rate.
lstat Higher A higher proportion of low status (less educated and less skilled workers) residents could imply a higher crime rate because of a greater level of poverty.
medv Lower A higher median property value might imply the neighborhood has a lower crime rate.
##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Manually Created Rule

After detailed analysis of the variables, I had proposed the below rules to classify high (1) and low (0) crime neighborhoods.

In order to create those rules I had decided to transform the tax rate variable after visual analysis of the conditional distribution of the tax rate. Under certain conditions, many high crime neighborhoods clustered around a tax rate of 300 while low crime neighborhoods did not.

\[ Tax2 = (tax - 300)^2\]

dat %>% mutate( tax2 = (tax - 300)^2 ) -> dat2

manualrule <- function(df){

    n = nrow(df)
    
    outcome = vector(mode="integer", length=n)
    
    for (i in c(1:n) )
    {
          val = "NA"
      
          nox = df[i,"nox"]  
          rad = df[i,"rad"]
          tax2 = df[i, "tax2"]
          ptratio = df[i, "ptratio"]
          
          if( nox <  0.48 )
          {
              val = "Low" 
          }
          else if( nox > .6)
          {   
              val = "High"
          }
          else if( rad > 6 )
          {
              val = "High"
          }
          else if( tax2 > 5000)
          {
              val = "Low"
          }
          else if(ptratio > 18)
          {
              val = "High"
          }
          else
          {
              val = "Low"
          }
          
          outcome[i] = ifelse(val == "High", 1, 0 )
    }
    return(outcome)
}

Those rules are summarized below for reference purposes but we are not going to explain our reasoning since that is covered in the previously mentioned analysis.

If nox < 0.48, then assign to Low Crime. If nox > 0.60, then assign to High Crime. In the intermediate region, we consider other variables:
If rad > 6 then assign to High Crime. If rad <= 6 then use \(Tax2=(Tax-300)^2\) as an artificial variable. If \(Tax2 > 5000\) then we assign to Low Crime. Otherwise if ptratio > 18 then assign to High Crime Else assign to Low Crime

Classification Tree

The classification tree approach will divide the training data set into 2 parts called dat2.train (90% of the observations) and dat2.eval (10% of the observations) for training and evaluation respectively.

# Partition the data

set.seed(11)

N = nrow(dat2)
(idxsample = sample( N, round( N/10, 0) ) )
##  [1] 250  34 184 409 144 113  37  60 374 396 213 214  62 222 341 163  93
## [18] 235  39  45 290 433 424  47 448   6 323 363  12 450 226 131 394 208
## [35] 346 267  98 329 459 296 327 392 376 463 188 263 244
dat2.eval  = dat2[idxsample,]
dat2.train = dat2[-idxsample,]

skim(dat2.eval)
Data summary
Name dat2.eval
Number of rows 47
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
zn 0 1 8.29 22.67 0.00 0.00 0.00 0.00 90.00 ▇▁▁▁▁
indus 0 1 12.48 6.59 1.22 7.42 12.83 18.10 25.65 ▅▆▁▇▁
chas 0 1 0.09 0.28 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
nox 0 1 0.57 0.12 0.40 0.48 0.55 0.67 0.87 ▆▇▅▃▁
rm 0 1 6.26 0.65 4.63 5.89 6.21 6.75 7.83 ▁▃▇▅▁
age 0 1 72.01 26.34 9.90 52.10 82.90 93.80 100.00 ▁▂▂▂▇
dis 0 1 3.62 2.32 1.13 1.99 2.53 4.87 12.13 ▇▃▁▁▁
rad 0 1 11.00 9.49 1.00 4.00 5.00 24.00 24.00 ▇▁▁▁▅
tax 0 1 429.34 182.36 187.00 292.50 384.00 666.00 666.00 ▆▆▅▁▇
ptratio 0 1 18.99 1.89 14.70 18.10 20.20 20.20 21.20 ▂▁▂▂▇
lstat 0 1 14.13 6.91 1.98 9.37 13.59 17.42 34.37 ▅▇▇▂▁
medv 0 1 21.46 9.52 7.40 15.50 19.40 25.50 50.00 ▅▇▂▂▁
target 0 1 0.55 0.50 0.00 0.00 1.00 1.00 1.00 ▆▁▁▁▇
tax2 0 1 49276.36 61657.84 16.00 1764.50 9604.00 133956.00 133956.00 ▇▁▁▁▅
skim(dat2.train)
Data summary
Name dat2.train
Number of rows 419
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
zn 0 1 11.95 23.44 0.00 0.00 0.00 20.00 100.00 ▇▁▁▁▁
indus 0 1 10.95 6.86 0.46 5.04 8.56 18.10 27.74 ▇▆▁▇▁
chas 0 1 0.07 0.25 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
nox 0 1 0.55 0.12 0.39 0.45 0.53 0.62 0.87 ▇▇▅▃▁
rm 0 1 6.29 0.71 3.86 5.89 6.21 6.62 8.78 ▁▂▇▂▁
age 0 1 67.96 28.54 2.90 42.35 76.90 94.20 100.00 ▂▂▂▃▇
dis 0 1 3.82 2.08 1.14 2.11 3.27 5.21 10.71 ▇▅▃▁▁
rad 0 1 9.37 8.59 1.00 4.00 5.00 16.00 24.00 ▇▂▁▁▃
tax 0 1 407.28 166.29 188.00 280.00 330.00 666.00 711.00 ▇▇▅▁▇
ptratio 0 1 18.33 2.22 12.60 16.85 18.70 20.20 22.00 ▁▃▅▅▇
lstat 0 1 12.46 7.11 1.73 6.87 10.88 16.43 37.97 ▇▇▃▂▁
medv 0 1 22.72 9.21 5.00 17.10 21.60 25.00 50.00 ▂▇▅▁▁
target 0 1 0.48 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
tax2 0 1 39093.65 57945.67 0.00 576.00 6084.00 133956.00 168921.00 ▇▁▁▃▁
# Build the classification tree model
tree.train = tree(target ~ . , data = dat2.train )
summary(tree.train)
## 
## Regression tree:
## tree(formula = target ~ ., data = dat2.train)
## Variables actually used in tree construction:
## [1] "nox"   "rad"   "rm"    "dis"   "indus"
## Number of terminal nodes:  9 
## Residual mean deviance:  0.03145 = 12.89 / 410 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.993000 -0.032790  0.000000  0.000000  0.006993  0.967200
# Display its structure and values
plot(tree.train)
text(tree.train)

# Display the out of sample accuracy
pred.eval = predict(tree.train , newdata= dat2.eval )
class.eval = ifelse(pred.eval > 0.5, 1, 0 )
table( class.eval, dat2.eval$target )
##           
## class.eval  0  1
##          0 21  1
##          1  0 25

Comparison by Full Sample

It is not entirely clear that we ought to compare model accuracies on just the test sample due to its small size and the fact that the manual classification rule was not previously backtested. Consequently, it is unclear that the manual classification ought to omit the training data set.

For easy of comparison, it makes more sense to evaluate accuracy of the classification tree and the manual approaches on the full data set which we now undertake below. We use the caret library to produce the full set of evaluation statistics.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
(lvl = factor(c("High", "Low")))
## [1] High Low 
## Levels: High Low
pred.full = predict(tree.train, newdata=dat2)
class.full = ifelse(pred.full > 0.5, 1, 0)
man.full = manualrule(dat2) 

table(class.full, dat2$target )
##           
## class.full   0   1
##          0 231  10
##          1   6 219
table(man.full, dat2$target )
##         
## man.full   0   1
##        0 211   8
##        1  26 221

We conclude that classification tree outperforms the manually created rule. First, the tree’s accuracy is a little higher (96.6%) than the manual method (92.7%). The manual rule has a tendency to incorrectly report low crime neighborhoods as high crime. I.e. Precision is lower (manual) 89.5% vs. (tree) 97.3%. Nonetheless, the manual rule has an organizational structure that compares well with the automated approach. For example, I picked nox as the most important variable and the threshold chosen manually is consistent with the automated selection. The key improvements by the automated approach occur in later stages of the partition of the scenarios.