When to use

A classification tree is an useful technique when the response variable is binary. If we look at a binary independant / response variable we can use logistic regression as a method for pattern recognition. We calculate the probablities using the equation. However if we don’t want to get a probablity but the classification then classification tree is useful.

We want to identify subgroups in the prediction space with predominantly one pattern.

Load the kyphosis data from the main dataset
Load the rpart package.

library(rpart)
data("kyphosis")
dim(kyphosis)
## [1] 81  4
head(kyphosis)
##   Kyphosis Age Number Start
## 1   absent  71      3     5
## 2   absent 158      3    14
## 3  present 128      4     5
## 4   absent   2      5     1
## 5   absent   1      4    15
## 6   absent   1      2    16
summary(kyphosis)
##     Kyphosis       Age             Number           Start      
##  absent :64   Min.   :  1.00   Min.   : 2.000   Min.   : 1.00  
##  present:17   1st Qu.: 26.00   1st Qu.: 3.000   1st Qu.: 9.00  
##               Median : 87.00   Median : 4.000   Median :13.00  
##               Mean   : 83.65   Mean   : 4.049   Mean   :11.49  
##               3rd Qu.:130.00   3rd Qu.: 5.000   3rd Qu.:16.00  
##               Max.   :206.00   Max.   :10.000   Max.   :18.00

We want to find out a classification tree which will allow us to tell if corrective surgery can be successful based on pretreatment factors i.e. Age at surgery, The number of vertebrae affected and the vertebrae which was corrected.

Fit a logistic regression model

kyphosis1 <- glm(Kyphosis~., data=kyphosis, family = binomial)
summary(kyphosis1)
## 
## Call:
## glm(formula = Kyphosis ~ ., family = binomial, data = kyphosis)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3124  -0.5484  -0.3632  -0.1659   2.1613  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -2.036934   1.449575  -1.405  0.15996   
## Age          0.010930   0.006446   1.696  0.08996 . 
## Number       0.410601   0.224861   1.826  0.06785 . 
## Start       -0.206510   0.067699  -3.050  0.00229 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.234  on 80  degrees of freedom
## Residual deviance: 61.380  on 77  degrees of freedom
## AIC: 69.38
## 
## Number of Fisher Scoring iterations: 5

In this case R fits model for Kyphosis = Present (works on alphanumeric order).
Thus higher the number vertebrae (lower down the surgery done) operated on the better the probablity of the surgery being successful. Surgery in lumbar region is better than dorsal region

Calculate the adjusted odds ratio

kyphosisor <- exp(kyphosis1$coefficients)
round(kyphosisor,2)
## (Intercept)         Age      Number       Start 
##        0.13        1.01        1.51        0.81
round(cbind(kyphosisor, exp(confint(kyphosis1))),2)
##             kyphosisor 2.5 % 97.5 %
## (Intercept)       0.13  0.01   1.95
## Age               1.01  1.00   1.02
## Number            1.51  1.00   2.45
## Start             0.81  0.71   0.92

Make the classification tree

kyphosis2 <- rpart(Kyphosis~.,data=kyphosis)
kyphosis2
## n= 81 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 81 17 absent (0.79012346 0.20987654)  
##    2) Start>=8.5 62  6 absent (0.90322581 0.09677419)  
##      4) Start>=14.5 29  0 absent (1.00000000 0.00000000) *
##      5) Start< 14.5 33  6 absent (0.81818182 0.18181818)  
##       10) Age< 55 12  0 absent (1.00000000 0.00000000) *
##       11) Age>=55 21  6 absent (0.71428571 0.28571429)  
##         22) Age>=111 14  2 absent (0.85714286 0.14285714) *
##         23) Age< 111 7  3 present (0.42857143 0.57142857) *
##    3) Start< 8.5 19  8 present (0.42105263 0.57894737) *
summary(kyphosis2)
## Call:
## rpart(formula = Kyphosis ~ ., data = kyphosis)
##   n= 81 
## 
##           CP nsplit rel error   xerror      xstd
## 1 0.17647059      0 1.0000000 1.000000 0.2155872
## 2 0.01960784      1 0.8235294 1.117647 0.2243268
## 3 0.01000000      4 0.7647059 1.117647 0.2243268
## 
## Variable importance
##  Start    Age Number 
##     64     24     12 
## 
## Node number 1: 81 observations,    complexity param=0.1764706
##   predicted class=absent   expected loss=0.2098765  P(node) =1
##     class counts:    64    17
##    probabilities: 0.790 0.210 
##   left son=2 (62 obs) right son=3 (19 obs)
##   Primary splits:
##       Start  < 8.5  to the right, improve=6.762330, (0 missing)
##       Number < 5.5  to the left,  improve=2.866795, (0 missing)
##       Age    < 39.5 to the left,  improve=2.250212, (0 missing)
##   Surrogate splits:
##       Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
## 
## Node number 2: 62 observations,    complexity param=0.01960784
##   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
##     class counts:    56     6
##    probabilities: 0.903 0.097 
##   left son=4 (29 obs) right son=5 (33 obs)
##   Primary splits:
##       Start  < 14.5 to the right, improve=1.0205280, (0 missing)
##       Age    < 55   to the left,  improve=0.6848635, (0 missing)
##       Number < 4.5  to the left,  improve=0.2975332, (0 missing)
##   Surrogate splits:
##       Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
##       Age    < 16   to the left,  agree=0.597, adj=0.138, (0 split)
## 
## Node number 3: 19 observations
##   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
##     class counts:     8    11
##    probabilities: 0.421 0.579 
## 
## Node number 4: 29 observations
##   predicted class=absent   expected loss=0  P(node) =0.3580247
##     class counts:    29     0
##    probabilities: 1.000 0.000 
## 
## Node number 5: 33 observations,    complexity param=0.01960784
##   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
##     class counts:    27     6
##    probabilities: 0.818 0.182 
##   left son=10 (12 obs) right son=11 (21 obs)
##   Primary splits:
##       Age    < 55   to the left,  improve=1.2467530, (0 missing)
##       Start  < 12.5 to the right, improve=0.2887701, (0 missing)
##       Number < 3.5  to the right, improve=0.1753247, (0 missing)
##   Surrogate splits:
##       Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
##       Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
## 
## Node number 10: 12 observations
##   predicted class=absent   expected loss=0  P(node) =0.1481481
##     class counts:    12     0
##    probabilities: 1.000 0.000 
## 
## Node number 11: 21 observations,    complexity param=0.01960784
##   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
##     class counts:    15     6
##    probabilities: 0.714 0.286 
##   left son=22 (14 obs) right son=23 (7 obs)
##   Primary splits:
##       Age    < 111  to the right, improve=1.71428600, (0 missing)
##       Start  < 12.5 to the right, improve=0.79365080, (0 missing)
##       Number < 3.5  to the right, improve=0.07142857, (0 missing)
## 
## Node number 22: 14 observations
##   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
##     class counts:    12     2
##    probabilities: 0.857 0.143 
## 
## Node number 23: 7 observations
##   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
##     class counts:     3     4
##    probabilities: 0.429 0.571

The summary tells us that start is the most important variable followed by Age and Number.

library(rpart.plot)

Create a Classification Tree

rpart.plot(kyphosis2, type=3, col="blue")

Parts of the plot:
1. There are 5 terminal nodes.
2. Kyphosis is absent for surgeries done in vertebrae below 8.5 and 14. This means that in patients undergoing surgery below the 14th vertebrae kyphosis got corrected. This means D8 vertebrae or below.
3. All patients above the 8th vertebrae have kyphosis present (D1).
4. If patient has surgery done on vertebrae between D2 - D7 and surgery was done before 55 months or after 111 months then again patient had no residual kyphosis.
5. In patients who were between age of 55 - 111 months and surgery was done in the D2 - D7 vertebrae residual kyphosis was present.

Conclusion:
1. Patients with surgeries done above D1 vertebrae will have residual kyphosis.
2. Kyphosis is corrected if the patient has surgery on D8 vertebrae or below.
3. Kyphosis is also corrected if the surgery is done in patient between the age of 55 - 111 months and surgery was between D1 - D8 vertebrae.

rpart.plot(kyphosis2, col="blue", type=3, extra=2, main="Classification Tree for Kyphosis Data with number in each node.")

Theory behind classification trees

Lets assume we have a binary independant variable of interest. Y which can range from 0 - 1.
Lets assume there are two covariats / dependant variables - x1 and x2.
x1 and x2 have a prediction space of 1 - 10 (can take values between 1 - 10).
We plot the values observed for the prediction space of the x1 and x2.
Classification tree approach tries to find rectangles in the prediction space (in 2 dimensional data).
Now the approach tries to find out the rectangles in such a way that most of the values inside each rectangle has a predominant pattern of 1 or 0 for Y.

  1. The tree starts with a root node with all the observations inside the node.
  2. Find out how many have Y = 0 and how many have Y = 1.
  3. Now we have to split the root node.
  4. Choose 1 covariate and a cut point.
  5. Now distribute the observations as per the choosen cut point of the chosen covariate into two daughter nodes.
  6. Now find out how many have the Y value of 1 and 0 in each daughter nodes.
  7. Now we split each daughter nodes on the basis of another/same covariate with an another / same cut-point.
  8. Again count the number of values of Y in these nodes.
  9. So we keep splitting till most of the observations of the Y have the same value.
  10. Once we reach the node with all Y values of the same type it is called a pure node.

How to choose the cut point and the covariate for the first node?
A function called entropy criteria is used for this purpose.
A simple way to understand this is to apply a two sample proportion test to the number of observation each node for the response variable Y. Now do this for this for the other covariates and for different cutpoints. Generate p values using the prop test and choose the covariate and the cut point with the least p-value = this gives us the best cut point for the 1st node.

This process continues till we get a minimum size. If the size of a node is 20 then no further splits are made. This process is called pruning and 20 is the default value in the rPart Program.

Choice of a Regression Approach

  1. Binary Response Variable : Logistic Regression.
  2. Terenary Response Variable : Multi-nomial regression.
  3. Count Variable : Poisson Regression .
  4. Continous Variable : Linear Regression.