A classification tree is an useful technique when the response variable is binary. If we look at a binary independant / response variable we can use logistic regression as a method for pattern recognition. We calculate the probablities using the equation. However if we don’t want to get a probablity but the classification then classification tree is useful.
We want to identify subgroups in the prediction space with predominantly one pattern.
Load the kyphosis data from the main dataset
Load the rpart package.
library(rpart)
data("kyphosis")
dim(kyphosis)
## [1] 81 4
head(kyphosis)
## Kyphosis Age Number Start
## 1 absent 71 3 5
## 2 absent 158 3 14
## 3 present 128 4 5
## 4 absent 2 5 1
## 5 absent 1 4 15
## 6 absent 1 2 16
summary(kyphosis)
## Kyphosis Age Number Start
## absent :64 Min. : 1.00 Min. : 2.000 Min. : 1.00
## present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00
## Median : 87.00 Median : 4.000 Median :13.00
## Mean : 83.65 Mean : 4.049 Mean :11.49
## 3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00
## Max. :206.00 Max. :10.000 Max. :18.00
We want to find out a classification tree which will allow us to tell if corrective surgery can be successful based on pretreatment factors i.e. Age at surgery, The number of vertebrae affected and the vertebrae which was corrected.
kyphosis1 <- glm(Kyphosis~., data=kyphosis, family = binomial)
summary(kyphosis1)
##
## Call:
## glm(formula = Kyphosis ~ ., family = binomial, data = kyphosis)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3124 -0.5484 -0.3632 -0.1659 2.1613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.036934 1.449575 -1.405 0.15996
## Age 0.010930 0.006446 1.696 0.08996 .
## Number 0.410601 0.224861 1.826 0.06785 .
## Start -0.206510 0.067699 -3.050 0.00229 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 83.234 on 80 degrees of freedom
## Residual deviance: 61.380 on 77 degrees of freedom
## AIC: 69.38
##
## Number of Fisher Scoring iterations: 5
In this case R fits model for Kyphosis = Present (works on alphanumeric order).
Thus higher the number vertebrae (lower down the surgery done) operated on the better the probablity of the surgery being successful. Surgery in lumbar region is better than dorsal region
kyphosisor <- exp(kyphosis1$coefficients)
round(kyphosisor,2)
## (Intercept) Age Number Start
## 0.13 1.01 1.51 0.81
round(cbind(kyphosisor, exp(confint(kyphosis1))),2)
## kyphosisor 2.5 % 97.5 %
## (Intercept) 0.13 0.01 1.95
## Age 1.01 1.00 1.02
## Number 1.51 1.00 2.45
## Start 0.81 0.71 0.92
kyphosis2 <- rpart(Kyphosis~.,data=kyphosis)
kyphosis2
## n= 81
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 81 17 absent (0.79012346 0.20987654)
## 2) Start>=8.5 62 6 absent (0.90322581 0.09677419)
## 4) Start>=14.5 29 0 absent (1.00000000 0.00000000) *
## 5) Start< 14.5 33 6 absent (0.81818182 0.18181818)
## 10) Age< 55 12 0 absent (1.00000000 0.00000000) *
## 11) Age>=55 21 6 absent (0.71428571 0.28571429)
## 22) Age>=111 14 2 absent (0.85714286 0.14285714) *
## 23) Age< 111 7 3 present (0.42857143 0.57142857) *
## 3) Start< 8.5 19 8 present (0.42105263 0.57894737) *
summary(kyphosis2)
## Call:
## rpart(formula = Kyphosis ~ ., data = kyphosis)
## n= 81
##
## CP nsplit rel error xerror xstd
## 1 0.17647059 0 1.0000000 1.000000 0.2155872
## 2 0.01960784 1 0.8235294 1.117647 0.2243268
## 3 0.01000000 4 0.7647059 1.117647 0.2243268
##
## Variable importance
## Start Age Number
## 64 24 12
##
## Node number 1: 81 observations, complexity param=0.1764706
## predicted class=absent expected loss=0.2098765 P(node) =1
## class counts: 64 17
## probabilities: 0.790 0.210
## left son=2 (62 obs) right son=3 (19 obs)
## Primary splits:
## Start < 8.5 to the right, improve=6.762330, (0 missing)
## Number < 5.5 to the left, improve=2.866795, (0 missing)
## Age < 39.5 to the left, improve=2.250212, (0 missing)
## Surrogate splits:
## Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split)
##
## Node number 2: 62 observations, complexity param=0.01960784
## predicted class=absent expected loss=0.09677419 P(node) =0.7654321
## class counts: 56 6
## probabilities: 0.903 0.097
## left son=4 (29 obs) right son=5 (33 obs)
## Primary splits:
## Start < 14.5 to the right, improve=1.0205280, (0 missing)
## Age < 55 to the left, improve=0.6848635, (0 missing)
## Number < 4.5 to the left, improve=0.2975332, (0 missing)
## Surrogate splits:
## Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split)
## Age < 16 to the left, agree=0.597, adj=0.138, (0 split)
##
## Node number 3: 19 observations
## predicted class=present expected loss=0.4210526 P(node) =0.2345679
## class counts: 8 11
## probabilities: 0.421 0.579
##
## Node number 4: 29 observations
## predicted class=absent expected loss=0 P(node) =0.3580247
## class counts: 29 0
## probabilities: 1.000 0.000
##
## Node number 5: 33 observations, complexity param=0.01960784
## predicted class=absent expected loss=0.1818182 P(node) =0.4074074
## class counts: 27 6
## probabilities: 0.818 0.182
## left son=10 (12 obs) right son=11 (21 obs)
## Primary splits:
## Age < 55 to the left, improve=1.2467530, (0 missing)
## Start < 12.5 to the right, improve=0.2887701, (0 missing)
## Number < 3.5 to the right, improve=0.1753247, (0 missing)
## Surrogate splits:
## Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split)
## Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split)
##
## Node number 10: 12 observations
## predicted class=absent expected loss=0 P(node) =0.1481481
## class counts: 12 0
## probabilities: 1.000 0.000
##
## Node number 11: 21 observations, complexity param=0.01960784
## predicted class=absent expected loss=0.2857143 P(node) =0.2592593
## class counts: 15 6
## probabilities: 0.714 0.286
## left son=22 (14 obs) right son=23 (7 obs)
## Primary splits:
## Age < 111 to the right, improve=1.71428600, (0 missing)
## Start < 12.5 to the right, improve=0.79365080, (0 missing)
## Number < 3.5 to the right, improve=0.07142857, (0 missing)
##
## Node number 22: 14 observations
## predicted class=absent expected loss=0.1428571 P(node) =0.1728395
## class counts: 12 2
## probabilities: 0.857 0.143
##
## Node number 23: 7 observations
## predicted class=present expected loss=0.4285714 P(node) =0.08641975
## class counts: 3 4
## probabilities: 0.429 0.571
The summary tells us that start is the most important variable followed by Age and Number.
library(rpart.plot)
Create a Classification Tree
rpart.plot(kyphosis2, type=3, col="blue")
Parts of the plot:
1. There are 5 terminal nodes.
2. Kyphosis is absent for surgeries done in vertebrae below 8.5 and 14. This means that in patients undergoing surgery below the 14th vertebrae kyphosis got corrected. This means D8 vertebrae or below.
3. All patients above the 8th vertebrae have kyphosis present (D1).
4. If patient has surgery done on vertebrae between D2 - D7 and surgery was done before 55 months or after 111 months then again patient had no residual kyphosis.
5. In patients who were between age of 55 - 111 months and surgery was done in the D2 - D7 vertebrae residual kyphosis was present.
Conclusion:
1. Patients with surgeries done above D1 vertebrae will have residual kyphosis.
2. Kyphosis is corrected if the patient has surgery on D8 vertebrae or below.
3. Kyphosis is also corrected if the surgery is done in patient between the age of 55 - 111 months and surgery was between D1 - D8 vertebrae.
rpart.plot(kyphosis2, col="blue", type=3, extra=2, main="Classification Tree for Kyphosis Data with number in each node.")
Lets assume we have a binary independant variable of interest. Y which can range from 0 - 1.
Lets assume there are two covariats / dependant variables - x1 and x2.
x1 and x2 have a prediction space of 1 - 10 (can take values between 1 - 10).
We plot the values observed for the prediction space of the x1 and x2.
Classification tree approach tries to find rectangles in the prediction space (in 2 dimensional data).
Now the approach tries to find out the rectangles in such a way that most of the values inside each rectangle has a predominant pattern of 1 or 0 for Y.
How to choose the cut point and the covariate for the first node?
A function called entropy criteria is used for this purpose.
A simple way to understand this is to apply a two sample proportion test to the number of observation each node for the response variable Y. Now do this for this for the other covariates and for different cutpoints. Generate p values using the prop test and choose the covariate and the cut point with the least p-value = this gives us the best cut point for the 1st node.
This process continues till we get a minimum size. If the size of a node is 20 then no further splits are made. This process is called pruning and 20 is the default value in the rPart Program.