Decision trees are a collection of predictive analytic techniques that use tree-like graphs for predicting the response variable. One such method is CHAID explained in a previous blog. The most popular decision tree method is the CART or the Classification and regression trees. Decision trees partition the data set into mutually exclusive and exhaustive subsets, which results in the splitting of the original data resembling a tree-like structure.
We can use the CART method for regression and classification. In this blog, I will go thru the CART classification in detail using the titanic example. Any decision tree model consists of 5 steps:
1. Start with the complete training data in the root node.
2. Decide on a splitting criterion
3. Split the data into two or more subsets based on the above metric
4. Using independent variables, repeat step 3 for each of the subsets of the data until
–(a) All the dependent variables are exhausted, or they are not statistically significant at alpha
–(b) We meet the stopping criteria
5. Generate business rules for the terminal nodes (nodes without any branches) of the tree
The data used in this blog is the same as the used in other classification posts, i.e. the Titanic dataset from Kaggle. In this problem, we have to identify what types of people had a higher chance of survival. A sample of the data is shown below. A basic EDA is performed in the blog on Logistic regression.
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked |
|---|---|---|---|---|---|---|---|
| O | 3 | female | 45.00 | 1 | 4 | 27.9000 | S |
| O | 3 | male | 19.00 | 0 | 0 | 8.1583 | S |
| O | 3 | male | 17.00 | 0 | 0 | 8.6625 | S |
| I | 3 | male | 0.42 | 0 | 1 | 8.5167 | C |
| I | 3 | male | 25.00 | 0 | 0 | 0.0000 | S |
In CHAID bases classification, we used p-value from hypothesis tests as the splitting criterion. In CART, we will use impurity metrics like Gini Index and Entropy.
It is defined mathematically as: \[ Gini\,Index = \sum_{i=1}^k \sum_{j=1,i\neq 1}^k P(C_i|t)\times P(C_j|t)\] where \(P(C_i|t)\) = Proportion of observations belonging to class \(C_i\) in node t. For a binary classification,
\[ Gini\,Index = 1 - \sum_{i=1}^k P(C_i|t)^2\]
Gini Index is a measure of total variance across the classes. It has a small value if all of the \(P(C_i|t)\) are close to zero or one. For this reason, the Gini index is a measure of node purity. A small value indicates that a node contains observations from a single class predominantly.
An alternative to the Gini index is entropy, given by entropy: \[ Entropy = - \sum_{i=1}^k P(C_i|t)\times log[P(C_i|t]\] Like the Gini index, the entropy is a small value if the node is pure. The Gini index and the entropy are quite similar numerically.
For this blog, I want to consider the Gini index as my impurity measure. The aim is to minimise the Gini Index as much as possible.
The Gini impurity in the base node (complete data) is
## [1] 0.472365
The impurity of the overall data is 0.47. Let us look at the purity if we split the data based on various features.
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| female | 231 | 81 | 0.3509561 | 0.3844305 |
| male | 109 | 468 | 0.6490439 | 0.3064437 |
## [1] "The total weighted impurity after the split is: 0.3338"
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| 1 | 134 | 80 | 0.2407199 | 0.4681632 |
| 2 | 87 | 97 | 0.2069741 | 0.4985232 |
| 3 | 119 | 372 | 0.5523060 | 0.3672459 |
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| 3rd Class | 119 | 372 | 0.552306 | 0.3672459 |
| First and second Class | 221 | 177 | 0.447694 | 0.4938890 |
## [1] "The total weighted impurity after the split is: 0.423943259996532"
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| 0 | 208 | 398 | 0.6816648 | 0.4508490 |
| 1 | 112 | 97 | 0.2350956 | 0.4974245 |
| 2 | 13 | 15 | 0.0314961 | 0.4974490 |
| 3 | 4 | 12 | 0.0179978 | 0.3750000 |
| 4 | 3 | 15 | 0.0202475 | 0.2777778 |
| 5 | 0 | 5 | 0.0056243 | 0.0000000 |
| 8 | 0 | 7 | 0.0078740 | 0.0000000 |
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| One or two siblings | 125 | 112 | 0.2665917 | 0.4984956 |
| others | 215 | 437 | 0.7334083 | 0.4420330 |
## [1] "The total weighted impurity after the split is: 0.457085468377958"
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| 0 | 231 | 445 | 0.7604049 | 0.4498923 |
| 1 | 65 | 53 | 0.1327334 | 0.4948291 |
| 2 | 40 | 40 | 0.0899888 | 0.5000000 |
| 3 | 3 | 2 | 0.0056243 | 0.4800000 |
| 4 | 0 | 4 | 0.0044994 | 0.0000000 |
| 5 | 1 | 4 | 0.0056243 | 0.3200000 |
| 6 | 0 | 1 | 0.0011249 | 0.0000000 |
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| 1,2&3 | 108 | 95 | 0.2283465 | 0.4979495 |
| others | 232 | 454 | 0.7716535 | 0.4476366 |
## [1] "The total weighted impurity after the split is: 0.459125378001722"
There are three locations where the passengers have embarked. The Gini index within each of the location is:
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| C | 93 | 75 | 0.1889764 | 0.4942602 |
| Q | 30 | 47 | 0.0866142 | 0.4756283 |
| S | 217 | 427 | 0.7244094 | 0.4468336 |
| Var1 | I | O | proportion.of.obs | child.gini.index |
|---|---|---|---|---|
| C | 93 | 75 | 0.1889764 | 0.4942602 |
| Q&S | 247 | 474 | 0.8110236 | 0.4504377 |
## [1] "The total weighted impurity after the split is: 0.458719142423424"
Age is a continuous variable, so we have to find the ideal age to split the data along with the impurity metrics. To find the ideal cutoff, I have plotted the weighted gini index for split based on different ages.
| Age_class | survived_prob | proportion.of.obs | child.gini.index |
|---|---|---|---|
| <7 | 0.6470588 | 0.0573678 | 0.4567474 |
| >=7 | 0.3663484 | 0.9426322 | 0.4642745 |
## [1] "The total weighted impurity after the split is: 0.46"
Similar to age, fare is a continuous variable and we have to find the optimal fare to split first.
| Fare_class | survived_prob | proportion.of.obs | child.gini.index |
|---|---|---|---|
| <=11 | 0.2087912 | 0.4094488 | 0.3303949 |
| >11 | 0.5028571 | 0.5905512 | 0.4999837 |
Among all the factors above, Gender has the least Gini Index of 0.3338. So the initial split of the data would be on Gender.
After splitting the data based on gender, we will get two data sets. We should perform a similar analysis to step 2 to split the data further, and so on. The splitting should stop when we meet the stopping criterion. Different stopping criterion for CART classification trees are:
1. The number of cases in the node is less than some pre-specified limit.
2. The purity of the node is more than some pre-specified threshold.
3. Depth of the node is more than some pre-specified threshold.
4. Predictor values for all records are identical
In this blog, I want to give a maximum depth of trees as 3. In the machine learning blogs, I will go deeper into how to select ideal stopping criterion. For max depth=3, we get the following decision tree:
From the above tree, I can see generate the following rules:
1. A female passenger has a higher probability of surviving unless she is travelling in third class with a fare less than 23.
2. A male passenger has a lower likelihood of surviving unless he is a child of age less than six years and with up to two siblings.
Using this decision tree, we can observe how sociological factors like gender, age and monetary status affected the decisions on who would get into a lifeboat at a crucial time during the titanic disaster.