Decision Trees and Random Forests: A Machine Learning Perspective

Shaurya Jauhari, Mora Lab, GMU. (Email: shauryajauhari@gzhmu.edu.cn)
2019-06-27 08:55:59

Introduction
General Concepts
- Bootstrapping
- Aggregating
- Bagging
- Classification and Regression Trees (CART)
Decision Trees
“Best” Split
- Gini Index
Random Forests
Packages: rpart, party, randomForest

Cleave data into smaller portions to elicit patterns that aid prediction | Recursive Partitioning
Logical structures hence represented can be construed without any a-priori knowledge.

Condition Check

Underperformance noted with nominal/ numerical data
Which attribute to select for tree partitioning?
Measures of Purity:Information Gain, Gain ratio, Chi-Squared statistic, etc.
\( Information Gain = Entropy_{Old} - Entropy_{New} \)
Gini Index
Only Binary classification problems

Calculated for each node
What's the Root node?
First calculate how well each feature classifies data
- Outlook -> Play Tennis; Temperature -> Play Tennis; Humidity -> Play Tennis; Wind -> Play Tennis

Perform such quantitation for all features
“Impure”
- None of the leaf nodes denote 100% Yes or 100% No for Play Tennis
Compare impurities
- less the better
\[ Gini Impurity_{Sunny} \] = 1 - (the probability of “yes”)² - (the probability of “no”)²
- \( 1 - (2/(2+3))^2 - (3/(2+3))^2 \)
- 0.52
Calculate the same for Rain and Overcast

Further, calculate the same for each class/ level of the remaining features (Temperature, Humidity, and Wind)
Classification Counts could be different for different features. That's OK!

Next is, Weighted average calculation for a each feature
- \[ Gini Impurity _{Outlook} \] = (classifications in the leaf node/total classification in the feature) * \[ Gini Impurity_{Sunny} \] + (classifications in the leaf node/total classification in the feature) * \[ Gini Impurity_{Rain} \] + (classifications in the leaf node/total classification in the feature) * \[ Gini Impurity_{Overcast} \]

\[ Gini Impurity _{Outlook} \] = (5/5+5+4)* 0.52 + (classifications in the leaf node/total classification in the feature) * \[ Gini Impurity_{Rain} \] + (classifications in the leaf node/total classification in the feature) * \[ Gini Impurity_{Overcast} \]

Eventually impurities for each feature will be compared
- one with least impurity gets to be the root node/ splitting attribute
What about intermediary nodes?
- Still not pure.
- iterative process till purity is acheived.
Check the leaf nodes of the root node with the protocol for Gini Index calculation
- Now using the non-root node features

For features with ranked data
- calculate Gini index for all ranks
- ignore last rank
For features with multiple-choice/ nominal data
- calculate Gini index for each choice as well as each possible combination
- ignore combination of all choices

Gini Index Formula , where c is th total number of features/attributes.

Out-of-Bag samples: Individual test cases
out-of-bag error: misclassification error for these left cases
The number of variables used per step can also be altered
- to bring variety to the tree structure
- dilute biasness
- settings available in package functions
- typical to consider square of the number of variables

Imputing missing data | data normalization
Data: numerical/ categorical
Idea is to make initial guess and refine it moving forward
- Categorical: most common class
- Numerical: median value
Refinement:
- compare samples similar to sample with missing data
- Build proximity matrix to track similarity
  - (sample x sample) matrix
  - values in corresponding cell augmented by 1 for samples ending in a common leaf node
  - done for both cells, (row x column) and (column x row), of the matrix

party

ctree(formula, data, controls = ctree_control())

rpart

t <- rpart(formula,data) rpart.plot(t, extra)

randomForest

randomForest(formula, data, ntree, mtry)

Last excercise on logistic regression has been updated with the current dataset.
Data isn't all that good! (https://github.com/mdozmorov/genomerunner_web/wiki/Histone-marks)