1 ๐ณ What is a Decision Tree?
A Decision Tree is a model that predicts outcomes by splitting data into branches, like a flowchart:
- Each node asks a question based on a feature (e.g., โIs age > 30?โ)
- Each branch represents an answer (Yes/No)
- Each leaf node gives a final prediction (e.g., class or numeric value)
1.1 ๐ฏ Why Use a Decision Tree?
- Simple and easy to interpret
- Works for classification and regression
- Can handle both categorical and numerical features
- No need for scaling or normalization
- Shows feature importance via tree structure
1.2 ๐ง How Does It Work?
- Start with the full dataset.
- The algorithm picks the best feature and value to split the data.
- It uses a cost function to decide which split is best.
- The process is repeated recursively on each branch.
1.3 โ When Does the Tree Stop?
- All data in a node is from the same class (pure)
- Maximum depth is reached
- No further meaningful splits
- Minimum number of samples in a node is too low
1.4 โ Advantages
- Easy to explain to non-experts
- Works well with non-linear data
- Handles missing values
- No need for data normalization
1.5 โ ๏ธ Disadvantages
- Can overfit easily
- Unstable: small data changes may cause big tree changes
- Less accurate than other models when used alone
๐ง Tip: Use pruning or ensemble methods like Random Forests to improve performance.
1.6 ๐ Summary
| Task | Cost Function | Formula | Goal |
|---|---|---|---|
| Classification | Gini Index / Entropy | Impurity of classes | Maximize class purity |
| Regression | Mean Squared Error | Difference from node mean | Minimize prediction error |
1.7 ๐ Final Thought
A Decision Tree is like playing โ20 Questionsโ โ at each step, the model asks the most useful question to move closer to the right answer.
2 ๐ Cost Functions
2.1 Classification โ Cost function in classification tree
The goal is to make nodes as pure as possible (mostly one class).
Gini Index: \[ Gini = 1 - \sum p_k^2 \]
Entropy: \[ Entropy = -\sum p_k \log_2(p_k) \]
2.2 Where \(( p_k )\) is the proportion of class \(( k )\) in the node.
In a classification tree, the goal is to assign each input to a category (e.g., โYesโ or โNoโ, or class A/B/C).
To decide where to split the data, the tree uses a cost function that measures how mixed the classes are in a node.
This is called impurity โ and the tree wants to make nodes that are as pure as possible.
2.2.1 ๐น Main impurity measures
1. Gini Index (used in CART)
\[ Gini = 1 - \sum_{k=1}^{K} p_k^2 \]
Where: - \(( p_k )\) is the proportion of class \(( k )\) in the node - Gini is 0 when the node is pure (all samples belong to one class) - Gini is maximum when classes are evenly split
2.2.2 ๐ What does the tree do?
At each step, the tree chooses the split that produces child nodes with the lowest total impurity.
\[ \text{Total impurity} = \frac{n_{left}}{n} \cdot impurity_{left} + \frac{n_{right}}{n} \cdot impurity_{right} \]
It tries different features and thresholds and picks the one that reduces impurity the most.
2.2.3 ๐ง Intuition
- If a node has mostly one class, the impurity is low โ good!
- If a node has a mix of classes, the impurity is high โ bad!
- The tree wants to create branches where each node is as โpureโ as possible.
2.2.4 ๐ Summary
| Measure | Best Value | Meaning |
|---|---|---|
| Gini Index | 0 | Node is pure |
| Entropy | 0 | Node is pure |
| Higher value | More mixed | More uncertainty/impurity |
- CART (Classification and Regression Trees) typically uses Gini Index.
- The cost function in classification trees is designed to minimize class impurity at each split.
2.3 Regression โ Cost function in regression tree
The goal is to group similar target values together.
- Mean Squared Error (MSE): \[ MSE = \frac{1}{n} \sum (y_i - \hat{y})^2 \]
Where: - \((y_i)\): actual value - \(( \hat{y} )\): predicted value (mean in the node)
In a regression tree, the goal is to predict numeric values (e.g., house prices, sales, temperature).
To do that, the tree splits the data in a way that minimizes the prediction error.
2.3.1 ๐ฏ What is the prediction at each leaf?
At each leaf of the regression tree, the prediction is simply the average of all the target values in that node.
\[\hat{y} = \frac{1}{n} \sum (y_n - \hat{y})^2\]
2.3.2 ๐งฎ What is the cost function?
Regression trees use the Mean Squared Error (MSE) as the cost function.
\[ MSE = \frac{1}{n} \sum (y_i - \hat{y})^2 \] Where: - \(( y_i )\) = actual value - \(( \hat{y} )\) = predicted value (mean of the node) - \(( n )\) = number of samples in the node
2.3.3 ๐ What does the tree try to do?
At every split, the tree looks for the feature and threshold that minimizes the total MSE in the child nodes.
\[ \text{Total MSE} = \frac{n_{left}}{n} MSE\_{left} + \frac{n_{right}}{n} MSE\_{right}\]
It tries many possible splits and picks the one that gives the lowest total MSE.
2.3.4 ๐ง Intuition
- If a node contains very similar values, the MSE will be low โ good split.
- If the values are very spread out, the MSE will be high โ bad split.
- So the tree wants to create groups (nodes) where the values are close together.
2.3.5 ๐ Example Summary
- The cost function in regression trees is MSE.
- The tree splits the data to minimize MSE in the output variable.
- Final predictions are just the mean of the node.
This is how regression trees learn to make accurate predictions.