1 🌳 What is a Decision Tree?

A Decision Tree is a model that predicts outcomes by splitting data into branches, like a flowchart:

Each node asks a question based on a feature (e.g., “Is age > 30?”)
Each branch represents an answer (Yes/No)
Each leaf node gives a final prediction (e.g., class or numeric value)

1.1 🎯 Why Use a Decision Tree?

Simple and easy to interpret
Works for classification and regression
Can handle both categorical and numerical features
No need for scaling or normalization
Shows feature importance via tree structure

1.2 🧠 How Does It Work?

Start with the full dataset.
The algorithm picks the best feature and value to split the data.
It uses a cost function to decide which split is best.
The process is repeated recursively on each branch.

1.3 ⛔ When Does the Tree Stop?

All data in a node is from the same class (pure)
Maximum depth is reached
No further meaningful splits
Minimum number of samples in a node is too low

1.4 ✅ Advantages

Easy to explain to non-experts
Works well with non-linear data
Handles missing values
No need for data normalization

1.5 ⚠️ Disadvantages

Can overfit easily
Unstable: small data changes may cause big tree changes
Less accurate than other models when used alone

🔧 Tip: Use pruning or ensemble methods like Random Forests to improve performance.

1.6 🔁 Summary

Task	Cost Function	Formula	Goal
Classification	Gini Index / Entropy	Impurity of classes	Maximize class purity
Regression	Mean Squared Error	Difference from node mean	Minimize prediction error

1.7 📌 Final Thought

A Decision Tree is like playing “20 Questions” — at each step, the model asks the most useful question to move closer to the right answer.

2 📉 Cost Functions

2.1 Classification – Cost function in classification tree

The goal is to make nodes as pure as possible (mostly one class).

Gini Index: \[ Gini = 1 - \sum p_k^2 \]
Entropy: \[ Entropy = -\sum p_k \log_2(p_k) \]

2.2 Where $( p_k )$ is the proportion of class $( k )$ in the node.

In a classification tree, the goal is to assign each input to a category (e.g., “Yes” or “No”, or class A/B/C).

To decide where to split the data, the tree uses a cost function that measures how mixed the classes are in a node.
This is called impurity — and the tree wants to make nodes that are as pure as possible.

2.2.1 🔹 Main impurity measures

1. Gini Index (used in CART)

\[ Gini = 1 - \sum_{k=1}^{K} p_k^2 \]

Where: - $( p_k )$ is the proportion of class $( k )$ in the node - Gini is 0 when the node is pure (all samples belong to one class) - Gini is maximum when classes are evenly split

2.2.2 🔍 What does the tree do?

At each step, the tree chooses the split that produces child nodes with the lowest total impurity.

\[ \text{Total impurity} = \frac{n_{left}}{n} \cdot impurity_{left} + \frac{n_{right}}{n} \cdot impurity_{right} \]

It tries different features and thresholds and picks the one that reduces impurity the most.

2.2.3 🧠 Intuition

If a node has mostly one class, the impurity is low → good!
If a node has a mix of classes, the impurity is high → bad!
The tree wants to create branches where each node is as “pure” as possible.

2.2.4 📌 Summary

Measure	Best Value	Meaning
Gini Index	0	Node is pure
Entropy	0	Node is pure
Higher value	More mixed	More uncertainty/impurity

CART (Classification and Regression Trees) typically uses Gini Index.
The cost function in classification trees is designed to minimize class impurity at each split.

2.3 Regression – Cost function in regression tree

The goal is to group similar target values together.

Mean Squared Error (MSE): \[ MSE = \frac{1}{n} \sum (y_i - \hat{y})^2 \]

Where: - $(y_i)$: actual value - $( \hat{y} )$: predicted value (mean in the node)

In a regression tree, the goal is to predict numeric values (e.g., house prices, sales, temperature).
To do that, the tree splits the data in a way that minimizes the prediction error.

2.3.1 🎯 What is the prediction at each leaf?

At each leaf of the regression tree, the prediction is simply the average of all the target values in that node.

\[\hat{y} = \frac{1}{n} \sum (y_n - \hat{y})^2\]

2.3.2 🧮 What is the cost function?

Regression trees use the Mean Squared Error (MSE) as the cost function.

\[ MSE = \frac{1}{n} \sum (y_i - \hat{y})^2 \] Where: - $( y_i )$ = actual value - $( \hat{y} )$ = predicted value (mean of the node) - $( n )$ = number of samples in the node

2.3.3 🔍 What does the tree try to do?

At every split, the tree looks for the feature and threshold that minimizes the total MSE in the child nodes.

\[ \text{Total MSE} = \frac{n_{left}}{n} MSE\_{left} + \frac{n_{right}}{n} MSE\_{right}\]

It tries many possible splits and picks the one that gives the lowest total MSE.

2.3.4 🧠 Intuition

If a node contains very similar values, the MSE will be low → good split.
If the values are very spread out, the MSE will be high → bad split.
So the tree wants to create groups (nodes) where the values are close together.

2.3.5 📝 Example Summary

The cost function in regression trees is MSE.
The tree splits the data to minimize MSE in the output variable.
Final predictions are just the mean of the node.

This is how regression trees learn to make accurate predictions.

--- title: "Decision Trees (CART algorithm and cost function for regression and classification problems)" author: - "Saidjon Rustamov" - "Dinara Khaldarova" - "Abbos Mukhammadiyev" format: html: theme: default css: custon_style.css toc: true toc-depth: 3 toc-location: left toc-title: "Table of contents" code-fold: true code-tools: true code-copy: true code-line-numbers: true number-sections: true number-depth: 3 fig-width: 8 fig-height: 6 fig-format: svg fig-dpi: 300 embed-resources: true page-layout: article fontsize: 1.1em linestretch: 1.5 title-block-banner: true smooth-scroll: true anchor-sections: true search: true header-includes: - \usepackage{sn} --- # 🌳 What is a Decision Tree? A **Decision Tree** is a model that predicts outcomes by splitting data into branches, like a flowchart: - Each **node** asks a question based on a feature (e.g., "Is age \> 30?") - Each **branch** represents an answer (Yes/No) - Each **leaf node** gives a final prediction (e.g., class or numeric value) <center><img src="Screenshot_1.png" alt="A simple classification tree" width="500"/></center> <center><img src="Screenshot_2.png" alt="A simple regrsssion tree" width="500"/></center> <center><img src="Screenshot_3.png" alt="A simple decision tree -- TERMINOLOGY" width="500"/></center> ------------------------------------------------------------------------ ## 🎯 Why Use a Decision Tree? - **Simple and easy to interpret** - Works for **classification** and **regression** - Can handle both **categorical** and **numerical** features - No need for scaling or normalization - Shows **feature importance** via tree structure ------------------------------------------------------------------------ ## 🧠 How Does It Work? 1. Start with the full dataset. 2. The algorithm picks the **best feature and value** to split the data. 3. It uses a **cost function** to decide which split is best. 4. The process is repeated **recursively** on each branch. ------------------------------------------------------------------------ ## ⛔ When Does the Tree Stop? - All data in a node is from the same class (pure) - Maximum depth is reached - No further meaningful splits - Minimum number of samples in a node is too low ------------------------------------------------------------------------ ## ✅ Advantages - Easy to explain to non-experts - Works well with non-linear data - Handles missing values - No need for data normalization ------------------------------------------------------------------------ ## ⚠️ Disadvantages - Can **overfit** easily - **Unstable**: small data changes may cause big tree changes - Less accurate than other models when used alone > 🔧 Tip: Use **pruning** or ensemble methods like **Random Forests** to improve performance. ------------------------------------------------------------------------ ## 🔁 Summary | Task | Cost Function | Formula | Goal | |----|----|----|----| | Classification | Gini Index / Entropy | Impurity of classes | Maximize class purity | | Regression | Mean Squared Error | Difference from node mean | Minimize prediction error | ------------------------------------------------------------------------ ## 📌 Final Thought A Decision Tree is like playing "20 Questions" — at each step, the model asks the most **useful** question to move closer to the right answer. # 📉 Cost Functions ## Classification -- Cost function in classification tree The goal is to make nodes as **pure** as possible (mostly one class). - **Gini Index**: $$ Gini = 1 - \sum p_k^2 $$ - **Entropy**: $$ Entropy = -\sum p_k \log_2(p_k) $$ ## Where $( p_k )$ is the proportion of class $( k )$ in the node. In a **classification tree**, the goal is to assign each input to a **category** (e.g., "Yes" or "No", or class A/B/C). To decide where to split the data, the tree uses a **cost function** that measures how **mixed** the classes are in a node.\ This is called **impurity** — and the tree wants to make nodes that are as **pure** as possible. <center><img src="Screenshot_4.png" alt="A simple decision tree diagram" width="500"/></center> ------------------------------------------------------------------------ ### 🔹 Main impurity measures #### 1. Gini Index (used in CART) $$ Gini = 1 - \sum_{k=1}^{K} p_k^2 $$ Where: - $( p_k )$ is the proportion of class $( k )$ in the node - Gini is **0** when the node is pure (all samples belong to one class) - Gini is **maximum** when classes are evenly split <center><img src="Screenshot_5.png" alt="A simple decision tree diagram" width="500"/></center> <center><img src="Screenshot_6.png" alt="A simple decision tree diagram" width="500"/></center> <center><img src="Screenshot_7.png" alt="A simple decision tree diagram" width="500"/></center> <center><img src="Screenshot_8.png" alt="A simple decision tree diagram" width="500"/></center> <center><img src="Screenshot_9.png" alt="A simple decision tree diagram" width="500"/></center> <center><img src="Screenshot_12.png" alt="A simple decision tree diagram" width="500"/></center> ------------------------------------------------------------------------ ### 🔍 What does the tree do? At each step, the tree chooses the split that produces child nodes with the **lowest total impurity**. $$ \text{Total impurity} = \frac{n_{left}}{n} \cdot impurity_{left} + \frac{n_{right}}{n} \cdot impurity_{right} $$ It tries different features and thresholds and picks the one that **reduces impurity the most**. <center><img src="Screenshot_10.png" alt="A simple decision tree diagram" width="500"/></center> <center><img src="Screenshot_11.png" alt="A simple decision tree diagram" width="500"/></center> ------------------------------------------------------------------------ ### 🧠 Intuition - If a node has **mostly one class**, the impurity is low → good! - If a node has a **mix of classes**, the impurity is high → bad! - The tree **wants to create branches where each node is as “pure” as possible**. ------------------------------------------------------------------------ ### 📌 Summary | Measure | Best Value | Meaning | |--------------|------------|---------------------------| | Gini Index | 0 | Node is pure | | Entropy | 0 | Node is pure | | Higher value | More mixed | More uncertainty/impurity | - CART (Classification and Regression Trees) typically uses **Gini Index**. - The **cost function in classification trees** is designed to **minimize class impurity** at each split. ------------------------------------------------------------------------ ## Regression -- Cost function in regression tree The goal is to group similar target values together. - **Mean Squared Error (MSE)**: $$ MSE = \frac{1}{n} \sum (y_i - \hat{y})^2 $$ Where: - $(y_i)$: actual value - $( \hat{y} )$: predicted value (mean in the node) ------------------------------------------------------------------------ In a **regression tree**, the goal is to predict **numeric values** (e.g., house prices, sales, temperature).\ To do that, the tree splits the data in a way that minimizes the **prediction error**. ### 🎯 What is the prediction at each leaf? At each leaf of the regression tree, the prediction is simply the **average** of all the target values in that node. $$\hat{y} = \frac{1}{n} \sum (y_n - \hat{y})^2$$ ------------------------------------------------------------------------ ### 🧮 What is the cost function? Regression trees use the **Mean Squared Error (MSE)** as the cost function. $$ MSE = \frac{1}{n} \sum (y_i - \hat{y})^2 $$ Where: - $( y_i )$ = actual value - $( \hat{y} )$ = predicted value (mean of the node) - $( n )$ = number of samples in the node ------------------------------------------------------------------------ ### 🔍 What does the tree try to do? At every split, the tree looks for the feature and threshold that **minimizes the total MSE** in the child nodes. $$ \text{Total MSE} = \frac{n_{left}}{n} MSE\_{left} + \frac{n_{right}}{n} MSE\_{right}$$ It tries many possible splits and picks the one that gives the **lowest total MSE**. ------------------------------------------------------------------------ ### 🧠 Intuition - If a node contains very **similar values**, the MSE will be **low** → good split. - If the values are very **spread out**, the MSE will be **high** → bad split. - So the tree **wants to create groups (nodes) where the values are close together**. ------------------------------------------------------------------------ ### 📝 Example Summary - The **cost function** in regression trees is **MSE**. - The tree splits the data to **minimize MSE** in the output variable. - Final predictions are just the **mean** of the node. This is how regression trees learn to make accurate predictions.