set.seed(20)

1 Introduction

Iterative Dichotomiser 3 (ID3) is a popular algorithm used to generate decision trees.

ID3 (Iterative Dichotomiser 3) is an algorithm used to create a decision tree, which is a predictive model used in machine learning and statistics. ID3 is particularly suited for classification problems where the target variable is categorical. Developed by Ross Quinlan, ID3 constructs decision trees using a top-down, greedy approach, selecting the attribute that maximizes information gain at each step.

ID3 can be adapted to handle continuous data as well. To be this, the comtinous data needs to be converted to into categorical attributes by defining thresholds. This involves splitting the continuous attribute into intervals or bins. One common method is to choose the threshold that provides the highest information gain https://www.kaggle.com/code/mrbisht/discretization-continuous-variables https://www.kaggle.com/code/mrbisht/discretization-continuous-variables.

ID3 primarily uses information gain based on entropy reduction to split attributes. However, some variations or adaptations of decision tree algorithms may use standard deviation reduction (SDR), particularly when dealing with continuous target variables in regression trees. This is more commonly associated with algorithms like CART (Classification and Regression Trees).

2 Standard Deviation Reduction (SDR)

When dealing with continuous data, especially for regression tasks, standard deviation reduction can be used to measure the effectiveness of a split. The basic idea is to choose splits that reduce the standard deviation within the resulting subsets.

The standard deviation is use to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous, its standard deviation would be zero (0)

In this manual, we’ll do a short demonstration of this algorithm, and also implement it in R.

To start with, consider the table below, which is the training data. From this data we’re to train a model that is capable of predicting student scores based on some given features. After training this model, teachers, parents, and students should be able to decide on what would a student score be based on those features. In this problem, the target variable is G3, while the features are Guardian, Sex, Studyingtime, and Paid.

sex guardian studytime G3
F other 3 yes 13
F mother 2 no 14
F mother 3 no 15
F mother 1 no 11
F mother 2 no 17
F mother 2 no 11
M mother 2 no 10
F father 2 no 10
F other 2 no 9

3 Computational Implementation

Firtly, we state the required formular for this process. For every target variable T and some predictors X, we have that:

  1. The mean of the continuous target variable is given as \[\bar{T} = \frac{1}{n} \sum_{i=1}^{n} T_i\]
  2. The standard deviation of the target variable is given as

\[\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (T_i - \bar{T})^2}\] 3. The coefficient of variation which is a measure of relative variability. It is the ratio of the standard deviation to the mean and is often expressed as a percentage is given as \[CV = \frac{\sigma}{\bar{T}} \times 100\%\]

  1. The Standard deviation for two attributes (prediction and target) is given as

\[\begin{equation} \sigma(T,X) = \sum_{c \in X}P(c)\sigma(c) \end{equation}\]

  1. Finally, the algorithm standard deviation reduction is calculated with the equation below: \[\begin{equation} \sigma R(T,X) = \sigma(T) - \sigma (T,X) \end{equation}\]

3.1 Decision criteria and Splitting conditions

  • The attribute(predictor) with the largest reduction is chosen for the decision node
  • The CV for any branch must be smaller than 10% or when there are few instances (n) remaining in that branch after split. If any branch fail the threshold, we split such branch further to reduce the impurity.

3.2 Process of Training

We start training the model by calculating the mean of the target continous variable, and its standard deviation respectively. - Count of G3 = 8 \[\begin{equation} \bar{G3} = \frac{1}{n} \sum_{i=1}^{n} G3_i = \frac{8+11+18+14+19+11+6+9}{8} \end{equation}\]

\[= \frac{96}{8}\] \[=12\] \[\begin{equation} \sigma (G3) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (T_i - \bar{G3})^2} =\sqrt{\frac{(8-12)^2+(11-12)^2+(18-12)^2+(14-12)^2+(19-12)^2+(11-12)^2+(6-12)^2+(9-12)^2}{8}} \\ =\sqrt{\frac{(-4)^2+(-1)^2+(6)^2+(2)^2+(7)^2+(-1)^2+(-6)^2+(-3)^2}{8}}\\ =\sqrt{\frac{16+1+36+4+49+1+36+9}{8}}\\ =\sqrt{\frac{152}{8}}\\ =\sqrt{19}\\ = 4.36 \end{equation}\]

\[\begin{equation} CV = \frac{\sigma (G3)}{\bar{G3}} \\ = \frac{4.36}{12} \times 100\% = 36.33\% \end{equation}\]

3.2.1 Now we calculate the mean, standard deviation, coefficient of variation, and standard deviation reduction of each attribute with respect to the target variable (G3).

The first feature is guardian

A quick explanation of how this would be done:

In the dataset given, from the guardian attribute, observe the number of instances Father occure, and for those instances pick the corresponding G3. Do the same for other guardian levels, you’ll have the result below:

3.2.1.1 Father serving as a Guardian

Count of Father = 3 G3 of Guardian as Father = 8, 14, 11.

  • Mean of G3, when Guardians is Father \[\begin{equation} \bar{(G3|G=F)} = \frac{8+14+11}{3} \\ =11 \end{equation}\]

  • Standard deviation of G3 when Guardians is Father \[\begin{equation} \sqrt{\frac{(8-11)^2+(14-11)^2+(11-11)^2}{3}} \\ =\sqrt{\frac{(-3)^2+(3)^2+(0)^2}{3}}\\ =\sqrt{\frac{9+9}{3}}\\ =\sqrt{\frac{18}{3}}\\ =\sqrt{6}\\ =2.45 \end{equation}\]

  • CV of G3 when Guardians is Father

\[\begin{equation} CV(Father) = \frac{\sigma(Father)}{\bar{G3|G=Father}} \times 100\% \\ CV(Father) = \frac{2.45}{11} \times 100\% \\ CV(Father) = 22.27\% \end{equation}\]

3.2.1.2 Mother serving as a Guardian

Count of Mother = 3 G3 of Guardian as Mother = 11, 19, 9.

-Mean of G3, when Guardians is Mother \[\begin{equation} \bar{G3|G=M} = \frac{11+19+9}{3}\\ =\frac{39}{3}\\ =13 \end{equation}\]

  • Standard deviation of G3, when Guardians is Mother \[\begin{equation} \sqrt{\frac{(11-13)^2+(19-13)^2+(9-13)^2}{3}} \\ =\sqrt{\frac{(-2)^2+(6)^2+(-4)^2}{3}}\\ =\sqrt{\frac{4+35+16}{3}}\\ =\sqrt{\frac{36}{3}}\\ =\sqrt{18.667}\\ =4.32 \end{equation}\]

  • CV of Mother as a Guardian \[\begin{equation} CV(Mother) = \frac{\sigma(Mother)}{\bar{G3|G=Mother}} \times 100\% \\ CV(Mother) = \frac{4.32}{13} \times 100\% \\ CV(Mother) = 33.23\% \end{equation}\]

3.2.1.3 Other forms of Guardian

Count of Other = 2 G3 of Guardian as Other = 18, 6.

-Mean \[\begin{equation} \bar{G3|G=O} = \frac{18+6}{2} \\ =\frac{24}{2} \\ =12 \end{equation}\]

  • Standard deviation \[\begin{equation} \sqrt{\frac{(18-12)^2+(6-12)^2}{2}} \\ =\sqrt{\frac{(6)^2+(-6)^2}{2}}\\ =\sqrt{\frac{36+36}{2}}\\ =\sqrt{\frac{72}{2}}\\ =\sqrt{36}\\ =6 \end{equation}\]

  • CV of Other forms of Guardian \[\begin{equation} CV(Other) = \frac{\sigma(O)}{\bar{G3|G=Other}} \times 100\% \\ CV(Other) = \frac{6}{12} \times 100\% \\ CV(Other) = 50\% \end{equation}\]

  • Standard deviation of Guardian \[\begin{equation} \sigma (G) = \sum_{c \in G} P(c) \sigma (c) \\ =\frac{3}{8} \times 2.45 + \frac{3}{8}\times 4.32 + \frac{2}{8}\times 6 \\ = 0.9375 + 1.6125 + 1.5 \\ = 4.05 \end{equation}\]

  • Standard deviation Reduction of Guardian \[\begin{equation} \sigma R(G3, G) = \sigma (G3) - \sigma (G) \\ = 4.36 - 4.05\\ = 0.31 \end{equation}\]

Guardian Ave STDev CV Count
Father 11 2.45 22.27 3.00
Mother 13 4.32 33.23 3.00
Other 12 6 50 2.00
SDR 0.31

3.2.2 Sex feature

3.2.2.1 Female candidate

Count of Female = 5 G3 of Female sex = 8, 14, 11, 6, 9.

-Mean of G3, when Sex is Female \[\begin{equation} \bar{G3|Sex=F} = \frac{8+14+11+6+9}{5}\\ =\frac{48}{5}\\ =9.6 \end{equation}\]

  • Standard deviation of G3 when Sex is Female \[\begin{equation} \sqrt{\frac{(8-9.6)^2+(14-9.6)^2+(11-9.6)^2+(6-9.6)^2+(9-9.6)^2}{5}}\\ =\sqrt{\frac{(-1.6)^2+(4.4)^2+(1.4)^2+(-3.6)^2+(-0.6)^2}{5}}\\ =\sqrt{\frac{2.56+19.36+1.96+12.96+0.36}{5}}\\ =\sqrt{\frac{37.2}{5}}\\ =\sqrt{7.44}\\ =2.73 \end{equation}\]

  • CV of G3 when Sex is Female \[\begin{equation} CV(Female) = \frac{\sigma(Female)}{\bar{G3|Sex=Female}} \times 100\%\\ CV(Female) = \frac{2.73}{9.6} \times 100\%\\ CV(Female) = 28.41\% \end{equation}\]

3.2.2.2 Male candidate

Count of Male = 3 G3 of Male sex = 11, 18, 19.

-Mean of G3, when Sex is Male \[\begin{equation} \bar{G3|Sex=F} = \frac{11+18+19}{3}\\ =\frac{48}{3}\\ =16 \end{equation}\]

  • Standard deviation of G3 when Sex is Male \[\begin{equation} \sqrt{\frac{(11-16)^2+(18-16)^2+(19-16)^2+(9-16)^2}{3}}\\ =\sqrt{\frac{(-5)^2+(2)^2+(3)^2+(-7)^2}{3}}\\ =\sqrt{\frac{25+4+49}{3}}\\ =\sqrt{\frac{78}{3}}\\ =\sqrt{26}\\ =5.10 \end{equation}\]

  • CV of G3 when Sex is Male \[\begin{equation} CV(Male) = \frac{\sigma(Male)}{\bar{G3|Sex=Male}} \times 100\%\\ CV(Male) = \frac{5.10}{16} \times 100\%\\ CV(Male) = 31.87\% \end{equation}\]

  • Standard deviation of Guardian \[\begin{equation} \sigma (S) = \sum_{c \in S} P(c) \sigma (c)\\ =\frac{5}{8} \times 2.73 + \frac{3}{8}\times 5.10 \\ = 1.7063 + 1.9125\\ = 3.61875 \end{equation}\]

  • Standard deviation Reduction of Sex \[\begin{equation} \sigma R(G3, S) = \sigma (G3) - \sigma (S) \\ = 4.36 - 3.62\\ = 0.74 \end{equation}\]

Sex Ave STDev CV Count
Female 9.6 2.73 28.41 5.00
Male 16 5.1 31.87 3.00
SDR 0.74

3.2.3 Study time feature

3.2.3.1 One hour of study per day

Count of One hour of study per day G3 of One hour = 8, 18, 11.

-Mean of G3, when Sudent study for one hr \[\begin{equation} \bar{G3|Study= 1} = \frac{8+18+11}{3} \\ =\frac{37}{3}\\ =12.33 \end{equation}\]

  • Standard deviation of G3 when Student study for one hr \[\begin{equation} \sqrt{\frac{(8-12.33)^2+(18-12.33)^2+(11-12.33)^2}{3}}\\ =\sqrt{\frac{(-4.33)^2+(5.67)^2+(-1.33)^2}{3}}\\ =\sqrt{\frac{18.7489+32.1489+1.7689}{3}}\\ =\sqrt{\frac{52.6667}{3}}\\ =\sqrt{17.56}\\ =4.19 \end{equation}\]

  • CV of G3 when Study time is one hr \[\begin{equation} CV(One) = \frac{\sigma(One)}{\bar{G3|Sudytime=One}} \times 100\% \\ CV(One) = \frac{4.19}{14} \times 100\% \\ CV(One) = 29.93\% \end{equation}\]

3.2.3.2 Two hours of study per day

Count of Two hours of study per day G3 of Two hours = 11,6.

-Mean of G3, when Sudent study for two hrs \[\begin{equation} \bar{G3|Study= 1} = \frac{11+6}{2} \\ =\frac{17}{2}\\ = 8.5 \end{equation}\]

  • Standard deviation of G3 when Student study for two hrs \[\begin{equation} \sqrt{\frac{(11-8.5)^2+(6-8.5)^2}{2}}\\ =\sqrt{\frac{(2.5)^2+(-2.5)^2}{2}}\\ =\sqrt{\frac{6.25+6.25}{2}}\\ =\sqrt{\frac{12.5}{2}}\\ =\sqrt{6.25}\\ =2.5 \end{equation}\]

  • CV of G3 when Study time is two hr \[\begin{equation} CV(Two) = \frac{\sigma(Two)}{\bar{G3|Sudytime=Two}} \times 100\% \\ CV(Two) = \frac{2.5}{8.5} \times 100\% \\ CV(Two) = 29.41\% \end{equation}\]

3.2.3.3 Three hours of study per day

Count of Three hours of study per day G3 of Three hours = 19, 9.

-Mean of G3, when Student study for three hrs \[\begin{equation} \bar{G3|Study= 3} = \frac{19+9}{2} \\ =\frac{28}{2}\\ =14 \end{equation}\]

  • Standard deviation of G3 when Student study for three hrs \[\begin{equation} \sqrt{\frac{(19-14)^2+(9-14)^2}{2}}\\ =\sqrt{\frac{(5)^2+(-5)^2}{2}}\\ =\sqrt{\frac{25+25}{2}}\\ =\sqrt{\frac{50}{2}}\\ =\sqrt{25}\\ =5 \end{equation}\]

  • CV of G3 when Study time is three hrs \[\begin{equation} CV(Three) = \frac{\sigma(Two)}{\bar{G3|Sudytime=Three}} \times 100\% \\ CV(Three) = \frac{5}{14} \times 100\% \\ CV(Three) = 35.71\% \end{equation}\]

3.2.3.4 Four hours of study per day

Count of Four hours of study per day is 1 G3 of Four hours = 11.

-Mean of G3, when Student study for four hrs \[\begin{equation} \bar{G3|Study= 4} = \frac{11}{1}\\ =11 \end{equation}\]

  • Standard deviation of G3 when Student study for four hrs \[\begin{equation} \sqrt{\frac{(11-141^2}{2}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =0 \end{equation}\]

  • CV of G3 when Study time is four hrs \[\begin{equation} CV(Four) = \frac{\sigma(Four)}{\bar{G3|Sudytime=Four}} \times 100\%\\ CV(Four) = \frac{0}{11} \times 100\%\\ CV(Four) = 0\% \end{equation}\]

  • Standard deviation of Study hr \[\begin{equation} \sigma (St) = \sum_{c \in St} P(c) \sigma (c)\\ =\frac{3}{8} \times 4.19 + \frac{2}{8}\times 2.5+ \frac{2}{8} \times 5 + \frac{1}{8} \times 0 \\ = 1.571 + 0.625 + 1.25\\ = 3.446 \end{equation}\]

  • Standard deviation Reduction of Study \[\begin{equation} \sigma R(G3, S) = \sigma (G3) - \sigma (St)\\ = 4.36 - 3.446\\ = 0.914 \end{equation}\]

StudyTime Ave STDev CV Count
1 12.33 4.19 29.93 3.000
2 8.5 2.5 29.41 2.000
3 14 5 35.71 2.000
4 0 0 0 1.000
SDR 0.914

3.3 Selecting the node for the split.

From the result of the analysis in the table above, the SDR(i.e \(\sigma\)R) for Guardian is 0.31, Sex is 0.74, Study time is 0.914, and Paid is 2.2. Out of all these features, Paid has the largest SDR making it the best out of all the features. However, study time is the next after Sex.

The two levels of Paid (No and Yes) have a coefficient of variation greater than 10% which is the minimum and the threshold for branch split. Hence, we split this two branches further.

4 Implementation in R

library(tidyverse)
library(readxl)
data <- read_excel("C:/Users/DELL/Desktop/2024_Projects/Iterative Dichotomizer 3/ID3_DATASET.xlsx")
head(data, 15)
## # A tibble: 15 × 5
##    sex   guardian studytime paid     G3
##    <chr> <chr>        <dbl> <chr> <dbl>
##  1 F     mother           2 no       11
##  2 F     father           2 no       11
##  3 F     mother           2 no       12
##  4 F     mother           3 no       14
##  5 F     father           2 no       13
##  6 M     mother           2 no       13
##  7 M     mother           2 no       13
##  8 F     mother           2 no       13
##  9 M     mother           2 no       17
## 10 M     mother           2 no       13
## 11 F     mother           2 no       14
## 12 F     father           3 no       13
## 13 M     father           1 no       12
## 14 M     mother           2 no       13
## 15 M     other            3 no       15

4.1 Data Partitioning

library(caTools)
sap <- sample.split(data$sex, SplitRatio = 0.7)
Train <- subset(data, sap==T)
Test <- subset(data, sap==F)

4.2 Training the model

#install.packages("C50")
library(C50)
## Warning: package 'C50' was built under R version 4.4.1
Model_IDE3 <- C5.0(as.factor(G3)~., data = Train)
pr <- predict(Model_IDE3, newdata = Test)
pr
##   [1] 12 11 10 11 11 11 11 10 10 13 12 11 11 11 11 11 11 11 12 12 11 11 10 11 12
##  [26] 11 10 11 13 11 10 11 13 10 11 11 11 11 11 12 11 11 10 11 10 11 11 11 13 12
##  [51] 11 11 11 11 11 10 11 13 10 11 10 11 11 11 11 11 13 11 11 11 10 11 13 13 11
##  [76] 11 11 11 11 11 11 11 11 13 10 11 12 11 11 11 11 13 10 13 10 11 11 10 11 13
## [101] 9  10 11 12 13 10 13 11 10 12 11 11 11 11 10 13 11 11 13 11 12 13 13 11 11
## [126] 11 10 10 11 11 11 13 12 11 11 11 11 11 13 11 11 11 13 11 10 12 11 11 12 11
## [151] 12 13 11 12 11 11 11 11 12 11 11 11 10 12 12 11 11 11 13 11 11 12 11 11 10
## [176] 11 11 11 11 11 11 9  13 12 12 11 13 13 13 9  11 11 11 13 11
## Levels: 0 1 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
#confusionMatrix(as.factor(pr), as.factor(Test$G3))
as.factor(Test$G3)
##   [1] 13 13 13 12 17 12 12 15 12 12 11 13 13 16 13 15 14 16 10 12 16 11 15 11 10
##  [26] 14 10 13 13 10 16 11 14 11 10 18 14 11 14 13 9  15 12 8  11 12 11 9  13 12
##  [51] 10 8  11 13 17 10 11 11 14 14 10 13 12 10 14 15 8  15 13 13 11 11 13 10 8 
##  [76] 11 13 14 13 10 8  11 10 15 15 15 11 7  12 13 10 17 11 13 16 10 15 15 10 15
## [101] 13 17 17 15 8  18 11 15 17 14 13 17 13 10 15 14 10 10 14 15 14 14 17 16 15
## [126] 10 18 17 11 17 13 11 10 7  8  12 12 16 9  8  9  11 8  11 13 11 11 13 8  10
## [151] 11 9  10 9  10 10 8  15 8  8  8  9  8  14 11 8  12 13 13 10 10 12 10 9  8 
## [176] 10 11 0  13 10 10 0  11 10 12 9  12 17 12 9  14 16 0  10 10
## Levels: 0 7 8 9 10 11 12 13 14 15 16 17 18
data_re<-na.omit(data)
length(data_re$G3)
## [1] 649
length(data$G3)
## [1] 649
plot(Model_IDE3)
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion