set.seed(20)
Iterative Dichotomiser 3 (ID3) is a popular algorithm used to generate decision trees.
ID3 (Iterative Dichotomiser 3) is an algorithm used to create a decision tree, which is a predictive model used in machine learning and statistics. ID3 is particularly suited for classification problems where the target variable is categorical. Developed by Ross Quinlan, ID3 constructs decision trees using a top-down, greedy approach, selecting the attribute that maximizes information gain at each step.
ID3 can be adapted to handle continuous data as well. To be this, the comtinous data needs to be converted to into categorical attributes by defining thresholds. This involves splitting the continuous attribute into intervals or bins. One common method is to choose the threshold that provides the highest information gain https://www.kaggle.com/code/mrbisht/discretization-continuous-variables https://www.kaggle.com/code/mrbisht/discretization-continuous-variables.
ID3 primarily uses information gain based on entropy reduction to split attributes. However, some variations or adaptations of decision tree algorithms may use standard deviation reduction (SDR), particularly when dealing with continuous target variables in regression trees. This is more commonly associated with algorithms like CART (Classification and Regression Trees).
When dealing with continuous data, especially for regression tasks, standard deviation reduction can be used to measure the effectiveness of a split. The basic idea is to choose splits that reduce the standard deviation within the resulting subsets.
The standard deviation is use to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous, its standard deviation would be zero (0)
In this manual, we’ll do a short demonstration of this algorithm, and also implement it in R.
To start with, consider the table below, which is the training data. From this data we’re to train a model that is capable of predicting student scores based on some given features. After training this model, teachers, parents, and students should be able to decide on what would a student score be based on those features. In this problem, the target variable is G3, while the features are Guardian, Sex, Studyingtime, and Paid.
| sex | guardian | studytime | paid | G3 |
|---|---|---|---|---|
| F | other | 3 | yes | 13 |
| F | mother | 2 | no | 14 |
| F | mother | 3 | no | 15 |
| F | mother | 1 | no | 11 |
| F | mother | 2 | no | 17 |
| F | mother | 2 | no | 11 |
| M | mother | 2 | no | 10 |
| F | father | 2 | no | 10 |
| F | other | 2 | no | 9 |
Firtly, we state the required formular for this process. For every target variable T and some predictors X, we have that:
\[\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (T_i - \bar{T})^2}\] 3. The coefficient of variation which is a measure of relative variability. It is the ratio of the standard deviation to the mean and is often expressed as a percentage is given as \[CV = \frac{\sigma}{\bar{T}} \times 100\%\]
\[\begin{equation} \sigma(T,X) = \sum_{c \in X}P(c)\sigma(c) \end{equation}\]
We start training the model by calculating the mean of the target continous variable, and its standard deviation respectively. - Count of G3 = 8 \[\begin{equation} \bar{G3} = \frac{1}{n} \sum_{i=1}^{n} G3_i = \frac{8+11+18+14+19+11+6+9}{8} \end{equation}\]
\[= \frac{96}{8}\] \[=12\] \[\begin{equation} \sigma (G3) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (T_i - \bar{G3})^2} =\sqrt{\frac{(8-12)^2+(11-12)^2+(18-12)^2+(14-12)^2+(19-12)^2+(11-12)^2+(6-12)^2+(9-12)^2}{8}} \\ =\sqrt{\frac{(-4)^2+(-1)^2+(6)^2+(2)^2+(7)^2+(-1)^2+(-6)^2+(-3)^2}{8}}\\ =\sqrt{\frac{16+1+36+4+49+1+36+9}{8}}\\ =\sqrt{\frac{152}{8}}\\ =\sqrt{19}\\ = 4.36 \end{equation}\]
\[\begin{equation} CV = \frac{\sigma (G3)}{\bar{G3}} \\ = \frac{4.36}{12} \times 100\% = 36.33\% \end{equation}\]
The first feature is guardian
A quick explanation of how this would be done:
In the dataset given, from the guardian attribute, observe the number of instances Father occure, and for those instances pick the corresponding G3. Do the same for other guardian levels, you’ll have the result below:
Count of Father = 3 G3 of Guardian as Father = 8, 14, 11.
Mean of G3, when Guardians is Father \[\begin{equation} \bar{(G3|G=F)} = \frac{8+14+11}{3} \\ =11 \end{equation}\]
Standard deviation of G3 when Guardians is Father \[\begin{equation} \sqrt{\frac{(8-11)^2+(14-11)^2+(11-11)^2}{3}} \\ =\sqrt{\frac{(-3)^2+(3)^2+(0)^2}{3}}\\ =\sqrt{\frac{9+9}{3}}\\ =\sqrt{\frac{18}{3}}\\ =\sqrt{6}\\ =2.45 \end{equation}\]
CV of G3 when Guardians is Father
\[\begin{equation} CV(Father) = \frac{\sigma(Father)}{\bar{G3|G=Father}} \times 100\% \\ CV(Father) = \frac{2.45}{11} \times 100\% \\ CV(Father) = 22.27\% \end{equation}\]
Count of Mother = 3 G3 of Guardian as Mother = 11, 19, 9.
-Mean of G3, when Guardians is Mother \[\begin{equation} \bar{G3|G=M} = \frac{11+19+9}{3}\\ =\frac{39}{3}\\ =13 \end{equation}\]
Standard deviation of G3, when Guardians is Mother \[\begin{equation} \sqrt{\frac{(11-13)^2+(19-13)^2+(9-13)^2}{3}} \\ =\sqrt{\frac{(-2)^2+(6)^2+(-4)^2}{3}}\\ =\sqrt{\frac{4+35+16}{3}}\\ =\sqrt{\frac{36}{3}}\\ =\sqrt{18.667}\\ =4.32 \end{equation}\]
CV of Mother as a Guardian \[\begin{equation} CV(Mother) = \frac{\sigma(Mother)}{\bar{G3|G=Mother}} \times 100\% \\ CV(Mother) = \frac{4.32}{13} \times 100\% \\ CV(Mother) = 33.23\% \end{equation}\]
Count of Other = 2 G3 of Guardian as Other = 18, 6.
-Mean \[\begin{equation} \bar{G3|G=O} = \frac{18+6}{2} \\ =\frac{24}{2} \\ =12 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(18-12)^2+(6-12)^2}{2}} \\ =\sqrt{\frac{(6)^2+(-6)^2}{2}}\\ =\sqrt{\frac{36+36}{2}}\\ =\sqrt{\frac{72}{2}}\\ =\sqrt{36}\\ =6 \end{equation}\]
CV of Other forms of Guardian \[\begin{equation} CV(Other) = \frac{\sigma(O)}{\bar{G3|G=Other}} \times 100\% \\ CV(Other) = \frac{6}{12} \times 100\% \\ CV(Other) = 50\% \end{equation}\]
Standard deviation of Guardian \[\begin{equation} \sigma (G) = \sum_{c \in G} P(c) \sigma (c) \\ =\frac{3}{8} \times 2.45 + \frac{3}{8}\times 4.32 + \frac{2}{8}\times 6 \\ = 0.9375 + 1.6125 + 1.5 \\ = 4.05 \end{equation}\]
Standard deviation Reduction of Guardian \[\begin{equation} \sigma R(G3, G) = \sigma (G3) - \sigma (G) \\ = 4.36 - 4.05\\ = 0.31 \end{equation}\]
| Guardian | Ave | STDev | CV | Count |
|---|---|---|---|---|
| Father | 11 | 2.45 | 22.27 | 3.00 |
| Mother | 13 | 4.32 | 33.23 | 3.00 |
| Other | 12 | 6 | 50 | 2.00 |
| SDR | 0.31 |
Count of Female = 5 G3 of Female sex = 8, 14, 11, 6, 9.
-Mean of G3, when Sex is Female \[\begin{equation} \bar{G3|Sex=F} = \frac{8+14+11+6+9}{5}\\ =\frac{48}{5}\\ =9.6 \end{equation}\]
Standard deviation of G3 when Sex is Female \[\begin{equation} \sqrt{\frac{(8-9.6)^2+(14-9.6)^2+(11-9.6)^2+(6-9.6)^2+(9-9.6)^2}{5}}\\ =\sqrt{\frac{(-1.6)^2+(4.4)^2+(1.4)^2+(-3.6)^2+(-0.6)^2}{5}}\\ =\sqrt{\frac{2.56+19.36+1.96+12.96+0.36}{5}}\\ =\sqrt{\frac{37.2}{5}}\\ =\sqrt{7.44}\\ =2.73 \end{equation}\]
CV of G3 when Sex is Female \[\begin{equation} CV(Female) = \frac{\sigma(Female)}{\bar{G3|Sex=Female}} \times 100\%\\ CV(Female) = \frac{2.73}{9.6} \times 100\%\\ CV(Female) = 28.41\% \end{equation}\]
Count of Male = 3 G3 of Male sex = 11, 18, 19.
-Mean of G3, when Sex is Male \[\begin{equation} \bar{G3|Sex=F} = \frac{11+18+19}{3}\\ =\frac{48}{3}\\ =16 \end{equation}\]
Standard deviation of G3 when Sex is Male \[\begin{equation} \sqrt{\frac{(11-16)^2+(18-16)^2+(19-16)^2+(9-16)^2}{3}}\\ =\sqrt{\frac{(-5)^2+(2)^2+(3)^2+(-7)^2}{3}}\\ =\sqrt{\frac{25+4+49}{3}}\\ =\sqrt{\frac{78}{3}}\\ =\sqrt{26}\\ =5.10 \end{equation}\]
CV of G3 when Sex is Male \[\begin{equation} CV(Male) = \frac{\sigma(Male)}{\bar{G3|Sex=Male}} \times 100\%\\ CV(Male) = \frac{5.10}{16} \times 100\%\\ CV(Male) = 31.87\% \end{equation}\]
Standard deviation of Guardian \[\begin{equation} \sigma (S) = \sum_{c \in S} P(c) \sigma (c)\\ =\frac{5}{8} \times 2.73 + \frac{3}{8}\times 5.10 \\ = 1.7063 + 1.9125\\ = 3.61875 \end{equation}\]
Standard deviation Reduction of Sex \[\begin{equation} \sigma R(G3, S) = \sigma (G3) - \sigma (S) \\ = 4.36 - 3.62\\ = 0.74 \end{equation}\]
| Sex | Ave | STDev | CV | Count |
|---|---|---|---|---|
| Female | 9.6 | 2.73 | 28.41 | 5.00 |
| Male | 16 | 5.1 | 31.87 | 3.00 |
| SDR | 0.74 |
Count of One hour of study per day G3 of One hour = 8, 18, 11.
-Mean of G3, when Sudent study for one hr \[\begin{equation} \bar{G3|Study= 1} = \frac{8+18+11}{3} \\ =\frac{37}{3}\\ =12.33 \end{equation}\]
Standard deviation of G3 when Student study for one hr \[\begin{equation} \sqrt{\frac{(8-12.33)^2+(18-12.33)^2+(11-12.33)^2}{3}}\\ =\sqrt{\frac{(-4.33)^2+(5.67)^2+(-1.33)^2}{3}}\\ =\sqrt{\frac{18.7489+32.1489+1.7689}{3}}\\ =\sqrt{\frac{52.6667}{3}}\\ =\sqrt{17.56}\\ =4.19 \end{equation}\]
CV of G3 when Study time is one hr \[\begin{equation} CV(One) = \frac{\sigma(One)}{\bar{G3|Sudytime=One}} \times 100\% \\ CV(One) = \frac{4.19}{14} \times 100\% \\ CV(One) = 29.93\% \end{equation}\]
Count of Two hours of study per day G3 of Two hours = 11,6.
-Mean of G3, when Sudent study for two hrs \[\begin{equation} \bar{G3|Study= 1} = \frac{11+6}{2} \\ =\frac{17}{2}\\ = 8.5 \end{equation}\]
Standard deviation of G3 when Student study for two hrs \[\begin{equation} \sqrt{\frac{(11-8.5)^2+(6-8.5)^2}{2}}\\ =\sqrt{\frac{(2.5)^2+(-2.5)^2}{2}}\\ =\sqrt{\frac{6.25+6.25}{2}}\\ =\sqrt{\frac{12.5}{2}}\\ =\sqrt{6.25}\\ =2.5 \end{equation}\]
CV of G3 when Study time is two hr \[\begin{equation} CV(Two) = \frac{\sigma(Two)}{\bar{G3|Sudytime=Two}} \times 100\% \\ CV(Two) = \frac{2.5}{8.5} \times 100\% \\ CV(Two) = 29.41\% \end{equation}\]
Count of Three hours of study per day G3 of Three hours = 19, 9.
-Mean of G3, when Student study for three hrs \[\begin{equation} \bar{G3|Study= 3} = \frac{19+9}{2} \\ =\frac{28}{2}\\ =14 \end{equation}\]
Standard deviation of G3 when Student study for three hrs \[\begin{equation} \sqrt{\frac{(19-14)^2+(9-14)^2}{2}}\\ =\sqrt{\frac{(5)^2+(-5)^2}{2}}\\ =\sqrt{\frac{25+25}{2}}\\ =\sqrt{\frac{50}{2}}\\ =\sqrt{25}\\ =5 \end{equation}\]
CV of G3 when Study time is three hrs \[\begin{equation} CV(Three) = \frac{\sigma(Two)}{\bar{G3|Sudytime=Three}} \times 100\% \\ CV(Three) = \frac{5}{14} \times 100\% \\ CV(Three) = 35.71\% \end{equation}\]
Count of Four hours of study per day is 1 G3 of Four hours = 11.
-Mean of G3, when Student study for four hrs \[\begin{equation} \bar{G3|Study= 4} = \frac{11}{1}\\ =11 \end{equation}\]
Standard deviation of G3 when Student study for four hrs \[\begin{equation} \sqrt{\frac{(11-141^2}{2}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =0 \end{equation}\]
CV of G3 when Study time is four hrs \[\begin{equation} CV(Four) = \frac{\sigma(Four)}{\bar{G3|Sudytime=Four}} \times 100\%\\ CV(Four) = \frac{0}{11} \times 100\%\\ CV(Four) = 0\% \end{equation}\]
Standard deviation of Study hr \[\begin{equation} \sigma (St) = \sum_{c \in St} P(c) \sigma (c)\\ =\frac{3}{8} \times 4.19 + \frac{2}{8}\times 2.5+ \frac{2}{8} \times 5 + \frac{1}{8} \times 0 \\ = 1.571 + 0.625 + 1.25\\ = 3.446 \end{equation}\]
Standard deviation Reduction of Study \[\begin{equation} \sigma R(G3, S) = \sigma (G3) - \sigma (St)\\ = 4.36 - 3.446\\ = 0.914 \end{equation}\]
| StudyTime | Ave | STDev | CV | Count |
|---|---|---|---|---|
| 1 | 12.33 | 4.19 | 29.93 | 3.000 |
| 2 | 8.5 | 2.5 | 29.41 | 2.000 |
| 3 | 14 | 5 | 35.71 | 2.000 |
| 4 | 0 | 0 | 0 | 1.000 |
| SDR | 0.914 |
Count of not Paid = 4 G3 of student who has not Paid = 8, 11, 14 ,6.
-Mean of G3, when student has not Paid \[\begin{equation} \bar{G3|Paid=No} = \frac{8+11+14+6}{4} \\ =\frac{39}{4}\\ =9.75 \end{equation}\]
Standard deviation of G3 when Student has not paid \[\begin{equation} \sqrt{\frac{(8-9.75)^2+(14-9.75)^2+(11-9.75)^2+(6-9.75)^2}{4}}\\ =\sqrt{\frac{(-1.75)^2+(4.25)^2+(1.25)^2+(-3.75)^2}{4}}\\ =\sqrt{\frac{3.0625+18.0625+1.5625+14.0625}{4}}\\ =\sqrt{\frac{36.75}{4}}\\ =\sqrt{9.1875}\\ =3.031 \end{equation}\]
CV of G3 when Student has not paid \[\begin{equation} CV(No) = \frac{\sigma(No)}{\bar{G3|Paid=No}} \times 100\%\\ CV(No) = \frac{3.031}{9.75} \times 100\%\\ CV(No) = 31.09\% \end{equation}\]
Count of student who has Paid = 4 G3 of student who has paid = 11, 18, 19,9.
-Mean of G3, when Student has paid \[\begin{equation} \bar{G3|Paid=Yes} = \frac{11+18+19+9}{4}\\ =\frac{57}{4}\\ =14.27 \end{equation}\]
Standard deviation of G3 when Student has paid \[\begin{equation} \sqrt{\frac{(11-14.27)^2+(18-14.27)^2+(19-14.27)^2+(9-14.27)^2}{4}}\\ =\sqrt{\frac{(-3.27)^2+(3.73)^2+(4.73)^2+(-5.27)^2}{4}}\\ =\sqrt{\frac{10.6929+13.9129+22.3729+27.7729}{4}}\\ =\sqrt{\frac{74.7516}{4}}\\ =\sqrt{18.6879}\\ =4.32 \end{equation}\]
CV of G3 when Student has paid \[\begin{equation} CV(Yes) = \frac{\sigma(Yes)}{\bar{G3|Paid=Yes}} \times 100\%\\ CV(Yes) = \frac{4.32}{14.27} \times 100\%\\ CV(Yes) = 30.27\% \end{equation}\]
Standard deviation of Paid \[\begin{equation} \sigma (P) = \sum_{c \in P} P(c) \sigma (c)\\ =\frac{4}{8} \times 3.031 + \frac{4}{8}\times 4.32 \\ = 1.5155 + 2.16\\ = 3.6755 \end{equation}\]
Standard deviation Reduction of Paid
\[\begin{equation} \sigma R(G3, S) = \sigma (G3) - \sigma (S)\\ = 4.36 - 2.16\\ = 2.2 \end{equation}\]
| Paid | Ave | STDev | CV | Count |
|---|---|---|---|---|
| No | 9.75 | 3.03 | 31.09 | 4.0 |
| Yes | 14.27 | 4.32 | 32.27 | 4.0 |
| SDR | 2.2 |
From the result of the analysis in the table above, the SDR(i.e \(\sigma\)R) for Guardian is 0.31, Sex is 0.74, Study time is 0.914, and Paid is 2.2. Out of all these features, Paid has the largest SDR making it the best out of all the features. However, study time is the next after Sex.
The two levels of Paid (No and Yes) have a coefficient of variation greater than 10% which is the minimum and the threshold for branch split. Hence, we split this two branches further.
| Guardian | Sex | StudyTIME | G3 |
|---|---|---|---|
| Father | F | 1 | 8 |
| Mother | M | 2 | 11 |
| Father | F | 4 | 4 |
| Other | F | 2 | 6 |
Average = 9.75, STDev = 3.03, CV = 31.09
Count of not Paid = 2 G3 of student for such students = 8,4.
-Mean of G3 such students \[\begin{equation} \bar{G3|G=Father} = \frac{8+4}{2} \\ =6 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(8-6)^2+(4-6)^2}{2}}\\ =\sqrt{\frac{(2)^2+(-2)^2}{2}}\\ =\sqrt{\frac{4+4}{2}}\\ =\sqrt{\frac{8}{2}}\\ =\sqrt{4}\\ =2 \end{equation}\]
CV of G3 \[\begin{equation} CV(Father) = \frac{\sigma(Father)}{\bar{G3|G=Father}} \times 100\%\\ CV(Father) = \frac{2}{6} \times 100\%\\ CV(Father) = 33.3\% \end{equation}\]
Count of not Paid = 1 G3 of student for such students = 11.
-Mean of G3 such students \[\begin{equation} \bar{G3|G=Mother} = \frac{11}{1} \\ = 11 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(11-11)^2}{1}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =\sqrt{0}\\ =0 \end{equation}\]
CV of G3 \[\begin{equation} CV(Mother) = \frac{\sigma(Mother)}{\bar{G3|G=Mother}} \times 100\%\\ CV(Mother) = \frac{0}{11} \times 100\%\\ CV(Mother) = 0\% \end{equation}\]
Count of not Paid = 1 G3 of student for such students = 6.
-Mean of G3 such students \[\begin{equation} \bar{G3|G=Other} = \frac{6}{1} \\ =6 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(6-6)^2}{1}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =\sqrt{0}{1}\\ =\sqrt{0}\\ =0 \end{equation}\]
CV of G3 \[\begin{equation} CV(Other) = \frac{\sigma(Other)}{\bar{G3|G=Other}} \times 100\%\\ CV(Other) = \frac{0}{6} \times 100\%\\ CV(Other) = 0\% \end{equation}\]
Standard deviation of Paid to Guardians \[\begin{equation} \sigma (G) = \sum_{c \in G} P(c) \sigma (c)\\ =\frac{1}{4} \times 0 + \frac{2}{4}\times 2 + \frac{1}{4} \times 0 \\ = 0 + 1+ 0 \\ = 1 \end{equation}\]
Standard deviation Reduction of Paid
\[\begin{equation} \sigma R(G3, G) = \sigma (G3) - \sigma (G)\\ = 3.03 - 1\\ = 2.03 \end{equation}\]
| Guardian | STDev | CV | Ave | Count |
|---|---|---|---|---|
| Father | 2 | 33.3 | 6 | 2.00 |
| Mother | 0 | 0 | 11 | 1.00 |
| Other | 0 | 0 | 6 | 1.00 |
| SDR | 2.03 |
Count of Female student who has not paid = 3 G3 of student for such students = 8, 4, 6.
-Mean of G3 such students \[\begin{equation} \bar{G3|G=Mother} = \frac{8+4+6}{3} \\ = \frac{18}{3}\\ = 6 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(8-6)^2+(4-6)^2+(6-6)^2}{3}}\\ =\sqrt{\frac{(2)^2+(-2)^2+(0)^2}{3}}\\ =\sqrt{\frac{4+4+0}{3}}\\ =\sqrt{\frac{8}{3}}\\ =\sqrt{2.67}\\ = 1.63 \end{equation}\]
CV of G3 \[\begin{equation} CV(Female) = \frac{\sigma(Female)}{\bar{G3|S=Female}} \times 100\%\\ CV(Female) = \frac{1.63}{6} \times 100\%\\ CV(Female) = 27.22\% \end{equation}\]
Count of Male student who has not paid = 1 G3 of student for such students = 11. \[\begin{equation} \bar{G3|S=Male} = \frac{11}{1} \\ = 11 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(11-11)^2}{1}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =\sqrt{0}\\ =0 \end{equation}\]
CV of G3 \[\begin{equation} CV(Male) = \frac{\sigma(Male)}{\bar{G3|S=Male}} \times 100\%\\ CV(Male) = \frac{0}{11} \times 100\%\\ CV(Male) = 0\% \end{equation}\]
Standard deviation of Paid to Sex \[\begin{equation} \sigma (PS) = \sum_{c \in S} P(c) \sigma (c)\\ =\frac{3}{4} \times 1.63 + \frac{1}{4}\times 0 \\ = 1.22 + 0 \\ = 1.22 \end{equation}\]
Standard deviation Reduction of Paid
\[\begin{equation} \sigma R(G3, PS) = \sigma (G3) - \sigma (PS)\\ = 3.03 - 1.22\\ = 1.81 \end{equation}\]
| Guardian | STDev | CV | Ave | Count |
|---|---|---|---|---|
| Female | 1.63 | 27.22 | 6 | 3.00 |
| Male | 0 | 0 | 11 | 1.00 |
| SDR | 1.81 |
Count of Student who study for 1 hr and has not paid = 1 G3 of student for such students = 8. \[\begin{equation} \bar{G3|St=One} = \frac{8}{1} \\ = 8 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(8-8)^2}{1}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =\sqrt{0}\\ =0 \end{equation}\]
CV of G3 \[\begin{equation} CV(One) = \frac{\sigma(One)}{\bar{G3|St=One}} \times 100\%\\ CV(One) = \frac{0}{8} \times 100\%\\ CV(One) = 0\% \end{equation}\]
Count of Student who study for 1 hr and has not paid = 2 G3 of student for such students = 11, 6. \[\begin{equation} \bar{G3|St=One} = \frac{11+6}{2} \\ = 8.5 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(11-8.5)^2+(6-8.5)^2}{2}}\\ =\sqrt{\frac{(2.5)^2+(-2.5)^2}{2}}\\ =\sqrt{\frac{6.25+6.25}{2}}\\ =\sqrt{\frac{12.5}{2}}\\ =\sqrt{6.25}\\ = 2.5 \end{equation}\]
CV of G3 \[\begin{equation} CV(Two) = \frac{\sigma(Two)}{\bar{G3|St=Two}} \times 100\%\\ CV(Two) = \frac{2.5}{8.5} \times 100\%\\ CV(Two) = 29.41\% \end{equation}\]
Count of Student who study for 1 hr and has not paid = 1 G3 of student for such students = 4. \[\begin{equation} \bar{G3|St=Four} = \frac{4}{1} \\ = 4 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(4-4)^2}{1}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =\sqrt{0}\\ =0 \end{equation}\]
CV of G3 \[\begin{equation} CV(Four) = \frac{\sigma(Four)}{\bar{G3|St=Four}} \times 100\%\\ CV(Four) = \frac{0}{4} \times 100\%\\ CV(Four) = 0\% \end{equation}\]
Standard deviation of Paid to Study Time \[\begin{equation} \sigma (PSt) = \sum_{c \in St} P(c) \sigma (c)\\ =\frac{1}{4} \times 0 + \frac{2}{4}\times 2.5 +\frac{1}{4} \times 0 \\ = 0 + 1.25 + 0 \\ = 1.25 \end{equation}\]
Standard deviation Reduction of Paid
\[\begin{equation} \sigma R(G3, PS) = \sigma (G3) - \sigma (PS)\\ = 3.03 - 1.25\\ = 1.78 \end{equation}\]
| StudyTIME | STDev | CV | Ave | count |
|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1.00 |
| 2 | 2.5 | 29.41 | 8.5 | 2.00 |
| 4 | 0 | 0 | 0 | 1.00 |
| SDR | 1.78 |
| Guardian | Sex | StudyTIME | G3 |
|---|---|---|---|
| Other | M | 1 | 18 |
| Mother | M | 3 | 19 |
| Father | F | 1 | 11 |
| Mother | F | 3 | 9 |
Average = 14.27, STDev = 4.32, CV = 32.27
Count of Paid = 2 G3 of student for such students = 19, 9. -Mean of G3 such students \[\begin{equation} \bar{G3|G=Mother} = \frac{19 + 9}{2} \\ =\frac{28}{2} = 14 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{(19-14)^2+(9-14)^2}{2}\\ \sqrt{(5)^2+(-5)^2}{2}\\ \sqrt{25+25}{2}\\ \sqrt{50}{2}\\ \sqrt{25}\\ =5 \end{equation}\]
CV of G3 \[\begin{equation} CV(Mother) = \frac{\sigma(Mother)}{\bar{G3|G=Mother}} \times 100\%\\ CV(Mother) = \frac{5}{14} \times 100\%\\ CV(Mother) = 35.71\% \end{equation}\]
Count of Student such student = 1 G3 of student for such students = 18. \[\begin{equation} \bar{G3|St=Four} = \frac{18}{1} \\ = 18 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(18-18)^2}{1}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =\sqrt{0}\\ =0 \end{equation}\]
CV of G3 \[\begin{equation} CV(Other) = \frac{\sigma(Other)}{\bar{G3|G=Other}} \times 100\%\\ CV(Other) = \frac{0}{18} \times 100\%\\ CV(Other) = 0\% \end{equation}\]
Count of such student = 1 G3 for such students = 11. \[\begin{equation} \bar{G3|St=Four} = \frac{11}{1} \\ = 11 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{\frac{(11-11)^2}{1}}\\ =\sqrt{\frac{(0)^2}{1}}\\ =\sqrt{0}\\ =0 \end{equation}\]
CV of G3 \[\begin{equation} CV(Father) = \frac{\sigma(Father)}{\bar{G3|G=Father}} \times 100\%\\ CV(Father) = \frac{0}{11} \times 100\%\\ CV(Father) = 0\% \end{equation}\]
Standard deviation of Paid to guardian \[\begin{equation} \sigma (PG) = \sum_{c \in G} P(c) \sigma (c)\\ =\frac{1}{4} \times 0 + \frac{2}{4}\times 5 +\frac{1}{4} \times 0 \\ = 0 + 2.5 + 0 \\ = 2.5 \end{equation}\]
Standard deviation Reduction of Paid
\[\begin{equation} \sigma R(G3, PS) = \sigma (G3) - \sigma (PS)\\ = 3.03 - 2.5\\ = 0.53 \end{equation}\]
| Guardian | STDev | CV | Ave | count |
|---|---|---|---|---|
| Mother | 5 | 35.71 | 14 | 2.00 |
| Other | 0 | 0 | 18 | 1.00 |
| Father | 0 | 0 | 11 | 1.00 |
| SDR | 0.53 |
Count of Paid = 2 G3 of student for such students = 18, 19. -Mean of G3 such students \[\begin{equation} \bar{(G3|G=Male)} = \frac{18 + 19}{2} \\ =\frac{37}{2} = 18.5 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{(18-18.5)^2+(19-18.5)^2}{2}\\ \sqrt{(-0.5)^2+(0.5)^2}{2}\\ \sqrt{0.25+0.25}{2}\\ \sqrt{0.5}{2}\\ \sqrt{0.25}\\ =0.5 \end{equation}\]
CV of G3 \[\begin{equation} CV(Male) = \frac{\sigma(Male)}{\bar{G3|S=Male}} \times 100\%\\ CV(Male) = \frac{0.5}{18.5} \times 100\%\\ CV(Male) = 2.70\% \end{equation}\]
Count of Paid = 2 G3 of student for such students = 11, 9. -Mean of G3 such students \[\begin{equation} \bar{(G3|G=Female)} = \frac{11 + 9}{2} \\ =\frac{20}{2} = 10 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{(11-10)^2+(9-10)^2}{2}\\ \sqrt{(1)^2+(-1)^2}{2}\\ \sqrt{1+1}{2}\\ \sqrt{2}{2}\\ \sqrt{1}\\ =1 \end{equation}\]
CV of G3 \[\begin{equation} CV(Female) = \frac{\sigma(Female)}{\bar{G3|G=Female}} \times 100\%\\ CV(Female) = \frac{1}{10} \times 100\%\\ CV(Female) = 10\% \end{equation}\]
Standard deviation of Paid to Sex \[\begin{equation} \sigma (PSe) = \sum_{c \in S} P(c) \sigma (c)\\ =\frac{2}{4} \times 0.5 + \frac{2}{4}\times 1 \\ = 0.25 + 0.5 \\ = 0.75 \end{equation}\]
Standard deviation Reduction of Paid
\[\begin{equation} \sigma R(G3, PS) = \sigma (G3) - \sigma (PS)\\ = 4.32 - 0.75\\ = 3.57 \end{equation}\]
| StudyTIME | STDev | CV | Ave | count |
|---|---|---|---|---|
| Male | 0.5 | 2.7 | 18.5 | 2.00 |
| Female | 1 | 10 | 10 | 2.00 |
| SDR | 3.57 |
Count of Paid = 2 G3 of student for such students = 18, 11. - Mean of G3 such students \[\begin{equation} \bar{(G3|St= one)} = \frac{18 + 11}{2} \\ =\frac{29}{2} = 14.5 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{(18-14.5)^2+(11-14.5)^2}{2}\\ \sqrt{(3.5)^2+(-3.5)^2}{2}\\ \sqrt{12.25+12.25}{2}\\ \sqrt{24.5}{2}\\ \sqrt{12.25}\\ =3.5 \end{equation}\]
CV of G3 \[\begin{equation} CV(One) = \frac{\sigma(One)}{\bar{G3|St=One}} \times 100\%\\ CV(One) = \frac{3.5}{14.5} \times 100\%\\ CV(One) = 24.14\% \end{equation}\]
Count of Paid = 2 G3 of student for such students = 19, 9. -Mean of G3 such students \[\begin{equation} \bar{(G3|St=three)} = \frac{19 + 9}{2} \\ =\frac{28}{2} = 14 \end{equation}\]
Standard deviation \[\begin{equation} \sqrt{(19-14)^2+(9-14)^2}{2}\\ \sqrt{(5)^2+(-5)^2}{2}\\ \sqrt{25+25}{2}\\ \sqrt{50}{2}\\ \sqrt{25}\\ =5 \end{equation}\]
CV of G3 \[\begin{equation} CV(Three) = \frac{\sigma(Three)}{\bar{G3|St=Three}} \times 100\%\\ CV(Three) = \frac{5}{14} \times 100\%\\ CV(Three) = 35.71\% \end{equation}\]
Standard deviation of Paid to Study Time \[\begin{equation} \sigma (PSt) = \sum_{c \in St} P(c) \sigma (c)\\ =\frac{1}{4} \times 0 + \frac{2}{4}\times 2.5 +\frac{1}{4} \times 0 \\ = 0 + 1.25 + 0 \\ = 1.25 \end{equation}\]
Standard deviation Reduction of Paid
\[\begin{equation} \sigma R(G3, PS) = \sigma (G3) - \sigma (PS)\\ = 3.03 - 1.25\\ = 1.78 \end{equation}\]
| StudyTIME | STDev | CV | Ave | count |
|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1.00 |
| 2 | 2.5 | 29.41 | 8.5 | 2.00 |
| 4 | 0 | 0 | 0 | 1.00 |
| SDR | 1.78 |
library(tidyverse)
library(readxl)
data <- read_excel("C:/Users/DELL/Desktop/2024_Projects/Iterative Dichotomizer 3/ID3_DATASET.xlsx")
head(data, 15)
## # A tibble: 15 × 5
## sex guardian studytime paid G3
## <chr> <chr> <dbl> <chr> <dbl>
## 1 F mother 2 no 11
## 2 F father 2 no 11
## 3 F mother 2 no 12
## 4 F mother 3 no 14
## 5 F father 2 no 13
## 6 M mother 2 no 13
## 7 M mother 2 no 13
## 8 F mother 2 no 13
## 9 M mother 2 no 17
## 10 M mother 2 no 13
## 11 F mother 2 no 14
## 12 F father 3 no 13
## 13 M father 1 no 12
## 14 M mother 2 no 13
## 15 M other 3 no 15
library(caTools)
sap <- sample.split(data$sex, SplitRatio = 0.7)
Train <- subset(data, sap==T)
Test <- subset(data, sap==F)
#install.packages("C50")
library(C50)
## Warning: package 'C50' was built under R version 4.4.1
Model_IDE3 <- C5.0(as.factor(G3)~., data = Train)
pr <- predict(Model_IDE3, newdata = Test)
pr
## [1] 12 11 10 11 11 11 11 10 10 13 12 11 11 11 11 11 11 11 12 12 11 11 10 11 12
## [26] 11 10 11 13 11 10 11 13 10 11 11 11 11 11 12 11 11 10 11 10 11 11 11 13 12
## [51] 11 11 11 11 11 10 11 13 10 11 10 11 11 11 11 11 13 11 11 11 10 11 13 13 11
## [76] 11 11 11 11 11 11 11 11 13 10 11 12 11 11 11 11 13 10 13 10 11 11 10 11 13
## [101] 9 10 11 12 13 10 13 11 10 12 11 11 11 11 10 13 11 11 13 11 12 13 13 11 11
## [126] 11 10 10 11 11 11 13 12 11 11 11 11 11 13 11 11 11 13 11 10 12 11 11 12 11
## [151] 12 13 11 12 11 11 11 11 12 11 11 11 10 12 12 11 11 11 13 11 11 12 11 11 10
## [176] 11 11 11 11 11 11 9 13 12 12 11 13 13 13 9 11 11 11 13 11
## Levels: 0 1 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
#confusionMatrix(as.factor(pr), as.factor(Test$G3))
as.factor(Test$G3)
## [1] 13 13 13 12 17 12 12 15 12 12 11 13 13 16 13 15 14 16 10 12 16 11 15 11 10
## [26] 14 10 13 13 10 16 11 14 11 10 18 14 11 14 13 9 15 12 8 11 12 11 9 13 12
## [51] 10 8 11 13 17 10 11 11 14 14 10 13 12 10 14 15 8 15 13 13 11 11 13 10 8
## [76] 11 13 14 13 10 8 11 10 15 15 15 11 7 12 13 10 17 11 13 16 10 15 15 10 15
## [101] 13 17 17 15 8 18 11 15 17 14 13 17 13 10 15 14 10 10 14 15 14 14 17 16 15
## [126] 10 18 17 11 17 13 11 10 7 8 12 12 16 9 8 9 11 8 11 13 11 11 13 8 10
## [151] 11 9 10 9 10 10 8 15 8 8 8 9 8 14 11 8 12 13 13 10 10 12 10 9 8
## [176] 10 11 0 13 10 10 0 11 10 12 9 12 17 12 9 14 16 0 10 10
## Levels: 0 7 8 9 10 11 12 13 14 15 16 17 18
data_re<-na.omit(data)
length(data_re$G3)
## [1] 649
length(data$G3)
## [1] 649
plot(Model_IDE3)
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in partysplit(varid = as.integer(i), breaks = as.numeric(j[1]), : NAs
## introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion
## Warning in .bincode(as.numeric(x), breaks = unique(c(-Inf, breaks_split(split),
## : NAs introduced by coercion