1 Introduction / Business Understanding

Alcohol drinking causes numerous problems to individuals, families and to society. Most people begin to drink alcohol in childhood. This study shows drinking habits of students math and portuguese language courses in secondary school. This data approach student achievement in secondary education of two Portuguese schools.

The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). It contains a lot of interesting social, gender and study information about students. We can use it for some EDA and try to predict student alcohol consumption (specifically, problematic consumption). The aim of this report is to detect if there is any correlation between high school students environmental and personal factors and the likeability of these students to consume alcohol.

2 Data Preparation

2.2 Read Data

Dataset was obtained from Kaggle Dataset, containing a lot of interesting social, gender and study information about students in secondary education of two Portuguese schools.

#> [1] 395
#> [1] 649

We can see how many students are in each course, 395 for Math course and 649 for Portuguese language course. Then we combine the data to do the wrangling process for both math and por.

#> Observations: 1,044
#> Variables: 34
#> $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, ...
#> $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M,...
#> $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
#> $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
#> $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, G...
#> $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T,...
#> $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
#> $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
#> $ Mjob       <fct> at_home, at_home, at_home, health, other, services, othe...
#> $ Fjob       <fct> teacher, other, other, services, other, other, other, te...
#> $ reason     <fct> course, course, other, home, home, reputation, home, hom...
#> $ guardian   <fct> mother, father, mother, mother, father, mother, mother, ...
#> $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
#> $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
#> $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
#> $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, n...
#> $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes,...
#> $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, y...
#> $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes,...
#> $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye...
#> $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
#> $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes,...
#> $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no,...
#> $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
#> $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
#> $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
#> $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
#> $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
#> $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
#> $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
#> $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
#> $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
#> $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
#> $ subject    <chr> "math", "math", "math", "math", "math", "math", "math", ...

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

  • school : student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  • sex : student’s sex (binary: ‘F’ - female or ‘M’ - male)
  • age : student’s age (numeric: from 15 to 22)
  • address : student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  • famsize : family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  • Pstatus : parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  • Medu : mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  • Fedu : father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  • Mjob : mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • Fjob : father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • reason : reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  • guardian : student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
  • traveltime : home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  • studytime : weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • failures : number of past class failures (numeric: n if 1<=n<3, else 4)
  • schoolsup : extra educational support (binary: yes or no)
  • famsup : family educational support (binary: yes or no)
  • paid : extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  • activities : extra-curricular activities (binary: yes or no)
  • nursery : attended nursery school (binary: yes or no)
  • higher : wants to take higher education (binary: yes or no)
  • internet : Internet access at home (binary: yes or no)
  • romantic : with a romantic relationship (binary: yes or no)
  • famrel : quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • freetime : free time after school (numeric: from 1 - very low to 5 - very high)
  • goout : going out with friends (numeric: from 1 - very low to 5 - very high)
  • Dalc : workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • Walc : weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • health : current health status (numeric: from 1 - very bad to 5 - very good)
  • absences : number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

  • G1 : first period grade (numeric: from 0 to 20)
  • G2 : second period grade (numeric: from 0 to 20)
  • G3 : final grade (numeric: from 0 to 20)

2.3 Data Wrangling and Cleansing

We can change the data types for several variables into factors and calculate the average grade of each student.

#>  [1] "school"     "sex"        "age"        "address"    "famsize"   
#>  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
#> [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
#> [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
#> [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
#> [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
#> [31] "G1"         "G2"         "G3"         "subject"

Before we explore further, it would be better if we check whether the data has NA value or not.

#> [1] FALSE

2.4 Data Exploration

The first thing we explore is the variable we want to predict, the level of alcohol consumption both on workday and weekend.

The higher level of alcohol consumption of students results in a relatively reduced of students average grade

Feature Engineering : Total Alcohol Consumption

#> 
#>          1          2          3          4          5 
#> 0.69636015 0.18773946 0.06609195 0.02490421 0.02490421
#> 
#>          1          2          3          4          5 
#> 0.38122605 0.22509579 0.19157088 0.13218391 0.06992337
#> 
#>   1   2   3   4   5 
#> 727 196  69  26  26
#> 
#>   1   2   3   4   5 
#> 398 235 200 138  73

5:2 is better than 1:1 because their proportion is more suitable and further explains the value of Talc.

#> 
#>   Very Low        Low     Medium       High  Very High 
#> 0.37452107 0.36685824 0.16187739 0.06321839 0.03352490
#> 
#>  Very Low       Low    Medium      High Very High 
#>       391       383       169        66        35

3 Model

Split Math & Por Data

After the wrangling process has been completed, we can split the math and por dataset

#> 
#>   Very Low        Low     Medium       High  Very High 
#> 0.37974684 0.35696203 0.17468354 0.05569620 0.03291139
#> 
#>   Very Low        Low     Medium       High  Very High 
#> 0.37134052 0.37288136 0.15408320 0.06779661 0.03389831

Near-zero varianced columns

#> [1] 1

We can observe variables that have variances close to zero with nearZeroVar(). We exclude subject column in pre-process later.

Cross Validation & Pre-Processing

MATH

#> 
#>   Very Low        Low     Medium       High  Very High 
#> 0.38050314 0.36163522 0.16981132 0.05345912 0.03459119
#> 
#>   Very Low        Low     Medium       High  Very High 
#> 0.37662338 0.33766234 0.19480519 0.06493506 0.02597403
#> 
#>  Very Low       Low    Medium      High Very High 
#>       0.2       0.2       0.2       0.2       0.2

POR

#> 
#>   Very Low        Low     Medium       High  Very High 
#> 0.37428023 0.37044146 0.15547025 0.06525912 0.03454894
#> 
#>  Very Low       Low    Medium      High Very High 
#> 0.3593750 0.3828125 0.1484375 0.0781250 0.0312500
#> 
#>  Very Low       Low    Medium      High Very High 
#>       0.2       0.2       0.2       0.2       0.2

3.1 Naive Bayes

3.1.1 Math Subject

Build Model

Prediction

Model Evaluation

Confusion matrix

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Very Low Low Medium High Very High
#>   Very Low        29   0      1    0         0
#>   Low              0  26      0    1         0
#>   Medium           0   0     14    0         0
#>   High             0   0      0    4         0
#>   Very High        0   0      0    0         2
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.974                
#>                  95% CI : (0.9093, 0.9968)     
#>     No Information Rate : 0.3766               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9626               
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#> 
#> Statistics by Class:
#> 
#>                      Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity                   1.0000     1.0000        0.9333     0.80000
#> Specificity                   0.9792     0.9804        1.0000     1.00000
#> Pos Pred Value                0.9667     0.9630        1.0000     1.00000
#> Neg Pred Value                1.0000     1.0000        0.9841     0.98630
#> Prevalence                    0.3766     0.3377        0.1948     0.06494
#> Detection Rate                0.3766     0.3377        0.1818     0.05195
#> Detection Prevalence          0.3896     0.3506        0.1818     0.05195
#> Balanced Accuracy             0.9896     0.9902        0.9667     0.90000
#>                      Class: Very High
#> Sensitivity                   1.00000
#> Specificity                   1.00000
#> Pos Pred Value                1.00000
#> Neg Pred Value                1.00000
#> Prevalence                    0.02597
#> Detection Rate                0.02597
#> Detection Prevalence          0.02597
#> Balanced Accuracy             1.00000

ROC (Receiver Operating Characteristic)

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability.

#>         Very Low         Low       Medium           High          Very High
#> [1,] 0.997999884 0.001580914 0.0004139177 0.000003950184 0.0000013342613665
#> [2,] 0.998082362 0.001633682 0.0002764905 0.000007298232 0.0000001675807808
#> [3,] 0.003388426 0.996103024 0.0005007134 0.000007836138 0.0000000003096437
#> [4,] 0.845009905 0.005411335 0.0027735149 0.146786544997 0.0000187004431713
#> [5,] 0.997359856 0.001351401 0.0002025819 0.001084922958 0.0000012375322506
#> [6,] 0.106325221 0.789514446 0.0996893776 0.000189879775 0.0042810751495299

AUC (Area Under The Curve)

#> 
#> Call:
#> multiclass.roc.default(response = math_test$Talc, predictor = y_pred)
#> 
#> Data: y_pred with 5 levels of math_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.9613

3.1.2 Portugese Subject

Build Model

Prediction

Model Evaluation

Confusion matrix

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Very Low Low Medium High Very High
#>   Very Low        43   0      0    0         1
#>   Low              0  48      0    0         1
#>   Medium           1   0     19    0         0
#>   High             1   0      0   10         0
#>   Very High        1   1      0    0         2
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.9531               
#>                  95% CI : (0.9008, 0.9826)     
#>     No Information Rate : 0.3828               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9329               
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#> 
#> Statistics by Class:
#> 
#>                      Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity                   0.9348     0.9796        1.0000     1.00000
#> Specificity                   0.9878     0.9873        0.9908     0.99153
#> Pos Pred Value                0.9773     0.9796        0.9500     0.90909
#> Neg Pred Value                0.9643     0.9873        1.0000     1.00000
#> Prevalence                    0.3594     0.3828        0.1484     0.07812
#> Detection Rate                0.3359     0.3750        0.1484     0.07812
#> Detection Prevalence          0.3438     0.3828        0.1562     0.08594
#> Balanced Accuracy             0.9613     0.9835        0.9954     0.99576
#>                      Class: Very High
#> Sensitivity                   0.50000
#> Specificity                   0.98387
#> Pos Pred Value                0.50000
#> Neg Pred Value                0.98387
#> Prevalence                    0.03125
#> Detection Rate                0.01562
#> Detection Prevalence          0.03125
#> Balanced Accuracy             0.74194

ROC (Receiver Operating Characteristic)

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability.

#>          Very Low          Low        Medium           High        Very High
#> [1,] 0.9986728184 0.0008012609 0.00042334075 0.000102294063 0.00000028591796
#> [2,] 0.9987594918 0.0004767296 0.00066986002 0.000093882385 0.00000003618707
#> [3,] 0.0080327007 0.0134080249 0.97480815410 0.001500363288 0.00225075706231
#> [4,] 0.9996412280 0.0002615960 0.00009362201 0.000003538023 0.00000001597722
#> [5,] 0.0002625151 0.0005375040 0.99826238689 0.000936614627 0.00000097939785
#> [6,] 0.1049034925 0.2980870184 0.25529525769 0.136329451718 0.20538477965125

AUC (Area Under The Curve)

#> 
#> Call:
#> multiclass.roc.default(response = por_test$Talc, predictor = y_pred)
#> 
#> Data: y_pred with 5 levels of por_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.8271

3.2 Decision Tree

3.2.1 Math Subject

Build Model

CTREE default

RPART

CTREE

RPART

Prediction CTREE

RPART

Model Evaluation

Confusion matrix

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Very Low Low Medium High Very High
#>   Very Low         4   6      5    0         0
#>   Low              5   5      2    1         0
#>   Medium          11  11      6    1         1
#>   High             5   3      1    3         1
#>   Very High        4   1      1    0         0
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.2338          
#>                  95% CI : (0.1448, 0.3441)
#>     No Information Rate : 0.3766          
#>     P-Value [Acc > NIR] : 0.99739         
#>                                           
#>                   Kappa : 0.0186          
#>                                           
#>  Mcnemar's Test P-Value : 0.02429         
#> 
#> Statistics by Class:
#> 
#>                      Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity                  0.13793    0.19231       0.40000     0.60000
#> Specificity                  0.77083    0.84314       0.61290     0.86111
#> Pos Pred Value               0.26667    0.38462       0.20000     0.23077
#> Neg Pred Value               0.59677    0.67188       0.80851     0.96875
#> Prevalence                   0.37662    0.33766       0.19481     0.06494
#> Detection Rate               0.05195    0.06494       0.07792     0.03896
#> Detection Prevalence         0.19481    0.16883       0.38961     0.16883
#> Balanced Accuracy            0.45438    0.51772       0.50645     0.73056
#>                      Class: Very High
#> Sensitivity                   0.00000
#> Specificity                   0.92000
#> Pos Pred Value                0.00000
#> Neg Pred Value                0.97183
#> Prevalence                    0.02597
#> Detection Rate                0.00000
#> Detection Prevalence          0.07792
#> Balanced Accuracy             0.46000

ROC (Receiver Operating Characteristic)

#>     Very Low        Low    Medium      High Very High
#> 1 0.38888889 0.05555556 0.5555556 0.0000000         0
#> 2 0.38888889 0.05555556 0.5555556 0.0000000         0
#> 3 0.25000000 0.00000000 0.0000000 0.7500000         0
#> 4 0.05555556 0.00000000 0.1111111 0.8333333         0
#> 5 0.11111111 0.11111111 0.0000000 0.7777778         0
#> 6 0.30769231 0.69230769 0.0000000 0.0000000         0

AUC (Area Under The Curve)

#> 
#> Call:
#> multiclass.roc.default(response = math_test$Talc, predictor = y_pred)
#> 
#> Data: y_pred with 5 levels of math_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.9613

3.2.2 Portugese Subject

Build Model

CTREE default

RPART

CTREE

RPART

Prediction CTREE

RPART

Model Evaluation

Confusion matrix

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Very Low Low Medium High Very High
#>   Very Low         9   9      2    2         2
#>   Low             18  23      5    1         1
#>   Medium           9  11      8    2         1
#>   High             7   5      3    3         0
#>   Very High        3   1      1    2         0
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.3359          
#>                  95% CI : (0.2549, 0.4248)
#>     No Information Rate : 0.3828          
#>     P-Value [Acc > NIR] : 0.88204         
#>                                           
#>                   Kappa : 0.1031          
#>                                           
#>  Mcnemar's Test P-Value : 0.06307         
#> 
#> Statistics by Class:
#> 
#>                      Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity                  0.19565     0.4694        0.4211     0.30000
#> Specificity                  0.81707     0.6835        0.7890     0.87288
#> Pos Pred Value               0.37500     0.4792        0.2581     0.16667
#> Neg Pred Value               0.64423     0.6750        0.8866     0.93636
#> Prevalence                   0.35938     0.3828        0.1484     0.07812
#> Detection Rate               0.07031     0.1797        0.0625     0.02344
#> Detection Prevalence         0.18750     0.3750        0.2422     0.14062
#> Balanced Accuracy            0.50636     0.5765        0.6050     0.58644
#>                      Class: Very High
#> Sensitivity                   0.00000
#> Specificity                   0.94355
#> Pos Pred Value                0.00000
#> Neg Pred Value                0.96694
#> Prevalence                    0.03125
#> Detection Rate                0.00000
#> Detection Prevalence          0.05469
#> Balanced Accuracy             0.47177

ROC (Receiver Operating Characteristic)

#>    Very Low       Low    Medium High Very High
#> 1 0.3571429 0.6428571 0.0000000    0         0
#> 2 0.8333333 0.1666667 0.0000000    0         0
#> 3 0.0000000 0.1818182 0.8181818    0         0
#> 4 0.7777778 0.0000000 0.2222222    0         0
#> 5 0.2857143 0.7142857 0.0000000    0         0
#> 6 0.4000000 0.6000000 0.0000000    0         0

AUC (Area Under The Curve)

#> 
#> Call:
#> multiclass.roc.default(response = por_test$Talc, predictor = y_pred)
#> 
#> Data: y_pred with 5 levels of por_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.8271

3.2.3 Proof

In this section I want to do proof to see whether the Decision Tree model is made vulnerable/sensitive to overfitting or not. It will be proven by trying to predict the model using data_test and also the data_train that we used to build the model.

#> 
#>  Very Low       Low    Medium      High Very High 
#>        29        26        15         5         2
#> 
#>  Very Low       Low    Medium      High Very High 
#>        46        49        19        10         4

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Very Low Low Medium High Very High
#>   Very Low        74  19     16    0         0
#>   Low             17  74     14    7         0
#>   Medium          18  13     78    0         0
#>   High             8   6      9  114         0
#>   Very High        4   9      4    0       121
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.762                
#>                  95% CI : (0.726, 0.7954)      
#>     No Information Rate : 0.2                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7025               
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#> 
#> Statistics by Class:
#> 
#>                      Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity                   0.6116     0.6116        0.6446      0.9421
#> Specificity                   0.9277     0.9215        0.9360      0.9525
#> Pos Pred Value                0.6789     0.6607        0.7156      0.8321
#> Neg Pred Value                0.9052     0.9047        0.9133      0.9850
#> Prevalence                    0.2000     0.2000        0.2000      0.2000
#> Detection Rate                0.1223     0.1223        0.1289      0.1884
#> Detection Prevalence          0.1802     0.1851        0.1802      0.2264
#> Balanced Accuracy             0.7696     0.7665        0.7903      0.9473
#>                      Class: Very High
#> Sensitivity                    1.0000
#> Specificity                    0.9649
#> Pos Pred Value                 0.8768
#> Neg Pred Value                 1.0000
#> Prevalence                     0.2000
#> Detection Rate                 0.2000
#> Detection Prevalence           0.2281
#> Balanced Accuracy              0.9824
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Very Low Low Medium High Very High
#>   Very Low        10   8      8    0         1
#>   Low              6   5      2    1         0
#>   Medium           3   8      2    1         0
#>   High             6   4      2    3         1
#>   Very High        4   1      1    0         0
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.2597          
#>                  95% CI : (0.1664, 0.3723)
#>     No Information Rate : 0.3766          
#>     P-Value [Acc > NIR] : 0.98890         
#>                                           
#>                   Kappa : 0.0203          
#>                                           
#>  Mcnemar's Test P-Value : 0.03911         
#> 
#> Statistics by Class:
#> 
#>                      Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity                   0.3448    0.19231       0.13333     0.60000
#> Specificity                   0.6458    0.82353       0.80645     0.81944
#> Pos Pred Value                0.3704    0.35714       0.14286     0.18750
#> Neg Pred Value                0.6200    0.66667       0.79365     0.96721
#> Prevalence                    0.3766    0.33766       0.19481     0.06494
#> Detection Rate                0.1299    0.06494       0.02597     0.03896
#> Detection Prevalence          0.3506    0.18182       0.18182     0.20779
#> Balanced Accuracy             0.4953    0.50792       0.46989     0.70972
#>                      Class: Very High
#> Sensitivity                   0.00000
#> Specificity                   0.92000
#> Pos Pred Value                0.00000
#> Neg Pred Value                0.97183
#> Prevalence                    0.02597
#> Detection Rate                0.00000
#> Detection Prevalence          0.07792
#> Balanced Accuracy             0.46000

Based on the above experiment, we can conclude that it is true that our Decision Tree model is vulnerable to overfitting.

3.3 Random Forest

Random forest is a classification method based on ensamble method. Random forest is built from several decision trees (default = 500) with different characteristics. For each tree model used observations and predictors that are different from the results of the sampling.

Ensemble learning is a machine learning paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.

3.3.1 K - fold Cross Validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

The k value must be chosen carefully for your data sample. The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller.

To summarize, there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

3.3.2 Math Subject

Build Model

#> Random Forest 
#> 
#> 605 samples
#>  29 predictor
#>   5 classes: 'Very Low', 'Low', 'Medium', 'High', 'Very High' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 5 times) 
#> Summary of sample sizes: 485, 485, 483, 485, 482, 485, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.8515453  0.8144140
#>   32    0.8502179  0.8127533
#>   62    0.8273983  0.7842296
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

Final Model Inspection

OOB or Out-Of-Bag estimates values indicate error values from models accuracy on unseen set of data.

Variable Importance

#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 62)
#> 
#>               Overall
#> average_grade  100.00
#> absences        99.98
#> age             73.71
#> sexM            66.40
#> goout5          42.74
#> paidyes         34.67
#> freetime4       32.88
#> famsupyes       32.46
#> addressU        31.71
#> famsizeLE3      29.71
#> romanticyes     29.49
#> activitiesyes   29.23
#> studytime2      28.29
#> Fjobservices    28.17
#> famrel4         27.11
#> Medu2           26.76
#> Fjobother       26.36
#> Fedu2           26.19
#> nurseryyes      25.55
#> health5         25.01

Prediction

Model Evaluation

Confusion matrix

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Very Low Low Medium High Very High
#>   Very Low        20  20      7    0         0
#>   Low              5   3      3    2         0
#>   Medium           0   3      4    3         2
#>   High             3   0      0    0         0
#>   Very High        1   0      1    0         0
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.3506          
#>                  95% CI : (0.2453, 0.4678)
#>     No Information Rate : 0.3766          
#>     P-Value [Acc > NIR] : 0.7194          
#>                                           
#>                   Kappa : 0.0444          
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity                   0.6897    0.11538       0.26667     0.00000
#> Specificity                   0.4375    0.80392       0.87097     0.95833
#> Pos Pred Value                0.4255    0.23077       0.33333     0.00000
#> Neg Pred Value                0.7000    0.64062       0.83077     0.93243
#> Prevalence                    0.3766    0.33766       0.19481     0.06494
#> Detection Rate                0.2597    0.03896       0.05195     0.00000
#> Detection Prevalence          0.6104    0.16883       0.15584     0.03896
#> Balanced Accuracy             0.5636    0.45965       0.56882     0.47917
#>                      Class: Very High
#> Sensitivity                   0.00000
#> Specificity                   0.97333
#> Pos Pred Value                0.00000
#> Neg Pred Value                0.97333
#> Prevalence                    0.02597
#> Detection Rate                0.00000
#> Detection Prevalence          0.02597
#> Balanced Accuracy             0.48667

ROC (Receiver Operating Characteristic)

AUC (Area Under The Curve)

#> 
#> Call:
#> multiclass.roc.default(response = math_test$Talc, predictor = y_pred)
#> 
#> Data: y_pred with 5 levels of math_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.9613

3.3.3 Portugese Subject

Build Model

#> Random Forest 
#> 
#> 975 samples
#>  29 predictor
#>   5 classes: 'Very Low', 'Low', 'Medium', 'High', 'Very High' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 5 times) 
#> Summary of sample sizes: 780, 780, 780, 780, 780, 780, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.8354872  0.7943590
#>   32    0.8334359  0.7917949
#>   62    0.8227692  0.7784615
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

Final Model Inspection

OOB or Out-Of-Bag estimates values indicate error values from models accuracy on unseen set of data.

Variable Importance

#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 62)
#> 
#>                Overall
#> average_grade   100.00
#> absences         91.49
#> sexM             73.77
#> age              65.25
#> famsupyes        38.19
#> goout5           36.82
#> health5          31.54
#> activitiesyes    30.41
#> studytime2       29.72
#> romanticyes      29.61
#> famrel5          28.97
#> famsizeLE3       28.16
#> health3          27.69
#> Fjobservices     27.66
#> schoolMS         27.44
#> freetime3        27.18
#> addressU         26.93
#> guardianmother   26.23
#> famrel4          25.95
#> Mjobother        25.33

Prediction

Model Evaluation

Confusion matrix

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Very Low Low Medium High Very High
#>   Very Low        27  37     11    4         2
#>   Low             11   5      0    1         1
#>   Medium           6   6      3    2         1
#>   High             0   1      4    0         0
#>   Very High        2   0      1    3         0
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.2734          
#>                  95% CI : (0.1984, 0.3592)
#>     No Information Rate : 0.3828          
#>     P-Value [Acc > NIR] : 0.9964613       
#>                                           
#>                   Kappa : -0.0479         
#>                                           
#>  Mcnemar's Test P-Value : 0.0007883       
#> 
#> Statistics by Class:
#> 
#>                      Class: Very Low Class: Low Class: Medium Class: High
#> Sensitivity                   0.5870    0.10204       0.15789     0.00000
#> Specificity                   0.3415    0.83544       0.86239     0.95763
#> Pos Pred Value                0.3333    0.27778       0.16667     0.00000
#> Neg Pred Value                0.5957    0.60000       0.85455     0.91870
#> Prevalence                    0.3594    0.38281       0.14844     0.07812
#> Detection Rate                0.2109    0.03906       0.02344     0.00000
#> Detection Prevalence          0.6328    0.14062       0.14062     0.03906
#> Balanced Accuracy             0.4642    0.46874       0.51014     0.47881
#>                      Class: Very High
#> Sensitivity                   0.00000
#> Specificity                   0.95161
#> Pos Pred Value                0.00000
#> Neg Pred Value                0.96721
#> Prevalence                    0.03125
#> Detection Rate                0.00000
#> Detection Prevalence          0.04688
#> Balanced Accuracy             0.47581

ROC (Receiver Operating Characteristic)

AUC (Area Under The Curve)

#> 
#> Call:
#> multiclass.roc.default(response = por_test$Talc, predictor = y_pred)
#> 
#> Data: y_pred with 5 levels of por_test$Talc: Very Low, Low, Medium, High, Very High.
#> Multi-class area under the curve: 0.8271

4 Conclusion

  • In this study case, I did separate modeling for both math and por. So we can observe the result of our model separately or find the average value of our models performance.

  • Before proceeding to the cross validation process, I have done 2 options in determining the right formula for creating a Talc column. The first one by using averages of Dalc and Walc. And the other using proportions for Dalc (5/7) and Walc (2/7) based on the number of days on workdays and weekends. After I did a test or experiment on both formula, I prefer to use the second option by using the proportion of days. Because it is more suitable and more likely to transform the value of Talc.

  • In addition to the things already mentioned, I also conducted a number of combination experiments in the machine learning Decision Tree model. Experiments that I did by using several combination of mincriterion values and also the functions used to create the model are ctree() and rpart

The following is a combination of the mincriterion that I used on the DT math subject (multiclass target) model. We can analyze the ability of our model (Sensitivity) to predict in each target class.

Based on experiments that have been done, we can conclude that the four models cannot predict well or capture values well from the Very High target level. And because of the exploratory data (Alcohol consumption level greatly affects student performance) that has been done, I want to prioritize Sensitivity for classes High and Very High so that we can use the model with default mincriterion.

The following is a combination of the mincriterion and model function that I used on the DT math subject (multiclass target) model. We can analyze our model evaluation metrics.

Based on experiments that have been done, we can conclude that the function to make a better Decision Tree model is ctree() by using the mincriterion that we just mentioned above.

So the primary differences between conditional inference trees (ctree from party package in R) compared to the more traditional decision tree algorithms (such as rpart in R) :

Both rpart and ctree recursively perform univariate splits of the dependent variable based on values on a set of covariates. ctree comes with a number of possible transformations for both DV and covariates (see the help for Transformations in the party package).

So generally the main difference seems to be that ctree uses a covariate selection scheme that is based on statistical theory (i.e. selection by permutation-based significance tests) and thereby avoids a potential bias in rpart, otherwise they seem similar; e.g. conditional inference trees can be used as base learners for Random Forests.

  • Besides comparing the three types of machine learning that I used (Naive Bayes, Decision Tree and Random Forest), In this study case I also compared the performance of each model for Binary target variable and Multiclass target variable. You can observe the binary model that I made by uncommenting lines on the available chunk.
a <- NB_table_math %>%
  summarise(
    Accuracy = accuracy_vec(Talc, Talc_pred),
    Sensitivity = sens_vec(Talc, Talc_pred),
    Specificity = spec_vec(Talc, Talc_pred),
    Precision = precision_vec(Talc, Talc_pred)
  )
# a$AUC <- AUC_NB_math_binary@y.values[[1]]
a$AUC <- AUC_NB_math_multiclass$auc[[1]]
a$Model <- "Naive Bayes Math"
a <- a[, c(6,1:5)]

b <- NB_table_por %>%
  summarise(
    Accuracy = accuracy_vec(Talc, Talc_pred),
    Sensitivity = sens_vec(Talc, Talc_pred),
    Specificity = spec_vec(Talc, Talc_pred),
    Precision = precision_vec(Talc, Talc_pred)
  )
# b$AUC <- AUC_NB_por_binary@y.values[[1]]
b$AUC <- AUC_NB_por_multiclass$auc[[1]]
b$Model <- "Naive Bayes Portugese"
b <- b[, c(6,1:5)]

g <- (a[,-1] + b[,-1])/2
g$Model <- "Naive Bayes Mean"
g <- g[, c(6,1:5)]

c <- DT_table_math %>%
  summarise(
    Accuracy = accuracy_vec(Talc, Talc_pred),
    Sensitivity = sens_vec(Talc, Talc_pred),
    Specificity = spec_vec(Talc, Talc_pred),
    Precision = precision_vec(Talc, Talc_pred)
  )
# c$AUC <- AUC_DT_math_binary@y.values[[1]]
c$AUC <- AUC_DT_math_multiclass$auc[[1]]
c$Model <- "Decision Tree Math"
c <- c[, c(6,1:5)]

d <- DT_table_por %>%
  summarise(
    Accuracy = accuracy_vec(Talc, Talc_pred),
    Sensitivity = sens_vec(Talc, Talc_pred),
    Specificity = spec_vec(Talc, Talc_pred),
    Precision = precision_vec(Talc, Talc_pred)
  )
# d$AUC <- AUC_DT_por_binary@y.values[[1]]
d$AUC <- AUC_DT_por_multiclass$auc[[1]]
d$Model <- "Decision Tree Portugese"
d <- d[, c(6,1:5)]

h <- (c[,-1] + d[,-1])/2
h$Model <- "Decision Tree Mean"
h <- h[, c(6,1:5)]


e <- RF_table_math %>%
  summarise(
    Accuracy = accuracy_vec(Talc, Talc_pred),
    Sensitivity = sens_vec(Talc, Talc_pred),
    Specificity = spec_vec(Talc, Talc_pred),
    Precision = precision_vec(Talc, Talc_pred)
  )
# e$AUC <- AUC_RF_math_binary@y.values[[1]]
e$AUC <- AUC_RF_math_multiclass$auc[[1]]
e$Model <- "Random Forest Math"
e <- e[, c(6,1:5)]

f <- RF_table_por %>%
  summarise(
    Accuracy = accuracy_vec(Talc, Talc_pred),
    Sensitivity = sens_vec(Talc, Talc_pred),
    Specificity = spec_vec(Talc, Talc_pred),
    Precision = precision_vec(Talc, Talc_pred)
  )
# f$AUC <- AUC_RF_por_binary@y.values[[1]]
f$AUC <- AUC_RF_por_multiclass$auc[[1]]
f$Model <- "Random Forest Portugese"
f <- f[, c(6,1:5)]

i <- (e[,-1] + f[,-1])/2
i$Model <- "Random Forest Mean"
i <- i[, c(6,1:5)]

Multiclass <- rbind(a, b, c, d, e, f, g, h, i)
Multiclass

The following are the results of the performance of each model to the binary class target

The following are the results of the performance of each model to the multiclass target

And for the final conclusion, we should use the Naive Bayes model for this case. This could be due to the variables used to predict Student Alcohol Consumption levels that are independent of one another. And if we modify the target class from binary to be multi-class, the performance of our tree based model will drastically decrease.