1 Introduction

This project aim to predict the performance of students in mathematics for students in two schools and aim to provide a clearer view on the factor that play role in their performance.

2 Methodology

2.1 Dataset

Data is retrieved from kaggle website. Folder contains two files : one for mathematics grade and the second for portuguese language. In this project, we focused on studying the performance of students in mathematics.

2.2 Model Evaluation

2.2.1 Confusion Matrix

To evaluate the performance of our classifications models, we are going use the confusion matrix.

## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
Actually Positive Actually Negative
Predicted positive True positives (TP) False negatives (FN)
Predicted negative False positives (FP) True negatives (TN)

From this matrix, we compute some rates:

  • Accuracy is the ability of the model to predict correctly and it is calculated as follows: \[ Accuracy=\frac{TP+TN}{TP+TN+FP+FN} \]
  • Sensitivity is the probability to predict a positive outcome when the actual outcome is positive and it is calculated as follows: \[ Sensitivity=\frac{TP}{TP+FN} \]
  • Specificity : is the probability to not predict a positive outcome when the actual outcome is not a positive and its calculated as follows:
    \[ Specificity=\frac{TN}{TN+FP} \]

2.2.2 Root Mean Square Error

Validation of our model is based on the value of Root Mean Square Error RMSE. The RMSE loss function is calculated as follows: \[\begin{equation} RMSE=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y-\hat{y}\right)^2} \end{equation}\] with

  • \(y\) the actual outcome,
  • \(\hat{y}\) the predicted outcome.
  • \(N\) number of observations.

3 Data analysis and model preparation

This dataset contains 395 observations and 33 variables including the output target : final grade. Data has no missing information.

We will start with a short description of each variable used in this project.

  1. school - student’s school (binary:
    • “GP” Gabriel Pereira or
    • “MS” Mousinho da Silveira)
  2. sex - student’s sex (binary:
    • “F” - female or
    • “M” - male)
  3. age - student’s age (numeric: from \(15\) to \(22\))
  4. address - student’s home address type (binary:
    • “U” - urban or
    • “R” - rural)
  5. famsize - family size (binary:
    • “LE3” less or equal to \(3\) or
    • “GT3” greater than \(3\))
  6. Pstatus - parent’s cohabitation status (binary:
    • “T” living together or
    • “A” apart)
  7. Medu - mother’s education (numeric:
    • 0 - none,
    • 1 - primary education (4th grade),
    • 2 – 5th to 9th grade,
    • 3 – secondary education or
    • 4 – higher education)
  8. Fedu - father’s education (numeric:
    • 0 - none,
    • 1 - primary education (4th grade),
    • 2 – 5th to 9th grade,
    • 3 – secondary education or
    • 4 – higher education
  9. Mjob - mother’s job (nominal:
    • “teacher”,
    • “health” care related,
    • civil “services” (e.g. administrative or police),
    • “at_home” or
    • “other”)
  10. Fjob - father’s job (nominal:
    • “teacher”,
    • “health” care related,
    • civil “services” (e.g. administrative or police),
    • “at_home” or
    • “other”)
  11. reason - reason to choose this school (nominal:
    • close to “home”,
    • school “reputation”,
    • “course” preference or
    • “other”)
  12. guardian - student’s guardian (nominal:
    • “mother”,
    • “father” or
    • “other”)
  13. traveltime - home to school travel time (numeric:
    • 1 - \(<15\) min.,
    • 2 - \(15\) to \(30\) min.,
    • 3 - \(30\) min. to \(1\) hour, or
    • 4 - \(>1\) hour)
  14. studytime - weekly study time (numeric:
    • 1 - \(<2\) hours,
    • 2 - \(2\) to \(5\) hours,
    • 3 - \(5\) to \(10\) hours, or
    • 4 - \(>10\) hours)
  15. failures - number of past class failures (numeric:
    • n if \(1\leqslant n <3\),
    • else \(4\))
  16. schoolsup - extra educational support (binary: yes or no)
  17. famsup - family educational support (binary: yes or no)
  18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  19. activities - extra-curricular activities (binary: yes or no)
  20. nursery - attended nursery school (binary: yes or no)
  21. higher - wants to take higher education (binary: yes or no)
  22. internet - Internet access at home (binary: yes or no)
  23. romantic - with a romantic relationship (binary: yes or no)
  24. famrel - quality of family relationships (numeric:
    • from \(1\) - very bad to \(5\) - excellent)
  25. freetime - free time after school (numeric:
    • from \(1\) - very low to \(5\) - very high)
  26. goout - going out with friends (numeric:
    • from \(1\) - very low to \(5\) - very high)
  27. Dalc - workday alcohol consumption (numeric:
    • from \(1\) - very low to \(5\) - very high)
  28. Walc - weekend alcohol consumption (numeric:
    • from \(1\) - very low to \(5\) - very high)
  29. health - current health status (numeric:
    • from \(1\) - very bad to \(5\) - very good)
  30. absences - number of school absences (numeric: from \(0\) to \(93\))
  31. G1 - first period grade (numeric: from \(0\) to \(20\))
  32. G2 - second period grade (numeric: from \(0\) to \(20\))
  33. G3 - final grade (numeric: from \(0\) to \(20\), output target)

3.1 Data exploration and visualization

let’s start exploring the data by a summary of all variables.

##  school   sex          age       address famsize   Pstatus      Medu      
##  GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
##  MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
##                   Median :17.0                             Median :3.000  
##                   Mean   :16.7                             Mean   :2.749  
##                   3rd Qu.:18.0                             3rd Qu.:4.000  
##                   Max.   :22.0                             Max.   :4.000  
##       Fedu             Mjob           Fjob            reason      guardian  
##  Min.   :0.000   at_home : 59   at_home : 20   course    :145   father: 90  
##  1st Qu.:2.000   health  : 34   health  : 18   home      :109   mother:273  
##  Median :2.000   other   :141   other   :217   other     : 36   other : 32  
##  Mean   :2.522   services:103   services:111   reputation:105               
##  3rd Qu.:3.000   teacher : 58   teacher : 29                                
##  Max.   :4.000                                                              
##    traveltime      studytime        failures      schoolsup famsup     paid    
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   no :344   no :153   no :214  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:242   yes:181  
##  Median :1.000   Median :2.000   Median :0.0000                                
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                                
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                                
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                                
##  activities nursery   higher    internet  romantic      famrel     
##  no :194    no : 81   no : 20   no : 66   no :263   Min.   :1.000  
##  yes:201    yes:314   yes:375   yes:329   yes:132   1st Qu.:4.000  
##                                                     Median :4.000  
##                                                     Mean   :3.944  
##                                                     3rd Qu.:5.000  
##                                                     Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00

Form the above table, we observe that Gabriel Pereira school have 349 students while Mousinho da Silveira school have 46 students. Bar plots below show that we have 208 female students and 187 male students.

The graph below shows that most of the students are between 15 and 18 years old. Two male students are 21 and 22 years old.

We observe that 78% of students came from urban, 71% of these families have more than \(3\) children and 90% of them live together. In addition, data shows that in 69% of families, mothers are in charge of taking care of their kids.

The majority of the parents are educated and \(53.92 \%\) possess a secondary or higher education.

Students absenteeism for more than 20 times is around 4 \(%\) which is absolutely an acceptable rate. Noteworthy, these students have a average of 10.6,10.5 and 10.3 on G1, G2 and G3 respectively.

As these graphs shows, most of the students does not have a school support. 61% of families support their children with their studies while 46% of students have extra paid classes.

33% of students are in romantic relationship and the table below shows the average of scores. In addition, 83% of students have access to the internet from their home.

## `summarise()` has grouped output by 'G3', 'sex'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 4
##   studytime `mean(G1)` `mean(G2)` `mean(G3)`
##       <int>      <dbl>      <dbl>      <dbl>
## 1         1       10.4       10.3       10.0
## 2         2       10.7       10.5       10.2
## 3         3       12.0       11.5       11.4
## 4         4       11.9       12.0       11.3
## `summarise()` has grouped output by 'G3', 'sex'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'G3', 'sex'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'G3', 'sex'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'G3', 'schoolsup'. You can override using
## the `.groups` argument.

## `summarise()` has grouped output by 'G3', 'Medu'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'G3', 'Dalc'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'G3', 'freetime', 'goout'. You can override
## using the `.groups` argument.

## `summarise()` has grouped output by 'G3'. You can override using the `.groups`
## argument.

Seeking a higher education can play an important motivator for students to achieve higher score. Students who do not want high education have an average of 6.8.

## `summarise()` has grouped output by 'G3', 'schoolsup', 'famsup'. You can
## override using the `.groups` argument.

3.2 Density

As shown in the figure above, grade 1, 2 and 3 are approximately normally distributed with close means.

Practically, All marketers are not putting zero on exams, for this reason we removed all grades equal to zero in G2 and G3. The density functin will like the graph below:

4 Correlation

5 Data splitting

6 linear regression

In this section, we will start by applying linear regression on our model.

We will start by predicting the species of coffee using linear regression models that take the form of \(Y=f(X)+\epsilon\) for an unknown function \(f\) where \(\epsilon\) is a mean-zero random error. If \(f\) is a linear function then we can write our multiple regression model as: \[Y=\beta_0+\beta_1X_1+\beta_2X_2+\dots+\beta_nX_n+\epsilon\] where \(X_i\) represents the \(i\)th predictor and \(\beta_i\) represents the association between the variable and the response. \(\beta_i\) represent the slope of \(i\)th predictor so we can interpret it as the variation of \(Y\) moving \(X_i\) one unit assuming all other predictors as fixed.

Models are resumed in the following table:

##  [1] 0.9359376 0.9382631 0.9401577 0.9415117 0.9426209 0.9430300 0.9435459
##  [8] 0.9440981 0.9445789 0.9451584 0.9455578 0.9458782 0.9461376 0.9463028
## [15] 0.9463366 0.9463498 0.9463504 0.9463669 0.9463921 0.9462961 0.9462427
## [22] 0.9461150 0.9460059 0.9458840 0.9457488 0.9456030 0.9454526 0.9453025
## [29] 0.9451544 0.9449731 0.9447972 0.9446033 0.9444067
## [1] 19
##  [1] 0.9359376 0.9382631 0.9401577 0.9415117 0.9426209 0.9430300 0.9435459
##  [8] 0.9440981 0.9445789 0.9451584 0.9455578 0.9458782 0.9459919 0.9460610
## [15] 0.9461070 0.9461586 0.9462099 0.9462334 0.9462090 0.9461614 0.9461237
## [22] 0.9460635 0.9459372 0.9458246 0.9456784 0.9455447 0.9454005 0.9452502
## [29] 0.9450794 0.9449504 0.9447653 0.9445861 0.9443901
## [1] 18

6.1 Removing less important variables

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

7 Random Forest

Random forests are generally an improved version of decision trees. The forest builds multiple decision trees and merge them to get more accurate and stable outcome. Random forests’ biggest advantage is that it can be used for both regression and classification. Here, we used it as a classifier and applied it to predict the score of the third grade.

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Bagged CART 
## 
## 288 samples
##  32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 258, 258, 259, 259, 260, 261, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.8690477  0.9260334  0.6916692

## [1] 1.106591

# Predicting score using rpart model A decision tree model is a tree-like graph including probability event outcomes. It’s an intuitive algorithm that is easily applied in modeling1. It uses tree representation to solve the problem. Starting from the root and going down each leaf node represents a decision or terminal node.

##   minsplit maxdepth   cp      error
## 1       12       13 0.01 0.08293618
## 2       16        8 0.01 0.08466819
## 3       18       13 0.01 0.08476208
## 4       19       10 0.01 0.08478310
## 5        8       11 0.01 0.08486986
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## 
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
## 
##     importance

## [1] 1.291236

8 Turning score into categorial factors

In this section, we transform scores into categories. We set intervals for scores.

## Loading required package: oompaBase
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction [0,5] (5,10] (10,15] (15,20]
##    [0,5]       0      0       0       0
##    (5,10]      1     22       6       0
##    (10,15]     0      6      23       6
##    (15,20]     0      0       1       4
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7101          
##                  95% CI : (0.5884, 0.8131)
##     No Information Rate : 0.4348          
##     P-Value [Acc > NIR] : 3.444e-06       
##                                           
##                   Kappa : 0.5156          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: [0,5] Class: (5,10] Class: (10,15] Class: (15,20]
## Sensitivity               0.00000        0.7857         0.7667        0.40000
## Specificity               1.00000        0.8293         0.6923        0.98305
## Pos Pred Value                NaN        0.7586         0.6571        0.80000
## Neg Pred Value            0.98551        0.8500         0.7941        0.90625
## Prevalence                0.01449        0.4058         0.4348        0.14493
## Detection Rate            0.00000        0.3188         0.3333        0.05797
## Detection Prevalence      0.00000        0.4203         0.5072        0.07246
## Balanced Accuracy         0.50000        0.8075         0.7295        0.69153

9 Results

It worthy if we can compare students performance over 3 years and predict its final grade.

10 Conclusion