This project aim to predict the performance of students in mathematics for students in two schools and aim to provide a clearer view on the factor that play role in their performance.
Data is retrieved from kaggle website. Folder contains two files : one for mathematics grade and the second for portuguese language. In this project, we focused on studying the performance of students in mathematics.
To evaluate the performance of our classifications models, we are going use the confusion matrix.
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
| Actually Positive | Actually Negative | |
|---|---|---|
| Predicted positive | True positives (TP) | False negatives (FN) |
| Predicted negative | False positives (FP) | True negatives (TN) |
From this matrix, we compute some rates:
Validation of our model is based on the value of Root Mean Square Error RMSE. The RMSE loss function is calculated as follows: \[\begin{equation} RMSE=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y-\hat{y}\right)^2} \end{equation}\] with
This dataset contains 395 observations and 33 variables including the output target : final grade. Data has no missing information.
We will start with a short description of each variable used in this project.
school - student’s school (binary:
sex - student’s sex (binary:
age - student’s age (numeric: from \(15\) to \(22\))address - student’s home address type (binary:
famsize - family size (binary:
Pstatus - parent’s cohabitation status (binary:
Medu - mother’s education (numeric:
Fedu - father’s education (numeric:
Mjob - mother’s job (nominal:
Fjob - father’s job (nominal:
reason - reason to choose this school (nominal:
guardian - student’s guardian (nominal:
traveltime - home to school travel time (numeric:
studytime - weekly study time (numeric:
failures - number of past class failures (numeric:
schoolsup - extra educational support (binary:
yes or no)famsup - family educational support (binary:
yes or no)paid - extra paid classes within the course subject
(Math or Portuguese) (binary: yes or
no)activities - extra-curricular activities (binary:
yes or no)nursery - attended nursery school (binary:
yes or no)higher - wants to take higher education (binary:
yes or no)internet - Internet access at home (binary:
yes or no)romantic - with a romantic relationship (binary:
yes or no)famrel - quality of family relationships (numeric:
freetime - free time after school (numeric:
goout - going out with friends (numeric:
Dalc - workday alcohol consumption (numeric:
Walc - weekend alcohol consumption (numeric:
health - current health status (numeric:
absences - number of school absences (numeric: from
\(0\) to \(93\))G1 - first period grade (numeric: from \(0\) to \(20\))G2 - second period grade (numeric: from \(0\) to \(20\))G3 - final grade (numeric: from \(0\) to \(20\), output target)let’s start exploring the data by a summary of all variables.
## school sex age address famsize Pstatus Medu
## GP:349 F:208 Min. :15.0 R: 88 GT3:281 A: 41 Min. :0.000
## MS: 46 M:187 1st Qu.:16.0 U:307 LE3:114 T:354 1st Qu.:2.000
## Median :17.0 Median :3.000
## Mean :16.7 Mean :2.749
## 3rd Qu.:18.0 3rd Qu.:4.000
## Max. :22.0 Max. :4.000
## Fedu Mjob Fjob reason guardian
## Min. :0.000 at_home : 59 at_home : 20 course :145 father: 90
## 1st Qu.:2.000 health : 34 health : 18 home :109 mother:273
## Median :2.000 other :141 other :217 other : 36 other : 32
## Mean :2.522 services:103 services:111 reputation:105
## 3rd Qu.:3.000 teacher : 58 teacher : 29
## Max. :4.000
## traveltime studytime failures schoolsup famsup paid
## Min. :1.000 Min. :1.000 Min. :0.0000 no :344 no :153 no :214
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:242 yes:181
## Median :1.000 Median :2.000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## activities nursery higher internet romantic famrel
## no :194 no : 81 no : 20 no : 66 no :263 Min. :1.000
## yes:201 yes:314 yes:375 yes:329 yes:132 1st Qu.:4.000
## Median :4.000
## Mean :3.944
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
Form the above table, we observe that Gabriel Pereira school have 349 students while Mousinho da Silveira school have 46 students. Bar plots below show that we have 208 female students and 187 male students.
The graph below shows that most of the students are between 15 and 18 years old. Two male students are 21 and 22 years old.
We observe that 78% of students came from urban, 71% of these families have more than \(3\) children and 90% of them live together. In addition, data shows that in 69% of families, mothers are in charge of taking care of their kids.
The majority of the parents are educated and \(53.92 \%\) possess a secondary or higher education.
Students absenteeism for more than 20 times is around 4 \(%\) which is absolutely an acceptable rate. Noteworthy, these students have a average of 10.6,10.5 and 10.3 on G1, G2 and G3 respectively.
As these graphs shows, most of the students does not have a school support. 61% of families support their children with their studies while 46% of students have extra paid classes.
33% of students are in romantic relationship and the table below shows the average of scores. In addition, 83% of students have access to the internet from their home.
## `summarise()` has grouped output by 'G3', 'sex'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## studytime `mean(G1)` `mean(G2)` `mean(G3)`
## <int> <dbl> <dbl> <dbl>
## 1 1 10.4 10.3 10.0
## 2 2 10.7 10.5 10.2
## 3 3 12.0 11.5 11.4
## 4 4 11.9 12.0 11.3
## `summarise()` has grouped output by 'G3', 'sex'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'G3', 'sex'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'G3', 'sex'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'G3', 'schoolsup'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'G3', 'Medu'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'G3', 'Dalc'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'G3', 'freetime', 'goout'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'G3'. You can override using the `.groups`
## argument.
Seeking a higher education can play an important motivator for students to achieve higher score. Students who do not want high education have an average of 6.8.
## `summarise()` has grouped output by 'G3', 'schoolsup', 'famsup'. You can
## override using the `.groups` argument.
As shown in the figure above, grade 1, 2 and 3 are approximately normally distributed with close means.
Practically, All marketers are not putting zero on exams, for this reason we removed all grades equal to zero in G2 and G3. The density functin will like the graph below:
In this section, we will start by applying linear regression on our model.
We will start by predicting the species of coffee using linear regression models that take the form of \(Y=f(X)+\epsilon\) for an unknown function \(f\) where \(\epsilon\) is a mean-zero random error. If \(f\) is a linear function then we can write our multiple regression model as: \[Y=\beta_0+\beta_1X_1+\beta_2X_2+\dots+\beta_nX_n+\epsilon\] where \(X_i\) represents the \(i\)th predictor and \(\beta_i\) represents the association between the variable and the response. \(\beta_i\) represent the slope of \(i\)th predictor so we can interpret it as the variation of \(Y\) moving \(X_i\) one unit assuming all other predictors as fixed.
Models are resumed in the following table:
## [1] 0.9359376 0.9382631 0.9401577 0.9415117 0.9426209 0.9430300 0.9435459
## [8] 0.9440981 0.9445789 0.9451584 0.9455578 0.9458782 0.9461376 0.9463028
## [15] 0.9463366 0.9463498 0.9463504 0.9463669 0.9463921 0.9462961 0.9462427
## [22] 0.9461150 0.9460059 0.9458840 0.9457488 0.9456030 0.9454526 0.9453025
## [29] 0.9451544 0.9449731 0.9447972 0.9446033 0.9444067
## [1] 19
## [1] 0.9359376 0.9382631 0.9401577 0.9415117 0.9426209 0.9430300 0.9435459
## [8] 0.9440981 0.9445789 0.9451584 0.9455578 0.9458782 0.9459919 0.9460610
## [15] 0.9461070 0.9461586 0.9462099 0.9462334 0.9462090 0.9461614 0.9461237
## [22] 0.9460635 0.9459372 0.9458246 0.9456784 0.9455447 0.9454005 0.9452502
## [29] 0.9450794 0.9449504 0.9447653 0.9445861 0.9443901
## [1] 18
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Random forests are generally an improved version of decision trees. The forest builds multiple decision trees and merge them to get more accurate and stable outcome. Random forests’ biggest advantage is that it can be used for both regression and classification. Here, we used it as a classifier and applied it to predict the score of the third grade.
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Bagged CART
##
## 288 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 258, 258, 259, 259, 260, 261, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.8690477 0.9260334 0.6916692
## [1] 1.106591
# Predicting score using rpart model A decision tree model is a tree-like graph including probability event outcomes. It’s an intuitive algorithm that is easily applied in modeling1. It uses tree representation to solve the problem. Starting from the root and going down each leaf node represents a decision or terminal node.
## minsplit maxdepth cp error
## 1 12 13 0.01 0.08293618
## 2 16 8 0.01 0.08466819
## 3 18 13 0.01 0.08476208
## 4 19 10 0.01 0.08478310
## 5 8 11 0.01 0.08486986
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
## [1] 1.291236
In this section, we transform scores into categories. We set intervals for scores.
## Loading required package: oompaBase
## Confusion Matrix and Statistics
##
## Reference
## Prediction [0,5] (5,10] (10,15] (15,20]
## [0,5] 0 0 0 0
## (5,10] 1 22 6 0
## (10,15] 0 6 23 6
## (15,20] 0 0 1 4
##
## Overall Statistics
##
## Accuracy : 0.7101
## 95% CI : (0.5884, 0.8131)
## No Information Rate : 0.4348
## P-Value [Acc > NIR] : 3.444e-06
##
## Kappa : 0.5156
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: [0,5] Class: (5,10] Class: (10,15] Class: (15,20]
## Sensitivity 0.00000 0.7857 0.7667 0.40000
## Specificity 1.00000 0.8293 0.6923 0.98305
## Pos Pred Value NaN 0.7586 0.6571 0.80000
## Neg Pred Value 0.98551 0.8500 0.7941 0.90625
## Prevalence 0.01449 0.4058 0.4348 0.14493
## Detection Rate 0.00000 0.3188 0.3333 0.05797
## Detection Prevalence 0.00000 0.4203 0.5072 0.07246
## Balanced Accuracy 0.50000 0.8075 0.7295 0.69153
It worthy if we can compare students performance over 3 years and predict its final grade.