Introduction

For this project I will be using Student performance dataset located at UCI Machine Learning Repository. The repository has 2 datasets one for Mathematics student-mat.csv and the other one for Portuguese language student-por.csv. In this project we will use these datasets and create models to predict the grades in mathematics and portuguese based on the student grades, demographic, social and school related features.

Data exploration

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

31 G1 - first period grade (numeric: from 0 to 20)

32 G2 - second period grade (numeric: from 0 to 20)

33 G3 - final grade (numeric: from 0 to 20, output target)

Dimension for Maths dataset

## [1] 395  33

Dimension for Portuguese dataset

## [1] 649  33
##     school              sex                 age         address         
##  Length:395         Length:395         Min.   :15.0   Length:395        
##  Class :character   Class :character   1st Qu.:16.0   Class :character  
##  Mode  :character   Mode  :character   Median :17.0   Mode  :character  
##                                        Mean   :16.7                     
##                                        3rd Qu.:18.0                     
##                                        Max.   :22.0                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:395         Length:395         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :2.000  
##                                        Mean   :2.749   Mean   :2.522  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:395        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:395         Length:395         Length:395         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.944  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00
##     school              sex                 age          address         
##  Length:649         Length:649         Min.   :15.00   Length:649        
##  Class :character   Class :character   1st Qu.:16.00   Class :character  
##  Mode  :character   Mode  :character   Median :17.00   Mode  :character  
##                                        Mean   :16.74                     
##                                        3rd Qu.:18.00                     
##                                        Max.   :22.00                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:649         Length:649         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :2.000   Median :2.000  
##                                        Mean   :2.515   Mean   :2.307  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:649        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.569   Mean   :1.931   Mean   :0.2219                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:649         Length:649         Length:649         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.931  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime        goout            Dalc            Walc          health     
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:2.000  
##  Median :3.00   Median :3.000   Median :1.000   Median :2.00   Median :4.000  
##  Mean   :3.18   Mean   :3.185   Mean   :1.502   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     absences            G1             G2              G3       
##  Min.   : 0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00  
##  Median : 2.000   Median :11.0   Median :11.00   Median :12.00  
##  Mean   : 3.659   Mean   :11.4   Mean   :11.57   Mean   :11.91  
##  3rd Qu.: 6.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :32.000   Max.   :19.0   Max.   :19.00   Max.   :19.00

We do the cleaning of the datasets by replacing the yes/ no with 1/ 0 in both of the datasets. maths_clean and port_clean

Data Visualization

Correlation matrix

From the coorelation matrix for both of the datasets we see there is a strong positive relationship among the grades G1,G2 and G3. Variables- age, medu, fedu, traveltime,studytime,failures,schoolsup,famsup,paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, dalc, walc, health, absences are positively correlated and other are negatively correlated.

Mathematics

Portuguese

Histogram

From the below histograms for Maths and Portuguese dataset - we see grades G1, G2 and G3 are almost normally distributed. We see right skewing in absences in both of the dataset.

Mathematics

Portuguese

Boxplot

In the boxplot’s below we see high intequartile range in variables - activities, famsup, paid, romantic

Mathematics

Portuguese

Model building

We split the cleaned datasets maths_clean and port_clean into a training set and a testing set, 80% will be set for training and 20% for testing set.

Model 1

We start with building linear model and include all of the variables. From the residual plots we don’t see any pattern in the distribution, resiual histogram is almost normally distributed and QQplot follows almost the normal line with slight skewing at the left.

## 
## Call:
## lm(formula = G3 ~ ., data = maths_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6879 -0.5536  0.1867  1.0543  4.8993 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.35049    2.45912   0.143  0.88677    
## schoolMS          0.49774    0.41931   1.187  0.23624    
## sexM              0.11195    0.27015   0.414  0.67890    
## age              -0.22802    0.11733  -1.943  0.05299 .  
## addressU          0.15458    0.31608   0.489  0.62519    
## famsizeLE3        0.06696    0.26479   0.253  0.80054    
## PstatusT         -0.30613    0.38231  -0.801  0.42399    
## Medu              0.02343    0.18218   0.129  0.89775    
## Fedu             -0.08043    0.15383  -0.523  0.60153    
## Mjobhealth       -0.15046    0.60642  -0.248  0.80424    
## Mjobother         0.17475    0.38401   0.455  0.64943    
## Mjobservices      0.20531    0.42486   0.483  0.62931    
## Mjobteacher       0.06101    0.56767   0.107  0.91449    
## Fjobhealth        0.04300    0.79739   0.054  0.95703    
## Fjobother        -0.08565    0.51814  -0.165  0.86883    
## Fjobservices     -0.28895    0.54231  -0.533  0.59460    
## Fjobteacher       0.04175    0.70168   0.060  0.95260    
## reasonhome       -0.22651    0.29977  -0.756  0.45054    
## reasonother       0.34134    0.43261   0.789  0.43079    
## reasonreputation  0.14007    0.32262   0.434  0.66451    
## guardianmother    0.04885    0.30129   0.162  0.87132    
## guardianother    -0.12815    0.52328  -0.245  0.80672    
## traveltime        0.06367    0.19045   0.334  0.73842    
## studytime        -0.03416    0.15788  -0.216  0.82884    
## failures         -0.11375    0.18423  -0.617  0.53747    
## schoolsup         0.38803    0.37305   1.040  0.29919    
## famsup            0.06910    0.26762   0.258  0.79643    
## paid              0.23221    0.25905   0.896  0.37083    
## activities       -0.37649    0.23886  -1.576  0.11613    
## nursery          -0.22119    0.29912  -0.739  0.46025    
## higher           -0.47963    0.56595  -0.847  0.39747    
## internet         -0.02805    0.33648  -0.083  0.93363    
## romantic         -0.24316    0.25965  -0.936  0.34984    
## famrel            0.33757    0.13836   2.440  0.01533 *  
## freetime          0.11970    0.12962   0.924  0.35656    
## goout             0.03599    0.12599   0.286  0.77534    
## Dalc             -0.21077    0.18005  -1.171  0.24277    
## Walc              0.16672    0.13503   1.235  0.21801    
## health            0.06933    0.08466   0.819  0.41354    
## absences          0.04595    0.01503   3.057  0.00246 ** 
## G1                0.20604    0.06960   2.960  0.00334 ** 
## G2                0.93901    0.05954  15.771  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.974 on 274 degrees of freedom
## Multiple R-squared:  0.8463, Adjusted R-squared:  0.8233 
## F-statistic: 36.81 on 41 and 274 DF,  p-value: < 2.2e-16
##                        2.5 %      97.5 %
## (Intercept)      -4.49068821 5.191664522
## schoolMS         -0.32773580 1.323214007
## sexM             -0.41987649 0.643774322
## age              -0.45900249 0.002966819
## addressU         -0.46767582 0.776844450
## famsizeLE3       -0.45431435 0.588244079
## PstatusT         -1.05877450 0.446521687
## Medu             -0.33521443 0.382077550
## Fedu             -0.38327357 0.222421688
## Mjobhealth       -1.34429986 1.043382521
## Mjobother        -0.58124422 0.930736198
## Mjobservices     -0.63109900 1.041723340
## Mjobteacher      -1.05654220 1.178566169
## Fjobhealth       -1.52678446 1.612786268
## Fjobother        -1.10568187 0.934387483
## Fjobservices     -1.35656225 0.778671398
## Fjobteacher      -1.33962093 1.423122841
## reasonhome       -0.81665784 0.363639782
## reasonother      -0.51033164 1.193011557
## reasonreputation -0.49505569 0.775195546
## guardianmother   -0.54429363 0.641996121
## guardianother    -1.15830776 0.902013135
## traveltime       -0.31127305 0.438607217
## studytime        -0.34497360 0.276644225
## failures         -0.47643785 0.248941581
## schoolsup        -0.34637763 1.122433574
## famsup           -0.45774206 0.595948996
## paid             -0.27776764 0.742190923
## activities       -0.84671730 0.093733926
## nursery          -0.81005707 0.367674398
## higher           -1.59380152 0.634541515
## internet         -0.69046461 0.634369695
## romantic         -0.75432680 0.268001184
## famrel            0.06518645 0.609954778
## freetime         -0.13546957 0.374873600
## goout            -0.21204344 0.284031507
## Dalc             -0.56524047 0.143693501
## Walc             -0.09911334 0.432559106
## health           -0.09734057 0.236009649
## absences          0.01635556 0.075544247
## G1                0.06901604 0.343054541
## G2                0.82180038 1.056225830
## 
## Call:
## lm(formula = G3 ~ ., data = port_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6879 -0.5536  0.1867  1.0543  4.8993 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.35049    2.45912   0.143  0.88677    
## schoolMS          0.49774    0.41931   1.187  0.23624    
## sexM              0.11195    0.27015   0.414  0.67890    
## age              -0.22802    0.11733  -1.943  0.05299 .  
## addressU          0.15458    0.31608   0.489  0.62519    
## famsizeLE3        0.06696    0.26479   0.253  0.80054    
## PstatusT         -0.30613    0.38231  -0.801  0.42399    
## Medu              0.02343    0.18218   0.129  0.89775    
## Fedu             -0.08043    0.15383  -0.523  0.60153    
## Mjobhealth       -0.15046    0.60642  -0.248  0.80424    
## Mjobother         0.17475    0.38401   0.455  0.64943    
## Mjobservices      0.20531    0.42486   0.483  0.62931    
## Mjobteacher       0.06101    0.56767   0.107  0.91449    
## Fjobhealth        0.04300    0.79739   0.054  0.95703    
## Fjobother        -0.08565    0.51814  -0.165  0.86883    
## Fjobservices     -0.28895    0.54231  -0.533  0.59460    
## Fjobteacher       0.04175    0.70168   0.060  0.95260    
## reasonhome       -0.22651    0.29977  -0.756  0.45054    
## reasonother       0.34134    0.43261   0.789  0.43079    
## reasonreputation  0.14007    0.32262   0.434  0.66451    
## guardianmother    0.04885    0.30129   0.162  0.87132    
## guardianother    -0.12815    0.52328  -0.245  0.80672    
## traveltime        0.06367    0.19045   0.334  0.73842    
## studytime        -0.03416    0.15788  -0.216  0.82884    
## failures         -0.11375    0.18423  -0.617  0.53747    
## schoolsup         0.38803    0.37305   1.040  0.29919    
## famsup            0.06910    0.26762   0.258  0.79643    
## paid              0.23221    0.25905   0.896  0.37083    
## activities       -0.37649    0.23886  -1.576  0.11613    
## nursery          -0.22119    0.29912  -0.739  0.46025    
## higher           -0.47963    0.56595  -0.847  0.39747    
## internet         -0.02805    0.33648  -0.083  0.93363    
## romantic         -0.24316    0.25965  -0.936  0.34984    
## famrel            0.33757    0.13836   2.440  0.01533 *  
## freetime          0.11970    0.12962   0.924  0.35656    
## goout             0.03599    0.12599   0.286  0.77534    
## Dalc             -0.21077    0.18005  -1.171  0.24277    
## Walc              0.16672    0.13503   1.235  0.21801    
## health            0.06933    0.08466   0.819  0.41354    
## absences          0.04595    0.01503   3.057  0.00246 ** 
## G1                0.20604    0.06960   2.960  0.00334 ** 
## G2                0.93901    0.05954  15.771  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.974 on 274 degrees of freedom
## Multiple R-squared:  0.8463, Adjusted R-squared:  0.8233 
## F-statistic: 36.81 on 41 and 274 DF,  p-value: < 2.2e-16
##                        2.5 %      97.5 %
## (Intercept)      -4.49068821 5.191664522
## schoolMS         -0.32773580 1.323214007
## sexM             -0.41987649 0.643774322
## age              -0.45900249 0.002966819
## addressU         -0.46767582 0.776844450
## famsizeLE3       -0.45431435 0.588244079
## PstatusT         -1.05877450 0.446521687
## Medu             -0.33521443 0.382077550
## Fedu             -0.38327357 0.222421688
## Mjobhealth       -1.34429986 1.043382521
## Mjobother        -0.58124422 0.930736198
## Mjobservices     -0.63109900 1.041723340
## Mjobteacher      -1.05654220 1.178566169
## Fjobhealth       -1.52678446 1.612786268
## Fjobother        -1.10568187 0.934387483
## Fjobservices     -1.35656225 0.778671398
## Fjobteacher      -1.33962093 1.423122841
## reasonhome       -0.81665784 0.363639782
## reasonother      -0.51033164 1.193011557
## reasonreputation -0.49505569 0.775195546
## guardianmother   -0.54429363 0.641996121
## guardianother    -1.15830776 0.902013135
## traveltime       -0.31127305 0.438607217
## studytime        -0.34497360 0.276644225
## failures         -0.47643785 0.248941581
## schoolsup        -0.34637763 1.122433574
## famsup           -0.45774206 0.595948996
## paid             -0.27776764 0.742190923
## activities       -0.84671730 0.093733926
## nursery          -0.81005707 0.367674398
## higher           -1.59380152 0.634541515
## internet         -0.69046461 0.634369695
## romantic         -0.75432680 0.268001184
## famrel            0.06518645 0.609954778
## freetime         -0.13546957 0.374873600
## goout            -0.21204344 0.284031507
## Dalc             -0.56524047 0.143693501
## Walc             -0.09911334 0.432559106
## health           -0.09734057 0.236009649
## absences          0.01635556 0.075544247
## G1                0.06901604 0.343054541
## G2                0.82180038 1.056225830

Model 2

We create a second model again using the linear model and this time we choose only the significant variables using the stepAIC function. Second model has performed better then the model 1, Residual standard error is smaller, Adjusted R-squared & F-statistic have improved. p-value is almost the same. Resiudal plots looks similar to the model 1 and since we still see the skewness we will use the glm for our next model.

## 
## Call:
## lm(formula = G3 ~ age + activities + famrel + Walc + absences + 
##     G1 + G2, data = maths_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7867 -0.5062  0.2511  1.0438  4.2313 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12327    1.54795   0.080  0.93658    
## age         -0.23169    0.08507  -2.724  0.00683 ** 
## activities  -0.40157    0.21636  -1.856  0.06440 .  
## famrel       0.38918    0.12427   3.132  0.00191 ** 
## Walc         0.13070    0.08770   1.490  0.13717    
## absences     0.03967    0.01306   3.037  0.00259 ** 
## G1           0.18729    0.06003   3.120  0.00198 ** 
## G2           0.95443    0.05265  18.127  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.91 on 308 degrees of freedom
## Multiple R-squared:  0.8383, Adjusted R-squared:  0.8346 
## F-statistic: 228.1 on 7 and 308 DF,  p-value: < 2.2e-16
##                   2.5 %      97.5 %
## (Intercept) -2.92262230  3.16916866
## age         -0.39907895 -0.06429662
## activities  -0.82729521  0.02414932
## famrel       0.14464860  0.63371896
## Walc        -0.04186637  0.30326278
## absences     0.01397166  0.06537164
## G1           0.06917031  0.30541952
## G2           0.85083148  1.05803670
## 
## Call:
## lm(formula = G3 ~ age + activities + famrel + Walc + absences + 
##     G1 + G2, data = port_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7867 -0.5062  0.2511  1.0438  4.2313 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12327    1.54795   0.080  0.93658    
## age         -0.23169    0.08507  -2.724  0.00683 ** 
## activities  -0.40157    0.21636  -1.856  0.06440 .  
## famrel       0.38918    0.12427   3.132  0.00191 ** 
## Walc         0.13070    0.08770   1.490  0.13717    
## absences     0.03967    0.01306   3.037  0.00259 ** 
## G1           0.18729    0.06003   3.120  0.00198 ** 
## G2           0.95443    0.05265  18.127  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.91 on 308 degrees of freedom
## Multiple R-squared:  0.8383, Adjusted R-squared:  0.8346 
## F-statistic: 228.1 on 7 and 308 DF,  p-value: < 2.2e-16
##                   2.5 %      97.5 %
## (Intercept) -2.92262230  3.16916866
## age         -0.39907895 -0.06429662
## activities  -0.82729521  0.02414932
## famrel       0.14464860  0.63371896
## Walc        -0.04186637  0.30326278
## absences     0.01397166  0.06537164
## G1           0.06917031  0.30541952
## G2           0.85083148  1.05803670

Model 3

We create the third model by using the generalized linear model and use the stepAIC to choose the significant variables.

## 
## Call:
## glm(formula = G3 ~ age + activities + famrel + Walc + absences + 
##     G1 + G2, family = gaussian, data = maths_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.7867  -0.5062   0.2511   1.0438   4.2313  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12327    1.54795   0.080  0.93658    
## age         -0.23169    0.08507  -2.724  0.00683 ** 
## activities  -0.40157    0.21636  -1.856  0.06440 .  
## famrel       0.38918    0.12427   3.132  0.00191 ** 
## Walc         0.13070    0.08770   1.490  0.13717    
## absences     0.03967    0.01306   3.037  0.00259 ** 
## G1           0.18729    0.06003   3.120  0.00198 ** 
## G2           0.95443    0.05265  18.127  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.646842)
## 
##     Null deviance: 6946.9  on 315  degrees of freedom
## Residual deviance: 1123.2  on 308  degrees of freedom
## AIC: 1315.5
## 
## Number of Fisher Scoring iterations: 2
## 
## Call:
## glm(formula = G3 ~ age + activities + famrel + Walc + absences + 
##     G1 + G2, family = gaussian, data = port_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.7867  -0.5062   0.2511   1.0438   4.2313  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12327    1.54795   0.080  0.93658    
## age         -0.23169    0.08507  -2.724  0.00683 ** 
## activities  -0.40157    0.21636  -1.856  0.06440 .  
## famrel       0.38918    0.12427   3.132  0.00191 ** 
## Walc         0.13070    0.08770   1.490  0.13717    
## absences     0.03967    0.01306   3.037  0.00259 ** 
## G1           0.18729    0.06003   3.120  0.00198 ** 
## G2           0.95443    0.05265  18.127  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.646842)
## 
##     Null deviance: 6946.9  on 315  degrees of freedom
## Residual deviance: 1123.2  on 308  degrees of freedom
## AIC: 1315.5
## 
## Number of Fisher Scoring iterations: 2

Model selection

We see similar model values for Mathematics and Portuguese datasets. So we will choose one of the model for both of the datasets. We neglect the model1 since it as lower Adjusted R-squared, higher Residual standard error and higher AIC when compared to Model2 & Model3.

Model3 the generalized linear model is our choice here for both Mathematics and Portuguese datasets since it has a little higher Adjusted R-squared value when compared to Model2.

Most significant variables that decide a students final grade are : Student’s age, extra-curricular activities, quality of family relationships, weekend alcohol consumption, number of school absences, first period grade, second period grade

Mathematics

## fitting null model for pseudo-r2
LM 1 LM 2 GML
Residual standard error 1.9737994 1.9096706 1.9096706
Adjusted R-squared 0.8233445 0.8346371 0.8383118
Predictors 42.0000000 8.0000000 8.0000000
AIC 1367.4380030 1315.5263920 1315.5263920

Portuguese

## fitting null model for pseudo-r2
LM 1 LM 2 GML
Residual standard error 1.9737994 1.9096706 1.9096706
Adjusted R-squared 0.8233445 0.8346371 0.8383118
Predictors 42.0000000 8.0000000 8.0000000
AIC 1367.4380030 1315.5263920 1315.5263920

References

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

https://rdrr.io/r/stats/sigma.html

https://www.statology.org/glm-r-squared/

https://archive.ics.uci.edu/ml/datasets/student+performance