For this project I will be using Student performance dataset located at UCI Machine Learning Repository. The repository has 2 datasets one for Mathematics student-mat.csv and the other one for Portuguese language student-por.csv. In this project we will use these datasets and create models to predict the grades in mathematics and portuguese based on the student grades, demographic, social and school related features.
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
These grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
32 G2 - second period grade (numeric: from 0 to 20)
33 G3 - final grade (numeric: from 0 to 20, output target)
Dimension for Maths dataset
## [1] 395 33
Dimension for Portuguese dataset
## [1] 649 33
## school sex age address
## Length:395 Length:395 Min. :15.0 Length:395
## Class :character Class :character 1st Qu.:16.0 Class :character
## Mode :character Mode :character Median :17.0 Mode :character
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## famsize Pstatus Medu Fedu
## Length:395 Length:395 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :2.000
## Mean :2.749 Mean :2.522
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:395
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:395 Length:395 Length:395 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.944
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
## school sex age address
## Length:649 Length:649 Min. :15.00 Length:649
## Class :character Class :character 1st Qu.:16.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :16.74
## 3rd Qu.:18.00
## Max. :22.00
## famsize Pstatus Medu Fedu
## Length:649 Length:649 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Mode :character Median :2.000 Median :2.000
## Mean :2.515 Mean :2.307
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:649
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.569 Mean :1.931 Mean :0.2219
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:649 Length:649 Length:649 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.931
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc health
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
## Median :3.00 Median :3.000 Median :1.000 Median :2.00 Median :4.000
## Mean :3.18 Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00
## Median : 2.000 Median :11.0 Median :11.00 Median :12.00
## Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
## 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
We do the cleaning of the datasets by replacing the yes/ no with 1/ 0 in both of the datasets. maths_clean and port_clean
From the coorelation matrix for both of the datasets we see there is a strong positive relationship among the grades G1,G2 and G3. Variables- age, medu, fedu, traveltime,studytime,failures,schoolsup,famsup,paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, dalc, walc, health, absences are positively correlated and other are negatively correlated.
From the below histograms for Maths and Portuguese dataset - we see grades G1, G2 and G3 are almost normally distributed. We see right skewing in absences in both of the dataset.
In the boxplot’s below we see high intequartile range in variables - activities, famsup, paid, romantic
We split the cleaned datasets maths_clean and port_clean into a training set and a testing set, 80% will be set for training and 20% for testing set.
We start with building linear model and include all of the variables. From the residual plots we don’t see any pattern in the distribution, resiual histogram is almost normally distributed and QQplot follows almost the normal line with slight skewing at the left.
##
## Call:
## lm(formula = G3 ~ ., data = maths_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6879 -0.5536 0.1867 1.0543 4.8993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.35049 2.45912 0.143 0.88677
## schoolMS 0.49774 0.41931 1.187 0.23624
## sexM 0.11195 0.27015 0.414 0.67890
## age -0.22802 0.11733 -1.943 0.05299 .
## addressU 0.15458 0.31608 0.489 0.62519
## famsizeLE3 0.06696 0.26479 0.253 0.80054
## PstatusT -0.30613 0.38231 -0.801 0.42399
## Medu 0.02343 0.18218 0.129 0.89775
## Fedu -0.08043 0.15383 -0.523 0.60153
## Mjobhealth -0.15046 0.60642 -0.248 0.80424
## Mjobother 0.17475 0.38401 0.455 0.64943
## Mjobservices 0.20531 0.42486 0.483 0.62931
## Mjobteacher 0.06101 0.56767 0.107 0.91449
## Fjobhealth 0.04300 0.79739 0.054 0.95703
## Fjobother -0.08565 0.51814 -0.165 0.86883
## Fjobservices -0.28895 0.54231 -0.533 0.59460
## Fjobteacher 0.04175 0.70168 0.060 0.95260
## reasonhome -0.22651 0.29977 -0.756 0.45054
## reasonother 0.34134 0.43261 0.789 0.43079
## reasonreputation 0.14007 0.32262 0.434 0.66451
## guardianmother 0.04885 0.30129 0.162 0.87132
## guardianother -0.12815 0.52328 -0.245 0.80672
## traveltime 0.06367 0.19045 0.334 0.73842
## studytime -0.03416 0.15788 -0.216 0.82884
## failures -0.11375 0.18423 -0.617 0.53747
## schoolsup 0.38803 0.37305 1.040 0.29919
## famsup 0.06910 0.26762 0.258 0.79643
## paid 0.23221 0.25905 0.896 0.37083
## activities -0.37649 0.23886 -1.576 0.11613
## nursery -0.22119 0.29912 -0.739 0.46025
## higher -0.47963 0.56595 -0.847 0.39747
## internet -0.02805 0.33648 -0.083 0.93363
## romantic -0.24316 0.25965 -0.936 0.34984
## famrel 0.33757 0.13836 2.440 0.01533 *
## freetime 0.11970 0.12962 0.924 0.35656
## goout 0.03599 0.12599 0.286 0.77534
## Dalc -0.21077 0.18005 -1.171 0.24277
## Walc 0.16672 0.13503 1.235 0.21801
## health 0.06933 0.08466 0.819 0.41354
## absences 0.04595 0.01503 3.057 0.00246 **
## G1 0.20604 0.06960 2.960 0.00334 **
## G2 0.93901 0.05954 15.771 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.974 on 274 degrees of freedom
## Multiple R-squared: 0.8463, Adjusted R-squared: 0.8233
## F-statistic: 36.81 on 41 and 274 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) -4.49068821 5.191664522
## schoolMS -0.32773580 1.323214007
## sexM -0.41987649 0.643774322
## age -0.45900249 0.002966819
## addressU -0.46767582 0.776844450
## famsizeLE3 -0.45431435 0.588244079
## PstatusT -1.05877450 0.446521687
## Medu -0.33521443 0.382077550
## Fedu -0.38327357 0.222421688
## Mjobhealth -1.34429986 1.043382521
## Mjobother -0.58124422 0.930736198
## Mjobservices -0.63109900 1.041723340
## Mjobteacher -1.05654220 1.178566169
## Fjobhealth -1.52678446 1.612786268
## Fjobother -1.10568187 0.934387483
## Fjobservices -1.35656225 0.778671398
## Fjobteacher -1.33962093 1.423122841
## reasonhome -0.81665784 0.363639782
## reasonother -0.51033164 1.193011557
## reasonreputation -0.49505569 0.775195546
## guardianmother -0.54429363 0.641996121
## guardianother -1.15830776 0.902013135
## traveltime -0.31127305 0.438607217
## studytime -0.34497360 0.276644225
## failures -0.47643785 0.248941581
## schoolsup -0.34637763 1.122433574
## famsup -0.45774206 0.595948996
## paid -0.27776764 0.742190923
## activities -0.84671730 0.093733926
## nursery -0.81005707 0.367674398
## higher -1.59380152 0.634541515
## internet -0.69046461 0.634369695
## romantic -0.75432680 0.268001184
## famrel 0.06518645 0.609954778
## freetime -0.13546957 0.374873600
## goout -0.21204344 0.284031507
## Dalc -0.56524047 0.143693501
## Walc -0.09911334 0.432559106
## health -0.09734057 0.236009649
## absences 0.01635556 0.075544247
## G1 0.06901604 0.343054541
## G2 0.82180038 1.056225830
##
## Call:
## lm(formula = G3 ~ ., data = port_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6879 -0.5536 0.1867 1.0543 4.8993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.35049 2.45912 0.143 0.88677
## schoolMS 0.49774 0.41931 1.187 0.23624
## sexM 0.11195 0.27015 0.414 0.67890
## age -0.22802 0.11733 -1.943 0.05299 .
## addressU 0.15458 0.31608 0.489 0.62519
## famsizeLE3 0.06696 0.26479 0.253 0.80054
## PstatusT -0.30613 0.38231 -0.801 0.42399
## Medu 0.02343 0.18218 0.129 0.89775
## Fedu -0.08043 0.15383 -0.523 0.60153
## Mjobhealth -0.15046 0.60642 -0.248 0.80424
## Mjobother 0.17475 0.38401 0.455 0.64943
## Mjobservices 0.20531 0.42486 0.483 0.62931
## Mjobteacher 0.06101 0.56767 0.107 0.91449
## Fjobhealth 0.04300 0.79739 0.054 0.95703
## Fjobother -0.08565 0.51814 -0.165 0.86883
## Fjobservices -0.28895 0.54231 -0.533 0.59460
## Fjobteacher 0.04175 0.70168 0.060 0.95260
## reasonhome -0.22651 0.29977 -0.756 0.45054
## reasonother 0.34134 0.43261 0.789 0.43079
## reasonreputation 0.14007 0.32262 0.434 0.66451
## guardianmother 0.04885 0.30129 0.162 0.87132
## guardianother -0.12815 0.52328 -0.245 0.80672
## traveltime 0.06367 0.19045 0.334 0.73842
## studytime -0.03416 0.15788 -0.216 0.82884
## failures -0.11375 0.18423 -0.617 0.53747
## schoolsup 0.38803 0.37305 1.040 0.29919
## famsup 0.06910 0.26762 0.258 0.79643
## paid 0.23221 0.25905 0.896 0.37083
## activities -0.37649 0.23886 -1.576 0.11613
## nursery -0.22119 0.29912 -0.739 0.46025
## higher -0.47963 0.56595 -0.847 0.39747
## internet -0.02805 0.33648 -0.083 0.93363
## romantic -0.24316 0.25965 -0.936 0.34984
## famrel 0.33757 0.13836 2.440 0.01533 *
## freetime 0.11970 0.12962 0.924 0.35656
## goout 0.03599 0.12599 0.286 0.77534
## Dalc -0.21077 0.18005 -1.171 0.24277
## Walc 0.16672 0.13503 1.235 0.21801
## health 0.06933 0.08466 0.819 0.41354
## absences 0.04595 0.01503 3.057 0.00246 **
## G1 0.20604 0.06960 2.960 0.00334 **
## G2 0.93901 0.05954 15.771 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.974 on 274 degrees of freedom
## Multiple R-squared: 0.8463, Adjusted R-squared: 0.8233
## F-statistic: 36.81 on 41 and 274 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) -4.49068821 5.191664522
## schoolMS -0.32773580 1.323214007
## sexM -0.41987649 0.643774322
## age -0.45900249 0.002966819
## addressU -0.46767582 0.776844450
## famsizeLE3 -0.45431435 0.588244079
## PstatusT -1.05877450 0.446521687
## Medu -0.33521443 0.382077550
## Fedu -0.38327357 0.222421688
## Mjobhealth -1.34429986 1.043382521
## Mjobother -0.58124422 0.930736198
## Mjobservices -0.63109900 1.041723340
## Mjobteacher -1.05654220 1.178566169
## Fjobhealth -1.52678446 1.612786268
## Fjobother -1.10568187 0.934387483
## Fjobservices -1.35656225 0.778671398
## Fjobteacher -1.33962093 1.423122841
## reasonhome -0.81665784 0.363639782
## reasonother -0.51033164 1.193011557
## reasonreputation -0.49505569 0.775195546
## guardianmother -0.54429363 0.641996121
## guardianother -1.15830776 0.902013135
## traveltime -0.31127305 0.438607217
## studytime -0.34497360 0.276644225
## failures -0.47643785 0.248941581
## schoolsup -0.34637763 1.122433574
## famsup -0.45774206 0.595948996
## paid -0.27776764 0.742190923
## activities -0.84671730 0.093733926
## nursery -0.81005707 0.367674398
## higher -1.59380152 0.634541515
## internet -0.69046461 0.634369695
## romantic -0.75432680 0.268001184
## famrel 0.06518645 0.609954778
## freetime -0.13546957 0.374873600
## goout -0.21204344 0.284031507
## Dalc -0.56524047 0.143693501
## Walc -0.09911334 0.432559106
## health -0.09734057 0.236009649
## absences 0.01635556 0.075544247
## G1 0.06901604 0.343054541
## G2 0.82180038 1.056225830
We create a second model again using the linear model and this time we choose only the significant variables using the stepAIC function. Second model has performed better then the model 1, Residual standard error is smaller, Adjusted R-squared & F-statistic have improved. p-value is almost the same. Resiudal plots looks similar to the model 1 and since we still see the skewness we will use the glm for our next model.
##
## Call:
## lm(formula = G3 ~ age + activities + famrel + Walc + absences +
## G1 + G2, data = maths_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7867 -0.5062 0.2511 1.0438 4.2313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.12327 1.54795 0.080 0.93658
## age -0.23169 0.08507 -2.724 0.00683 **
## activities -0.40157 0.21636 -1.856 0.06440 .
## famrel 0.38918 0.12427 3.132 0.00191 **
## Walc 0.13070 0.08770 1.490 0.13717
## absences 0.03967 0.01306 3.037 0.00259 **
## G1 0.18729 0.06003 3.120 0.00198 **
## G2 0.95443 0.05265 18.127 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.91 on 308 degrees of freedom
## Multiple R-squared: 0.8383, Adjusted R-squared: 0.8346
## F-statistic: 228.1 on 7 and 308 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) -2.92262230 3.16916866
## age -0.39907895 -0.06429662
## activities -0.82729521 0.02414932
## famrel 0.14464860 0.63371896
## Walc -0.04186637 0.30326278
## absences 0.01397166 0.06537164
## G1 0.06917031 0.30541952
## G2 0.85083148 1.05803670
##
## Call:
## lm(formula = G3 ~ age + activities + famrel + Walc + absences +
## G1 + G2, data = port_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7867 -0.5062 0.2511 1.0438 4.2313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.12327 1.54795 0.080 0.93658
## age -0.23169 0.08507 -2.724 0.00683 **
## activities -0.40157 0.21636 -1.856 0.06440 .
## famrel 0.38918 0.12427 3.132 0.00191 **
## Walc 0.13070 0.08770 1.490 0.13717
## absences 0.03967 0.01306 3.037 0.00259 **
## G1 0.18729 0.06003 3.120 0.00198 **
## G2 0.95443 0.05265 18.127 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.91 on 308 degrees of freedom
## Multiple R-squared: 0.8383, Adjusted R-squared: 0.8346
## F-statistic: 228.1 on 7 and 308 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) -2.92262230 3.16916866
## age -0.39907895 -0.06429662
## activities -0.82729521 0.02414932
## famrel 0.14464860 0.63371896
## Walc -0.04186637 0.30326278
## absences 0.01397166 0.06537164
## G1 0.06917031 0.30541952
## G2 0.85083148 1.05803670
We create the third model by using the generalized linear model and use the stepAIC to choose the significant variables.
##
## Call:
## glm(formula = G3 ~ age + activities + famrel + Walc + absences +
## G1 + G2, family = gaussian, data = maths_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.7867 -0.5062 0.2511 1.0438 4.2313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.12327 1.54795 0.080 0.93658
## age -0.23169 0.08507 -2.724 0.00683 **
## activities -0.40157 0.21636 -1.856 0.06440 .
## famrel 0.38918 0.12427 3.132 0.00191 **
## Walc 0.13070 0.08770 1.490 0.13717
## absences 0.03967 0.01306 3.037 0.00259 **
## G1 0.18729 0.06003 3.120 0.00198 **
## G2 0.95443 0.05265 18.127 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3.646842)
##
## Null deviance: 6946.9 on 315 degrees of freedom
## Residual deviance: 1123.2 on 308 degrees of freedom
## AIC: 1315.5
##
## Number of Fisher Scoring iterations: 2
##
## Call:
## glm(formula = G3 ~ age + activities + famrel + Walc + absences +
## G1 + G2, family = gaussian, data = port_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.7867 -0.5062 0.2511 1.0438 4.2313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.12327 1.54795 0.080 0.93658
## age -0.23169 0.08507 -2.724 0.00683 **
## activities -0.40157 0.21636 -1.856 0.06440 .
## famrel 0.38918 0.12427 3.132 0.00191 **
## Walc 0.13070 0.08770 1.490 0.13717
## absences 0.03967 0.01306 3.037 0.00259 **
## G1 0.18729 0.06003 3.120 0.00198 **
## G2 0.95443 0.05265 18.127 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3.646842)
##
## Null deviance: 6946.9 on 315 degrees of freedom
## Residual deviance: 1123.2 on 308 degrees of freedom
## AIC: 1315.5
##
## Number of Fisher Scoring iterations: 2
We see similar model values for Mathematics and Portuguese datasets. So we will choose one of the model for both of the datasets. We neglect the model1 since it as lower Adjusted R-squared, higher Residual standard error and higher AIC when compared to Model2 & Model3.
Model3 the generalized linear model is our choice here for both Mathematics and Portuguese datasets since it has a little higher Adjusted R-squared value when compared to Model2.
Most significant variables that decide a students final grade are : Student’s age, extra-curricular activities, quality of family relationships, weekend alcohol consumption, number of school absences, first period grade, second period grade
Mathematics
## fitting null model for pseudo-r2
| LM 1 | LM 2 | GML | |
|---|---|---|---|
| Residual standard error | 1.9737994 | 1.9096706 | 1.9096706 |
| Adjusted R-squared | 0.8233445 | 0.8346371 | 0.8383118 |
| Predictors | 42.0000000 | 8.0000000 | 8.0000000 |
| AIC | 1367.4380030 | 1315.5263920 | 1315.5263920 |
Portuguese
## fitting null model for pseudo-r2
| LM 1 | LM 2 | GML | |
|---|---|---|---|
| Residual standard error | 1.9737994 | 1.9096706 | 1.9096706 |
| Adjusted R-squared | 0.8233445 | 0.8346371 | 0.8383118 |
| Predictors | 42.0000000 | 8.0000000 | 8.0000000 |
| AIC | 1367.4380030 | 1315.5263920 | 1315.5263920 |
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
https://rdrr.io/r/stats/sigma.html