My data comes from two Portuguese schools and contains information on age, demographics, attendance information, family structures, and much more. This information is recorded for each student in two Portuguese schools, and was obtained due to high student failure rates in Portugal, despite improvements in education level around the year 2000. Statistics such as 40% leaving school early in Portuguese schools vs the European average of 15%.
Portuguese school information systems mostly rely on paper sheets, and so the database was built from two sources: school reports based on the paper sheets, and questionnaires which were used to complement the previous information. The questionnaires included demographic, social and emotional components, and other school related variables. Few answered about family income which therefore was not used in the final dataset.
What is the best model to predict final grades (g3) from the set of predictors given?
## school sex age address famsize pstatus medu fedu mjob fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 TRUE FALSE FALSE FALSE
## 2 father 1 2 0 FALSE TRUE FALSE FALSE
## 3 mother 1 2 3 TRUE FALSE TRUE FALSE
## 4 mother 1 3 0 FALSE TRUE TRUE TRUE
## 5 father 1 2 0 FALSE TRUE TRUE FALSE
## 6 mother 1 2 0 FALSE TRUE TRUE TRUE
## nursery higher internet romantic famrel freetime goout dalc walc health
## 1 TRUE TRUE FALSE FALSE 4 3 4 1 1 3
## 2 FALSE TRUE TRUE FALSE 5 3 3 1 1 3
## 3 TRUE TRUE TRUE FALSE 4 3 2 2 3 3
## 4 TRUE TRUE TRUE TRUE 3 2 2 1 1 5
## 5 TRUE TRUE FALSE FALSE 4 3 2 1 2 5
## 6 TRUE TRUE TRUE FALSE 5 4 2 1 2 5
## absences g1 g2 g3
## 1 6 5 6 6
## 2 4 5 5 6
## 3 10 7 8 10
## 4 2 15 14 15
## 5 4 6 10 10
## 6 10 15 15 15
For this project I will be using LASSO regression, which is a powerful tool that reduces the number of predictors in your model, makes it more interpretable or reduces the dimentionality. It does this by reducing variable coefficients weights to 0, only keeping the predictors that are important.
Lasso regression is used to help determine which variables are important to keep in the model. It adds a penalty to the residual sum of squares and multiplies by the regularization parameter \(\alpha\) (\(\alpha =0\) is ridge regression, \(0<\alpha\)<1 is elastic net, and \(\alpha\) =1 is lasso regression). Ridge and elastic net are other forms of linear regression regularization.
\[L_{lasso}(\hat{\beta}) = \sum_{i=1}^n(y_i-x^s_i\hat{\beta})^2 + \alpha\sum_{i=1}^m|\hat{\beta}|\] Both ridge regression and lasso are techniques specially designed for prediction rather than for inference. This allows you to work with complex datasets (high dimensional) while at the same time not having to worry as much about over fitting.
Before we can use the LASSO regression, we must first check for multicollinearity, as the LASSO could arbitrarily choose a predictor to keep if multicollinearity is present.
## GVIF Df GVIF^(1/(2*Df))
## sex 1.341504 1 1.158233
## age 1.632852 1 1.277831
## traveltime 1.200273 1 1.095570
## studytime 1.299177 1 1.139814
## absences 1.236113 1 1.111806
## medu 2.896548 1 1.701925
## fedu 2.062927 1 1.436289
## failures 1.324186 1 1.150733
## mjob 3.021069 4 1.148207
## fjob 2.029016 4 1.092473
## famrel 1.130420 1 1.063212
## freetime 1.293159 1 1.137172
## goout 1.427386 1 1.194732
## walc 2.323741 1 1.524382
## dalc 1.981845 1 1.407780
## internet 1.200967 1 1.095887
## romantic 1.127690 1 1.061928
## health 1.158221 1 1.076207
## school 1.392342 1 1.179975
## pstatus 1.121966 1 1.059229
## famsize 1.108846 1 1.053018
## reason 1.383986 3 1.055655
## guardian 1.597902 2 1.124314
##
## Call: cv.glmnet(x = x, y = y, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.1468 27 18.27 1.827 18
## 1se 1.2475 4 19.95 2.074 1
We can see that this is achieved when \(\lambda = 0.1468\).
## 25 x 1 sparse Matrix of class "dgCMatrix"
## s0
## sex 0.776561885
## age -0.078450089
## school .
## famsize 0.456095427
## pstatus -0.112236235
## reason 0.216980058
## guardian .
## famsup -0.402610835
## internet 0.339557170
## romantic -0.667087353
## higher 0.983374838
## dalc .
## walc .
## health -0.009597781
## traveltime -0.210190511
## studytime 0.160975721
## absences 0.015084610
## medu 0.411126852
## fedu .
## failures -1.625070501
## mjob .
## fjob .
## famrel 0.009226948
## freetime 0.057419408
## goout -0.322244394
After the Lasso regression, we have our model as such:
\[g_1 = \beta_0 +\beta_1sex+\beta_2age+\beta_3famsize+\beta_4pstatus+ \beta_5reason + \beta_6famsup + \beta_7internet + \beta_8 romantic+\beta_9higher+\beta_{10}health\] \[+\beta_{11}traveltime + \beta_{12}studytime+\beta_{13}absences+\beta_{14}medu+ \beta_{15}failures+\beta_{16}famrel+\beta_{17}freetime+\beta_{18}goout+\epsilon\]
Where the errors, \(\epsilon\) ~ Normal(0,\(\sigma^2\))
The assumptions for this model are:
Normality: From the QQ plot, we can see some deviation from the reference line, indicating some potential concerns with the normality of the residuals. That being said, the rest of the residuals follow very close to the reference line, indicating fairly normally distributed residuals.
Equal variance: Based on the plot of residuals vs fitted, we can see that the residuals stay fairly evenly distributed above and below 0, all through the predicted values. That being said, there is a large line below the fitted line that has a negative slope, which could be the larger distribution of 0 final grades, which I will mention later.
Independence: Based on the description of the data, it does not include a section for how the data was selected, and it appears to take all the students with grades in Math and Portuguese from in two different schools.
##
## Call:
## lm(formula = g3 ~ goout + freetime + famrel + failures + medu +
## absences + studytime + traveltime + health + higher + romantic +
## internet + famsup + reason + pstatus + famsize + age + sex,
## data = grades)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4888 -2.2330 0.4778 2.8871 9.3490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.70808 3.63598 2.945 0.00343 **
## goout -0.50794 0.19907 -2.552 0.01112 *
## freetime 0.25389 0.22878 1.110 0.26783
## famrel 0.15579 0.23833 0.654 0.51374
## failures -1.62001 0.31214 -5.190 3.46e-07 ***
## medu 0.45092 0.21306 2.116 0.03497 *
## absences 0.04236 0.02787 1.520 0.12943
## studytime 0.44984 0.27502 1.636 0.10276
## traveltime -0.32636 0.30910 -1.056 0.29172
## health -0.12059 0.15599 -0.773 0.43999
## higherTRUE 1.57030 1.04421 1.504 0.13347
## romanticTRUE -1.00297 0.46267 -2.168 0.03080 *
## internetTRUE 0.61188 0.58980 1.037 0.30021
## famsupTRUE -0.75780 0.45274 -1.674 0.09501 .
## reasonhome 0.19292 0.54116 0.356 0.72167
## reasonother 1.35309 0.78077 1.733 0.08392 .
## reasonreputation 0.71946 0.55463 1.297 0.19536
## pstatusT -0.48708 0.70762 -0.688 0.49167
## famsizeLE3 0.74103 0.47190 1.570 0.11719
## age -0.17898 0.18036 -0.992 0.32169
## sexM 1.13609 0.47060 2.414 0.01625 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.126 on 374 degrees of freedom
## Multiple R-squared: 0.2301, Adjusted R-squared: 0.1889
## F-statistic: 5.588 on 20 and 374 DF, p-value: 1.224e-12
Based on our summary table, we can see that the model is significant with an overall p value of 1.22e-12. This indicates that the model has some sort of relationship with the response and can be used for predictions. That being said, looking at other metrics such as adjusted R^2 which has a value of 0.1889, indicates that less than 20% of the variability in the response is explained by variability in the predictors. In making predictions with this data, I found that a prediction for a 95% confidence interval given any demographic information was around ~11 grade points out of 20, or over half of the grades for each prediction, which although accurate, is not very precise. Additionally, due to the lack of independence in the creation of the study, we cannot generalize to the greater population of Portugal, or to any causal relationship (because this is not an experiment). Overall, the predictors listed have some correlation to final grades, but have a weak correlation overall, and are not great predictors of final grade. I think it would be interesting to see this study repeated in a setting with random selection of students across a population.
I think the LASSO regression was very useful to parse down a large model, eliminating some of the insignificant predictors quickly, and producing the best model in relation to the mean squared error. Some potential reasons why I have a relatively low adjusted R^2 value despite having a large amount of predictors in my model are as follows:
* The predictors are not highly correlated with grades: Next although
the predictors are correlated with final grades, they are not very
significant as seen from our LASSO regression. Almost all of the
predictors that were left had coefficients fairly close to 0, indicating
that they are fairly weakly correlated to final grades. This is shown by
the addition of g1 or g2, which causes all other predictors to be
dropped during LASSO regression, and still having a large
coefficient.
I think it would be interesting to do a model with the response as a binary categorical variable: pass or fail, which could then be easily used with k fold validation to evaluate the performance of the model. Additionally, this could take away the error of the large portion of 0 grades in our final grades data.
Next I think adding more predictors would be interesting, to see if there is some other variable that is not included in the model yet which has some other explanatory feature of final grades, but is not directly related to its calculation (not the semester grades, which have been shown to be highly correlated with final grades).
lm <- lm(g1~sex+age+traveltime+studytime+absences+medu+fedu+failures+mjob+fjob+famrel+freetime+goout+walc+dalc+internet+romantic+health+school+pstatus+famsize+reason+guardian, data=grades)
vif(lm)
y <- grades$g3
x <- data.matrix(grades[,c("sex", "age","school","famsize","pstatus","reason","guardian","famsup","internet","romantic","higher","dalc","walc","health", "traveltime", "studytime", "absences","medu","fedu","failures", "mjob", "fjob", "famrel", "freetime", "goout")])
#Lasso Regresson - alpha = 1
set.seed(2178)
grades_lasso <- cv.glmnet(x, y, alpha=1)
grades_lasso
plot(grades_lasso$glmnet.fit,"lambda",label=FALSE)
lasso.min <- grades_lasso$lambda.min
lasso.model <- glmnet(x=x, y=y, alpha =1 , lambda = lasso.min)
lasso.model$beta
#fitting the reduced multiple regression
gradeslm <- lm(g3~goout+freetime+famrel+failures+medu+absences+studytime+traveltime+health+higher+romantic+internet+famsup+reason+pstatus+famsize+age+sex, data=grades)
#QQplot
res<- resid(gradeslm)
qqnorm(res)
qqline(res,col="blue")
#residuals vs fitted
plot(fitted(gradeslm),res)
abline(0,0)
#summary
summary(gradeslm)
#plots of final grades vs semester 1 grades
iris1 <- ggplot(grades, aes(x=g3)) + geom_histogram(binwidth = 1,fill="blue",color="black")+ggtitle("Final Semester Grades")
iris2 <-ggplot(grades, aes(x=g1))+geom_histogram(binwidth=1,fill="red",color="black")+ ggtitle("First Semester Grades")
plot_grid(iris1, iris2, labels = "AUTO")