Introduction:

My data comes from two Portuguese schools and contains information on age, demographics, attendance information, family structures, and much more. This information is recorded for each student in two Portuguese schools, and was obtained due to high student failure rates in Portugal, despite improvements in education level around the year 2000. Statistics such as 40% leaving school early in Portuguese schools vs the European average of 15%.

Portuguese school information systems mostly rely on paper sheets, and so the database was built from two sources: school reports based on the paper sheets, and questionnaires which were used to complement the previous information. The questionnaires included demographic, social and emotional components, and other school related variables. Few answered about family income which therefore was not used in the final dataset.

My question from this dataset is:

What is the best model to predict final grades (g3) from the set of predictors given?

##   school sex age address famsize pstatus medu fedu     mjob     fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup  paid activities
## 1   mother          2         2        0      TRUE  FALSE FALSE      FALSE
## 2   father          1         2        0     FALSE   TRUE FALSE      FALSE
## 3   mother          1         2        3      TRUE  FALSE  TRUE      FALSE
## 4   mother          1         3        0     FALSE   TRUE  TRUE       TRUE
## 5   father          1         2        0     FALSE   TRUE  TRUE      FALSE
## 6   mother          1         2        0     FALSE   TRUE  TRUE       TRUE
##   nursery higher internet romantic famrel freetime goout dalc walc health
## 1    TRUE   TRUE    FALSE    FALSE      4        3     4    1    1      3
## 2   FALSE   TRUE     TRUE    FALSE      5        3     3    1    1      3
## 3    TRUE   TRUE     TRUE    FALSE      4        3     2    2    3      3
## 4    TRUE   TRUE     TRUE     TRUE      3        2     2    1    1      5
## 5    TRUE   TRUE    FALSE    FALSE      4        3     2    1    2      5
## 6    TRUE   TRUE     TRUE    FALSE      5        4     2    1    2      5
##   absences g1 g2 g3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15

Methodology:

For this project I will be using LASSO regression, which is a powerful tool that reduces the number of predictors in your model, makes it more interpretable or reduces the dimentionality. It does this by reducing variable coefficients weights to 0, only keeping the predictors that are important.

LASSO regression basics:

Lasso regression is used to help determine which variables are important to keep in the model. It adds a penalty to the residual sum of squares and multiplies by the regularization parameter \(\alpha\) (\(\alpha =0\) is ridge regression, \(0<\alpha\)<1 is elastic net, and \(\alpha\) =1 is lasso regression). Ridge and elastic net are other forms of linear regression regularization.

\[L_{lasso}(\hat{\beta}) = \sum_{i=1}^n(y_i-x^s_i\hat{\beta})^2 + \alpha\sum_{i=1}^m|\hat{\beta}|\] Both ridge regression and lasso are techniques specially designed for prediction rather than for inference. This allows you to work with complex datasets (high dimensional) while at the same time not having to worry as much about over fitting.

Before we can use the LASSO regression, we must first check for multicollinearity, as the LASSO could arbitrarily choose a predictor to keep if multicollinearity is present.

##                GVIF Df GVIF^(1/(2*Df))
## sex        1.341504  1        1.158233
## age        1.632852  1        1.277831
## traveltime 1.200273  1        1.095570
## studytime  1.299177  1        1.139814
## absences   1.236113  1        1.111806
## medu       2.896548  1        1.701925
## fedu       2.062927  1        1.436289
## failures   1.324186  1        1.150733
## mjob       3.021069  4        1.148207
## fjob       2.029016  4        1.092473
## famrel     1.130420  1        1.063212
## freetime   1.293159  1        1.137172
## goout      1.427386  1        1.194732
## walc       2.323741  1        1.524382
## dalc       1.981845  1        1.407780
## internet   1.200967  1        1.095887
## romantic   1.127690  1        1.061928
## health     1.158221  1        1.076207
## school     1.392342  1        1.179975
## pstatus    1.121966  1        1.059229
## famsize    1.108846  1        1.053018
## reason     1.383986  3        1.055655
## guardian   1.597902  2        1.124314

LASSO Process:

## 
## Call:  cv.glmnet(x = x, y = y, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min 0.1468    27   18.27 1.827      18
## 1se 1.2475     4   19.95 2.074       1

We can see that this is achieved when \(\lambda = 0.1468\).

## 25 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## sex         0.776561885
## age        -0.078450089
## school      .          
## famsize     0.456095427
## pstatus    -0.112236235
## reason      0.216980058
## guardian    .          
## famsup     -0.402610835
## internet    0.339557170
## romantic   -0.667087353
## higher      0.983374838
## dalc        .          
## walc        .          
## health     -0.009597781
## traveltime -0.210190511
## studytime   0.160975721
## absences    0.015084610
## medu        0.411126852
## fedu        .          
## failures   -1.625070501
## mjob        .          
## fjob        .          
## famrel      0.009226948
## freetime    0.057419408
## goout      -0.322244394

After the Lasso regression, we have our model as such:

\[g_1 = \beta_0 +\beta_1sex+\beta_2age+\beta_3famsize+\beta_4pstatus+ \beta_5reason + \beta_6famsup + \beta_7internet + \beta_8 romantic+\beta_9higher+\beta_{10}health\] \[+\beta_{11}traveltime + \beta_{12}studytime+\beta_{13}absences+\beta_{14}medu+ \beta_{15}failures+\beta_{16}famrel+\beta_{17}freetime+\beta_{18}goout+\epsilon\]

Where the errors, \(\epsilon\) ~ Normal(0,\(\sigma^2\))

Model Conditions

The assumptions for this model are:

Normality: From the QQ plot, we can see some deviation from the reference line, indicating some potential concerns with the normality of the residuals. That being said, the rest of the residuals follow very close to the reference line, indicating fairly normally distributed residuals.

Equal variance: Based on the plot of residuals vs fitted, we can see that the residuals stay fairly evenly distributed above and below 0, all through the predicted values. That being said, there is a large line below the fitted line that has a negative slope, which could be the larger distribution of 0 final grades, which I will mention later.

Independence: Based on the description of the data, it does not include a section for how the data was selected, and it appears to take all the students with grades in Math and Portuguese from in two different schools.

Conclusions:

## 
## Call:
## lm(formula = g3 ~ goout + freetime + famrel + failures + medu + 
##     absences + studytime + traveltime + health + higher + romantic + 
##     internet + famsup + reason + pstatus + famsize + age + sex, 
##     data = grades)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4888  -2.2330   0.4778   2.8871   9.3490 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.70808    3.63598   2.945  0.00343 ** 
## goout            -0.50794    0.19907  -2.552  0.01112 *  
## freetime          0.25389    0.22878   1.110  0.26783    
## famrel            0.15579    0.23833   0.654  0.51374    
## failures         -1.62001    0.31214  -5.190 3.46e-07 ***
## medu              0.45092    0.21306   2.116  0.03497 *  
## absences          0.04236    0.02787   1.520  0.12943    
## studytime         0.44984    0.27502   1.636  0.10276    
## traveltime       -0.32636    0.30910  -1.056  0.29172    
## health           -0.12059    0.15599  -0.773  0.43999    
## higherTRUE        1.57030    1.04421   1.504  0.13347    
## romanticTRUE     -1.00297    0.46267  -2.168  0.03080 *  
## internetTRUE      0.61188    0.58980   1.037  0.30021    
## famsupTRUE       -0.75780    0.45274  -1.674  0.09501 .  
## reasonhome        0.19292    0.54116   0.356  0.72167    
## reasonother       1.35309    0.78077   1.733  0.08392 .  
## reasonreputation  0.71946    0.55463   1.297  0.19536    
## pstatusT         -0.48708    0.70762  -0.688  0.49167    
## famsizeLE3        0.74103    0.47190   1.570  0.11719    
## age              -0.17898    0.18036  -0.992  0.32169    
## sexM              1.13609    0.47060   2.414  0.01625 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.126 on 374 degrees of freedom
## Multiple R-squared:  0.2301, Adjusted R-squared:  0.1889 
## F-statistic: 5.588 on 20 and 374 DF,  p-value: 1.224e-12

Based on our summary table, we can see that the model is significant with an overall p value of 1.22e-12. This indicates that the model has some sort of relationship with the response and can be used for predictions. That being said, looking at other metrics such as adjusted R^2 which has a value of 0.1889, indicates that less than 20% of the variability in the response is explained by variability in the predictors. In making predictions with this data, I found that a prediction for a 95% confidence interval given any demographic information was around ~11 grade points out of 20, or over half of the grades for each prediction, which although accurate, is not very precise. Additionally, due to the lack of independence in the creation of the study, we cannot generalize to the greater population of Portugal, or to any causal relationship (because this is not an experiment). Overall, the predictors listed have some correlation to final grades, but have a weak correlation overall, and are not great predictors of final grade. I think it would be interesting to see this study repeated in a setting with random selection of students across a population.

Discussion:

I think the LASSO regression was very useful to parse down a large model, eliminating some of the insignificant predictors quickly, and producing the best model in relation to the mean squared error. Some potential reasons why I have a relatively low adjusted R^2 value despite having a large amount of predictors in my model are as follows:

* The predictors are not highly correlated with grades: Next although the predictors are correlated with final grades, they are not very significant as seen from our LASSO regression. Almost all of the predictors that were left had coefficients fairly close to 0, indicating that they are fairly weakly correlated to final grades. This is shown by the addition of g1 or g2, which causes all other predictors to be dropped during LASSO regression, and still having a large coefficient.

Further Research:

I think it would be interesting to do a model with the response as a binary categorical variable: pass or fail, which could then be easily used with k fold validation to evaluate the performance of the model. Additionally, this could take away the error of the large portion of 0 grades in our final grades data.

Next I think adding more predictors would be interesting, to see if there is some other variable that is not included in the model yet which has some other explanatory feature of final grades, but is not directly related to its calculation (not the semester grades, which have been shown to be highly correlated with final grades).

R code:

lm <- lm(g1~sex+age+traveltime+studytime+absences+medu+fedu+failures+mjob+fjob+famrel+freetime+goout+walc+dalc+internet+romantic+health+school+pstatus+famsize+reason+guardian, data=grades)
vif(lm)


y <- grades$g3
x <- data.matrix(grades[,c("sex", "age","school","famsize","pstatus","reason","guardian","famsup","internet","romantic","higher","dalc","walc","health", "traveltime", "studytime", "absences","medu","fedu","failures", "mjob", "fjob", "famrel", "freetime", "goout")])


#Lasso Regresson - alpha = 1
set.seed(2178)
grades_lasso <- cv.glmnet(x, y, alpha=1)
grades_lasso

plot(grades_lasso$glmnet.fit,"lambda",label=FALSE)


lasso.min <- grades_lasso$lambda.min
lasso.model <- glmnet(x=x, y=y, alpha =1 , lambda = lasso.min)
lasso.model$beta


#fitting the reduced multiple regression
gradeslm <- lm(g3~goout+freetime+famrel+failures+medu+absences+studytime+traveltime+health+higher+romantic+internet+famsup+reason+pstatus+famsize+age+sex, data=grades)


#QQplot 
res<- resid(gradeslm)
qqnorm(res)
qqline(res,col="blue")


#residuals vs fitted 
plot(fitted(gradeslm),res)
abline(0,0)


#summary
summary(gradeslm)



#plots of final grades vs semester 1 grades
iris1 <- ggplot(grades, aes(x=g3)) + geom_histogram(binwidth = 1,fill="blue",color="black")+ggtitle("Final Semester Grades")

iris2 <-ggplot(grades, aes(x=g1))+geom_histogram(binwidth=1,fill="red",color="black")+ ggtitle("First Semester Grades")

plot_grid(iris1, iris2, labels = "AUTO")