Data Dive 11

Build a linear (or generalized linear) model as you like Use whatever response variable and explanatory variables you prefer Use the tools from previous weeks to diagnose the model Highlight any issues with the model Interpret at least one of the coefficients

Importing all the libraries

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'pROC' was built under R version 4.3.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## 
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## Warning: package 'ResourceSelection' was built under R version 4.3.2
## ResourceSelection 0.3-6   2023-06-27

Loading the data

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Summary

## spc_tbl_ [4,424 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Marital status                                : num [1:4424] 1 1 1 1 2 2 1 1 1 1 ...
##  $ Application mode                              : num [1:4424] 17 15 1 17 39 39 1 18 1 1 ...
##  $ Application order                             : num [1:4424] 5 1 5 2 1 1 1 4 3 1 ...
##  $ Course                                        : num [1:4424] 171 9254 9070 9773 8014 ...
##  $ Daytime/evening attendance                      : num [1:4424] 1 1 1 1 0 0 1 1 1 1 ...
##  $ Previous qualification                        : num [1:4424] 1 1 1 1 1 19 1 1 1 1 ...
##  $ Previous qualification (grade)                : num [1:4424] 122 160 122 122 100 ...
##  $ Nacionality                                   : num [1:4424] 1 1 1 1 1 1 1 1 62 1 ...
##  $ Mother's qualification                        : num [1:4424] 19 1 37 38 37 37 19 37 1 1 ...
##  $ Father's qualification                        : num [1:4424] 12 3 37 37 38 37 38 37 1 19 ...
##  $ Mother's occupation                           : num [1:4424] 5 3 9 5 9 9 7 9 9 4 ...
##  $ Father's occupation                           : num [1:4424] 9 3 9 3 9 7 10 9 9 7 ...
##  $ Admission grade                               : num [1:4424] 127 142 125 120 142 ...
##  $ Displaced                                     : num [1:4424] 1 1 1 1 0 0 1 1 0 1 ...
##  $ Educational special needs                     : num [1:4424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Debtor                                        : num [1:4424] 0 0 0 0 0 1 0 0 0 1 ...
##  $ Tuition fees up to date                       : num [1:4424] 1 0 0 1 1 1 1 0 1 0 ...
##  $ Gender                                        : num [1:4424] 1 1 1 0 0 1 0 1 0 0 ...
##  $ Scholarship holder                            : num [1:4424] 0 0 0 0 0 0 1 0 1 0 ...
##  $ Age at enrollment                             : num [1:4424] 20 19 19 20 45 50 18 22 21 18 ...
##  $ International                                 : num [1:4424] 0 0 0 0 0 0 0 0 1 0 ...
##  $ Curricular units 1st sem (credited)           : num [1:4424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Curricular units 1st sem (enrolled)           : num [1:4424] 0 6 6 6 6 5 7 5 6 6 ...
##  $ Curricular units 1st sem (evaluations)        : num [1:4424] 0 6 0 8 9 10 9 5 8 9 ...
##  $ Curricular units 1st sem (approved)           : num [1:4424] 0 6 0 6 5 5 7 0 6 5 ...
##  $ Curricular units 1st sem (grade)              : num [1:4424] 0 14 0 13.4 12.3 ...
##  $ Curricular units 1st sem (without evaluations): num [1:4424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Curricular units 2nd sem (credited)           : num [1:4424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Curricular units 2nd sem (enrolled)           : num [1:4424] 0 6 6 6 6 5 8 5 6 6 ...
##  $ Curricular units 2nd sem (evaluations)        : num [1:4424] 0 6 0 10 6 17 8 5 7 14 ...
##  $ Curricular units 2nd sem (approved)           : num [1:4424] 0 6 0 5 6 5 8 0 6 2 ...
##  $ Curricular units 2nd sem (grade)              : num [1:4424] 0 13.7 0 12.4 13 ...
##  $ Curricular units 2nd sem (without evaluations): num [1:4424] 0 0 0 0 0 5 0 0 0 0 ...
##  $ Unemployment rate                             : num [1:4424] 10.8 13.9 10.8 9.4 13.9 16.2 15.5 15.5 16.2 8.9 ...
##  $ Inflation rate                                : num [1:4424] 1.4 -0.3 1.4 -0.8 -0.3 0.3 2.8 2.8 0.3 1.4 ...
##  $ GDP                                           : num [1:4424] 1.74 0.79 1.74 -3.12 0.79 -0.92 -4.06 -4.06 -0.92 3.51 ...
##  $ Target                                        : chr [1:4424] "Dropout" "Graduate" "Dropout" "Graduate" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Marital status` = col_double(),
##   ..   `Application mode` = col_double(),
##   ..   `Application order` = col_double(),
##   ..   Course = col_double(),
##   ..   `Daytime/evening attendance   ` = col_double(),
##   ..   `Previous qualification` = col_double(),
##   ..   `Previous qualification (grade)` = col_double(),
##   ..   Nacionality = col_double(),
##   ..   `Mother's qualification` = col_double(),
##   ..   `Father's qualification` = col_double(),
##   ..   `Mother's occupation` = col_double(),
##   ..   `Father's occupation` = col_double(),
##   ..   `Admission grade` = col_double(),
##   ..   Displaced = col_double(),
##   ..   `Educational special needs` = col_double(),
##   ..   Debtor = col_double(),
##   ..   `Tuition fees up to date` = col_double(),
##   ..   Gender = col_double(),
##   ..   `Scholarship holder` = col_double(),
##   ..   `Age at enrollment` = col_double(),
##   ..   International = col_double(),
##   ..   `Curricular units 1st sem (credited)` = col_double(),
##   ..   `Curricular units 1st sem (enrolled)` = col_double(),
##   ..   `Curricular units 1st sem (evaluations)` = col_double(),
##   ..   `Curricular units 1st sem (approved)` = col_double(),
##   ..   `Curricular units 1st sem (grade)` = col_double(),
##   ..   `Curricular units 1st sem (without evaluations)` = col_double(),
##   ..   `Curricular units 2nd sem (credited)` = col_double(),
##   ..   `Curricular units 2nd sem (enrolled)` = col_double(),
##   ..   `Curricular units 2nd sem (evaluations)` = col_double(),
##   ..   `Curricular units 2nd sem (approved)` = col_double(),
##   ..   `Curricular units 2nd sem (grade)` = col_double(),
##   ..   `Curricular units 2nd sem (without evaluations)` = col_double(),
##   ..   `Unemployment rate` = col_double(),
##   ..   `Inflation rate` = col_double(),
##   ..   GDP = col_double(),
##   ..   Target = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

#Model building

## 
## Call:
## glm(formula = Target ~ Gender * `Age at enrollment` + I(`Admission grade`^2) + 
##     `Scholarship holder`, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.136e+00  2.146e-01   5.292 1.21e-07 ***
## Gender                     -5.675e-01  2.311e-01  -2.455   0.0141 *  
## `Age at enrollment`        -5.342e-02  5.922e-03  -9.020  < 2e-16 ***
## I(`Admission grade`^2)      5.611e-05  9.328e-06   6.015 1.80e-09 ***
## `Scholarship holder`        1.249e+00  1.017e-01  12.282  < 2e-16 ***
## Gender:`Age at enrollment` -4.960e-03  9.135e-03  -0.543   0.5871    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5554.5  on 4423  degrees of freedom
## Residual deviance: 4934.0  on 4418  degrees of freedom
## AIC: 4946
## 
## Number of Fisher Scoring iterations: 4

The target variable is “Dropout”.

Gender: 1 represents males and 0 represents females, the negative coefficient for gender suggests that males have lower log-odds of dropping out compared to female. Here we are holding all other variables constant.

Age at enrollment: The negative coefficient suggests that with each additional year of age at enrollment, the log-odds of dropping out decreases.

I(Admission grade^2): The positive coefficient for the squared admission grade suggests there is a non-linear relationship between the admission grade and the log-odds of dropping out. Specifically, it suggests that as the admission grade increases, the log-odds of dropping out increases at an increasing rate.

Scholarship holder: The positive coefficient indicates that students who are scholarship holders have higher log-odds of dropping out compared to those who are not, holding other variables constant.

Gender:Age at enrollment: It is not statistically significant as (p = 0.5871), suggesting that the effect of age at enrollment on the log-odds of dropping out does not differ significantly between genders in this model.

The model’s AIC is 4946.

Multicolinearity Check

We can use Variance Inflation Factors (VIF) to check for multicollinearity among predictors. High VIF values (typically VIF > 10) suggest that multicollinearity might be inflating the variances of the parameter estimates, which can lead to incorrect conclusions.

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                     Gender        `Age at enrollment` 
##                  10.826355                   1.750837 
##     I(`Admission grade`^2)       `Scholarship holder` 
##                   1.014343                   1.017090 
## Gender:`Age at enrollment` 
##                  11.954864

The output from the Variance Inflation Factor (VIF) indicates that there might be some multicollinearity, especially with “Gender” and the interaction term “Gender:Age at enrollment” which have VIFs greater than 10.

Since “Gender” and “Gender:Age at enrollment” are involved in an interaction, it is not unusual for them to have high VIF values.

Residual Analysis

Overall, the diagnostics suggest that the model is performing well.

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  model$y, fitted(model)
## X-squared = 13.076, df = 8, p-value = 0.1093

Given that p-value is 0.1093, this suggests that the model does not show a statistically significant lack of fit. Therefore, the model is an adequate fit for the data according to the Hosmer-Lemeshow test.

AUC - ROC curve

## Setting levels: control = Dropout, case = Graduate
## Setting direction: controls < cases

## Area under the curve: 0.7614

Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Dropout Graduate
##   Dropout      410      155
##   Graduate    1011     2054
##                                          
##                Accuracy : 0.6788         
##                  95% CI : (0.6633, 0.694)
##     No Information Rate : 0.6085         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2446         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.2885         
##             Specificity : 0.9298         
##          Pos Pred Value : 0.7257         
##          Neg Pred Value : 0.6701         
##              Prevalence : 0.3915         
##          Detection Rate : 0.1129         
##    Detection Prevalence : 0.1556         
##       Balanced Accuracy : 0.6092         
##                                          
##        'Positive' Class : Dropout        
## 

Overall, while the model has a fairly high specificity, indicating it’s good at identifying those who will graduate, its sensitivity is quite low, meaning it’s not as good at identifying those who will drop out.

Coefficients

##                (Intercept)                     Gender 
##                  3.1138405                  0.5669666 
##        `Age at enrollment`     I(`Admission grade`^2) 
##                  0.9479814                  1.0000561 
##       `Scholarship holder` Gender:`Age at enrollment` 
##                  3.4880810                  0.9950519

Interpretation of Coefficient

Gender: The odds of the outcome for one gender is 0.5669666 times the odds for the other gender. If Gender is 1 for male and 0 for female, this would mean that the odds of the outcome for males are 43.3% (1 - 0.5669666) less than the odds for females.

Age at enrollment: The odds of the outcome decrease by a factor of 0.9479814 for each additional year of age at enrollment. This suggests that older students at the time of enrollment have slightly lower odds of the outcome of dropout compared to younger students.

Admission grade (squared): The odds of the outcome increase by a factor of 1.0000561 for each one-unit increase in the squared admission grade. This effect is very small, as indicated by the coefficient being very close to 1, and it may not be practically significant.

Scholarship holder: The odds of the outcome for a scholarship holder are 3.4880810 times the odds for a non-scholarship holder. This suggests that having a scholarship is associated with higher odds of the outcome of dropout in the model.

Gender:Age at enrollment: The interaction term between gender and age at enrollment has an odds ratio of 0.9950519. This suggests that the effect of age on the outcome changes slightly depending on gender, but this effect is very small.

Summary -

We built a logistic regression model to predict whether a student will graduate or drop out based on various predictors, including gender, age at enrollment, admission grade squared, and scholarship holder status.

Model Diagnostics:

Variance Inflation Factor (VIF): Checked for multicollinearity among predictors. Some variables showed high VIF scores, suggesting multicollinearity issues.

Residuals Analysis: A four diagnostic plots to assess the residuals. No major issues were visible, indicating a reasonable fit of the model to the data.

Hosmer-Lemeshow Test: Conducted a goodness-of-fit test and found a p-value of 0.1093, suggesting the model fits the data

Model Performance:

ROC Curve: Generated a Receiver Operating Characteristic curve with an Area Under the Curve (AUC) of 0.7614, indicating good discriminative ability of the model.

Confusion Matrix: Calculated the model’s accuracy (about 67.88%) and other performance metrics like sensitivity and specificity. The model was more effective at predicting graduates than dropouts.

Coefficient Interpretation: We interpreted the coefficients of the model as odds ratios, which describe the change in odds for a one-unit change in the predictor variables.

Model Issues: The main issue highlighted was the model’s low sensitivity, indicating it was not as effective at identifying students who would drop out.