Data Dive 10

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable This should be something worth modeling Build a logistic regression model for this variable, using between 1-4 explanatory variables Interpret the coefficients, and explain what they mean in your notebook (Bonus) Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and interpret its meaning Consider a transformation for any explanatory variable, and illustrate why you need the transformation (or why you do not) Scatter Plots …

Importing all the libraries

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading the data

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [4,424 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Marital status                                : num [1:4424] 1 1 1 1 2 2 1 1 1 1 ...
##  $ Application mode                              : num [1:4424] 17 15 1 17 39 39 1 18 1 1 ...
##  $ Application order                             : num [1:4424] 5 1 5 2 1 1 1 4 3 1 ...
##  $ Course                                        : num [1:4424] 171 9254 9070 9773 8014 ...
##  $ Daytime/evening attendance                      : num [1:4424] 1 1 1 1 0 0 1 1 1 1 ...
##  $ Previous qualification                        : num [1:4424] 1 1 1 1 1 19 1 1 1 1 ...
##  $ Previous qualification (grade)                : num [1:4424] 122 160 122 122 100 ...
##  $ Nacionality                                   : num [1:4424] 1 1 1 1 1 1 1 1 62 1 ...
##  $ Mother's qualification                        : num [1:4424] 19 1 37 38 37 37 19 37 1 1 ...
##  $ Father's qualification                        : num [1:4424] 12 3 37 37 38 37 38 37 1 19 ...
##  $ Mother's occupation                           : num [1:4424] 5 3 9 5 9 9 7 9 9 4 ...
##  $ Father's occupation                           : num [1:4424] 9 3 9 3 9 7 10 9 9 7 ...
##  $ Admission grade                               : num [1:4424] 127 142 125 120 142 ...
##  $ Displaced                                     : num [1:4424] 1 1 1 1 0 0 1 1 0 1 ...
##  $ Educational special needs                     : num [1:4424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Debtor                                        : num [1:4424] 0 0 0 0 0 1 0 0 0 1 ...
##  $ Tuition fees up to date                       : num [1:4424] 1 0 0 1 1 1 1 0 1 0 ...
##  $ Gender                                        : num [1:4424] 1 1 1 0 0 1 0 1 0 0 ...
##  $ Scholarship holder                            : num [1:4424] 0 0 0 0 0 0 1 0 1 0 ...
##  $ Age at enrollment                             : num [1:4424] 20 19 19 20 45 50 18 22 21 18 ...
##  $ International                                 : num [1:4424] 0 0 0 0 0 0 0 0 1 0 ...
##  $ Curricular units 1st sem (credited)           : num [1:4424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Curricular units 1st sem (enrolled)           : num [1:4424] 0 6 6 6 6 5 7 5 6 6 ...
##  $ Curricular units 1st sem (evaluations)        : num [1:4424] 0 6 0 8 9 10 9 5 8 9 ...
##  $ Curricular units 1st sem (approved)           : num [1:4424] 0 6 0 6 5 5 7 0 6 5 ...
##  $ Curricular units 1st sem (grade)              : num [1:4424] 0 14 0 13.4 12.3 ...
##  $ Curricular units 1st sem (without evaluations): num [1:4424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Curricular units 2nd sem (credited)           : num [1:4424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Curricular units 2nd sem (enrolled)           : num [1:4424] 0 6 6 6 6 5 8 5 6 6 ...
##  $ Curricular units 2nd sem (evaluations)        : num [1:4424] 0 6 0 10 6 17 8 5 7 14 ...
##  $ Curricular units 2nd sem (approved)           : num [1:4424] 0 6 0 5 6 5 8 0 6 2 ...
##  $ Curricular units 2nd sem (grade)              : num [1:4424] 0 13.7 0 12.4 13 ...
##  $ Curricular units 2nd sem (without evaluations): num [1:4424] 0 0 0 0 0 5 0 0 0 0 ...
##  $ Unemployment rate                             : num [1:4424] 10.8 13.9 10.8 9.4 13.9 16.2 15.5 15.5 16.2 8.9 ...
##  $ Inflation rate                                : num [1:4424] 1.4 -0.3 1.4 -0.8 -0.3 0.3 2.8 2.8 0.3 1.4 ...
##  $ GDP                                           : num [1:4424] 1.74 0.79 1.74 -3.12 0.79 -0.92 -4.06 -4.06 -0.92 3.51 ...
##  $ Target                                        : chr [1:4424] "Dropout" "Graduate" "Dropout" "Graduate" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Marital status` = col_double(),
##   ..   `Application mode` = col_double(),
##   ..   `Application order` = col_double(),
##   ..   Course = col_double(),
##   ..   `Daytime/evening attendance   ` = col_double(),
##   ..   `Previous qualification` = col_double(),
##   ..   `Previous qualification (grade)` = col_double(),
##   ..   Nacionality = col_double(),
##   ..   `Mother's qualification` = col_double(),
##   ..   `Father's qualification` = col_double(),
##   ..   `Mother's occupation` = col_double(),
##   ..   `Father's occupation` = col_double(),
##   ..   `Admission grade` = col_double(),
##   ..   Displaced = col_double(),
##   ..   `Educational special needs` = col_double(),
##   ..   Debtor = col_double(),
##   ..   `Tuition fees up to date` = col_double(),
##   ..   Gender = col_double(),
##   ..   `Scholarship holder` = col_double(),
##   ..   `Age at enrollment` = col_double(),
##   ..   International = col_double(),
##   ..   `Curricular units 1st sem (credited)` = col_double(),
##   ..   `Curricular units 1st sem (enrolled)` = col_double(),
##   ..   `Curricular units 1st sem (evaluations)` = col_double(),
##   ..   `Curricular units 1st sem (approved)` = col_double(),
##   ..   `Curricular units 1st sem (grade)` = col_double(),
##   ..   `Curricular units 1st sem (without evaluations)` = col_double(),
##   ..   `Curricular units 2nd sem (credited)` = col_double(),
##   ..   `Curricular units 2nd sem (enrolled)` = col_double(),
##   ..   `Curricular units 2nd sem (evaluations)` = col_double(),
##   ..   `Curricular units 2nd sem (approved)` = col_double(),
##   ..   `Curricular units 2nd sem (grade)` = col_double(),
##   ..   `Curricular units 2nd sem (without evaluations)` = col_double(),
##   ..   `Unemployment rate` = col_double(),
##   ..   `Inflation rate` = col_double(),
##   ..   GDP = col_double(),
##   ..   Target = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

we can use the column “Target” as our binary response variable. Here we convert “Dropout” and “Graduate” to a binary format 1 for Dropout and O for Graduate.

## 
## Call:
## glm(formula = Target_binary ~ `Admission grade` + `Age at enrollment` + 
##     `Unemployment rate` + GDP, family = binomial, data = data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.5702899  0.3443138  -1.656   0.0977 .  
## `Admission grade`   -0.0141455  0.0023412  -6.042 1.52e-09 ***
## `Age at enrollment`  0.0681507  0.0043866  15.536  < 2e-16 ***
## `Unemployment rate` -0.0002795  0.0132473  -0.021   0.9832    
## GDP                 -0.0319840  0.0155899  -2.052   0.0402 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5554.5  on 4423  degrees of freedom
## Residual deviance: 5241.5  on 4419  degrees of freedom
## AIC: 5251.5
## 
## Number of Fisher Scoring iterations: 4

With 95% confidence, we can say that the true value of the Admission grade coefficient lies between -0.0187 and -0.0096. The given interval is completely negative, and does not contain 0. This means that the log odds of dropping out decreases as the admission grade increases. In other words, students with higher admission grades are less likely to drop out. Higher Admission grade decreases the likelihood of dropping out. Older age at the time of Age at enrollment increases the likelihood of dropping out. Unemployment rate did not show a significant effect on the likelihood of dropping out. Higher GDP (or less negative in the case of economic contractions) decreases the likelihood of dropping out.

## Waiting for profiling to be done...
##                           2.5 %       97.5 %
## (Intercept)         -1.24471559  0.105311249
## `Admission grade`   -0.01875599 -0.009576510
## `Age at enrollment`  0.05961943  0.076820562
## `Unemployment rate` -0.02623150  0.025706748
## GDP                 -0.06250329 -0.001380817

Admission grade:It is left-skewed nature, so a logarithmic transformation can be done to make the distribution more symmetric.

GDP:The distribution seems symmetric, so a transformation might not be necessary.

## 
## Call:
## glm(formula = Target_binary ~ Log_Admission_grade + `Age at enrollment` + 
##     `Unemployment rate` + GDP, family = binomial, data = data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         15.0396645  2.8342167   5.306 1.12e-07 ***
## Log_Admission_grade -3.2184984  0.5245240  -6.136 8.46e-10 ***
## `Age at enrollment`  0.0678671  0.0043856  15.475  < 2e-16 ***
## `Unemployment rate` -0.0005923  0.0132472  -0.045   0.9643    
## GDP                 -0.0323142  0.0155918  -2.073   0.0382 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5554.5  on 4423  degrees of freedom
## Residual deviance: 5240.4  on 4419  degrees of freedom
## AIC: 5250.4
## 
## Number of Fisher Scoring iterations: 4
## AIC for original model: 5251.457
## AIC for transformed model: 5250.443

The logarithmic transformation of Admission grade improved the model fit slightly, as evidenced by the reduced AIC.so we will go on with the original model.