Week 11 | Data Dive — GLMs (Part 2)

# Load the necessary library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(boot)
library(broom)
library(lindia)
library(dplyr)
library(ggplot2)
library(dplyr)
library(glmnet)  # For generalized linear models

## Warning: package 'glmnet' was built under R version 4.3.2

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8

library(caret)

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: lattice
## 
## Attaching package: 'lattice'
## 
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

mpg<- read_delim("C:/Users/kondo/OneDrive/Desktop/INTRO to Statistics and R/Data Set and work/data.csv", delim = ";",show_col_types = FALSE)

glimpse(mpg)

## Rows: 4,424
## Columns: 37
## $ `Marital status`                                 <dbl> 1, 1, 1, 1, 2, 2, 1, …
## $ `Application mode`                               <dbl> 17, 15, 1, 17, 39, 39…
## $ `Application order`                              <dbl> 5, 1, 5, 2, 1, 1, 1, …
## $ Course                                           <dbl> 171, 9254, 9070, 9773…
## $ `Daytime/evening attendance\t`                   <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Previous qualification`                         <dbl> 1, 1, 1, 1, 1, 19, 1,…
## $ `Previous qualification (grade)`                 <dbl> 122.0, 160.0, 122.0, …
## $ Nacionality                                      <dbl> 1, 1, 1, 1, 1, 1, 1, …
## $ `Mother's qualification`                         <dbl> 19, 1, 37, 38, 37, 37…
## $ `Father's qualification`                         <dbl> 12, 3, 37, 37, 38, 37…
## $ `Mother's occupation`                            <dbl> 5, 3, 9, 5, 9, 9, 7, …
## $ `Father's occupation`                            <dbl> 9, 3, 9, 3, 9, 7, 10,…
## $ `Admission grade`                                <dbl> 127.3, 142.5, 124.8, …
## $ Displaced                                        <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Educational special needs`                      <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ Debtor                                           <dbl> 0, 0, 0, 0, 0, 1, 0, …
## $ `Tuition fees up to date`                        <dbl> 1, 0, 0, 1, 1, 1, 1, …
## $ Gender                                           <dbl> 1, 1, 1, 0, 0, 1, 0, …
## $ `Scholarship holder`                             <dbl> 0, 0, 0, 0, 0, 0, 1, …
## $ `Age at enrollment`                              <dbl> 20, 19, 19, 20, 45, 5…
## $ International                                    <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 7, …
## $ `Curricular units 1st sem (evaluations)`         <dbl> 0, 6, 0, 8, 9, 10, 9,…
## $ `Curricular units 1st sem (approved)`            <dbl> 0, 6, 0, 6, 5, 5, 7, …
## $ `Curricular units 1st sem (grade)`               <dbl> 0.00000, 14.00000, 0.…
## $ `Curricular units 1st sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 8, …
## $ `Curricular units 2nd sem (evaluations)`         <dbl> 0, 6, 0, 10, 6, 17, 8…
## $ `Curricular units 2nd sem (approved)`            <dbl> 0, 6, 0, 5, 6, 5, 8, …
## $ `Curricular units 2nd sem (grade)`               <dbl> 0.00000, 13.66667, 0.…
## $ `Curricular units 2nd sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 5, 0, …
## $ `Unemployment rate`                              <dbl> 10.8, 13.9, 10.8, 9.4…
## $ `Inflation rate`                                 <dbl> 1.4, -0.3, 1.4, -0.8,…
## $ GDP                                              <dbl> 1.74, 0.79, 1.74, -3.…
## $ Target                                           <chr> "Dropout", "Graduate"…

Tasks : Build a linear (or generalized linear) model as you like Use whatever response variable and explanatory variables you prefer Use the tools from previous weeks to diagnose the model Highlight any issues with the model Interpret at least one of the coefficients

# Build a linear regression model
model <- lm(`Admission grade` ~ `Age at enrollment` + `Tuition fees up to date` + GDP, data = mpg)

# Check the summary of the linear regression model
summary(model)

## 
## Call:
## lm(formula = `Admission grade` ~ `Age at enrollment` + `Tuition fees up to date` + 
##     GDP, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.967  -9.196  -1.126   7.743  63.864 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               125.99989    1.00701 125.123  < 2e-16 ***
## `Age at enrollment`        -0.04264    0.02918  -1.461  0.14407    
## `Tuition fees up to date`   2.23753    0.68152   3.283  0.00103 ** 
## GDP                        -0.13287    0.09599  -1.384  0.16639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 4420 degrees of freedom
## Multiple R-squared:  0.003787,   Adjusted R-squared:  0.00311 
## F-statistic:   5.6 on 3 and 4420 DF,  p-value: 0.0007878

Here, we’re predicting “Admission grade” using three explanatory variables: “Age at enrollment,” “Tuition fees up to date,” and “GDP.”

# Select variables for modeling
selected_vars <- c("Target", "Age at enrollment", "Admission grade", "Tuition fees up to date", "Gender")

# Create a subset of the dataset
data_subset <- mpg %>%
  select(all_of(selected_vars))

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(data_subset$Target, p = 0.7, list = FALSE)
train_data <- data_subset[train_index, ]
test_data <- data_subset[-train_index, ]

# Build a generalized linear model
glm_model <- glm(as.factor(Target) ~ ., data = train_data, family = binomial)

# Assess model fit
summary(glm_model)

## 
## Call:
## glm(formula = as.factor(Target) ~ ., family = binomial, data = train_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.967374   0.446017  -4.411 1.03e-05 ***
## `Age at enrollment`       -0.056716   0.005733  -9.892  < 2e-16 ***
## `Admission grade`          0.014723   0.003086   4.772 1.83e-06 ***
## `Tuition fees up to date`  2.786490   0.159903  17.426  < 2e-16 ***
## Gender                    -0.679008   0.090529  -7.500 6.36e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3889.6  on 3097  degrees of freedom
## Residual deviance: 3156.5  on 3093  degrees of freedom
## AIC: 3166.5
## 
## Number of Fisher Scoring iterations: 4

predictions <- predict(glm_model, newdata = test_data, type = "response")

# Calculate predicted classes based on a probability threshold of 0.5
predicted_classes <- ifelse(predictions > 0.5, "Graduate", "Dropout")

# Create a confusion matrix
confusion_matrix <- table(Predicted = predicted_classes, Actual = test_data$Target)

# Calculate accuracy, precision, recall, and F1 score
accuracy <- (confusion_matrix[1, 1] + confusion_matrix[2, 2]) / sum(confusion_matrix)
precision <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
recall <- confusion_matrix[2, 2] / sum(confusion_matrix[2, ])
f1_score <- 2 * (precision * recall) / (precision + recall)

# Print the confusion matrix and performance metrics
confusion_matrix

##           Actual
## Predicted  Dropout Enrolled Graduate
##   Dropout      172       21       33
##   Graduate     254      217      629

cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.2933635

cat("Precision:", precision, "\n")

## Precision: 0.9117647

cat("Recall:", recall, "\n")

## Recall: 0.1972727

cat("F1 Score:", f1_score, "\n")

## F1 Score: 0.3243647

Performance Metrics:

Accuracy: 0.2933635 Precision: 0.9117647 Recall: 0.1972727 F1 Score: 0.3243647

Here’s a brief interpretation:

Accuracy: The accuracy of the model is quite low at approximately 29.34%. This means that the model correctly predicted the outcome for only about 29.34% of the instances in the test data.

Precision: The precision of the model is relatively high at approximately 91.18%. This indicates that when the model predicts a student to be in the “Graduate” class, it is correct about 91.18% of the time.

Recall: The recall is relatively low at approximately 19.73%. This suggests that the model missed a significant number of actual “Graduate” instances, as it correctly identified only about 19.73% of them.

F1 Score: The F1 score, which combines precision and recall, is approximately 32.44%. It’s a measure of the model’s overall performance, balancing the trade-off between precision and recall.

#interprete of coefficient 
# Extract coefficients
coef(glm_model)

##               (Intercept)       `Age at enrollment`         `Admission grade` 
##               -1.96737357               -0.05671599                0.01472295 
## `Tuition fees up to date`                    Gender 
##                2.78649004               -0.67900789

# Interpretation of the Admission Grade coefficient
admission_grade_coef <- coef(glm_model)["Admission Grade"]
exp(admission_grade_coef)  # Exponentiate to interpret odds ratio

## <NA> 
##   NA

The coefficients represent the impact of each respective variable on the response variable (in your case, “Target”). Here’s a brief interpretation:

Intercept: This is the baseline value when all other predictors are zero. In your model, it serves as the base or reference point.

Age at enrollment: For each unit increase in the age at enrollment, the log-odds of the “Target” variable decreases by approximately -0.0567.

Admission grade: For each unit increase in the admission grade, the log-odds of the “Target” variable increases by approximately 0.0147.

Tuition fees up to date: This variable has a relatively high coefficient. For each unit increase in tuition fees being up to date, the log-odds of the “Target” variable increases significantly by approximately 2.7865.

Gender: For the gender variable (1 for male, 0 for female), being male (1) decreases the log-odds of the “Target” variable by approximately -0.6790 compared to being female (0).

The “” represents missing data, so it’s not applicable in this context.

Week 11 | Data Dive — GLMs (Part 2)

Vaishali Kondoju

2023-11-04