Importing data:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.4.2
## ResourceSelection 0.3-6 2023-06-27
library(ggplot2)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Mutating new columns for better data visualization:
dataset_1<-dataset
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance ` == 1, "day","evening"))
dataset_1<-mutate(dataset, marital_status = ifelse(dataset$`Marital status` == 1, "single",
ifelse(`Marital status` == 2, "married",
ifelse(`Marital status` == 3, "widower",
ifelse(`Marital status` == 4, "divorced",
ifelse(`Marital status` == 5, "facto union",
ifelse(`Marital status` == 6, "legally seperated", "no")))))))
dataset_1$`Daytime/evening attendance `<-as.integer(dataset_1$`Daytime/evening attendance `)
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance `== 1, "day","evening"))
colnames(dataset_1)[c(26, 32)] <- c("Grades_1stSem", "Grades_2ndSem")
dataset_1$sem_results <- rowMeans(dataset_1[, c("Grades_1stSem", "Grades_2ndSem")], na.rm = TRUE)
dataset_1<-mutate(dataset_1, target = ifelse(dataset$Target == "Graduate",1,
ifelse(Target == "Enrolled",2,
ifelse(Target == "Dropout", 0, "no"))))
dataset_1$`Previous qualification (grade)`<-as.integer(dataset_1$`Previous qualification (grade)`)
dataset_1$`Admission grade`<-as.integer(dataset_1$`Admission grade`)
dataset_1$prev_qual_grade<-dataset_1$`Previous qualification (grade)`
dataset_1$adm_grade<-dataset_1$`Admission grade`
dataset_1$prev_perf <- rowMeans(dataset_1[, c("prev_qual_grade","adm_grade")], na.rm = TRUE)
dataset_1$past_grade<-dataset_1$prev_perf
The central challenge is to predict which students will successfully graduate, remain enrolled, or drop out based on their demographics and academic performance.
The specific questions are: Why do some students graduate while others drop out or remain enrolled without completing their degrees?
And how can we use the available data to predict these outcomes and implement timely interventions?
The goal of this project is to analyze the dataset using statistical methods to uncover patterns that influence student outcomes.
These insights will be used to build models to predict whether a student will graduate, remain enrolled, or drop out.
The impact of this analysis is significant: by identifying at-risk students early, institutions can design better interventions to increase graduation rates and improve overall student success.
dg<- dataset_1|>
group_by(marital_status,Target)|>
summarise(num_std=n())
## `summarise()` has grouped output by 'marital_status'. You can override using
## the `.groups` argument.
p1<- dg|>
ggplot(aes(x=marital_status, y=num_std,fill=Target))+
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Marital status vs Target",
x = "marital_status",
y = "Frequency") +
theme_minimal()
p1
Majority of the students from Single category are “Graduates” followed by “Droupouts”.
Admission grades are assumed to be higher for students who are expected to complete their graduation than those of Dropouts.
p2<- dataset_1|>
ggplot(aes(x=Target,y=`Admission grade`))+
geom_bar(stat="identity")+
labs(title = "Admission grades trend",
x = "Target",
y = "Grades")
p2
dd<- dataset_1|>
group_by(Target,day_eve_class)|>
summarise(freq=n())
## `summarise()` has grouped output by 'Target'. You can override using the
## `.groups` argument.
p3<- dd|>
ggplot(aes(x=day_eve_class,y=freq,fill=Target))+
geom_bar(stat="identity", position = position_dodge())
p3
p1<- dataset_1|>
ggplot(aes(x = Target, y =sem_results)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "sem_resuts vs target", x = "Target", y = "sem_results")
p1
## `geom_smooth()` using formula = 'y ~ x'
dataset_2<-filter(dataset_1,dataset_1$Grades_2ndSem!=0)
p3<- dataset_2|>
ggplot(aes(x = past_grade, y =Grades_2ndSem)) +
geom_point(size=1, shape=1) +
#geom_smooth(method = "lm", se = FALSE) +
labs(title = "1st sem vs target", x = "Past Grades", y = "2nd Sem results")
p3
library(pwr)
effect_size <- 0.1 # Define your minimum effect size (10% difference)
alpha <- 0.05
power <- 0.80
# Calculate sample size for each group
n <- pwr.2p.test(h = ES.h(0.5, 0.5 + effect_size), sig.level = alpha, power = power)$n
#Create a contingency table
dataset_2<-filter(dataset_1,Target!="Enrolled")
contingency_table <- table(dataset_2$Target)
prop_test <- prop.test(contingency_table)
prop_test
##
## 1-sample proportions test with continuity correction
##
## data: contingency_table, null probability 0.5
## X-squared = 170.63, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3755686 0.4075829
## sample estimates:
## p
## 0.3914601
The test value(0.3914601) is greater than considered alpha (0.05), Therefore, the Null Hypothesis fails.
Summarizing the results in short:
Test Type: One-sample proportions test with continuity correction.
Test Statistic: Chi-squared value is 170.63 with 1 degree of freedom.
P-value: Less than 2.2e-16, indicating strong evidence against the null hypothesis.
95% Confidence Interval: The true proportion is estimated to be between approximately 0.3756 and 0.4076.
Sample Estimate: The observed proportion in your data is approximately 0.3915.
# Create a bar plot for graduation rates
ggplot(dataset_2, aes(x = Target)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "steelblue") +
labs(title = "Proportion of Graduates vs. Dropouts",
y = "Proportion",
x = "Target") +
scale_y_continuous(labels = scales::percent)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Filter data for graduates and enrolled students
graduates <- dataset_1 |> filter(Target == "Graduate")
enrolled <- dataset_1 |> filter(Target == "Enrolled")
t_test_result <- t.test(graduates$sem_results, enrolled$sem_results)
t_test_result
##
## Welch Two Sample t-test
##
## data: graduates$sem_results and enrolled$sem_results
## t = 11.611, df = 1153.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.287354 1.810905
## sample estimates:
## mean of x mean of y
## 12.67056 11.12143
Summarizing the results in short:
Test Type: Welch Two Sample t-test comparing the means of two groups (graduates and enrolled).
Test Statistic: t = 11.611 with degrees of freedom (df) ≈ 1153.5.
P-value: Less than 2.2e-16, indicating strong evidence against the null hypothesis.
95% Confidence Interval: The true difference in means is estimated to be between approximately 1.2874 and 1.8109.
Sample Estimates:
Mean of graduates (x): 12.67056
Mean of enrolled (y): 11.12143
# Create a box plot for sem_resuts comparison
ggplot(dataset_1, aes(x = Target, y = sem_results)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Comparison of sem_results between Graduates and Enrolled Students",
y = "sem_results",
x = "Target")
As the dependent variable (target) is not normally distributed, linear regression may not be suitable.
GLMs are a flexible extension of linear regression models that allow the dependent variable to have a non-normal distribution.
For a binomial regression model, we shall have the target to be either 0 or 1. In our case we can drop the students who fall under “Enrolled” (2) category as they do not affect our analysis on understanding about the “Droupout” and “Graduate” students.
# Ensuring 'day_eve_class' is treated as a factor
dataset_1$day_eve_class <- as.factor(dataset_1$day_eve_class)
df_bin<-filter(dataset_1,target!=2)
drops <- c("Target")
df_bin<-df_bin[ , !(names(df_bin) %in% drops)]
df_bin$target<-as.numeric(df_bin$target)
dataset_1$`Previous qualification (grade)`<-as.integer(dataset_1$`Previous qualification (grade)`)
dataset_1$`Admission grade`<-as.integer(dataset_1$`Admission grade`)
dataset_1$prev_qual_grade<-dataset_1$`Previous qualification (grade)`
dataset_1$adm_grade<-dataset_1$`Admission grade`
# Full model with all available predictors
full_model <- glm(target ~ ., data = df_bin, family = binomial)
summary(full_model)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = df_bin)
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) 3.842e+00 1.555e+01 0.247
## `Marital status` -8.810e-01 3.891e+00 -0.226
## `Application mode` -7.551e-03 4.968e-03 -1.520
## `Application order` -3.454e-02 5.379e-02 -0.642
## Course -1.928e-04 6.408e-05 -3.009
## `Daytime/evening attendance\\t` -1.627e-01 2.500e-01 -0.651
## `Previous qualification` 2.060e-02 7.568e-03 2.722
## `Previous qualification (grade)` -2.707e-03 6.061e-03 -0.447
## Nacionality -5.491e-02 1.413e-02 -3.887
## `Mother's qualification` -4.805e-03 5.185e-03 -0.927
## `Father's qualification` 7.966e-03 5.130e-03 1.553
## `Mother's occupation` 1.216e-02 6.637e-03 1.832
## `Father's occupation` -4.282e-03 6.993e-03 -0.612
## `Admission grade` 2.433e-03 5.778e-03 0.421
## Displaced -3.396e-01 1.542e-01 -2.202
## `Educational special needs` -2.866e-01 5.269e-01 -0.544
## Debtor -1.258e+00 2.368e-01 -5.312
## `Tuition fees up to date` 2.643e+00 2.802e-01 9.433
## Gender -4.018e-01 1.391e-01 -2.888
## `Scholarship holder` 7.958e-01 1.659e-01 4.797
## `Age at enrollment` -3.340e-02 1.379e-02 -2.421
## International 3.310e+00 8.123e-01 4.075
## `Curricular units 1st sem (credited)` -2.717e-01 1.024e-01 -2.652
## `Curricular units 1st sem (enrolled)` -2.030e-01 1.495e-01 -1.357
## `Curricular units 1st sem (evaluations)` 1.031e-02 3.773e-02 0.273
## `Curricular units 1st sem (approved)` 6.419e-01 8.166e-02 7.861
## Grades_1stSem -7.154e-02 5.130e-02 -1.395
## `Curricular units 1st sem (without evaluations)` 2.322e-01 1.678e-01 1.384
## `Curricular units 2nd sem (credited)` -1.210e-01 1.064e-01 -1.137
## `Curricular units 2nd sem (enrolled)` -7.848e-01 1.443e-01 -5.437
## `Curricular units 2nd sem (evaluations)` -1.798e-02 3.591e-02 -0.501
## `Curricular units 2nd sem (approved)` 9.776e-01 7.420e-02 13.176
## Grades_2ndSem 1.271e-01 5.360e-02 2.371
## `Curricular units 2nd sem (without evaluations)` 1.995e-01 1.362e-01 1.465
## `Unemployment rate` -5.728e-02 2.778e-02 -2.062
## `Inflation rate` 2.648e-02 4.750e-02 0.557
## GDP -2.970e-02 3.280e-02 -0.905
## marital_statusfacto union 8.970e-01 4.078e+00 0.220
## marital_statuslegally seperated 1.211e+00 8.389e+00 0.144
## marital_statusmarried -2.371e+00 7.735e+00 -0.307
## marital_statussingle -3.354e+00 1.162e+01 -0.289
## marital_statuswidower NA NA NA
## day_eve_classevening NA NA NA
## sem_results NA NA NA
## prev_qual_grade NA NA NA
## adm_grade NA NA NA
## prev_perf NA NA NA
## past_grade NA NA NA
## Pr(>|z|)
## (Intercept) 0.804877
## `Marital status` 0.820864
## `Application mode` 0.128554
## `Application order` 0.520782
## Course 0.002624 **
## `Daytime/evening attendance\\t` 0.515168
## `Previous qualification` 0.006487 **
## `Previous qualification (grade)` 0.655173
## Nacionality 0.000102 ***
## `Mother's qualification` 0.354114
## `Father's qualification` 0.120497
## `Mother's occupation` 0.066878 .
## `Father's occupation` 0.540363
## `Admission grade` 0.673647
## Displaced 0.027684 *
## `Educational special needs` 0.586503
## Debtor 1.08e-07 ***
## `Tuition fees up to date` < 2e-16 ***
## Gender 0.003876 **
## `Scholarship holder` 1.61e-06 ***
## `Age at enrollment` 0.015477 *
## International 4.60e-05 ***
## `Curricular units 1st sem (credited)` 0.008006 **
## `Curricular units 1st sem (enrolled)` 0.174655
## `Curricular units 1st sem (evaluations)` 0.784583
## `Curricular units 1st sem (approved)` 3.81e-15 ***
## Grades_1stSem 0.163153
## `Curricular units 1st sem (without evaluations)` 0.166434
## `Curricular units 2nd sem (credited)` 0.255495
## `Curricular units 2nd sem (enrolled)` 5.41e-08 ***
## `Curricular units 2nd sem (evaluations)` 0.616564
## `Curricular units 2nd sem (approved)` < 2e-16 ***
## Grades_2ndSem 0.017733 *
## `Curricular units 2nd sem (without evaluations)` 0.142993
## `Unemployment rate` 0.039226 *
## `Inflation rate` 0.577228
## GDP 0.365238
## marital_statusfacto union 0.825912
## marital_statuslegally seperated 0.885261
## marital_statusmarried 0.759160
## marital_statussingle 0.772847
## marital_statuswidower NA
## day_eve_classevening NA
## sem_results NA
## prev_qual_grade NA
## adm_grade NA
## prev_perf NA
## past_grade NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4859.8 on 3629 degrees of freedom
## Residual deviance: 1647.0 on 3589 degrees of freedom
## AIC: 1729
##
## Number of Fisher Scoring iterations: 8
Summary analysis:
Variables like marital_status, day_eve_class, sem_results, and prev_perf are not estimable because of perfect multi collinearity or redundancy. These columns are likely linearly dependent or contain constant values.
Most coefficients (e.g.,marital_status, Application_mode, etc.) have very large standard errors and near-zero estimates. This might be because of multi-collinearity or model over fitting.
A sharp drop in deviance (4859.8) implies the model explains almost all the variance in the response. A very low AIC (84) suggests the model fits the data well, but it could be misleading given the over fitting symptoms.
The large coefficient estimates and near-zero deviance suggest perfect separation in the data. This means the explanatory variables are nearly perfect predictors of the target, but this could fail on new data.
To overcome the above fall outs, we can find the best model that fits the data nearly favorable in the estimations.
Akaike Information Criterion (AIC) is a metric used to evaluate and compare the quality of statistical models, particularly in the context of regression and time series analysis.
# Stepwise selection using AIC
best_model <- stepAIC(full_model, direction = "both", trace = FALSE)
summary(best_model)
##
## Call:
## glm(formula = target ~ Course + `Previous qualification` + Nacionality +
## `Mother's occupation` + Displaced + Debtor + `Tuition fees up to date` +
## Gender + `Scholarship holder` + `Age at enrollment` + International +
## `Curricular units 1st sem (credited)` + `Curricular units 1st sem (approved)` +
## `Curricular units 1st sem (without evaluations)` + `Curricular units 2nd sem (enrolled)` +
## `Curricular units 2nd sem (approved)` + Grades_2ndSem + `Curricular units 2nd sem (without evaluations)` +
## `Unemployment rate` + `Marital status`, family = binomial,
## data = df_bin)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -8.107e-01 5.170e-01 -1.568
## Course -2.299e-04 5.703e-05 -4.031
## `Previous qualification` 1.496e-02 6.704e-03 2.232
## Nacionality -5.434e-02 1.421e-02 -3.824
## `Mother's occupation` 8.116e-03 3.353e-03 2.420
## Displaced -3.477e-01 1.466e-01 -2.371
## Debtor -1.256e+00 2.343e-01 -5.359
## `Tuition fees up to date` 2.672e+00 2.814e-01 9.495
## Gender -4.183e-01 1.369e-01 -3.056
## `Scholarship holder` 8.036e-01 1.633e-01 4.921
## `Age at enrollment` -3.742e-02 1.089e-02 -3.438
## International 3.194e+00 8.120e-01 3.934
## `Curricular units 1st sem (credited)` -3.872e-01 5.041e-02 -7.682
## `Curricular units 1st sem (approved)` 6.033e-01 6.714e-02 8.986
## `Curricular units 1st sem (without evaluations)` 2.371e-01 1.602e-01 1.481
## `Curricular units 2nd sem (enrolled)` -9.836e-01 7.708e-02 -12.760
## `Curricular units 2nd sem (approved)` 9.894e-01 6.559e-02 15.084
## Grades_2ndSem 8.543e-02 4.239e-02 2.016
## `Curricular units 2nd sem (without evaluations)` 2.058e-01 1.330e-01 1.547
## `Unemployment rate` -5.288e-02 2.486e-02 -2.128
## `Marital status` 2.013e-01 1.410e-01 1.428
## Pr(>|z|)
## (Intercept) 0.116898
## Course 5.54e-05 ***
## `Previous qualification` 0.025646 *
## Nacionality 0.000131 ***
## `Mother's occupation` 0.015513 *
## Displaced 0.017742 *
## Debtor 8.37e-08 ***
## `Tuition fees up to date` < 2e-16 ***
## Gender 0.002242 **
## `Scholarship holder` 8.60e-07 ***
## `Age at enrollment` 0.000586 ***
## International 8.37e-05 ***
## `Curricular units 1st sem (credited)` 1.56e-14 ***
## `Curricular units 1st sem (approved)` < 2e-16 ***
## `Curricular units 1st sem (without evaluations)` 0.138730
## `Curricular units 2nd sem (enrolled)` < 2e-16 ***
## `Curricular units 2nd sem (approved)` < 2e-16 ***
## Grades_2ndSem 0.043841 *
## `Curricular units 2nd sem (without evaluations)` 0.121835
## `Unemployment rate` 0.033375 *
## `Marital status` 0.153423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4859.8 on 3629 degrees of freedom
## Residual deviance: 1659.1 on 3609 degrees of freedom
## AIC: 1701.1
##
## Number of Fisher Scoring iterations: 7
odds_ratios <- exp(coef(best_model))
print(odds_ratios)
## (Intercept)
## 0.4445558
## Course
## 0.9997701
## `Previous qualification`
## 1.0150718
## Nacionality
## 0.9471104
## `Mother's occupation`
## 1.0081490
## Displaced
## 0.7063150
## Debtor
## 0.2848719
## `Tuition fees up to date`
## 14.4679419
## Gender
## 0.6581893
## `Scholarship holder`
## 2.2335079
## `Age at enrollment`
## 0.9632678
## International
## 24.3904190
## `Curricular units 1st sem (credited)`
## 0.6789414
## `Curricular units 1st sem (approved)`
## 1.8281873
## `Curricular units 1st sem (without evaluations)`
## 1.2676240
## `Curricular units 2nd sem (enrolled)`
## 0.3739787
## `Curricular units 2nd sem (approved)`
## 2.6895669
## Grades_2ndSem
## 1.0891886
## `Curricular units 2nd sem (without evaluations)`
## 1.2284935
## `Unemployment rate`
## 0.9484920
## `Marital status`
## 1.2230046
conf_intervals <- exp(confint(best_model))
## Waiting for profiling to be done...
print(conf_intervals)
## 2.5 % 97.5 %
## (Intercept) 0.1601521 1.2171744
## Course 0.9996573 0.9998808
## `Previous qualification` 1.0020865 1.0287939
## Nacionality 0.9203756 0.9739408
## `Mother's occupation` 1.0019720 1.0152707
## Displaced 0.5286202 0.9396994
## Debtor 0.1799871 0.4513463
## `Tuition fees up to date` 8.4665229 25.5787590
## Gender 0.5035818 0.8614147
## `Scholarship holder` 1.6294400 3.0927850
## `Age at enrollment` 0.9430934 0.9842764
## International 5.1153038 123.4982590
## `Curricular units 1st sem (credited)` 0.6146627 0.7490615
## `Curricular units 1st sem (approved)` 1.6049745 2.0886667
## `Curricular units 1st sem (without evaluations)` 0.9273699 1.7370475
## `Curricular units 2nd sem (enrolled)` 0.3204658 0.4336439
## `Curricular units 2nd sem (approved)` 2.3712039 3.0668913
## Grades_2ndSem 1.0044698 1.1858909
## `Curricular units 2nd sem (without evaluations)` 0.9472852 1.5925452
## `Unemployment rate` 0.9032723 0.9957707
## `Marital status` 0.9317590 1.6174690
library(ggplot2)
odds_data <- data.frame(
Predictor = names(odds_ratios),
OddsRatio = odds_ratios,
LowerCI = conf_intervals[, 1],
UpperCI = conf_intervals[, 2]
)
ggplot(odds_data, aes(x = Predictor, y = OddsRatio)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_errorbar(aes(ymin = LowerCI, ymax = UpperCI), width = 0.1) +
theme_minimal() +
labs(title = "Odds Ratios of Predictors", y = "Odds Ratio")
Model Summary:
Null Deviance: 4859.8 (deviance with no predictors).
Residual Deviance: 1659.1 (deviance after adding predictors, much lower—indicating improved model fit).
AIC: 1701.1 (used for model comparison; lower values indicate better fit).
Interpretation of Significant Predictors:
Course: Negative coefficient (beta = -2.299e-04), significant at p < 0.001. Suggests that specific courses have a small but significant negative effect on the log-odds of the target outcome.
Previous qualification: Positive and significant (beta = 1.496e-02), indicating better previous qualifications increase the likelihood of the outcome.
Nacionality: Negative and significant (beta = -5.434e-02), suggesting non-domestic nationality decreases the likelihood of the target outcome.
Debtor: Strong negative impact (beta = -1.256,p < 0.001) on the outcome. Being a debtor significantly reduces the chances.
Tuition fees up to date: Large positive coefficient (beta = 2.672, p < 0.001), indicating that students up-to-date on tuition have much higher odds of the outcome.
Scholarship holder: Positive effect (beta = 0.8036, p < 0.001), showing that receiving a scholarship increases the odds of success.
International: Strong positive impact (beta = 3.194), indicating international students are more likely to achieve the outcome.
Curricular units: Various significant contributions:
1st sem credited: Negative (beta = -0.387), fewer credited units decrease likelihood.
2nd sem enrolled: Strong negative (beta = -0.983), suggesting being enrolled without approval reduces success.
2nd sem approved: Positive (beta = 0.989), indicating more approvals increase success likelihood.
Non-Significant Predictors:
Model Diagnostics:
Residual Deviance vs. Null Deviance: Substantial reduction, indicating that the predictors provide a much better fit than the null model.
Number of Iterations: 7 iterations indicate stable convergence for this logistic regression.
# 1. Residuals vs Fitted Plot
plot(best_model, which = 1, main = "Residuals vs Fitted")
Interpretation:
Horizontal band of residuals: The residuals are spread relatively evenly around the 0-line, which is expected for a well-fitted model.
Outliers (2013, 2010, 3480): These data points have high residuals, suggesting they do not fit the model well. They could indicate influential or misclassified observations.
No clear pattern: Ideally, residuals should not show any systematic structure or trend. The absence of a curve or clustering around certain fitted values is a positive sign.
# 2. Cook's Distance
plot(best_model, which = 4, main = "Cook's Distance")
Interpretation:
Highlighted Observations (1334, 2013, 2766): These points have higher Cook’s Distance values, indicating they could be influential observations.
Threshold for Influence: A general rule of thumb is that values above 1 (or significantly larger than the majority of the data) are influential, but none here exceed that threshold.
# 3. Leverage Plot
plot(hatvalues(best_model), main = "Leverage Values", ylab = "Leverage")
abline(h = 2 * mean(hatvalues(best_model)), col = "red", lty = 2)
Interpretation:
Most observations have low leverage values, which is expected in a well-distributed dataset.
High-leverage points are seen near the top of the plot but remain below a threshold of 0.5. A few points have higher leverage, meaning their predictor values are farther from the center of the data set, but they don’t seem to be extreme outliers.
# 4. Variance Inflation Factor (VIF) for multicollinearity
vif(best_model)
## Course
## 7.655553
## `Previous qualification`
## 1.111836
## Nacionality
## 3.262373
## `Mother's occupation`
## 1.042082
## Displaced
## 1.250714
## Debtor
## 1.127495
## `Tuition fees up to date`
## 1.132134
## Gender
## 1.090516
## `Scholarship holder`
## 1.049656
## `Age at enrollment`
## 1.752143
## International
## 3.349127
## `Curricular units 1st sem (credited)`
## 4.299866
## `Curricular units 1st sem (approved)`
## 8.776707
## `Curricular units 1st sem (without evaluations)`
## 1.168276
## `Curricular units 2nd sem (enrolled)`
## 11.197613
## `Curricular units 2nd sem (approved)`
## 6.949030
## Grades_2ndSem
## 8.472960
## `Curricular units 2nd sem (without evaluations)`
## 1.132925
## `Unemployment rate`
## 1.095878
## `Marital status`
## 1.505745
Low VIFs (< 5) for most variables: Indicates low multicollinearity and minimal impact on the stability of coefficients.
Moderate VIFs (5–10): Suggest some multicollinearity but not critically problematic.
High VIFs (> 10): Indicates strong multicollinearity, potentially problematic.
The significant features for our models turned out to be: Previous Qualification, Nationality, Scholarship holder and Semester grades.
Students who couldn’t perform well in their previous grades, if they can perform well in the current semesters, they have higher chances of graduating.
Students should prefer day classes over evening classes in order to have a better chance of graduating.