Final Project - Intro to Stats with R

Student Demographics v/s Academic Performance

Importing data:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ResourceSelection)

## Warning: package 'ResourceSelection' was built under R version 4.4.2

## ResourceSelection 0.3-6   2023-06-27

library(ggplot2)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Mutating new columns for better data visualization:

dataset_1<-dataset
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance    ` == 1, "day","evening"))

dataset_1<-mutate(dataset, marital_status = ifelse(dataset$`Marital status` == 1, "single",
                    ifelse(`Marital status` == 2, "married",
                    ifelse(`Marital status` == 3, "widower",
                    ifelse(`Marital status` == 4, "divorced",
                    ifelse(`Marital status` == 5, "facto union",
                    ifelse(`Marital status` == 6, "legally seperated", "no")))))))

dataset_1$`Daytime/evening attendance   `<-as.integer(dataset_1$`Daytime/evening attendance `)

dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance    `== 1, "day","evening"))

colnames(dataset_1)[c(26, 32)] <- c("Grades_1stSem", "Grades_2ndSem")

dataset_1$sem_results <- rowMeans(dataset_1[, c("Grades_1stSem", "Grades_2ndSem")], na.rm = TRUE)

dataset_1<-mutate(dataset_1, target = ifelse(dataset$Target == "Graduate",1,
                    ifelse(Target == "Enrolled",2,
                    ifelse(Target == "Dropout", 0, "no"))))

dataset_1$`Previous qualification (grade)`<-as.integer(dataset_1$`Previous qualification (grade)`)

dataset_1$`Admission grade`<-as.integer(dataset_1$`Admission grade`) 

dataset_1$prev_qual_grade<-dataset_1$`Previous qualification (grade)`

dataset_1$adm_grade<-dataset_1$`Admission grade`

dataset_1$prev_perf <- rowMeans(dataset_1[, c("prev_qual_grade","adm_grade")], na.rm = TRUE)

dataset_1$past_grade<-dataset_1$prev_perf

Problem Statement:

The central challenge is to predict which students will successfully graduate, remain enrolled, or drop out based on their demographics and academic performance.
The specific questions are: Why do some students graduate while others drop out or remain enrolled without completing their degrees?
And how can we use the available data to predict these outcomes and implement timely interventions?

Project Goal and Impact:

The goal of this project is to analyze the dataset using statistical methods to uncover patterns that influence student outcomes.
These insights will be used to build models to predict whether a student will graduate, remain enrolled, or drop out.
The impact of this analysis is significant: by identifying at-risk students early, institutions can design better interventions to increase graduation rates and improve overall student success.

Exploratory Data Analysis (EDA):

Our data set is biased towards “Single” marital status student category.

dg<- dataset_1|>
  group_by(marital_status,Target)|>
  summarise(num_std=n())

## `summarise()` has grouped output by 'marital_status'. You can override using
## the `.groups` argument.

p1<- dg|>
  ggplot(aes(x=marital_status, y=num_std,fill=Target))+
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Marital status vs Target",
       x = "marital_status",
       y = "Frequency") +
  theme_minimal()
p1

Majority of the students from Single category are “Graduates” followed by “Droupouts”.
Admission grades are assumed to be higher for students who are expected to complete their graduation than those of Dropouts.

p2<- dataset_1|>
  ggplot(aes(x=Target,y=`Admission grade`))+
  geom_bar(stat="identity")+
  labs(title = "Admission grades trend",
       x = "Target",
       y = "Grades") 
p2

Students attending day classes have better academic performances than those of attending in the evening.

dd<- dataset_1|>
  group_by(Target,day_eve_class)|>
  summarise(freq=n())

## `summarise()` has grouped output by 'Target'. You can override using the
## `.groups` argument.

p3<- dd|>
  ggplot(aes(x=day_eve_class,y=freq,fill=Target))+
  geom_bar(stat="identity", position = position_dodge())
p3

Assumptions and Justifications:

The average semester grades for graduate students is higher than those who dropout from the courses.

p1<- dataset_1|>
  ggplot(aes(x = Target, y =sem_results)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "sem_resuts vs target", x = "Target", y = "sem_results")
p1

## `geom_smooth()` using formula = 'y ~ x'

The plot avg_sem_results v/s target shows that our assumption is valid.

Students who performed well in their previous grades and admission tests will also perform good in their current semester results.

dataset_2<-filter(dataset_1,dataset_1$Grades_2ndSem!=0)

p3<- dataset_2|>
  ggplot(aes(x = past_grade, y =Grades_2ndSem)) +
  geom_point(size=1, shape=1) +
  #geom_smooth(method = "lm", se = FALSE) +
  labs(title = "1st sem vs target", x = "Past Grades", y = "2nd Sem results")
p3

From the scatter plot, the majority of the students whose past grades are near to average values have average performance in their semesters as well, whereas the students with higher past grades have a diversified spread of their performance which is not in our assumption.

Hypothesis Testing:

The true proportion between Graduate and Dropout students count is 0.5. (used Neyman-Pearson framework)

library(pwr)

effect_size <- 0.1  # Define your minimum effect size (10% difference)
alpha <- 0.05
power <- 0.80

# Calculate sample size for each group
n <- pwr.2p.test(h = ES.h(0.5, 0.5 + effect_size), sig.level = alpha, power = power)$n

The sample size is calculated, which is required for a two-proportion test, given a minimum detectable effect size (10% difference), a significance level of 0.05, and a desired power of 0.80.

#Create a contingency table 
dataset_2<-filter(dataset_1,Target!="Enrolled")
contingency_table <- table(dataset_2$Target)

We have filtered out the “Enrolled” category students from our data as this set of students are still pursuing the program and would be used in testing our model if they are going to be graduated or dropout of the program.

Perform the z-test:

prop_test <- prop.test(contingency_table)
prop_test

## 
##  1-sample proportions test with continuity correction
## 
## data:  contingency_table, null probability 0.5
## X-squared = 170.63, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.3755686 0.4075829
## sample estimates:
##         p 
## 0.3914601

The test value(0.3914601) is greater than considered alpha (0.05), Therefore, the Null Hypothesis fails.

Summarizing the results in short:

Test Type: One-sample proportions test with continuity correction.
Test Statistic: Chi-squared value is 170.63 with 1 degree of freedom.
P-value: Less than 2.2e-16, indicating strong evidence against the null hypothesis.
95% Confidence Interval: The true proportion is estimated to be between approximately 0.3756 and 0.4076.
Sample Estimate: The observed proportion in your data is approximately 0.3915.

# Create a bar plot for graduation rates
ggplot(dataset_2, aes(x = Target)) +
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "steelblue") +
  labs(title = "Proportion of Graduates vs. Dropouts",
       y = "Proportion",
       x = "Target") +
  scale_y_continuous(labels = scales::percent)

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The above metrics and bar plot conclude that there is strong evidence to suggest that the true proportion of the population differs from 0.5, and the Alternate Hypothesis is correct.

There is no difference in the average semester grades of Graduate and Enrolled students. (used Fisher’s significance testing framework)

# Filter data for graduates and enrolled students
graduates <- dataset_1 |> filter(Target == "Graduate")
enrolled <- dataset_1 |> filter(Target == "Enrolled")

Perform the t-test

t_test_result <- t.test(graduates$sem_results, enrolled$sem_results)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  graduates$sem_results and enrolled$sem_results
## t = 11.611, df = 1153.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.287354 1.810905
## sample estimates:
## mean of x mean of y 
##  12.67056  11.12143

Summarizing the results in short:

Test Type: Welch Two Sample t-test comparing the means of two groups (graduates and enrolled).
Test Statistic: t = 11.611 with degrees of freedom (df) ≈ 1153.5.
P-value: Less than 2.2e-16, indicating strong evidence against the null hypothesis.
95% Confidence Interval: The true difference in means is estimated to be between approximately 1.2874 and 1.8109.
Sample Estimates:
- Mean of graduates (x): 12.67056
- Mean of enrolled (y): 11.12143

# Create a box plot for sem_resuts comparison
ggplot(dataset_1, aes(x = Target, y = sem_results)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Comparison of sem_results between Graduates and Enrolled Students",
       y = "sem_results",
       x = "Target")

Since the p value is less than 0.05, our hypothesis fails and proves that the average semester results of Enrolled students is different to those of Graduated students.

Generalized Linear Model (GLM) :

As the dependent variable (target) is not normally distributed, linear regression may not be suitable.
GLMs are a flexible extension of linear regression models that allow the dependent variable to have a non-normal distribution.
For a binomial regression model, we shall have the target to be either 0 or 1. In our case we can drop the students who fall under “Enrolled” (2) category as they do not affect our analysis on understanding about the “Droupout” and “Graduate” students.

# Ensuring 'day_eve_class' is treated as a factor
dataset_1$day_eve_class <- as.factor(dataset_1$day_eve_class)

df_bin<-filter(dataset_1,target!=2)

drops <- c("Target")
df_bin<-df_bin[ , !(names(df_bin) %in% drops)]

df_bin$target<-as.numeric(df_bin$target)

Adding a column “prev_perf” which has the mean values of Previous_grades and Admission_grades of a student.

dataset_1$`Previous qualification (grade)`<-as.integer(dataset_1$`Previous qualification (grade)`)

dataset_1$`Admission grade`<-as.integer(dataset_1$`Admission grade`) 

dataset_1$prev_qual_grade<-dataset_1$`Previous qualification (grade)`

dataset_1$adm_grade<-dataset_1$`Admission grade`

# Full model with all available predictors
full_model <- glm(target ~ ., data = df_bin, family = binomial)

summary(full_model)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = df_bin)
## 
## Coefficients: (7 not defined because of singularities)
##                                                    Estimate Std. Error z value
## (Intercept)                                       3.842e+00  1.555e+01   0.247
## `Marital status`                                 -8.810e-01  3.891e+00  -0.226
## `Application mode`                               -7.551e-03  4.968e-03  -1.520
## `Application order`                              -3.454e-02  5.379e-02  -0.642
## Course                                           -1.928e-04  6.408e-05  -3.009
## `Daytime/evening attendance\\t`                  -1.627e-01  2.500e-01  -0.651
## `Previous qualification`                          2.060e-02  7.568e-03   2.722
## `Previous qualification (grade)`                 -2.707e-03  6.061e-03  -0.447
## Nacionality                                      -5.491e-02  1.413e-02  -3.887
## `Mother's qualification`                         -4.805e-03  5.185e-03  -0.927
## `Father's qualification`                          7.966e-03  5.130e-03   1.553
## `Mother's occupation`                             1.216e-02  6.637e-03   1.832
## `Father's occupation`                            -4.282e-03  6.993e-03  -0.612
## `Admission grade`                                 2.433e-03  5.778e-03   0.421
## Displaced                                        -3.396e-01  1.542e-01  -2.202
## `Educational special needs`                      -2.866e-01  5.269e-01  -0.544
## Debtor                                           -1.258e+00  2.368e-01  -5.312
## `Tuition fees up to date`                         2.643e+00  2.802e-01   9.433
## Gender                                           -4.018e-01  1.391e-01  -2.888
## `Scholarship holder`                              7.958e-01  1.659e-01   4.797
## `Age at enrollment`                              -3.340e-02  1.379e-02  -2.421
## International                                     3.310e+00  8.123e-01   4.075
## `Curricular units 1st sem (credited)`            -2.717e-01  1.024e-01  -2.652
## `Curricular units 1st sem (enrolled)`            -2.030e-01  1.495e-01  -1.357
## `Curricular units 1st sem (evaluations)`          1.031e-02  3.773e-02   0.273
## `Curricular units 1st sem (approved)`             6.419e-01  8.166e-02   7.861
## Grades_1stSem                                    -7.154e-02  5.130e-02  -1.395
## `Curricular units 1st sem (without evaluations)`  2.322e-01  1.678e-01   1.384
## `Curricular units 2nd sem (credited)`            -1.210e-01  1.064e-01  -1.137
## `Curricular units 2nd sem (enrolled)`            -7.848e-01  1.443e-01  -5.437
## `Curricular units 2nd sem (evaluations)`         -1.798e-02  3.591e-02  -0.501
## `Curricular units 2nd sem (approved)`             9.776e-01  7.420e-02  13.176
## Grades_2ndSem                                     1.271e-01  5.360e-02   2.371
## `Curricular units 2nd sem (without evaluations)`  1.995e-01  1.362e-01   1.465
## `Unemployment rate`                              -5.728e-02  2.778e-02  -2.062
## `Inflation rate`                                  2.648e-02  4.750e-02   0.557
## GDP                                              -2.970e-02  3.280e-02  -0.905
## marital_statusfacto union                         8.970e-01  4.078e+00   0.220
## marital_statuslegally seperated                   1.211e+00  8.389e+00   0.144
## marital_statusmarried                            -2.371e+00  7.735e+00  -0.307
## marital_statussingle                             -3.354e+00  1.162e+01  -0.289
## marital_statuswidower                                    NA         NA      NA
## day_eve_classevening                                     NA         NA      NA
## sem_results                                              NA         NA      NA
## prev_qual_grade                                          NA         NA      NA
## adm_grade                                                NA         NA      NA
## prev_perf                                                NA         NA      NA
## past_grade                                               NA         NA      NA
##                                                  Pr(>|z|)    
## (Intercept)                                      0.804877    
## `Marital status`                                 0.820864    
## `Application mode`                               0.128554    
## `Application order`                              0.520782    
## Course                                           0.002624 ** 
## `Daytime/evening attendance\\t`                  0.515168    
## `Previous qualification`                         0.006487 ** 
## `Previous qualification (grade)`                 0.655173    
## Nacionality                                      0.000102 ***
## `Mother's qualification`                         0.354114    
## `Father's qualification`                         0.120497    
## `Mother's occupation`                            0.066878 .  
## `Father's occupation`                            0.540363    
## `Admission grade`                                0.673647    
## Displaced                                        0.027684 *  
## `Educational special needs`                      0.586503    
## Debtor                                           1.08e-07 ***
## `Tuition fees up to date`                         < 2e-16 ***
## Gender                                           0.003876 ** 
## `Scholarship holder`                             1.61e-06 ***
## `Age at enrollment`                              0.015477 *  
## International                                    4.60e-05 ***
## `Curricular units 1st sem (credited)`            0.008006 ** 
## `Curricular units 1st sem (enrolled)`            0.174655    
## `Curricular units 1st sem (evaluations)`         0.784583    
## `Curricular units 1st sem (approved)`            3.81e-15 ***
## Grades_1stSem                                    0.163153    
## `Curricular units 1st sem (without evaluations)` 0.166434    
## `Curricular units 2nd sem (credited)`            0.255495    
## `Curricular units 2nd sem (enrolled)`            5.41e-08 ***
## `Curricular units 2nd sem (evaluations)`         0.616564    
## `Curricular units 2nd sem (approved)`             < 2e-16 ***
## Grades_2ndSem                                    0.017733 *  
## `Curricular units 2nd sem (without evaluations)` 0.142993    
## `Unemployment rate`                              0.039226 *  
## `Inflation rate`                                 0.577228    
## GDP                                              0.365238    
## marital_statusfacto union                        0.825912    
## marital_statuslegally seperated                  0.885261    
## marital_statusmarried                            0.759160    
## marital_statussingle                             0.772847    
## marital_statuswidower                                  NA    
## day_eve_classevening                                   NA    
## sem_results                                            NA    
## prev_qual_grade                                        NA    
## adm_grade                                              NA    
## prev_perf                                              NA    
## past_grade                                             NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4859.8  on 3629  degrees of freedom
## Residual deviance: 1647.0  on 3589  degrees of freedom
## AIC: 1729
## 
## Number of Fisher Scoring iterations: 8

Summary analysis:
- Variables like marital_status, day_eve_class, sem_results, and prev_perf are not estimable because of perfect multi collinearity or redundancy. These columns are likely linearly dependent or contain constant values.
- Most coefficients (e.g.,marital_status, Application_mode, etc.) have very large standard errors and near-zero estimates. This might be because of multi-collinearity or model over fitting.
- A sharp drop in deviance (4859.8) implies the model explains almost all the variance in the response. A very low AIC (84) suggests the model fits the data well, but it could be misleading given the over fitting symptoms.
- The large coefficient estimates and near-zero deviance suggest perfect separation in the data. This means the explanatory variables are nearly perfect predictors of the target, but this could fail on new data.
- To overcome the above fall outs, we can find the best model that fits the data nearly favorable in the estimations.
Akaike Information Criterion (AIC) is a metric used to evaluate and compare the quality of statistical models, particularly in the context of regression and time series analysis.

# Stepwise selection using AIC
best_model <- stepAIC(full_model, direction = "both", trace = FALSE)

summary(best_model)

## 
## Call:
## glm(formula = target ~ Course + `Previous qualification` + Nacionality + 
##     `Mother's occupation` + Displaced + Debtor + `Tuition fees up to date` + 
##     Gender + `Scholarship holder` + `Age at enrollment` + International + 
##     `Curricular units 1st sem (credited)` + `Curricular units 1st sem (approved)` + 
##     `Curricular units 1st sem (without evaluations)` + `Curricular units 2nd sem (enrolled)` + 
##     `Curricular units 2nd sem (approved)` + Grades_2ndSem + `Curricular units 2nd sem (without evaluations)` + 
##     `Unemployment rate` + `Marital status`, family = binomial, 
##     data = df_bin)
## 
## Coefficients:
##                                                    Estimate Std. Error z value
## (Intercept)                                      -8.107e-01  5.170e-01  -1.568
## Course                                           -2.299e-04  5.703e-05  -4.031
## `Previous qualification`                          1.496e-02  6.704e-03   2.232
## Nacionality                                      -5.434e-02  1.421e-02  -3.824
## `Mother's occupation`                             8.116e-03  3.353e-03   2.420
## Displaced                                        -3.477e-01  1.466e-01  -2.371
## Debtor                                           -1.256e+00  2.343e-01  -5.359
## `Tuition fees up to date`                         2.672e+00  2.814e-01   9.495
## Gender                                           -4.183e-01  1.369e-01  -3.056
## `Scholarship holder`                              8.036e-01  1.633e-01   4.921
## `Age at enrollment`                              -3.742e-02  1.089e-02  -3.438
## International                                     3.194e+00  8.120e-01   3.934
## `Curricular units 1st sem (credited)`            -3.872e-01  5.041e-02  -7.682
## `Curricular units 1st sem (approved)`             6.033e-01  6.714e-02   8.986
## `Curricular units 1st sem (without evaluations)`  2.371e-01  1.602e-01   1.481
## `Curricular units 2nd sem (enrolled)`            -9.836e-01  7.708e-02 -12.760
## `Curricular units 2nd sem (approved)`             9.894e-01  6.559e-02  15.084
## Grades_2ndSem                                     8.543e-02  4.239e-02   2.016
## `Curricular units 2nd sem (without evaluations)`  2.058e-01  1.330e-01   1.547
## `Unemployment rate`                              -5.288e-02  2.486e-02  -2.128
## `Marital status`                                  2.013e-01  1.410e-01   1.428
##                                                  Pr(>|z|)    
## (Intercept)                                      0.116898    
## Course                                           5.54e-05 ***
## `Previous qualification`                         0.025646 *  
## Nacionality                                      0.000131 ***
## `Mother's occupation`                            0.015513 *  
## Displaced                                        0.017742 *  
## Debtor                                           8.37e-08 ***
## `Tuition fees up to date`                         < 2e-16 ***
## Gender                                           0.002242 ** 
## `Scholarship holder`                             8.60e-07 ***
## `Age at enrollment`                              0.000586 ***
## International                                    8.37e-05 ***
## `Curricular units 1st sem (credited)`            1.56e-14 ***
## `Curricular units 1st sem (approved)`             < 2e-16 ***
## `Curricular units 1st sem (without evaluations)` 0.138730    
## `Curricular units 2nd sem (enrolled)`             < 2e-16 ***
## `Curricular units 2nd sem (approved)`             < 2e-16 ***
## Grades_2ndSem                                    0.043841 *  
## `Curricular units 2nd sem (without evaluations)` 0.121835    
## `Unemployment rate`                              0.033375 *  
## `Marital status`                                 0.153423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4859.8  on 3629  degrees of freedom
## Residual deviance: 1659.1  on 3609  degrees of freedom
## AIC: 1701.1
## 
## Number of Fisher Scoring iterations: 7

odds_ratios <- exp(coef(best_model))
print(odds_ratios)

##                                      (Intercept) 
##                                        0.4445558 
##                                           Course 
##                                        0.9997701 
##                         `Previous qualification` 
##                                        1.0150718 
##                                      Nacionality 
##                                        0.9471104 
##                            `Mother's occupation` 
##                                        1.0081490 
##                                        Displaced 
##                                        0.7063150 
##                                           Debtor 
##                                        0.2848719 
##                        `Tuition fees up to date` 
##                                       14.4679419 
##                                           Gender 
##                                        0.6581893 
##                             `Scholarship holder` 
##                                        2.2335079 
##                              `Age at enrollment` 
##                                        0.9632678 
##                                    International 
##                                       24.3904190 
##            `Curricular units 1st sem (credited)` 
##                                        0.6789414 
##            `Curricular units 1st sem (approved)` 
##                                        1.8281873 
## `Curricular units 1st sem (without evaluations)` 
##                                        1.2676240 
##            `Curricular units 2nd sem (enrolled)` 
##                                        0.3739787 
##            `Curricular units 2nd sem (approved)` 
##                                        2.6895669 
##                                    Grades_2ndSem 
##                                        1.0891886 
## `Curricular units 2nd sem (without evaluations)` 
##                                        1.2284935 
##                              `Unemployment rate` 
##                                        0.9484920 
##                                 `Marital status` 
##                                        1.2230046

conf_intervals <- exp(confint(best_model))

## Waiting for profiling to be done...

print(conf_intervals)

##                                                      2.5 %      97.5 %
## (Intercept)                                      0.1601521   1.2171744
## Course                                           0.9996573   0.9998808
## `Previous qualification`                         1.0020865   1.0287939
## Nacionality                                      0.9203756   0.9739408
## `Mother's occupation`                            1.0019720   1.0152707
## Displaced                                        0.5286202   0.9396994
## Debtor                                           0.1799871   0.4513463
## `Tuition fees up to date`                        8.4665229  25.5787590
## Gender                                           0.5035818   0.8614147
## `Scholarship holder`                             1.6294400   3.0927850
## `Age at enrollment`                              0.9430934   0.9842764
## International                                    5.1153038 123.4982590
## `Curricular units 1st sem (credited)`            0.6146627   0.7490615
## `Curricular units 1st sem (approved)`            1.6049745   2.0886667
## `Curricular units 1st sem (without evaluations)` 0.9273699   1.7370475
## `Curricular units 2nd sem (enrolled)`            0.3204658   0.4336439
## `Curricular units 2nd sem (approved)`            2.3712039   3.0668913
## Grades_2ndSem                                    1.0044698   1.1858909
## `Curricular units 2nd sem (without evaluations)` 0.9472852   1.5925452
## `Unemployment rate`                              0.9032723   0.9957707
## `Marital status`                                 0.9317590   1.6174690

library(ggplot2)

odds_data <- data.frame(
  Predictor = names(odds_ratios),
  OddsRatio = odds_ratios,
  LowerCI = conf_intervals[, 1],
  UpperCI = conf_intervals[, 2]
)

ggplot(odds_data, aes(x = Predictor, y = OddsRatio)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_errorbar(aes(ymin = LowerCI, ymax = UpperCI), width = 0.1) +
  theme_minimal() +
  labs(title = "Odds Ratios of Predictors", y = "Odds Ratio")

Model Summary:
- Null Deviance: 4859.8 (deviance with no predictors).
- Residual Deviance: 1659.1 (deviance after adding predictors, much lower—indicating improved model fit).
- AIC: 1701.1 (used for model comparison; lower values indicate better fit).
Interpretation of Significant Predictors:
- Course: Negative coefficient (beta = -2.299e-04), significant at p < 0.001. Suggests that specific courses have a small but significant negative effect on the log-odds of the target outcome.
- Previous qualification: Positive and significant (beta = 1.496e-02), indicating better previous qualifications increase the likelihood of the outcome.
- Nacionality: Negative and significant (beta = -5.434e-02), suggesting non-domestic nationality decreases the likelihood of the target outcome.
- Debtor: Strong negative impact (beta = -1.256,p < 0.001) on the outcome. Being a debtor significantly reduces the chances.
- Tuition fees up to date: Large positive coefficient (beta = 2.672, p < 0.001), indicating that students up-to-date on tuition have much higher odds of the outcome.
- Scholarship holder: Positive effect (beta = 0.8036, p < 0.001), showing that receiving a scholarship increases the odds of success.
- International: Strong positive impact (beta = 3.194), indicating international students are more likely to achieve the outcome.
- Curricular units: Various significant contributions:
  - 1st sem credited: Negative (beta = -0.387), fewer credited units decrease likelihood.
  - 2nd sem enrolled: Strong negative (beta = -0.983), suggesting being enrolled without approval reduces success.
  - 2nd sem approved: Positive (beta = 0.989), indicating more approvals increase success likelihood.
Non-Significant Predictors:
- Curricular units 1st sem without evaluations and Marital status have higher p-values (>0.1), indicating these variables do not significantly affect the target in this model.
Model Diagnostics:
- Residual Deviance vs. Null Deviance: Substantial reduction, indicating that the predictors provide a much better fit than the null model.
- Number of Iterations: 7 iterations indicate stable convergence for this logistic regression.

# 1. Residuals vs Fitted Plot
plot(best_model, which = 1, main = "Residuals vs Fitted")

Interpretation:

Horizontal band of residuals: The residuals are spread relatively evenly around the 0-line, which is expected for a well-fitted model.
Outliers (2013, 2010, 3480): These data points have high residuals, suggesting they do not fit the model well. They could indicate influential or misclassified observations.
No clear pattern: Ideally, residuals should not show any systematic structure or trend. The absence of a curve or clustering around certain fitted values is a positive sign.

# 2. Cook's Distance
plot(best_model, which = 4, main = "Cook's Distance")

Interpretation:

Highlighted Observations (1334, 2013, 2766): These points have higher Cook’s Distance values, indicating they could be influential observations.
Threshold for Influence: A general rule of thumb is that values above 1 (or significantly larger than the majority of the data) are influential, but none here exceed that threshold.

# 3. Leverage Plot
plot(hatvalues(best_model), main = "Leverage Values", ylab = "Leverage")
abline(h = 2 * mean(hatvalues(best_model)), col = "red", lty = 2)

Interpretation:

Most observations have low leverage values, which is expected in a well-distributed dataset.
High-leverage points are seen near the top of the plot but remain below a threshold of 0.5. A few points have higher leverage, meaning their predictor values are farther from the center of the data set, but they don’t seem to be extreme outliers.

# 4. Variance Inflation Factor (VIF) for multicollinearity
vif(best_model)

##                                           Course 
##                                         7.655553 
##                         `Previous qualification` 
##                                         1.111836 
##                                      Nacionality 
##                                         3.262373 
##                            `Mother's occupation` 
##                                         1.042082 
##                                        Displaced 
##                                         1.250714 
##                                           Debtor 
##                                         1.127495 
##                        `Tuition fees up to date` 
##                                         1.132134 
##                                           Gender 
##                                         1.090516 
##                             `Scholarship holder` 
##                                         1.049656 
##                              `Age at enrollment` 
##                                         1.752143 
##                                    International 
##                                         3.349127 
##            `Curricular units 1st sem (credited)` 
##                                         4.299866 
##            `Curricular units 1st sem (approved)` 
##                                         8.776707 
## `Curricular units 1st sem (without evaluations)` 
##                                         1.168276 
##            `Curricular units 2nd sem (enrolled)` 
##                                        11.197613 
##            `Curricular units 2nd sem (approved)` 
##                                         6.949030 
##                                    Grades_2ndSem 
##                                         8.472960 
## `Curricular units 2nd sem (without evaluations)` 
##                                         1.132925 
##                              `Unemployment rate` 
##                                         1.095878 
##                                 `Marital status` 
##                                         1.505745

Low VIFs (< 5) for most variables: Indicates low multicollinearity and minimal impact on the stability of coefficients.
- Examples: Mohther’s_occupation (1.04), Debtor(1.13), Gender(1.09).
Moderate VIFs (5–10): Suggest some multicollinearity but not critically problematic.
- Examples: Cirricular units 2nd sem (approved) (6.95), Cirricular units 2nd sem (grade) (8.47).
High VIFs (> 10): Indicates strong multicollinearity, potentially problematic.
- Examples: Cirricular units 2nd sem (enrolled) (11.20), Cirricular units 1st sem (approved) (8.78), Cirricular units 1st sem (credited) (4.30 approaching 5).

Conclusion and Recommendations:

The significant features for our models turned out to be: Previous Qualification, Nationality, Scholarship holder and Semester grades.
Students who couldn’t perform well in their previous grades, if they can perform well in the current semesters, they have higher chances of graduating.
Students should prefer day classes over evening classes in order to have a better chance of graduating.

Weeks 14 | Data Dive — Communicating Statistics

Kiran

2024-12-11

Final Project - Intro to Stats with R

Student Demographics v/s Academic Performance

Problem Statement:

Project Goal and Impact:

Exploratory Data Analysis (EDA):

Assumptions and Justifications:

Hypothesis Testing:

Generalized Linear Model (GLM) :

Conclusion and Recommendations: