R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable

“Displaced”: This column has two values, 1 for “yes” and 0 for “no,” indicating whether a student is displaced.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- read.csv('./Downloads/students_dropout_and_academic_success.csv')

selected_columns <- c("Displaced", "Previous_qualification_grade", "Age_at_enrollment", "Gender")
df <- df[selected_columns]


df$Gender <- ifelse(df$Gender == 1, 1, 0)

# Check for missing values and handle them if needed
df <- na.omit(df)

# Perform logistic regression
logistic_model <- glm(Displaced ~ Previous_qualification_grade + Age_at_enrollment + Gender, data = df, family = "binomial")


summary(logistic_model)
## 
## Call:
## glm(formula = Displaced ~ Previous_qualification_grade + Age_at_enrollment + 
##     Gender, family = "binomial", data = df)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   4.468443   0.376841  11.858  < 2e-16 ***
## Previous_qualification_grade -0.009947   0.002518  -3.950 7.81e-05 ***
## Age_at_enrollment            -0.124142   0.006005 -20.674  < 2e-16 ***
## Gender                       -0.342906   0.068673  -4.993 5.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6091.5  on 4423  degrees of freedom
## Residual deviance: 5399.8  on 4420  degrees of freedom
## AIC: 5407.8
## 
## Number of Fisher Scoring iterations: 4

Coefficients:

Intercept (4.468443): This is the log-odds of being displaced when all other variables are zero. In other words, it represents the baseline log-odds of displacement for a reference group (possibly female students with the lowest previous qualification grade at the youngest age).

Previous_qualification_grade (-0.009947): For each one-unit increase in the previous qualification grade, the log-odds of being displaced decrease by approximately 0.0099, holding all other variables constant. This implies that as the previous qualification grade increases, the likelihood of being displaced decreases.

Age_at_enrollment (-0.124142): For each one-year increase in age at enrollment, the log-odds of being displaced decrease by approximately 0.1241, holding all other variables constant. This suggests that older students are less likely to be displaced.

Gender (-0.342906): If the student is male (Gender = 1), the log-odds of being displaced are lower by approximately 0.3429 compared to female students (Gender = 0), holding all other variables constant. This indicates that, on average, male students are less likely to be displaced than female students.

Significance:

The “Estimate” column provides the coefficients for each variable. The “Std. Error” column indicates the standard errors associated with each coefficient. The “z value” column provides the z-values, which are used to test the significance of each coefficient. The “Pr(>|z|)” column represents the p-values for each coefficient. Small p-values (e.g., < 0.05) suggest that the corresponding variable is statistically significant. In this model, all three explanatory variables (Previous_qualification_grade, Age_at_enrollment, and Gender) are statistically significant because they have small p-values (indicated by the ’***’ notation).

The null deviance and residual deviance are measures of how well the model fits the data. The smaller the residual deviance in comparison to the null deviance, the better the model explains the variance in the response variable. In this case, the model appears to be a reasonable fit, as the residual deviance is substantially smaller than the null deviance.

The AIC (Akaike Information Criterion) is a measure of the model’s goodness of fit. Lower AIC values indicate a better fit, and your model has an AIC of 5407.8, which suggests a relatively good fit.

Keep in mind that the interpretation assumes that the logistic regression assumptions hold, and that there are no issues with multicollinearity or overfitting. Additionally, the coefficients provide information about the log-odds, so you may want to convert them to odds ratios for more intuitive interpretation if needed.

Explanatory Variable: Previous Qualification Grade

hist(df$Previous_qualification_grade, main = "Histogram of Previous Qualification Grade", xlab = "Previous Qualification Grade")

# Create a scatter plot of log-odds versus the explanatory variable
logit_model <- glm(Displaced ~ Previous_qualification_grade, family = "binomial", data = df)
log_odds <- predict(logit_model, type = "link")
plot(df$Previous_qualification_grade, log_odds, xlab = "Previous Qualification Grade", ylab = "Log-Odds of Displacement")

The plot suggests a linear relationship. In logistic regression, the assumption of linearity in the log-odds is important. When the log-odds have a linear relationship with the explanatory variable, it means that the log-odds change linearly with changes in the explanatory variable. Thus we wont be requiring any transformations.