Dataset Introduction - Student Exam Performance

Records of past students final grades based off lifestyle factors and study habits

  • Features: Gender, Age, Hours spent Studying, Attendance, Sleep, Tutoring, Previous GPA
  • Target Variable: Pass/Fail Status
  • Number of datapoints: 10,000

Topic Introduction - Logistic Regression

A regression algorithm used for predicting classifications.

Using this data set, we’ll use logistic regression to predict if a student will pass or fail their class based on the given features

  • Logistic Regression is primarily used for binary classification, which is why pass/fail is the target variable of the model

Data Exploration 1

A summary of our features used for regression

summary(test_scores)
##     gender               age        study_hours_per_day attendance_rate 
##  Length:10000       Min.   :15.00   Min.   :0.50        Min.   : 50.80  
##  Class :character   1st Qu.:15.00   1st Qu.:2.20        1st Qu.: 78.28  
##  Mode  :character   Median :16.00   Median :3.01        Median : 85.10  
##                     Mean   :16.49   Mean   :3.02        Mean   : 84.70  
##                     3rd Qu.:17.00   3rd Qu.:3.83        3rd Qu.: 91.90  
##                     Max.   :18.00   Max.   :7.24        Max.   :100.00  
##   sleep_hours       tutoring          previous_gpa    pass_fail        
##  Min.   : 4.000   Length:10000       Min.   :0.000   Length:10000      
##  1st Qu.: 6.338   Class :character   1st Qu.:1.610   Class :character  
##  Median : 7.030   Mode  :character   Median :1.990   Mode  :character  
##  Mean   : 7.019                      Mean   :1.985                     
##  3rd Qu.: 7.690                      3rd Qu.:2.350                     
##  Max.   :10.000                      Max.   :3.990

Data Exploration 2

The pie charts display the recorded ages and genders of the students in the dataset

Data Exploration 3

A histogram displaying the the previous grade point average of all students

Building the Logistic Regression Model

set.seed(1) #ensures results are reproducible

#Convert pass_fail from charater to binary
test_scores$pass_fail <- ifelse(test_scores$pass_fail == "Pass", 1, 0)
test_scores$pass_fail <- as.integer(test_scores$pass_fail)

#Split the data into training and testing sets
sample <- sample(c(TRUE,FALSE), nrow(test_scores), replace = TRUE, 
                 prob = c(0.7,0.3))
train <- test_scores[sample, ]
test <- test_scores[!sample, ]

#fit the log-reg model
model <- glm(pass_fail~gender+age+study_hours_per_day+attendance_rate
             +sleep_hours+tutoring+previous_gpa, 
             family="binomial", data=train)

Model Results

## 
## Call:
## glm(formula = pass_fail ~ gender + age + study_hours_per_day + 
##     attendance_rate + sleep_hours + tutoring + previous_gpa, 
##     family = "binomial", data = train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -14.479958   0.792455 -18.272  < 2e-16 ***
## genderMale           -0.019043   0.077901  -0.244  0.80688    
## age                  -0.091416   0.034657  -2.638  0.00835 ** 
## study_hours_per_day   0.717959   0.041720  17.209  < 2e-16 ***
## attendance_rate       0.023697   0.004149   5.711 1.12e-08 ***
## sleep_hours           0.001004   0.039124   0.026  0.97953    
## tutoringYes           0.012586   0.085445   0.147  0.88290    
## previous_gpa          5.872510   0.155392  37.792  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9649.6  on 6963  degrees of freedom
## Residual deviance: 4215.1  on 6956  degrees of freedom
## AIC: 4231.1
## 
## Number of Fisher Scoring iterations: 6
## fitting null model for pseudo-r2
##  McFadden 
## 0.5631866
##                        Overall
## genderMale           0.2444526
## age                  2.6377480
## study_hours_per_day 17.2090195
## attendance_rate      5.7114153
## sleep_hours          0.0256615
## tutoringYes          0.1472951
## previous_gpa        37.7915921

Model Results Cont’d

The McFadden R^2 evaluates how well the model fits the data set, with any value >0.4 meaning it is well fit. The McFadden R^2 for this model is ~0.563

Another type of analysis is to determine which features had the greatest impact, which is what we’ll us VarImp to assess. The greater the value, the greater the impact the feature had.

  • Gender: 0.2444526
  • Age: 2.6377480
  • Study hours: 17.2090195
  • Attendance: 5.7114153
  • Sleep: 0.0256615
  • Tutoring: 0.1472951
  • Previous GPA: 37.7915921

Smaller Dataset

To create a plot of the logistic regression curve, the amount of features used needs to be reduced. Study hours and Previous GPA was selected for the plot as they had the greatest influence on the model

df_small <- train %>%
  select(previous_gpa, study_hours_per_day, pass_fail)
model_small <- glm(pass_fail ~ previous_gpa + study_hours_per_day,
  data = df_small,family = "binomial"
)
grid <- expand.grid(
  previous_gpa = seq(min(df_small$previous_gpa),
                     max(df_small$previous_gpa),
                     length.out = 40),
  study_hours_per_day = seq(min(df_small$study_hours_per_day),
                            max(df_small$study_hours_per_day),
                            length.out = 40)
)

Generating the Logistic Regression Plot

grid$pred <- predict(model_small, newdata = grid, type = "response")
z_matrix <- matrix(grid$pred, nrow = 40, ncol = 40)
reg_plot <- plot_ly(
  x = unique(grid$previous_gpa),
  y = unique(grid$study_hours_per_day),
  z = z_matrix,
  type = "surface",
  colorscale = list(c(0, "red"),c(1, "green")
  )
) %>%
  layout(
    title = "3D Logistic Regression Surface: GPA × Study Hours",
    scene = list(
      xaxis = list(title = "Previous GPA"),
      yaxis = list(title = "Study Hours per Day"),
      zaxis = list(title = "Predicted Probability of Passing")
    )
  )

Final Plot