Records of past students final grades based off lifestyle factors and study habits
- Features: Gender, Age, Hours spent Studying, Attendance, Sleep, Tutoring, Previous GPA
- Target Variable: Pass/Fail Status
- Number of datapoints: 10,000
Records of past students final grades based off lifestyle factors and study habits
A regression algorithm used for predicting classifications.
Using this data set, we’ll use logistic regression to predict if a student will pass or fail their class based on the given features
A summary of our features used for regression
summary(test_scores)
## gender age study_hours_per_day attendance_rate ## Length:10000 Min. :15.00 Min. :0.50 Min. : 50.80 ## Class :character 1st Qu.:15.00 1st Qu.:2.20 1st Qu.: 78.28 ## Mode :character Median :16.00 Median :3.01 Median : 85.10 ## Mean :16.49 Mean :3.02 Mean : 84.70 ## 3rd Qu.:17.00 3rd Qu.:3.83 3rd Qu.: 91.90 ## Max. :18.00 Max. :7.24 Max. :100.00 ## sleep_hours tutoring previous_gpa pass_fail ## Min. : 4.000 Length:10000 Min. :0.000 Length:10000 ## 1st Qu.: 6.338 Class :character 1st Qu.:1.610 Class :character ## Median : 7.030 Mode :character Median :1.990 Mode :character ## Mean : 7.019 Mean :1.985 ## 3rd Qu.: 7.690 3rd Qu.:2.350 ## Max. :10.000 Max. :3.990
The pie charts display the recorded ages and genders of the students in the dataset
A histogram displaying the the previous grade point average of all students
set.seed(1) #ensures results are reproducible
#Convert pass_fail from charater to binary
test_scores$pass_fail <- ifelse(test_scores$pass_fail == "Pass", 1, 0)
test_scores$pass_fail <- as.integer(test_scores$pass_fail)
#Split the data into training and testing sets
sample <- sample(c(TRUE,FALSE), nrow(test_scores), replace = TRUE,
prob = c(0.7,0.3))
train <- test_scores[sample, ]
test <- test_scores[!sample, ]
#fit the log-reg model
model <- glm(pass_fail~gender+age+study_hours_per_day+attendance_rate
+sleep_hours+tutoring+previous_gpa,
family="binomial", data=train)
## ## Call: ## glm(formula = pass_fail ~ gender + age + study_hours_per_day + ## attendance_rate + sleep_hours + tutoring + previous_gpa, ## family = "binomial", data = train) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -14.479958 0.792455 -18.272 < 2e-16 *** ## genderMale -0.019043 0.077901 -0.244 0.80688 ## age -0.091416 0.034657 -2.638 0.00835 ** ## study_hours_per_day 0.717959 0.041720 17.209 < 2e-16 *** ## attendance_rate 0.023697 0.004149 5.711 1.12e-08 *** ## sleep_hours 0.001004 0.039124 0.026 0.97953 ## tutoringYes 0.012586 0.085445 0.147 0.88290 ## previous_gpa 5.872510 0.155392 37.792 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 9649.6 on 6963 degrees of freedom ## Residual deviance: 4215.1 on 6956 degrees of freedom ## AIC: 4231.1 ## ## Number of Fisher Scoring iterations: 6
## fitting null model for pseudo-r2
## McFadden ## 0.5631866
## Overall ## genderMale 0.2444526 ## age 2.6377480 ## study_hours_per_day 17.2090195 ## attendance_rate 5.7114153 ## sleep_hours 0.0256615 ## tutoringYes 0.1472951 ## previous_gpa 37.7915921
The McFadden R^2 evaluates how well the model fits the data set, with any value >0.4 meaning it is well fit. The McFadden R^2 for this model is ~0.563
Another type of analysis is to determine which features had the greatest impact, which is what we’ll us VarImp to assess. The greater the value, the greater the impact the feature had.
To create a plot of the logistic regression curve, the amount of features used needs to be reduced. Study hours and Previous GPA was selected for the plot as they had the greatest influence on the model
df_small <- train %>%
select(previous_gpa, study_hours_per_day, pass_fail)
model_small <- glm(pass_fail ~ previous_gpa + study_hours_per_day,
data = df_small,family = "binomial"
)
grid <- expand.grid(
previous_gpa = seq(min(df_small$previous_gpa),
max(df_small$previous_gpa),
length.out = 40),
study_hours_per_day = seq(min(df_small$study_hours_per_day),
max(df_small$study_hours_per_day),
length.out = 40)
)
grid$pred <- predict(model_small, newdata = grid, type = "response")
z_matrix <- matrix(grid$pred, nrow = 40, ncol = 40)
reg_plot <- plot_ly(
x = unique(grid$previous_gpa),
y = unique(grid$study_hours_per_day),
z = z_matrix,
type = "surface",
colorscale = list(c(0, "red"),c(1, "green")
)
) %>%
layout(
title = "3D Logistic Regression Surface: GPA × Study Hours",
scene = list(
xaxis = list(title = "Previous GPA"),
yaxis = list(title = "Study Hours per Day"),
zaxis = list(title = "Predicted Probability of Passing")
)
)