In part 1 of this project, I showed that with nearly 90% accuracy, students at risk for failing could be detected as early as the first quarter of a course. Early identification of these students is important to best provide intervention to reduce risk of failure, but what are the best types of intervention?
To answer this question, I shift from the random forest machine learning techniques to logistic regressions and structural equation modeling(SEM). The strengths of random forest lie in its predictive power, particularly when the relationship between the predictor variables and the target variable tends to be non-linear in nature. However, a drawback is it’s weakness in interpretating the underlying relationship between predictors and the target. Conversely, the strengths of logistic regressions and SEM are both their ability to interpret the magnitude of effects as well as the effects of variables controlling for confounding effects
#load packages
library(readr)
library(lavaan)
library(dplyr)
library(caret)
library(MASS)
rm(list = ls())
setwd('/Users/williamkye/Box Sync/nyc data science academy/mooc/')
sem <- read_csv("sem1.csv")
set.seed(0)
trainIdx <- createDataPartition(sem$final_result,
p = .8,
list = FALSE,
times = 1)
train <- sem[trainIdx,]
test <- sem[-trainIdx,]
model <- glm(final_result ~.,family=binomial(link='logit'),data=train)
fitted.results <- predict(model,newdata=test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$final_result)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.93312101910828"
sort(model$coefficients, decreasing = TRUE)[1:20]
## age_band55<= num_assessment_CMA_group_1
## 3.5646340 1.9581201
## num_assessment_CMA_group_4 num_assessment_CMA_group_2
## 1.4256663 1.4133993
## num_assessment_CMA_group_7 highest_educationNo Formal quals
## 1.3821967 1.2958133
## num_assessment_CMA_group_3 num_assessment_CMA_group_5
## 1.2301218 1.1003804
## num_assessment_CMA_group_13 num_assessment_CMA_group_10
## 1.0229442 0.9671943
## num_assessment_CMA_group_20 num_assessment_CMA_group_28
## 0.9567808 0.9289148
## num_assessment_CMA_group_26 num_assessment_CMA_group_12
## 0.9224321 0.8694280
## num_assessment_CMA_group_24 num_assessment_CMA_group_9
## 0.8488794 0.8469322
## num_assessment_CMA_group_11 num_assessment_CMA_group_15
## 0.8394068 0.8331867
## num_assessment_CMA_group_16 num_assessment_CMA_group_14
## 0.8226895 0.7990772
The saturated logistic regression has an accuracy rate of 93%, remarkably similar to the random forest model. The coefficients from the regression are preferable over the variable importance function of random forest because they show both the effects of covariates while holding the effects of the other covariate constant and the overall magnitude of these effects. However, with over 200 predictors in the model, not only are the results unruly to look at, but the covariates are also higly susceptible to multicollinearity (thus undermining our confidence in interpreting the significance of predictors).
SEM is different from logistic regression models in two important ways:
For these reasons, SEM is well suited to examine how intervention can best be utilized to address students at risk for failing. First, It allows for a way to utilize all of the predictors by identifying important underlying latent variables that underly the relationship between student online interactions and assessment scores with likelihood of passing a class. Second, both random forest and logistic regression highlight test scores as the most important variables for predicting success, however, it doesn’t allow for analysis of how online interaction activity may effect how well students perform on assessments, which then, affects their liklihood of passing of failing.
mediation<- '
score_TMA_group_3 ~ a0*activity_2 + a1*content_2 + a2*forum_2+ a3*resource_2
score_TMA_group_8 ~ b0*activity_7 + b1*content_7 + b2*forum_7+ b3*resource_7
score_TMA_group_13 ~ c0*activity_12 + c1*content_12 + c2*forum_12+ c3*resource_12
score_TMA_group_19 ~ d0*activity_18 + d1*content_18 + d2*forum_18+ d3*resource_18
score_TMA_group_24 ~ e0*activity_23 + e1*content_23 + e2*forum_23+ e3*resource_23
final_result~ f*score_TMA_group_3 +g*score_TMA_group_8+ h*score_TMA_group_13 +i*score_TMA_group_19 +j*score_TMA_group_24 + k*gender + l*imd_band + m*num_of_prev_attempts+n*studied_credits + o*disability
indirect_activity2 := a0*f
indirect_content2 := a1*f
indirect_forum2 := a2*f
indirect_resource2 := a3*f
indirect_activity7 := b0*g
indirect_content7 := b1*g
indirect_forum7 := b2*g
indirect_resource7 := b3*g
indirect_activity12 := c0*h
indirect_content12 := c1*h
indirect_forum12 := c2*h
indirect_resource12 := c3*h
indirect_activity18 := d0*i
indirect_content18 := d1*i
indirect_forum18 := d2*i
indirect_resource18 := d3*i
indirect_activity23 := e0*j
indirect_content23 := e1*j
indirect_forum23 := e2*j
indirect_resource23 := e3*j
'
res<-sem(mediation, data=sem, ordered= c('final_result', 'gender', 'disability'))
#str(sem$gender)
#table(sem$content_group_3)
summary(res)
Below are the truncated results of my mediation model where I only include significant effects
# Defined Parameters:
# Estimate Std.Err z-value P(>|z|)
# content week 7 0.000 0.000 2.151 0.031
# forum week 7 0.001 0.000 7.486 0.000
# resources week 7 0.000 0.000 4.228 0.000
# activity week 12 0.001 0.000 4.611 0.000
# activity week 18 0.002 0.000 8.415 0.000
# content week 18 0.001 0.000 10.004 0.000
# resources week 18 0.000 0.000 4.764 0.000
# activity week 23 0.002 0.000 15.191 0.000
# content week 23 0.001 0.000 38.287 0.000
The indirect effects demonstrate significant pathways for mediation. For instance, the positive and significant effect for content week 7 indicates that a students total activity on content modules from week 1-7 significantly effected their scores for their assessment on week 8, which then also significantly effected their liklihood of passing or failing.
From the above mediation results, we can infer several patterns:
As personal/online learning continues to grow in popularity, it is important to understand the factors that shape student success. In particular, as data associated with online learning becomes more and more available and robust, there is now an important opportunity to identify intervention techniques to prevent student failure. In this project, I used structural equation modeling to test how student assessments scores mediated the relationship between different online interactivity modules and the liklihood of students passing the course. I find that the best course of intervention for students as risk for failure is to utilize encourage increased activity on content modules and overall activity. In particular, this most effective during the middle of the courses, rather than the begining or the end of courses