In this project, I’ll demonstrate how to predict future dropout predictions using some of the most well-established machine learning algorithms. The data stems from a Massive open online course (MOOC) platform targeted primarily for medical students. The data has been preprocessed previously where I have been querying user activitiy data from the first week of studying. Using this data, I’ll explore whether student activity during the first week will predict future dropout. We will use a binary outcome variable, defined as whether the student have been inactive during the two consecutive weeks following the first week. If no activity is present, the student is classified as dropout. Conversely, if the student have been active he/she is classified as retention
Load the required packages for importing data and preprocessing.
import pandas as pd # For data manipulation
import numpy as np # For data manipulation
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import seaborn as sns
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
import warnings
warnings.filterwarnings('ignore')Read dataset
df = pd.read_csv("df_dropout.csv")We’ll start by looking a bit closer to the dataset. It appears that the dataset includes 4101 rows, with each row corresponding to aggregated scores for a unique student stemming from the first study week.
print(len(df))## 4101
The dataset has altogether 32 columns of which 31 are student activity features, whereas dropout is the target variable
print(df.columns)## Index(['total_rt', 'courses_rt', 'exam_rt', 'guidedlearning_rt', 'profile_rt',
## 'quiz_rt', 'welcome_rt', 'have_taken_a_quiz', 'have_taken_an_exam_quiz',
## 'have_repeated_a_quiz', 'quizzes_count', 'quiz_sessions_count',
## 'exam_quizzes_count', 'exam_quiz_sessions_count', 'actions_total_count',
## 'action_page_scroll_count', 'action_out_of_focus_count',
## 'action_page_read_count', 'last_day_active', 'unique_days_active',
## 'active_more_than_1_unique_day', 'reading_sessions_count',
## 'have_read_less_than_10mins', 'dropout'],
## dtype='object')
It seems that all features are either integers or floats.
print(df.dtypes)## total_rt float64
## courses_rt float64
## exam_rt float64
## guidedlearning_rt float64
## profile_rt float64
## quiz_rt float64
## welcome_rt float64
## have_taken_a_quiz int64
## have_taken_an_exam_quiz int64
## have_repeated_a_quiz int64
## quizzes_count float64
## quiz_sessions_count float64
## exam_quizzes_count float64
## exam_quiz_sessions_count float64
## actions_total_count int64
## action_page_scroll_count float64
## action_out_of_focus_count float64
## action_page_read_count float64
## last_day_active int64
## unique_days_active int64
## active_more_than_1_unique_day int64
## reading_sessions_count int64
## have_read_less_than_10mins int64
## dropout int64
## dtype: object
We’ll also take a glimpse on the decriptives of the student activitiy features and the target variable.
print(df.describe())## total_rt courses_rt exam_rt guidedlearning_rt profile_rt \
## count 4101.000000 4101.000000 4101.000000 4101.000000 4101.000000
## mean 59.364115 1.349522 9.744260 0.184371 0.115096
## std 139.144609 2.834167 59.695501 1.362583 0.734116
## min 0.000000 0.000000 0.000000 0.000000 0.000000
## 25% 2.300667 0.143583 0.000000 0.000000 0.000000
## 50% 9.371117 0.594517 0.000000 0.000000 0.000000
## 75% 49.112367 1.493050 0.000000 0.000000 0.000000
## max 2243.159717 63.635350 1444.356150 50.858583 35.125683
##
## quiz_rt welcome_rt have_taken_a_quiz have_taken_an_exam_quiz \
## count 4101.000000 4101.000000 4101.000000 4101.000000
## mean 0.198568 0.219454 0.374055 0.166301
## std 0.757390 0.719694 0.483937 0.372396
## min 0.000000 0.000000 0.000000 0.000000
## 25% 0.000000 0.000000 0.000000 0.000000
## 50% 0.000000 0.000000 0.000000 0.000000
## 75% 0.000000 0.000000 1.000000 0.000000
## max 24.633050 9.859617 1.000000 1.000000
##
## have_repeated_a_quiz quizzes_count quiz_sessions_count \
## count 4101.000000 4101.000000 4101.000000
## mean 0.176786 20.629115 3.066325
## std 0.381534 83.911677 9.027302
## min 0.000000 0.000000 0.000000
## 25% 0.000000 0.000000 0.000000
## 50% 0.000000 0.000000 0.000000
## 75% 0.000000 3.000000 1.000000
## max 1.000000 2098.000000 182.000000
##
## exam_quizzes_count exam_quiz_sessions_count actions_total_count \
## count 4101.000000 4101.000000 4101.000000
## mean 11.113631 0.507193 318.341624
## std 83.259036 2.336614 740.858978
## min 0.000000 0.000000 1.000000
## 25% 0.000000 0.000000 20.000000
## 50% 0.000000 0.000000 65.000000
## 75% 0.000000 0.000000 294.000000
## max 1958.000000 64.000000 12588.000000
##
## action_page_scroll_count action_out_of_focus_count \
## count 4101.000000 4101.000000
## mean 63.374543 54.542307
## std 219.958797 128.129422
## min 0.000000 0.000000
## 25% 0.000000 3.000000
## 50% 10.000000 11.000000
## 75% 47.000000 50.000000
## max 6635.000000 3280.000000
##
## action_page_read_count last_day_active unique_days_active \
## count 4101.000000 4101.000000 4101.000000
## mean 2.576201 2.373567 2.363082
## std 8.988033 2.916467 1.907902
## min 0.000000 0.000000 1.000000
## 25% 0.000000 0.000000 1.000000
## 50% 0.000000 0.000000 1.000000
## 75% 0.000000 6.000000 3.000000
## max 136.000000 7.000000 8.000000
##
## active_more_than_1_unique_day reading_sessions_count \
## count 4101.000000 4101.000000
## mean 0.473299 3.375274
## std 0.499347 3.708509
## min 0.000000 1.000000
## 25% 0.000000 1.000000
## 50% 0.000000 2.000000
## 75% 1.000000 4.000000
## max 1.000000 77.000000
##
## have_read_less_than_10mins dropout
## count 4101.000000 4101.000000
## mean 0.507925 0.520605
## std 0.499998 0.499636
## min 0.000000 0.000000
## 25% 0.000000 0.000000
## 50% 1.000000 1.000000
## 75% 1.000000 1.000000
## max 1.000000 1.000000
There doesn’t seem to be any missing data in the dataset.
df.sum().isna()## total_rt False
## courses_rt False
## exam_rt False
## guidedlearning_rt False
## profile_rt False
## quiz_rt False
## welcome_rt False
## have_taken_a_quiz False
## have_taken_an_exam_quiz False
## have_repeated_a_quiz False
## quizzes_count False
## quiz_sessions_count False
## exam_quizzes_count False
## exam_quiz_sessions_count False
## actions_total_count False
## action_page_scroll_count False
## action_out_of_focus_count False
## action_page_read_count False
## last_day_active False
## unique_days_active False
## active_more_than_1_unique_day False
## reading_sessions_count False
## have_read_less_than_10mins False
## dropout False
## dtype: bool
Let’s check the balance between dropout vs. retention students. Note that retention = 0 and dropout = 1. The balance appears to be surprisingly good, with 1966 retention students and 2135 dropout students.
dct = {}
for i in df['dropout']:
if i in dct.keys():
dct[i] += 1
else:
dct[i] = 1
print(dct)## {1: 2135, 0: 1966}
Next, we’ll move on to some expolatory data analyses. For curiosity, we’ll check the point-biserial correlations between the student activity features and the dropout rate using a heatmap. Point-biserial correlations can be interpreted as effect sizes, where 0.2-0.5 indicates a small effect, 0.5-0.8 a moderate effect, and >= 0.8 a large effet. It appears that two of the features showed a correlation coefficient to dropout corresponding to a medium effect, namely last_day_active and unique_days_active.
import seaborn as sns
from pylab import rcParams
df_corr = df.corr()
mask = np.zeros_like(df_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)]= True
b= sns.heatmap(df_corr, mask = mask, annot = True, annot_kws = {"size": 4})
b.tick_params(labelsize=4)
plt.show()Let’s visualize the four student activity features that showed the strongest relationships to dropout using bar plots. There appears to be huge class-level differences in each of these features (note that error bars represent standard error of means).
df_slice = df[['dropout','last_day_active', 'unique_days_active', 'have_read_less_than_10mins', 'reading_sessions_count']]
df_slice['dropout'] = df_slice['dropout'].apply(lambda x: "Dropout" if x == 1 else "Retention")
plt.subplot(2, 2, 1)
sns.barplot(x="dropout", y="last_day_active", data = df_slice)## <matplotlib.axes._subplots.AxesSubplot object at 0x1a32719c88>
plt.subplot(2, 2, 2)
sns.barplot(x="dropout", y="unique_days_active", data = df_slice)## <matplotlib.axes._subplots.AxesSubplot object at 0x1a3d5f8c88>
plt.subplot(2, 2, 3)
sns.barplot(x="dropout", y= 'have_read_less_than_10mins', data = df_slice)## <matplotlib.axes._subplots.AxesSubplot object at 0x1a326bff28>
plt.subplot(2, 2, 4)
sns.barplot(x="dropout", y= 'reading_sessions_count', data = df_slice)## <matplotlib.axes._subplots.AxesSubplot object at 0x1a3f958a58>
plt.tight_layout()
plt.show()Data preprocessing is important before the actual model computation. First, we split the df dataset to a X and y, where X includes the student activity features, and y the target variable (i.e., dropout).
X = df.drop("dropout", axis = 1)
y = df['dropout']Next, we split the data to a training and test set using the train_test_split function. We also set random_state to 0 for the sake of replicability, and use a test size of 0.25 (i.e., the training set includes 75 % of the students, whereas the test set includes 25 % of the students).
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.25)Moreover, we check whether the proportion of dropouts are balanced across the training and test set. It appears that they match highly, with 51 % of the participants being dropouts in the training set, and 54 % being dropouts in the test set.
dct_train = {}
for i in list(y_train):
if i in dct_train.keys():
dct_train[i] += 1
else:
dct_train[i] = 1
print(dct_train.get(1) / sum(dct_train.values()))## 0.5138211382113821
dct_test = {}
for i in list(y_test):
if i in dct_test.keys():
dct_test[i] += 1
else:
dct_test[i] = 1
print(dct_test.get(1) / sum(dct_test.values()))## 0.5409356725146199
It’s important to standardize the features before feeding them to the model(s). This ensures that all features are on the same range (i.e., mean = 0, sd = 1), and thus treated equally in the models.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)For being on the safe side, we also check that the numpy arrays are shaped as they should.
print(X_train.shape)## (3075, 23)
print(X_test.shape)## (1026, 23)
print(y_train.shape)## (3075,)
print(y_test.shape)## (1026,)
Lastly, I compute a function titled metrics, which yields an aggregated dataframe of the four performance metrics typically used in classification problems:
def metrics(mat, name):
label = str(mat)
accuracy = (mat[0][0] + mat[1][1]) /(mat[0][0] + mat[0][1] + mat[1][0] + mat[1][1])
precision = (mat[1][1]) /(mat[0][1] + mat[1][1])
recall = (mat[1][1]) /(mat[1][0] + mat[1][1])
f1_score = (mat[1][1]) /(mat[1][1] + (mat[0][1] + mat[1][0]) /2)
df = pd.DataFrame(columns = {"model", "accuracy", "precision", "recall", "f1_score"})
df = df.append({"model":name, "accuracy":accuracy, "precision":precision, "recall":recall, "f1_score":f1_score}, ignore_index= True)
df = df.filter(["model", "accuracy", "precision", "recall", "f1_score"])
return dfFinally, we can start computing the learning algorithms. In this project, I’ll use five algorithms, namely Logistic regression, Random forest, Gradient boosting, XGBoost, and Support vector classifier.
Logistic regression measures the relationship between the categorical target variable and one or more predictors by estimating probabilities using a logistic/sigmoid function. It is the algorithm that one typically starts with, and to which the more complex non-linear algorithms are compared against. Let’s compute the model.
## Logistic regression
# Set classifier
logReg = LogisticRegression()
# Fit the pipeline to the training set
logreg = logReg.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = logReg.predict(X_test)
# Compute confusion matrix
log_regression = confusion_matrix(y_test, y_pred)
# Retrieve metrics
logreg_metrics = metrics(log_regression, "Logistic regression")
logreg_metrics## model accuracy precision recall f1_score
## 0 Logistic regression 0.77193 0.76013 0.845045 0.800341
The performance appears to be quite good, with an accuracy of 77.1
The random forest is an ensamble learning algorithm which builds upon the Classification And Regression Tree (CART) paradigm. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Random forest have consistently shown to generalize well to untrained data. Let’s compute the model.
# Initiate the RF classifier
rf = RandomForestClassifier(max_depth = 8, max_features = 1, n_estimators = 300, random_state = 0)
# Fit on training set
rf = rf.fit(X_train, y_train)
# Predict on test set
y_pred = rf.predict(X_test)
# Compute confustion matrix
rf_conf = confusion_matrix(y_test, y_pred)
# Retrieve metrics
rf_metrics = metrics(rf_conf, "Random Forest")
rf_metrics## model accuracy precision recall f1_score
## 0 Random Forest 0.789474 0.776509 0.857658 0.815068
The output shows that Random forest actually performs quite well, yielding an accuracy of 78.9.
Gradient boosting is another CART ensemble algorithm which relies on the assumption that the best possible next model, when combined with previous models, minimizes the overall prediction error. Let’s compute the model.
# Initiate the gradient boosting classifier
gb = GradientBoostingClassifier(learning_rate = 0.3, max_depth = 2, n_estimators = 10)
# Fit on training set
gb = gb.fit(X_train, y_train)
# Predict on test set
y_pred = gb.predict(X_test)
# Compute confustion matrix
gb_conf = confusion_matrix(y_test, y_pred)
# Retrieve metrics
gb_metrics = metrics(gb_conf, "Gradient boosting")
gb_metrics## model accuracy precision recall f1_score
## 0 Gradient boosting 0.780702 0.767857 0.852252 0.807857
Gradient boosting shows a quite descent performance as well, with the accuracy (78.1) being slightly lower than the Random forest.
The third CART ensemble learning algorithm used here is XGBoost, which builds upon function approximation by optimizing specific loss functions as well as applying several regularization techniques. Let’s compute the model.
# Initiate the XGboost classifier
xgb = XGBClassifier(learning_rate = 0.4, max_depth = 2, n_estimators = 16)
# Fit on training set
xgb = xgb.fit(X_train, y_train)
# Predict on test set
y_pred = xgb.predict(X_test)
# Compute confustion matrix
xgb_conf = confusion_matrix(y_test, y_pred)
# Retrieve metrics
xgb_metrics = metrics(xgb_conf, "XGBoost")
xgb_metrics## model accuracy precision recall f1_score
## 0 XGBoost 0.785575 0.77686 0.846847 0.810345
The performance of XGBoost (accuracy = 78.6) appears to be better than Gradient boosting but slightly worse than Random forest.
Lastly, we’ll see how well the Support vector classifier (SVC) algorithm generalize. As the logistic regression, SVC is a linear model, and creates a line or a hyperplane which separates the data into classes. Let’s compute the model.
# Initiate the SVC classifier
svc = SVC(C = 10, gamma = 0.01, kernel = 'rbf')
# Fit on training set
svc = svc.fit(X_train, y_train)
# Predict on test set
y_pred = svc.predict(X_test)
# Compute confustion matrix
svc_conf = confusion_matrix(y_test, y_pred)
# Retrieve metrics
svc_metrics = metrics(xgb_conf, "Support vector classifier")
svc_metrics## model accuracy precision recall f1_score
## 0 Support vector classifier 0.785575 0.77686 0.846847 0.810345
Also the SVM appears to yield quite good performance (accuracy = 78.6), being on par with the performance metrics of XGBoost.
Lastly, we append the output into a single data frame, and sort their accuracies in an descending order. It seems that Random forest was the best classifier on all performance metrics in the present project.
final_metrics = logreg_metrics.append(rf_metrics).append(gb_metrics).append(xgb_metrics).append(svc_metrics)
print(final_metrics.sort_values(by = "accuracy", ascending = False))## model accuracy precision recall f1_score
## 0 Random Forest 0.789474 0.776509 0.857658 0.815068
## 0 XGBoost 0.785575 0.776860 0.846847 0.810345
## 0 Support vector classifier 0.785575 0.776860 0.846847 0.810345
## 0 Gradient boosting 0.780702 0.767857 0.852252 0.807857
## 0 Logistic regression 0.771930 0.760130 0.845045 0.800341
In this tutorial, I examined the dropout rate on a Massive open online course (MOOC) platform using student activity data stemming from the first week after enrolment. I used five machine learning algorithms (Logistic regression, Random forest, Gradient Boosting, XGBoost, Support vector classifier). Of these models, the Random forest classifier yielded the best results with an accuracy of 77.6%, and an recall rate of 85.8%, suggesting that those who drops out can be detected fairly accurately already during the first week of studying. I hope you enjoyed the tutorial.