Summary

In this project, I’ll demonstrate how to predict future dropout predictions using some of the most well-established machine learning algorithms. The data stems from a Massive open online course (MOOC) platform targeted primarily for medical students. The data has been preprocessed previously where I have been querying user activitiy data from the first week of studying. Using this data, I’ll explore whether student activity during the first week will predict future dropout. We will use a binary outcome variable, defined as whether the student have been inactive during the two consecutive weeks following the first week. If no activity is present, the student is classified as dropout. Conversely, if the student have been active he/she is classified as retention


Load required packages and import data

Load the required packages for importing data and preprocessing.

import pandas as pd # For data manipulation
import numpy as np  # For data manipulation
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import seaborn as sns
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
import warnings
warnings.filterwarnings('ignore')

Read dataset

df = pd.read_csv("df_dropout.csv")

A first glance at the data

We’ll start by looking a bit closer to the dataset. It appears that the dataset includes 4101 rows, with each row corresponding to aggregated scores for a unique student stemming from the first study week.

print(len(df))
## 4101

The dataset has altogether 32 columns of which 31 are student activity features, whereas dropout is the target variable

print(df.columns)
## Index(['total_rt', 'courses_rt', 'exam_rt', 'guidedlearning_rt', 'profile_rt',
##        'quiz_rt', 'welcome_rt', 'have_taken_a_quiz', 'have_taken_an_exam_quiz',
##        'have_repeated_a_quiz', 'quizzes_count', 'quiz_sessions_count',
##        'exam_quizzes_count', 'exam_quiz_sessions_count', 'actions_total_count',
##        'action_page_scroll_count', 'action_out_of_focus_count',
##        'action_page_read_count', 'last_day_active', 'unique_days_active',
##        'active_more_than_1_unique_day', 'reading_sessions_count',
##        'have_read_less_than_10mins', 'dropout'],
##       dtype='object')

It seems that all features are either integers or floats.

print(df.dtypes)
## total_rt                         float64
## courses_rt                       float64
## exam_rt                          float64
## guidedlearning_rt                float64
## profile_rt                       float64
## quiz_rt                          float64
## welcome_rt                       float64
## have_taken_a_quiz                  int64
## have_taken_an_exam_quiz            int64
## have_repeated_a_quiz               int64
## quizzes_count                    float64
## quiz_sessions_count              float64
## exam_quizzes_count               float64
## exam_quiz_sessions_count         float64
## actions_total_count                int64
## action_page_scroll_count         float64
## action_out_of_focus_count        float64
## action_page_read_count           float64
## last_day_active                    int64
## unique_days_active                 int64
## active_more_than_1_unique_day      int64
## reading_sessions_count             int64
## have_read_less_than_10mins         int64
## dropout                            int64
## dtype: object

We’ll also take a glimpse on the decriptives of the student activitiy features and the target variable.

print(df.describe())
##           total_rt   courses_rt      exam_rt  guidedlearning_rt   profile_rt  \
## count  4101.000000  4101.000000  4101.000000        4101.000000  4101.000000   
## mean     59.364115     1.349522     9.744260           0.184371     0.115096   
## std     139.144609     2.834167    59.695501           1.362583     0.734116   
## min       0.000000     0.000000     0.000000           0.000000     0.000000   
## 25%       2.300667     0.143583     0.000000           0.000000     0.000000   
## 50%       9.371117     0.594517     0.000000           0.000000     0.000000   
## 75%      49.112367     1.493050     0.000000           0.000000     0.000000   
## max    2243.159717    63.635350  1444.356150          50.858583    35.125683   
## 
##            quiz_rt   welcome_rt  have_taken_a_quiz  have_taken_an_exam_quiz  \
## count  4101.000000  4101.000000        4101.000000              4101.000000   
## mean      0.198568     0.219454           0.374055                 0.166301   
## std       0.757390     0.719694           0.483937                 0.372396   
## min       0.000000     0.000000           0.000000                 0.000000   
## 25%       0.000000     0.000000           0.000000                 0.000000   
## 50%       0.000000     0.000000           0.000000                 0.000000   
## 75%       0.000000     0.000000           1.000000                 0.000000   
## max      24.633050     9.859617           1.000000                 1.000000   
## 
##        have_repeated_a_quiz  quizzes_count  quiz_sessions_count  \
## count           4101.000000    4101.000000          4101.000000   
## mean               0.176786      20.629115             3.066325   
## std                0.381534      83.911677             9.027302   
## min                0.000000       0.000000             0.000000   
## 25%                0.000000       0.000000             0.000000   
## 50%                0.000000       0.000000             0.000000   
## 75%                0.000000       3.000000             1.000000   
## max                1.000000    2098.000000           182.000000   
## 
##        exam_quizzes_count  exam_quiz_sessions_count  actions_total_count  \
## count         4101.000000               4101.000000          4101.000000   
## mean            11.113631                  0.507193           318.341624   
## std             83.259036                  2.336614           740.858978   
## min              0.000000                  0.000000             1.000000   
## 25%              0.000000                  0.000000            20.000000   
## 50%              0.000000                  0.000000            65.000000   
## 75%              0.000000                  0.000000           294.000000   
## max           1958.000000                 64.000000         12588.000000   
## 
##        action_page_scroll_count  action_out_of_focus_count  \
## count               4101.000000                4101.000000   
## mean                  63.374543                  54.542307   
## std                  219.958797                 128.129422   
## min                    0.000000                   0.000000   
## 25%                    0.000000                   3.000000   
## 50%                   10.000000                  11.000000   
## 75%                   47.000000                  50.000000   
## max                 6635.000000                3280.000000   
## 
##        action_page_read_count  last_day_active  unique_days_active  \
## count             4101.000000      4101.000000         4101.000000   
## mean                 2.576201         2.373567            2.363082   
## std                  8.988033         2.916467            1.907902   
## min                  0.000000         0.000000            1.000000   
## 25%                  0.000000         0.000000            1.000000   
## 50%                  0.000000         0.000000            1.000000   
## 75%                  0.000000         6.000000            3.000000   
## max                136.000000         7.000000            8.000000   
## 
##        active_more_than_1_unique_day  reading_sessions_count  \
## count                    4101.000000             4101.000000   
## mean                        0.473299                3.375274   
## std                         0.499347                3.708509   
## min                         0.000000                1.000000   
## 25%                         0.000000                1.000000   
## 50%                         0.000000                2.000000   
## 75%                         1.000000                4.000000   
## max                         1.000000               77.000000   
## 
##        have_read_less_than_10mins      dropout  
## count                 4101.000000  4101.000000  
## mean                     0.507925     0.520605  
## std                      0.499998     0.499636  
## min                      0.000000     0.000000  
## 25%                      0.000000     0.000000  
## 50%                      1.000000     1.000000  
## 75%                      1.000000     1.000000  
## max                      1.000000     1.000000

There doesn’t seem to be any missing data in the dataset.

df.sum().isna()
## total_rt                         False
## courses_rt                       False
## exam_rt                          False
## guidedlearning_rt                False
## profile_rt                       False
## quiz_rt                          False
## welcome_rt                       False
## have_taken_a_quiz                False
## have_taken_an_exam_quiz          False
## have_repeated_a_quiz             False
## quizzes_count                    False
## quiz_sessions_count              False
## exam_quizzes_count               False
## exam_quiz_sessions_count         False
## actions_total_count              False
## action_page_scroll_count         False
## action_out_of_focus_count        False
## action_page_read_count           False
## last_day_active                  False
## unique_days_active               False
## active_more_than_1_unique_day    False
## reading_sessions_count           False
## have_read_less_than_10mins       False
## dropout                          False
## dtype: bool

Let’s check the balance between dropout vs. retention students. Note that retention = 0 and dropout = 1. The balance appears to be surprisingly good, with 1966 retention students and 2135 dropout students.

dct = {}

for i in df['dropout']:
    if i in dct.keys():
        dct[i] += 1
    else:
        dct[i] = 1
    
print(dct)
## {1: 2135, 0: 1966}

Exploratory data analysis (EDA)

Next, we’ll move on to some expolatory data analyses. For curiosity, we’ll check the point-biserial correlations between the student activity features and the dropout rate using a heatmap. Point-biserial correlations can be interpreted as effect sizes, where 0.2-0.5 indicates a small effect, 0.5-0.8 a moderate effect, and >= 0.8 a large effet. It appears that two of the features showed a correlation coefficient to dropout corresponding to a medium effect, namely last_day_active and unique_days_active.

import seaborn as sns
from pylab import rcParams
df_corr = df.corr()

mask = np.zeros_like(df_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)]= True
b= sns.heatmap(df_corr, mask = mask, annot = True, annot_kws = {"size": 4})
b.tick_params(labelsize=4)
plt.show()


Let’s visualize the four student activity features that showed the strongest relationships to dropout using bar plots. There appears to be huge class-level differences in each of these features (note that error bars represent standard error of means).

df_slice = df[['dropout','last_day_active', 'unique_days_active', 'have_read_less_than_10mins', 'reading_sessions_count']]

df_slice['dropout'] = df_slice['dropout'].apply(lambda x: "Dropout" if x == 1 else "Retention")

plt.subplot(2, 2, 1)
sns.barplot(x="dropout", y="last_day_active", data = df_slice)
## <matplotlib.axes._subplots.AxesSubplot object at 0x1a32719c88>
plt.subplot(2, 2, 2)
sns.barplot(x="dropout", y="unique_days_active", data = df_slice)
## <matplotlib.axes._subplots.AxesSubplot object at 0x1a3d5f8c88>
plt.subplot(2, 2, 3)
sns.barplot(x="dropout", y= 'have_read_less_than_10mins', data = df_slice)
## <matplotlib.axes._subplots.AxesSubplot object at 0x1a326bff28>
plt.subplot(2, 2, 4)
sns.barplot(x="dropout", y= 'reading_sessions_count', data = df_slice)
## <matplotlib.axes._subplots.AxesSubplot object at 0x1a3f958a58>
plt.tight_layout()
plt.show()

Data preprocessing

Data preprocessing is important before the actual model computation. First, we split the df dataset to a X and y, where X includes the student activity features, and y the target variable (i.e., dropout).

X = df.drop("dropout", axis = 1)

y = df['dropout']

Next, we split the data to a training and test set using the train_test_split function. We also set random_state to 0 for the sake of replicability, and use a test size of 0.25 (i.e., the training set includes 75 % of the students, whereas the test set includes 25 % of the students).

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.25)

Moreover, we check whether the proportion of dropouts are balanced across the training and test set. It appears that they match highly, with 51 % of the participants being dropouts in the training set, and 54 % being dropouts in the test set.

dct_train = {}
for i in list(y_train):
    if i in dct_train.keys():
        dct_train[i] += 1
    else:
        dct_train[i] = 1

print(dct_train.get(1) / sum(dct_train.values()))
## 0.5138211382113821
dct_test = {}
for i in list(y_test):
    if i in dct_test.keys():
        dct_test[i] += 1
    else:
        dct_test[i] = 1

print(dct_test.get(1) / sum(dct_test.values()))
## 0.5409356725146199


It’s important to standardize the features before feeding them to the model(s). This ensures that all features are on the same range (i.e., mean = 0, sd = 1), and thus treated equally in the models.

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

For being on the safe side, we also check that the numpy arrays are shaped as they should.

print(X_train.shape)
## (3075, 23)
print(X_test.shape)
## (1026, 23)
print(y_train.shape)
## (3075,)
print(y_test.shape)
## (1026,)


Lastly, I compute a function titled metrics, which yields an aggregated dataframe of the four performance metrics typically used in classification problems:

def metrics(mat, name):
    label = str(mat)
    accuracy = (mat[0][0] + mat[1][1]) /(mat[0][0] + mat[0][1] + mat[1][0] + mat[1][1])
    precision = (mat[1][1]) /(mat[0][1] +  mat[1][1])
    recall = (mat[1][1]) /(mat[1][0] +  mat[1][1])
    f1_score = (mat[1][1]) /(mat[1][1] + (mat[0][1] + mat[1][0]) /2)
    df = pd.DataFrame(columns = {"model", "accuracy", "precision", "recall", "f1_score"})
    df = df.append({"model":name, "accuracy":accuracy, "precision":precision, "recall":recall, "f1_score":f1_score},  ignore_index= True)
    df = df.filter(["model", "accuracy", "precision", "recall", "f1_score"])
    return df

Machine learning

Finally, we can start computing the learning algorithms. In this project, I’ll use five algorithms, namely Logistic regression, Random forest, Gradient boosting, XGBoost, and Support vector classifier.

Logistic regression

Logistic regression measures the relationship between the categorical target variable and one or more predictors by estimating probabilities using a logistic/sigmoid function. It is the algorithm that one typically starts with, and to which the more complex non-linear algorithms are compared against. Let’s compute the model.

## Logistic regression
# Set classifier
logReg = LogisticRegression()

# Fit the pipeline to the training set
logreg = logReg.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = logReg.predict(X_test)

# Compute confusion matrix
log_regression = confusion_matrix(y_test, y_pred)

# Retrieve metrics
logreg_metrics = metrics(log_regression, "Logistic regression")

logreg_metrics
##                  model  accuracy  precision    recall  f1_score
## 0  Logistic regression   0.77193    0.76013  0.845045  0.800341

The performance appears to be quite good, with an accuracy of 77.1


Random Forest

The random forest is an ensamble learning algorithm which builds upon the Classification And Regression Tree (CART) paradigm. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Random forest have consistently shown to generalize well to untrained data. Let’s compute the model.

# Initiate the RF classifier 
rf = RandomForestClassifier(max_depth =  8, max_features = 1, n_estimators = 300, random_state = 0)

# Fit on training set
rf = rf.fit(X_train, y_train)

# Predict on test set
y_pred = rf.predict(X_test)

# Compute confustion matrix
rf_conf = confusion_matrix(y_test, y_pred)

# Retrieve metrics
rf_metrics = metrics(rf_conf, "Random Forest")

rf_metrics
##            model  accuracy  precision    recall  f1_score
## 0  Random Forest  0.789474   0.776509  0.857658  0.815068

The output shows that Random forest actually performs quite well, yielding an accuracy of 78.9.


Gradient boosting

Gradient boosting is another CART ensemble algorithm which relies on the assumption that the best possible next model, when combined with previous models, minimizes the overall prediction error. Let’s compute the model.

# Initiate the gradient boosting classifier 
gb = GradientBoostingClassifier(learning_rate = 0.3, max_depth = 2, n_estimators = 10)

# Fit on training set
gb = gb.fit(X_train, y_train)

# Predict on test set
y_pred = gb.predict(X_test)

# Compute confustion matrix
gb_conf = confusion_matrix(y_test, y_pred)

# Retrieve metrics
gb_metrics = metrics(gb_conf, "Gradient boosting")

gb_metrics
##                model  accuracy  precision    recall  f1_score
## 0  Gradient boosting  0.780702   0.767857  0.852252  0.807857

Gradient boosting shows a quite descent performance as well, with the accuracy (78.1) being slightly lower than the Random forest.


XGBoost

The third CART ensemble learning algorithm used here is XGBoost, which builds upon function approximation by optimizing specific loss functions as well as applying several regularization techniques. Let’s compute the model.

# Initiate the XGboost classifier 
xgb = XGBClassifier(learning_rate = 0.4, max_depth = 2, n_estimators = 16)

# Fit on training set
xgb = xgb.fit(X_train, y_train)

# Predict on test set
y_pred = xgb.predict(X_test)

# Compute confustion matrix
xgb_conf = confusion_matrix(y_test, y_pred)

# Retrieve metrics
xgb_metrics = metrics(xgb_conf, "XGBoost")

xgb_metrics
##      model  accuracy  precision    recall  f1_score
## 0  XGBoost  0.785575    0.77686  0.846847  0.810345

The performance of XGBoost (accuracy = 78.6) appears to be better than Gradient boosting but slightly worse than Random forest.


Support vector classifier

Lastly, we’ll see how well the Support vector classifier (SVC) algorithm generalize. As the logistic regression, SVC is a linear model, and creates a line or a hyperplane which separates the data into classes. Let’s compute the model.

# Initiate the SVC classifier 
svc = SVC(C =  10, gamma = 0.01, kernel = 'rbf')

# Fit on training set
svc = svc.fit(X_train, y_train)

# Predict on test set
y_pred = svc.predict(X_test)

# Compute confustion matrix
svc_conf = confusion_matrix(y_test, y_pred)

# Retrieve metrics
svc_metrics = metrics(xgb_conf, "Support vector classifier")

svc_metrics
##                        model  accuracy  precision    recall  f1_score
## 0  Support vector classifier  0.785575    0.77686  0.846847  0.810345

Also the SVM appears to yield quite good performance (accuracy = 78.6), being on par with the performance metrics of XGBoost.


Retrieve summary metrics

Lastly, we append the output into a single data frame, and sort their accuracies in an descending order. It seems that Random forest was the best classifier on all performance metrics in the present project.

final_metrics = logreg_metrics.append(rf_metrics).append(gb_metrics).append(xgb_metrics).append(svc_metrics)

print(final_metrics.sort_values(by = "accuracy", ascending = False))
##                        model  accuracy  precision    recall  f1_score
## 0              Random Forest  0.789474   0.776509  0.857658  0.815068
## 0                    XGBoost  0.785575   0.776860  0.846847  0.810345
## 0  Support vector classifier  0.785575   0.776860  0.846847  0.810345
## 0          Gradient boosting  0.780702   0.767857  0.852252  0.807857
## 0        Logistic regression  0.771930   0.760130  0.845045  0.800341


Conclusions

In this tutorial, I examined the dropout rate on a Massive open online course (MOOC) platform using student activity data stemming from the first week after enrolment. I used five machine learning algorithms (Logistic regression, Random forest, Gradient Boosting, XGBoost, Support vector classifier). Of these models, the Random forest classifier yielded the best results with an accuracy of 77.6%, and an recall rate of 85.8%, suggesting that those who drops out can be detected fairly accurately already during the first week of studying. I hope you enjoyed the tutorial.