STA 631 Project: Online Course Engagement Data Analysis

Objectives

  1. Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation.

  2. Determine and apply the appropriate generalized linear model for a specific data context.

  3. Conduct model selection for a set of candidate models.

  4. Use programming software (i.e., R) to fit and assess statistical models.

  5. Communicate the results of statistical models to a general audience.

About Dataset

Description:

This dataset captures user engagement metrics from an online course platform, facilitating analyses on factors influencing course completion. It includes user demographics, course-specific data, and engagement metrics.

Source: Kaggle

# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(caret)
library(reshape2)
library(pROC)
library(MuMIn)

Features

UserID: Unique identifier for each user

CourseCategory: Category of the course taken by the user (e.g., Programming, Business, Arts)

TimeSpentOnCourse: Total time spent by the user on the course in hours

NumberOfVideosWatched: Total number of videos watched by the user

NumberOfQuizzesTaken: Total number of quizzes taken by the user

QuizScores: Average scores achieved by the user in quizzes (percentage)

CompletionRate: Percentage of course content completed by the user

DeviceType: Type of device used by the user (Device Type: Desktop (0) or Mobile (1))

CourseCompletion (Target Variable): Course completion status (0: Not Completed, 1: Completed)

# Load the data
data <- read.csv("online_course_engagement_data.csv")

# View the structure of the dataset
str(data)
## 'data.frame':    9000 obs. of  9 variables:
##  $ UserID               : int  5618 4326 5849 4992 3866 8650 4321 4589 4215 8089 ...
##  $ CourseCategory       : chr  "Health" "Arts" "Arts" "Science" ...
##  $ TimeSpentOnCourse    : num  30 27.8 86.8 35 92.5 ...
##  $ NumberOfVideosWatched: int  17 1 14 17 16 12 10 16 8 15 ...
##  $ NumberOfQuizzesTaken : int  3 5 2 10 0 7 2 3 4 10 ...
##  $ QuizScores           : num  50.4 62.6 78.5 59.2 98.4 ...
##  $ CompletionRate       : num  20.9 65.6 63.8 95.4 18.1 ...
##  $ DeviceType           : int  1 1 1 0 0 0 1 1 0 1 ...
##  $ CourseCompletion     : int  0 0 1 1 0 1 0 0 1 0 ...
# Summary of the data
summary(data)
##      UserID     CourseCategory     TimeSpentOnCourse NumberOfVideosWatched
##  Min.   :   1   Length:9000        Min.   : 1.005    Min.   : 0.00        
##  1st Qu.:2252   Class :character   1st Qu.:25.441    1st Qu.: 5.00        
##  Median :4484   Mode  :character   Median :49.818    Median :10.00        
##  Mean   :4499                      Mean   :50.164    Mean   :10.02        
##  3rd Qu.:6751                      3rd Qu.:75.070    3rd Qu.:15.00        
##  Max.   :9000                      Max.   :99.993    Max.   :20.00        
##  NumberOfQuizzesTaken   QuizScores    CompletionRate       DeviceType    
##  Min.   : 0.000       Min.   :50.01   Min.   : 0.00933   Min.   :0.0000  
##  1st Qu.: 2.000       1st Qu.:62.28   1st Qu.:25.65361   1st Qu.:0.0000  
##  Median : 5.000       Median :74.74   Median :50.26412   Median :1.0000  
##  Mean   : 5.091       Mean   :74.71   Mean   :50.34015   Mean   :0.5007  
##  3rd Qu.: 8.000       3rd Qu.:87.02   3rd Qu.:75.57249   3rd Qu.:1.0000  
##  Max.   :10.000       Max.   :99.99   Max.   :99.97971   Max.   :1.0000  
##  CourseCompletion
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3964  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
# Check for missing values
sum(is.na(data))
## [1] 0

Objective 1: Describe Probability as a Foundation of Statistical Modeling

In this project, probability serves as the foundation for understanding the behavior of our dataset. We start with basic descriptive statistics to examine the distributions of key variables such as TimeSpentOnCourse, QuizScores, and CourseCompletion. By visualizing these distributions through histograms and bar charts, we gain insights into the probability of different events. For instance:

# Descriptive statistics
summary(data)
##      UserID     CourseCategory     TimeSpentOnCourse NumberOfVideosWatched
##  Min.   :   1   Length:9000        Min.   : 1.005    Min.   : 0.00        
##  1st Qu.:2252   Class :character   1st Qu.:25.441    1st Qu.: 5.00        
##  Median :4484   Mode  :character   Median :49.818    Median :10.00        
##  Mean   :4499                      Mean   :50.164    Mean   :10.02        
##  3rd Qu.:6751                      3rd Qu.:75.070    3rd Qu.:15.00        
##  Max.   :9000                      Max.   :99.993    Max.   :20.00        
##  NumberOfQuizzesTaken   QuizScores    CompletionRate       DeviceType    
##  Min.   : 0.000       Min.   :50.01   Min.   : 0.00933   Min.   :0.0000  
##  1st Qu.: 2.000       1st Qu.:62.28   1st Qu.:25.65361   1st Qu.:0.0000  
##  Median : 5.000       Median :74.74   Median :50.26412   Median :1.0000  
##  Mean   : 5.091       Mean   :74.71   Mean   :50.34015   Mean   :0.5007  
##  3rd Qu.: 8.000       3rd Qu.:87.02   3rd Qu.:75.57249   3rd Qu.:1.0000  
##  Max.   :10.000       Max.   :99.99   Max.   :99.97971   Max.   :1.0000  
##  CourseCompletion
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3964  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Histogram of TimeSpentOnCourse: This shows the probability distribution of the amount of time students spend on courses. From the histogram, we can infer the likelihood of students spending specific amounts of time on their courses, helping us understand typical engagement patterns.

# Visualize the distribution of time spent on course
ggplot(data, aes(x = TimeSpentOnCourse)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Distribution of Time Spent on Course", x = "Time Spent (hours)", y = "Frequency")

Histogram of QuizScores: This displays the distribution of quiz scores among students, providing insights into the probability of achieving certain score ranges. This understanding helps in modeling student performance and its impact on course completion.

# Visualize the distribution of quiz scores
ggplot(data, aes(x = QuizScores)) +
  geom_histogram(binwidth = 5, fill = "green", color = "black") +
  labs(title = "Distribution of Quiz Scores", x = "Quiz Scores (%)", y = "Frequency")

Bar Chart of CourseCompletion: By visualizing the counts of students who completed or did not complete the course, we understand the overall probability of course completion within the dataset.

# Visualize the distribution of course completion status
ggplot(data, aes(x = CourseCompletion)) +
  geom_bar(fill = "purple", color = "black") +
  labs(title = "Distribution of Course Completion Status", x = "Course Completion (0: Not Completed, 1: Completed)", y = "Count")

# Course completion by course category (count)
ggplot(data, aes(x = CourseCategory, fill = as.factor(CourseCompletion))) +
  geom_bar(position = "dodge") +
  labs(title = "Course Completion by Course Category", x = "Course Category", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

These initial steps lay the groundwork for more advanced statistical modeling by providing a probabilistic framework for interpreting the data.

Objective 2: Determine and Apply the Appropriate Generalized Linear Model

To predict whether a student completes a course (CourseCompletion), we apply a logistic regression model, which is a type of generalized linear model (GLM) suited for binary outcomes. Logistic regression models the probability that a given input point belongs to a certain class (e.g., completed or not completed). Here’s how we implement and interpret the logistic regression model:

Model Specification: The logistic regression model is specified with CourseCompletion as the dependent variable and predictors including TimeSpentOnCourse, NumberOfVideosWatched, NumberOfQuizzesTaken, QuizScores, CompletionRate, and DeviceType. This choice is based on the nature of our dependent variable, which is binary.

# Convert CourseCompletion to a factor
data$CourseCompletion <- as.factor(data$CourseCompletion)

Fitting the Model: Using the glm() function in R, we fit the logistic regression model to our data. This involves estimating the coefficients for each predictor variable to best predict the probability of course completion.

# Fit a logistic regression model
glm_model <- glm(CourseCompletion ~ TimeSpentOnCourse + NumberOfVideosWatched + NumberOfQuizzesTaken + QuizScores + CompletionRate + DeviceType, 
                 data = data, 
                 family = binomial)

Model Interpretation: The output of the logistic regression model includes coefficients that indicate the direction and strength of the relationship between each predictor and the probability of course completion. For example, a positive coefficient for TimeSpentOnCourse would suggest that more time spent on the course increases the likelihood of completion.

This application of logistic regression meets the objective by appropriately modeling the relationship between student engagement metrics and course completion status.

Objective 3: Conduct Model Selection

We conduct model selection using criteria such as AIC and BIC.

Model selection helps identify the best model that balances complexity and performance. We used stepwise selection based on AIC (Akaike Information Criterion) for this purpose.

AIC: A measure of model quality that balances goodness of fit and complexity. Lower AIC values indicate a better model.

Stepwise Selection: Using the step() function in R, we perform stepwise selection, which iteratively adds or removes predictors based on their impact on the AIC. The goal is to minimize the AIC, balancing model fit and complexity.

Model Comparison: The models are compared based on their AIC values, and the one with the lowest AIC is selected as the best model. This process helps in identifying the key predictors that significantly contribute to the model while avoiding overfitting.

The selected model, after stepwise selection, includes the most relevant predictors, ensuring that our model is both interpretable and robust.

# Model selection using AIC
selected_model <- step(glm_model, direction = "both", trace = FALSE)

# Compare models using AIC and BIC
library(MuMIn)
model_comparison <- model.sel(glm_model, selected_model)
model_comparison
## Model selection table 
##                (Intrc)   CmplR    DvcTy  NmOQT NmOVW   QzScr   TmSOC df
## selected_model  -11.84 0.03638          0.3073 0.131 0.07236 0.02044  6
## glm_model       -11.84 0.03638 0.006056 0.3073 0.131 0.07236 0.02044  7
##                   logLik   AICc delta weight
## selected_model -3973.580 7959.2  0.00   0.73
## glm_model      -3973.574 7961.2  1.99   0.27
## Models ranked by AICc(x)
summary(selected_model)
## 
## Call:
## glm(formula = CourseCompletion ~ TimeSpentOnCourse + NumberOfVideosWatched + 
##     NumberOfQuizzesTaken + QuizScores + CompletionRate, family = binomial, 
##     data = data)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -11.841118   0.261989  -45.20   <2e-16 ***
## TimeSpentOnCourse       0.020439   0.001018   20.07   <2e-16 ***
## NumberOfVideosWatched   0.130973   0.005002   26.18   <2e-16 ***
## NumberOfQuizzesTaken    0.307289   0.009907   31.02   <2e-16 ***
## QuizScores              0.072361   0.002220   32.59   <2e-16 ***
## CompletionRate          0.036382   0.001098   33.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12087.8  on 8999  degrees of freedom
## Residual deviance:  7947.2  on 8994  degrees of freedom
## AIC: 7959.2
## 
## Number of Fisher Scoring iterations: 5

The reduction in deviance from the null model to the residual model indicates a good fit. The AIC of the selected model is lower than that of the initial model, indicating improved performance.

The stepwise selection process identified key predictors: TimeSpentOnCourse, NumberOfVideosWatched, NumberOfQuizzesTaken, QuizScores, and CompletionRate. These predictors significantly influence course completion, providing a well-balanced model with good fit and interpretability. This final model effectively predicts course completion while avoiding overfitting.

Objective 4: Use Programming Software (R) to Fit and Assess Statistical Models

Throughout the project, we leverage R to handle various aspects of data analysis, from data preprocessing to model fitting and assessment.

# Predict probabilities of course completion
data$PredictedProb <- predict(glm_model, type = "response")

# Create a confusion matrix
table(data$CourseCompletion, data$PredictedProb > 0.5)
##    
##     FALSE TRUE
##   0  4666  766
##   1  1036 2532
# Plot ROC curve
roc_curve <- roc(data$CourseCompletion, data$PredictedProb)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve for Course Completion Prediction")

# Calculate the Area Under the Curve (AUC)
auc_value <- auc(roc_curve)
auc_value
## Area under the curve: 0.8697

Here’s how we utilize R:

Data Loading and Preprocessing: We use read.csv() to load the dataset and tidyverse for data manipulation. Functions like summary() and str() help in understanding the dataset structure.

Descriptive Statistics and Visualization: Libraries like ggplot2 are used to create visualizations that depict the distributions of key variables and relationships between them.

Model Fitting: The glm() function is employed to fit the logistic regression model. This involves specifying the formula and family (binomial) appropriate for our binary outcome variable.

Model Selection: Using the step() function, we perform stepwise model selection to identify the best model based on AIC.

Model Assessment: We assess model performance through the ROC curve and AUC using the pROC package. The confusionMatrix() function from the caret package helps evaluate classification accuracy.

By using R, we demonstrate proficiency in applying statistical software to perform comprehensive data analysis and modeling tasks, meeting the course objective of fitting and assessing statistical models using programming software.

Objective 5: Communicate Results

Effectively communicating the results of statistical models involves visualizations and interpretations that are accessible to a general audience.

# Visualize predicted probabilities
ggplot(data, aes(x = PredictedProb, fill = CourseCompletion)) +
  geom_histogram(binwidth = 0.1, position = "dodge", color = "white") +
  labs(title = "Predicted Probabilities of Course Completion", x = "Predicted Probability", y = "Count")

Insights from Visualizations:

Time Spent on Course and Quiz Scores: Significant predictors of course completion, as more time spent and higher quiz scores are associated with higher completion rates.

Course Category: Completion rates vary across different course categories, indicating that course type might influence completion rates.

Summary and Interpretation:

Descriptive Statistics: Provided an overview of key variables like Time Spent on Course and Quiz Scores.

GLM Application: Applied a logistic regression model to predict course completion based on engagement metrics.

Model Selection: Used AIC and BIC criteria to select the best-fit model.

Model Assessment: Evaluated model performance using predicted probabilities and ROC curve analysis.

Time Spent on Course: Significant positive predictor of course completion.

Number of Videos Watched: Also positively associated with course completion.

Quiz Scores: Higher quiz scores increase the likelihood of completing the course.

Device Type: Users on desktops are more likely to complete the course compared to mobile users.

Interpretation of the ROC Curve:

AUC Value:

Here, the auc_value is 0.8697, it indicates that the model has a good ability to distinguish between students who will complete the course and those who will not.

An AUC value of 0.8697 means that there is an 86.97% chance that the model will rank a randomly chosen student who completed the course higher than a randomly chosen student who did not.

Conclusion:

The ROC curve provides a visual assessment of the model’s performance across different classification thresholds. The AUC quantifies this performance into a single value, making it easier to compare different models. A higher AUC value indicates a better-performing model, and in this case, it suggests that the logistic regression model is effective in predicting course completion.

Here’s how we achieve this:

Visualizations: We use ggplot2 to create intuitive visualizations such as:

Histograms of predicted probabilities to show the likelihood of course completion.

ROC Curve to illustrate the model’s performance in distinguishing between students who complete and do not complete the course. Confusion Matrix to evaluate the model’s classification accuracy.

Interpretation of Coefficients: We explain the coefficients of the logistic regression model in simple terms. For instance, we describe how an increase in TimeSpentOnCourse increases the probability of course completion, and how DeviceType influences completion rates.

Summary of Findings: We summarize the key insights from the model, highlighting the most important predictors of course completion and their practical implications. This includes discussing the impact of engagement metrics on student success.

These steps ensure that our statistical findings are conveyed clearly and effectively to a non-technical audience.