STA 631 Project: Online Course Engagement Data Analysis
Objectives
Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation.
Determine and apply the appropriate generalized linear model for a specific data context.
Conduct model selection for a set of candidate models.
Use programming software (i.e., R) to fit and assess statistical models.
Communicate the results of statistical models to a general audience.
About Dataset
Description:
This dataset captures user engagement metrics from an online course platform, facilitating analyses on factors influencing course completion. It includes user demographics, course-specific data, and engagement metrics.
Source: Kaggle
# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(caret)
library(reshape2)
library(pROC)
library(MuMIn)
Features
UserID: Unique identifier for each user
CourseCategory: Category of the course taken by the user (e.g., Programming, Business, Arts)
TimeSpentOnCourse: Total time spent by the user on the course in hours
NumberOfVideosWatched: Total number of videos watched by the user
NumberOfQuizzesTaken: Total number of quizzes taken by the user
QuizScores: Average scores achieved by the user in quizzes (percentage)
CompletionRate: Percentage of course content completed by the user
DeviceType: Type of device used by the user (Device Type: Desktop (0) or Mobile (1))
CourseCompletion (Target Variable): Course completion status (0: Not Completed, 1: Completed)
# Load the data
data <- read.csv("online_course_engagement_data.csv")
# View the structure of the dataset
str(data)
## 'data.frame': 9000 obs. of 9 variables:
## $ UserID : int 5618 4326 5849 4992 3866 8650 4321 4589 4215 8089 ...
## $ CourseCategory : chr "Health" "Arts" "Arts" "Science" ...
## $ TimeSpentOnCourse : num 30 27.8 86.8 35 92.5 ...
## $ NumberOfVideosWatched: int 17 1 14 17 16 12 10 16 8 15 ...
## $ NumberOfQuizzesTaken : int 3 5 2 10 0 7 2 3 4 10 ...
## $ QuizScores : num 50.4 62.6 78.5 59.2 98.4 ...
## $ CompletionRate : num 20.9 65.6 63.8 95.4 18.1 ...
## $ DeviceType : int 1 1 1 0 0 0 1 1 0 1 ...
## $ CourseCompletion : int 0 0 1 1 0 1 0 0 1 0 ...
## UserID CourseCategory TimeSpentOnCourse NumberOfVideosWatched
## Min. : 1 Length:9000 Min. : 1.005 Min. : 0.00
## 1st Qu.:2252 Class :character 1st Qu.:25.441 1st Qu.: 5.00
## Median :4484 Mode :character Median :49.818 Median :10.00
## Mean :4499 Mean :50.164 Mean :10.02
## 3rd Qu.:6751 3rd Qu.:75.070 3rd Qu.:15.00
## Max. :9000 Max. :99.993 Max. :20.00
## NumberOfQuizzesTaken QuizScores CompletionRate DeviceType
## Min. : 0.000 Min. :50.01 Min. : 0.00933 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:62.28 1st Qu.:25.65361 1st Qu.:0.0000
## Median : 5.000 Median :74.74 Median :50.26412 Median :1.0000
## Mean : 5.091 Mean :74.71 Mean :50.34015 Mean :0.5007
## 3rd Qu.: 8.000 3rd Qu.:87.02 3rd Qu.:75.57249 3rd Qu.:1.0000
## Max. :10.000 Max. :99.99 Max. :99.97971 Max. :1.0000
## CourseCompletion
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3964
## 3rd Qu.:1.0000
## Max. :1.0000
## [1] 0
Objective 1: Describe Probability as a Foundation of Statistical Modeling
In this project, probability serves as the foundation for understanding the behavior of our dataset. We start with basic descriptive statistics to examine the distributions of key variables such as TimeSpentOnCourse, QuizScores, and CourseCompletion. By visualizing these distributions through histograms and bar charts, we gain insights into the probability of different events. For instance:
## UserID CourseCategory TimeSpentOnCourse NumberOfVideosWatched
## Min. : 1 Length:9000 Min. : 1.005 Min. : 0.00
## 1st Qu.:2252 Class :character 1st Qu.:25.441 1st Qu.: 5.00
## Median :4484 Mode :character Median :49.818 Median :10.00
## Mean :4499 Mean :50.164 Mean :10.02
## 3rd Qu.:6751 3rd Qu.:75.070 3rd Qu.:15.00
## Max. :9000 Max. :99.993 Max. :20.00
## NumberOfQuizzesTaken QuizScores CompletionRate DeviceType
## Min. : 0.000 Min. :50.01 Min. : 0.00933 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:62.28 1st Qu.:25.65361 1st Qu.:0.0000
## Median : 5.000 Median :74.74 Median :50.26412 Median :1.0000
## Mean : 5.091 Mean :74.71 Mean :50.34015 Mean :0.5007
## 3rd Qu.: 8.000 3rd Qu.:87.02 3rd Qu.:75.57249 3rd Qu.:1.0000
## Max. :10.000 Max. :99.99 Max. :99.97971 Max. :1.0000
## CourseCompletion
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3964
## 3rd Qu.:1.0000
## Max. :1.0000
Histogram of TimeSpentOnCourse: This shows the probability distribution of the amount of time students spend on courses. From the histogram, we can infer the likelihood of students spending specific amounts of time on their courses, helping us understand typical engagement patterns.
# Visualize the distribution of time spent on course
ggplot(data, aes(x = TimeSpentOnCourse)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Distribution of Time Spent on Course", x = "Time Spent (hours)", y = "Frequency")
Histogram of QuizScores: This displays the distribution of quiz scores among students, providing insights into the probability of achieving certain score ranges. This understanding helps in modeling student performance and its impact on course completion.
# Visualize the distribution of quiz scores
ggplot(data, aes(x = QuizScores)) +
geom_histogram(binwidth = 5, fill = "green", color = "black") +
labs(title = "Distribution of Quiz Scores", x = "Quiz Scores (%)", y = "Frequency")
Bar Chart of CourseCompletion: By visualizing the counts of students who completed or did not complete the course, we understand the overall probability of course completion within the dataset.
# Visualize the distribution of course completion status
ggplot(data, aes(x = CourseCompletion)) +
geom_bar(fill = "purple", color = "black") +
labs(title = "Distribution of Course Completion Status", x = "Course Completion (0: Not Completed, 1: Completed)", y = "Count")
# Course completion by course category (count)
ggplot(data, aes(x = CourseCategory, fill = as.factor(CourseCompletion))) +
geom_bar(position = "dodge") +
labs(title = "Course Completion by Course Category", x = "Course Category", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
These initial steps lay the groundwork for more advanced statistical modeling by providing a probabilistic framework for interpreting the data.
Objective 2: Determine and Apply the Appropriate Generalized Linear Model
To predict whether a student completes a course (CourseCompletion), we apply a logistic regression model, which is a type of generalized linear model (GLM) suited for binary outcomes. Logistic regression models the probability that a given input point belongs to a certain class (e.g., completed or not completed). Here’s how we implement and interpret the logistic regression model:
Model Specification: The logistic regression model is specified with CourseCompletion as the dependent variable and predictors including TimeSpentOnCourse, NumberOfVideosWatched, NumberOfQuizzesTaken, QuizScores, CompletionRate, and DeviceType. This choice is based on the nature of our dependent variable, which is binary.
Fitting the Model: Using the glm() function in R, we fit the logistic regression model to our data. This involves estimating the coefficients for each predictor variable to best predict the probability of course completion.
# Fit a logistic regression model
glm_model <- glm(CourseCompletion ~ TimeSpentOnCourse + NumberOfVideosWatched + NumberOfQuizzesTaken + QuizScores + CompletionRate + DeviceType,
data = data,
family = binomial)
Model Interpretation: The output of the logistic regression model includes coefficients that indicate the direction and strength of the relationship between each predictor and the probability of course completion. For example, a positive coefficient for TimeSpentOnCourse would suggest that more time spent on the course increases the likelihood of completion.
This application of logistic regression meets the objective by appropriately modeling the relationship between student engagement metrics and course completion status.
Objective 3: Conduct Model Selection
We conduct model selection using criteria such as AIC and BIC.
Model selection helps identify the best model that balances complexity and performance. We used stepwise selection based on AIC (Akaike Information Criterion) for this purpose.
AIC: A measure of model quality that balances goodness of fit and complexity. Lower AIC values indicate a better model.
Stepwise Selection: Using the step() function in R, we perform stepwise selection, which iteratively adds or removes predictors based on their impact on the AIC. The goal is to minimize the AIC, balancing model fit and complexity.
Model Comparison: The models are compared based on their AIC values, and the one with the lowest AIC is selected as the best model. This process helps in identifying the key predictors that significantly contribute to the model while avoiding overfitting.
The selected model, after stepwise selection, includes the most relevant predictors, ensuring that our model is both interpretable and robust.
# Model selection using AIC
selected_model <- step(glm_model, direction = "both", trace = FALSE)
# Compare models using AIC and BIC
library(MuMIn)
model_comparison <- model.sel(glm_model, selected_model)
model_comparison
## Model selection table
## (Intrc) CmplR DvcTy NmOQT NmOVW QzScr TmSOC df
## selected_model -11.84 0.03638 0.3073 0.131 0.07236 0.02044 6
## glm_model -11.84 0.03638 0.006056 0.3073 0.131 0.07236 0.02044 7
## logLik AICc delta weight
## selected_model -3973.580 7959.2 0.00 0.73
## glm_model -3973.574 7961.2 1.99 0.27
## Models ranked by AICc(x)
##
## Call:
## glm(formula = CourseCompletion ~ TimeSpentOnCourse + NumberOfVideosWatched +
## NumberOfQuizzesTaken + QuizScores + CompletionRate, family = binomial,
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.841118 0.261989 -45.20 <2e-16 ***
## TimeSpentOnCourse 0.020439 0.001018 20.07 <2e-16 ***
## NumberOfVideosWatched 0.130973 0.005002 26.18 <2e-16 ***
## NumberOfQuizzesTaken 0.307289 0.009907 31.02 <2e-16 ***
## QuizScores 0.072361 0.002220 32.59 <2e-16 ***
## CompletionRate 0.036382 0.001098 33.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12087.8 on 8999 degrees of freedom
## Residual deviance: 7947.2 on 8994 degrees of freedom
## AIC: 7959.2
##
## Number of Fisher Scoring iterations: 5
The reduction in deviance from the null model to the residual model indicates a good fit. The AIC of the selected model is lower than that of the initial model, indicating improved performance.
The stepwise selection process identified key predictors: TimeSpentOnCourse, NumberOfVideosWatched, NumberOfQuizzesTaken, QuizScores, and CompletionRate. These predictors significantly influence course completion, providing a well-balanced model with good fit and interpretability. This final model effectively predicts course completion while avoiding overfitting.
Objective 4: Use Programming Software (R) to Fit and Assess Statistical Models
Throughout the project, we leverage R to handle various aspects of data analysis, from data preprocessing to model fitting and assessment.
# Predict probabilities of course completion
data$PredictedProb <- predict(glm_model, type = "response")
# Create a confusion matrix
table(data$CourseCompletion, data$PredictedProb > 0.5)
##
## FALSE TRUE
## 0 4666 766
## 1 1036 2532
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8697
Here’s how we utilize R:
Data Loading and Preprocessing: We use read.csv() to load the dataset and tidyverse for data manipulation. Functions like summary() and str() help in understanding the dataset structure.
Descriptive Statistics and Visualization: Libraries like ggplot2 are used to create visualizations that depict the distributions of key variables and relationships between them.
Model Fitting: The glm() function is employed to fit the logistic regression model. This involves specifying the formula and family (binomial) appropriate for our binary outcome variable.
Model Selection: Using the step() function, we perform stepwise model selection to identify the best model based on AIC.
Model Assessment: We assess model performance through the ROC curve and AUC using the pROC package. The confusionMatrix() function from the caret package helps evaluate classification accuracy.
By using R, we demonstrate proficiency in applying statistical software to perform comprehensive data analysis and modeling tasks, meeting the course objective of fitting and assessing statistical models using programming software.
Objective 5: Communicate Results
Effectively communicating the results of statistical models involves visualizations and interpretations that are accessible to a general audience.
# Visualize predicted probabilities
ggplot(data, aes(x = PredictedProb, fill = CourseCompletion)) +
geom_histogram(binwidth = 0.1, position = "dodge", color = "white") +
labs(title = "Predicted Probabilities of Course Completion", x = "Predicted Probability", y = "Count")
Insights from Visualizations:
Time Spent on Course and Quiz Scores: Significant predictors of course completion, as more time spent and higher quiz scores are associated with higher completion rates.
Course Category: Completion rates vary across different course categories, indicating that course type might influence completion rates.
Summary and Interpretation:
Descriptive Statistics: Provided an overview of key variables like Time Spent on Course and Quiz Scores.
GLM Application: Applied a logistic regression model to predict course completion based on engagement metrics.
Model Selection: Used AIC and BIC criteria to select the best-fit model.
Model Assessment: Evaluated model performance using predicted probabilities and ROC curve analysis.
Time Spent on Course: Significant positive predictor of course completion.
Number of Videos Watched: Also positively associated with course completion.
Quiz Scores: Higher quiz scores increase the likelihood of completing the course.
Device Type: Users on desktops are more likely to complete the course compared to mobile users.
Interpretation of the ROC Curve:
AUC Value:
Here, the auc_value is 0.8697, it indicates that the model has a good ability to distinguish between students who will complete the course and those who will not.
An AUC value of 0.8697 means that there is an 86.97% chance that the model will rank a randomly chosen student who completed the course higher than a randomly chosen student who did not.
Conclusion:
The ROC curve provides a visual assessment of the model’s performance across different classification thresholds. The AUC quantifies this performance into a single value, making it easier to compare different models. A higher AUC value indicates a better-performing model, and in this case, it suggests that the logistic regression model is effective in predicting course completion.
Here’s how we achieve this:
Visualizations: We use ggplot2 to create intuitive visualizations such as:
Histograms of predicted probabilities to show the likelihood of course completion.
ROC Curve to illustrate the model’s performance in distinguishing between students who complete and do not complete the course. Confusion Matrix to evaluate the model’s classification accuracy.
Interpretation of Coefficients: We explain the coefficients of the logistic regression model in simple terms. For instance, we describe how an increase in TimeSpentOnCourse increases the probability of course completion, and how DeviceType influences completion rates.
Summary of Findings: We summarize the key insights from the model, highlighting the most important predictors of course completion and their practical implications. This includes discussing the impact of engagement metrics on student success.
These steps ensure that our statistical findings are conveyed clearly and effectively to a non-technical audience.