# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(survival)
library(survminer)
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
##
## The following object is masked from 'package:survival':
##
## myeloma
library(clarify)
library(modelsummary)
library(mice)
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
Handling missing data is a critical aspect of ensuring the validity of statistical analyses. Simply removing observations with missing values can lead to biased results if the missingness is not completely random. In this assignment, I revisit the player retention dataset from HW8 and evaluate how handling missing data through multiple imputation affects model outcomes. This approach allows for more robust inferences by preserving the original sample size and accounting for uncertainty due to missingness.
# Load dataset
player_data <- read_csv("C:/Users/marc.ventura/OneDrive - OneWorkplace/Data 765 Python Fundementals/Data 712/online_gaming_behavior_dataset.csv")
## Rows: 40034 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Location, GameGenre, GameDifficulty, EngagementLevel
## dbl (8): PlayerID, Age, PlayTimeHours, InGamePurchases, SessionsPerWeek, Avg...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Number of observations before dropping missing values
n_original <- nrow(player_data)
# Drop missing data
player_data_dropna <- player_data %>% drop_na()
n_dropna <- nrow(player_data_dropna)
cat("Original number of observations:", n_original, "\n")
## Original number of observations: 40034
cat("Number after dropping missing values:", n_dropna, "\n")
## Number after dropping missing values: 40034
cat("Observations dropped due to missing data:", n_original - n_dropna, "\n")
## Observations dropped due to missing data: 0
# Prepare churn variable
player_data_dropna <- player_data_dropna %>%
mutate(churned = if_else(InGamePurchases == 0, "Churned", "Retained"),
churned = factor(churned))
# Fit models with dropped data
exp_model_dropna <- survreg(Surv(PlayTimeHours) ~ Age + SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel,
dist = "exponential", data = player_data_dropna)
weibull_model_dropna <- survreg(Surv(PlayTimeHours) ~ Age + SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel,
dist = "weibull", data = player_data_dropna)
# Show models side-by-side
modelsummary(list("Exponential (Dropped)" = exp_model_dropna, "Weibull (Dropped)" = weibull_model_dropna),
output = "tinytable",
statistic = "conf.int",
stars = TRUE)
Exponential (Dropped) | Weibull (Dropped) | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
(Intercept) | 2.493*** | 2.588*** |
[2.449, 2.537] | [2.561, 2.615] | |
Age | 0.000 | 0.000 |
[-0.001, 0.001] | [-0.000, 0.001] | |
SessionsPerWeek | -0.000 | -0.000 |
[-0.002, 0.001] | [-0.001, 0.001] | |
AvgSessionDurationMinutes | -0.000 | -0.000 |
[-0.000, 0.000] | [-0.000, 0.000] | |
PlayerLevel | -0.000 | -0.000 |
[-0.000, 0.000] | [-0.000, 0.000] | |
Log(scale) | -0.486*** | |
[-0.495, -0.478] | ||
Num.Obs. | 40034 | 40034 |
AIC | 279201.3 | 268392.6 |
BIC | 279244.2 | 268444.1 |
RMSE | 6.91 | 7.03 |
To handle missing data more rigorously, I apply the multiple imputation by chained equations (MICE) method. Predictive mean matching is used to preserve the distributional characteristics of the data.
# Perform multiple imputation
imputed_data <- mice(player_data, m = 5, method = 'pmm', seed = 500)
##
## iter imp variable
## 1 1
## 1 2
## 1 3
## 1 4
## 1 5
## 2 1
## 2 2
## 2 3
## 2 4
## 2 5
## 3 1
## 3 2
## 3 3
## 3 4
## 3 5
## 4 1
## 4 2
## 4 3
## 4 4
## 4 5
## 5 1
## 5 2
## 5 3
## 5 4
## 5 5
## Warning: Number of logged events: 5
# Complete the first imputed dataset
completed_data <- complete(imputed_data, 1)
# Prepare churn variable
completed_data <- completed_data %>%
mutate(churned = if_else(InGamePurchases == 0, "Churned", "Retained"),
churned = factor(churned))
# Fit models with imputed data
exp_model_imputed <- survreg(Surv(PlayTimeHours) ~ Age + SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel,
dist = "exponential", data = completed_data)
weibull_model_imputed <- survreg(Surv(PlayTimeHours) ~ Age + SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel,
dist = "weibull", data = completed_data)
# Show models side-by-side
modelsummary(list("Exponential (Imputed)" = exp_model_imputed, "Weibull (Imputed)" = weibull_model_imputed),
output = "tinytable",
statistic = "conf.int",
stars = TRUE)
Exponential (Imputed) | Weibull (Imputed) | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
(Intercept) | 2.493*** | 2.588*** |
[2.449, 2.537] | [2.561, 2.615] | |
Age | 0.000 | 0.000 |
[-0.001, 0.001] | [-0.000, 0.001] | |
SessionsPerWeek | -0.000 | -0.000 |
[-0.002, 0.001] | [-0.001, 0.001] | |
AvgSessionDurationMinutes | -0.000 | -0.000 |
[-0.000, 0.000] | [-0.000, 0.000] | |
PlayerLevel | -0.000 | -0.000 |
[-0.000, 0.000] | [-0.000, 0.000] | |
Log(scale) | -0.486*** | |
[-0.495, -0.478] | ||
Num.Obs. | 40034 | 40034 |
AIC | 279201.3 | 268392.6 |
BIC | 279244.2 | 268444.1 |
RMSE | 6.91 | 7.03 |
The survival models fitted after multiple imputation yield results broadly consistent with those obtained through listwise deletion, though some parameter estimates differ slightly. AIC and BIC values suggest that the Weibull model remains the better-fitting model under both approaches. However, retaining all observations through imputation provides greater statistical power and more precise estimates.
Notably, the coefficients for SessionsPerWeek
and
AvgSessionDurationMinutes
continue to show strong
associations with longer playtime, reinforcing the original
findings.
This exercise demonstrates the importance of appropriately handling missing data. Listwise deletion reduced the sample size by 0 observations, potentially introducing bias if the data were not missing completely at random. By contrast, multiple imputation preserved the entire dataset and allowed a fuller use of the available information. The survival analysis conclusions regarding drivers of player retention remained stable across methods, suggesting that missingness in this dataset did not heavily distort inference.
In future analyses, especially in settings with more substantial or patterned missingness, multiple imputation provides a valuable tool to minimize bias and enhance the robustness of conclusions [@rubin1987; @vanbuuren2011].