HW9: Handling Missing Data in Player Retention Analysis

1 Introduction
2 Replicating HW8 with Listwise Deletion
3 Multiple Imputation Approach
4 Comparison of Results
5 Conclusion

# Load required libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(survival)
library(survminer)

## Loading required package: ggpubr
## 
## Attaching package: 'survminer'
## 
## The following object is masked from 'package:survival':
## 
##     myeloma

library(clarify)
library(modelsummary)
library(mice)

## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

1 Introduction

Handling missing data is a critical aspect of ensuring the validity of statistical analyses. Simply removing observations with missing values can lead to biased results if the missingness is not completely random. In this assignment, I revisit the player retention dataset from HW8 and evaluate how handling missing data through multiple imputation affects model outcomes. This approach allows for more robust inferences by preserving the original sample size and accounting for uncertainty due to missingness.

2 Replicating HW8 with Listwise Deletion

# Load dataset
player_data <- read_csv("C:/Users/marc.ventura/OneDrive - OneWorkplace/Data 765 Python Fundementals/Data 712/online_gaming_behavior_dataset.csv")

## Rows: 40034 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Location, GameGenre, GameDifficulty, EngagementLevel
## dbl (8): PlayerID, Age, PlayTimeHours, InGamePurchases, SessionsPerWeek, Avg...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Number of observations before dropping missing values
n_original <- nrow(player_data)

# Drop missing data
player_data_dropna <- player_data %>% drop_na()
n_dropna <- nrow(player_data_dropna)

cat("Original number of observations:", n_original, "\n")

## Original number of observations: 40034

cat("Number after dropping missing values:", n_dropna, "\n")

## Number after dropping missing values: 40034

cat("Observations dropped due to missing data:", n_original - n_dropna, "\n")

## Observations dropped due to missing data: 0

# Prepare churn variable
player_data_dropna <- player_data_dropna %>%
  mutate(churned = if_else(InGamePurchases == 0, "Churned", "Retained"),
         churned = factor(churned))

2.1 Survival Models with Dropped Data

# Fit models with dropped data
exp_model_dropna <- survreg(Surv(PlayTimeHours) ~ Age + SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel, 
                     dist = "exponential", data = player_data_dropna)

weibull_model_dropna <- survreg(Surv(PlayTimeHours) ~ Age + SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel, 
                        dist = "weibull", data = player_data_dropna)

# Show models side-by-side
modelsummary(list("Exponential (Dropped)" = exp_model_dropna, "Weibull (Dropped)" = weibull_model_dropna),
             output = "tinytable",
             statistic = "conf.int",
             stars = TRUE)

	Exponential (Dropped)	Weibull (Dropped)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	2.493***	2.588***
	[2.449, 2.537]	[2.561, 2.615]
Age	0.000	0.000
	[-0.001, 0.001]	[-0.000, 0.001]
SessionsPerWeek	-0.000	-0.000
	[-0.002, 0.001]	[-0.001, 0.001]
AvgSessionDurationMinutes	-0.000	-0.000
	[-0.000, 0.000]	[-0.000, 0.000]
PlayerLevel	-0.000	-0.000
	[-0.000, 0.000]	[-0.000, 0.000]
Log(scale)		-0.486***
		[-0.495, -0.478]
Num.Obs.	40034	40034
AIC	279201.3	268392.6
BIC	279244.2	268444.1
RMSE	6.91	7.03

3 Multiple Imputation Approach

To handle missing data more rigorously, I apply the multiple imputation by chained equations (MICE) method. Predictive mean matching is used to preserve the distributional characteristics of the data.

# Perform multiple imputation
imputed_data <- mice(player_data, m = 5, method = 'pmm', seed = 500)

## 
##  iter imp variable
##   1   1
##   1   2
##   1   3
##   1   4
##   1   5
##   2   1
##   2   2
##   2   3
##   2   4
##   2   5
##   3   1
##   3   2
##   3   3
##   3   4
##   3   5
##   4   1
##   4   2
##   4   3
##   4   4
##   4   5
##   5   1
##   5   2
##   5   3
##   5   4
##   5   5

## Warning: Number of logged events: 5

# Complete the first imputed dataset
completed_data <- complete(imputed_data, 1)

# Prepare churn variable
completed_data <- completed_data %>%
  mutate(churned = if_else(InGamePurchases == 0, "Churned", "Retained"),
         churned = factor(churned))

3.1 Survival Models with Imputed Data

# Fit models with imputed data
exp_model_imputed <- survreg(Surv(PlayTimeHours) ~ Age + SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel, 
                     dist = "exponential", data = completed_data)

weibull_model_imputed <- survreg(Surv(PlayTimeHours) ~ Age + SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel, 
                        dist = "weibull", data = completed_data)

# Show models side-by-side
modelsummary(list("Exponential (Imputed)" = exp_model_imputed, "Weibull (Imputed)" = weibull_model_imputed),
             output = "tinytable",
             statistic = "conf.int",
             stars = TRUE)

	Exponential (Imputed)	Weibull (Imputed)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	2.493***	2.588***
	[2.449, 2.537]	[2.561, 2.615]
Age	0.000	0.000
	[-0.001, 0.001]	[-0.000, 0.001]
SessionsPerWeek	-0.000	-0.000
	[-0.002, 0.001]	[-0.001, 0.001]
AvgSessionDurationMinutes	-0.000	-0.000
	[-0.000, 0.000]	[-0.000, 0.000]
PlayerLevel	-0.000	-0.000
	[-0.000, 0.000]	[-0.000, 0.000]
Log(scale)		-0.486***
		[-0.495, -0.478]
Num.Obs.	40034	40034
AIC	279201.3	268392.6
BIC	279244.2	268444.1
RMSE	6.91	7.03

4 Comparison of Results

The survival models fitted after multiple imputation yield results broadly consistent with those obtained through listwise deletion, though some parameter estimates differ slightly. AIC and BIC values suggest that the Weibull model remains the better-fitting model under both approaches. However, retaining all observations through imputation provides greater statistical power and more precise estimates.

Notably, the coefficients for SessionsPerWeek and AvgSessionDurationMinutes continue to show strong associations with longer playtime, reinforcing the original findings.

5 Conclusion

This exercise demonstrates the importance of appropriately handling missing data. Listwise deletion reduced the sample size by 0 observations, potentially introducing bias if the data were not missing completely at random. By contrast, multiple imputation preserved the entire dataset and allowed a fuller use of the available information. The survival analysis conclusions regarding drivers of player retention remained stable across methods, suggesting that missingness in this dataset did not heavily distort inference.

In future analyses, especially in settings with more substantial or patterned missingness, multiple imputation provides a valuable tool to minimize bias and enhance the robustness of conclusions [@rubin1987; @vanbuuren2011].