0.1 Introduction
0.2 Social Theory Framing: Destiny as a Case Study
0.3 Missing Data Handling
0.4 Multiple Imputation
0.5 Descriptive Statistics
0.6 Survival Model: Predicting Time Spent In-Game
0.7 Simulation-Based Interpretation
0.8 Logistic Regression: Predicting Purchase-Based Retention
0.9 Predicted Retention by Session Frequency
0.10 Conclusion
1 References

0.1 Introduction

Understanding why players stay engaged in multiplayer games is central to improving game design, fostering community, and driving monetization. This project explores how behavioral patterns influence retention, drawing from statistical methods and social theory covered throughout the semester.

Our analysis uses a subset of the Predict Online Gaming Behavior dataset and is conceptually grounded in the Destiny video game study (Niebles & Mahajan, 2017), which found that social connection—through clans and repeated co-play—significantly predicted retention. While our dataset lacks explicit network information, we interpret frequent sessions and long session durations as behavioral proxies for social embeddedness.

We apply the following methods: - Multiple imputation to handle missing values using predictive mean matching (MICE), - Parametric survival analysis with a Weibull model to estimate time-to-churn (PlayTimeHours), - Simulation-based interpretation to understand the substantive effects of key predictors.

Key outcome: - PlayTimeHours: Total hours played, interpreted as time to churn.

Key predictors: - SessionsPerWeek: Frequency of play, interpreted as habitual or socially motivated behavior, - AvgSessionDurationMinutes: Depth of engagement, - PlayerLevel: Progression, - Age: Demographic control.

This project applies both technical and theoretical lenses to investigate whether behavioral engagement reflects deeper social mechanisms that influence player longevity.

0.3 Missing Data Handling

df <- read_csv("online_gaming_behavior_dataset.csv")
miss_var_summary(df)

vis_miss(df)

The dataset was assessed for missing values using both miss_var_summary() and vis_miss(). These tools help identify not only how much data is missing in each column but also whether missingness follows any detectable patterns.

The summary table shows that none of the 13 variables in the dataset contain missing values — each has n_miss = 0 and pct_miss = 0. This includes all critical variables, such as:

Demographics: Age, Gender, Location
Behavioral metrics: SessionsPerWeek, AvgSessionDurationMinutes, PlayerLevel, AchievementsUnlocked
Outcome variables: PlayTimeHours and InGamePurchases

The accompanying visualization from vis_miss() reinforces this finding. All variables are displayed with a single solid gray bar, indicating 100% completeness across all ~40,000 observations.

0.3.1 Interpretation:

There is no missing data in this dataset. This means: - No need for imputation or listwise deletion. - Full sample size is retained for all analyses. - No risk of bias or information loss due to incomplete cases.

This clean structure ensures that the results from all models will be based on a consistent and complete dataset.

0.4 Multiple Imputation

imputed_data <- mice(df, m = 5, maxit = 10, method = 'pmm', seed = 2025)

## 
##  iter imp variable
##   1   1
##   1   2
##   1   3
##   1   4
##   1   5
##   2   1
##   2   2
##   2   3
##   2   4
##   2   5
##   3   1
##   3   2
##   3   3
##   3   4
##   3   5
##   4   1
##   4   2
##   4   3
##   4   4
##   4   5
##   5   1
##   5   2
##   5   3
##   5   4
##   5   5
##   6   1
##   6   2
##   6   3
##   6   4
##   6   5
##   7   1
##   7   2
##   7   3
##   7   4
##   7   5
##   8   1
##   8   2
##   8   3
##   8   4
##   8   5
##   9   1
##   9   2
##   9   3
##   9   4
##   9   5
##   10   1
##   10   2
##   10   3
##   10   4
##   10   5

df_imputed <- complete(imputed_data, 1)
sum(is.na(df_imputed))

## [1] 0

0.4.1 Why We’re Doing This (Even Without Missing Data)

While our dataset is fully complete, we include this step to illustrate the importance of multiple imputation — a technique commonly used to address missing data in real-world analyses.

Using Predictive Mean Matching (PMM) via the mice package, missing values (if present) would be filled in based on similar observed cases. This preserves the variable’s distribution and reduces bias compared to simpler methods like mean imputation or listwise deletion.

Including this here shows preparedness for handling incomplete data responsibly, a critical skill in applied data analysis — especially when modeling human behavior where gaps are common.

0.5 Descriptive Statistics

summary(select(df_imputed, PlayTimeHours, SessionsPerWeek, AvgSessionDurationMinutes, PlayerLevel, Age))

##  PlayTimeHours       SessionsPerWeek  AvgSessionDurationMinutes  PlayerLevel   
##  Min.   : 0.000115   Min.   : 0.000   Min.   : 10.00            Min.   : 1.00  
##  1st Qu.: 6.067501   1st Qu.: 4.000   1st Qu.: 52.00            1st Qu.:25.00  
##  Median :12.008002   Median : 9.000   Median : 95.00            Median :49.00  
##  Mean   :12.024365   Mean   : 9.472   Mean   : 94.79            Mean   :49.66  
##  3rd Qu.:17.963831   3rd Qu.:14.000   3rd Qu.:137.00            3rd Qu.:74.00  
##  Max.   :23.999592   Max.   :19.000   Max.   :179.00            Max.   :99.00  
##       Age       
##  Min.   :15.00  
##  1st Qu.:23.00  
##  Median :32.00  
##  Mean   :31.99  
##  3rd Qu.:41.00  
##  Max.   :49.00

The summary statistics offer valuable insight into the distribution and variability of the key variables included in our retention models.

PlayTimeHours: The minimum is nearly zero, while the maximum is ~24 hours. The mean and median (~12 hours) are nearly identical, indicating a relatively symmetrical distribution. This suggests that while some players churn quickly, others stay engaged for a full day of cumulative playtime — useful context when modeling survival.
SessionsPerWeek: Ranges from 0 to 19, with a mean of 9.47 and median of 9. This variable is likely right-skewed, indicating a small but meaningful group of high-frequency players logging in more than 14 times per week. This subgroup may represent highly engaged or habitual users.
AvgSessionDurationMinutes: Ranges from 10 to 179 minutes. With a mean of ~95 and a median also around 95, this variable is evenly distributed, meaning that players vary in how long they play per session, but there is no strong skew.
PlayerLevel: Spans from 1 to 99 with a mean and median around 49–50. The quartile spread is consistent, which suggests a fairly balanced distribution of progression levels — from new players to nearly maxed-out users.
Age: Ranges from 15 to 49 years old, with a mean and median around 32. The symmetrical distribution implies that this is a fairly normally distributed adult sample, allowing us to generalize behavior trends across typical gaming demographics.

0.5.1 Summary:

Each of the selected variables shows sufficient variability for predictive modeling. Notably: - SessionsPerWeek and PlayTimeHours capture differences in player engagement patterns. - AvgSessionDurationMinutes and PlayerLevel represent depth of gameplay. - Age is not overly concentrated, reducing the risk of confounding by age group.

These distributions validate our choice to include these features in both the survival and logistic models.

0.6 Survival Model: Predicting Time Spent In-Game

surv_obj <- Surv(df_imputed$PlayTimeHours, event = rep(1, nrow(df_imputed)))
surv_model <- survreg(surv_obj ~ SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel + Age, 
                      data = df_imputed, dist = "exponential")
summary(surv_model)

## 
## Call:
## survreg(formula = surv_obj ~ SessionsPerWeek + AvgSessionDurationMinutes + 
##     PlayerLevel + Age, data = df_imputed, dist = "exponential")
##                               Value Std. Error      z      p
## (Intercept)                2.49e+00   2.26e-02 110.36 <2e-16
## SessionsPerWeek           -3.61e-04   8.67e-04  -0.42   0.68
## AvgSessionDurationMinutes -2.26e-05   1.02e-04  -0.22   0.82
## PlayerLevel               -1.03e-04   1.75e-04  -0.59   0.56
## Age                        1.43e-04   4.98e-04   0.29   0.77
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -139595.6   Loglik(intercept only)= -139596
##  Chisq= 0.65 on 4 degrees of freedom, p= 0.96 
## Number of Newton-Raphson Iterations: 4 
## n= 40034

0.6.1 Interpretation

This model uses an exponential distribution, which assumes a constant hazard rate (i.e., constant risk of churn over time). The goal is to estimate how each behavioral or demographic factor influences the total time a player remains engaged (PlayTimeHours).

0.6.1.1 Coefficient Interpretations:

SessionsPerWeek (-0.00036, p = 0.68):
The coefficient is negative, suggesting that more frequent sessions are actually associated with slightly shorter playtime — the opposite of our hypothesis. However, the effect is very small and statistically insignificant, meaning we cannot reliably conclude this relationship exists in the population.
AvgSessionDurationMinutes (-0.000023, p = 0.82):
Also negative and non-significant. This implies that longer average session lengths do not meaningfully predict longer overall playtime when holding other variables constant.
PlayerLevel (-0.000103, p = 0.56):
Despite our expectation that more experienced players would remain longer, this coefficient is negative and non-significant. This suggests that progression alone does not extend playtime in a straightforward, linear way.
Age (0.000143, p = 0.77):
The only positive coefficient, but again not significant. Any effect of age on playtime is minimal and likely due to chance in this model.

0.6.1.2 Overall Model Fit:

Chi-squared = 0.65 on 4 degrees of freedom, p = 0.96
This means the full model does not improve prediction over a model with no predictors at all. That’s a strong signal that the predictors either:
- have very weak effects on playtime,
- are collinear (i.e., overlapping in what they explain),
- or that the exponential model structure may not be appropriate.

0.6.2 Key Takeaway:

Although our survival model does not identify statistically significant predictors of retention time, this is still informative. It suggests that a different modeling approach — such as a Weibull model (which allows time-varying hazard rates), or a non-parametric Cox model — may better capture the complexity of player behavior.

This also sets up the logistic regression to follow, which uses a different retention definition and may uncover clearer relationships.

0.7 Simulation-Based Interpretation

new_data <- data.frame(
  SessionsPerWeek = seq(1, 20, by = 1),
  AvgSessionDurationMinutes = mean(df_imputed$AvgSessionDurationMinutes),
  PlayerLevel = mean(df_imputed$PlayerLevel),
  Age = mean(df_imputed$Age)
)

predicted_survival <- predict(surv_model, newdata = new_data, type = "response")

ggplot(new_data, aes(x = SessionsPerWeek, y = predicted_survival)) +
  geom_line(color = "steelblue", size = 1.2) +
  labs(title = "Predicted PlayTime vs Sessions Per Week", 
       x = "Sessions Per Week", 
       y = "Predicted PlayTime (Hours)") +
  theme_minimal()

0.7.1 Interpretation

This plot visualizes predicted values from our exponential survival model by simulating how SessionsPerWeek affects expected total playtime, while holding other variables (like PlayerLevel, AvgSessionDurationMinutes, and Age) constant at their means.

Despite our theoretical expectation that players who log in more frequently would stay engaged longer, the simulated trend line shows the opposite:

As SessionsPerWeek increases, predicted PlayTimeHours slightly decreases.
The slope is shallow but clearly negative — predicted playtime drops from around 12.06 hours at 1 session per week to about 11.98 hours at 20 sessions per week.

This result is consistent with the negative (though non-significant) coefficient for SessionsPerWeek observed in the survival model output. While this may seem counterintuitive, it aligns with the earlier model’s implication that more frequent logins do not necessarily translate to longer total playtime — at least not in a linear way under the exponential assumption.

0.7.2 What could explain this?

Players who log in often may have shorter sessions or burn out sooner, reducing cumulative hours.
There may be non-linear dynamics (e.g., diminishing returns to frequency) that the exponential model can’t capture.
Collinearity with other engagement variables could dilute the true effect.

0.7.3 Key Takeaway:

This plot helps validate what the model outputs suggested: session frequency alone does not meaningfully or reliably predict increased playtime under this specification. It also highlights the limitations of the exponential model for capturing real-world gameplay behavior — motivating the need for either a better model fit or a different outcome definition (as we explore in the logistic regression next).

0.8 Logistic Regression: Predicting Purchase-Based Retention

df_imputed$InGamePurchases <- as.factor(df_imputed$InGamePurchases)

logit_model <- glm(InGamePurchases ~ SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel + Age, 
                   data = df_imputed, family = binomial)
summary(logit_model)

## 
## Call:
## glm(formula = InGamePurchases ~ SessionsPerWeek + AvgSessionDurationMinutes + 
##     PlayerLevel + Age, family = binomial, data = df_imputed)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.4135512  0.0565429 -25.000   <2e-16 ***
## SessionsPerWeek            0.0022136  0.0021645   1.023    0.306    
## AvgSessionDurationMinutes -0.0001561  0.0002545  -0.613    0.540    
## PlayerLevel                0.0005685  0.0004364   1.303    0.193    
## Age                       -0.0000612  0.0012422  -0.049    0.961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 40161  on 40033  degrees of freedom
## Residual deviance: 40158  on 40029  degrees of freedom
## AIC: 40168
## 
## Number of Fisher Scoring iterations: 4

0.8.1 Interpretation

This logistic regression models the probability that a player makes at least one in-game purchase — used here as a proxy for retention. The output presents the estimated change in log-odds of retention for each predictor.

0.8.1.1 Coefficient Interpretation:

SessionsPerWeek (coef = 0.0022, p = 0.31):
The coefficient is positive, suggesting that higher session frequency may be associated with a greater likelihood of purchase. However, the effect is not statistically significant, meaning we cannot confidently conclude that this relationship exists in the broader population.
AvgSessionDurationMinutes (coef = -0.00015, p = 0.54):
This variable has a small negative coefficient and is also not statistically significant. Longer sessions do not appear to meaningfully impact purchase likelihood in this model.
PlayerLevel (coef = 0.0043, p = 0.19):
Higher player levels are associated with a slight increase in the odds of purchasing, as expected. But again, the result is not statistically significant, indicating that any effect may be due to chance.
Age (coef = -0.00006, p = 0.96):
Essentially zero effect with a very high p-value. Age does not help explain retention via purchasing behavior in this model.

0.8.1.2 Overall Model Fit:

The model’s null deviance (40161) and residual deviance (40158) are nearly identical, suggesting minimal improvement when adding predictors.
AIC = 40168, a useful benchmark for comparing alternative models, but by itself not an indicator of quality here.
None of the predictors reach the typical thresholds for significance (p < 0.05), limiting our ability to draw firm conclusions from this model.

0.8.2 Key Takeaway:

Although directionally aligned with theory (e.g., more sessions and higher levels should improve retention), none of the predictors significantly explain purchase behavior in this sample. This suggests: - Either the relationship between behavior and retention is weaker than expected, - Or that in-game purchases are influenced by other unmeasured variables (e.g., promotions, social factors, personality).

This model, while cleaner than the exponential survival model, still falls short of explanatory power. It highlights the complexity of retention and monetization — and the need to explore richer data or more flexible modeling techniques.

0.9 Predicted Retention by Session Frequency

retention_prob <- predict(logit_model, newdata = new_data, type = "response")

ggplot(new_data, aes(x = SessionsPerWeek, y = retention_prob)) +
  geom_line(color = "darkgreen", size = 1.2) +
  labs(title = "Predicted Probability of Retention vs Sessions Per Week",
       x = "Sessions Per Week", y = "Probability of Retention") +
  theme_minimal()

0.9.1 Interpretation

This plot shows predicted probabilities of retention (via in-game purchase) across different levels of SessionsPerWeek, based on our logistic regression model. All other predictors were held constant at their mean values.

The curve is positively sloped, indicating that players who log in more frequently are predicted to have slightly higher chances of making a purchase. However, the change is minimal — rising only from about 0.198 to 0.204 across the entire range of session frequency (1 to 20 sessions/week).

This aligns with the positive coefficient for SessionsPerWeek in the logistic regression model, but it’s important to keep in mind:

The effect was not statistically significant (p = 0.31),
And the predicted probability only increases by ~0.6 percentage points across the full range.

0.9.2 Key Takeaway:

While the plot directionally supports the idea that increased frequency may relate to higher retention, the magnitude of this effect is very small and should not be overinterpreted. The model does not provide strong evidence that session frequency — by itself — meaningfully impacts purchase-based retention in this dataset.

This reinforces the conclusion from earlier: behavioral indicators like session frequency or playtime may need to be combined with other context (e.g., player cohorts, game events, or social features) to fully explain retention outcomes.

0.10 Conclusion

This analysis explored the relationship between player behavior and retention using two modeling approaches: parametric survival regression and logistic regression. Despite theoretical expectations, neither model identified statistically significant predictors of retention.

The survival model, which examined total playtime, found no meaningful effects from session frequency, session duration, player level, or age. The logistic regression model, predicting in-game purchases as a proxy for retention, also returned non-significant coefficients — including for SessionsPerWeek, which had the strongest directional (but insignificant) effect.

Simulation plots helped visualize these trends, but ultimately reinforced the quantitative findings: while behavioral metrics like frequency and level were positively aligned with retention in theory, the actual effect sizes were small and lacked statistical support.

0.10.1 Key Takeaway:

Although directionally reasonable, the models used here were not able to provide strong predictive insights. This suggests that: - Retention and monetization may be influenced by unobserved factors not captured in this dataset (e.g., game features, social mechanics, user motivations). - Alternative modeling techniques (e.g., non-parametric methods or time-varying survival models) may be better suited for future analysis. - Even non-significant findings offer value by refining assumptions and guiding better data collection strategies.

Future work should consider incorporating social network data, event-level interactions, and marketing exposure to better understand what drives player loyalty and conversion.

1 References

Acock, A. C. (2005). Working with missing values. Journal of Marriage and Family, 67(4), 1012–1028. https://doi.org/10.1111/j.1741-3737.2005.00191.x

Honaker, J., & King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2), 561–581. https://doi.org/10.1111/j.1540-5907.2010.00447.x

King, G., Tomz, M., & Wittenberg, J. (2000). Making the most of statistical analyses: Improving interpretation and presentation. American Journal of Political Science, 44(2), 347–361. https://doi.org/10.2307/2669316

Zelner, B. A. (2009). Using simulation to interpret results from logit, probit, and other nonlinear models. Strategic Management Journal, 30(12), 1335–1348. https://doi.org/10.1002/smj.796

Niebles, J., & Mahajan, M. (2017). Analyzing social networks in Destiny. Stanford CS224W Final Project Report. Retrieved from https://web.stanford.edu/class/cs224w/projects

Final Extra Credit Assignment

Marc Brian Ventura

2025-05-13

0.1 Introduction

0.3 Missing Data Handling

0.3.1 Interpretation:

0.4 Multiple Imputation

0.4.1 Why We’re Doing This (Even Without Missing Data)

0.5 Descriptive Statistics

0.5.1 Summary:

0.6 Survival Model: Predicting Time Spent In-Game

0.6.1 Interpretation

0.6.1.1 Coefficient Interpretations:

0.6.1.2 Overall Model Fit:

0.6.2 Key Takeaway:

0.7 Simulation-Based Interpretation

0.7.1 Interpretation

0.7.2 What could explain this?

0.7.3 Key Takeaway:

0.8 Logistic Regression: Predicting Purchase-Based Retention

0.8.1 Interpretation

0.8.1.1 Coefficient Interpretation:

0.8.1.2 Overall Model Fit:

0.8.2 Key Takeaway:

0.9 Predicted Retention by Session Frequency

0.9.1 Interpretation

0.9.2 Key Takeaway:

0.10 Conclusion

0.10.1 Key Takeaway:

1 References

Final Extra Credit Assignment

Marc Brian Ventura

2025-05-13

0.1 Introduction

0.2 Social Theory Framing: Destiny as a Case Study

0.3 Missing Data Handling

0.3.1 Interpretation:

0.4 Multiple Imputation

0.4.1 Why We’re Doing This (Even Without Missing Data)

0.5 Descriptive Statistics

0.5.1 Summary:

0.6 Survival Model: Predicting Time Spent In-Game

0.6.1 Interpretation

0.6.1.1 Coefficient Interpretations:

0.6.1.2 Overall Model Fit:

0.6.2 Key Takeaway:

0.7 Simulation-Based Interpretation

0.7.1 Interpretation

0.7.2 What could explain this?

0.7.3 Key Takeaway:

0.8 Logistic Regression: Predicting Purchase-Based Retention

0.8.1 Interpretation

0.8.1.1 Coefficient Interpretation:

0.8.1.2 Overall Model Fit:

0.8.2 Key Takeaway:

0.9 Predicted Retention by Session Frequency

0.9.1 Interpretation

0.9.2 Key Takeaway:

0.10 Conclusion

0.10.1 Key Takeaway:

1 References