Understanding why players stay engaged in multiplayer games is central to improving game design, fostering community, and driving monetization. This project explores how behavioral patterns influence retention, drawing from statistical methods and social theory covered throughout the semester.
Our analysis uses a subset of the Predict Online Gaming Behavior dataset and is conceptually grounded in the Destiny video game study (Niebles & Mahajan, 2017), which found that social connection—through clans and repeated co-play—significantly predicted retention. While our dataset lacks explicit network information, we interpret frequent sessions and long session durations as behavioral proxies for social embeddedness.
We apply the following methods: - Multiple
imputation to handle missing values using predictive mean
matching (MICE), - Parametric survival analysis with a Weibull
model to estimate time-to-churn (PlayTimeHours
), -
Simulation-based interpretation to understand the
substantive effects of key predictors.
Key outcome: - PlayTimeHours
: Total hours played,
interpreted as time to churn.
Key predictors: - SessionsPerWeek
: Frequency of play,
interpreted as habitual or socially motivated behavior, -
AvgSessionDurationMinutes
: Depth of engagement, -
PlayerLevel
: Progression, - Age
: Demographic
control.
This project applies both technical and theoretical lenses to investigate whether behavioral engagement reflects deeper social mechanisms that influence player longevity.
df <- read_csv("online_gaming_behavior_dataset.csv")
miss_var_summary(df)
vis_miss(df)
The dataset was assessed for missing values using both
miss_var_summary()
and vis_miss()
. These tools
help identify not only how much data is missing in each column but also
whether missingness follows any detectable patterns.
The summary table shows that none of the 13 variables in the dataset
contain missing values — each has n_miss = 0
and
pct_miss = 0
. This includes all critical variables, such
as:
Age
, Gender
,
Location
SessionsPerWeek
,
AvgSessionDurationMinutes
, PlayerLevel
,
AchievementsUnlocked
PlayTimeHours
and
InGamePurchases
The accompanying visualization from vis_miss()
reinforces this finding. All variables are displayed with a single solid
gray bar, indicating 100% completeness across all ~40,000
observations.
There is no missing data in this dataset. This means: - No need for imputation or listwise deletion. - Full sample size is retained for all analyses. - No risk of bias or information loss due to incomplete cases.
This clean structure ensures that the results from all models will be based on a consistent and complete dataset.
imputed_data <- mice(df, m = 5, maxit = 10, method = 'pmm', seed = 2025)
##
## iter imp variable
## 1 1
## 1 2
## 1 3
## 1 4
## 1 5
## 2 1
## 2 2
## 2 3
## 2 4
## 2 5
## 3 1
## 3 2
## 3 3
## 3 4
## 3 5
## 4 1
## 4 2
## 4 3
## 4 4
## 4 5
## 5 1
## 5 2
## 5 3
## 5 4
## 5 5
## 6 1
## 6 2
## 6 3
## 6 4
## 6 5
## 7 1
## 7 2
## 7 3
## 7 4
## 7 5
## 8 1
## 8 2
## 8 3
## 8 4
## 8 5
## 9 1
## 9 2
## 9 3
## 9 4
## 9 5
## 10 1
## 10 2
## 10 3
## 10 4
## 10 5
df_imputed <- complete(imputed_data, 1)
sum(is.na(df_imputed))
## [1] 0
While our dataset is fully complete, we include this step to illustrate the importance of multiple imputation — a technique commonly used to address missing data in real-world analyses.
Using Predictive Mean Matching (PMM) via the
mice
package, missing values (if present) would be filled
in based on similar observed cases. This preserves the variable’s
distribution and reduces bias compared to simpler methods like mean
imputation or listwise deletion.
Including this here shows preparedness for handling incomplete data responsibly, a critical skill in applied data analysis — especially when modeling human behavior where gaps are common.
summary(select(df_imputed, PlayTimeHours, SessionsPerWeek, AvgSessionDurationMinutes, PlayerLevel, Age))
## PlayTimeHours SessionsPerWeek AvgSessionDurationMinutes PlayerLevel
## Min. : 0.000115 Min. : 0.000 Min. : 10.00 Min. : 1.00
## 1st Qu.: 6.067501 1st Qu.: 4.000 1st Qu.: 52.00 1st Qu.:25.00
## Median :12.008002 Median : 9.000 Median : 95.00 Median :49.00
## Mean :12.024365 Mean : 9.472 Mean : 94.79 Mean :49.66
## 3rd Qu.:17.963831 3rd Qu.:14.000 3rd Qu.:137.00 3rd Qu.:74.00
## Max. :23.999592 Max. :19.000 Max. :179.00 Max. :99.00
## Age
## Min. :15.00
## 1st Qu.:23.00
## Median :32.00
## Mean :31.99
## 3rd Qu.:41.00
## Max. :49.00
The summary statistics offer valuable insight into the distribution and variability of the key variables included in our retention models.
PlayTimeHours: The minimum is nearly zero, while the maximum is ~24 hours. The mean and median (~12 hours) are nearly identical, indicating a relatively symmetrical distribution. This suggests that while some players churn quickly, others stay engaged for a full day of cumulative playtime — useful context when modeling survival.
SessionsPerWeek: Ranges from 0 to 19, with a mean of 9.47 and median of 9. This variable is likely right-skewed, indicating a small but meaningful group of high-frequency players logging in more than 14 times per week. This subgroup may represent highly engaged or habitual users.
AvgSessionDurationMinutes: Ranges from 10 to 179 minutes. With a mean of ~95 and a median also around 95, this variable is evenly distributed, meaning that players vary in how long they play per session, but there is no strong skew.
PlayerLevel: Spans from 1 to 99 with a mean and median around 49–50. The quartile spread is consistent, which suggests a fairly balanced distribution of progression levels — from new players to nearly maxed-out users.
Age: Ranges from 15 to 49 years old, with a mean and median around 32. The symmetrical distribution implies that this is a fairly normally distributed adult sample, allowing us to generalize behavior trends across typical gaming demographics.
Each of the selected variables shows sufficient variability for
predictive modeling. Notably: - SessionsPerWeek
and
PlayTimeHours
capture differences in player engagement
patterns. - AvgSessionDurationMinutes
and
PlayerLevel
represent depth of gameplay. - Age
is not overly concentrated, reducing the risk of confounding by age
group.
These distributions validate our choice to include these features in both the survival and logistic models.
surv_obj <- Surv(df_imputed$PlayTimeHours, event = rep(1, nrow(df_imputed)))
surv_model <- survreg(surv_obj ~ SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel + Age,
data = df_imputed, dist = "exponential")
summary(surv_model)
##
## Call:
## survreg(formula = surv_obj ~ SessionsPerWeek + AvgSessionDurationMinutes +
## PlayerLevel + Age, data = df_imputed, dist = "exponential")
## Value Std. Error z p
## (Intercept) 2.49e+00 2.26e-02 110.36 <2e-16
## SessionsPerWeek -3.61e-04 8.67e-04 -0.42 0.68
## AvgSessionDurationMinutes -2.26e-05 1.02e-04 -0.22 0.82
## PlayerLevel -1.03e-04 1.75e-04 -0.59 0.56
## Age 1.43e-04 4.98e-04 0.29 0.77
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -139595.6 Loglik(intercept only)= -139596
## Chisq= 0.65 on 4 degrees of freedom, p= 0.96
## Number of Newton-Raphson Iterations: 4
## n= 40034
This model uses an exponential distribution, which assumes a constant
hazard rate (i.e., constant risk of churn over time). The goal is to
estimate how each behavioral or demographic factor influences the total
time a player remains engaged (PlayTimeHours
).
SessionsPerWeek (-0.00036, p = 0.68):
The coefficient is negative, suggesting that more frequent sessions are
actually associated with slightly shorter playtime — the opposite of our
hypothesis. However, the effect is very small and
statistically insignificant, meaning we cannot reliably
conclude this relationship exists in the population.
AvgSessionDurationMinutes (-0.000023, p =
0.82):
Also negative and non-significant. This implies that longer average
session lengths do not meaningfully predict longer overall playtime when
holding other variables constant.
PlayerLevel (-0.000103, p = 0.56):
Despite our expectation that more experienced players would remain
longer, this coefficient is negative and non-significant. This suggests
that progression alone does not extend playtime in a straightforward,
linear way.
Age (0.000143, p = 0.77):
The only positive coefficient, but again not
significant. Any effect of age on playtime is minimal and
likely due to chance in this model.
Although our survival model does not identify statistically significant predictors of retention time, this is still informative. It suggests that a different modeling approach — such as a Weibull model (which allows time-varying hazard rates), or a non-parametric Cox model — may better capture the complexity of player behavior.
This also sets up the logistic regression to follow, which uses a different retention definition and may uncover clearer relationships.
new_data <- data.frame(
SessionsPerWeek = seq(1, 20, by = 1),
AvgSessionDurationMinutes = mean(df_imputed$AvgSessionDurationMinutes),
PlayerLevel = mean(df_imputed$PlayerLevel),
Age = mean(df_imputed$Age)
)
predicted_survival <- predict(surv_model, newdata = new_data, type = "response")
ggplot(new_data, aes(x = SessionsPerWeek, y = predicted_survival)) +
geom_line(color = "steelblue", size = 1.2) +
labs(title = "Predicted PlayTime vs Sessions Per Week",
x = "Sessions Per Week",
y = "Predicted PlayTime (Hours)") +
theme_minimal()
This plot visualizes predicted values from our exponential survival
model by simulating how SessionsPerWeek
affects expected
total playtime, while holding other variables (like
PlayerLevel
, AvgSessionDurationMinutes
, and
Age
) constant at their means.
Despite our theoretical expectation that players who log in more frequently would stay engaged longer, the simulated trend line shows the opposite:
This result is consistent with the negative (though non-significant)
coefficient for SessionsPerWeek
observed in the survival
model output. While this may seem counterintuitive, it aligns with the
earlier model’s implication that more frequent logins do
not necessarily translate to longer total playtime — at
least not in a linear way under the exponential assumption.
This plot helps validate what the model outputs suggested: session frequency alone does not meaningfully or reliably predict increased playtime under this specification. It also highlights the limitations of the exponential model for capturing real-world gameplay behavior — motivating the need for either a better model fit or a different outcome definition (as we explore in the logistic regression next).
df_imputed$InGamePurchases <- as.factor(df_imputed$InGamePurchases)
logit_model <- glm(InGamePurchases ~ SessionsPerWeek + AvgSessionDurationMinutes + PlayerLevel + Age,
data = df_imputed, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = InGamePurchases ~ SessionsPerWeek + AvgSessionDurationMinutes +
## PlayerLevel + Age, family = binomial, data = df_imputed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4135512 0.0565429 -25.000 <2e-16 ***
## SessionsPerWeek 0.0022136 0.0021645 1.023 0.306
## AvgSessionDurationMinutes -0.0001561 0.0002545 -0.613 0.540
## PlayerLevel 0.0005685 0.0004364 1.303 0.193
## Age -0.0000612 0.0012422 -0.049 0.961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40161 on 40033 degrees of freedom
## Residual deviance: 40158 on 40029 degrees of freedom
## AIC: 40168
##
## Number of Fisher Scoring iterations: 4
This logistic regression models the probability that a player makes at least one in-game purchase — used here as a proxy for retention. The output presents the estimated change in log-odds of retention for each predictor.
SessionsPerWeek (coef = 0.0022, p = 0.31):
The coefficient is positive, suggesting that higher session frequency
may be associated with a greater likelihood of purchase. However, the
effect is not statistically significant, meaning we
cannot confidently conclude that this relationship exists in the broader
population.
AvgSessionDurationMinutes (coef = -0.00015, p =
0.54):
This variable has a small negative coefficient and is also not
statistically significant. Longer sessions do not appear to meaningfully
impact purchase likelihood in this model.
PlayerLevel (coef = 0.0043, p = 0.19):
Higher player levels are associated with a slight increase in the odds
of purchasing, as expected. But again, the result is not
statistically significant, indicating that any effect may be
due to chance.
Age (coef = -0.00006, p = 0.96):
Essentially zero effect with a very high p-value. Age does not help
explain retention via purchasing behavior in this model.
Although directionally aligned with theory (e.g., more sessions and higher levels should improve retention), none of the predictors significantly explain purchase behavior in this sample. This suggests: - Either the relationship between behavior and retention is weaker than expected, - Or that in-game purchases are influenced by other unmeasured variables (e.g., promotions, social factors, personality).
This model, while cleaner than the exponential survival model, still falls short of explanatory power. It highlights the complexity of retention and monetization — and the need to explore richer data or more flexible modeling techniques.
retention_prob <- predict(logit_model, newdata = new_data, type = "response")
ggplot(new_data, aes(x = SessionsPerWeek, y = retention_prob)) +
geom_line(color = "darkgreen", size = 1.2) +
labs(title = "Predicted Probability of Retention vs Sessions Per Week",
x = "Sessions Per Week", y = "Probability of Retention") +
theme_minimal()
This plot shows predicted probabilities of retention (via in-game
purchase) across different levels of SessionsPerWeek
, based
on our logistic regression model. All other predictors were held
constant at their mean values.
The curve is positively sloped, indicating that players who log in more frequently are predicted to have slightly higher chances of making a purchase. However, the change is minimal — rising only from about 0.198 to 0.204 across the entire range of session frequency (1 to 20 sessions/week).
This aligns with the positive coefficient for
SessionsPerWeek
in the logistic regression model, but it’s
important to keep in mind:
While the plot directionally supports the idea that increased frequency may relate to higher retention, the magnitude of this effect is very small and should not be overinterpreted. The model does not provide strong evidence that session frequency — by itself — meaningfully impacts purchase-based retention in this dataset.
This reinforces the conclusion from earlier: behavioral indicators like session frequency or playtime may need to be combined with other context (e.g., player cohorts, game events, or social features) to fully explain retention outcomes.
This analysis explored the relationship between player behavior and retention using two modeling approaches: parametric survival regression and logistic regression. Despite theoretical expectations, neither model identified statistically significant predictors of retention.
The survival model, which examined total playtime, found no
meaningful effects from session frequency, session duration, player
level, or age. The logistic regression model, predicting in-game
purchases as a proxy for retention, also returned non-significant
coefficients — including for SessionsPerWeek
, which had the
strongest directional (but insignificant) effect.
Simulation plots helped visualize these trends, but ultimately reinforced the quantitative findings: while behavioral metrics like frequency and level were positively aligned with retention in theory, the actual effect sizes were small and lacked statistical support.
Although directionally reasonable, the models used here were not able to provide strong predictive insights. This suggests that: - Retention and monetization may be influenced by unobserved factors not captured in this dataset (e.g., game features, social mechanics, user motivations). - Alternative modeling techniques (e.g., non-parametric methods or time-varying survival models) may be better suited for future analysis. - Even non-significant findings offer value by refining assumptions and guiding better data collection strategies.
Future work should consider incorporating social network data, event-level interactions, and marketing exposure to better understand what drives player loyalty and conversion.
Acock, A. C. (2005). Working with missing values. Journal of Marriage and Family, 67(4), 1012–1028. https://doi.org/10.1111/j.1741-3737.2005.00191.x
Honaker, J., & King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2), 561–581. https://doi.org/10.1111/j.1540-5907.2010.00447.x
King, G., Tomz, M., & Wittenberg, J. (2000). Making the most of statistical analyses: Improving interpretation and presentation. American Journal of Political Science, 44(2), 347–361. https://doi.org/10.2307/2669316
Zelner, B. A. (2009). Using simulation to interpret results from logit, probit, and other nonlinear models. Strategic Management Journal, 30(12), 1335–1348. https://doi.org/10.1002/smj.796
Niebles, J., & Mahajan, M. (2017). Analyzing social networks in Destiny. Stanford CS224W Final Project Report. Retrieved from https://web.stanford.edu/class/cs224w/projects
0.2 Social Theory Framing: Destiny as a Case Study
This project is conceptually informed by findings from Analyzing Social Networks in Destiny, which showed that social integration—particularly through clans and co-play—was a powerful predictor of retention. Players embedded in strong social networks stayed significantly longer than solo players, even when controlling for progression or skill.
Although our dataset lacks explicit social network variables, we draw on these insights by interpreting behavioral metrics like: - Frequent logins (
SessionsPerWeek
), - Longer average sessions (AvgSessionDurationMinutes
), - Progression (PlayerLevel
)as indirect proxies for social engagement.
Our modeling strategy attempts to detect latent social dynamics via observable player behavior. While less granular than the Destiny study’s network analysis, this approach reflects a common real-world constraint: behavioral data often precedes social network instrumentation. Accordingly, this report asks: To what extent can player behavior stand in for social connectivity in predicting retention?