Exploratory Data Analysis of Game Churn

Author

Emma Simkova

##Set Up##

  #Libraries#
library(tidyverse)
library(knitr)
library(kableExtra)
library(patchwork)

game <- read_csv("game_churn.csv")

#renaming columns to get rid of spaces
game <- game %>% 
  rename(Start_Moves    = `Start Moves`,
         Extra_Moves    = `Extra Moves`,
         Used_Moves     = `Used Moves`,
         Buy_More_Moves = `Buy More Moves`,
         Used_Coins     = `Used Coins`,
         End_Type       = `End Type`,
         Play_Time_Sec  = `Play Time Sec`)

#converting variables into factors
game <- game %>%
  mutate(Churn = factor(Churn, levels = c("No", "Yes"), labels = c("Stayed", "Churned")),
  End_Type = factor(End_Type), Day = factor(Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), ordered = TRUE))

Introduction

In this report I use R, RStudio, Quarto and the tidyverse to explore the game_churn.csv dataset, which contains gameplay information for 13,548 player-level observations from a mobile game. Each row represents a single play of a level by a player, and the columns describe different aspects of their behaviour in that level, such as score, number of moves, play time and whether the player eventually churned, i.e. stopped playing the game.

The aim of this report is to look for simple patterns that might be linked with players churning. I focus on summary tables and visualisations using dplyr for data manipulation and ggplot2 for data visualisation.

The dataset contains 15 variables. The main attributes for each observation are:

  • ID: unique identifier for each player.
  • Churn: whether or not the player stopped playing the game.
  • Level: the level number the player is on.
  • Start Moves: the number of moves the player started the level with.
  • Extra Moves: the number of extra moves the player purchased for that level.
  • Used Moves: the total number of moves the player used in that level.
  • Buy More Moves: number of times the player purchased more moves in that level.
  • Used Coins: the total coins the player used in the level.
  • End Type: the reason why the level ended, with possible values of “Win”, “Lose”, “Restart” and “Quit”.
  • Play Time Sec: the number of seconds the player has spent playing the level.
  • Rolling Losses: the number of successive losses by the player.
  • Scores: the score achieved by the player.
  • Datetime: the timestamp when the player started the level.
  • Hour: the hour of the day (from 0-23) during which the player started playing the level.
  • Day: the day of the week on which the player started playing the level.

Churn Count and Percentages

#First, I look at how many player-level observations are in each churn group and what percentage they represent.#
table_churn <- game %>%
  group_by(Churn) %>%
  summarise(Count = n()) %>%
  mutate(Percent = round(100 * Count / sum(Count), 1))

table_churn
# A tibble: 2 × 3
  Churn   Count Percent
  <fct>   <int>   <dbl>
1 Stayed   8896    65.7
2 Churned  4652    34.3
#Now I format this table using knitr::kable() and kableExtra.#
knitr::kable(table_churn, format = "html", digits = c(0, 0, 1), align = "lrr", col.names = c("Churn Status", "Count", "Percent"),
             caption = "Number and percentage of observations by churn status", table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "skyblue4")
Number and percentage of observations by churn status
Churn Status Count Percent
Stayed 8896 65.7
Churned 4652 34.3

From this table I can see which group Stayed vs Churned is larger.The churn table shows that most players stayed (65.7%), while about one-third churned (34.3%). This means churn is still fairly common, so understanding player behaviour is important. These results suggest that further analysis of play time, scores and end types can help explain why some players continue playing while others leave.

Distribution of Playtime and Scores

#Histogram 1 - Playtime#
  #This histogram shows how long players typically spend on a level.#

p1 <- ggplot(game, aes(x = Play_Time_Sec)) +
  geom_histogram(binwidth = 20, fill = "maroon1", color = "navy") +
  coord_cartesian(xlim = c(0, 500)) +
  labs(title = "Distribution of Play Time", x = "Play time (seconds)", y = "Count") #Most play sessions last under 500 seconds so, in the histogram I used a restricted x-axis range to make the distribution clearer. A small number of outliers with very long play times were stretching the original axis and making the plot hard to read.#


#Histogram 2 - Scores#
  #This histogram shows the overall distribution of scores achieved by players across all levels.#

p2 <- ggplot(game, aes(x = Scores)) +
  geom_histogram(binwidth = 500, fill = "mediumaquamarine", color = "navy") +
  labs(title = "Distribution of Scores", x = "Scores", y = "Count")

p1 + p2

Both histograms show right-skewed distributions, meaning most observations fall within a lower range while a smaller number extend into much higher values. The Playtime histogram indicates that the vast majority of sessions last under 500 seconds, with only a few long-duration outliers. This suggests that players usually complete or quit levels fairly quickly. The Scores histogram shows a similar pattern, most scores fall between roughly 1,500 and 4,500 points, while very high scores are much less common. These patterns suggest that both play time and performance tend to cluster within a typical range, with only a small minority of unusually long or high-scoring sessions standing out from the rest.

After exploring the individual distributions of play time and scores, I next examine how these variables relate to player behaviour. The following scatterplot shows the relationship between used moves and scores, with dot size representing play time and colour representing churn status.

Scores vs Used Moves

#Scatter Plot#  
  #This Scatterplot shows us the relationship between Used Moves, Scores, and Churn Behaviour.#

game_scatter <- game %>%
  filter(Play_Time_Sec <= 500)

ggplot(data = game_scatter, mapping = aes(x = Used_Moves, y = Scores, colour = Churn, size = Play_Time_Sec)) +
  geom_point(alpha = 0.4) +
  scale_size_continuous(name = "Play time (seconds)") +
  scale_colour_manual(values = c("Stayed" = "orchid1", "Churned" = "lightseagreen")) +
  labs(title = "Scores vs Used Moves by Churn Status", x = "Used moves", y = "Score", colour = "Churn status") +
  theme_minimal()

#the x-axis is Used Moves, the y-axis is Scores, and the colours represent Churn groups.#

The scatterplot shows a clear positive relationship between used moves and scores, players who use more moves generally achieve higher scores. Both churned and stayed players follow a similar pattern, with no strong separation between the two groups. Larger dots (longer play times) appear across the plot, meaning longer sessions happen at many different score and move levels. Overall, used moves and scores do not strongly distinguish churn behaviour.

Comparison of Play Time and Scores Across Groups

#Box Plot 1 - Playtime by Churn#
  #Do players who churn play for less/more time?#

b1 <- ggplot(game, aes(x = Churn, y = Play_Time_Sec)) +
  geom_boxplot(fill = "deeppink3", color = "black", outlier.colour = "cyan3") +
  labs(title = "Play Time by Churn Status", x = "Churn Status", y = "Play Time (seconds)") +
  coord_cartesian(ylim = c(0, 500))
#Shows how long players spend in a level.#
#Compares two groups Stayed vs Churned.#

#Box Plot 2 - Scores by End Type#
  #Do different end types lead to different scores?#

b2 <- ggplot(game, aes(x = End_Type, y = Scores)) +
  geom_boxplot(fill = "darkslategray4", color = "black", outlier.colour = "cyan3") +
  labs(title = "Scores by End Type", x = "End Type", y = "Score")
#Shows how high scores are in different end types (Win, Lose, Restart, Quit).#
#Has nothing to do with churn.#

b1 + b2

These two boxplots are not directly related, but both explore how a numerical variable varies across categories. The first boxplot compares play time between churned and non-churned players, showing clear differences in how long each group typically spends on a level. The second boxplot examines how scores vary across end types (Win, Lose, Restart, Quit), highlighting differences in performance depending on how the level concluded. Displaying the boxplots side-by-side makes it easier to compare the spreads, medians and variability across groups, helping to reveal patterns in both player behaviour and level outcomes.

Digital Marketing Insights

In this section I focus on questions that are useful for digital marketing and player retention, such as which players spend more, when players are most active, and when churn is more likely to occur.

Average Coins Used by Churn Group

#Table#

coins_churn <- game %>%
  group_by(Churn) %>%
  summarise(Average_Coins = mean(Used_Coins, na.rm = TRUE))

knitr::kable(coins_churn, format = "html", digits = 2, align = "lr", col.names = c("Churn Group", "Average Coins Used"), caption = "Average coins used by churn group", table.attr = 'data-quarto-disable-processing = "true"') %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "midnightblue", background = "plum2")
Average coins used by churn group
Churn Group Average Coins Used
Stayed 1.42
Churned 1.26
#The use of this table is for marketing, to know if churned players spend less.#

The table shows that players who stayed in the game used slightly more coins on average (1.42), than players who eventually churned (1.26). Although the difference is not very large, it suggests that players who remain engaged in the game are more likely to spend coins during levels, possibly because they continue investing in gameplay resources such as extra moves or power-ups. In contrast, churned players appear to use coins less often, which may indicate lower engagement or reduced willingness to spend before leaving the game.

Churn Proportion by Day of the Week

#Bar Chart#

churn_day <- game %>%
  group_by(Day, Churn) %>%
  summarise(Count = n()) %>%
  mutate(Percent = Count / sum(Count))

ggplot(churn_day, aes(x = Day, y = Percent, fill = Churn)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("Stayed" = "palegreen2", "Churned" = "violetred3"), name = "Churn status") +
  labs(title = "Churn Proportion by Day of the Week", x = "Day of week", y = "Proportion of observations", fill = "Churn status") +
  theme_minimal()

#Good use for timing a campaign in the marketing environment.#

The stacked bar chart shows that the proportion of players who churn versus stay is very consistent across all days of the week. For every day, roughly one-third of observations belong to churned players, while about two-thirds represent players who stayed. This suggests that churn behaviour does not vary much depending on the day a player starts a level. In other words, players are equally likely to churn whether they play on weekdays or weekends, indicating that day of the week is not a strong factor influencing churn in this dataset.

Average Play Time by Hour of Day

#Line Chart#

time_by_hour <- game %>%
  group_by(Hour) %>%
  summarise(Avg_Play_Time = mean(Play_Time_Sec, na.rm = TRUE))

ggplot(time_by_hour, aes(x = Hour, y = Avg_Play_Time)) +
 geom_line(linewidth = 1, color = "darkorchid4") + #line colour#
  geom_point(color = "darkseagreen1", size = 2) + #dot colour#
  labs(title = "Average Play Time by Hour of Day", x = "Hour of day", y = "Average play time (seconds)") +
  theme_minimal()

#Finds peak engagement hours for ads or push notifications.#

The line chart shows how the average play time varies across the 24 hours of the day. Although the changes are not dramatic, there are noticeable fluctuations throughout the day. Play time tends to be slightly higher during the early morning hours (around midnight and 1–2 AM) and again around midday. In contrast, there is a clear dip in average play time around hour 10, where players spend the least amount of time on a level. Overall, the pattern suggests that players’ engagement is measured, how long they play a level, varies modestly depending on the time of day, but no strong or consistent trend appears across the full 24-hour cycle.

Churn Rate by Day and Hour of Day

#Heatmap#

churn_day_hour <- game %>%
  group_by(Day, Hour, Churn) %>%
  summarise(Count = n()) %>%
  group_by(Day, Hour) %>%
  mutate(Churn_Rate = Count / sum(Count)) %>%
  filter(Churn == "Churned")

ggplot(churn_day_hour, aes(x = Hour, y = Day, fill = Churn_Rate)) +
  geom_tile() +
  scale_fill_gradient(low = "lightyellow", high = "darkmagenta") +
  labs(title = "Churn Rate by Day and Hour of Day", x = "Hour of day", y = "Day of week", fill = "Churn rate") +
  theme_minimal()

#Combines timing + churn risk.#

The heatmap shows that churn rates stay fairly similar across all days and hours. There are some small differences, with a few darker squares showing slightly higher churn at certain times, but no strong pattern stands out. Overall, players seem equally likely to churn no matter what day or time they play.

Conclusion

Overall, most players continue playing the game, though around one third eventually churn, so retention remains an important challenge. The distributions of both play time and scores show similar patterns across the whole player base, with most sessions being short and most scores falling in a typical mid range. When comparing behaviour between churned and non-churned players, there are no major differences in play time, moves used or scores, suggesting that basic gameplay activity does not strongly predict churn.

However, players who stay in the game use slightly more coins on average, indicating higher engagement or willingness to invest in gameplay resources. Timing patterns also appear stable, churn proportions look almost identical across all days of the week, and average play time varies only slightly across hours of the day. The heatmap confirms that churn rates remain fairly consistent regardless of when players play.

Taken together, these results suggest that churn in this dataset is not driven by performance or timing, but may instead be linked to lower engagement and reduced spending among players who eventually leave.

Bonus: Meme

A scare from first year of Digital Analytics