Kyal Huntley
2024-12-08
## Issue Description
As the lead developer of Community Game LLC, I am using this project to research trends in player engagement to help shape the development of a new game planned for January 2025. Understanding player behavior changes over time, and how factors like seasonal trends or game genres influence engagement and retention, is crucial.
This analysis aims to inform the design and marketing strategy for my upcoming game by identifying seasonal trends in player count and the genres with the most stable player bases. Insights gained will help determine optimal content release times and game genre selection.
The dataset used for this analysis comes from the Tidy Tuesday project, which provides real-world datasets for analysis. The dataset includes player count statistics for various games. The key columns in the dataset are:
URL to the dataset:
https://github.com/rfordatascience/tidytuesday/tree/3557b4badd5da4f33a03e8e61b554743fb19e40f/data/2021/2021-03-16
The dataset documentation can be found on the Tidy Tuesday
GitHub repository:
Dataset
Documentation
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 83631 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gamename, month, avg_peak_perc
## dbl (4): year, avg, gain, peak
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert avg_peak_perc to numeric by removing '%' symbols and converting
player_data <- player_data %>%
mutate(avg_peak_perc = parse_number(avg_peak_perc))## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `avg_peak_perc = parse_number(avg_peak_perc)`.
## Caused by warning:
## ! 156 parsing failures.
## row col expected actual
## 41807 -- a number NaN%
## 49939 -- a number NaN%
## 51313 -- a number NaN%
## 53083 -- a number NaN%
## 55133 -- a number NaN%
## ..... ... ........ ......
## See problems(...) for more details.
# Filter out rows with missing values in critical columns
cleaned_data <- player_data %>%
filter(!is.na(avg) & !is.na(gain) & !is.na(peak) & !is.na(avg_peak_perc))
# Create a 'season' column based on the month
cleaned_data <- cleaned_data %>%
mutate(season = case_when(
month %in% c("December", "January", "February") ~ "Winter",
month %in% c("March", "April", "May") ~ "Spring",
month %in% c("June", "July", "August") ~ "Summer",
month %in% c("September", "October", "November") ~ "Fall",
TRUE ~ NA_character_ # Handle any invalid month names
))
# Remove rows with invalid or missing seasons
cleaned_data <- cleaned_data %>%
filter(!is.na(season))
# Check the cleaned dataset
str(cleaned_data)## tibble [82,267 × 8] (S3: tbl_df/tbl/data.frame)
## $ gamename : chr [1:82267] "Counter-Strike: Global Offensive" "Dota 2" "PLAYERUNKNOWN'S BATTLEGROUNDS" "Apex Legends" ...
## $ year : num [1:82267] 2021 2021 2021 2021 2021 ...
## $ month : chr [1:82267] "February" "February" "February" "February" ...
## $ avg : num [1:82267] 741013 404832 198958 120983 117742 ...
## $ gain : num [1:82267] -2196 -27840 -2290 49216 -24375 ...
## $ peak : num [1:82267] 1123485 651615 447390 196799 224276 ...
## $ avg_peak_perc: num [1:82267] 66 62.1 44.5 61.5 52.5 ...
## ..- attr(*, "problems")= tibble [156 × 4] (S3: tbl_df/tbl/data.frame)
## .. ..$ row : int [1:156] 41807 49939 51313 53083 55133 55160 55239 55247 55391 55494 ...
## .. ..$ col : int [1:156] NA NA NA NA NA NA NA NA NA NA ...
## .. ..$ expected: chr [1:156] "a number" "a number" "a number" "a number" ...
## .. ..$ actual : chr [1:156] "NaN%" "NaN%" "NaN%" "NaN%" ...
## $ season : chr [1:82267] "Winter" "Winter" "Winter" "Winter" ...
## gamename year month avg
## Length:82267 Min. :2012 Length:82267 Min. : 0.0
## Class :character 1st Qu.:2016 Class :character 1st Qu.: 53.9
## Mode :character Median :2018 Mode :character Median : 203.1
## Mean :2017 Mean : 2749.3
## 3rd Qu.:2019 3rd Qu.: 756.0
## Max. :2021 Max. :1584886.8
## gain peak avg_peak_perc season
## Min. :-250249.0 Min. : 1 Min. : 0.00 Length:82267
## 1st Qu.: -38.3 1st Qu.: 139 1st Qu.:34.63 Class :character
## Median : -1.6 Median : 499 Median :43.90 Mode :character
## Mean : -10.3 Mean : 5419 Mean :42.58
## 3rd Qu.: 22.3 3rd Qu.: 1706 3rd Qu.:51.39
## Max. : 426446.1 Max. :3236027 Max. :88.84
# Average player count by season
seasonal_trends <- cleaned_data %>%
group_by(season) %>%
summarise(avg_players = mean(avg, na.rm = TRUE))
# Plot the results
library(ggplot2)
ggplot(seasonal_trends, aes(x = season, y = avg_players, fill = season)) +
geom_bar(stat = "identity") +
labs(title = "Average Player Count by Season", x = "Season", y = "Average Player Count")
## Q2 Setup 1
# Load necessary libraries
library(tidyverse)
# Load the dataset
player_data <- read_csv("C:/Users/kyalg/Downloads/games.csv")## Rows: 83631 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gamename, month, avg_peak_perc
## dbl (4): year, avg, gain, peak
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Step 1: Filter out rows with missing or zero avg values
player_data_filtered <- player_data %>%
filter(!is.na(avg) & avg > 0)
# Step 2: Group by game name and calculate the average avg per game
avg_per_game <- player_data_filtered %>%
group_by(gamename) %>%
summarise(avg_player_count = mean(avg, na.rm = TRUE)) %>% # Calculate mean avg per game
arrange(desc(avg_player_count)) # Sort in descending order
# Step 3: Get the top 100 games
top_100_games <- avg_per_game %>%
slice_head(n = 100) # Select the top 100 rows
# Display the top 100 games
print(top_100_games)## # A tibble: 100 × 2
## gamename avg_player_count
## <chr> <dbl>
## 1 PLAYERUNKNOWN'S BATTLEGROUNDS 482025.
## 2 Dota 2 453732.
## 3 Counter-Strike: Global Offensive 308645.
## 4 Cyberpunk 2077 146416.
## 5 Apex Legends 83044.
## 6 Destiny 2 75035.
## 7 Grand Theft Auto V 64019.
## 8 Team Fortress 2 52885.
## 9 Tom Clancy's Rainbow Six Siege 52743.
## 10 Football Manager 2021 49945.
## # ℹ 90 more rows
# Load necessary libraries
library(tidyverse)
library(stringr)
# Load the player data
player_data <- read_csv("C:/Users/kyalg/Downloads/games.csv")## Rows: 83631 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gamename, month, avg_peak_perc
## dbl (4): year, avg, gain, peak
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create the genre mapping
genre_mapping <- tibble(
gamename = c(
"playerunknown's battlegrounds",
"dota 2",
"counter-strike: global offensive",
"cyberpunk 2077",
"apex legends",
"destiny 2",
"grand theft auto v",
"team fortress 2",
"tom clancy's rainbow six siege",
"football manager 2021",
"football manager 2020",
"monster hunter: world",
"ark: survival evolved",
"fall guys: ultimate knockout",
"rocket league",
"warframe",
"rust",
"dyson sphere program",
"sid meier's civilization v",
"phasmophobia",
"sid meier's civilization vi",
"garry's mod",
"fallout 4",
"mount & blade ii: bannerlord",
"dead by daylight",
"crusader kings iii",
"terraria",
"path of exile",
"payday 2",
"unturned",
"paladins",
"dota underlords",
"total war: warhammer ii",
"ea sports fifa 21",
"sea of thieves",
"counter-strike",
"arma 3",
"nba 2k20",
"the witcher 3: wild hunt",
"age of empires ii: definitive edition",
"total war: three kingdoms",
"football manager 2018",
"wallpaper engine",
"stardew valley",
"black desert online",
"z1 battle royale",
"hearts of iron iv",
"borderlands 3",
"farming simulator 19",
"war thunder",
"football manager 2017",
"euro truck simulator 2",
"left 4 dead 2",
"dark souls iii",
"stellaris",
"tmodloader",
"europa universalis iv",
"satisfactory",
"smite",
"rimworld",
"football manager 2016",
"clicker heroes",
"red dead online",
"the elder scrolls online",
"world of tanks blitz",
"cities: skylines",
"borderlands 2",
"7 days to die",
"nba 2k21",
"assassin's creed odyssey",
"counter-strike: source",
"star wars: the old republic",
"don't starve together",
"efootball pes 2021 season update",
"fallout 76",
"football manager 2015",
"dayz",
"final fantasy xiv online",
"factorio",
"hades",
"risk of rain 2",
"divinity: original sin 2",
"total war: rome ii - emperor edition",
"efootball pes 2020",
"sekiro: shadows die twice",
"ring of elysium",
"yu-gi-oh! duel links",
"slay the spire",
"runescape",
"the binding of isaac: rebirth",
"phantasy star online 2",
"brawlhalla",
"nba 2k19",
"conan exiles",
"vrchat",
"football manager 2014",
"age of empires ii (2013)",
"home behind 2",
"mount & blade: warband",
"ngu idle"
),
genre = c(
"Battle Royale", "MOBA", "FPS", "RPG", "Battle Royale", "FPS", "Open World", "FPS",
"FPS", "Simulation", "Simulation", "Action RPG", "Survival", "Party", "Sports",
"MMO", "Survival", "Simulation", "Strategy", "Horror", "Strategy", "Sandbox",
"RPG", "RPG", "Survival Horror", "Strategy", "Sandbox", "Action RPG", "Co-op FPS",
"Survival", "Hero Shooter", "Strategy", "Strategy", "Sports", "Adventure", "FPS",
"FPS", "Sports", "RPG", "Strategy", "Strategy", "Simulation", "Utility",
"Simulation", "MMO", "Battle Royale", "Strategy", "Shooter", "Simulation",
"Shooter", "Simulation", "Simulation", "Co-op FPS", "Action RPG", "Strategy",
"Sandbox", "Strategy", "Simulation", "MOBA", "Simulation", "Simulation", "Idle",
"Open World", "MMO", "Action", "City Builder", "Shooter", "Survival", "Sports",
"RPG", "FPS", "MMO", "Survival", "Sports", "RPG", "Simulation", "Survival",
"MMO", "Simulation", "Roguelike", "Roguelike", "RPG", "Strategy", "Sports",
"Action", "Battle Royale", "Card Game", "Roguelike", "MMO", "Roguelike", "MMO",
"Fighting", "Sports", "Survival", "Social", "Simulation", "Strategy", "RPG",
"RPG", "Idle"
)
)
# Standardize game names in both datasets (lowercase and trim whitespace)
player_data <- player_data %>%
mutate(gamename = str_trim(str_to_lower(gamename)))
genre_mapping <- genre_mapping %>%
mutate(gamename = str_trim(str_to_lower(gamename)))
# Join the player data with genre mapping
combined_data <- player_data %>%
left_join(genre_mapping, by = "gamename") %>%
select(gamename, avg, genre) %>%
filter(!is.na(genre)) # Filter out rows without a genre
# Save the combined data to a CSV file
write_csv(combined_data, "C:/Users/kyalg/Downloads/game_avg_genre.csv")
# Display the combined data
print(combined_data)## # A tibble: 5,182 × 3
## gamename avg genre
## <chr> <dbl> <chr>
## 1 counter-strike: global offensive 741013. FPS
## 2 dota 2 404832. MOBA
## 3 playerunknown's battlegrounds 198958. Battle Royale
## 4 apex legends 120983. Battle Royale
## 5 rust 117742. Survival
## 6 team fortress 2 101231. FPS
## 7 grand theft auto v 90648. Open World
## 8 tom clancy's rainbow six siege 72383. FPS
## 9 rocket league 53723. Sports
## 10 path of exile 46920. Action RPG
## # ℹ 5,172 more rows
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Load the combined data with game names, average player counts, and genres
game_avg_genre <- read_csv("C:/Users/kyalg/Downloads/game_avg_genre.csv")## Rows: 5182 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gamename, genre
## dbl (1): avg
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove the "Card Game" genre
game_avg_genre_filtered <- game_avg_genre %>%
filter(genre != "Card Game")
# Calculate the average player count and standard deviation for each genre
genre_stability <- game_avg_genre_filtered %>%
group_by(genre) %>%
summarise(
avg_player_count = mean(avg, na.rm = TRUE),
player_count_sd = sd(avg, na.rm = TRUE)
) %>%
arrange(player_count_sd) # Sort by standard deviation (stability)
# Plot 1: Player Count Stability by Genre
ggplot(genre_stability, aes(x = reorder(genre, player_count_sd), y = player_count_sd, fill = genre)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_y_continuous(labels = comma) + # Format y-axis with commas
labs(
title = "Player Count Stability by Genre",
x = "Genre",
y = "Standard Deviation of Player Count"
) +
theme_minimal() +
theme(legend.position = "none")# Select the top 10 genres by average player count
top_10_genres <- genre_stability %>%
arrange(desc(avg_player_count)) %>%
slice_head(n = 10)
# Plot 2: Top 10 Genres by Average Player Count
ggplot(top_10_genres, aes(x = reorder(genre, avg_player_count), y = avg_player_count, fill = genre)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_y_continuous(labels = comma) + # Format y-axis with commas
labs(
title = "Top 10 Genres by Average Player Count",
x = "Genre",
y = "Average Player Count"
) +
theme_minimal() +
theme(legend.position = "none")