Data Analysis Final Presentation

Kyal Huntley

2024-12-08

## Issue Description

As the lead developer of Community Game LLC, I am using this project to research trends in player engagement to help shape the development of a new game planned for January 2025. Understanding player behavior changes over time, and how factors like seasonal trends or game genres influence engagement and retention, is crucial.

This analysis aims to inform the design and marketing strategy for my upcoming game by identifying seasonal trends in player count and the genres with the most stable player bases. Insights gained will help determine optimal content release times and game genre selection.

Questions

  1. How do seasonal trends affect player count in various games?
  1. What genre of game has the most stable player base?

Data Source

The dataset used for this analysis comes from the Tidy Tuesday project, which provides real-world datasets for analysis. The dataset includes player count statistics for various games. The key columns in the dataset are:

URL to the dataset:
https://github.com/rfordatascience/tidytuesday/tree/3557b4badd5da4f33a03e8e61b554743fb19e40f/data/2021/2021-03-16

Documentation

The dataset documentation can be found on the Tidy Tuesday GitHub repository:
Dataset Documentation

Description of the Data

# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the dataset
player_data <- read_csv("C:/Users/kyalg/Downloads/games.csv")
## Rows: 83631 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gamename, month, avg_peak_perc
## dbl (4): year, avg, gain, peak
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert avg_peak_perc to numeric by removing '%' symbols and converting
player_data <- player_data %>%
  mutate(avg_peak_perc = parse_number(avg_peak_perc))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `avg_peak_perc = parse_number(avg_peak_perc)`.
## Caused by warning:
## ! 156 parsing failures.
##   row col expected actual
## 41807  -- a number   NaN%
## 49939  -- a number   NaN%
## 51313  -- a number   NaN%
## 53083  -- a number   NaN%
## 55133  -- a number   NaN%
## ..... ... ........ ......
## See problems(...) for more details.
# Filter out rows with missing values in critical columns
cleaned_data <- player_data %>%
  filter(!is.na(avg) & !is.na(gain) & !is.na(peak) & !is.na(avg_peak_perc))

# Create a 'season' column based on the month
cleaned_data <- cleaned_data %>%
  mutate(season = case_when(
    month %in% c("December", "January", "February") ~ "Winter",
    month %in% c("March", "April", "May") ~ "Spring",
    month %in% c("June", "July", "August") ~ "Summer",
    month %in% c("September", "October", "November") ~ "Fall",
    TRUE ~ NA_character_  # Handle any invalid month names
  ))

# Remove rows with invalid or missing seasons
cleaned_data <- cleaned_data %>%
  filter(!is.na(season))

# Check the cleaned dataset
str(cleaned_data)
## tibble [82,267 × 8] (S3: tbl_df/tbl/data.frame)
##  $ gamename     : chr [1:82267] "Counter-Strike: Global Offensive" "Dota 2" "PLAYERUNKNOWN'S BATTLEGROUNDS" "Apex Legends" ...
##  $ year         : num [1:82267] 2021 2021 2021 2021 2021 ...
##  $ month        : chr [1:82267] "February" "February" "February" "February" ...
##  $ avg          : num [1:82267] 741013 404832 198958 120983 117742 ...
##  $ gain         : num [1:82267] -2196 -27840 -2290 49216 -24375 ...
##  $ peak         : num [1:82267] 1123485 651615 447390 196799 224276 ...
##  $ avg_peak_perc: num [1:82267] 66 62.1 44.5 61.5 52.5 ...
##   ..- attr(*, "problems")= tibble [156 × 4] (S3: tbl_df/tbl/data.frame)
##   .. ..$ row     : int [1:156] 41807 49939 51313 53083 55133 55160 55239 55247 55391 55494 ...
##   .. ..$ col     : int [1:156] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ expected: chr [1:156] "a number" "a number" "a number" "a number" ...
##   .. ..$ actual  : chr [1:156] "NaN%" "NaN%" "NaN%" "NaN%" ...
##  $ season       : chr [1:82267] "Winter" "Winter" "Winter" "Winter" ...
summary(cleaned_data)
##    gamename              year         month                avg           
##  Length:82267       Min.   :2012   Length:82267       Min.   :      0.0  
##  Class :character   1st Qu.:2016   Class :character   1st Qu.:     53.9  
##  Mode  :character   Median :2018   Mode  :character   Median :    203.1  
##                     Mean   :2017                      Mean   :   2749.3  
##                     3rd Qu.:2019                      3rd Qu.:    756.0  
##                     Max.   :2021                      Max.   :1584886.8  
##       gain                peak         avg_peak_perc      season         
##  Min.   :-250249.0   Min.   :      1   Min.   : 0.00   Length:82267      
##  1st Qu.:    -38.3   1st Qu.:    139   1st Qu.:34.63   Class :character  
##  Median :     -1.6   Median :    499   Median :43.90   Mode  :character  
##  Mean   :    -10.3   Mean   :   5419   Mean   :42.58                     
##  3rd Qu.:     22.3   3rd Qu.:   1706   3rd Qu.:51.39                     
##  Max.   : 426446.1   Max.   :3236027   Max.   :88.84

Q1

# Average player count by season
seasonal_trends <- cleaned_data %>%
  group_by(season) %>%
  summarise(avg_players = mean(avg, na.rm = TRUE))

# Plot the results
library(ggplot2)
ggplot(seasonal_trends, aes(x = season, y = avg_players, fill = season)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Player Count by Season", x = "Season", y = "Average Player Count")

## Q2 Setup 1

# Load necessary libraries
library(tidyverse)

# Load the dataset
player_data <- read_csv("C:/Users/kyalg/Downloads/games.csv")
## Rows: 83631 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gamename, month, avg_peak_perc
## dbl (4): year, avg, gain, peak
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Step 1: Filter out rows with missing or zero avg values
player_data_filtered <- player_data %>%
  filter(!is.na(avg) & avg > 0)

# Step 2: Group by game name and calculate the average avg per game
avg_per_game <- player_data_filtered %>%
  group_by(gamename) %>%
  summarise(avg_player_count = mean(avg, na.rm = TRUE)) %>%  # Calculate mean avg per game
  arrange(desc(avg_player_count))                            # Sort in descending order

# Step 3: Get the top 100 games
top_100_games <- avg_per_game %>%
  slice_head(n = 100)  # Select the top 100 rows

# Display the top 100 games
print(top_100_games)
## # A tibble: 100 × 2
##    gamename                         avg_player_count
##    <chr>                                       <dbl>
##  1 PLAYERUNKNOWN'S BATTLEGROUNDS             482025.
##  2 Dota 2                                    453732.
##  3 Counter-Strike: Global Offensive          308645.
##  4 Cyberpunk 2077                            146416.
##  5 Apex Legends                               83044.
##  6 Destiny 2                                  75035.
##  7 Grand Theft Auto V                         64019.
##  8 Team Fortress 2                            52885.
##  9 Tom Clancy's Rainbow Six Siege             52743.
## 10 Football Manager 2021                      49945.
## # ℹ 90 more rows
# Optional: Save the top 100 games to a CSV file
write_csv(top_100_games, "C:/Users/kyalg/Downloads/top_100_games.csv")

Q2 Setup 2

# Load necessary libraries
library(tidyverse)
library(stringr)

# Load the player data
player_data <- read_csv("C:/Users/kyalg/Downloads/games.csv")
## Rows: 83631 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): gamename, month, avg_peak_perc
## dbl (4): year, avg, gain, peak
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create the genre mapping
genre_mapping <- tibble(
  gamename = c(
    "playerunknown's battlegrounds", 
    "dota 2",
    "counter-strike: global offensive",
    "cyberpunk 2077",
    "apex legends",
    "destiny 2",
    "grand theft auto v",
    "team fortress 2",
    "tom clancy's rainbow six siege",
    "football manager 2021",
    "football manager 2020",
    "monster hunter: world",
    "ark: survival evolved",
    "fall guys: ultimate knockout",
    "rocket league",
    "warframe",
    "rust",
    "dyson sphere program",
    "sid meier's civilization v",
    "phasmophobia",
    "sid meier's civilization vi",
    "garry's mod",
    "fallout 4",
    "mount & blade ii: bannerlord",
    "dead by daylight",
    "crusader kings iii",
    "terraria",
    "path of exile",
    "payday 2",
    "unturned",
    "paladins",
    "dota underlords",
    "total war: warhammer ii",
    "ea sports fifa 21",
    "sea of thieves",
    "counter-strike",
    "arma 3",
    "nba 2k20",
    "the witcher 3: wild hunt",
    "age of empires ii: definitive edition",
    "total war: three kingdoms",
    "football manager 2018",
    "wallpaper engine",
    "stardew valley",
    "black desert online",
    "z1 battle royale",
    "hearts of iron iv",
    "borderlands 3",
    "farming simulator 19",
    "war thunder",
    "football manager 2017",
    "euro truck simulator 2",
    "left 4 dead 2",
    "dark souls iii",
    "stellaris",
    "tmodloader",
    "europa universalis iv",
    "satisfactory",
    "smite",
    "rimworld",
    "football manager 2016",
    "clicker heroes",
    "red dead online",
    "the elder scrolls online",
    "world of tanks blitz",
    "cities: skylines",
    "borderlands 2",
    "7 days to die",
    "nba 2k21",
    "assassin's creed odyssey",
    "counter-strike: source",
    "star wars: the old republic",
    "don't starve together",
    "efootball pes 2021 season update",
    "fallout 76",
    "football manager 2015",
    "dayz",
    "final fantasy xiv online",
    "factorio",
    "hades",
    "risk of rain 2",
    "divinity: original sin 2",
    "total war: rome ii - emperor edition",
    "efootball pes 2020",
    "sekiro: shadows die twice",
    "ring of elysium",
    "yu-gi-oh! duel links",
    "slay the spire",
    "runescape",
    "the binding of isaac: rebirth",
    "phantasy star online 2",
    "brawlhalla",
    "nba 2k19",
    "conan exiles",
    "vrchat",
    "football manager 2014",
    "age of empires ii (2013)",
    "home behind 2",
    "mount & blade: warband",
    "ngu idle"
  ),
  genre = c(
    "Battle Royale", "MOBA", "FPS", "RPG", "Battle Royale", "FPS", "Open World", "FPS",
    "FPS", "Simulation", "Simulation", "Action RPG", "Survival", "Party", "Sports", 
    "MMO", "Survival", "Simulation", "Strategy", "Horror", "Strategy", "Sandbox", 
    "RPG", "RPG", "Survival Horror", "Strategy", "Sandbox", "Action RPG", "Co-op FPS", 
    "Survival", "Hero Shooter", "Strategy", "Strategy", "Sports", "Adventure", "FPS", 
    "FPS", "Sports", "RPG", "Strategy", "Strategy", "Simulation", "Utility", 
    "Simulation", "MMO", "Battle Royale", "Strategy", "Shooter", "Simulation", 
    "Shooter", "Simulation", "Simulation", "Co-op FPS", "Action RPG", "Strategy", 
    "Sandbox", "Strategy", "Simulation", "MOBA", "Simulation", "Simulation", "Idle", 
    "Open World", "MMO", "Action", "City Builder", "Shooter", "Survival", "Sports", 
    "RPG", "FPS", "MMO", "Survival", "Sports", "RPG", "Simulation", "Survival", 
    "MMO", "Simulation", "Roguelike", "Roguelike", "RPG", "Strategy", "Sports", 
    "Action", "Battle Royale", "Card Game", "Roguelike", "MMO", "Roguelike", "MMO", 
    "Fighting", "Sports", "Survival", "Social", "Simulation", "Strategy", "RPG", 
    "RPG", "Idle"
  )
)

# Standardize game names in both datasets (lowercase and trim whitespace)
player_data <- player_data %>%
  mutate(gamename = str_trim(str_to_lower(gamename)))

genre_mapping <- genre_mapping %>%
  mutate(gamename = str_trim(str_to_lower(gamename)))

# Join the player data with genre mapping
combined_data <- player_data %>%
  left_join(genre_mapping, by = "gamename") %>%
  select(gamename, avg, genre) %>%
  filter(!is.na(genre))  # Filter out rows without a genre

# Save the combined data to a CSV file
write_csv(combined_data, "C:/Users/kyalg/Downloads/game_avg_genre.csv")

# Display the combined data
print(combined_data)
## # A tibble: 5,182 × 3
##    gamename                             avg genre        
##    <chr>                              <dbl> <chr>        
##  1 counter-strike: global offensive 741013. FPS          
##  2 dota 2                           404832. MOBA         
##  3 playerunknown's battlegrounds    198958. Battle Royale
##  4 apex legends                     120983. Battle Royale
##  5 rust                             117742. Survival     
##  6 team fortress 2                  101231. FPS          
##  7 grand theft auto v                90648. Open World   
##  8 tom clancy's rainbow six siege    72383. FPS          
##  9 rocket league                     53723. Sports       
## 10 path of exile                     46920. Action RPG   
## # ℹ 5,172 more rows

Q2

# Load necessary libraries
library(tidyverse)
library(scales)  # For formatting numbers
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# Load the combined data with game names, average player counts, and genres
game_avg_genre <- read_csv("C:/Users/kyalg/Downloads/game_avg_genre.csv")
## Rows: 5182 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gamename, genre
## dbl (1): avg
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove the "Card Game" genre
game_avg_genre_filtered <- game_avg_genre %>%
  filter(genre != "Card Game")

# Calculate the average player count and standard deviation for each genre
genre_stability <- game_avg_genre_filtered %>%
  group_by(genre) %>%
  summarise(
    avg_player_count = mean(avg, na.rm = TRUE),
    player_count_sd = sd(avg, na.rm = TRUE)
  ) %>%
  arrange(player_count_sd)  # Sort by standard deviation (stability)

# Plot 1: Player Count Stability by Genre
ggplot(genre_stability, aes(x = reorder(genre, player_count_sd), y = player_count_sd, fill = genre)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_y_continuous(labels = comma) +  # Format y-axis with commas
  labs(
    title = "Player Count Stability by Genre",
    x = "Genre",
    y = "Standard Deviation of Player Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Select the top 10 genres by average player count
top_10_genres <- genre_stability %>%
  arrange(desc(avg_player_count)) %>%
  slice_head(n = 10)

# Plot 2: Top 10 Genres by Average Player Count
ggplot(top_10_genres, aes(x = reorder(genre, avg_player_count), y = avg_player_count, fill = genre)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_y_continuous(labels = comma) +  # Format y-axis with commas
  labs(
    title = "Top 10 Genres by Average Player Count",
    x = "Genre",
    y = "Average Player Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")