All-Time Video Game Statistics

DSA_406_001_SP25_project_vmbaxi

Author

Vaasu Baxi

Published

April 9, 2025

Initial Data Exploration and Motivation

Reading in the Dataset

# Reading in the "Games" data
videogame_data <- read.csv("games.csv")

Inspecting the Dataset

# Checking dimensions, first and last few rows
dim(videogame_data)
[1] 195209     10
head(videogame_data)
       id                              name       date rating reviews plays
1 1000001 Cathode Ray Tube Amusement Device 1947-12-31    3.6      85   149
2 1000002                  Bertie the Brain 1950-08-25    3.0      26    46
3 1000003                               Nim 1951-12-31    1.9       9    26
4 1000004                          Draughts 1952-08-31    2.8       9    30
5 1000005                               OXO 1952-12-31    3.1      22    80
6 1000006                              Pool 1954-06-26    3.1      12    33
  playing backlogs wishlists
1       1       42        72
2       0        9        17
3       0        2         8
4       0        4         7
5       0       11        15
6       0        3         4
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
1 The cathode ray tube amusement device is the earliest known interactive electronic game to use a cathode ray tube (CRT). It is a device that records and controls the quality of an electronic signal. The strength of the electronic signals produced by the amusement device is controlled by knobs which influences the trajectory of the CRT's light beam. The device is purely electromechanical and does not use any memory device, computer, or programming. The player turns a control knob to position the CRT beam on the screen; to the player, the beam appears as a dot, which represents a reticle or scope. The player has a restricted amount of time in which to maneuver the dot so that it overlaps an airplane, and then to fire at the airplane by pressing a button. If the beam's gun falls within the predefined mechanical coordinates of a target when the user presses the button, then the CRT beam defocuses, simulating an explosion.
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Currently considered the first videogame in history. A tic-tac-toe clone.
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The Nimrod was a special purpose computer that played the game of Nim, designed and built by Ferranti and displayed at the Exhibition of Science during the 1951 Festival of Britain. It was the first digital computer exclusively designed to play a game, though its true intention was to illustrate the principles of the (then novel) digital computer for the public.
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               A game of draughts (a.k.a. checkers) written for the Ferranti Mark 1 computer by Christopher Strachey at the University of Manchester between 1951 and 1952. In the summer of 1952, the program was able to "play a complete game of Draughts at a reasonable speed".
5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   OXO was a computer game developed by Alexander S. Douglas in 1952 for the EDSAC computer, which simulates a game of Noughts and crosses, also sometimes called Tic-tac-toe. OXO is the earliest known game to display visuals on a video monitor. To play OXO, the player would enter input using a rotary telephone controller, and output was displayed on the computer's 35×16 dot matrix cathode ray tube. Each game was played against an artificially intelligent opponent.
6                                                                                               A game of pool (billiards) developed by William George Brown and Ted Lewis in 1954 on the MIDSAC computer, intended primarily to showcase the computing power of the MIDSAC. "The game displayed a 2-inch rendition of the pool cue for the players to line up their shots and ran a simulation of the colliding and ricocheting balls in real-time, implementing a full game of a cue ball and 15 frame balls for two players. Graphics were drawn in real-time on a monochrome 13" point plotting X-Y display, the screen being updated by the program 40 times a second (that is, in a normal in-game situations with 2 to 4 balls moving at once). However, for time constraints, the table and its pockets weren’t drawn by the computer graphics, but were rather drawn manually onto the display using a grease pencil." - Norbert Landsteiner for masswerk.at
tail(videogame_data)
            id                                name date rating reviews plays
195204 1204007                 Hometown Poker Hero          NA       0     0
195205 1204008            Duke Nukem II Remastered         4.5       0     4
195206 1204009                            Deadoxer          NA       0     0
195207 1204010                     Lottso! Express          NA       0     0
195208 1204011              Little White Man vs. X          NA       0     0
195209 1204012 Lost Lagoon 2: Cursed and Forgotten          NA       0     1
       playing backlogs wishlists
195204       0        0         0
195205       0        1         2
195206       0        1         0
195207       0        0         0
195208       0        1         0
195209       0        0         0
                                                                                                                                                                                                                                                                                                                                                                         description
195204                                                                                                                                                                                                                                                                                                                                                                              
195205                                                                                                                                                                                                                       A remaster by Blaze Entertainment as part of Duke Nukem 1+2 Remastered which is set to be released exclusively for Evercade in Duke Nukem Collection 1.
195206                                                                                                                                                                                                                                                                                                                      This is a small horror game in which you have to escape.
195207                                                                                                                                                                                                                                                                                                                                                                              
195208                                                                                                                                                   Attention: Attention, attention, attention!!! Sometimes the game automatically locks enemies, press the tab key to unlock it. (At the beginning, pressing tab is to lock the enemy) Other operations are in the legal terms
195209 Escape from a mysterious and dangerous island in Lost Lagoon 2: Cursed & Forgotten! After waking up shipwrecked, you realize that you have been cursed by powers beyond your understanding. Break the curse quickly because malevolent islanders lurk in the lush landscape and are dead set on making you their next victim! Find a way to return home before it's too late.
nrow(videogame_data)
[1] 195209
ncol(videogame_data)
[1] 10

Brief Description

The dataset contains information about various video games, including their titles, genres, release years, backlogs, and ratings.

Dataset was acquired through Kaggle. can be accessed through this link: https://www.kaggle.com/datasets/gsimonx37/backloggd?select=games.csv

Motivation

Understanding video game trends can help answer key industry questions, such as:

  • What factors contribute to a game’s success?
  • Are certain platforms or genres more successful than others?
  • How do different genre trends change over time?
  • What types of games have higher/lower scores?
  • What is the popular and least popular game of all time and is there a correlation to those types of games having certain critic scores?

Hypothesis

A potential hypothesis to test:

  • “Game developers spend millions building hype—but are players actually playing the games they wishlist?”
  • Games with higher wishlist counts are more likely to end up in user backlogs than be actively playing.”

Ethical Considerations

  • Bias Awareness: Recognizing potential biases, such as favoring certain game genres or assuming popularity equals quality.
  • Data Integrity: Ensuring the dataset is accurate and not misrepresented.
  • Representation: Consideration of indie vs. AAA developers

Table Creation / Data Dictionary

data_dictionary <- data.frame(
  Variable_Name = colnames(videogame_data),
  Class = sapply(videogame_data, class),
  Continuity = sapply(videogame_data, function(x) ifelse(is.numeric(x), "Continuous", "Discrete")),
  Description = c(
    "Game ID Number",
    "Game Title",
    "When it was Released",
    "What is the Viewership Rating",
    "How many Reviews",
    "How many Plays the game has",
    "How many are Playing",
    "How many people put it to the side to play later, Backlogs",
    "How many people added to the Wishlist",
    "Description of The game"
  )
)
summary(videogame_data)
       id              name               date               rating      
 Min.   :1000001   Length:195209      Length:195209      Min.   :-0.30   
 1st Qu.:1051405   Class :character   Class :character   1st Qu.: 2.50   
 Median :1102074   Mode  :character   Mode  :character   Median : 3.00   
 Mean   :1102931                                         Mean   : 2.97   
 3rd Qu.:1154452                                         3rd Qu.: 3.50   
 Max.   :1204012                                         Max.   : 6.20   
                                                         NA's   :131670  
    reviews             plays            playing             backlogs       
 Min.   :  -1.000   Min.   :   -1.0   Min.   :   -1.000   Min.   :   -1.00  
 1st Qu.:   0.000   1st Qu.:    0.0   1st Qu.:    0.000   1st Qu.:    0.00  
 Median :   0.000   Median :    2.0   Median :    0.000   Median :    1.00  
 Mean   :   9.211   Mean   :  135.6   Mean   :    4.331   Mean   :   39.89  
 3rd Qu.:   1.000   3rd Qu.:   10.0   3rd Qu.:    0.000   3rd Qu.:    5.00  
 Max.   :8814.000   Max.   :83000.0   Max.   :10000.000   Max.   :17000.00  
                                                                            
   wishlists        description       
 Min.   :   -1.00   Length:195209     
 1st Qu.:    0.00   Class :character  
 Median :    0.00   Mode  :character  
 Mean   :   20.11                     
 3rd Qu.:    3.00                     
 Max.   :11000.00                     
                                      
data_dictionary
            Variable_Name     Class Continuity
id                     id   integer Continuous
name                 name character   Discrete
date                 date character   Discrete
rating             rating   numeric Continuous
reviews           reviews   integer Continuous
plays               plays   integer Continuous
playing           playing   integer Continuous
backlogs         backlogs   integer Continuous
wishlists       wishlists   integer Continuous
description   description character   Discrete
                                                           Description
id                                                      Game ID Number
name                                                        Game Title
date                                              When it was Released
rating                                   What is the Viewership Rating
reviews                                               How many Reviews
plays                                      How many Plays the game has
playing                                           How many are Playing
backlogs    How many people put it to the side to play later, Backlogs
wishlists                        How many people added to the Wishlist
description                                    Description of The game

Merging Datasets

There were three datasets named, games.csv, genre.csv, and platform.csv. I had to combine them using aggregating methods for dplyr. The genre and platform dataset however had multiple duplicates of the same game ID whereas games.csv had unique game IDs. So I had to combine rows for the duplicates and then combine both the datasets with the games.csv to make two new columns that represent the genre and platforms for their respective game IDs. This caused my original 192k+ row data set into a 600k+ data set as multiple unique game ID’s had multiple genres and multiple platforms, creating many “many-to-many” relationships.

# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)

# Load the datasets
games <- read.csv("games.csv", stringsAsFactors = FALSE)
genres <- read.csv("genres.csv", stringsAsFactors = FALSE)
platforms <- read.csv("platforms.csv", stringsAsFactors = FALSE)

# Remove blank spaces and replace empty strings with NA
games <- games %>% mutate_all(~na_if(trimws(.), ""))
genres <- genres %>% mutate_all(~na_if(trimws(.), ""))
platforms <- platforms %>% mutate_all(~na_if(trimws(.), ""))

# Convert `date` column to Date format
games$date <- as.Date(games$date, format = "%Y-%m-%d")

# Convert numeric columns
numeric_cols <- c("rating", "reviews", "plays", "playing", "backlogs", "wishlists")
games[numeric_cols] <- lapply(games[numeric_cols], as.numeric)

# Handle missing values:
games <- games %>%
  mutate(
    rating = ifelse(is.na(rating), median(rating, na.rm = TRUE), rating),
    reviews = ifelse(is.na(reviews), 0, reviews),
    plays = ifelse(is.na(plays), 0, plays),
    playing = ifelse(is.na(playing), 0, playing),
    backlogs = ifelse(is.na(backlogs), 0, backlogs),
    wishlists = ifelse(is.na(wishlists), 0, wishlists)
  )

# Merge datasets using left joins
final_data <- games %>%
  left_join(genres, by = "id") %>%
  left_join(platforms, by = "id")
Warning in left_join(., genres, by = "id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.
ℹ Row 191291 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
Warning in left_join(., platforms, by = "id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 7 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# Remove duplicate rows
final_data <- distinct(final_data)

# Drop unnecessary columns if needed (like description if it's too verbose)
final_data <- final_data %>% select(-description)

# Remove rows where critical information (like game ID or name) is missing
final_data <- final_data %>% filter(!is.na(id) & !is.na(name))

# View the cleaned data summary
summary(final_data)
      id                name                date                rating      
 Length:655041      Length:655041      Min.   :1947-12-31   Min.   :-0.300  
 Class :character   Class :character   1st Qu.:2012-06-26   1st Qu.: 3.000  
 Mode  :character   Mode  :character   Median :2018-03-21   Median : 3.000  
                                       Mean   :2014-10-11   Mean   : 2.999  
                                       3rd Qu.:2021-11-24   3rd Qu.: 3.100  
                                       Max.   :2030-12-20   Max.   : 6.200  
                                       NA's   :68575                        
    reviews            plays            playing           backlogs      
 Min.   :  -1.00   Min.   :   -1.0   Min.   :   -1.0   Min.   :   -1.0  
 1st Qu.:   0.00   1st Qu.:    1.0   1st Qu.:    0.0   1st Qu.:    0.0  
 Median :   0.00   Median :    6.0   Median :    0.0   Median :    3.0  
 Mean   :  27.49   Mean   :  395.7   Mean   :   14.5   Mean   :  118.2  
 3rd Qu.:   4.00   3rd Qu.:   44.0   3rd Qu.:    1.0   3rd Qu.:   23.0  
 Max.   :8814.00   Max.   :83000.0   Max.   :10000.0   Max.   :17000.0  
                                                                        
   wishlists           genre             platform        
 Min.   :   -1.00   Length:655041      Length:655041     
 1st Qu.:    0.00   Class :character   Class :character  
 Median :    2.00   Mode  :character   Mode  :character  
 Mean   :   55.42                                        
 3rd Qu.:   11.00                                        
 Max.   :11000.00                                        
                                                         

Problem-Solution Story Arc: “From Hype to Play – Why Don’t Players Finish Games?”

The Problem

Across the gaming industry, there’s a known disconnect between player anticipation (as shown by wishlists) and actual engagement (reflected in how many people are playing the game post-launch). Publishers and developers often rely on pre-release metrics like wishlists, ratings, and reviews to forecast a game’s success. But what if these signals are misleading?

Data Investigation

We analyzed a dataset of video games with fields like:

wishlists - (anticipation/hype)

playing - (real-time engagement)

backlogs - (games bought but not yet played)

genres, platforms, ratings, and release dates

Visualization of Dataset

  1. Wishlist vs. Playing: Does Hype Translate to Play?

Each point represents a game, plotting how many people have wish listed it versus how many are currently playing it.

Games with high wishlist but low playing counts might signal anticipated but underplayed titles, or ones with marketing hype that didn’t convert into playtime. We can see that there are a few games that follow the trend of high wishlist and high playing time.

library(ggplot2)
library(dplyr)

# Scatterplot of wishlists vs playing, to analyze the relationship between players engagement
ggplot(final_data, aes(x = wishlists, y = playing)) +
  geom_point(alpha = 0.2, color = "darkred") +
  geom_smooth(method = "lm", color = "blue") +
  labs(
    title = "Wishlist vs. Playing: Does Hype Translate to Engagement?",
    x = "Wishlists (Hype)",
    y = "Currently Playing (Engagement)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

  1. Backlog by Platform: Where Do Games Go to Die?

This bar chart compares the average number of backlogged games across different platforms. It highlights where users are most likely to add games to their backlog (games owned but not played).

Platforms like OnLive Game System or Google Stadia tend to have higher average backlogs, suggesting users may buy in bundles or get overwhelmed by massive libraries, leading to more games being not played.

# Code counts the average backlogs by grouping per platform
platform_backlogs <- final_data %>%
  group_by(platform) %>%
  summarize(avg_backlogs = mean(backlogs, na.rm = TRUE)) %>%
  arrange(desc(avg_backlogs)) %>%
  top_n(10, avg_backlogs)
# Creates a bar plot of average number of backlogged games by platform
ggplot(platform_backlogs, aes(x = reorder(platform, avg_backlogs), y = avg_backlogs)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Backlogged Games by Platform",
    x = "Platform",
    y = "Average Number of Games Backlogged"
  ) +
  theme_minimal()

  1. Relationship between Rating and Number of Reviews

This plot shows how a game’s rating relates to the number of user reviews it receives. Each dot represents a game.

Games with higher ratings often receive more reviews, though there are some highly rated games with low visibility. This suggests critical acclaim can drive engagement, but marketing and popularity still play a role.

# Find the max number of reviews for the highest average rated game
max_point <- final_data[which.max(final_data$reviews), ]

# Creates a scatterplot for the ratings vs reviews of each game
ggplot(final_data, aes(x = rating, y = reviews)) +
  geom_point(alpha = 0.6, color = "pink") +
  geom_smooth(method = "lm", color = "red") +
  geom_text(data = max_point, aes(label = name), vjust = -1, color = "black", size = 3) +
  labs(title = "Ratings vs. Number of Reviews",
       x = "Game Rating",
       y = "Number of Reviews") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

  1. Average Playing by Genre: Which Genres Keep Players Engaged?

This chart shows the average number of players actively playing games in each genre.

Genres like Turn Based Strategy, Brawler, and RPG consistently rank high in active players, showing that these genres have broad and lasting appeal. Developers in these spaces may see stronger ongoing engagement.

# Groups the genre of each game and arranges it by the highest mean to the lowest for the top 10 genres
genre_playing <- final_data %>%
  group_by(genre) %>%
  summarize(avg_playing = mean(playing, na.rm = TRUE)) %>%
  arrange(desc(avg_playing)) %>%
  top_n(10, avg_playing)
# Creates a barplot for the average playing count by genre
ggplot(genre_playing, aes(x = reorder(genre, avg_playing), y = avg_playing)) +
  geom_bar(stat = "identity", fill = "darkorange") +
  coord_flip() +
  labs(
    title = "Average Playing Count by Genre",
    x = "Genre",
    y = "Average Currently Playing"
  ) +
  theme_minimal()

  1. Rating vs. Playing: Do Better Games Get Played More?

This visualization explores whether higher-rated games are played more frequently by plotting ratings against current playing counts.

There’s a visible trend: higher-rated games often have more players, supporting the idea that quality drives playtime. However, there are exceptions — suggesting other factors like marketing, platform, or genre also influence engagement.

# Creates a scatterplot for analyzing how many games are getting higher ratings
ggplot(final_data, aes(x = rating, y = playing)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Do Higher-Rated Games Get Played More?",
    x = "Rating",
    y = "Currently Playing"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

  1. Monthly Game Release Trends: Does Timing Affect Success?

This shows the number of games released per month, revealing industry patterns in launch timing.

Spikes appear around holiday season months (Oct–Dec), aligning with consumer spending cycles. Strategic release timing could give games a competitive edge during high-demand periods.

library(lubridate)

# Groups the release date by months
monthly_release <- final_data %>%
  mutate(Month = month(date, label = TRUE)) %>%
  group_by(Month) %>%
  summarize(Count = n())

# Creates a barplot by the number of games releases by month to identify which games releases the most in what month
ggplot(monthly_release, aes(x = Month, y = Count)) +
  geom_col(fill = "purple") +
  labs(
    title = "Which Months See the Most Game Releases?",
    x = "Month",
    y = "Number of Games Released"
  ) +
  theme_minimal()

This faceted line plot shows how average playing and backlogs change over the years, broken down by genre.

Some genres like RPGs and Strategy show increasing backlog trends, suggesting players buy them but may not start or finish them due to time demands. In contrast, Puzzle games often maintain high play rates, reflecting fast, accessible gameplay.

# Add a year column
final_data <- final_data %>%
  mutate(Year = year(date)) %>%
  filter(!is.na(Year), !is.na(genre))

# Choose top genres by overall count
top_genres <- final_data %>%
  count(genre, sort = TRUE) %>%
  top_n(6, n) %>%
  pull(genre)

# Summarize playing and backlogs by year and genre
trend_data <- final_data %>%
  filter(genre %in% top_genres) %>%
  group_by(Year, genre) %>%
  summarize(
    avg_playing = mean(playing, na.rm = TRUE),
    avg_backlogs = mean(backlogs, na.rm = TRUE),
    .groups = 'drop'
  )

# Reshape data for plotting
library(tidyr)
trend_long <- trend_data %>%
  pivot_longer(cols = c(avg_playing, avg_backlogs), names_to = "Metric", values_to = "Value")

# Plots
ggplot(trend_long, aes(x = Year, y = Value, color = Metric)) +
  geom_line(linewidth = 1) +
  facet_wrap(~ genre, scales = "free_y") +
  scale_color_manual(values = c("avg_playing" = "darkgreen", "avg_backlogs" = "darkred"),
                     labels = c("Playing", "Backlogs")) +
  labs(
    title = "Trends in Game Engagement vs. Backlogs by Genre (Over Time)",
    x = "Year",
    y = "Average Count",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold", size = 12))

# Scatter plot: Wishlist vs Backlog with trend line
ggplot(games, aes(x = wishlists, y = backlogs)) +
  geom_point(alpha = 0.5, color = "#4B9CD3") +
  geom_smooth(method = "lm", se = TRUE, color = "#D7263D") +
  labs(
    title = "Wishlist vs. Backlogs: When Interest Doesn’t Equal Action",
    x = "Number of Wishlists",
    y = "Number of Backlogs"
  ) +
  theme_minimal(base_size = 14)
`geom_smooth()` using formula = 'y ~ x'

# Step 1: Keep only unique rows per game-genre pair
unique_genre_wishlist <- final_data %>%
  select(id, genre, wishlists) %>%
  distinct(id, genre, .keep_all = TRUE)

# Step 2: Summarize total wishlists per genre
genre_wishlist_summary <- unique_genre_wishlist %>%
  group_by(genre) %>%
  summarise(total_wishlists = sum(wishlists, na.rm = TRUE)) %>%
  arrange(desc(total_wishlists))

# Step 3: Print summary
print(genre_wishlist_summary)
# A tibble: 23 × 2
   genre               total_wishlists
   <chr>                         <dbl>
 1 Adventure                   2578066
 2 RPG                         1458065
 3 Indie                        962595
 4 Shooter                      761335
 5 Platform                     665188
 6 Puzzle                       614949
 7 Brawler                      396359
 8 Strategy                     374641
 9 Simulator                    366933
10 Turn Based Strategy          214250
# ℹ 13 more rows

Exploratory Analysis Questions

What is the distribution of current players across games?

Why it matters: Understanding how actively games are played gives insight into player engagement patterns and potential hits vs misses in the market.

Approach: By creating a scatter plot of the playing variable vs the wishlist variable we can see how the current player counts are distributed across games.

Findings: Most games have low current player counts and a few really popular titles. Usually the higher the wishlist numbers the more players there are playing that game.

How does rating relate to number of reviews?

Why it matters: It helps gauge whether higher-rated games attract more attention, and how user sentiment (rating) connects to user volume (reviews).

Approach: Used a scatter plot of rating vs. reviews, with a smooth trend line to analyze their relationship. Also used another scatter plot to show the rating vs playing relationship.

Findings: Higher rated games tend to receive more reviews. However there are well-rated games that sill have few reviews likely due to niche appeal or less popularity. The rating vs playing graph is almost identical to the previous one.

How does average player count differ across genres?

Why it matters: Identifies which genres have strong, sustained engagement and which might struggle to retain players.

Approach: Grouped data by genre, calculated average playing, and visualized via bar chart. Also made a faceted line plot to show how playing and backlogs changed for some genres across the years.

Findings: Genres like Turn-Based Strategy, MOBA, and Brawler have the highest average playing counts. There also seems to be a huge spike in all genres for games around the 2010’s but a steep decline after that until 2020’s.

What happens to backlogs when the number of wishlists increases?

Why it matters: Reveals whether user interest (wishlists) translates into actual playtime or leads to games sitting in backlogs.

Approach: Created a scatter plot of wishlists vs. backlogs to analyze the trend.

Findings: As wishlist counts increase, so do backlog counts — suggesting that users intend to play these games but delay starting. Indicates a possible disconnect between intent and follow-through.

Hypothesis Generation

Hypothesis:

Games with higher wishlist counts are more likely to end up in user backlogs than be actively playing.

Based on EDA observations: We observed a positive correlation between wishlists and backlogs. However, Games that are wishlisted more often don’t always show high current play counts.

Why it matters to stakeholders: Game developers and marketers need to understand if their audience is committing to playing or just collecting titles. If games are wishlisted but not played, marketing efforts or game design might need adjustment to improve engagement post-purchase.

Testing Data: Time-stamped user data on wishlist to purchase tp play time. The game length, complexity of the game, and possibly completion rate.

We cant test with a regression model or logistic regression predicting whether a wishlisted game is played or backlogged.Include control variables like genre, platform, and rating.

If TRUE: Developers might consider streamlined tutorials or shorter game formats to convert interest into play. If FALSE: Users may be wishlisting for other reasons such as sales or collecting.

Stakeholder Communication

In this exploratory analysis of a video game dataset, we investigated trends in player engagement, game ratings, and genre-specific behavior to uncover insights for developers and publishers. Our dataset included over 500k+ games with variables such as current plays, ratings, reviews, wishlists, and genre/platform data.

Our analysis began by assessing the distribution of player engagement. We found that most games have very low current player counts, with only a few dominating the space. When examining genre-level engagement, Turn-Based strategy, MOBA, Brawler, and RPG titles consistently showed higher average player counts, while strategy and simulator games tended to attract less attention over time.

We also discovered a positive relationship between a game’s rating and the number of reviews, suggesting that better rated games tend to spark more user engagement and conversation. However, some well-rated games had few reviews, indicating potential missed marketing opportunities or niche targeting.

A key pattern emerged around wishlists and backlogs. While games with more wishlists were expected to be played more, we observed that these games often landed in users’ backlogs instead. This implies that high user interest doesn’t always lead to immediate play — possibly due to game length, complexity, or user time constraints.

From this, we developed a hypothesis:

“Games with higher wishlist counts are more likely to end up in user backlogs than be actively played.”

This insight matters for studios looking to convert hype into actual gameplay. If true, developers might need to simplify on boarding experiences or create stronger incentives to start games post-purchase.

For further research, we recommend collecting time-series user behavior (e.g., wishlist to playtime) and analyzing the role of game duration or tutorial design. Testing this hypothesis with logistic regression could quantify how strongly wishlist activity predicts backlog behavior across genres.

Our findings suggest that while user interest is high, converting that interest into action is a key challenge. Stakeholders can use these insights to fine tune game design, marketing campaigns, and release strategies.

# Scatter plot: Wishlist vs Backlog with trend line
ggplot(games, aes(x = wishlists, y = backlogs)) +
  geom_point(alpha = 0.5, color = "#4B9CD3") +
  geom_smooth(method = "lm", se = TRUE, color = "#D7263D") +
  labs(
    title = "Wishlist vs. Backlogs: When Interest Doesn’t Equal Action",
    x = "Number of Wishlists",
    y = "Number of Backlogs"
  ) +
  theme_minimal(base_size = 14)
`geom_smooth()` using formula = 'y ~ x'

Here is another graph that shows which genres out of all provided have the most player engagement to show companies to maybe focus on those types of games.

# Calculate average current players by genre
avg_playing_by_genre <- final_data %>%
  group_by(genre) %>%
  summarise(avg_playing = mean(playing, na.rm = TRUE)) %>%
  arrange(desc(avg_playing))

# Bar plot
ggplot(avg_playing_by_genre, aes(x = reorder(genre, avg_playing), y = avg_playing)) +
  geom_bar(stat = "identity", fill = "#00A6A6") +
  coord_flip() +
  labs(
    title = "Which Game Genres Keep Players Coming Back?",
    x = "Game Genre",
    y = "Average Number of Current Players"
  ) +
  theme_minimal(base_size = 14)