Top Games of All Time, hackernoon.com

PRESS START: DATA LOADING…

Video games have evolved from a niche form of entertainment into a global cultural phenomenon with a vast and diverse player base. As the gaming industry continues to grow, so does the wealth of data generated by players, developers, and platforms. The world of video games has evolved at an astonishing pace, captivating millions of players across the globe and generating a thriving industry worth billions of dollars. Through descriptive analysis and visualization, we aim to uncover valuable insights about characteristics of highly rated games and the developers that make them, discern patterns in player preferences, and gain a richer understanding of this dynamic realm where pixels meet passion. Join me on this gaming odyssey as we embark on a journey to explore the data behind the games that have captured our imaginations and challenged our skills.

Let’s import our libraries and the data set.

library(tidyverse)
library(stringr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(stats)
library(scales)
library(corrplot)

gamesData <- read.csv("games.csv", header=TRUE)

LEVEL ONE: PREPROCESSING

In the realm of data adventures, every epic journey begins with preparation. Our first quest, Preprocessing, is the essential foundation upon which our data adventure is built. We will clean our raw data into a pristine form, ready to reveal its hidden treasures. Armed with tidy datasets and a sense of purpose, we will be poised to delve deeper into the world of video games. Our adventure begins here, in the realm of Preprocessing, where data becomes knowledge, and knowledge becomes power.

This analysis uses data from backloggd, a video game collection website that allows users to log their game play and stats, review the games they play, and connect with friends and other players. A unique feature about this website is that it allows users to create backlogs, which are games they haven’t finished yet, and wishlists of games the users want to play.

The games dataset contains information from thousands of video games dating back to 1980, and includes information like genre, user reviews, game summary, user wishlist, release dates, and player engagement. By analyzing this data set, we will gain valuable insight into player preferences, gaming trends, and market dynamics in this booming industry.

Now let’s see exactly what we are working with with the glimpse() function. This allows us to get a brief overview of this data frame, including the number of variables, the data types, and the first few observations of each variable.

glimpse(gamesData)

## Rows: 1,512
## Columns: 14
## $ X                 <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Title             <chr> "Elden Ring", "Hades", "The Legend of Zelda: Breath …
## $ Release.Date      <chr> "Feb 25, 2022", "Dec 10, 2019", "Mar 03, 2017", "Sep…
## $ Team              <chr> "['Bandai Namco Entertainment', 'FromSoftware']", "[…
## $ Rating            <dbl> 4.5, 4.3, 4.4, 4.2, 4.4, 4.3, 4.2, 4.3, 3.0, 4.3, 4.…
## $ Times.Listed      <chr> "3.9K", "2.9K", "4.3K", "3.5K", "3K", "2.3K", "1.6K"…
## $ Number.of.Reviews <chr> "3.9K", "2.9K", "4.3K", "3.5K", "3K", "2.3K", "1.6K"…
## $ Genres            <chr> "['Adventure', 'RPG']", "['Adventure', 'Brawler', 'I…
## $ Summary           <chr> "Elden Ring is a fantasy, action and open world game…
## $ Reviews           <chr> "[\"The first playthrough of elden ring is one of th…
## $ Plays             <chr> "17K", "21K", "30K", "28K", "21K", "33K", "7.2K", "9…
## $ Playing           <chr> "3.8K", "3.2K", "2.5K", "679", "2.4K", "1.8K", "1.1K…
## $ Backlogs          <chr> "4.6K", "6.3K", "5K", "4.9K", "8.3K", "1.1K", "4.5K"…
## $ Wishlist          <chr> "4.8K", "3.6K", "2.6K", "1.8K", "2.3K", "230", "3.8K…

Describing the data

X - Arbitrary index of the game
Title - The Title of the game (due to multiple platforms and remakes of games, this includes duplicates)
Release Date - Date of release of the game’s first version
Team - Game developer
Rating - Numerical Average Rating on backloggr
Times Listed - Number of users who listed this game
Number of Reviews - Number of user reviews
Genres - All genres pertaining to a specific game (some games may have several genres attributes)
Summary - Summary of the game, provided by the developers
Reviews - Text based user reviews of games
Plays - Number of users that have played the game
Playing - Number of current users that are playing the game
Backlogs - Number of users who have the game, but have not started playing yet.
Wishlist - Number of users who wish to play the game.

Converting Data

We have a lot of numerical data that is stored as characters! This is not ideal. So before we can do any sort of analysis, we need to convert some data and get it ready to use for analysis. Since there are several columns that need the same type of cleaning, I created two functions to be applied to any column that needs it. The first function is k_convert which will convert the data Plays, Playing, Backlog and Wishlist from characters, such as “4.6k”, into it’s integer form “4600”. The second function is clean_col, which removes the brackets and braces from Genres and Team and instead lists them so they are easier to aggregate. I also want to convert the Release_Dates column into a numerical date we can analyze, however since only one column is a date, I will just apply it to that column directly.

CREATE FUNCTIONS:

k_convert <- function(text_vector) {
  converted_values <- numeric(length(text_vector))
  
  for (i in 1:length(text_vector)) {
    value <- text_vector[i]
    if (grepl("K", value, fixed = TRUE)) {
      numeric_part <- as.numeric(str_replace(value, "K", "")) * 1000
    } else {
      numeric_part <- as.numeric(value)
    }
    converted_values[i] <- numeric_part
  }
  
  return(converted_values)
}

clean_col  <- function(text_vector) {
  cleaned_text <- text_vector %>%
    str_replace("\\[", "") %>%
    str_replace("\\]", "") %>%
    str_replace_all(pattern = "'", replacement = "")
  
  return(cleaned_text)
}

APPLY FUNCTIONS

g <- clean_col(gamesData$Genres)

t <- clean_col(gamesData$Team)

r <- gamesData$Release.Date %>% 
         as.Date(format = "%b %d, %Y")

p <- k_convert(gamesData$Plays)

pl <- k_convert(gamesData$Playing)

b <- k_convert(gamesData$Backlogs)

w <- k_convert(gamesData$Wishlist)

CREATE NEW DATA FRAMES

Now that we have all our factors converted into a usable format, let’s combine it into a nice, clean data frame. Using the glimpse() function again, we can see that the data types have been successfully been changed from to or , and that the previous short-hand strings of numbers are now in their integer form. I have also renamed two of the columns that were confusing:

I changed Plays into total_plays because this column sums up the total plays of a particular game.
I changed Playing into current_playing because this column contains the number of users currently playing a particular game.

games_df <- data.frame(
    title = gamesData$Title, 
    release_date = r,
    rating = gamesData$Rating,
    genres = g,
    team = t,
    total_plays = p,
    current_playing = pl,
    backlogs = b,
    wishlist = w
    ) 
glimpse(games_df)

## Rows: 1,512
## Columns: 9
## $ title           <chr> "Elden Ring", "Hades", "The Legend of Zelda: Breath of…
## $ release_date    <date> 2022-02-25, 2019-12-10, 2017-03-03, 2015-09-15, 2017-…
## $ rating          <dbl> 4.5, 4.3, 4.4, 4.2, 4.4, 4.3, 4.2, 4.3, 3.0, 4.3, 4.4,…
## $ genres          <chr> "Adventure, RPG", "Adventure, Brawler, Indie, RPG", "A…
## $ team            <chr> "Bandai Namco Entertainment, FromSoftware", "Supergian…
## $ total_plays     <dbl> 17000, 21000, 30000, 28000, 21000, 33000, 7200, 9200, …
## $ current_playing <dbl> 3800, 3200, 2500, 679, 2400, 1800, 1100, 759, 470, 110…
## $ backlogs        <dbl> 4600, 6300, 5000, 4900, 8300, 1100, 4500, 3400, 776, 6…
## $ wishlist        <dbl> 4800, 3600, 2600, 1800, 2300, 230, 3800, 3300, 126, 36…

I am very interested in evaluating the genres column because most video games are listed with several genres. To do made it easier to analyze, I made an additional data frame genre_df that has the genres separated. This splits each observation (video game) into several rows depending on how many genres were listed in the genres column. As you can see in the tibble below, there are “duplicates” of each game where only the genre column differs. Because of the nature of this data frame, it is not very useful for evaluating the rating due to duplicates, which is why to evaluate the genres, it needed to be a separate table.

genre_df <- games_df %>%
  separate_rows(genres, sep = ", ") %>%
  mutate(genre_dummy = 1)
head(genre_df)

NA Search

NA values can skew our results, so it is best to be aware of them before doing any analysis. Below I created a matrix that displays every column, and sums up the totals of NA data in each column. We can see that the rating column has 13 instances of NA data, and release_date has 3. We will keep this in mind when performing any analysis with those two columns.

na_matrix <- is.na(games_df)
na_count <- colSums(na_matrix)
print(na_count)

##           title    release_date          rating          genres            team 
##               0               3              13               0               0 
##     total_plays current_playing        backlogs        wishlist 
##               0               0               0               0

LEVEL TWO: EXPLORATORY DATA ANALYSIS

With our data meticulously cleaned and prepared, it’s time to embark on the next level of our journey: Exploratory Data Analysis (EDA). This phase is where we wield our analytical tools and keen insights to uncover hidden treasures within the dataset. As we traverse this level, we’ll craft questions that guide our exploration and lead us to valuable insights about the world of video games.

First, we start off by performing a simple summary table for our numerical data columns: rating, total_plays, current_playing, backlogs, and wishlist. This gives us the spread, centrality, and variance of these variables. Here, we see that game ratings have a range of 0.7 to 4.8, with a mean of 3.719.

games_df %>%
  select(rating, total_plays, 
         current_playing, backlogs, wishlist) %>%
  summary

##      rating       total_plays    current_playing     backlogs     
##  Min.   :0.700   Min.   :    0   Min.   :   0.0   Min.   :   1.0  
##  1st Qu.:3.400   1st Qu.: 1800   1st Qu.:  43.0   1st Qu.: 461.8  
##  Median :3.800   Median : 4200   Median : 112.5   Median :1000.0  
##  Mean   :3.719   Mean   : 6254   Mean   : 267.4   Mean   :1452.6  
##  3rd Qu.:4.100   3rd Qu.: 9100   3rd Qu.: 298.0   3rd Qu.:2100.0  
##  Max.   :4.800   Max.   :33000   Max.   :3800.0   Max.   :8300.0  
##  NA's   :13                                                       
##     wishlist     
##  Min.   :   2.0  
##  1st Qu.: 212.0  
##  Median : 496.0  
##  Mean   : 780.5  
##  3rd Qu.:1100.0  
##  Max.   :5400.0  
##

Now let’s explore our categorical data. Here we are doing a quick frequency table for game genres, ordered by descending frequency. Adventure clearly is the most popular genre, followed by RPG (role playing game), and then Shooter, Platform (like Mario or Sonic), and Indie finish off the top five genres. I also will do another one for the teams column, to see what game developers have the most games under their scope. We can see Capcom, Square Enix, and Nintendo are industry leaders, along with Sega, and Ubisoft.

genre_df %>% 
  count(genres) %>%
  arrange(desc(n))

games_df %>% 
  count(team) %>%
  arrange(desc(n)) %>%
  head(n=10)

OUR QUEST(IONS ABOUT DATA):

Which video game genres are the most popular?
Has genre popularity changed over time?
What game developers consistently put out top rated, or top played games?
What are insights to user preferences regarding game ratings, playtime, and their wishlists?

ACHIEVEMENT UNLOCKED: DATA VISUALIZATION

Huzzah! In our quest to unlock insights from the vast world of video games, our journey leads us to data visualization. Through charts and graphs, we will reveal the genres that reign supreme, uncover how genre popularity has evolved over the decades, and spotlight game developers who consistently craft top-rated and highly played titles. Lastly, we will identify the games coveted by players, and which remain untouched, collecting virtual dust on their digital shelves.

GENRE POPULARITY OVER TIME

Here we have a line plot of the top 10 game genres over time (1980-2023). Again, since one game can have several genres, this could explain how similar some of the genre lines are, yet it is clear that as time progresses, the diversity of game genres continues to grow.

genre_df <- genre_df %>%
  filter(!is.na(release_date))

genre_counts <- genre_df %>%
  mutate(year = year(as.Date(release_date))) %>%
  group_by(year, genres) %>%
  summarise(count = n(), na.rm = TRUE)

top_10_genres <- genre_counts %>%
  group_by(genres) %>%
  summarise(total_count = sum(count)) %>%
  top_n(10, wt = total_count) %>%
  arrange(desc(total_count))

filtered_genre_counts <- genre_counts %>%
  filter(genres %in% top_10_genres$genres)

ggplot(filtered_genre_counts, aes(x = year, y = count, color = genres)) +
  geom_line() +
  labs(title = "Top 10 Genre Popularity Over Time",
       x = "Year",
       y = "Genre Count",
       caption = "Source: Backloggr") +
  theme_minimal() +
  theme(legend.key.size = unit(0.2, "cm"))

Team Developers

Here we can see the distribution of which video game developers are leading the industry, putting out several games for users to play. We can also see that this data is a little skewed, as it is not counting the umbrella companies but separates them into their respective teams. For example, Nintendo puts out a lot of games, but their subsidiary companies have their own categories. Nintendo, Game Freak for the Pokemon franchise, EAD is for Donkey Kong, Mario, Zelda, Animal Crossing, and we an see that Sonic is listed twice. This is definitely something to consider when discussing which companies put out the most games.

dev_freq <- games_df %>% 
  count(team) %>%
  arrange(desc(n))

top_20_devs <- head(dev_freq, n=20)

ggplot(top_20_devs, aes(x = reorder(team, -n), y = n)) +
  geom_bar(stat = "identity", fill = "pink") +
  geom_text(aes(label = n), vjust = -0.3, size = 3) +
  coord_flip() +  
  labs(title = "Top 20 Video Game Developers",
       x = "Number of Games Developed",
       y = "Teams",
       caption = "Source:  Backloggr") +
  theme_minimal()

This table is great to know how each subsidiary company fares, but I think knowing the data for the umbrella companies is important as well to visualize. First thing we are going to do is aggregate the developer teams together so that all the umbrella companies are grouped together. I created a new data frame that includes the top 10 unique team developers from my top_20_teams list visualized earlier, along with an other category for all other developers, and use this new data frame to perform analysis

grouped_df <- games_df %>%
  mutate(
    grouped_dev = case_when(
      grepl("nintendo", team, ignore.case = TRUE) ~ "Nintendo",
      grepl("sega", team, ignore.case = TRUE) ~ "Sega",
      grepl("ubisoft", team, ignore.case = TRUE) ~ "Ubisoft",
      grepl("square enix", team, ignore.case = TRUE) ~ "Square Enix",
      grepl("capcom", team, ignore.case = TRUE) ~ "Capcom",
      grepl("bandai", team, ignore.case = TRUE) ~ "Bandai",
      grepl("Bethesda", team, ignore.case = TRUE) ~ "Bethesda",
      grepl("capcom", team, ignore.case = TRUE) ~ "Capcom",
      grepl("activision", team, ignore.case = TRUE) ~ "Activision",
      grepl("Electronic Arts", team, ignore.case = TRUE) ~ "EA Games",
      grepl("Sony", team, ignore.case = TRUE) ~ "Sony",
      TRUE ~ "Other"
    )
  ) %>%
  group_by(grouped_dev) %>%
  summarise(
    total_plays = sum(total_plays),
    current_playing = sum(current_playing),
    backlogs = sum(backlogs),
    wishlist = sum(wishlist),
    rating = mean(rating, na.rm = TRUE)
  ) %>%
  ungroup()

print(grouped_df)

## # A tibble: 11 × 6
##    grouped_dev total_plays current_playing backlogs wishlist rating
##    <chr>             <dbl>           <dbl>    <dbl>    <dbl>  <dbl>
##  1 Activision       253900            6506    39788    21859   3.55
##  2 Bandai           358384           26566    91571    55845   3.7 
##  3 Bethesda         329400           12830    81615    37636   3.68
##  4 Capcom           447673           17110   123056    63509   3.76
##  5 EA Games         485580           11471    75606    34413   3.65
##  6 Nintendo        1980200           67237   331337   188355   3.82
##  7 Other           3894037          190658   986482   510213   3.70
##  8 Sega             353432           19151   112853    55407   3.53
##  9 Sony             715260           25969   174284   115391   3.91
## 10 Square Enix      386389           22373   134435    79040   3.85
## 11 Ubisoft          251156            4407    45270    18509   3.38

GAME RATINGS FOR TOP 10 DEVELOPER TEAMS

Here we have a boxplot for the video game ratings for the top 10 game developers. The dark blue line is the median of the data and indicates the central tendency. The width of the light blue box is the spread of the ratings, measuring the variability of the middle 50% of the data, while the whiskers are the first quartile and third quartile. The individual points that fall outside the whiskers are outliers for the dataset. Finally, we can determine skewness from this visual representation as well. Perfectly symmetrical distribution would have the median in the middle of the box, while positive skew has the line to the left (such as the Sonic box plot) and negative skew has the line to the right (such as Square Enix). Negative skew implies that the median is larger than the mean, while positive skew implies that the median is smaller than the mean.

top_10_teams <- games_df %>%
  group_by(team) %>%
  summarise(num_games = n()) %>%
  top_n(10, wt = num_games) %>%
  arrange(desc(num_games))

filtered_games_df <- games_df %>%
  filter(team %in% top_10_teams$team)

filtered_games_df <- na.omit(filtered_games_df)
  wrapped_team_names <- str_wrap(filtered_games_df$team, width = 35)

ggplot(filtered_games_df, aes(x = wrapped_team_names, y = rating)) +
  geom_boxplot(fill = "skyblue", color = "blue") +
  labs(
    title = "Distribution of Ratings by \nTop 10 Game Developers",
    x = "Game Developer (Team)",
    y = "Rating"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

TOP DEV TEAMS, TOTAL GAME PLAYS, AND RATINGS

Lastly, we have a bar plot of the distribution of total game plays from our grouped_df where we have all game developer teams grouped together under their umbrella company, and each bar is colored by average rating for that developer team. I have excluded the other category to make it more readable. It is clear that Nintendo is the industry leader in total game play, and maintain a pretty high rating.

filtered_grouped_df <- grouped_df %>%
  filter(grouped_dev != "Other")

barplot <- ggplot(filtered_grouped_df, 
                  aes(x = reorder(grouped_dev, total_plays), 
                      y = total_plays, fill = rating)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Total Plays and Rating by Game Development Team",
    x = "Game Development Team",
    y = "Total Plays",
    fill = "Rating"
  ) +
  theme_minimal() +
  theme(legend.position = "top") +
  coord_flip()


print(barplot)

BOSS BATTLE: STATISTICAL ANALYSIS

Equipped with our sharpest analytical swords, we must face the boss hidden deep within the data dungeon to uncover hidden insights. This stage of our quest involves hypothesis testing, regression analysis, and correlation exploration. With our formidable tools, we will emerge victorious from this statistical battle, with newfound knowledge to share.

TEST 1: HYPOTHESIS TESTING WITH GAME RATINGS

For this I am using the filtered_games_df that has every game developer group listed as different games are made by different subsidiary companies, and will give a more detailed analysis.

\[H_0:\] There is no difference in game ratings among developer teams. \[H_A:\] There is a significant difference in game ratings among developer teams.

filtered_games_df <- games_df %>%
  filter(team %in% top_10_teams$team)

filtered_games_df <- na.omit(filtered_games_df)
anova_result <- aov(rating ~ team, data = filtered_games_df)

summary(anova_result)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## team          9  14.92  1.6575   6.596 4.45e-08 ***
## Residuals   173  43.47  0.2513                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Due to the extremely small p-value, we have very strong evidence to reject the null hypothesis, accept the alternative hypothesis and conclude that there is statistically significant difference in ratings among developer teams.

TEST 2: LINEAR REGRESSION

Here we are going to predict game ratings based on number of total plays. This line represents the best-fitting linear relationship between total game plays and rating. The shaded area represents the prediction, or confidence interval for the line. Here we can see that a lot of the data does not fall in the best-fit line or the confidence interval, indicating that higher rated games are not necessarily the games that are played the most.

linear_regression_plot <- ggplot(filtered_games_df, aes(x = total_plays, y = rating)) +
  geom_point() +  
  geom_smooth(method = "lm", se = TRUE, color = "#D35400") +  
  labs(
    title = "Linear Regression: Rating vs. Total Game Plays",
    x = "Total Game Plays",
    y = "Rating"
  ) +
  theme_minimal()


print(linear_regression_plot)

TEST 3: CORRELATION EXPLORATION

Now we perform our finishing move, exploring the correlation between user data (ratings, total_plays, current_playing, backlogs, and wishlist) from our grouped game developers.
* A value of 1 indicates a perfect positive correlation, meaning as one variable increases, the other also increases linearly. * A value of -1 indicates a perfect negative correlation, meaning as one value increases, the other decreases linearly. * A value close to 0 indicates a weak, or no linear correlation between the variables.

Here, we see that backlogs and wishlist have the strongest correlation, meaning that games that end up on a users wishlist often are the games that are backlogged. These games are the games users want to play. We can see the weakest correlation is between rating and total plays, meaning that the relationship between these two variables may not have a relationship after all, which is what we saw in the linear regression.

correlation_matrix <- cor(filtered_grouped_df[,
              c("total_plays", 
                "current_playing", 
                "backlogs", 
                "wishlist", 
                "rating")])

corrplot(
  correlation_matrix,
  method = "color",
  type = "upper",
  order = "hclust",
  tl.col = "black",
  tl.srt = 45,
  diag = FALSE
)

GAME OVER: FINAL THOUGHTS AND FUTURE QUESTS

Congrats adventurer! In our epic journey through the world of video games, we ventured into the realm of data analysis to gain deeper insights into the gaming universe, we explored the relationships between important game-related variables. Some of these insights include:

Game ratings may not be the best indicator of what games users like to play the most
Adventure games are by far the most popular type of game, and have been near the top of the genre charts for over 40 years
Nintendo is an industry leader with thousands of game plays and also maintains a high user rating, but Capcom have developed the most amount of games
There is a significant difference between game ratings between different game developer teams

Our quest left us with a deeper understanding of how these variables interact in the gaming world. Armed with this knowledge, we are now better equipped to navigate the challenges and mysteries that await us in our gaming adventures.

How to Win at Data Analysis

Level Up with Video Game Insights

Tegan McCoy

2023-09-22