Video games have evolved from a niche form of entertainment into a global cultural phenomenon with a vast and diverse player base. As the gaming industry continues to grow, so does the wealth of data generated by players, developers, and platforms. The world of video games has evolved at an astonishing pace, captivating millions of players across the globe and generating a thriving industry worth billions of dollars. Through descriptive analysis and visualization, we aim to uncover valuable insights about characteristics of highly rated games and the developers that make them, discern patterns in player preferences, and gain a richer understanding of this dynamic realm where pixels meet passion. Join me on this gaming odyssey as we embark on a journey to explore the data behind the games that have captured our imaginations and challenged our skills.
library(tidyverse)
library(stringr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(stats)
library(scales)
library(corrplot)
gamesData <- read.csv("games.csv", header=TRUE)
In the realm of data adventures, every epic journey begins with preparation. Our first quest, Preprocessing, is the essential foundation upon which our data adventure is built. We will clean our raw data into a pristine form, ready to reveal its hidden treasures. Armed with tidy datasets and a sense of purpose, we will be poised to delve deeper into the world of video games. Our adventure begins here, in the realm of Preprocessing, where data becomes knowledge, and knowledge becomes power.
This analysis uses data from backloggd, a video game collection website that allows users to log their game play and stats, review the games they play, and connect with friends and other players. A unique feature about this website is that it allows users to create backlogs, which are games they haven’t finished yet, and wishlists of games the users want to play.
The games dataset contains information from thousands of
video games dating back to 1980, and includes information like genre,
user reviews, game summary, user wishlist, release dates, and player
engagement. By analyzing this data set, we will gain valuable insight
into player preferences, gaming trends, and market dynamics in this
booming industry.
Now let’s see exactly what we are working with with the
glimpse() function. This allows us to get a brief overview
of this data frame, including the number of variables, the data types,
and the first few observations of each variable.
glimpse(gamesData)
## Rows: 1,512
## Columns: 14
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Title <chr> "Elden Ring", "Hades", "The Legend of Zelda: Breath …
## $ Release.Date <chr> "Feb 25, 2022", "Dec 10, 2019", "Mar 03, 2017", "Sep…
## $ Team <chr> "['Bandai Namco Entertainment', 'FromSoftware']", "[…
## $ Rating <dbl> 4.5, 4.3, 4.4, 4.2, 4.4, 4.3, 4.2, 4.3, 3.0, 4.3, 4.…
## $ Times.Listed <chr> "3.9K", "2.9K", "4.3K", "3.5K", "3K", "2.3K", "1.6K"…
## $ Number.of.Reviews <chr> "3.9K", "2.9K", "4.3K", "3.5K", "3K", "2.3K", "1.6K"…
## $ Genres <chr> "['Adventure', 'RPG']", "['Adventure', 'Brawler', 'I…
## $ Summary <chr> "Elden Ring is a fantasy, action and open world game…
## $ Reviews <chr> "[\"The first playthrough of elden ring is one of th…
## $ Plays <chr> "17K", "21K", "30K", "28K", "21K", "33K", "7.2K", "9…
## $ Playing <chr> "3.8K", "3.2K", "2.5K", "679", "2.4K", "1.8K", "1.1K…
## $ Backlogs <chr> "4.6K", "6.3K", "5K", "4.9K", "8.3K", "1.1K", "4.5K"…
## $ Wishlist <chr> "4.8K", "3.6K", "2.6K", "1.8K", "2.3K", "230", "3.8K…
We have a lot of numerical data that is stored as characters! This is
not ideal. So before we can do any sort of analysis, we need to convert
some data and get it ready to use for analysis. Since there are several
columns that need the same type of cleaning, I created two functions to
be applied to any column that needs it. The first function is
k_convert which will convert the data Plays,
Playing, Backlog and Wishlist
from characters, such as “4.6k”, into it’s integer form “4600”. The
second function is clean_col, which removes the brackets
and braces from Genres and Team and instead
lists them so they are easier to aggregate. I also want to convert the
Release_Dates column into a numerical date we can analyze,
however since only one column is a date, I will just apply it to that
column directly.
k_convert <- function(text_vector) {
converted_values <- numeric(length(text_vector))
for (i in 1:length(text_vector)) {
value <- text_vector[i]
if (grepl("K", value, fixed = TRUE)) {
numeric_part <- as.numeric(str_replace(value, "K", "")) * 1000
} else {
numeric_part <- as.numeric(value)
}
converted_values[i] <- numeric_part
}
return(converted_values)
}
clean_col <- function(text_vector) {
cleaned_text <- text_vector %>%
str_replace("\\[", "") %>%
str_replace("\\]", "") %>%
str_replace_all(pattern = "'", replacement = "")
return(cleaned_text)
}
g <- clean_col(gamesData$Genres)
t <- clean_col(gamesData$Team)
r <- gamesData$Release.Date %>%
as.Date(format = "%b %d, %Y")
p <- k_convert(gamesData$Plays)
pl <- k_convert(gamesData$Playing)
b <- k_convert(gamesData$Backlogs)
w <- k_convert(gamesData$Wishlist)
Now that we have all our factors converted into a usable format,
let’s combine it into a nice, clean data frame. Using the
glimpse() function again, we can see that the data types
have been successfully been changed from
Plays into total_plays because
this column sums up the total plays of a particular
game.Playing into current_playing
because this column contains the number of users currently
playing a particular game.
games_df <- data.frame(
title = gamesData$Title,
release_date = r,
rating = gamesData$Rating,
genres = g,
team = t,
total_plays = p,
current_playing = pl,
backlogs = b,
wishlist = w
)
glimpse(games_df)
## Rows: 1,512
## Columns: 9
## $ title <chr> "Elden Ring", "Hades", "The Legend of Zelda: Breath of…
## $ release_date <date> 2022-02-25, 2019-12-10, 2017-03-03, 2015-09-15, 2017-…
## $ rating <dbl> 4.5, 4.3, 4.4, 4.2, 4.4, 4.3, 4.2, 4.3, 3.0, 4.3, 4.4,…
## $ genres <chr> "Adventure, RPG", "Adventure, Brawler, Indie, RPG", "A…
## $ team <chr> "Bandai Namco Entertainment, FromSoftware", "Supergian…
## $ total_plays <dbl> 17000, 21000, 30000, 28000, 21000, 33000, 7200, 9200, …
## $ current_playing <dbl> 3800, 3200, 2500, 679, 2400, 1800, 1100, 759, 470, 110…
## $ backlogs <dbl> 4600, 6300, 5000, 4900, 8300, 1100, 4500, 3400, 776, 6…
## $ wishlist <dbl> 4800, 3600, 2600, 1800, 2300, 230, 3800, 3300, 126, 36…
I am very interested in evaluating the genres column because most
video games are listed with several genres. To do made
it easier to analyze, I made an additional data frame
genre_df that has the genres separated. This splits each
observation (video game) into several rows depending on how many genres
were listed in the genres column. As you can see in the
tibble below, there are “duplicates” of each game where only the genre
column differs. Because of the nature of this data frame, it is not very
useful for evaluating the rating due to duplicates, which
is why to evaluate the genres, it needed to be a separate table.
genre_df <- games_df %>%
separate_rows(genres, sep = ", ") %>%
mutate(genre_dummy = 1)
head(genre_df)
NA values can skew our results, so it is best to be aware of them
before doing any analysis. Below I created a matrix that displays every
column, and sums up the totals of NA data in each column.
We can see that the rating column has 13 instances of
NA data, and release_date has 3. We will keep
this in mind when performing any analysis with those two columns.
na_matrix <- is.na(games_df)
na_count <- colSums(na_matrix)
print(na_count)
## title release_date rating genres team
## 0 3 13 0 0
## total_plays current_playing backlogs wishlist
## 0 0 0 0
With our data meticulously cleaned and prepared, it’s time to embark on the next level of our journey: Exploratory Data Analysis (EDA). This phase is where we wield our analytical tools and keen insights to uncover hidden treasures within the dataset. As we traverse this level, we’ll craft questions that guide our exploration and lead us to valuable insights about the world of video games.
First, we start off by performing a simple summary table for our
numerical data columns: rating, total_plays,
current_playing, backlogs, and
wishlist. This gives us the spread, centrality, and
variance of these variables. Here, we see that game ratings have a range
of 0.7 to 4.8, with a mean of 3.719.
games_df %>%
select(rating, total_plays,
current_playing, backlogs, wishlist) %>%
summary
## rating total_plays current_playing backlogs
## Min. :0.700 Min. : 0 Min. : 0.0 Min. : 1.0
## 1st Qu.:3.400 1st Qu.: 1800 1st Qu.: 43.0 1st Qu.: 461.8
## Median :3.800 Median : 4200 Median : 112.5 Median :1000.0
## Mean :3.719 Mean : 6254 Mean : 267.4 Mean :1452.6
## 3rd Qu.:4.100 3rd Qu.: 9100 3rd Qu.: 298.0 3rd Qu.:2100.0
## Max. :4.800 Max. :33000 Max. :3800.0 Max. :8300.0
## NA's :13
## wishlist
## Min. : 2.0
## 1st Qu.: 212.0
## Median : 496.0
## Mean : 780.5
## 3rd Qu.:1100.0
## Max. :5400.0
##
Now let’s explore our categorical data. Here we are doing a quick
frequency table for game genres, ordered by descending
frequency. Adventure clearly is the most popular genre, followed by RPG
(role playing game), and then Shooter, Platform (like Mario or Sonic),
and Indie finish off the top five genres. I also will do another one for
the teams column, to see what game developers have the most
games under their scope. We can see Capcom, Square Enix, and Nintendo
are industry leaders, along with Sega, and Ubisoft.
genre_df %>%
count(genres) %>%
arrange(desc(n))
games_df %>%
count(team) %>%
arrange(desc(n)) %>%
head(n=10)
Huzzah! In our quest to unlock insights from the vast world of video games, our journey leads us to data visualization. Through charts and graphs, we will reveal the genres that reign supreme, uncover how genre popularity has evolved over the decades, and spotlight game developers who consistently craft top-rated and highly played titles. Lastly, we will identify the games coveted by players, and which remain untouched, collecting virtual dust on their digital shelves.
Here we see a nice visual dispersion of the top ten genres of video games, with Adventure being a clear leader.
genre_freq <- genre_df %>%
count(genres) %>%
arrange(desc(n))
top_10_genres <- head(genre_freq, n=10)
ggplot(top_10_genres, aes(x = reorder(genres, -n), y = n)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = n), vjust = -0.3, size = 3) +
coord_flip() +
labs(title = "Top 10 Most Popular Video Game Genres",
x = "Frequency",
y = "Genre",
caption = "Source: Backloggr") +
theme_minimal()
Here we have a line plot of the top 10 game genres over time (1980-2023). Again, since one game can have several genres, this could explain how similar some of the genre lines are, yet it is clear that as time progresses, the diversity of game genres continues to grow.
genre_df <- genre_df %>%
filter(!is.na(release_date))
genre_counts <- genre_df %>%
mutate(year = year(as.Date(release_date))) %>%
group_by(year, genres) %>%
summarise(count = n(), na.rm = TRUE)
top_10_genres <- genre_counts %>%
group_by(genres) %>%
summarise(total_count = sum(count)) %>%
top_n(10, wt = total_count) %>%
arrange(desc(total_count))
filtered_genre_counts <- genre_counts %>%
filter(genres %in% top_10_genres$genres)
ggplot(filtered_genre_counts, aes(x = year, y = count, color = genres)) +
geom_line() +
labs(title = "Top 10 Genre Popularity Over Time",
x = "Year",
y = "Genre Count",
caption = "Source: Backloggr") +
theme_minimal() +
theme(legend.key.size = unit(0.2, "cm"))
Here we can see the distribution of which video game developers are leading the industry, putting out several games for users to play. We can also see that this data is a little skewed, as it is not counting the umbrella companies but separates them into their respective teams. For example, Nintendo puts out a lot of games, but their subsidiary companies have their own categories. Nintendo, Game Freak for the Pokemon franchise, EAD is for Donkey Kong, Mario, Zelda, Animal Crossing, and we an see that Sonic is listed twice. This is definitely something to consider when discussing which companies put out the most games.
dev_freq <- games_df %>%
count(team) %>%
arrange(desc(n))
top_20_devs <- head(dev_freq, n=20)
ggplot(top_20_devs, aes(x = reorder(team, -n), y = n)) +
geom_bar(stat = "identity", fill = "pink") +
geom_text(aes(label = n), vjust = -0.3, size = 3) +
coord_flip() +
labs(title = "Top 20 Video Game Developers",
x = "Number of Games Developed",
y = "Teams",
caption = "Source: Backloggr") +
theme_minimal()
This table is great to know how each subsidiary company fares, but I
think knowing the data for the umbrella companies is important as well
to visualize. First thing we are going to do is aggregate the developer
teams together so that all the umbrella companies are grouped together.
I created a new data frame that includes the top 10 unique team
developers from my top_20_teams list visualized earlier,
along with an other category for all other developers, and
use this new data frame to perform analysis
grouped_df <- games_df %>%
mutate(
grouped_dev = case_when(
grepl("nintendo", team, ignore.case = TRUE) ~ "Nintendo",
grepl("sega", team, ignore.case = TRUE) ~ "Sega",
grepl("ubisoft", team, ignore.case = TRUE) ~ "Ubisoft",
grepl("square enix", team, ignore.case = TRUE) ~ "Square Enix",
grepl("capcom", team, ignore.case = TRUE) ~ "Capcom",
grepl("bandai", team, ignore.case = TRUE) ~ "Bandai",
grepl("Bethesda", team, ignore.case = TRUE) ~ "Bethesda",
grepl("capcom", team, ignore.case = TRUE) ~ "Capcom",
grepl("activision", team, ignore.case = TRUE) ~ "Activision",
grepl("Electronic Arts", team, ignore.case = TRUE) ~ "EA Games",
grepl("Sony", team, ignore.case = TRUE) ~ "Sony",
TRUE ~ "Other"
)
) %>%
group_by(grouped_dev) %>%
summarise(
total_plays = sum(total_plays),
current_playing = sum(current_playing),
backlogs = sum(backlogs),
wishlist = sum(wishlist),
rating = mean(rating, na.rm = TRUE)
) %>%
ungroup()
print(grouped_df)
## # A tibble: 11 × 6
## grouped_dev total_plays current_playing backlogs wishlist rating
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Activision 253900 6506 39788 21859 3.55
## 2 Bandai 358384 26566 91571 55845 3.7
## 3 Bethesda 329400 12830 81615 37636 3.68
## 4 Capcom 447673 17110 123056 63509 3.76
## 5 EA Games 485580 11471 75606 34413 3.65
## 6 Nintendo 1980200 67237 331337 188355 3.82
## 7 Other 3894037 190658 986482 510213 3.70
## 8 Sega 353432 19151 112853 55407 3.53
## 9 Sony 715260 25969 174284 115391 3.91
## 10 Square Enix 386389 22373 134435 79040 3.85
## 11 Ubisoft 251156 4407 45270 18509 3.38
Here we have a boxplot for the video game ratings for the top 10 game developers. The dark blue line is the median of the data and indicates the central tendency. The width of the light blue box is the spread of the ratings, measuring the variability of the middle 50% of the data, while the whiskers are the first quartile and third quartile. The individual points that fall outside the whiskers are outliers for the dataset. Finally, we can determine skewness from this visual representation as well. Perfectly symmetrical distribution would have the median in the middle of the box, while positive skew has the line to the left (such as the Sonic box plot) and negative skew has the line to the right (such as Square Enix). Negative skew implies that the median is larger than the mean, while positive skew implies that the median is smaller than the mean.
top_10_teams <- games_df %>%
group_by(team) %>%
summarise(num_games = n()) %>%
top_n(10, wt = num_games) %>%
arrange(desc(num_games))
filtered_games_df <- games_df %>%
filter(team %in% top_10_teams$team)
filtered_games_df <- na.omit(filtered_games_df)
wrapped_team_names <- str_wrap(filtered_games_df$team, width = 35)
ggplot(filtered_games_df, aes(x = wrapped_team_names, y = rating)) +
geom_boxplot(fill = "skyblue", color = "blue") +
labs(
title = "Distribution of Ratings by \nTop 10 Game Developers",
x = "Game Developer (Team)",
y = "Rating"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
Lastly, we have a bar plot of the distribution of total game plays
from our grouped_df where we have all game developer teams
grouped together under their umbrella company, and each bar is colored
by average rating for that developer team. I have excluded the
other category to make it more readable. It is clear that
Nintendo is the industry leader in total game play, and maintain a
pretty high rating.
filtered_grouped_df <- grouped_df %>%
filter(grouped_dev != "Other")
barplot <- ggplot(filtered_grouped_df,
aes(x = reorder(grouped_dev, total_plays),
y = total_plays, fill = rating)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Total Plays and Rating by Game Development Team",
x = "Game Development Team",
y = "Total Plays",
fill = "Rating"
) +
theme_minimal() +
theme(legend.position = "top") +
coord_flip()
print(barplot)
Equipped with our sharpest analytical swords, we must face the boss hidden deep within the data dungeon to uncover hidden insights. This stage of our quest involves hypothesis testing, regression analysis, and correlation exploration. With our formidable tools, we will emerge victorious from this statistical battle, with newfound knowledge to share.
For this I am using the filtered_games_df that has every
game developer group listed as different games are made by different
subsidiary companies, and will give a more detailed analysis.
\[H_0:\] There is no difference in game ratings among developer teams. \[H_A:\] There is a significant difference in game ratings among developer teams.
filtered_games_df <- games_df %>%
filter(team %in% top_10_teams$team)
filtered_games_df <- na.omit(filtered_games_df)
anova_result <- aov(rating ~ team, data = filtered_games_df)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## team 9 14.92 1.6575 6.596 4.45e-08 ***
## Residuals 173 43.47 0.2513
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Due to the extremely small p-value, we have very strong evidence to
reject the null hypothesis, accept the alternative hypothesis and
conclude that there is statistically significant difference in ratings
among developer teams.
Here we are going to predict game ratings based on number of total
plays. This line represents the best-fitting linear relationship between
total game plays and rating. The shaded area represents the prediction,
or confidence interval for the line. Here we can see that a lot of the
data does not fall in the best-fit line or the confidence interval,
indicating that higher rated games are not necessarily the games that
are played the most.
linear_regression_plot <- ggplot(filtered_games_df, aes(x = total_plays, y = rating)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "#D35400") +
labs(
title = "Linear Regression: Rating vs. Total Game Plays",
x = "Total Game Plays",
y = "Rating"
) +
theme_minimal()
print(linear_regression_plot)
Now we perform our finishing move, exploring the correlation between
user data (ratings, total_plays,
current_playing, backlogs, and
wishlist) from our grouped game developers.
* A value of 1 indicates a perfect positive correlation, meaning as
one variable increases, the other also increases linearly. * A value of
-1 indicates a perfect negative correlation, meaning as one value
increases, the other decreases linearly. * A value close to 0 indicates
a weak, or no linear correlation between the variables.
Here, we see that backlogs and wishlist
have the strongest correlation, meaning that games that end up on a
users wishlist often are the games that are backlogged. These games are
the games users want to play. We can see the weakest correlation is
between rating and total plays, meaning that the relationship between
these two variables may not have a relationship after all, which is what
we saw in the linear regression.
correlation_matrix <- cor(filtered_grouped_df[,
c("total_plays",
"current_playing",
"backlogs",
"wishlist",
"rating")])
corrplot(
correlation_matrix,
method = "color",
type = "upper",
order = "hclust",
tl.col = "black",
tl.srt = 45,
diag = FALSE
)
Congrats adventurer! In our epic journey through the world of video games, we ventured into the realm of data analysis to gain deeper insights into the gaming universe, we explored the relationships between important game-related variables. Some of these insights include:
Our quest left us with a deeper understanding of how these variables interact in the gaming world. Armed with this knowledge, we are now better equipped to navigate the challenges and mysteries that await us in our gaming adventures.