Video Game Sales Analysis

Author

Trevor Reule

Introduction:

Since I was a kid, one of my passions has always been video games. Recently, as a business student, I have become more interested in the finance side of the video game industry. The data set I will be conducting analysis through is the vgsales excel file, or Video Game Sales, which can be found on Kaggle. This set is interesting because it includes over 16,500 games and their release year, publisher, platform, genre, and sales. The games included are from years 1980-2016. I will be performing various analyses using my favorite game series as references, hopefully telling a story about the financial side of the video game industry.

Link to dataset: https://myxavier-my.sharepoint.com/:x:/r/personal/reulet_xavier_edu/_layouts/15/Doc.aspx?sourcedoc=%7B3654DF9D-0C83-40C0-870C-467DE30D7709%7D&file=vgsales.csv&action=default&mobileredirect=true

Data Dictionary:

Rank - What position the game is in in terms of global sales

Name - Game title

Year - Year game came out

Genre - The type of game

Publisher - Who made the game

NA_Sales - North American unit sales in millions

EU_Sales - European unit sales in millions

JP_Sales - Japan unit sales in millions

Other_Sales - Unit sales from all other parts of the world in millions

Global_Sales - Total unit sales in millions

# Load in the tidyverse library for functionality
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load in the vgsales data set
vg_sales <- read_csv("vgsales.csv")
Rows: 16598 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Platform, Year, Genre, Publisher
dbl (6): Rank, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First, let’s take a look at some data featuring the most famous video game character, Mario:

# Visualization 1:
# Detect games with 'Mario' in the title
mario_games <- vg_sales %>%
  filter(str_detect(Name, "(?i)Mario"))

# Group by release year and calculate total worldwide sales for each year
mario_sales_by_year <- mario_games %>%
  group_by(Year) %>%
  reframe(total_sales = sum(Global_Sales, na.rm = TRUE))

# Create bar plot
ggplot(mario_sales_by_year, aes(x = Year, y = total_sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Worldwide Unit Sales of Mario Games Each Year a Game Was Released",
       x = "Year",
       y = "Total Worldwide Sales (millions)") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

I feel that this graph showing the worldwide unit sales of Mario games each year one was released explains Nintendo’s marketing strategies, especially when releasing a new console. In 1985, Mario saw quick success with Super Mario Bros., as it was an innovative game for home consoles. It did so well because it was THE name title for the new hip console at the time, the Nintendo Entertainment System. Mario game sales were also unusually high in 1996 (Mario 64) after the Nintendo 64 was released, a little spike in 2002 (Mario Sunshine) shortly after the Gamecube released, and then another spike in 2007 (Mario Galaxy) shortly after the Wii released. There seems to be a correlation with high worlwide unit sales for Mario games and the release date of Nintendo consoles. As Nintendo’s main character, it makes sense that they use Mario’s likeness for launch titles in order to boost initial sales of a new console.

Next, let’s take a look at the video game industry as a whole, and see how popular they are in general:

# Visualization 2
# Create a data frame that finds the total worldwide video game unit sales per year
worldwide_sales <- vg_sales %>%
  group_by(Year) %>%
  filter(!(Year %in% c(2017,2020))) %>% 
  summarise(total_worldwide_sales = sum(Global_Sales, na.rm = TRUE)) %>% 
  arrange(Year)

# Create line graph to represent data
ggplot(worldwide_sales, aes(x = Year, y = total_worldwide_sales)) +
  geom_point() +
  labs(title = "Total Worldwide Unit Sales of Video Games Over Time",
       x = "Year",
       y = "Total Worldwide Unit Sales (millions)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

I first filtered out the years 2017 - 2020 because there were no sales recorded then; the data seems to stop at 2016. This total wordwide unit sales per year does tell a story about the overall trajectory of the -video games industry from 1980 to 2016. From 1980-1995, unit sales remained relatively constant. That is until, however, the Nintendo 64 came out; then, in the advent of 3D gaming, unit sales rose by about 100 million copies. This trajectory would steadily increase until its peak in 2008, which is in the middle of the ‘console wars’ era. This is when the Wii, Playstation 3, and Xbox 360 were at their most popular, and shows that gaming was truly most popular at this time. Now, since the data only goes to 2016, we are missing an entire new generation of gaming. However, we can assume that as the new generation started rolling out, unit sales would go back up after a steep drop because now there are new consoles for everyone to play on.

Next, let’s see what types of game is most popular in each Region:

# Visualization 3
# Find genre sales by region
genre_sales <- vg_sales %>%
  group_by(Genre) %>%
  summarise(
    "Total NA Unit Sales" = sum(NA_Sales, na.rm = TRUE),
    "Total EU Unit Sales" = sum(EU_Sales, na.rm = TRUE),
    "Total JP Unit Sales" = sum(JP_Sales, na.rm = TRUE),
    "Total Other Unit Sales" = sum(Other_Sales, na.rm = TRUE)
  )

# Reshape the data to long format
genre_sales_long <- genre_sales %>%
  pivot_longer(cols = starts_with("total"), names_to = "Region", values_to = "Total_Sales") 

# Create the graph
ggplot(genre_sales_long, aes(x = Genre, y = Total_Sales, fill = Region)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Genre Popularity by Region",
       x = "Genre",
       y = "Total Unit Sales (millions)",
       fill = "Region") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

There are some interesting contrasts that can be found on this graph between Regions. Action games are wildly popular in North America and Europe, while not as popular in Japan. This makes sense, because the opposite of action, Role-Playing, is most popular in Japan compared to the other regions. Sports, shooter, and platform games are also wildly popular in North America compared to other Regions, but this could also be because of North America’s easy access to Internet and Gaming, as well as the population just being much more; although, it is interesting to see how unpopular these same genre’s are in Japan even though they are a hub for video games as a whole.

Let’s look at which consoles have the most unit game sales:

# Visualization 4
# Create a frame that finds unit sales per console
console_sales <- vg_sales %>%
  group_by(Platform) %>%
  summarise(total_sales = sum(Global_Sales, na.rm = TRUE)) %>%
  arrange(total_sales)

# Create plot that shows video game unit sales for each console listed
ggplot(console_sales, aes(x = reorder(Platform, total_sales), y = total_sales)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Game Sales by Console",
       x = "Console",
       y = "Total Sales (millions)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

These results tell us about the vast popularity of the ‘console wars’ time period previously mentioned, as well as the longevity of the PS2. The Xbox 360, Playstation 3, and Wii were all competing at the same time, but the Playstation 2 came out all the way back in 2000, 5 years before the Xbox 360 and lasting until 2013. The Playstation 2, funnily enough, was basically competing with its own sister product, the Playstation 3, and both consoles still did wildly successful. It’s also interesting to see that even though Nintendo are the pioneers of gaming, they don’t have the console with the highest game sales, although the Wii does come close. The data could also be a bit inaccurate, because this data may not take into account new unit sales for newer consoles such as the Switch, Playstation 4, or Xbox One.

Here’s some analysis on my favorite game franchise, Fallout:

# Visualization 5
# Create data frame that includes only the main fallout games
fallout_games <- vg_sales %>%
  filter(Name == "Fallout 3" | Name == "Fallout: New Vegas" | Name == "Fallout 4")

# Sort the games by Global and North American Sales
fallout_sales_total <- fallout_games %>%
  group_by(Name) %>%
  reframe(total_sales = sum(Global_Sales, na.rm = TRUE),
          total_NA_Sales = sum(NA_Sales, na.rm = TRUE))

# Create a graph to visualize the diference in Global and NA sales
ggplot(fallout_sales_total, aes(x = Name, y = total_sales, fill = Name)) +
  geom_col() +
  geom_text(aes(label = paste0("NA Sales: ", total_NA_Sales)), vjust = 1, color = "black") +
  labs(title = "Unit Sales of Fallout Games Worldwide Compared to North America",
       y = "Global Unit Sales (millions)",
       fill = "Game Title") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Now looking at North American Unit Sales of my current favorite game franchise, Fallout, it is interesting to see how high of a percentage of the units sold for these games were actually in North America. Of course, the games take place in America and involve parodying of the American government, which would explain why North Americans are so gravitated toward the series. This is especially true for Fallout 3 and Fallout: New Vegas, where more than half of all units sold were in North America.

Fallout Steam Analysis

Continuing with my love of Fallout, I decided to use an API to find user reviews from the Steam website to see if there is a difference in sentiment in Fallout 3 compared to Fallout 4. It will be interesting to see which one people prefer, the ‘classic’ or the ‘new and improved’.

Link to API data: https://myxavier-my.sharepoint.com/:x:/r/personal/reulet_xavier_edu/_layouts/15/Doc.aspx?sourcedoc=%7B24DA8CB9-8669-41B3-8545-A32B865FE3FD%7D&file=fallout_combined_reviews.csv&action=default&mobileredirect=true

# Visualization with API data
# First we need to load some libraries for our sentiment analysis
library(tidytext)
Warning: package 'tidytext' was built under R version 4.3.3
library(textdata)
Warning: package 'textdata' was built under R version 4.3.3
# Then, load the data file I saved to my OneDrive and copied to my computer
fallout_combined_reviews <- read_csv("fallout_combined_reviews.csv")
Rows: 137 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): reviews.language, reviews.review, game
dbl (14): reviews.recommendationid, reviews.author$steamid, reviews.author$n...
lgl  (6): reviews.voted_up, reviews.steam_purchase, reviews.received_for_fre...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# New 'tidy' data frame with review words
tidy_fallout_reviews <- fallout_combined_reviews %>%
  unnest_tokens(word, reviews.review)

# Remove stop words (and filter out "fallout")
tidy_fallout_reviews <- tidy_fallout_reviews %>%
  anti_join(stop_words) %>% 
  filter(word != "fallout")
Joining with `by = join_by(word)`
# Perform sentiment analysis using the 'bing' lexicon
sentiment_analysis <- tidy_fallout_reviews %>%
  inner_join(get_sentiments("bing"), by = "word")

# Classify words as positive or negative
sentiment_analysis <- sentiment_analysis %>%
  mutate(sentiment_category = ifelse(sentiment == "positive", "Positive", "Negative"))

# Count the frequency of positive and negative words in each review
word_counts <- sentiment_analysis %>%
  count(game, sentiment_category)

# Visualize positive and negative word frequency per game
ggplot(word_counts, aes(x = sentiment_category, y = n, fill = sentiment_category)) +
  geom_bar(stat = "identity") +
  facet_wrap(~game, scales = "free") +
  labs(title = "Frequency of Positive and Negative Words per Review per Game",
       x = "Sentiment Category",
       y = "Frequency",
       fill = "Sentiment Category") +
  theme_minimal()

Note: I filtered out the word “fallout” from analysis because it is considered negative, and is only used to refer to the games in the instances of the reviews.

In general, it seems that for both Fallout 3 and Fallout 4, reviews are mostly negative. In fact, it seems that the average frequency of negative words for reviews for both games are similar, even though the number of reviews taken for each game is not equal. This could be for multiple reasons, one being that Steam users tend to be more prone to using swears, even in a positive context. This could create more negative results that intended. Another reason could be that users are having issues with Steam itself, and not the game. There are some reviews on the Fallout 3 Steam page that claim that they cannot get the game to run; this is just a problem with Steam’s operations, and does not affect the actual sentiment toward the Fallout games.

Conclusion

All in all, it seems that the video game industry is cyclical; meaning, as soon as new consoles roll out, video game unit sales will skyrocket, as seen in visualizations 1 and 2. Genre popularity is vastly different between regions, as seen in visualization 3. This means marketers must carefully analyze their demographics before creating and releasing a game within a certain genre. For example, Japanese people are not going to respond the same way as Americans will to the release of a new Call of Duty game, but they will seemingly be ecstatic when the next Final Fantasy game comes out. Competition between game consoles also seems to encourage a rise in unit sales, as the console wars era was the peak of video game unit sales, as seen in visualizations 2 and 4. The original vg_sales data set I was using is not completely comprehensive, so there is data from the newest generation of consoles missing. If the new trajectory follows the cycle, unit sales should have increased at the release of the PS5 and Xbox Series X. As for Fallout, I am not swayed by the amount of negativity they seemingly got from Steam users. There is a wide margin of error, especially because Steam users tend to use vulgar language. Sifting through the reviews individually, it actually seems that most of the negativity, especially for Fallout 3, is because of Steam opimization problems.