Since I was a kid, one of my passions has always been video games. Recently, as a business student, I have become more interested in the finance side of the video game industry. The data set I will be conducting analysis through is the vgsales excel file, or Video Game Sales, which can be found on Kaggle. This set is interesting because it includes over 16,500 games and their release year, publisher, platform, genre, and sales. The games included are from years 1980-2016. I will be performing various analyses using my favorite game series as references, hopefully telling a story about the financial side of the video game industry.
Link to dataset: https://myxavier-my.sharepoint.com/:x:/r/personal/reulet_xavier_edu/_layouts/15/Doc.aspx?sourcedoc=%7B3654DF9D-0C83-40C0-870C-467DE30D7709%7D&file=vgsales.csv&action=default&mobileredirect=true
Data Dictionary:
Rank - What position the game is in in terms of global sales
Name - Game title
Year - Year game came out
Genre - The type of game
Publisher - Who made the game
NA_Sales - North American unit sales in millions
EU_Sales - European unit sales in millions
JP_Sales - Japan unit sales in millions
Other_Sales - Unit sales from all other parts of the world in millions
Global_Sales - Total unit sales in millions
# Load in the tidyverse library for functionalitylibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load in the vgsales data setvg_sales <-read_csv("vgsales.csv")
Rows: 16598 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Platform, Year, Genre, Publisher
dbl (6): Rank, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, let’s take a look at some data featuring the most famous video game character, Mario:
# Visualization 1:# Detect games with 'Mario' in the titlemario_games <- vg_sales %>%filter(str_detect(Name, "(?i)Mario"))# Group by release year and calculate total worldwide sales for each yearmario_sales_by_year <- mario_games %>%group_by(Year) %>%reframe(total_sales =sum(Global_Sales, na.rm =TRUE))# Create bar plotggplot(mario_sales_by_year, aes(x = Year, y = total_sales)) +geom_bar(stat ="identity", fill ="skyblue") +labs(title ="Worldwide Unit Sales of Mario Games Each Year a Game Was Released",x ="Year",y ="Total Worldwide Sales (millions)") +theme_minimal()+theme(axis.text.x =element_text(angle =90, vjust =0.5, hjust =1))
I feel that this graph showing the worldwide unit sales of Mario games each year one was released explains Nintendo’s marketing strategies, especially when releasing a new console. In 1985, Mario saw quick success with Super Mario Bros., as it was an innovative game for home consoles. It did so well because it was THE name title for the new hip console at the time, the Nintendo Entertainment System. Mario game sales were also unusually high in 1996 (Mario 64) after the Nintendo 64 was released, a little spike in 2002 (Mario Sunshine) shortly after the Gamecube released, and then another spike in 2007 (Mario Galaxy) shortly after the Wii released. There seems to be a correlation with high worlwide unit sales for Mario games and the release date of Nintendo consoles. As Nintendo’s main character, it makes sense that they use Mario’s likeness for launch titles in order to boost initial sales of a new console.
Next, let’s take a look at the video game industry as a whole, and see how popular they are in general:
# Visualization 2# Create a data frame that finds the total worldwide video game unit sales per yearworldwide_sales <- vg_sales %>%group_by(Year) %>%filter(!(Year %in%c(2017,2020))) %>%summarise(total_worldwide_sales =sum(Global_Sales, na.rm =TRUE)) %>%arrange(Year)# Create line graph to represent dataggplot(worldwide_sales, aes(x = Year, y = total_worldwide_sales)) +geom_point() +labs(title ="Total Worldwide Unit Sales of Video Games Over Time",x ="Year",y ="Total Worldwide Unit Sales (millions)") +theme_minimal() +theme(axis.text.x =element_text(angle =90, vjust =0.5, hjust =1))
I first filtered out the years 2017 - 2020 because there were no sales recorded then; the data seems to stop at 2016. This total wordwide unit sales per year does tell a story about the overall trajectory of the -video games industry from 1980 to 2016. From 1980-1995, unit sales remained relatively constant. That is until, however, the Nintendo 64 came out; then, in the advent of 3D gaming, unit sales rose by about 100 million copies. This trajectory would steadily increase until its peak in 2008, which is in the middle of the ‘console wars’ era. This is when the Wii, Playstation 3, and Xbox 360 were at their most popular, and shows that gaming was truly most popular at this time. Now, since the data only goes to 2016, we are missing an entire new generation of gaming. However, we can assume that as the new generation started rolling out, unit sales would go back up after a steep drop because now there are new consoles for everyone to play on.
Next, let’s see what types of game is most popular in each Region:
# Visualization 3# Find genre sales by regiongenre_sales <- vg_sales %>%group_by(Genre) %>%summarise("Total NA Unit Sales"=sum(NA_Sales, na.rm =TRUE),"Total EU Unit Sales"=sum(EU_Sales, na.rm =TRUE),"Total JP Unit Sales"=sum(JP_Sales, na.rm =TRUE),"Total Other Unit Sales"=sum(Other_Sales, na.rm =TRUE) )# Reshape the data to long formatgenre_sales_long <- genre_sales %>%pivot_longer(cols =starts_with("total"), names_to ="Region", values_to ="Total_Sales") # Create the graphggplot(genre_sales_long, aes(x = Genre, y = Total_Sales, fill = Region)) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Genre Popularity by Region",x ="Genre",y ="Total Unit Sales (millions)",fill ="Region") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
There are some interesting contrasts that can be found on this graph between Regions. Action games are wildly popular in North America and Europe, while not as popular in Japan. This makes sense, because the opposite of action, Role-Playing, is most popular in Japan compared to the other regions. Sports, shooter, and platform games are also wildly popular in North America compared to other Regions, but this could also be because of North America’s easy access to Internet and Gaming, as well as the population just being much more; although, it is interesting to see how unpopular these same genre’s are in Japan even though they are a hub for video games as a whole.
Let’s look at which consoles have the most unit game sales:
# Visualization 4# Create a frame that finds unit sales per consoleconsole_sales <- vg_sales %>%group_by(Platform) %>%summarise(total_sales =sum(Global_Sales, na.rm =TRUE)) %>%arrange(total_sales)# Create plot that shows video game unit sales for each console listedggplot(console_sales, aes(x =reorder(Platform, total_sales), y = total_sales)) +geom_bar(stat ="identity") +labs(title ="Total Game Sales by Console",x ="Console",y ="Total Sales (millions)") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
These results tell us about the vast popularity of the ‘console wars’ time period previously mentioned, as well as the longevity of the PS2. The Xbox 360, Playstation 3, and Wii were all competing at the same time, but the Playstation 2 came out all the way back in 2000, 5 years before the Xbox 360 and lasting until 2013. The Playstation 2, funnily enough, was basically competing with its own sister product, the Playstation 3, and both consoles still did wildly successful. It’s also interesting to see that even though Nintendo are the pioneers of gaming, they don’t have the console with the highest game sales, although the Wii does come close. The data could also be a bit inaccurate, because this data may not take into account new unit sales for newer consoles such as the Switch, Playstation 4, or Xbox One.
Here’s some analysis on my favorite game franchise, Fallout:
# Visualization 5# Create data frame that includes only the main fallout gamesfallout_games <- vg_sales %>%filter(Name =="Fallout 3"| Name =="Fallout: New Vegas"| Name =="Fallout 4")# Sort the games by Global and North American Salesfallout_sales_total <- fallout_games %>%group_by(Name) %>%reframe(total_sales =sum(Global_Sales, na.rm =TRUE),total_NA_Sales =sum(NA_Sales, na.rm =TRUE))# Create a graph to visualize the diference in Global and NA salesggplot(fallout_sales_total, aes(x = Name, y = total_sales, fill = Name)) +geom_col() +geom_text(aes(label =paste0("NA Sales: ", total_NA_Sales)), vjust =1, color ="black") +labs(title ="Unit Sales of Fallout Games Worldwide Compared to North America",y ="Global Unit Sales (millions)",fill ="Game Title") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Now looking at North American Unit Sales of my current favorite game franchise, Fallout, it is interesting to see how high of a percentage of the units sold for these games were actually in North America. Of course, the games take place in America and involve parodying of the American government, which would explain why North Americans are so gravitated toward the series. This is especially true for Fallout 3 and Fallout: New Vegas, where more than half of all units sold were in North America.
Fallout Steam Analysis
Continuing with my love of Fallout, I decided to use an API to find user reviews from the Steam website to see if there is a difference in sentiment in Fallout 3 compared to Fallout 4. It will be interesting to see which one people prefer, the ‘classic’ or the ‘new and improved’.
Link to API data: https://myxavier-my.sharepoint.com/:x:/r/personal/reulet_xavier_edu/_layouts/15/Doc.aspx?sourcedoc=%7B24DA8CB9-8669-41B3-8545-A32B865FE3FD%7D&file=fallout_combined_reviews.csv&action=default&mobileredirect=true
# Visualization with API data# First we need to load some libraries for our sentiment analysislibrary(tidytext)
Warning: package 'tidytext' was built under R version 4.3.3
library(textdata)
Warning: package 'textdata' was built under R version 4.3.3
# Then, load the data file I saved to my OneDrive and copied to my computerfallout_combined_reviews <-read_csv("fallout_combined_reviews.csv")
Rows: 137 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): reviews.language, reviews.review, game
dbl (14): reviews.recommendationid, reviews.author$steamid, reviews.author$n...
lgl (6): reviews.voted_up, reviews.steam_purchase, reviews.received_for_fre...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# New 'tidy' data frame with review wordstidy_fallout_reviews <- fallout_combined_reviews %>%unnest_tokens(word, reviews.review)# Remove stop words (and filter out "fallout")tidy_fallout_reviews <- tidy_fallout_reviews %>%anti_join(stop_words) %>%filter(word !="fallout")
Joining with `by = join_by(word)`
# Perform sentiment analysis using the 'bing' lexiconsentiment_analysis <- tidy_fallout_reviews %>%inner_join(get_sentiments("bing"), by ="word")# Classify words as positive or negativesentiment_analysis <- sentiment_analysis %>%mutate(sentiment_category =ifelse(sentiment =="positive", "Positive", "Negative"))# Count the frequency of positive and negative words in each reviewword_counts <- sentiment_analysis %>%count(game, sentiment_category)# Visualize positive and negative word frequency per gameggplot(word_counts, aes(x = sentiment_category, y = n, fill = sentiment_category)) +geom_bar(stat ="identity") +facet_wrap(~game, scales ="free") +labs(title ="Frequency of Positive and Negative Words per Review per Game",x ="Sentiment Category",y ="Frequency",fill ="Sentiment Category") +theme_minimal()
Note: I filtered out the word “fallout” from analysis because it is considered negative, and is only used to refer to the games in the instances of the reviews.
In general, it seems that for both Fallout 3 and Fallout 4, reviews are mostly negative. In fact, it seems that the average frequency of negative words for reviews for both games are similar, even though the number of reviews taken for each game is not equal. This could be for multiple reasons, one being that Steam users tend to be more prone to using swears, even in a positive context. This could create more negative results that intended. Another reason could be that users are having issues with Steam itself, and not the game. There are some reviews on the Fallout 3 Steam page that claim that they cannot get the game to run; this is just a problem with Steam’s operations, and does not affect the actual sentiment toward the Fallout games.
Conclusion
All in all, it seems that the video game industry is cyclical; meaning, as soon as new consoles roll out, video game unit sales will skyrocket, as seen in visualizations 1 and 2. Genre popularity is vastly different between regions, as seen in visualization 3. This means marketers must carefully analyze their demographics before creating and releasing a game within a certain genre. For example, Japanese people are not going to respond the same way as Americans will to the release of a new Call of Duty game, but they will seemingly be ecstatic when the next Final Fantasy game comes out. Competition between game consoles also seems to encourage a rise in unit sales, as the console wars era was the peak of video game unit sales, as seen in visualizations 2 and 4. The original vg_sales data set I was using is not completely comprehensive, so there is data from the newest generation of consoles missing. If the new trajectory follows the cycle, unit sales should have increased at the release of the PS5 and Xbox Series X. As for Fallout, I am not swayed by the amount of negativity they seemingly got from Steam users. There is a wide margin of error, especially because Steam users tend to use vulgar language. Sifting through the reviews individually, it actually seems that most of the negativity, especially for Fallout 3, is because of Steam opimization problems.