The video game industry is highly competitive, with thousands of games released annually but only a small percentage achieving significant sales. To maximize the commercial success of our next game, I need a data-driven approach to identify:
To explore these questions, I analyze the vgsales data set, which contains information on 16,324 video games released between 1980 and 2016.
Each row represents a single video game title, and the data set includes the following variables:
Rank: Ranking of the game based on global sales.Name: Name of the video game.Platform: The gaming platform (e.g., PS4, Xbox,
PC).Year: Year of release.Genre: Game genre (e.g., Action, Sports, RPG).Publisher: The company that published the game.NA_Sales: Sales in North America (millions).EU_Sales: Sales in Europe (millions).JP_Sales: Sales in Japan (millions).Other_Sales: Sales in other regions (millions).Global_Sales: Total worldwide sales (millions).First, I check for missing values in each column:
colSums(is.na(video_games))
## Rank Name Platform Year Genre Publisher
## 0 0 0 0 0 0
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 0 0 0 0 0
The output shows that there are no missing values in the data set. This means I can proceed with analysis without needing to handle any missing data.
Next I look at summary statistics to understand the distribution of sales data and other numerical variables
summary(video_games)
## Rank Name Platform Year
## Min. : 1 Length:16324 Length:16324 Min. :1980
## 1st Qu.: 4136 Class :character Class :character 1st Qu.:2003
## Median : 8294 Mode :character Mode :character Median :2007
## Mean : 8292 Mean :2006
## 3rd Qu.:12439 3rd Qu.:2010
## Max. :16600 Max. :2016
## Genre Publisher NA_Sales EU_Sales
## Length:16324 Length:16324 Min. : 0.0000 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0800 Median : 0.0200
## Mean : 0.2655 Mean : 0.1476
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100
## Max. :41.4900 Max. :29.0200
## JP_Sales Other_Sales Global_Sales
## Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
## 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
## Median : 0.00000 Median : 0.01000 Median : 0.1700
## Mean : 0.07867 Mean : 0.04833 Mean : 0.5403
## 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4800
## Max. :10.22000 Max. :10.57000 Max. :82.7400
Key Insights:
The scatter plot below displays the relationship between a game’s global sales and its rank.
ggplot(video_games, aes(x = Rank, y = Global_Sales)) +
geom_point(alpha = 0.5, color = "blue") +
labs(title = "Global Video Game Sales",
x = "Game Rank",
y = "Global Sales (millions)") +
theme_minimal()
Most video games have relatively low sales, clustering near the bottom
of the plot. However there is a clear outlier, one game stands out with
significantly higher global sales than others.
To better understand the scatter plot of global game sales, I present a table of the top 10 best-selling games.
video_games %>%
arrange(desc(Global_Sales)) %>%
select(Rank, Name, Platform, Genre, Global_Sales) %>%
head(10)
## Rank Name Platform Genre Global_Sales
## 1 1 Wii Sports Wii Sports 82.74
## 2 2 Super Mario Bros. NES Platform 40.24
## 3 3 Mario Kart Wii Wii Racing 35.82
## 4 4 Wii Sports Resort Wii Sports 33.00
## 5 5 Pokemon Red/Pokemon Blue GB Role-Playing 31.37
## 6 6 Tetris GB Puzzle 30.26
## 7 7 New Super Mario Bros. DS Platform 30.01
## 8 8 Wii Play Wii Misc 29.02
## 9 9 New Super Mario Bros. Wii Wii Platform 28.62
## 10 10 Duck Hunt NES Shooter 28.31
The highest-selling video game is Wii Sports, with 82.74 million copies sold. More than double the sales of the second-best game, Super Mario Bros. (40.24 million copies sold). This confirms that Wii Sports is the outlier observed in the scatter plot.
To better understand the distribution of video game sales, I plotted a histogram using a logarithmic scale to adjust for the skewed nature of the data.
ggplot(video_games, aes(x = Global_Sales)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
scale_x_log10() +
labs(title = "Distribution of Global Video Game Sales",
x = "Global Sales (millions, log scale)", y = "Count") +
theme_minimal()
The histogram confirms that most video games sell fewer than 1 million copies, while only a handful exceed 10 million. The log transformation makes it easier to see the distribution, highlighting the extreme sales disparity between low-selling and blockbuster games.
These findings suggest that selecting the right market, genre, and platform is critical. A poorly positioned game could easily be lost in the vast number of low-selling titles.