1. Introduction

The video game industry is highly competitive, with thousands of games released annually but only a small percentage achieving significant sales. To maximize the commercial success of our next game, I need a data-driven approach to identify:

  1. The best target market – Which regions have the highest sales potential (North America, Europe, Japan, etc.)?
  2. The most profitable genre – What type of game sells the most (Action, RPG, Sports, etc.)?
  3. The ideal gaming platform – Should we focus on console, PC, or another platform (PlayStation, Xbox, PC, etc.)?

2. Data Set Overview

2.1 Data Source

To explore these questions, I analyze the vgsales data set, which contains information on 16,324 video games released between 1980 and 2016.

2.2 Variables and Observations

Each row represents a single video game title, and the data set includes the following variables:

  • Columns & Variables:
    • Rank: Ranking of the game based on global sales.
    • Name: Name of the video game.
    • Platform: The gaming platform (e.g., PS4, Xbox, PC).
    • Year: Year of release.
    • Genre: Game genre (e.g., Action, Sports, RPG).
    • Publisher: The company that published the game.
    • NA_Sales: Sales in North America (millions).
    • EU_Sales: Sales in Europe (millions).
    • JP_Sales: Sales in Japan (millions).
    • Other_Sales: Sales in other regions (millions).
    • Global_Sales: Total worldwide sales (millions).

3. Brief Descriptive Analytics

3.1 Missing Values

First, I check for missing values in each column:

colSums(is.na(video_games))
##         Rank         Name     Platform         Year        Genre    Publisher 
##            0            0            0            0            0            0 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##            0            0            0            0            0

The output shows that there are no missing values in the data set. This means I can proceed with analysis without needing to handle any missing data.

3.2 Summary Statistics

Next I look at summary statistics to understand the distribution of sales data and other numerical variables

summary(video_games)
##       Rank           Name             Platform              Year     
##  Min.   :    1   Length:16324       Length:16324       Min.   :1980  
##  1st Qu.: 4136   Class :character   Class :character   1st Qu.:2003  
##  Median : 8294   Mode  :character   Mode  :character   Median :2007  
##  Mean   : 8292                                         Mean   :2006  
##  3rd Qu.:12439                                         3rd Qu.:2010  
##  Max.   :16600                                         Max.   :2016  
##     Genre            Publisher            NA_Sales          EU_Sales      
##  Length:16324       Length:16324       Min.   : 0.0000   Min.   : 0.0000  
##  Class :character   Class :character   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Mode  :character   Mode  :character   Median : 0.0800   Median : 0.0200  
##                                        Mean   : 0.2655   Mean   : 0.1476  
##                                        3rd Qu.: 0.2400   3rd Qu.: 0.1100  
##                                        Max.   :41.4900   Max.   :29.0200  
##     JP_Sales         Other_Sales        Global_Sales    
##  Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
##  1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
##  Median : 0.00000   Median : 0.01000   Median : 0.1700  
##  Mean   : 0.07867   Mean   : 0.04833   Mean   : 0.5403  
##  3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4800  
##  Max.   :10.22000   Max.   :10.57000   Max.   :82.7400

Key Insights:

  • The data set includes video games released between 1980 and 2016, with most games concentrated between 2003 and 2010.
  • Global sales range from 0.01 million to 82.74 million, but the median is only 0.17 million, indicating that while a few games become massive hits, most sell relatively few copies.
  • North America has the highest maximum sales (41.49M), making it a dominant market, while Japan’s median sales are significantly lower, indicating different gaming preferences.
  • Sales distribution is highly skewed, with a small percentage of games capturing the majority of revenue.

3.3 Identifying Outliers

The scatter plot below displays the relationship between a game’s global sales and its rank.

ggplot(video_games, aes(x = Rank, y = Global_Sales)) +
  geom_point(alpha = 0.5, color = "blue") +
  labs(title = "Global Video Game Sales",
       x = "Game Rank",
       y = "Global Sales (millions)") +
  theme_minimal()

Most video games have relatively low sales, clustering near the bottom of the plot. However there is a clear outlier, one game stands out with significantly higher global sales than others.

3.4 Top 10 Best-Selling Games

To better understand the scatter plot of global game sales, I present a table of the top 10 best-selling games.

video_games %>%
  arrange(desc(Global_Sales)) %>%
  select(Rank, Name, Platform, Genre, Global_Sales) %>%
  head(10)
##    Rank                      Name Platform        Genre Global_Sales
## 1     1                Wii Sports      Wii       Sports        82.74
## 2     2         Super Mario Bros.      NES     Platform        40.24
## 3     3            Mario Kart Wii      Wii       Racing        35.82
## 4     4         Wii Sports Resort      Wii       Sports        33.00
## 5     5  Pokemon Red/Pokemon Blue       GB Role-Playing        31.37
## 6     6                    Tetris       GB       Puzzle        30.26
## 7     7     New Super Mario Bros.       DS     Platform        30.01
## 8     8                  Wii Play      Wii         Misc        29.02
## 9     9 New Super Mario Bros. Wii      Wii     Platform        28.62
## 10   10                 Duck Hunt      NES      Shooter        28.31

The highest-selling video game is Wii Sports, with 82.74 million copies sold. More than double the sales of the second-best game, Super Mario Bros. (40.24 million copies sold). This confirms that Wii Sports is the outlier observed in the scatter plot.

3.5 Distribution of Global Sales

To better understand the distribution of video game sales, I plotted a histogram using a logarithmic scale to adjust for the skewed nature of the data.

ggplot(video_games, aes(x = Global_Sales)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
  scale_x_log10() + 
  labs(title = "Distribution of Global Video Game Sales",
       x = "Global Sales (millions, log scale)", y = "Count") +
  theme_minimal()

The histogram confirms that most video games sell fewer than 1 million copies, while only a handful exceed 10 million. The log transformation makes it easier to see the distribution, highlighting the extreme sales disparity between low-selling and blockbuster games.

These findings suggest that selecting the right market, genre, and platform is critical. A poorly positioned game could easily be lost in the vast number of low-selling titles.