1 Introduction

According to a comprehensive research report by Market Research Future (MRFR), “Video Game Market information by Gaming Device, by Gaming Type, by End-user and Region – forecast to 2027” market was valued at 155.9 billion in 2019 and industry size to grow at a compound annual growth rate of 14.5% by 2026.

Video games are gaining traction at a rapid pace due to rise of online platforms and easy access to games due to secure payment methods. Development of games with a focus on interactive experiences can drive the market demand.

In this report, Exploratory Data Analysis will be performed to give more insights about Video Games Sales. The process includes Data Input, Data Inspection, Data Cleansing & Coertions, Data Summary, Data Transformation & Visualization, and Data Explanation. The objective of this report is to give insights and a possible business recommendation.

Reference: Article

2 Data Input

The data used in this report contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com and with another web scrape from Metacritic. The sales used in this data was recorded until Dec 2016.

The video games sales data used consists of several variables with the following details:

  • Rank : Ranking of overall sales
  • Name : The games name
  • Platform : Platform of the games release (i.e. PC,PS4, etc.)
  • Year : Year of the game’s release
  • Genre : Genre of the game
  • Publisher : Publisher of the game
  • NA_Sales : Sales in North America (in millions)
  • EU_Sales : Sales in Europe (in millions)
  • JP_Sales : Sales in Japan (in millions)
  • Other_Sales : Sales in the rest of the world (in millions)
  • Global_Sales : Total worldwide sales
#Data input
games <- as.data.frame(read.csv(file = "data/vgsales.csv"))
#Library input
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(RColorBrewer)

3 Data Inspection

head(games)
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26
dim(games)
## [1] 16598    11
names(games)
##  [1] "Rank"         "Name"         "Platform"     "Year"         "Genre"       
##  [6] "Publisher"    "NA_Sales"     "EU_Sales"     "JP_Sales"     "Other_Sales" 
## [11] "Global_Sales"

From our inspection we can conclude :

  • The data contain 16598 of rows and 11 of columns
  • Each of column name : Name”, “Platform”, “Year_of_Release”, “Genre”, “Publisher”, “NA_Sales”, “EU_Sales”, “JP_Sales”, “Other_Sales”, “Global_Sales”

4 Data Cleansing and Coertion

#Checking data types
str(games)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : chr  "Wii" "NES" "Wii" "Wii" ...
##  $ Year        : chr  "2006" "1985" "2008" "2009" ...
##  $ Genre       : chr  "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher   : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...
# Data coertions

games[ , c("Platform", "Year", "Genre", "Publisher")] <- lapply(games[ , c("Platform", "Year", "Genre", "Publisher")], as.factor)

str(games)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
##  $ Year        : Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
##  $ Genre       : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
##  $ Publisher   : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...
#Checking missing value
games[games == "N/A"]<-NA
colSums(is.na(games))
##         Rank         Name     Platform         Year        Genre    Publisher 
##            0            0            0          271            0           58 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##            0            0            0            0            0
#Calculating the percentage of missing values
colSums(is.na(games))/nrow(games)
##         Rank         Name     Platform         Year        Genre    Publisher 
##  0.000000000  0.000000000  0.000000000  0.016327268  0.000000000  0.003494397 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000

Since the missing value is still below 5% of all data observation, the row with any missing value in it will be dropped.

#Dropping observation with missing values
games <- games %>% 
  drop_na(Year, Publisher)
anyNA(games)
## [1] FALSE

All data types have been converted to the desired data types and there’s no more missing value.

5 Data Summary

#Data summary
summary(games)
##       Rank           Name              Platform         Year     
##  Min.   :    1   Length:16291       DS     :2131   2009   :1431  
##  1st Qu.: 4132   Class :character   PS2    :2127   2008   :1428  
##  Median : 8292   Mode  :character   PS3    :1304   2010   :1257  
##  Mean   : 8290                      Wii    :1290   2007   :1201  
##  3rd Qu.:12440                      X360   :1234   2011   :1136  
##  Max.   :16600                      PSP    :1197   2006   :1008  
##                                     (Other):7008   (Other):8830  
##           Genre                             Publisher        NA_Sales      
##  Action      :3251   Electronic Arts             : 1339   Min.   : 0.0000  
##  Sports      :2304   Activision                  :  966   1st Qu.: 0.0000  
##  Misc        :1686   Namco Bandai Games          :  928   Median : 0.0800  
##  Role-Playing:1470   Ubisoft                     :  918   Mean   : 0.2656  
##  Shooter     :1282   Konami Digital Entertainment:  823   3rd Qu.: 0.2400  
##  Adventure   :1274   THQ                         :  712   Max.   :41.4900  
##  (Other)     :5024   (Other)                     :10605                    
##     EU_Sales          JP_Sales         Other_Sales        Global_Sales    
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
##  Median : 0.0200   Median : 0.00000   Median : 0.01000   Median : 0.1700  
##  Mean   : 0.1477   Mean   : 0.07883   Mean   : 0.04843   Mean   : 0.5409  
##  3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4800  
##  Max.   :29.0200   Max.   :10.22000   Max.   :10.57000   Max.   :82.7400  
## 

Data Summary:

  1. The most common video games genre is “Action.”
  2. Most platform used by video games is DS.
  3. Most of the video games are published on 2009.
  4. Most of the video games are published by “Electronic Arts.”
  5. The average sales of each video games is around 266K in North America, around 148K in Europe, and around 788K in Japan. Globally, the average sales is around 541K.

6 Data Transformation & Visualization

6.1 What is the highest selling games worldwide?

head(games[ order(games$Global_Sales, decreasing = TRUE), ],1)
##   Rank       Name Platform Year  Genre Publisher NA_Sales EU_Sales JP_Sales
## 1    1 Wii Sports      Wii 2006 Sports  Nintendo    41.49    29.02     3.77
##   Other_Sales Global_Sales
## 1        8.46        82.74

The highest selling games worldwide is “Wii Sports” with the total global sales 82.72 million.

6.2 What type of genre has the most sales globally?

total_sales_genre <- aggregate.data.frame(x = list(Total_Sales = games$Global_Sales),
                                 by = list(Genre = games$Genre),
                                 FUN = sum)

total_sales_genre <- total_sales_genre[order(total_sales_genre$Total_Sales, decreasing = T), ]
total_sales_genre
##           Genre Total_Sales
## 1        Action     1722.84
## 11       Sports     1309.24
## 9       Shooter     1026.20
## 8  Role-Playing      923.83
## 5      Platform      829.13
## 4          Misc      789.87
## 7        Racing      726.76
## 3      Fighting      444.05
## 10   Simulation      389.98
## 6        Puzzle      242.21
## 2     Adventure      234.59
## 12     Strategy      173.27

The highest selling genre is “Action” with the total global sales around 1723 million. The visualization of the genre by total global sales can be seen below:

ggplot(data = total_sales_genre, mapping = aes(x = reorder(Genre, -Total_Sales), y = Total_Sales)) +
  geom_col(mapping = aes(fill = Genre)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90))+
  theme(legend.position = "None") +
  labs(title =  "Genre by Total Global Sales",
       subtitle = "Video Games Sales Data",
       x = NULL,
       y = "Total Sales")+
  scale_fill_brewer(palette = "Spectral")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Spectral is 11
## Returning the palette you asked for with that many colors

From the visualization, it can be seen that #1 spot for the highest selling genre is “Action” followed by “Sports” and “Shooter”. While the last place goes to “Strategy”.

6.3 Who is the publisher with the most sales globally?

total_sales_publisher <- aggregate.data.frame(x = list(Total_Sales = games$Global_Sales),
                                 by = list(Publisher = games$Publisher),
                                 FUN = sum)

total_sales_publisher <- total_sales_publisher[order(total_sales_publisher$Total_Sales, decreasing = T), ]
head(total_sales_publisher, 10)
##                        Publisher Total_Sales
## 368                     Nintendo     1784.43
## 139              Electronic Arts     1093.39
## 17                    Activision      721.41
## 463  Sony Computer Entertainment      607.28
## 531                      Ubisoft      473.54
## 497         Take-Two Interactive      399.30
## 513                          THQ      340.44
## 282 Konami Digital Entertainment      278.56
## 451                         Sega      270.70
## 351           Namco Bandai Games      253.65

The publisher who have the highest sales globally is “Nintendo” with the total global sales around 1784 million. The visualization of the top 10 publishers by total global sales can be seen below:

top_sales_publisher <- head(total_sales_publisher, 10)

ggplot(data = top_sales_publisher, mapping = aes(x = Total_Sales, y = reorder(Publisher, Total_Sales))) +
  geom_col(mapping = aes(fill = Publisher)) +
  theme_classic() +
  theme(legend.position = "None") +
  labs(title =  "Top 10 Publishers by Total Global Sales",
       subtitle = "Video Games Sales Data",
       x = "Total Sales",
       y = NULL)+
  scale_fill_brewer(palette = "Spectral")

From the visualization, the top 10 publishers based on their video game sales can be seen clearly. The first place goes to “Nintendo” followed by “Electronic Arts” and “Activision”.

6.4 What kind of platform has the most sales globally?

total_sales_platform <- aggregate.data.frame(x = list(Total_Sales = games$Global_Sales),
                                 by = list(Platform = games$Platform),
                                 FUN = sum)

total_sales_platform <- total_sales_platform[order(total_sales_platform$Total_Sales, decreasing = T), ]
head(total_sales_platform, 10)
##    Platform Total_Sales
## 17      PS2     1233.46
## 29     X360      969.60
## 18      PS3      949.35
## 26      Wii      909.81
## 5        DS      818.91
## 16       PS      727.39
## 7       GBA      305.62
## 20      PSP      291.71
## 19      PS4      278.10
## 14       PC      254.70

PS2 is the platform with the highest video games sales globally with the total global sales (of the video games) around 1233 million. The visualization of the top 10 platforms by total global sales can be seen below:

top_sales_platform <- head(total_sales_platform, 10)

ggplot(data = top_sales_platform, mapping = aes(x = reorder(Platform, -Total_Sales), y = Total_Sales)) +
  geom_col(mapping = aes(fill = Platform)) +
  theme_classic() +
  theme(legend.position = "None") +
  labs(title =  "Top 10 Platforms by Total Global Sales",
       subtitle = "Video Games Sales Data",
       x = NULL,
       y = "Total Sales") +
  scale_fill_brewer(palette = "Spectral")

From the visualization, the top 10 platforms based on their video game sales can be seen clearly. The first place goes to “PS2” followed by “X360” and “PS3”. While the last place goes to “PC”.

6.5 Which year had the highest sales worldwide?

total_sales_year <- aggregate.data.frame(x = list(Total_Sales = games$Global_Sales),
                                 by = list(Year = games$Year),
                                 FUN = sum)

total_sales_year <- total_sales_year[order(total_sales_year$Total_Sales, decreasing = T), ]
head(total_sales_year, 10)
##    Year Total_Sales
## 29 2008      678.90
## 30 2009      667.30
## 28 2007      609.92
## 31 2010      600.29
## 27 2006      521.04
## 32 2011      515.80
## 26 2005      458.51
## 25 2004      414.01
## 23 2002      395.52
## 34 2013      368.11

2008 was the year with the highest sales of video games globally.

6.6 Top 5 years game release by genre.

According to the table on the previous business question, the top 5 years by total sales worldwide are: - 2008 with the total sales around 679 million; - 2009 with the total sales around 667 million; - 2007 with the total sales around 610 million; - 2010 with the total sales around 600 million; and - 2006 with the total sales around 521 million.

Based on that information, we are going see the sales based on each genre for each of those year.

top_year <- games %>%
  group_by(Year, Genre) %>%
  summarize(Total_Sales = sum(Global_Sales)) %>% 
  filter(Year %in% c("2008", "2009", "2007", "2010", "2006"))
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
ggplot(data = top_year, aes(x = reorder(Year, Total_Sales), y = Total_Sales))+
  geom_col(aes(fill = Genre), position = "dodge")+
  labs(title = "Top 5 Years Game Release by Genre",
       subtitle = "Video Games Sales Data",
       x = NULL,
       y = "Total Sales",
       fill = NULL)+
  theme_minimal()+
  theme(legend.position = "bottom")+
  scale_fill_brewer(palette = "Spectral")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Spectral is 11
## Returning the palette you asked for with that many colors

Based on the visualization, it is safe to say that “Sports” and “Action” have been dominating the sales in those 5 years. While “Strategy” is often has the lowest sales.

6.7 Leading publisher of each year by sales.

publisher_year <- games %>% 
  group_by(Year, Publisher) %>% 
  summarise(Sales = sum(Global_Sales)) %>%
  top_n(1) %>% 
  ungroup() 
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
## Selecting by Sales
publisher_year <- filter(publisher_year, !Year %in% c("2017", "2020"))
ggplot(data = publisher_year, aes(x = Year, y = Sales))+
  geom_col(aes(fill = Publisher), position = "dodge")+
  labs(title = "Leading Publisher of Each Year by Sales",
       subtitle = "Video Games Sales Data",
       x = "Total Sales",
       y = NULL,
       fill = NULL)+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90))+
  theme(legend.position = "right") +
  scale_fill_brewer(palette = "Spectral")

As seen from the plot, Nintendo dominated the sales for most years.

6.8 Sales comparison by genre.

sales_comp_gen <- games %>%
  select(Genre, NA_Sales, EU_Sales, JP_Sales, Other_Sales) %>% 
  group_by(Genre) %>% 
  summarise(NA_Sales = sum(NA_Sales), 
            EU_Sales = sum(EU_Sales), 
            JP_Sales = sum(JP_Sales), 
            Other_Sales = sum(Other_Sales))

sales_comp_gen <- pivot_longer(data =  sales_comp_gen, 
             cols = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"))

We can see the comparison of sales from each region by the genre below:

ggplot(data = sales_comp_gen, aes(x = name, y = Genre, fill = value))+
  geom_tile(aes(fill = value))+
  geom_text(aes(label = value), position = position_dodge(width = .1), color = "black")+
  labs(title = "Sales Comparison by Genre",
       subtitle = "Video Games Sales Data",
       x = "Total Sales",
       y = NULL,
       fill = NULL)+
  theme_minimal()+
  theme(legend.position = "right")+
  scale_fill_distiller(palette = "Spectral")

For most region, “Action” is the highest selling video games genre, while “Strategy” remains the lowest. The top 3 highest sales by genre all comes from North America (“Action”, “Sports”, and “Shooter”).

6.9 Total sales and comparison by region.

sales_region <- games %>%
  select(NA_Sales, EU_Sales, JP_Sales, Other_Sales) %>% 
  summarise(NA_Sales = sum(NA_Sales), 
            EU_Sales = sum(EU_Sales), 
            JP_Sales = sum(JP_Sales), 
            Other_Sales = sum(Other_Sales))

sales_region <- pivot_longer(data =  sales_region, 
             cols = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"))
ggplot(data = sales_region, aes(x = reorder(name, -value), y = value))+
  geom_col(aes(fill = name), position = "dodge")+
  labs(title = "Total Sales by Region",
       subtitle = "Video Games Sales Data",
       x = NULL,
       y = "Total Sales",
       fill = NULL)+
  theme_minimal()+
  theme(legend.position = "right")+
  scale_fill_brewer(palette = "Spectral")

North America holds the title of the region with the most video games sales of all-time (until 2016).

7 Data Explanation / Conclusion

Based on the data used in this report and the exploratory data analysis process that has been done, we can conclude that:

  • Based on the total sales achieved, North America is the region that loved video games the most.
  • Most of video games sold are played on PS2 platform (based on the total sales achieved of each video games).
  • The most popular video games genre globally is “Action.” This genre is also placed #1 in almost all region with the only exception is Japan, which seems to prefer “Role-Playing” genre the most.
  • However, most of video games released on 2007-2010 have “Action” as their genre which could indicates the high competition between the games in the “Action” genre.
  • Nintendo has been leading the video games industry most of the time, it can be seen by their total sales which ranked the highest in most of each year. However, Electronics Art also shows a promising future, as in the last decade they have ranked the highest for several years.

If you are interested in selling/ creating video games, here are a view recommendation:

  • Consider your market target. For example, if your market target is people in North America, the most popular genre there is “Action.” But, if your market target is people in Japan, then “Role-Playing” genre is the perfect choice (see Sales Comparison by Genre).
  • However, you might also need to consider the competition, since most video games’ have “Action” as their genre. Go for “Puzzle” if you want the less mainstream games (see Data Summary).
  • If you want to sell or invest to video games publisher, consider “Nintendo” or “Electronics Art” since they are the best publisher based on their total revenue in the past decades (see Leading Publisher of Each Year by Sales).