According to a comprehensive research report by Market Research Future (MRFR), “Video Game Market information by Gaming Device, by Gaming Type, by End-user and Region – forecast to 2027” market was valued at 155.9 billion in 2019 and industry size to grow at a compound annual growth rate of 14.5% by 2026.
Video games are gaining traction at a rapid pace due to rise of online platforms and easy access to games due to secure payment methods. Development of games with a focus on interactive experiences can drive the market demand.
In this report, Exploratory Data Analysis will be performed to give more insights about Video Games Sales. The process includes Data Input, Data Inspection, Data Cleansing & Coertions, Data Summary, Data Transformation & Visualization, and Data Explanation. The objective of this report is to give insights and a possible business recommendation.
Reference: Article
The data used in this report contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com and with another web scrape from Metacritic. The sales used in this data was recorded until Dec 2016.
The video games sales data used consists of several variables with the following details:
Rank : Ranking of overall salesName : The games namePlatform : Platform of the games release (i.e. PC,PS4, etc.)Year : Year of the game’s releaseGenre : Genre of the gamePublisher : Publisher of the gameNA_Sales : Sales in North America (in millions)EU_Sales : Sales in Europe (in millions)JP_Sales : Sales in Japan (in millions)Other_Sales : Sales in the rest of the world (in millions)Global_Sales : Total worldwide sales#Data input
games <- as.data.frame(read.csv(file = "data/vgsales.csv"))
#Library input
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(RColorBrewer)
head(games)
## Rank Name Platform Year Genre Publisher NA_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.20
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
## 6 2.26 4.22 0.58 30.26
dim(games)
## [1] 16598 11
names(games)
## [1] "Rank" "Name" "Platform" "Year" "Genre"
## [6] "Publisher" "NA_Sales" "EU_Sales" "JP_Sales" "Other_Sales"
## [11] "Global_Sales"
From our inspection we can conclude :
#Checking data types
str(games)
## 'data.frame': 16598 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : chr "Wii" "NES" "Wii" "Wii" ...
## $ Year : chr "2006" "1985" "2008" "2009" ...
## $ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
## $ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
## $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
# Data coertions
games[ , c("Platform", "Year", "Genre", "Publisher")] <- lapply(games[ , c("Platform", "Year", "Genre", "Publisher")], as.factor)
str(games)
## 'data.frame': 16598 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
## $ Year : Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
## $ Genre : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
## $ Publisher : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
## $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
#Checking missing value
games[games == "N/A"]<-NA
colSums(is.na(games))
## Rank Name Platform Year Genre Publisher
## 0 0 0 271 0 58
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 0 0 0 0 0
#Calculating the percentage of missing values
colSums(is.na(games))/nrow(games)
## Rank Name Platform Year Genre Publisher
## 0.000000000 0.000000000 0.000000000 0.016327268 0.000000000 0.003494397
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
Since the missing value is still below 5% of all data observation, the row with any missing value in it will be dropped.
#Dropping observation with missing values
games <- games %>%
drop_na(Year, Publisher)
anyNA(games)
## [1] FALSE
All data types have been converted to the desired data types and there’s no more missing value.
#Data summary
summary(games)
## Rank Name Platform Year
## Min. : 1 Length:16291 DS :2131 2009 :1431
## 1st Qu.: 4132 Class :character PS2 :2127 2008 :1428
## Median : 8292 Mode :character PS3 :1304 2010 :1257
## Mean : 8290 Wii :1290 2007 :1201
## 3rd Qu.:12440 X360 :1234 2011 :1136
## Max. :16600 PSP :1197 2006 :1008
## (Other):7008 (Other):8830
## Genre Publisher NA_Sales
## Action :3251 Electronic Arts : 1339 Min. : 0.0000
## Sports :2304 Activision : 966 1st Qu.: 0.0000
## Misc :1686 Namco Bandai Games : 928 Median : 0.0800
## Role-Playing:1470 Ubisoft : 918 Mean : 0.2656
## Shooter :1282 Konami Digital Entertainment: 823 3rd Qu.: 0.2400
## Adventure :1274 THQ : 712 Max. :41.4900
## (Other) :5024 (Other) :10605
## EU_Sales JP_Sales Other_Sales Global_Sales
## Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
## Median : 0.0200 Median : 0.00000 Median : 0.01000 Median : 0.1700
## Mean : 0.1477 Mean : 0.07883 Mean : 0.04843 Mean : 0.5409
## 3rd Qu.: 0.1100 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4800
## Max. :29.0200 Max. :10.22000 Max. :10.57000 Max. :82.7400
##
Data Summary:
head(games[ order(games$Global_Sales, decreasing = TRUE), ],1)
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49 29.02 3.77
## Other_Sales Global_Sales
## 1 8.46 82.74
The highest selling games worldwide is “Wii Sports” with the total global sales 82.72 million.
total_sales_genre <- aggregate.data.frame(x = list(Total_Sales = games$Global_Sales),
by = list(Genre = games$Genre),
FUN = sum)
total_sales_genre <- total_sales_genre[order(total_sales_genre$Total_Sales, decreasing = T), ]
total_sales_genre
## Genre Total_Sales
## 1 Action 1722.84
## 11 Sports 1309.24
## 9 Shooter 1026.20
## 8 Role-Playing 923.83
## 5 Platform 829.13
## 4 Misc 789.87
## 7 Racing 726.76
## 3 Fighting 444.05
## 10 Simulation 389.98
## 6 Puzzle 242.21
## 2 Adventure 234.59
## 12 Strategy 173.27
The highest selling genre is “Action” with the total global sales around 1723 million. The visualization of the genre by total global sales can be seen below:
ggplot(data = total_sales_genre, mapping = aes(x = reorder(Genre, -Total_Sales), y = Total_Sales)) +
geom_col(mapping = aes(fill = Genre)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90))+
theme(legend.position = "None") +
labs(title = "Genre by Total Global Sales",
subtitle = "Video Games Sales Data",
x = NULL,
y = "Total Sales")+
scale_fill_brewer(palette = "Spectral")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Spectral is 11
## Returning the palette you asked for with that many colors
From the visualization, it can be seen that #1 spot for the highest selling genre is “Action” followed by “Sports” and “Shooter”. While the last place goes to “Strategy”.
total_sales_publisher <- aggregate.data.frame(x = list(Total_Sales = games$Global_Sales),
by = list(Publisher = games$Publisher),
FUN = sum)
total_sales_publisher <- total_sales_publisher[order(total_sales_publisher$Total_Sales, decreasing = T), ]
head(total_sales_publisher, 10)
## Publisher Total_Sales
## 368 Nintendo 1784.43
## 139 Electronic Arts 1093.39
## 17 Activision 721.41
## 463 Sony Computer Entertainment 607.28
## 531 Ubisoft 473.54
## 497 Take-Two Interactive 399.30
## 513 THQ 340.44
## 282 Konami Digital Entertainment 278.56
## 451 Sega 270.70
## 351 Namco Bandai Games 253.65
The publisher who have the highest sales globally is “Nintendo” with the total global sales around 1784 million. The visualization of the top 10 publishers by total global sales can be seen below:
top_sales_publisher <- head(total_sales_publisher, 10)
ggplot(data = top_sales_publisher, mapping = aes(x = Total_Sales, y = reorder(Publisher, Total_Sales))) +
geom_col(mapping = aes(fill = Publisher)) +
theme_classic() +
theme(legend.position = "None") +
labs(title = "Top 10 Publishers by Total Global Sales",
subtitle = "Video Games Sales Data",
x = "Total Sales",
y = NULL)+
scale_fill_brewer(palette = "Spectral")
From the visualization, the top 10 publishers based on their video game sales can be seen clearly. The first place goes to “Nintendo” followed by “Electronic Arts” and “Activision”.
total_sales_platform <- aggregate.data.frame(x = list(Total_Sales = games$Global_Sales),
by = list(Platform = games$Platform),
FUN = sum)
total_sales_platform <- total_sales_platform[order(total_sales_platform$Total_Sales, decreasing = T), ]
head(total_sales_platform, 10)
## Platform Total_Sales
## 17 PS2 1233.46
## 29 X360 969.60
## 18 PS3 949.35
## 26 Wii 909.81
## 5 DS 818.91
## 16 PS 727.39
## 7 GBA 305.62
## 20 PSP 291.71
## 19 PS4 278.10
## 14 PC 254.70
PS2 is the platform with the highest video games sales globally with the total global sales (of the video games) around 1233 million. The visualization of the top 10 platforms by total global sales can be seen below:
top_sales_platform <- head(total_sales_platform, 10)
ggplot(data = top_sales_platform, mapping = aes(x = reorder(Platform, -Total_Sales), y = Total_Sales)) +
geom_col(mapping = aes(fill = Platform)) +
theme_classic() +
theme(legend.position = "None") +
labs(title = "Top 10 Platforms by Total Global Sales",
subtitle = "Video Games Sales Data",
x = NULL,
y = "Total Sales") +
scale_fill_brewer(palette = "Spectral")
From the visualization, the top 10 platforms based on their video game sales can be seen clearly. The first place goes to “PS2” followed by “X360” and “PS3”. While the last place goes to “PC”.
total_sales_year <- aggregate.data.frame(x = list(Total_Sales = games$Global_Sales),
by = list(Year = games$Year),
FUN = sum)
total_sales_year <- total_sales_year[order(total_sales_year$Total_Sales, decreasing = T), ]
head(total_sales_year, 10)
## Year Total_Sales
## 29 2008 678.90
## 30 2009 667.30
## 28 2007 609.92
## 31 2010 600.29
## 27 2006 521.04
## 32 2011 515.80
## 26 2005 458.51
## 25 2004 414.01
## 23 2002 395.52
## 34 2013 368.11
2008 was the year with the highest sales of video games globally.
According to the table on the previous business question, the top 5 years by total sales worldwide are: - 2008 with the total sales around 679 million; - 2009 with the total sales around 667 million; - 2007 with the total sales around 610 million; - 2010 with the total sales around 600 million; and - 2006 with the total sales around 521 million.
Based on that information, we are going see the sales based on each genre for each of those year.
top_year <- games %>%
group_by(Year, Genre) %>%
summarize(Total_Sales = sum(Global_Sales)) %>%
filter(Year %in% c("2008", "2009", "2007", "2010", "2006"))
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
ggplot(data = top_year, aes(x = reorder(Year, Total_Sales), y = Total_Sales))+
geom_col(aes(fill = Genre), position = "dodge")+
labs(title = "Top 5 Years Game Release by Genre",
subtitle = "Video Games Sales Data",
x = NULL,
y = "Total Sales",
fill = NULL)+
theme_minimal()+
theme(legend.position = "bottom")+
scale_fill_brewer(palette = "Spectral")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Spectral is 11
## Returning the palette you asked for with that many colors
Based on the visualization, it is safe to say that “Sports” and “Action” have been dominating the sales in those 5 years. While “Strategy” is often has the lowest sales.
publisher_year <- games %>%
group_by(Year, Publisher) %>%
summarise(Sales = sum(Global_Sales)) %>%
top_n(1) %>%
ungroup()
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
## Selecting by Sales
publisher_year <- filter(publisher_year, !Year %in% c("2017", "2020"))
ggplot(data = publisher_year, aes(x = Year, y = Sales))+
geom_col(aes(fill = Publisher), position = "dodge")+
labs(title = "Leading Publisher of Each Year by Sales",
subtitle = "Video Games Sales Data",
x = "Total Sales",
y = NULL,
fill = NULL)+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90))+
theme(legend.position = "right") +
scale_fill_brewer(palette = "Spectral")
As seen from the plot, Nintendo dominated the sales for most years.
sales_comp_gen <- games %>%
select(Genre, NA_Sales, EU_Sales, JP_Sales, Other_Sales) %>%
group_by(Genre) %>%
summarise(NA_Sales = sum(NA_Sales),
EU_Sales = sum(EU_Sales),
JP_Sales = sum(JP_Sales),
Other_Sales = sum(Other_Sales))
sales_comp_gen <- pivot_longer(data = sales_comp_gen,
cols = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"))
We can see the comparison of sales from each region by the genre below:
ggplot(data = sales_comp_gen, aes(x = name, y = Genre, fill = value))+
geom_tile(aes(fill = value))+
geom_text(aes(label = value), position = position_dodge(width = .1), color = "black")+
labs(title = "Sales Comparison by Genre",
subtitle = "Video Games Sales Data",
x = "Total Sales",
y = NULL,
fill = NULL)+
theme_minimal()+
theme(legend.position = "right")+
scale_fill_distiller(palette = "Spectral")
For most region, “Action” is the highest selling video games genre, while “Strategy” remains the lowest. The top 3 highest sales by genre all comes from North America (“Action”, “Sports”, and “Shooter”).
sales_region <- games %>%
select(NA_Sales, EU_Sales, JP_Sales, Other_Sales) %>%
summarise(NA_Sales = sum(NA_Sales),
EU_Sales = sum(EU_Sales),
JP_Sales = sum(JP_Sales),
Other_Sales = sum(Other_Sales))
sales_region <- pivot_longer(data = sales_region,
cols = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"))
ggplot(data = sales_region, aes(x = reorder(name, -value), y = value))+
geom_col(aes(fill = name), position = "dodge")+
labs(title = "Total Sales by Region",
subtitle = "Video Games Sales Data",
x = NULL,
y = "Total Sales",
fill = NULL)+
theme_minimal()+
theme(legend.position = "right")+
scale_fill_brewer(palette = "Spectral")
North America holds the title of the region with the most video games sales of all-time (until 2016).
Based on the data used in this report and the exploratory data analysis process that has been done, we can conclude that:
If you are interested in selling/ creating video games, here are a view recommendation: