Exploratory Data Analysis - Video Game Sales Dataset
1. Explanation
The beginning
Hello everyone , welcome to my first Rpubs. I made this rmd to fulfill my LBB assignment. Hope You like it :)
The Data About
This is Video Game Sales Dataset that I get from Kaggle. In this dataset contains of sales record that include the video game Platform, Genre, Year, and the game Name. We can see the competition from game publishers, when the video game sales started and when the video game sales increase significantly
2. Input Dataset
Dont’t forget to put the Dataset in the same folder and set the current folder as working directory
game_data <- read.csv("dataset/vgsales.csv")Done, let’s move to the next step
2.1 Dataset Inspection
Get first 5 rows
head(game_data, 5)## Rank Name Platform Year Genre Publisher NA_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
Get last 5 rows
tail(game_data, 5)## Rank Name Platform Year
## 16594 16596 Woody Woodpecker in Crazy Castle 5 GBA 2002
## 16595 16597 Men in Black II: Alien Escape GC 2003
## 16596 16598 SCORE International Baja 1000: The Official Game PS2 2008
## 16597 16599 Know How 2 DS 2010
## 16598 16600 Spirits & Spells GBA 2003
## Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 16594 Platform Kemco 0.01 0.00 0 0 0.01
## 16595 Shooter Infogrames 0.01 0.00 0 0 0.01
## 16596 Racing Activision 0.00 0.00 0 0 0.01
## 16597 Puzzle 7G//AMES 0.00 0.01 0 0 0.01
## 16598 Platform Wanadoo 0.01 0.00 0 0 0.01
Get total rows / observation
nrow(game_data)## [1] 16598
Get total columns
ncol(game_data)## [1] 11
Get all columns names
names(game_data)## [1] "Rank" "Name" "Platform" "Year" "Genre"
## [6] "Publisher" "NA_Sales" "EU_Sales" "JP_Sales" "Other_Sales"
## [11] "Global_Sales"
Get dimension of dataset
dim(game_data)## [1] 16598 11
From our inspection we can take few informations :
Video Games Sales dataset contains 16598 of rows and 11 columns
Each of columns names : Rank, Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales (Total worldwide sales in millions)
2.2 Data Cleaning & Coertions
Check the data type for each column using str(structure)
str(game_data)## 'data.frame': 16598 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : chr "Wii" "NES" "Wii" "Wii" ...
## $ Year : chr "2006" "1985" "2008" "2009" ...
## $ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
## $ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
## $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
We can see from the result above, there are several data type that incorrect, let’s convert the data type to the correct type (Explicit Coertion)
game_data$Platform <- as.factor(game_data$Platform)
game_data$Year <- as.integer(game_data$Year)
game_data$Publisher <- as.factor(game_data$Publisher)
game_data$Genre <- as.factor(game_data$Genre)
str(game_data)## 'data.frame': 16598 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
## $ Year : int 2006 1985 2008 2009 1996 1989 2006 2006 2009 1984 ...
## $ Genre : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
## $ Publisher : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
## $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
summary(game_data)## Rank Name Platform Year
## Min. : 1 Length:16598 DS :2163 Min. :1980
## 1st Qu.: 4151 Class :character PS2 :2161 1st Qu.:2003
## Median : 8300 Mode :character PS3 :1329 Median :2007
## Mean : 8301 Wii :1325 Mean :2006
## 3rd Qu.:12450 X360 :1265 3rd Qu.:2010
## Max. :16600 PSP :1213 Max. :2020
## (Other):7142 NA's :271
## Genre Publisher NA_Sales
## Action :3316 Electronic Arts : 1351 Min. : 0.0000
## Sports :2346 Activision : 975 1st Qu.: 0.0000
## Misc :1739 Namco Bandai Games : 932 Median : 0.0800
## Role-Playing:1488 Ubisoft : 921 Mean : 0.2647
## Shooter :1310 Konami Digital Entertainment: 832 3rd Qu.: 0.2400
## Adventure :1286 THQ : 715 Max. :41.4900
## (Other) :5113 (Other) :10872
## EU_Sales JP_Sales Other_Sales Global_Sales
## Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
## Median : 0.0200 Median : 0.00000 Median : 0.01000 Median : 0.1700
## Mean : 0.1467 Mean : 0.07778 Mean : 0.04806 Mean : 0.5374
## 3rd Qu.: 0.1100 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4700
## Max. :29.0200 Max. :10.22000 Max. :10.57000 Max. :82.7400
##
Done. Each column type now are correct. We can see the summary that we get, a lot of information we can get (especially for the factor data type) because the data type now are correct
Check for the missing value
colSums(is.na(game_data))## Rank Name Platform Year Genre Publisher
## 0 0 0 271 0 0
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 0 0 0 0 0
We have missing values in Year Column
anyNA(game_data)## [1] TRUE
head(game_data[which(is.na(game_data$Year)), c("Name", "Year")])## Name Year
## 180 Madden NFL 2004 NA
## 378 FIFA Soccer 2004 NA
## 432 LEGO Batman: The Videogame NA
## 471 wwe Smackdown vs. Raw 2006 NA
## 608 Space Invaders NA
## 625 Rock Band NA
Here is the 6 sample of missing value in Year Column by Name
From the result above, we can see the missing value for Year column, there are 271 rows. Let’s clean the missing value using complete_case() function in R
game_data <- game_data[complete.cases(game_data), ]nrow(game_data)## [1] 16327
anyNA(game_data)## [1] FALSE
Yeayyy, there is no missing value in our dataset right now. Don’t worry, we just remove 2% of rows from our dataset, that means we can still extract important information from our dataset
Let’s subsetting our dataset, because we are not use all of the column
game_data <- game_data[, !(colnames(game_data) %in% c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"))]head(game_data, 5)## Rank Name Platform Year Genre Publisher
## 1 1 Wii Sports Wii 2006 Sports Nintendo
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo
## Global_Sales
## 1 82.74
## 2 40.24
## 3 35.82
## 4 33.00
## 5 31.37
Done, we success remove NA_Sales, EU_Sales, JP_Sales Column. Because we only focus in Global_Sales
3. Data Explanation
Let’s summarise our data with summary function in R
summary(game_data)## Rank Name Platform Year
## Min. : 1 Length:16327 DS :2133 Min. :1980
## 1st Qu.: 4136 Class :character PS2 :2127 1st Qu.:2003
## Median : 8295 Mode :character PS3 :1304 Median :2007
## Mean : 8293 Wii :1290 Mean :2006
## 3rd Qu.:12442 X360 :1235 3rd Qu.:2010
## Max. :16600 PSP :1197 Max. :2020
## (Other):7041
## Genre Publisher Global_Sales
## Action :3253 Electronic Arts : 1339 Min. : 0.0100
## Sports :2304 Activision : 966 1st Qu.: 0.0600
## Misc :1710 Namco Bandai Games : 928 Median : 0.1700
## Role-Playing:1471 Ubisoft : 918 Mean : 0.5402
## Shooter :1282 Konami Digital Entertainment: 823 3rd Qu.: 0.4800
## Adventure :1276 THQ : 712 Max. :82.7400
## (Other) :5031 (Other) :10641
Summary :
First Game Sales was in 1980
Games sales increase in 2006
Best selling Game by Genre is Action
DS is the Most Platform selling Games
The highest Sales of Global Sales was 82.7400
The lowest Sales of Global Sales was 0.0100
Let’s take a look, how the Global Sales from time to time increase significantly in 2006
library("scales")
# take data in year 2006, 2007 dan 2008
sales_game_one <- game_data[game_data$Year %in% c("2006", "2007", "2008") ,]
# calculate the mean of Global Sale by each Platform
sales_game_one_agg <- aggregate.data.frame(sales_game_one$Global_Sales, by = list(sales_game_one$Platform), mean)
# change column names
names(sales_game_one_agg) <- c("title", "mean")
# take only six highest Global Sales
sales_game_one_agg <- sales_game_one_agg[order(sales_game_one_agg$mean, decreasing = T),]
sales_game_one_agg <- sales_game_one_agg[1:6,]
ggplot(sales_game_one_agg, aes(x = title, y = mean)) +
geom_col(aes(fill = title), show.legend = F) +
labs(title = "Global Sales by Platfom", y = "Global Sales", x = "Platform", subtitle = "that Increase in 2006, 2007 and 2008") +
scale_y_continuous(labels = unit_format(scale = 100, suffix = " million")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))From the result above, we can conclude 2006 is the high Global Sales from year to year
Let’s try to check the outlier in Global_Sales using boxplot, ggplot (scatter plot) and custome function
boxplot(game_data$Global_Sales, main = "Outlier Detection in Global Sales")ggplot(data = game_data, aes(x = Genre, y = Global_Sales)) +
labs(x = "Game Genre", y = "Global Sales", title = "Outlier Detection in Global Sales by Genre") +
geom_point(alpha = 0.5, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5))outlier_detect <- function(sales) {
game_sd <- aggregate(sales ~ Genre, game_data, sd)
outlier <- data.frame(sales = game_sd$sales,
outlier = ifelse(game_sd$sales >= 2, TRUE, FALSE))
outlier
}
outlier_detect(game_data$Global_Sales)## sales outlier
## 1 1.1649482 FALSE
## 2 0.5105220 FALSE
## 3 0.9579153 FALSE
## 4 1.3232976 FALSE
## 5 2.5986424 TRUE
## 6 1.5762764 FALSE
## 7 1.6767708 FALSE
## 8 1.7169431 FALSE
## 9 1.8335782 FALSE
## 10 1.2055324 FALSE
## 11 2.1049264 TRUE
## 12 0.5241971 FALSE
From the result above, we can conclude :
The Global_Sales has a bit outlier base on the Boxplot and the Scatter Plot above, only a few values exceed the average of Global Sales
Beside that, from the outlier_detect function, we can see the average standard deviation of the Global Sales. The outlier is not much, only a few values exceed the average sd
Actually we can clean the outlier, but in this case, i think we don’t need to clean it. Because that can remove valuable information from Video Games Sales dataset.
4. Data Manipulation & Transformation
- Which Platrofm gives the lowest number of sales ?
game_sales <- game_data[order(game_data$Global_Sales == 0.01, decreasing = T),]
head(game_sales)## Rank
## 15981 15983
## 15982 15984
## 15983 15985
## 15984 15986
## 15985 15987
## 15986 15988
## Name
## 15981 Turok
## 15982 Coven and Labyrinth of Refrain
## 15983 Super Battle For Money Sentouchuu: Kyuukyoku no Shinobu to Battle Player Choujou Kessen!
## 15984 Dragon Zakura DS
## 15985 Chameleon: To Dye For!
## 15986 Hotel Giant DS
## Platform Year Genre Publisher Global_Sales
## 15981 PC 2008 Action Touchstone 0.01
## 15982 PSV 2016 Action Nippon Ichi Software 0.01
## 15983 3DS 2016 Action Namco Bandai Games 0.01
## 15984 DS 2007 Misc Electronic Arts 0.01
## 15985 DS 2006 Puzzle 505 Games 0.01
## 15986 DS 2008 Simulation Nobilis 0.01
From the result above, there are 6 Platform that have lowest Global Sales
- Which genre of the Game most users bought from all platform ?
round(prop.table(table(game_data$Genre)) * 100, 2)##
## Action Adventure Fighting Misc Platform Puzzle
## 19.92 7.82 5.12 10.47 5.37 3.50
## Racing Role-Playing Shooter Simulation Sports Strategy
## 7.51 9.01 7.85 5.21 14.11 4.11
Action is the genre of game that the most user bought
- Which year did game sales significantly increase ?
game_data_sales <- game_data[game_data$Global_Sales >= 40.00, ]
sort(table(game_data_sales$Year), decreasing = T)##
## 1985 2006 1980 1981 1982 1983 1984 1986 1987 1988 1989 1990 1991 1992 1993 1994
## 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2007 2008 2009 2010 2011
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2012 2013 2014 2015 2016 2017 2020
## 0 0 0 0 0 0 0
The game sales become significantly increase was in year 2006 & 1985
- Who was the publisher and the game sales that significantly increase ?
game <- data.frame(Publisher = game_data_sales$Publisher, Name = game_data_sales$Name)
game## Publisher Name
## 1 Nintendo Wii Sports
## 2 Nintendo Super Mario Bros.
The publisher that made game sales increase significantly in 2006 & 1985 was Nintendo
- How much total sales of top publisher each genre of game and platform
total_sales <- aggregate(Global_Sales ~ Platform + Genre + Publisher, game_data, sum)
head(total_sales[order(total_sales$Global_Sales, decreasing = T),])## Platform Genre Publisher Global_Sales
## 3199 Wii Sports Nintendo 169.19
## 1579 PS2 Sports Electronic Arts 126.39
## 298 X360 Shooter Activision 102.02
## 293 PS3 Shooter Activision 87.72
## 3167 DS Role-Playing Nintendo 85.88
## 3142 NES Platform Nintendo 81.25
We can see the top Publisher sales, there are Nintendo, Electronic Arts and Activision. Off course Nintendo still number one position of Sales :)
# Take just 6 top Publisher sales
total_sales <- head(total_sales[order(total_sales$Global_Sales, decreasing = T),!(colnames(total_sales) %in% c("Publisher"))])
total_sales## Platform Genre Global_Sales
## 3199 Wii Sports 169.19
## 1579 PS2 Sports 126.39
## 298 X360 Shooter 102.02
## 293 PS3 Shooter 87.72
## 3167 DS Role-Playing 85.88
## 3142 NES Platform 81.25
ggplot(data = total_sales, aes(fill = Platform, x = Global_Sales, y = Platform)) +
geom_bar(stat = "identity", show.legend = F) +
labs(x = "Global Sales", y = "Platfrom", title = "Top total sales of Platrfom by Global Sales ") +
scale_x_continuous(labels = unit_format(suffix = " million")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))From the Bar Plot above, we can see Wii & PS 2 was the most high sales in Sports Genre and X360 was the most high sales in Shooter Genre
- Let’s dive deeper into PS 2 platform, because I want to nostalgic about my chillhood :). How much total sales of PS 2 Platform in 11 years (2000-2011) ?
total_sales_ps2 <- game_data[game_data$Platform == "PS2", c("Global_Sales", "Year")]
sum(total_sales_ps2$Global_Sales)## [1] 1233.46
Total sales of PS 2 Platform in 11 years was 1233.46
- What is Mean of total sales in PS 2 platform for every Year ?
aggregate(Global_Sales ~ Year, total_sales_ps2, mean)## Year Global_Sales
## 1 2000 0.47695122
## 2 2001 0.89962162
## 3 2002 0.73357143
## 4 2003 0.71988281
## 5 2004 0.81768340
## 6 2005 0.61788462
## 7 2006 0.39930502
## 8 2007 0.35514019
## 9 2008 0.28183246
## 10 2009 0.27552083
## 11 2010 0.14815789
## 12 2011 0.06714286
The highest mean of sales from PS 2 was in 2001
- Which is the Highest sales of PS 2 platform ?
highest_total_sales <- total_sales_ps2[total_sales_ps2$Global_Sales == 20.8100,]
head(highest_total_sales)## Global_Sales Year
## 18 20.81 2004
The highest sales of PS 2 was in 2004
- Which is the Lowest sales of PS 2 platform ?
lowest_total_sales <- total_sales_ps2[total_sales_ps2$Global_Sales == 0.01,]
head(lowest_total_sales)## Global_Sales Year
## 15995 0.01 2009
## 15998 0.01 2006
## 16030 0.01 2008
## 16034 0.01 2008
## 16035 0.01 2006
## 16039 0.01 2008
The lowest sales of PS 2 was from 2006 until 2009
- Which is the Highest Sales of PS 2 Platform for each Genre ?
top_genre <- game_data[game_data$Platform == "PS2", c("Global_Sales", "Genre")]
top_genre <- aggregate(Global_Sales ~ Genre, top_genre, sum)top_genre[order(top_genre$Global_Sales, decreasing = T),]## Genre Global_Sales
## 1 Action 272.43
## 11 Sports 262.64
## 7 Racing 154.21
## 9 Shooter 108.28
## 4 Misc 98.69
## 8 Role-Playing 91.55
## 3 Fighting 89.19
## 5 Platform 72.11
## 10 Simulation 42.26
## 2 Adventure 21.16
## 12 Strategy 15.04
## 6 Puzzle 5.90
ggplot(data = top_genre, aes(x = Global_Sales, y = Genre)) +
geom_bar(stat = "identity", fill = "steelblue", show.legend = F) +
labs(y = "Game Genre", x = "Global Sales", title = "Global sales of PS 2 by Genre") +
scale_x_continuous(labels = unit_format(suffix = " million")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) From the Bar Plot above, we can see that Sports and Action Genre was the highest sales of PS 2 Platform
5. Explanatory Text & Business Recomendation
Video Game Sales was started in 1980 until now. In 40 Years (1980 - 2020) the total of Video Game sales was 8820.36. Each sales of video game have Genre. The most users like was Action and Sports. From all Platform, we can take a look, that Action and Sports have highest Proportion of Game Sales. Let’s take an example from PS 2 and Wii Platform. The most Genre that have highest sales was Action and Sports too. In 2004 was the year that PS 2 has highest sales and the genre was Action. Beside that, in 2006 Wii has highest Global Sales in the Sports Genre. That proved Action and Sports was the most like Game Genre.
Recomendations :
From Dateset Above we can see the most selling was Action and Sports Game Genre. So publisher can make that as reference for making future Game. PS 2 and Wii have proved it. Game Genre Action and Sports made PS 2 and Wii have the highest Global Sales.
Try to avoid Puzzle Game Genre. Because Puzzle Game Genre has the lowest Global Sales and Proportion. We can take a look in 2016 - 2020, every Publisher and Platform consist to made Action Game Genre. Top Publisher like Capcom, Ubisoft, Warner Bros and Sony competing to make Action Game. Overall we can conclude that the Puzzle Game Genre is not a trend anymore in 2016 - 2020.