Exploratory Data Analysis - Video Game Sales Dataset

1. Explanation

The beginning

Hello everyone , welcome to my first Rpubs. I made this rmd to fulfill my LBB assignment. Hope You like it :)

The Data About

This is Video Game Sales Dataset that I get from Kaggle. In this dataset contains of sales record that include the video game Platform, Genre, Year, and the game Name. We can see the competition from game publishers, when the video game sales started and when the video game sales increase significantly

2. Input Dataset

Dont’t forget to put the Dataset in the same folder and set the current folder as working directory

game_data <- read.csv("dataset/vgsales.csv")
Done, let’s move to the next step

2.1 Dataset Inspection

Get first 5 rows

head(game_data, 5)
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37

Get last 5 rows

tail(game_data, 5)
##        Rank                                             Name Platform Year
## 16594 16596               Woody Woodpecker in Crazy Castle 5      GBA 2002
## 16595 16597                    Men in Black II: Alien Escape       GC 2003
## 16596 16598 SCORE International Baja 1000: The Official Game      PS2 2008
## 16597 16599                                       Know How 2       DS 2010
## 16598 16600                                 Spirits & Spells      GBA 2003
##          Genre  Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 16594 Platform      Kemco     0.01     0.00        0           0         0.01
## 16595  Shooter Infogrames     0.01     0.00        0           0         0.01
## 16596   Racing Activision     0.00     0.00        0           0         0.01
## 16597   Puzzle   7G//AMES     0.00     0.01        0           0         0.01
## 16598 Platform    Wanadoo     0.01     0.00        0           0         0.01

Get total rows / observation

nrow(game_data)
## [1] 16598

Get total columns

ncol(game_data)
## [1] 11

Get all columns names

names(game_data)
##  [1] "Rank"         "Name"         "Platform"     "Year"         "Genre"       
##  [6] "Publisher"    "NA_Sales"     "EU_Sales"     "JP_Sales"     "Other_Sales" 
## [11] "Global_Sales"

Get dimension of dataset

dim(game_data)
## [1] 16598    11
From our inspection we can take few informations :
  • Video Games Sales dataset contains 16598 of rows and 11 columns
  • Each of columns names : Rank, Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales (Total worldwide sales in millions)

2.2 Data Cleaning & Coertions

Check the data type for each column using str(structure)

str(game_data)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : chr  "Wii" "NES" "Wii" "Wii" ...
##  $ Year        : chr  "2006" "1985" "2008" "2009" ...
##  $ Genre       : chr  "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher   : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...
We can see from the result above, there are several data type that incorrect, let’s convert the data type to the correct type (Explicit Coertion)
game_data$Platform <- as.factor(game_data$Platform)
game_data$Year <- as.integer(game_data$Year)
game_data$Publisher <- as.factor(game_data$Publisher)
game_data$Genre <- as.factor(game_data$Genre)

str(game_data)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
##  $ Year        : int  2006 1985 2008 2009 1996 1989 2006 2006 2009 1984 ...
##  $ Genre       : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
##  $ Publisher   : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...
summary(game_data)
##       Rank           Name              Platform         Year     
##  Min.   :    1   Length:16598       DS     :2163   Min.   :1980  
##  1st Qu.: 4151   Class :character   PS2    :2161   1st Qu.:2003  
##  Median : 8300   Mode  :character   PS3    :1329   Median :2007  
##  Mean   : 8301                      Wii    :1325   Mean   :2006  
##  3rd Qu.:12450                      X360   :1265   3rd Qu.:2010  
##  Max.   :16600                      PSP    :1213   Max.   :2020  
##                                     (Other):7142   NA's   :271   
##           Genre                             Publisher        NA_Sales      
##  Action      :3316   Electronic Arts             : 1351   Min.   : 0.0000  
##  Sports      :2346   Activision                  :  975   1st Qu.: 0.0000  
##  Misc        :1739   Namco Bandai Games          :  932   Median : 0.0800  
##  Role-Playing:1488   Ubisoft                     :  921   Mean   : 0.2647  
##  Shooter     :1310   Konami Digital Entertainment:  832   3rd Qu.: 0.2400  
##  Adventure   :1286   THQ                         :  715   Max.   :41.4900  
##  (Other)     :5113   (Other)                     :10872                    
##     EU_Sales          JP_Sales         Other_Sales        Global_Sales    
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
##  Median : 0.0200   Median : 0.00000   Median : 0.01000   Median : 0.1700  
##  Mean   : 0.1467   Mean   : 0.07778   Mean   : 0.04806   Mean   : 0.5374  
##  3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4700  
##  Max.   :29.0200   Max.   :10.22000   Max.   :10.57000   Max.   :82.7400  
## 
Done. Each column type now are correct. We can see the summary that we get, a lot of information we can get (especially for the factor data type) because the data type now are correct

Check for the missing value

colSums(is.na(game_data))
##         Rank         Name     Platform         Year        Genre    Publisher 
##            0            0            0          271            0            0 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##            0            0            0            0            0
We have missing values in Year Column
anyNA(game_data)
## [1] TRUE
head(game_data[which(is.na(game_data$Year)),  c("Name", "Year")])
##                           Name Year
## 180            Madden NFL 2004   NA
## 378           FIFA Soccer 2004   NA
## 432 LEGO Batman: The Videogame   NA
## 471 wwe Smackdown vs. Raw 2006   NA
## 608             Space Invaders   NA
## 625                  Rock Band   NA

Here is the 6 sample of missing value in Year Column by Name

From the result above, we can see the missing value for Year column, there are 271 rows. Let’s clean the missing value using complete_case() function in R
game_data <- game_data[complete.cases(game_data), ]
nrow(game_data)
## [1] 16327
anyNA(game_data)
## [1] FALSE
Yeayyy, there is no missing value in our dataset right now. Don’t worry, we just remove 2% of rows from our dataset, that means we can still extract important information from our dataset

Let’s subsetting our dataset, because we are not use all of the column

game_data <- game_data[, !(colnames(game_data) %in% c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"))]
head(game_data, 5)
##   Rank                     Name Platform Year        Genre Publisher
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo
##   Global_Sales
## 1        82.74
## 2        40.24
## 3        35.82
## 4        33.00
## 5        31.37
Done, we success remove NA_Sales, EU_Sales, JP_Sales Column. Because we only focus in Global_Sales

3. Data Explanation

Let’s summarise our data with summary function in R

summary(game_data)
##       Rank           Name              Platform         Year     
##  Min.   :    1   Length:16327       DS     :2133   Min.   :1980  
##  1st Qu.: 4136   Class :character   PS2    :2127   1st Qu.:2003  
##  Median : 8295   Mode  :character   PS3    :1304   Median :2007  
##  Mean   : 8293                      Wii    :1290   Mean   :2006  
##  3rd Qu.:12442                      X360   :1235   3rd Qu.:2010  
##  Max.   :16600                      PSP    :1197   Max.   :2020  
##                                     (Other):7041                 
##           Genre                             Publisher      Global_Sales    
##  Action      :3253   Electronic Arts             : 1339   Min.   : 0.0100  
##  Sports      :2304   Activision                  :  966   1st Qu.: 0.0600  
##  Misc        :1710   Namco Bandai Games          :  928   Median : 0.1700  
##  Role-Playing:1471   Ubisoft                     :  918   Mean   : 0.5402  
##  Shooter     :1282   Konami Digital Entertainment:  823   3rd Qu.: 0.4800  
##  Adventure   :1276   THQ                         :  712   Max.   :82.7400  
##  (Other)     :5031   (Other)                     :10641
Summary :
  • First Game Sales was in 1980
  • Games sales increase in 2006
  • Best selling Game by Genre is Action
  • DS is the Most Platform selling Games
  • The highest Sales of Global Sales was 82.7400
  • The lowest Sales of Global Sales was 0.0100

Let’s take a look, how the Global Sales from time to time increase significantly in 2006

library("scales")

# take data in year 2006, 2007 dan 2008
sales_game_one <- game_data[game_data$Year %in% c("2006", "2007", "2008") ,]

# calculate the mean of Global Sale by each Platform
sales_game_one_agg <- aggregate.data.frame(sales_game_one$Global_Sales, by = list(sales_game_one$Platform), mean)

# change column names
names(sales_game_one_agg) <- c("title", "mean")

# take only six highest Global Sales
sales_game_one_agg <- sales_game_one_agg[order(sales_game_one_agg$mean, decreasing = T),]
sales_game_one_agg <- sales_game_one_agg[1:6,]



ggplot(sales_game_one_agg, aes(x = title, y = mean)) +
  geom_col(aes(fill = title), show.legend = F) +
  labs(title = "Global Sales by Platfom", y = "Global Sales", x = "Platform", subtitle = "that Increase in 2006, 2007 and 2008") +
  scale_y_continuous(labels = unit_format(scale = 100, suffix = " million")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

From the result above, we can conclude 2006 is the high Global Sales from year to year

Let’s try to check the outlier in Global_Sales using boxplot, ggplot (scatter plot) and custome function

boxplot(game_data$Global_Sales, main = "Outlier Detection in Global Sales")

ggplot(data = game_data, aes(x = Genre, y = Global_Sales)) +
  labs(x = "Game Genre", y = "Global Sales", title = "Outlier Detection in Global Sales by Genre") +
  geom_point(alpha = 0.5, position = 'jitter') +
  theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5))

outlier_detect <- function(sales) {
  game_sd <- aggregate(sales ~ Genre, game_data, sd)

  outlier <- data.frame(sales = game_sd$sales, 
                               outlier = ifelse(game_sd$sales >= 2, TRUE, FALSE)) 
  outlier
}

outlier_detect(game_data$Global_Sales)
##        sales outlier
## 1  1.1649482   FALSE
## 2  0.5105220   FALSE
## 3  0.9579153   FALSE
## 4  1.3232976   FALSE
## 5  2.5986424    TRUE
## 6  1.5762764   FALSE
## 7  1.6767708   FALSE
## 8  1.7169431   FALSE
## 9  1.8335782   FALSE
## 10 1.2055324   FALSE
## 11 2.1049264    TRUE
## 12 0.5241971   FALSE
From the result above, we can conclude :
  • The Global_Sales has a bit outlier base on the Boxplot and the Scatter Plot above, only a few values exceed the average of Global Sales
  • Beside that, from the outlier_detect function, we can see the average standard deviation of the Global Sales. The outlier is not much, only a few values exceed the average sd
  • Actually we can clean the outlier, but in this case, i think we don’t need to clean it. Because that can remove valuable information from Video Games Sales dataset.

4. Data Manipulation & Transformation

  1. Which Platrofm gives the lowest number of sales ?
game_sales <- game_data[order(game_data$Global_Sales == 0.01, decreasing = T),]
head(game_sales)
##        Rank
## 15981 15983
## 15982 15984
## 15983 15985
## 15984 15986
## 15985 15987
## 15986 15988
##                                                                                           Name
## 15981                                                                                    Turok
## 15982                                                           Coven and Labyrinth of Refrain
## 15983 Super Battle For Money Sentouchuu: Kyuukyoku no Shinobu to Battle Player Choujou Kessen!
## 15984                                                                         Dragon Zakura DS
## 15985                                                                   Chameleon: To Dye For!
## 15986                                                                           Hotel Giant DS
##       Platform Year      Genre            Publisher Global_Sales
## 15981       PC 2008     Action           Touchstone         0.01
## 15982      PSV 2016     Action Nippon Ichi Software         0.01
## 15983      3DS 2016     Action   Namco Bandai Games         0.01
## 15984       DS 2007       Misc      Electronic Arts         0.01
## 15985       DS 2006     Puzzle            505 Games         0.01
## 15986       DS 2008 Simulation              Nobilis         0.01
From the result above, there are 6 Platform that have lowest Global Sales
  1. Which genre of the Game most users bought from all platform ?
round(prop.table(table(game_data$Genre)) * 100, 2)
## 
##       Action    Adventure     Fighting         Misc     Platform       Puzzle 
##        19.92         7.82         5.12        10.47         5.37         3.50 
##       Racing Role-Playing      Shooter   Simulation       Sports     Strategy 
##         7.51         9.01         7.85         5.21        14.11         4.11
Action is the genre of game that the most user bought
  1. Which year did game sales significantly increase ?
game_data_sales <- game_data[game_data$Global_Sales >= 40.00, ]
sort(table(game_data_sales$Year), decreasing = T)
## 
## 1985 2006 1980 1981 1982 1983 1984 1986 1987 1988 1989 1990 1991 1992 1993 1994 
##    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2007 2008 2009 2010 2011 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
## 2012 2013 2014 2015 2016 2017 2020 
##    0    0    0    0    0    0    0
The game sales become significantly increase was in year 2006 & 1985
  1. Who was the publisher and the game sales that significantly increase ?
game <-  data.frame(Publisher = game_data_sales$Publisher, Name = game_data_sales$Name)
game
##   Publisher              Name
## 1  Nintendo        Wii Sports
## 2  Nintendo Super Mario Bros.
The publisher that made game sales increase significantly in 2006 & 1985 was Nintendo
  1. How much total sales of top publisher each genre of game and platform
total_sales  <- aggregate(Global_Sales ~ Platform + Genre + Publisher, game_data, sum)
head(total_sales[order(total_sales$Global_Sales, decreasing = T),])
##      Platform        Genre       Publisher Global_Sales
## 3199      Wii       Sports        Nintendo       169.19
## 1579      PS2       Sports Electronic Arts       126.39
## 298      X360      Shooter      Activision       102.02
## 293       PS3      Shooter      Activision        87.72
## 3167       DS Role-Playing        Nintendo        85.88
## 3142      NES     Platform        Nintendo        81.25
We can see the top Publisher sales, there are Nintendo, Electronic Arts and Activision. Off course Nintendo still number one position of Sales :)
# Take just 6 top Publisher sales
total_sales <- head(total_sales[order(total_sales$Global_Sales, decreasing = T),!(colnames(total_sales) %in% c("Publisher"))])
total_sales
##      Platform        Genre Global_Sales
## 3199      Wii       Sports       169.19
## 1579      PS2       Sports       126.39
## 298      X360      Shooter       102.02
## 293       PS3      Shooter        87.72
## 3167       DS Role-Playing        85.88
## 3142      NES     Platform        81.25
ggplot(data = total_sales, aes(fill = Platform, x = Global_Sales, y = Platform)) +
  geom_bar(stat = "identity", show.legend = F) +
  labs(x = "Global Sales", y = "Platfrom", title = "Top total sales of Platrfom by Global Sales ") +
  scale_x_continuous(labels = unit_format(suffix = " million")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

From the Bar Plot above, we can see Wii & PS 2 was the most high sales in Sports Genre and X360 was the most high sales in Shooter Genre
  1. Let’s dive deeper into PS 2 platform, because I want to nostalgic about my chillhood :). How much total sales of PS 2 Platform in 11 years (2000-2011) ?
total_sales_ps2 <- game_data[game_data$Platform == "PS2", c("Global_Sales", "Year")]
sum(total_sales_ps2$Global_Sales)
## [1] 1233.46
Total sales of PS 2 Platform in 11 years was 1233.46
  1. What is Mean of total sales in PS 2 platform for every Year ?
aggregate(Global_Sales ~ Year, total_sales_ps2, mean)
##    Year Global_Sales
## 1  2000   0.47695122
## 2  2001   0.89962162
## 3  2002   0.73357143
## 4  2003   0.71988281
## 5  2004   0.81768340
## 6  2005   0.61788462
## 7  2006   0.39930502
## 8  2007   0.35514019
## 9  2008   0.28183246
## 10 2009   0.27552083
## 11 2010   0.14815789
## 12 2011   0.06714286
The highest mean of sales from PS 2 was in 2001
  1. Which is the Highest sales of PS 2 platform ?
highest_total_sales <- total_sales_ps2[total_sales_ps2$Global_Sales == 20.8100,]
head(highest_total_sales)
##    Global_Sales Year
## 18        20.81 2004
The highest sales of PS 2 was in 2004
  1. Which is the Lowest sales of PS 2 platform ?
lowest_total_sales <- total_sales_ps2[total_sales_ps2$Global_Sales == 0.01,]
head(lowest_total_sales)
##       Global_Sales Year
## 15995         0.01 2009
## 15998         0.01 2006
## 16030         0.01 2008
## 16034         0.01 2008
## 16035         0.01 2006
## 16039         0.01 2008
The lowest sales of PS 2 was from 2006 until 2009
  1. Which is the Highest Sales of PS 2 Platform for each Genre ?
top_genre <- game_data[game_data$Platform == "PS2", c("Global_Sales", "Genre")]
top_genre <- aggregate(Global_Sales ~ Genre, top_genre, sum)
top_genre[order(top_genre$Global_Sales, decreasing = T),]
##           Genre Global_Sales
## 1        Action       272.43
## 11       Sports       262.64
## 7        Racing       154.21
## 9       Shooter       108.28
## 4          Misc        98.69
## 8  Role-Playing        91.55
## 3      Fighting        89.19
## 5      Platform        72.11
## 10   Simulation        42.26
## 2     Adventure        21.16
## 12     Strategy        15.04
## 6        Puzzle         5.90
ggplot(data = top_genre, aes(x = Global_Sales, y = Genre)) +
  geom_bar(stat = "identity", fill = "steelblue", show.legend = F) +
  labs(y = "Game Genre", x = "Global Sales", title = "Global sales of PS 2 by Genre") +
  scale_x_continuous(labels = unit_format(suffix = " million")) +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) 

From the Bar Plot above, we can see that Sports and Action Genre was the highest sales of PS 2 Platform

5. Explanatory Text & Business Recomendation

Video Game Sales was started in 1980 until now. In 40 Years (1980 - 2020) the total of Video Game sales was 8820.36. Each sales of video game have Genre. The most users like was Action and Sports. From all Platform, we can take a look, that Action and Sports have highest Proportion of Game Sales. Let’s take an example from PS 2 and Wii Platform. The most Genre that have highest sales was Action and Sports too. In 2004 was the year that PS 2 has highest sales and the genre was Action. Beside that, in 2006 Wii has highest Global Sales in the Sports Genre. That proved Action and Sports was the most like Game Genre.

Recomendations :

  • From Dateset Above we can see the most selling was Action and Sports Game Genre. So publisher can make that as reference for making future Game. PS 2 and Wii have proved it. Game Genre Action and Sports made PS 2 and Wii have the highest Global Sales.
  • Try to avoid Puzzle Game Genre. Because Puzzle Game Genre has the lowest Global Sales and Proportion. We can take a look in 2016 - 2020, every Publisher and Platform consist to made Action Game Genre. Top Publisher like Capcom, Ubisoft, Warner Bros and Sony competing to make Action Game. Overall we can conclude that the Puzzle Game Genre is not a trend anymore in 2016 - 2020.