Video Game sales

I’ll demonstrate the power of Tidyverse on a dataset from Kaggle: https://www.kaggle.com/gregorut/videogamesales

Fields in the dataset are:

Rank - Ranking of overall sales

Name - The games name

Platform - Platform of the games release (i.e. PC,PS4, etc.)

Year - Year of the game's release

Genre - Genre of the game

Publisher - Publisher of the game

NA_Sales - Sales in North America (in millions)

EU_Sales - Sales in Europe (in millions)

JP_Sales - Sales in Japan (in millions)

Other_Sales - Sales in the rest of the world (in millions)

Global_Sales - Total worldwide sales.

Data import

Following Wickham & Grolemund data science framework, this is the fist step in any data science project: Import.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#importing and converting into a dataframe using piping (%>%), which is widely used in tidyverse.

sales_raw<-read.csv("https://raw.githubusercontent.com/bsvmelo/CUNY/master/vgsales.csv") %>% data.frame()

#quick display
head(sales_raw)
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26
#summary stats
summary(sales_raw)
##       Rank                                Name          Platform   
##  Min.   :    1   Need for Speed: Most Wanted:   12   DS     :2163  
##  1st Qu.: 4151   FIFA 14                    :    9   PS2    :2161  
##  Median : 8300   LEGO Marvel Super Heroes   :    9   PS3    :1329  
##  Mean   : 8301   Madden NFL 07              :    9   Wii    :1325  
##  3rd Qu.:12450   Ratatouille                :    9   X360   :1265  
##  Max.   :16600   Angry Birds Star Wars      :    8   PSP    :1213  
##                  (Other)                    :16542   (Other):7142  
##       Year               Genre                             Publisher    
##  2009   :1431   Action      :3316   Electronic Arts             : 1351  
##  2008   :1428   Sports      :2346   Activision                  :  975  
##  2010   :1259   Misc        :1739   Namco Bandai Games          :  932  
##  2007   :1202   Role-Playing:1488   Ubisoft                     :  921  
##  2011   :1139   Shooter     :1310   Konami Digital Entertainment:  832  
##  2006   :1008   Adventure   :1286   THQ                         :  715  
##  (Other):9131   (Other)     :5113   (Other)                     :10872  
##     NA_Sales          EU_Sales          JP_Sales         Other_Sales      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000  
##  Median : 0.0800   Median : 0.0200   Median : 0.00000   Median : 0.01000  
##  Mean   : 0.2647   Mean   : 0.1467   Mean   : 0.07778   Mean   : 0.04806  
##  3rd Qu.: 0.2400   3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000  
##  Max.   :41.4900   Max.   :29.0200   Max.   :10.22000   Max.   :10.57000  
##                                                                           
##   Global_Sales    
##  Min.   : 0.0100  
##  1st Qu.: 0.0600  
##  Median : 0.1700  
##  Mean   : 0.5374  
##  3rd Qu.: 0.4700  
##  Max.   :82.7400  
## 

Some statistics: Number of rows: 16598
Number of columns: 11
History size: 11
Number of unique games: 11493
Number of unique platforms: 31
Number of unique genres: 12
Number of unique publishers: 579
Number of NAs in each column: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Number of N/A in each column: 0, 0, 0, 271, 0, 58, 0, 0, 0, 0, 0

Despite the facts there are no NAs, there are N/A which will have to be dealt with in a later stage.

Visualization

Following Wickham & Grolemund data science framework, I’ll start with some basic visualization using ggplot.

Top 20 ranked games

top_20<-head(sales_raw,20)
# Simple Bar Plot
ggplot(top_20)+ geom_col(mapping=aes(x=reorder(Name,-Global_Sales),y=Global_Sales))+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+labs(x="Games", y = "Global Sales in millions")

Platforms

#Number of games sold per platform

ggplot(sales_raw) + geom_bar(mapping=aes(x=Platform))+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+labs(y = "# units sold in mn")

Genres

Types of Genres

ggplot(sales_raw) + geom_bar(mapping=aes(x=Genre))+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+labs(y = "# of games")

Global Sales

Sales per year per platform

ggplot(sales_raw) + geom_col(mapping=aes(Year,Global_Sales,fill=Platform))+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

Sales per year per platform - Facets

ggplot(sales_raw) + geom_col(mapping=aes(Year,Global_Sales,fill=Platform))+facet_wrap(~Platform)+theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(), legend.position = "none")

Data Transformation

1: First thing to do is to take care of the N/A values in the dataset.
To recap, this is the number of N/A in each column: 0, 0, 0, 271, 0, 58, 0, 0, 0, 0, 0

There are N/A in the Year and Publisher columns.

I will delete all records in Year and Publisher columns that have an N/A.

# replacing N/A by NA
sales_df1<-sales_raw
sales_df1<-sales_df1[sales_df1$Year !="N/A",]
sales_df1<-sales_df1[sales_df1$Publisher !="N/A",]
# confirming that all records containing N/As are deleted
colSums(sales_df1=='N/A')
##         Rank         Name     Platform         Year        Genre    Publisher 
##            0            0            0            0            0            0 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##            0            0            0            0            0

2:Filtering I will analyse only Sony Platform games, which I assume are: PS, PS2, PS3, PS4, PSP, PSV.

# Date frame with only games for Sony platforms
sales_sony<-drop_na(sales_df1)
sales_sony<-filter(na.omit(sales_df1), Platform %in% c('PS', 'PS2', 'PS3', 'PS4', 'PSP', 'PSV'))
head(sales_sony)
##   Rank                          Name Platform Year   Genre
## 1   17            Grand Theft Auto V      PS3 2013  Action
## 2   18 Grand Theft Auto: San Andreas      PS2 2004  Action
## 3   25   Grand Theft Auto: Vice City      PS2 2002  Action
## 4   29        Gran Turismo 3: A-Spec      PS2 2001  Racing
## 5   34     Call of Duty: Black Ops 3      PS4 2015 Shooter
## 6   35    Call of Duty: Black Ops II      PS3 2012 Shooter
##                     Publisher NA_Sales EU_Sales JP_Sales Other_Sales
## 1        Take-Two Interactive     7.01     9.27     0.97        4.14
## 2        Take-Two Interactive     9.43     0.40     0.41       10.57
## 3        Take-Two Interactive     8.41     5.49     0.47        1.78
## 4 Sony Computer Entertainment     6.85     5.09     1.87        1.16
## 5                  Activision     5.77     5.81     0.35        2.31
## 6                  Activision     4.99     5.88     0.65        2.52
##   Global_Sales
## 1        21.40
## 2        20.81
## 3        16.15
## 4        14.98
## 5        14.24
## 6        14.03

3:Selecting columns I will use the select() function to narrow down the analysis to games sold in Japan

# Date frame with only sales in Japan
sales_sony_jp<-select(sales_sony, -c(NA_Sales,EU_Sales,Other_Sales))
head(sales_sony_jp)
##   Rank                          Name Platform Year   Genre
## 1   17            Grand Theft Auto V      PS3 2013  Action
## 2   18 Grand Theft Auto: San Andreas      PS2 2004  Action
## 3   25   Grand Theft Auto: Vice City      PS2 2002  Action
## 4   29        Gran Turismo 3: A-Spec      PS2 2001  Racing
## 5   34     Call of Duty: Black Ops 3      PS4 2015 Shooter
## 6   35    Call of Duty: Black Ops II      PS3 2012 Shooter
##                     Publisher JP_Sales Global_Sales
## 1        Take-Two Interactive     0.97        21.40
## 2        Take-Two Interactive     0.41        20.81
## 3        Take-Two Interactive     0.47        16.15
## 4 Sony Computer Entertainment     1.87        14.98
## 5                  Activision     0.35        14.24
## 6                  Activision     0.65        14.03

4:Adding new variable with mutate() I will create a variable corresponding to the Global share of the Japanese sales

sales_sony_jp<-mutate(sales_sony_jp,Share=JP_Sales/Global_Sales)

5:Grouped Summaries with summarize() I will summarize the Japanese Share of total sales in aggregate and also by year and publisher.

sales_sony_jp_summ_1<-summarize(sales_sony_jp,sum(JP_Sales)/sum(Global_Sales))
sales_sony_jp_summ_2<-sales_sony_jp %>%
  group_by(Year) %>%
  summarize(Share1=sum(JP_Sales)/sum(Global_Sales)) 

Historical share of Japanese sales is: r sales_sony_jp_summ_1

Plotting share of Japanese sales over time:

ggplot(sales_sony_jp_summ_2)+ geom_col(mapping=aes(sales_sony_jp_summ_2$Year,sales_sony_jp_summ_2$Share1))+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+labs(y = "JP Share", x="Year")

It seems there’s an error with 2017 data. I will leave this to my colleagues to try to fix :-)

Aggregation by year and by Platform

(sales_sony_jp_summ_3<-sales_sony_jp %>%
  group_by(Year, Platform) %>%
  summarize(Share1=sum(JP_Sales)/sum(Global_Sales))) 
## # A tibble: 57 x 3
## # Groups:   Year [24]
##    Year  Platform Share1
##    <fct> <fct>     <dbl>
##  1 1994  PS        0.444
##  2 1995  PS        0.266
##  3 1996  PS        0.188
##  4 1997  PS        0.243
##  5 1998  PS        0.168
##  6 1999  PS        0.194
##  7 2000  PS        0.152
##  8 2000  PS2       0.227
##  9 2001  PS        0.109
## 10 2001  PS2       0.107
## # … with 47 more rows

Aggregation by year by Genre and by Platform, when Share is greater than 0.

(sales_sony_jp_summ_4<-sales_sony_jp %>%
  group_by(Year, Platform, Genre) %>%
  summarize(Share1=sum(JP_Sales)/sum(Global_Sales))) %>%
  filter(Share1>0)
## # A tibble: 501 x 4
## # Groups:   Year, Platform [56]
##    Year  Platform Genre        Share1
##    <fct> <fct>    <fct>         <dbl>
##  1 1994  PS       Fighting      0.384
##  2 1994  PS       Racing        0.930
##  3 1994  PS       Role-Playing  0.944
##  4 1994  PS       Shooter       0.327
##  5 1994  PS       Simulation    0.171
##  6 1994  PS       Sports        0.298
##  7 1994  PS       Strategy      0.96 
##  8 1995  PS       Action        0.196
##  9 1995  PS       Adventure     0.815
## 10 1995  PS       Fighting      0.366
## # … with 491 more rows

Aggregation by Platform, when Share is greater than 0.

(sales_sony_jp_summ_5<-sales_sony_jp %>%
  group_by(Platform) %>%
  summarize(Share1=sum(JP_Sales)/sum(Global_Sales))) %>%
  filter(Share1>0)
## # A tibble: 6 x 2
##   Platform Share1
##   <fct>     <dbl>
## 1 PS       0.192 
## 2 PS2      0.112 
## 3 PS3      0.0834
## 4 PS4      0.0514
## 5 PSP      0.260 
## 6 PSV      0.339

Conclusion

I have used a lot of different functions to transform, tidy, and present data. There’s still a lot analysis that can be done.