Visualizing Video Game Genre Sales as a Function of Region or Year

Setting the environment

These are the R libraries we will need for this demo:



Accessing the data

This demo will utilize the Kaggle Video Game Sales 2019 dataset
The data is openly available for the public to download as an .csv file. The .csv file was uploaded to the author’s github and can be directly accessed it via the raw link and load to an R data.frame:

Inspecting the data.frame

Get a general feel for the data by inspecting some basic: dataframe size dimensions, feature names, feature summaries

## [1] 55792    16
##  [1] "Rank"          "Name"          "Genre"         "ESRB_Rating"  
##  [5] "Platform"      "Publisher"     "Developer"     "Critic_Score" 
##  [9] "User_Score"    "Total_Shipped" "Global_Sales"  "NA_Sales"     
## [13] "PAL_Sales"     "JP_Sales"      "Other_Sales"   "Year"
##       Rank           Name              Genre           ESRB_Rating       
##  Min.   :    1   Length:55792       Length:55792       Length:55792      
##  1st Qu.:13949   Class :character   Class :character   Class :character  
##  Median :27896   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :27896                                                           
##  3rd Qu.:41844                                                           
##  Max.   :55792                                                           
##                                                                          
##    Platform          Publisher          Developer          Critic_Score  
##  Length:55792       Length:55792       Length:55792       Min.   : 1.00  
##  Class :character   Class :character   Class :character   1st Qu.: 6.40  
##  Mode  :character   Mode  :character   Mode  :character   Median : 7.50  
##                                                           Mean   : 7.21  
##                                                           3rd Qu.: 8.30  
##                                                           Max.   :10.00  
##                                                           NA's   :49256  
##    User_Score    Total_Shipped    Global_Sales      NA_Sales    
##  Min.   : 2.00   Min.   : 0.03   Min.   : 0.00   Min.   :0.00   
##  1st Qu.: 7.80   1st Qu.: 0.20   1st Qu.: 0.03   1st Qu.:0.05   
##  Median : 8.50   Median : 0.59   Median : 0.12   Median :0.12   
##  Mean   : 8.25   Mean   : 1.89   Mean   : 0.37   Mean   :0.28   
##  3rd Qu.: 9.10   3rd Qu.: 1.80   3rd Qu.: 0.36   3rd Qu.:0.29   
##  Max.   :10.00   Max.   :82.86   Max.   :20.32   Max.   :9.76   
##  NA's   :55457   NA's   :53965   NA's   :36377   NA's   :42828  
##    PAL_Sales        JP_Sales      Other_Sales         Year     
##  Min.   :0.00    Min.   :0.00    Min.   :0.00    Min.   :1970  
##  1st Qu.:0.01    1st Qu.:0.02    1st Qu.:0.00    1st Qu.:2000  
##  Median :0.04    Median :0.05    Median :0.01    Median :2008  
##  Mean   :0.16    Mean   :0.11    Mean   :0.04    Mean   :2006  
##  3rd Qu.:0.14    3rd Qu.:0.12    3rd Qu.:0.04    3rd Qu.:2011  
##  Max.   :9.85    Max.   :2.69    Max.   :3.12    Max.   :2020  
##  NA's   :42603   NA's   :48749   NA's   :40270   NA's   :979

Tidying the data.frame with dplyr & tidyr methods

This data.frame is mercifully clean. We can simply select the features we are most interested in visualizing and ‘tidy’ them for use with ggplot2 plotting functions

## # A tibble: 6 x 4
##   Genre    Year Region      Sales
##   <chr>   <dbl> <chr>       <dbl>
## 1 Puzzle   1985 NA_Sales     0.42
## 2 Puzzle   1985 PAL_Sales    0.1 
## 3 Puzzle   1985 JP_Sales     0.28
## 4 Puzzle   1985 Other_Sales  0.02
## 5 Shooter  1986 NA_Sales     0.27
## 6 Shooter  1986 PAL_Sales    0.08

Visualizing Video Game Genre Sales as a function of Sales Region

Great!, the data.frame is well organized, now just a little more tayloring to create specific plots. To visualize Genre sales by Region, let’s select these feature with group_by():

## # A tibble: 6 x 4
## # Groups:   Genre [2]
##   Genre            Region           n proportion
##   <chr>            <chr>        <dbl>      <dbl>
## 1 Action           JP_Sales     47.1      0.0779
## 2 Action           NA_Sales    276.       0.456 
## 3 Action           Other_Sales  72.6      0.120 
## 4 Action           PAL_Sales   209.       0.346 
## 5 Action-Adventure JP_Sales      4.42     0.0451
## 6 Action-Adventure NA_Sales     38.4      0.392

Visualize with a ggplot2 Proportional Stacked Bar Area Chart

Visualizing Video Game Genre Sales as a function of Sales Region

Again, just a little more tayloring to create specific plots. To visualize Genre sales by Year, let’s select these feature with group_by():

## # A tibble: 6 x 5
## # Groups:   Genre [1]
##   Genre   Year genreAnnualSales annualSales proportion
##   <chr>  <dbl>            <dbl>       <dbl>      <dbl>
## 1 Action  1990             4.83        7.2       67.1 
## 2 Action  1992             1.39        3.55      39.2 
## 3 Action  1994             1.42        7.87      18.0 
## 4 Action  1996             4.63       27.2       17.0 
## 5 Action  1997             1.31       46.0        2.85
## 6 Action  1998             6.54       54.3       12.1

Visualizing Video Game Genre Sales by percent of sales for each year of data collection with a ggplot2 proportion stacked bar plot:

## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_bar).

In Closing

In this demo we visualized the video game regional sales data that was make available on Kaggle. The data needed few steps of processing before meaninful visualization could be generated. The R libraries ‘dplyr’ & ‘tidyr’ did most of our heavy lifting. Stacked bar plots were constructed that explored a few relations of data features. First, The proportion of Genre sales were calculated and displayed as a function of sales region. The plot is quite busy, but close inspection can show variations across the regions. For example, Visual Novel sales make up a relatively larger proportion of sales for the JP_Sales than any other region whereas PAL_Sales market has the highest proportion of sales for Boardgames. The next figure plots proportion of genre sales for each year. This figure is interesting, because it shows that over time, more genre designation were added over the course of data collection.