These are the R libraries we will need for this demo:
This demo will utilize the Kaggle Video Game Sales 2019 dataset
The data is openly available for the public to download as an .csv file. The .csv file was uploaded to the author’s github and can be directly accessed it via the raw link and load to an R data.frame:
Get a general feel for the data by inspecting some basic: dataframe size dimensions, feature names, feature summaries
## [1] 55792 16
## [1] "Rank" "Name" "Genre" "ESRB_Rating"
## [5] "Platform" "Publisher" "Developer" "Critic_Score"
## [9] "User_Score" "Total_Shipped" "Global_Sales" "NA_Sales"
## [13] "PAL_Sales" "JP_Sales" "Other_Sales" "Year"
## Rank Name Genre ESRB_Rating
## Min. : 1 Length:55792 Length:55792 Length:55792
## 1st Qu.:13949 Class :character Class :character Class :character
## Median :27896 Mode :character Mode :character Mode :character
## Mean :27896
## 3rd Qu.:41844
## Max. :55792
##
## Platform Publisher Developer Critic_Score
## Length:55792 Length:55792 Length:55792 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 6.40
## Mode :character Mode :character Mode :character Median : 7.50
## Mean : 7.21
## 3rd Qu.: 8.30
## Max. :10.00
## NA's :49256
## User_Score Total_Shipped Global_Sales NA_Sales
## Min. : 2.00 Min. : 0.03 Min. : 0.00 Min. :0.00
## 1st Qu.: 7.80 1st Qu.: 0.20 1st Qu.: 0.03 1st Qu.:0.05
## Median : 8.50 Median : 0.59 Median : 0.12 Median :0.12
## Mean : 8.25 Mean : 1.89 Mean : 0.37 Mean :0.28
## 3rd Qu.: 9.10 3rd Qu.: 1.80 3rd Qu.: 0.36 3rd Qu.:0.29
## Max. :10.00 Max. :82.86 Max. :20.32 Max. :9.76
## NA's :55457 NA's :53965 NA's :36377 NA's :42828
## PAL_Sales JP_Sales Other_Sales Year
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :1970
## 1st Qu.:0.01 1st Qu.:0.02 1st Qu.:0.00 1st Qu.:2000
## Median :0.04 Median :0.05 Median :0.01 Median :2008
## Mean :0.16 Mean :0.11 Mean :0.04 Mean :2006
## 3rd Qu.:0.14 3rd Qu.:0.12 3rd Qu.:0.04 3rd Qu.:2011
## Max. :9.85 Max. :2.69 Max. :3.12 Max. :2020
## NA's :42603 NA's :48749 NA's :40270 NA's :979
This data.frame is mercifully clean. We can simply select the features we are most interested in visualizing and ‘tidy’ them for use with ggplot2 plotting functions
genreVGS <- vgs_df %>%
#Subset the data to include columns: Genre + regional sales columns + year
select( Genre, NA_Sales, PAL_Sales, JP_Sales, Other_Sales, Year ) %>%
#Drop rows without regional sales information
drop_na() %>%
#pivot the data.frame longer such that each regional sales observation has it's own row
pivot_longer(cols = NA_Sales:Other_Sales, names_to = 'Region', values_to = 'Sales') %>%
arrange( Year )
head( genreVGS )## # A tibble: 6 x 4
## Genre Year Region Sales
## <chr> <dbl> <chr> <dbl>
## 1 Puzzle 1985 NA_Sales 0.42
## 2 Puzzle 1985 PAL_Sales 0.1
## 3 Puzzle 1985 JP_Sales 0.28
## 4 Puzzle 1985 Other_Sales 0.02
## 5 Shooter 1986 NA_Sales 0.27
## 6 Shooter 1986 PAL_Sales 0.08
Great!, the data.frame is well organized, now just a little more tayloring to create specific plots. To visualize Genre sales by Region, let’s select these feature with group_by():
genre_byregionVGS <- genreVGS %>%
#group_by() to aggregate the data by the features Genre & Region
group_by( Genre, Region ) %>%
#summarise() to find the total sales for the grouped features
summarise( n = sum( Sales ) ) %>%
#mutate() to create a new column that calculates the proportion of sales
mutate( proportion = n/sum( n ))
head( genre_byregionVGS )## # A tibble: 6 x 4
## # Groups: Genre [2]
## Genre Region n proportion
## <chr> <chr> <dbl> <dbl>
## 1 Action JP_Sales 47.1 0.0779
## 2 Action NA_Sales 276. 0.456
## 3 Action Other_Sales 72.6 0.120
## 4 Action PAL_Sales 209. 0.346
## 5 Action-Adventure JP_Sales 4.42 0.0451
## 6 Action-Adventure NA_Sales 38.4 0.392
Visualize with a ggplot2 Proportional Stacked Bar Area Chart
ggplot(genre_byregionVGS, aes(x=genre_byregionVGS$Region, y=genre_byregionVGS$proportion, fill=genre_byregionVGS$Genre)) +
geom_bar(position="fill", stat="identity",color ='black') +
scale_color_manual(values = c(NA, 'black'), guide=F) +
ggtitle('Genre Sales Proportion per Region') +
xlab( 'Region' ) +
ylab( 'Regional Sales Proportion') +
scale_fill_discrete(name = "Genre")Again, just a little more tayloring to create specific plots. To visualize Genre sales by Year, let’s select these feature with group_by():
genre_byyearVGS <- genreVGS %>%
#group_by() Genre & Year
group_by( Genre, Year ) %>%
#summarise() to find the total sales for the grouped features
summarise( genreAnnualSales = sum( Sales ) )
#The total Annual Sales is different for each year, so to calculate the correct proportions, we need to calculate the each year sales. This is passed to a data.frame:
yearTotal <- genreVGS %>%
group_by( Year ) %>%
summarise( annualSales = sum( Sales ) )
#Join the information in both dataframes so that each row of 'genre_byyearVGS' is associated with the correct total annual sales
genre_byyearVGS <- genre_byyearVGS %>%
#full_join() to join the data.frames by 'Year'
full_join( yearTotal, genre_byyearVGS, by='Year' ) %>%
#mutate to create a new column that holds calculations of the proportion of sales
mutate( proportion = genreAnnualSales/annualSales*100)
head(genre_byyearVGS)## # A tibble: 6 x 5
## # Groups: Genre [1]
## Genre Year genreAnnualSales annualSales proportion
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Action 1990 4.83 7.2 67.1
## 2 Action 1992 1.39 3.55 39.2
## 3 Action 1994 1.42 7.87 18.0
## 4 Action 1996 4.63 27.2 17.0
## 5 Action 1997 1.31 46.0 2.85
## 6 Action 1998 6.54 54.3 12.1
Visualizing Video Game Genre Sales by percent of sales for each year of data collection with a ggplot2 proportion stacked bar plot:
# Plot
ggplot(genre_byyearVGS, aes(x=genre_byyearVGS$Year, y=genre_byyearVGS$proportion, fill=genre_byyearVGS$Genre)) +
geom_bar(position="fill", stat="identity",color ='black') +
scale_color_manual(values = c(NA, 'black'), guide=F) +
xlim( 1985,2019 ) +
ggtitle('Genre Sales Proportion per Year') +
xlab( 'Year' ) +
ylab( 'Annual Sales Proportion') +
scale_fill_discrete(name = "Genre")## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_bar).
In this demo we visualized the video game regional sales data that was make available on Kaggle. The data needed few steps of processing before meaninful visualization could be generated. The R libraries ‘dplyr’ & ‘tidyr’ did most of our heavy lifting. Stacked bar plots were constructed that explored a few relations of data features. First, The proportion of Genre sales were calculated and displayed as a function of sales region. The plot is quite busy, but close inspection can show variations across the regions. For example, Visual Novel sales make up a relatively larger proportion of sales for the JP_Sales than any other region whereas PAL_Sales market has the highest proportion of sales for Boardgames. The next figure plots proportion of genre sales for each year. This figure is interesting, because it shows that over time, more genre designation were added over the course of data collection.