Data Source: https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings
The dataset chosen for the assignment relates to video game sales and their ratings (or scores) through a video game critic website, “Metacritic”.
The variables of interest for this visualization are as follows:
Within the video games industry, there has often been criticism on ratings (or scores, both are used interchangeably) that critics give to video games, with suspicion of a lack of independence. The task at hand here is to create a visualization comparing critic reviews and user reviews, to analyze if there are any potential patterns that can be observed between the two.
The potential challenges faced in accomplishing the task are as follows:
| Challenges | Solution |
|---|---|
| Finding a way to visualize a potential lack of independence in critic ratings. | Video game companies that make large sales often have influence, hence, global sales can be viewed as a proxy to influence. Working with the data on hand, a possible method is to plot Global Sales and the differences between Critic and User Scores. Should there be a pattern emerging in the plot, it is a hint that Global Sales and disparities in ratings are associated, which may indicate influence affecting independence of critic scores. |
| There are too many video game genres to effectively visualize them neatly. | The top video game genres (in terms of available records) can be chosen for visualization. Since the task is to analyze potential patterns of video game companies’ influence on critic ratings, using the more popular genres would make sense as big video game companies tend to produce games belonging to the more popular categories. |
| Many video games exist on multiple gaming platforms (e.g. PC, Xbox, Playstation, Nintendo, etc.). This is will be an issue in the analysis as the ratings will differ from platform to platform. | The ratings however, tend to be fairly similar across platforms. It is assumed that a single record can be representative of the video game for all platforms. Hence, any duplicated records of the same video game title will be dropped in the visualization. |
Proposed sketch design:
The following packages are required:
facet_wrap())packages = c('dplyr', 'ggplot2', 'ggpubr', 'lemon', 'readr', 'reshape2')
for(p in packages){
if(!require(p, character.only=T)){
install.packages(p)
}
library(p, character.only = T)
}
Using the function read_csv(), the raw dataset is imported.
games <- read_csv("Video_Games_Sales_as_at_22_Dec_2016.csv")
Data pre-processing is required to ensure that the data are suitable as inputs for the graphing packages.
Before starting on pre-processing, the datatypes of the variable should be converted to the right categories.
Using str(), the datatypes of the dataset can be checked. The output below shows that ‘Year_of_Release’ and ‘User_Score’ variables have been incorrectly classified to be of “character” datatype.
str(games, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 16719 obs. of 16 variables:
## $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : chr "Wii" "NES" "Wii" "Wii" ...
## $ Year_of_Release: chr "2006" "1985" "2008" "2009" ...
## $ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
## $ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
## $ NA_Sales : num 41.4 29.1 15.7 15.6 11.3 ...
## $ EU_Sales : num 28.96 3.58 12.76 10.93 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
## $ Global_Sales : num 82.5 40.2 35.5 32.8 31.4 ...
## $ Critic_Score : num 76 NA 82 80 NA NA 89 58 87 NA ...
## $ Critic_Count : num 51 NA 73 73 NA NA 65 41 80 NA ...
## $ User_Score : chr "8" NA "8.3" "8" ...
## $ User_Count : num 322 NA 709 192 NA NA 431 129 594 NA ...
## $ Developer : chr "Nintendo" NA "Nintendo" "Nintendo" ...
## $ Rating : chr "E" NA "E" "E" ...
To resolve the issue of incorrect datatypes, the as.numeric() function is used to convert the variables to be of numeric datatype.
games$Year_of_Release <- as.numeric(as.character(games$Year_of_Release))
games$User_Score <- as.numeric(as.character(games$User_Score))
User_Score should be scaled to a range from 0 to 100, similar to that of ‘Critic_Score’’s so that both variables can be more easily compared. The values are multiplied by 10 to achieve the scaling results.
games$User_Score <- games$User_Score * 10
As the 3 variables detailed in 1. are necessary to proceed with our visualization and analysis, records with missing value(s) in any of the variables are removed from the dataframe. Removing all invalid records leaves the dataset with 7,017 records for analysis.
games <- games[!(is.na(games$Genre)),]
games <- games[!(is.na(games$Critic_Score)),]
games <- games[!(is.na(games$User_Score)),]
games <- games[!(is.na(games$Global_Sales)),]
nrow(games)
## [1] 7017
Some of the video games in the dataset are released on several platforms (PC, Xbox, Playstation, Nintendo, etc.) and would have multiple records. The same video game titles tend to have similar ratings and sales patterns regardless of platforms. Hence, the assumption is made that the record of a video game title is representative of all its platform releases, and only 1 is retained in the dataset for analysis.
There are 4,471 records remaining for analysis after removal of the duplicate records.
games <- games[!(duplicated(games$Name)),]
nrow(games)
## [1] 4471
The area of interest for this visualization is to check for potential influences on critic scores that affects their independence. To allow for that analysis, differences between the 2 ratings (‘Critic_Score’ & ‘User_Score’) for each game is used as mentioned in section 1. A new variable is thus required, where we subtract Critic_Score from User_Score to find the values of paired differences. The new variable is named ‘score_difference’.
games$score_difference <- games$User_Score - games$Critic_Score
The dataset has been cleaned and prepared for our analysis. This section aims to then extract the information we need and proceed with creating the final visualization by plotting the different types of charts.
The bar charts below indicate that the top 5 genres after removing NA values of the selected variables are:
To reduce the clutter in the visualization, the analysis will only be conducted for these top 5 genres.
count(games, Genre) %>%
ggplot(aes(x = reorder(Genre, n), y = n)) +
geom_col(color = "black", fill = "#296d98") +
geom_text(aes(label=n, hjust=-0.15)) +
coord_flip() +
theme_classic()
The top 5 genres in the remaining records are extracted from the dataset.
selected_games <- games[(games$Genre %in% c('Action','Sports', 'Role-Playing', 'Shooter', 'Racing')),]
For each genre, boxplots are created for ‘Critic_Score’ and ‘User_Score’ individually. To do so, melt() from the reshape2 package is used to transform the dataset structure such that there are only 2 columns: ‘Genre’ and selected variables (‘Critic_Score’ and ‘User_Score’). The transformed dataset can then be used as an input in geom_boxplot() and in facet_rep_wrap() for the desired output.
selected_games.m <- selected_games %>%
select(Genre, Critic_Score, User_Score) %>%
melt(id.var = "Genre")
The boxplots are created together with violin charts layered on top to provide better visualization of the distributions. Here, facet_rep_wrap() from the lemon package is used instead of facet_wrap() so that tick labels can be repeated for each chart for clarity. annotate_figure() from the ggpubr package is also used to input a title for the facet wrapped figure of charts. The codes are as follows:
comparison_boxplot <- ggplot(data = selected_games.m, aes(x = value, y = variable)) +
geom_boxplot(aes(fill = variable), notch = TRUE) +
scale_fill_manual(values = c("#296d98", "#45b6fe")) +
scale_y_discrete(labels=c("Critic Score", "User Score")) +
geom_violin(alpha=0.1) +
coord_flip() +
theme_classic() +
theme(legend.position = "bottom", legend.direction = "vertical", legend.title = element_blank(), legend.key.size = unit(0.98, "cm"), strip.text = element_text(size = 11, face = "bold"), axis.title.x = element_blank()) +
xlab('Scores') +
facet_rep_wrap(~Genre, repeat.tick.labels = TRUE, ncol = 1)
comparison_boxplot <- annotate_figure(comparison_boxplot, top = text_grob("Boxplots of Critic and User Scores", face = "bold", size = 12, hjust = 0.4))
comparison_boxplot
The correlations are plotted for each genre, between ‘User_Score’ and ‘Critic_Score’.
Before starting on the plots, a new variable ‘diff_cat’ is created to categorize the ‘score_difference’ variable formed in 2.3.5. This allows for color-coding in the charts and provides clarity in visualizing if ‘User_Score’ is higher than ‘Critic_Score’. It also allows for quick visualization of the number of records that have higher/lower ‘User_Score’ than ‘Critic_Score’.
selected_games$diff_cat = NA
selected_games[selected_games$score_difference > 0, ]$diff_cat <- "Higher User Score"
selected_games[selected_games$score_difference < 0, ]$diff_cat <- "Lower User Score"
selected_games[selected_games$score_difference == 0, ]$diff_cat <- "User Score = Critic Score"
comparison_corr <- ggplot(data = selected_games, aes(x = Critic_Score, y = User_Score)) +
geom_point(aes(color = diff_cat)) +
geom_smooth(method = lm, color = "red", size = 0.5) +
scale_color_manual(values = c("#45b6fe", "#296d98", "#0e2433")) +
geom_abline(intercept = 0, slope = 1) +
coord_cartesian(xlim = c(0,100), ylim = c(0,100)) +
theme_classic() +
theme(legend.position = "bottom", legend.direction = "vertical", legend.title = element_blank(), legend.key.size = unit(0.5, "cm"), strip.text = element_text(size = 11, face = "bold")) +
xlab('Critic Score') +
ylab('User Score') +
facet_rep_wrap(~Genre, repeat.tick.labels = TRUE, ncol = 1)
comparison_corr <- annotate_figure(comparison_corr, top = text_grob("Correlation of Critic and User Scores", face = "bold", size = 12, hjust = 0.4))
comparison_corr
The last set of charts to be visualized are scatter plots of the ‘Global_Sales’ variable against the ‘score differences’ (between Critics and Users).
sales_score_corr <- ggplot(data = selected_games, aes(x = score_difference, y = Global_Sales, color = diff_cat)) +
geom_point() +
scale_color_manual(values = c("#45b6fe", "#296d98", "#0e2433")) +
geom_vline(xintercept = 0) +
theme_classic() +
theme(legend.position = "bottom", legend.direction = "vertical", legend.title = element_blank(), legend.key.size = unit(0.5, "cm"), strip.text = element_text(size = 11, face = "bold")) +
ylab('Global Sales')+
xlab('User Score - Critic Score') +
facet_rep_wrap(~Genre, scales = "free_y", repeat.tick.labels = TRUE, ncol = 1)
sales_score_corr <- annotate_figure(sales_score_corr, top = text_grob("Global Sales vs (User-Critic Score)", face = "bold", size = 12, hjust = 0.4))
sales_score_corr
The last part to creating the final visualization is to combine all 3 groups of charts (boxplots, correlation, scatter) together. A plot title and lead-in text are also created (via annotate_figure())to provide more information on the visualization.
visualization <- ggarrange(comparison_boxplot, comparison_corr, sales_score_corr, ncol = 3)
visualization <- annotate_figure(visualization, top = "Ratings between Critics and Users have often differed from one another. There appears to be more User Ratings that are\n higher than Critic Ratings for the top 5 video game genres. Video games with higher global sales also have lower User\n Ratings than Critic Ratings, especially so for Action, Role-Playing and Shooting games. \n")
visualization <- annotate_figure(visualization, top = text_grob("Discrepancies Between Critic and User Ratings in Video Game Genres", face = "bold", size = 16))
In the boxplots on the left, the distributions of both Critic Scores and User Scores can be analyzed with ease. The boxplots show a visualization of the 5-number summary, the notched portions indicate the 95% confidence interval of the medians, and the violin charts show the density plots. If the notched portions overlap, it means that there is not enough evidence to suggest that the medians differ.
The correlation plots in the middle show if there are strong correlations between User Scores and Critic Scores. A more important feature for the task at hand are the 45 degrees line. Any points above it indicate that User scores are higher than Critic scores for a particular video game and vice versa, giving readers a quick overview of the ratings imbalance in each genre.
The scatter plots on the right attempt to visualize the relationship between global sales and the discrepancies of User and Critic scores. If there are patterns, it is indicative that Critic scores can be associated with video game sales, which is treated as a proxy of the influence of video game companies.
Useful Information: