Assignment 4

Data Source: https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings

1. Major Data and Design Challenges

Describe the major data and design challenges faced in accomplishing the task, and how you plan to overcome these challenges with a proposed sketched design. (3 marks)

The dataset chosen for the assignment relates to video game sales and their ratings (or scores) through a video game critic website, “Metacritic”.

The variables of interest for this visualization are as follows:

Categorical Variable:
- Genre
Continuous Variables:
- Critic_Score
- User_Score
- Global_Sales

Within the video games industry, there has often been criticism on ratings (or scores, both are used interchangeably) that critics give to video games, with suspicion of a lack of independence. The task at hand here is to create a visualization comparing critic reviews and user reviews, to analyze if there are any potential patterns that can be observed between the two.

The potential challenges faced in accomplishing the task are as follows:

Challenges	Solution
Finding a way to visualize a potential lack of independence in critic ratings.	Video game companies that make large sales often have influence, hence, global sales can be viewed as a proxy to influence. Working with the data on hand, a possible method is to plot Global Sales and the differences between Critic and User Scores. Should there be a pattern emerging in the plot, it is a hint that Global Sales and disparities in ratings are associated, which may indicate influence affecting independence of critic scores.
There are too many video game genres to effectively visualize them neatly.	The top video game genres (in terms of available records) can be chosen for visualization. Since the task is to analyze potential patterns of video game companies’ influence on critic ratings, using the more popular genres would make sense as big video game companies tend to produce games belonging to the more popular categories.
Many video games exist on multiple gaming platforms (e.g. PC, Xbox, Playstation, Nintendo, etc.). This is will be an issue in the analysis as the ratings will differ from platform to platform.	The ratings however, tend to be fairly similar across platforms. It is assumed that a single record can be representative of the video game for all platforms. Hence, any duplicated records of the same video game title will be dropped in the visualization.

Proposed sketch design:

2. Procedures in Creating Visualization

2.1. Importing Packages for Processing and Analysis

The following packages are required:

Importing Data:
- readr (For reading csv files)
Graph Plotting:
- dplyr (To allow counting of records to determine top genres in terms of records)
- ggplot2 (For plotting charts)
- ggpubr (To input titles for figures created via facet wrapping)
- lemon (Provides advanced features in facet_wrap())
- reshape2 (Helps in transformation of dataset structure)

packages = c('dplyr', 'ggplot2', 'ggpubr', 'lemon', 'readr', 'reshape2')

for(p in packages){
  if(!require(p, character.only=T)){
  install.packages(p)
  }
  library(p, character.only = T)
}

2.2. Importing Data

Using the function read_csv(), the raw dataset is imported.

games <- read_csv("Video_Games_Sales_as_at_22_Dec_2016.csv")

2.3. Data Pre-Processing

Data pre-processing is required to ensure that the data are suitable as inputs for the graphing packages.

2.3.1. Checking Datatypes

Before starting on pre-processing, the datatypes of the variable should be converted to the right categories.

Using str(), the datatypes of the dataset can be checked. The output below shows that ‘Year_of_Release’ and ‘User_Score’ variables have been incorrectly classified to be of “character” datatype.

str(games, give.attr = FALSE)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 16719 obs. of  16 variables:
##  $ Name           : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform       : chr  "Wii" "NES" "Wii" "Wii" ...
##  $ Year_of_Release: chr  "2006" "1985" "2008" "2009" ...
##  $ Genre          : chr  "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher      : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales       : num  41.4 29.1 15.7 15.6 11.3 ...
##  $ EU_Sales       : num  28.96 3.58 12.76 10.93 8.89 ...
##  $ JP_Sales       : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales    : num  8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
##  $ Global_Sales   : num  82.5 40.2 35.5 32.8 31.4 ...
##  $ Critic_Score   : num  76 NA 82 80 NA NA 89 58 87 NA ...
##  $ Critic_Count   : num  51 NA 73 73 NA NA 65 41 80 NA ...
##  $ User_Score     : chr  "8" NA "8.3" "8" ...
##  $ User_Count     : num  322 NA 709 192 NA NA 431 129 594 NA ...
##  $ Developer      : chr  "Nintendo" NA "Nintendo" "Nintendo" ...
##  $ Rating         : chr  "E" NA "E" "E" ...

To resolve the issue of incorrect datatypes, the as.numeric() function is used to convert the variables to be of numeric datatype.

games$Year_of_Release <- as.numeric(as.character(games$Year_of_Release))
games$User_Score <- as.numeric(as.character(games$User_Score))

2.3.2. Scaling Variables

User_Score should be scaled to a range from 0 to 100, similar to that of ‘Critic_Score’’s so that both variables can be more easily compared. The values are multiplied by 10 to achieve the scaling results.

games$User_Score <- games$User_Score * 10

2.3.3. Removing Missing Values

As the 3 variables detailed in 1. are necessary to proceed with our visualization and analysis, records with missing value(s) in any of the variables are removed from the dataframe. Removing all invalid records leaves the dataset with 7,017 records for analysis.

games <- games[!(is.na(games$Genre)),]
games <- games[!(is.na(games$Critic_Score)),]
games <- games[!(is.na(games$User_Score)),]
games <- games[!(is.na(games$Global_Sales)),]
nrow(games)

## [1] 7017

2.3.4. Removing Duplicate Records

Some of the video games in the dataset are released on several platforms (PC, Xbox, Playstation, Nintendo, etc.) and would have multiple records. The same video game titles tend to have similar ratings and sales patterns regardless of platforms. Hence, the assumption is made that the record of a video game title is representative of all its platform releases, and only 1 is retained in the dataset for analysis.

There are 4,471 records remaining for analysis after removal of the duplicate records.

games <- games[!(duplicated(games$Name)),]
nrow(games)

## [1] 4471

2.3.5. Creating New Variable for Comparison

The area of interest for this visualization is to check for potential influences on critic scores that affects their independence. To allow for that analysis, differences between the 2 ratings (‘Critic_Score’ & ‘User_Score’) for each game is used as mentioned in section 1. A new variable is thus required, where we subtract Critic_Score from User_Score to find the values of paired differences. The new variable is named ‘score_difference’.

games$score_difference <- games$User_Score - games$Critic_Score

2.4.Exploratory Data Analysis (EDA)

The dataset has been cleaned and prepared for our analysis. This section aims to then extract the information we need and proceed with creating the final visualization by plotting the different types of charts.

2.4.1. Identifying Top Genres

The bar charts below indicate that the top 5 genres after removing NA values of the selected variables are:

Action: 885 records
Role-Playing: 592 records
Sports: 515 records
Shooter: 513 records
Racing: 361 records

To reduce the clutter in the visualization, the analysis will only be conducted for these top 5 genres.

count(games, Genre) %>%
  ggplot(aes(x = reorder(Genre, n), y = n)) +
  geom_col(color = "black", fill = "#296d98") +
  geom_text(aes(label=n, hjust=-0.15)) +
  coord_flip() +
  theme_classic()

2.4.2 Subsetting Top 5 Genres from Dataset

The top 5 genres in the remaining records are extracted from the dataset.

selected_games <- games[(games$Genre %in% c('Action','Sports', 'Role-Playing', 'Shooter', 'Racing')),]

2.4.3. Boxplot of Critic Scores and User Scores

For each genre, boxplots are created for ‘Critic_Score’ and ‘User_Score’ individually. To do so, melt() from the reshape2 package is used to transform the dataset structure such that there are only 2 columns: ‘Genre’ and selected variables (‘Critic_Score’ and ‘User_Score’). The transformed dataset can then be used as an input in geom_boxplot() and in facet_rep_wrap() for the desired output.

selected_games.m <- selected_games %>%
  select(Genre, Critic_Score, User_Score) %>%
  melt(id.var = "Genre")

The boxplots are created together with violin charts layered on top to provide better visualization of the distributions. Here, facet_rep_wrap() from the lemon package is used instead of facet_wrap() so that tick labels can be repeated for each chart for clarity. annotate_figure() from the ggpubr package is also used to input a title for the facet wrapped figure of charts. The codes are as follows:

comparison_boxplot <- ggplot(data = selected_games.m, aes(x = value, y = variable)) + 
  geom_boxplot(aes(fill = variable), notch = TRUE) +
  scale_fill_manual(values = c("#296d98", "#45b6fe")) +
  scale_y_discrete(labels=c("Critic Score", "User Score")) +
  geom_violin(alpha=0.1) +
  coord_flip() +
  theme_classic() +
  theme(legend.position = "bottom", legend.direction = "vertical", legend.title = element_blank(), legend.key.size = unit(0.98, "cm"), strip.text = element_text(size = 11, face = "bold"), axis.title.x = element_blank()) +
  xlab('Scores') +
  facet_rep_wrap(~Genre, repeat.tick.labels = TRUE, ncol = 1)

comparison_boxplot <- annotate_figure(comparison_boxplot, top = text_grob("Boxplots of Critic and User Scores", face = "bold", size = 12, hjust = 0.4))

comparison_boxplot

2.4.4. Correlation of Critic Scores and User Scores

The correlations are plotted for each genre, between ‘User_Score’ and ‘Critic_Score’.

Before starting on the plots, a new variable ‘diff_cat’ is created to categorize the ‘score_difference’ variable formed in 2.3.5. This allows for color-coding in the charts and provides clarity in visualizing if ‘User_Score’ is higher than ‘Critic_Score’. It also allows for quick visualization of the number of records that have higher/lower ‘User_Score’ than ‘Critic_Score’.

selected_games$diff_cat = NA
selected_games[selected_games$score_difference > 0, ]$diff_cat <- "Higher User Score"
selected_games[selected_games$score_difference < 0, ]$diff_cat <- "Lower User Score"
selected_games[selected_games$score_difference == 0, ]$diff_cat <- "User Score = Critic Score"


comparison_corr <- ggplot(data = selected_games, aes(x = Critic_Score, y = User_Score)) +
  geom_point(aes(color = diff_cat)) +
  geom_smooth(method = lm, color = "red", size = 0.5) +
  scale_color_manual(values = c("#45b6fe", "#296d98", "#0e2433")) +
  geom_abline(intercept = 0, slope = 1) +
  coord_cartesian(xlim = c(0,100), ylim = c(0,100)) +
  theme_classic() +
  theme(legend.position = "bottom", legend.direction = "vertical", legend.title = element_blank(), legend.key.size = unit(0.5, "cm"), strip.text = element_text(size = 11, face = "bold")) +
  xlab('Critic Score') +
  ylab('User Score') +
  facet_rep_wrap(~Genre, repeat.tick.labels = TRUE, ncol = 1)

comparison_corr <- annotate_figure(comparison_corr, top = text_grob("Correlation of Critic and User Scores", face = "bold", size = 12, hjust = 0.4))

comparison_corr

2.4.5. Scatter Plots of Global Sales and Score Differences

The last set of charts to be visualized are scatter plots of the ‘Global_Sales’ variable against the ‘score differences’ (between Critics and Users).

sales_score_corr <- ggplot(data = selected_games, aes(x = score_difference, y = Global_Sales, color = diff_cat)) +
  geom_point() +
  scale_color_manual(values = c("#45b6fe", "#296d98", "#0e2433")) +
  geom_vline(xintercept = 0) + 
  theme_classic() +
  theme(legend.position = "bottom", legend.direction = "vertical", legend.title = element_blank(), legend.key.size = unit(0.5, "cm"), strip.text = element_text(size = 11, face = "bold")) +
  ylab('Global Sales')+
  xlab('User Score - Critic Score') +
  facet_rep_wrap(~Genre, scales = "free_y", repeat.tick.labels = TRUE, ncol = 1)

sales_score_corr <- annotate_figure(sales_score_corr, top = text_grob("Global Sales vs (User-Critic Score)", face = "bold", size = 12, hjust = 0.4))

sales_score_corr

2.4.6. Final Visualization Preparation

The last part to creating the final visualization is to combine all 3 groups of charts (boxplots, correlation, scatter) together. A plot title and lead-in text are also created (via annotate_figure())to provide more information on the visualization.

visualization <- ggarrange(comparison_boxplot, comparison_corr, sales_score_corr, ncol = 3) 

visualization <- annotate_figure(visualization, top = "Ratings between Critics and Users have often differed from one another. There appears to be more User Ratings that are\n higher than Critic Ratings for the top 5 video game genres. Video games with higher global sales also have lower User\n Ratings than Critic Ratings, especially so for Action, Role-Playing and Shooting games. \n")

visualization <- annotate_figure(visualization, top = text_grob("Discrepancies Between Critic and User Ratings in Video Game Genres", face = "bold", size = 16))

3. Final Visualization and Description

The final data visualization and a short description of not more than 350 words. The description must provide at least two useful information revealed by the data visualization. (4 marks)

In the boxplots on the left, the distributions of both Critic Scores and User Scores can be analyzed with ease. The boxplots show a visualization of the 5-number summary, the notched portions indicate the 95% confidence interval of the medians, and the violin charts show the density plots. If the notched portions overlap, it means that there is not enough evidence to suggest that the medians differ.

The correlation plots in the middle show if there are strong correlations between User Scores and Critic Scores. A more important feature for the task at hand are the 45 degrees line. Any points above it indicate that User scores are higher than Critic scores for a particular video game and vice versa, giving readers a quick overview of the ratings imbalance in each genre.

The scatter plots on the right attempt to visualize the relationship between global sales and the discrepancies of User and Critic scores. If there are patterns, it is indicative that Critic scores can be associated with video game sales, which is treated as a proxy of the influence of video game companies.

Useful Information:

In all genres, especially for ‘Action’, ‘Role-Playing’ and ‘Shooter’, there appears to be a connection between the sales of video games and the disagreements between User and Critic scores. Strong-performing video games (or video game companies) tend to have Critic scores that are much higher than User scores. This is possibly a hint that the independence of Critics has been affected.
User scores that are much higher than Critic scores have the opposite effect for all genres, giving lower sales
Another interesting piece of information is that Users tend to give higher ratings than Critics, at least in ‘Action’, ‘Racing’ and ‘Role-Playing’ genres. The notched portion of the User Score boxplots in the 3 genres do not overlap with those of Critic Scores as well, indicating strong evidence (95% confidence) that the medians differ between the ratings.

ISSS608 - Assignment 4

Dexter Tan

24/07/2020

Assignment 4

1. Major Data and Design Challenges

Describe the major data and design challenges faced in accomplishing the task, and how you plan to overcome these challenges with a proposed sketched design. (3 marks)

2. Procedures in Creating Visualization

2.1. Importing Packages for Processing and Analysis

2.2. Importing Data

2.3. Data Pre-Processing

2.3.1. Checking Datatypes

2.3.2. Scaling Variables

2.3.3. Removing Missing Values

2.3.4. Removing Duplicate Records

2.3.5. Creating New Variable for Comparison

2.4.Exploratory Data Analysis (EDA)

2.4.1. Identifying Top Genres

2.4.2 Subsetting Top 5 Genres from Dataset

2.4.3. Boxplot of Critic Scores and User Scores

2.4.4. Correlation of Critic Scores and User Scores

2.4.5. Scatter Plots of Global Sales and Score Differences

2.4.6. Final Visualization Preparation

3. Final Visualization and Description

The final data visualization and a short description of not more than 350 words. The description must provide at least two useful information revealed by the data visualization. (4 marks)

ISSS608 - Assignment 4

Dexter Tan

24/07/2020

Assignment 4

1. Major Data and Design Challenges

Describe the major data and design challenges faced in accomplishing the task, and how you plan to overcome these challenges with a proposed sketched design. (3 marks)

2. Procedures in Creating Visualization

Provide step-by-step description on how the data visualization was prepared by using ggplot2 and other related R packages. (3 marks)

2.1. Importing Packages for Processing and Analysis

2.2. Importing Data

2.3. Data Pre-Processing

2.3.1. Checking Datatypes

2.3.2. Scaling Variables

2.3.3. Removing Missing Values

2.3.4. Removing Duplicate Records

2.3.5. Creating New Variable for Comparison

2.4.Exploratory Data Analysis (EDA)

2.4.1. Identifying Top Genres

2.4.2 Subsetting Top 5 Genres from Dataset

2.4.3. Boxplot of Critic Scores and User Scores

2.4.4. Correlation of Critic Scores and User Scores

2.4.5. Scatter Plots of Global Sales and Score Differences

2.4.6. Final Visualization Preparation

3. Final Visualization and Description

The final data visualization and a short description of not more than 350 words. The description must provide at least two useful information revealed by the data visualization. (4 marks)