DATA 605_Week11_Assignment11

I sourced data on video game reviews and sales from Kaggle. It covers games with sales greater than 100,000 copies over the last 30+ years, and was scraped by vgchartz.com (https://www.kaggle.com/gregorut/videogamesales/home).

# Load and explore data
path <- file.path(getwd(), "Video_Games_Sales_as_at_22_Dec_2016.csv")
games_df <- read_csv(file = path, col_names = TRUE)

## Parsed with column specification:
## cols(
##   Name = col_character(),
##   Platform = col_character(),
##   Year_of_Release = col_character(),
##   Genre = col_character(),
##   Publisher = col_character(),
##   NA_Sales = col_double(),
##   EU_Sales = col_double(),
##   JP_Sales = col_double(),
##   Other_Sales = col_double(),
##   Global_Sales = col_double(),
##   Critic_Score = col_integer(),
##   Critic_Count = col_integer(),
##   User_Score = col_character(),
##   User_Count = col_integer(),
##   Developer = col_character(),
##   Rating = col_character()
## )

summary(games_df)

##      Name             Platform         Year_of_Release   
##  Length:16719       Length:16719       Length:16719      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     Genre            Publisher            NA_Sales          EU_Sales     
##  Length:16719       Length:16719       Min.   : 0.0000   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.: 0.0000   1st Qu.: 0.000  
##  Mode  :character   Mode  :character   Median : 0.0800   Median : 0.020  
##                                        Mean   : 0.2633   Mean   : 0.145  
##                                        3rd Qu.: 0.2400   3rd Qu.: 0.110  
##                                        Max.   :41.3600   Max.   :28.960  
##                                                                          
##     JP_Sales        Other_Sales        Global_Sales      Critic_Score  
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.0100   Min.   :13.00  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.0600   1st Qu.:60.00  
##  Median : 0.0000   Median : 0.01000   Median : 0.1700   Median :71.00  
##  Mean   : 0.0776   Mean   : 0.04733   Mean   : 0.5335   Mean   :68.97  
##  3rd Qu.: 0.0400   3rd Qu.: 0.03000   3rd Qu.: 0.4700   3rd Qu.:79.00  
##  Max.   :10.2200   Max.   :10.57000   Max.   :82.5300   Max.   :98.00  
##                                                         NA's   :8582   
##   Critic_Count     User_Score          User_Count       Developer        
##  Min.   :  3.00   Length:16719       Min.   :    4.0   Length:16719      
##  1st Qu.: 12.00   Class :character   1st Qu.:   10.0   Class :character  
##  Median : 21.00   Mode  :character   Median :   24.0   Mode  :character  
##  Mean   : 26.36                      Mean   :  162.2                     
##  3rd Qu.: 36.00                      3rd Qu.:   81.0                     
##  Max.   :113.00                      Max.   :10665.0                     
##  NA's   :8582                        NA's   :9129                        
##     Rating         
##  Length:16719      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

str(games_df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    16719 obs. of  16 variables:
##  $ Name           : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform       : chr  "Wii" "NES" "Wii" "Wii" ...
##  $ Year_of_Release: chr  "2006" "1985" "2008" "2009" ...
##  $ Genre          : chr  "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher      : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales       : num  41.4 29.1 15.7 15.6 11.3 ...
##  $ EU_Sales       : num  28.96 3.58 12.76 10.93 8.89 ...
##  $ JP_Sales       : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales    : num  8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
##  $ Global_Sales   : num  82.5 40.2 35.5 32.8 31.4 ...
##  $ Critic_Score   : int  76 NA 82 80 NA NA 89 58 87 NA ...
##  $ Critic_Count   : int  51 NA 73 73 NA NA 65 41 80 NA ...
##  $ User_Score     : chr  "8" NA "8.3" "8" ...
##  $ User_Count     : int  322 NA 709 192 NA NA 431 129 594 NA ...
##  $ Developer      : chr  "Nintendo" NA "Nintendo" "Nintendo" ...
##  $ Rating         : chr  "E" NA "E" "E" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 16
##   .. ..$ Name           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Platform       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Year_of_Release: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Genre          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Publisher      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ NA_Sales       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ EU_Sales       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ JP_Sales       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Other_Sales    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Global_Sales   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Critic_Score   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Critic_Count   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ User_Score     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ User_Count     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Developer      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rating         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

# Tidy data input
games_df$Platform <- as.factor(games_df$Platform)
games_df <- games_df %>% rename(Year = Year_of_Release)
games_df$Year <- as.integer(games_df$Year)

## Warning: NAs introduced by coercion

games_df$Genre <- as.factor(games_df$Genre)
games_df$Publisher <- as.factor(games_df$Publisher)
games_df$User_Score <- as.double(games_df$User_Score)

## Warning: NAs introduced by coercion

games_df$Developer <- as.factor(games_df$Developer)
games_df$Rating <- as.factor(games_df$Rating)

# Quantify unique values of each variable
games_df %>% 
  summarise_each(funs(n_distinct)) %>% 
  as.tibble()

## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`

## # A tibble: 1 x 16
##    Name Platform  Year Genre Publisher NA_Sales EU_Sales JP_Sales
##   <int>    <int> <int> <int>     <int>    <int>    <int>    <int>
## 1 11563       31    40    13       581      402      307      244
## # ... with 8 more variables: Other_Sales <int>, Global_Sales <int>,
## #   Critic_Score <int>, Critic_Count <int>, User_Score <int>,
## #   User_Count <int>, Developer <int>, Rating <int>

This games dataset includes over 16,000 observations covering over 11,000 game titles published over 40 years. These account for 31 plaforms, 13 genres, nearly 1,700 developers, and a third as many publishers. In addition to critic and user review scores (which we’ll only look at obliquely in this exercise), the dataset also include sales figures (in millions of USD) for North America, EU, Japan, and other regions, as well as a rolled-up global number.

Since this is such a rich dataset, before looking into the relationship between sales in different markets we’ll explore the scores reviewers have given game. Professional critics and regular players’s reviews are scaled differently: critic scores are on a 100-pt scale, and users on a 10pt scale.

# Explore score distribution
CScore_hist <- games_df %>% select(Name, Critic_Score) %>%
  filter(!is.na(Critic_Score)) %>% 
  group_by(Name) %>% 
  
  ggplot(aes(Critic_Score)) +
  geom_histogram(fill = "lightblue4", bins = 50) + # set bin size
  scale_x_continuous(limits = c(0, 100), breaks = seq(from = 0, to = 100, by = 25)) +
  scale_y_continuous(limits = c(0, 600), breaks = seq(from = 0, to = 600, by = 100)) +
  labs(title = "Critic Score Distribution", x = "Score", y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

UScore_hist <- games_df %>% select(Name, User_Score) %>% 
  group_by(Name) %>% 
  filter(!is.na(User_Score)) %>% 
  
  ggplot(aes(User_Score)) +
  geom_histogram(fill = "lightsalmon2", bins = 50) +
  scale_x_continuous(limits = c(0, 10), breaks = seq(from = 0, to = 10, by = 2.5)) +
  scale_y_continuous(limits = c(0, 600), breaks = seq(from = 0, to = 600, by = 100)) +
  labs(title = "User Score Distribution", x = "Score", y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

grid.arrange(CScore_hist, UScore_hist, nrow = 2)

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

Review scores follow a broadly similar distribution for both critics and users, though there is more positive skew for users - which is to say, critics seem to be harder on games; or alternatively, users with favorable experiences (or fanboys) are more likely to score games.

The video game market has grown dramatically over the last 30+ years - we can treat review volume as a signal of engagement, and since we have a robust longitudinal dataset, look at the trend of critic and user volume to see if they appear correlated.

# Boxplot of critic reviews
CReview_trend <- games_df %>% 
  select(Year, Critic_Count) %>% 
  filter(!is.na(Critic_Count)) %>% 
  group_by(Year) %>% 
  summarize(Total = sum(Critic_Count)) %>% 

  ggplot(aes(x = Year, y = Total)) +
  geom_area(fill = "lightblue4") +
  scale_y_continuous(limits = c(0, 150000), breaks = seq(from = 0, to = 150000, by = 25000)) +
  labs(title = "Critic Review Volume", x = "Year", y = "Reviews") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

# Boxplot of user reviews
UReview_trend <- games_df %>% 
  select(Year, User_Count) %>% 
  filter(!is.na(User_Count)) %>% 
  group_by(Year) %>% 
  summarize(Total = sum(User_Count)) %>% 
  
  ggplot(aes(x = Year, y = Total)) +
  geom_area(fill = "lightsalmon2") +
  scale_y_continuous(limits = c(0, 150000), breaks = seq(from = 0, to = 150000, by = 25000)) +
  labs(title = "User Review Volume", x = "Year", y = "Reviews") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

grid.arrange(CReview_trend, UReview_trend, nrow = 2)

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 2 rows containing missing values (position_stack).

Critic reviews don’t commence in 1997, while user reviews began in the 1980s. Both grow in the early aughts, but critic reviews peaked at a count of 20,000, stayed near there until 2008, and then declined; on the other hand, user reviews continued to rise after 2000, growing to a count of over 130,000 (or more than 6x since 2000) and declining after 2013.

We’ll next look at the whether there’s a correlation between critic review volume and global sales.

# Linear plot of critic reviews and global sales
score_trend <- games_df %>% 
  select(Year, Global_Sales, Critic_Score) %>% 
  filter(!is.na(Year)) %>% 
  filter(!is.na(Critic_Score)) %>% 
  group_by(Year) %>% 
  summarize(Sales = sum(Global_Sales), `Average Score` = mean(Critic_Score)) %>%
  arrange(Year, Sales, `Average Score`) %>% 
  gather(key = "Type", value = "Amount",  Sales:`Average Score`) %>% 
  
  ggplot(aes(x = Year, y = Amount, color = Type)) +
  geom_line(size = 1) + 
  scale_color_manual(values = c("lightblue4", "lightsalmon")) +
  labs(title = "Critic Review and Sales Trend", x = "Year", y = "Sales") +
#  sec_axis(trans = NULL, name = "Average Score") +  # scale down
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))
  
score_trend

Switching gears, we’ll now turn to sales in the different regions (as a reminder, these are sales of video games that clear 100,000 copies). While the video game marketplace is regional with different consumer preferences and localized production, regions have grow in tandem with one another. We’ll explore whether there is a linear relationship between annual sales figures in two sets of markets: between North America and Japan, and between North America and Europe.

To start, a boxplot provides a sense of these each the distribution and central tendency of sales by title in each of the three markets.

# Boxplot of speed and distance
sales_box <- games_df %>% 
  select(NA_Sales, EU_Sales, JP_Sales) %>% 
  gather(key = "Variable", value = "Sales", NA_Sales:JP_Sales) %>% 
  
  ggplot(aes(x = Variable, y = Sales)) +
  geom_boxplot(notch = TRUE, outlier.color = "red", outlier.shape = 20) +
  coord_flip() +
  labs(title = "Distribution of Sales in Each Market", y = "", x = "") +
  scale_x_discrete(labels = c("North American Sales", "EU Sales", "Japan Sales")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

sales_box

## notch went outside hinges. Try setting notch=FALSE.

Given the large number and spread of outliers (highlighted in red), it’s difficult to discern what the distribution really looks like. Suffice it to say, it’s apparent there are a large number of outliers on the high end in each market, particularly for Japan and North America. This is consistent with an entertainment marketplace driven by IP and network effects.

Scatterplots visually depict the overall relationship between 1) NA-Japan sales, and 2) NA-EU sales.

# Scatterplot of North America and Japan sales
NAJP_scatter <- ggplot(data = games_df, aes(x = JP_Sales, y = NA_Sales)) +
  geom_point(size = 1.5) +
  labs(title = "Sales: North America and Japan", x = "Japan Sales", y = "NA Sales") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

NAJP_scatter

# Scatterplot of North America and EU sales
NAEU_scatter <- ggplot(data = games_df, aes(x = EU_Sales, y = NA_Sales)) +
  geom_point(size = 1.5) +
  labs(title = "Sales: North America and EU", x = "EU Sales", y = "NA Sales") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

NAEU_scatter

The scatterplots reveal that the point density of the NA-EU sales appears to be clustered much tighter than NA-JP sales, suggesting that the linear relationship should be stronger. Accordingly, we’ll focus on the NA-EU relationship when developing a linear model.

# Fit linear model to NA and EU sales
NAEU_lm <- lm(data = games_df, formula = NA_Sales ~ EU_Sales)

summary(NAEU_lm)

## 
## Call:
## lm(formula = NA_Sales ~ EU_Sales, data = games_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6863 -0.0839 -0.0713  0.0129 26.0667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.083920   0.004214   19.91   <2e-16 ***
## EU_Sales    1.237100   0.008047  153.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5236 on 16717 degrees of freedom
## Multiple R-squared:  0.5857, Adjusted R-squared:  0.5857 
## F-statistic: 2.364e+04 on 1 and 16717 DF,  p-value: < 2.2e-16

The residuals for the linear model do not exhibit a very normal distribution around the mean. While the mean is near zero, the interquartile range is skewed above it with many outliers.

The standard error for the EU sales variable is 150x smaller than its coefficient, and the standard error for the intercept is 20x smaller than its coefficient - this means very little variability in the estimate of either. P-values are below .05, which supports that these estimates are relevant to the model. Based on the reported $R^2$, the model explains nearly 59% of of the variation in the data.

We would define the regression model for this fit as:

$NA\_Sales = .08392 + 1.2371 * EU\_Sales$

We can interpret this formula to mean that for every additional $1.00 of EU sales we predict that NA sales increases by about $1.24. The y-intercept suggests that no sales in the EU would be accompanied by about $84,000 sales in NA, which is probably not practically meaningful.

We’ll augment the previous scatterplot with the regression line.

# Add regression line to scatterplot
NAEU_scatter +
  geom_smooth(method = lm, se = TRUE, fill = "gray95", color = "darkgray")

As anticipated, points hug the regression line pretty well, with some positive skew above the line but not too many apparent high leverage points.

Next we’ll take a look at the residuals:

# Append predicted value and residuals to original dataframe
games_df$NAEUpredicted <- predict(NAEU_lm)
games_df$NAEUresiduals <- residuals(NAEU_lm)

# Visualize residuals on scatterplot
games_residviz <- ggplot(data = games_df, aes(x = EU_Sales, y = NA_Sales)) +
  geom_segment(aes(xend = EU_Sales, yend = NAEUpredicted), alpha = .2) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgray") +
  geom_point(aes(color = NAEUresiduals), size = 1.5) +  # plot actual points
  scale_color_gradient2(low = "midnightblue", mid = "white", high = "red2") + 
  guides(color = FALSE) +
  geom_point(aes(y = NAEUpredicted), size = 1.5, shape = 1) +  
  labs(title = "", x = "Japan Sales", y = "NA Sales") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

games_residviz

# Plot residuals
games_residplot <- ggplot(data = NAEU_lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  stat_smooth(method = "loess", se = TRUE, fill = "gray90", color = "darkgray") +
  geom_hline(yintercept = 0, col = "red1", linetype = "dashed", alpha = .8, size = .5) +
  labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

games_residplot

The first chart maps residuals to points on the regression line. Color intensity reflects magnitude of the residual - the greater the positive distance from the line, the redder the residual point; the greater the negative distance, the bluer the point. A quick look at the first chart shows more intense red than blues, which again reflects the wider spread, outliers, and leverage points above the mean.

The second chart is a proper residual plot, transforming the x-axis to the regression line to better depict the magnitude of residuals. A locally weighted scatterplot smoothed line has been fitted to the residuals. Its upward curve depicts the leverage exerted by the high leverage point to the right of the chart, for which NA sales were considerably higher than predicted.

Lastly, we’ll look at a quartile-quartile plot of residuals, which provides a visual indication of how normal their distribution is.

# QQplot 

ggplot(data = games_df, aes(sample = EU_Sales)) +
  stat_qq(size = 1.5) +
  stat_qq_line(color = "darkgray") +
  labs(title = "Quantile-Quantile Plot", x = "Theoretical Quantiles", y = "Standardized Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = .5))

Points follow the line pretty closely, but the plot evinces pronounced positive skew with a heavy upper tail, as we’ve seen elsewhere.

In conclusion, this general linear model seems to provide a reasonable prediction of North American sales based on EU sales that explains 59% of the variation in the data - not bad for a large, longitudinal set of real-world signals. This suggests that the cross-regional success of video game properties that achieve scale has some consistency in its pattern - perhaps driven by popular IP, generic trends, marketing and distribution practices, or other factors. If exploring further, we should examine collinearity and look at relationships between each of the regions with one another, correcting for outliers and discretizing by video game platform as well as time periods.

References:

DATA 605_Week11_Assignment11

Jeremy O’Brien

November 4, 2018