I sourced data on video game reviews and sales from Kaggle. It covers games with sales greater than 100,000 copies over the last 30+ years, and was scraped by vgchartz.com (https://www.kaggle.com/gregorut/videogamesales/home).
# Load and explore data
path <- file.path(getwd(), "Video_Games_Sales_as_at_22_Dec_2016.csv")
games_df <- read_csv(file = path, col_names = TRUE)
## Parsed with column specification:
## cols(
## Name = col_character(),
## Platform = col_character(),
## Year_of_Release = col_character(),
## Genre = col_character(),
## Publisher = col_character(),
## NA_Sales = col_double(),
## EU_Sales = col_double(),
## JP_Sales = col_double(),
## Other_Sales = col_double(),
## Global_Sales = col_double(),
## Critic_Score = col_integer(),
## Critic_Count = col_integer(),
## User_Score = col_character(),
## User_Count = col_integer(),
## Developer = col_character(),
## Rating = col_character()
## )
summary(games_df)
## Name Platform Year_of_Release
## Length:16719 Length:16719 Length:16719
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Genre Publisher NA_Sales EU_Sales
## Length:16719 Length:16719 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0800 Median : 0.020
## Mean : 0.2633 Mean : 0.145
## 3rd Qu.: 0.2400 3rd Qu.: 0.110
## Max. :41.3600 Max. :28.960
##
## JP_Sales Other_Sales Global_Sales Critic_Score
## Min. : 0.0000 Min. : 0.00000 Min. : 0.0100 Min. :13.00
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.0600 1st Qu.:60.00
## Median : 0.0000 Median : 0.01000 Median : 0.1700 Median :71.00
## Mean : 0.0776 Mean : 0.04733 Mean : 0.5335 Mean :68.97
## 3rd Qu.: 0.0400 3rd Qu.: 0.03000 3rd Qu.: 0.4700 3rd Qu.:79.00
## Max. :10.2200 Max. :10.57000 Max. :82.5300 Max. :98.00
## NA's :8582
## Critic_Count User_Score User_Count Developer
## Min. : 3.00 Length:16719 Min. : 4.0 Length:16719
## 1st Qu.: 12.00 Class :character 1st Qu.: 10.0 Class :character
## Median : 21.00 Mode :character Median : 24.0 Mode :character
## Mean : 26.36 Mean : 162.2
## 3rd Qu.: 36.00 3rd Qu.: 81.0
## Max. :113.00 Max. :10665.0
## NA's :8582 NA's :9129
## Rating
## Length:16719
## Class :character
## Mode :character
##
##
##
##
str(games_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 16719 obs. of 16 variables:
## $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : chr "Wii" "NES" "Wii" "Wii" ...
## $ Year_of_Release: chr "2006" "1985" "2008" "2009" ...
## $ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
## $ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
## $ NA_Sales : num 41.4 29.1 15.7 15.6 11.3 ...
## $ EU_Sales : num 28.96 3.58 12.76 10.93 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
## $ Global_Sales : num 82.5 40.2 35.5 32.8 31.4 ...
## $ Critic_Score : int 76 NA 82 80 NA NA 89 58 87 NA ...
## $ Critic_Count : int 51 NA 73 73 NA NA 65 41 80 NA ...
## $ User_Score : chr "8" NA "8.3" "8" ...
## $ User_Count : int 322 NA 709 192 NA NA 431 129 594 NA ...
## $ Developer : chr "Nintendo" NA "Nintendo" "Nintendo" ...
## $ Rating : chr "E" NA "E" "E" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 16
## .. ..$ Name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Platform : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Year_of_Release: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Genre : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Publisher : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ NA_Sales : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ EU_Sales : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ JP_Sales : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Other_Sales : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Global_Sales : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Critic_Score : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Critic_Count : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ User_Score : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ User_Count : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Developer : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Rating : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# Tidy data input
games_df$Platform <- as.factor(games_df$Platform)
games_df <- games_df %>% rename(Year = Year_of_Release)
games_df$Year <- as.integer(games_df$Year)
## Warning: NAs introduced by coercion
games_df$Genre <- as.factor(games_df$Genre)
games_df$Publisher <- as.factor(games_df$Publisher)
games_df$User_Score <- as.double(games_df$User_Score)
## Warning: NAs introduced by coercion
games_df$Developer <- as.factor(games_df$Developer)
games_df$Rating <- as.factor(games_df$Rating)
# Quantify unique values of each variable
games_df %>%
summarise_each(funs(n_distinct)) %>%
as.tibble()
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`
## # A tibble: 1 x 16
## Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 11563 31 40 13 581 402 307 244
## # ... with 8 more variables: Other_Sales <int>, Global_Sales <int>,
## # Critic_Score <int>, Critic_Count <int>, User_Score <int>,
## # User_Count <int>, Developer <int>, Rating <int>
This games dataset includes over 16,000 observations covering over 11,000 game titles published over 40 years. These account for 31 plaforms, 13 genres, nearly 1,700 developers, and a third as many publishers. In addition to critic and user review scores (which we’ll only look at obliquely in this exercise), the dataset also include sales figures (in millions of USD) for North America, EU, Japan, and other regions, as well as a rolled-up global number.
Since this is such a rich dataset, before looking into the relationship between sales in different markets we’ll explore the scores reviewers have given game. Professional critics and regular players’s reviews are scaled differently: critic scores are on a 100-pt scale, and users on a 10pt scale.
# Explore score distribution
CScore_hist <- games_df %>% select(Name, Critic_Score) %>%
filter(!is.na(Critic_Score)) %>%
group_by(Name) %>%
ggplot(aes(Critic_Score)) +
geom_histogram(fill = "lightblue4", bins = 50) + # set bin size
scale_x_continuous(limits = c(0, 100), breaks = seq(from = 0, to = 100, by = 25)) +
scale_y_continuous(limits = c(0, 600), breaks = seq(from = 0, to = 600, by = 100)) +
labs(title = "Critic Score Distribution", x = "Score", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
UScore_hist <- games_df %>% select(Name, User_Score) %>%
group_by(Name) %>%
filter(!is.na(User_Score)) %>%
ggplot(aes(User_Score)) +
geom_histogram(fill = "lightsalmon2", bins = 50) +
scale_x_continuous(limits = c(0, 10), breaks = seq(from = 0, to = 10, by = 2.5)) +
scale_y_continuous(limits = c(0, 600), breaks = seq(from = 0, to = 600, by = 100)) +
labs(title = "User Score Distribution", x = "Score", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
grid.arrange(CScore_hist, UScore_hist, nrow = 2)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
Review scores follow a broadly similar distribution for both critics and users, though there is more positive skew for users - which is to say, critics seem to be harder on games; or alternatively, users with favorable experiences (or fanboys) are more likely to score games.
The video game market has grown dramatically over the last 30+ years - we can treat review volume as a signal of engagement, and since we have a robust longitudinal dataset, look at the trend of critic and user volume to see if they appear correlated.
# Boxplot of critic reviews
CReview_trend <- games_df %>%
select(Year, Critic_Count) %>%
filter(!is.na(Critic_Count)) %>%
group_by(Year) %>%
summarize(Total = sum(Critic_Count)) %>%
ggplot(aes(x = Year, y = Total)) +
geom_area(fill = "lightblue4") +
scale_y_continuous(limits = c(0, 150000), breaks = seq(from = 0, to = 150000, by = 25000)) +
labs(title = "Critic Review Volume", x = "Year", y = "Reviews") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# Boxplot of user reviews
UReview_trend <- games_df %>%
select(Year, User_Count) %>%
filter(!is.na(User_Count)) %>%
group_by(Year) %>%
summarize(Total = sum(User_Count)) %>%
ggplot(aes(x = Year, y = Total)) +
geom_area(fill = "lightsalmon2") +
scale_y_continuous(limits = c(0, 150000), breaks = seq(from = 0, to = 150000, by = 25000)) +
labs(title = "User Review Volume", x = "Year", y = "Reviews") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
grid.arrange(CReview_trend, UReview_trend, nrow = 2)
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (position_stack).
Critic reviews don’t commence in 1997, while user reviews began in the 1980s. Both grow in the early aughts, but critic reviews peaked at a count of 20,000, stayed near there until 2008, and then declined; on the other hand, user reviews continued to rise after 2000, growing to a count of over 130,000 (or more than 6x since 2000) and declining after 2013.
We’ll next look at the whether there’s a correlation between critic review volume and global sales.
# Linear plot of critic reviews and global sales
score_trend <- games_df %>%
select(Year, Global_Sales, Critic_Score) %>%
filter(!is.na(Year)) %>%
filter(!is.na(Critic_Score)) %>%
group_by(Year) %>%
summarize(Sales = sum(Global_Sales), `Average Score` = mean(Critic_Score)) %>%
arrange(Year, Sales, `Average Score`) %>%
gather(key = "Type", value = "Amount", Sales:`Average Score`) %>%
ggplot(aes(x = Year, y = Amount, color = Type)) +
geom_line(size = 1) +
scale_color_manual(values = c("lightblue4", "lightsalmon")) +
labs(title = "Critic Review and Sales Trend", x = "Year", y = "Sales") +
# sec_axis(trans = NULL, name = "Average Score") + # scale down
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
score_trend
Switching gears, we’ll now turn to sales in the different regions (as a reminder, these are sales of video games that clear 100,000 copies). While the video game marketplace is regional with different consumer preferences and localized production, regions have grow in tandem with one another. We’ll explore whether there is a linear relationship between annual sales figures in two sets of markets: between North America and Japan, and between North America and Europe.
To start, a boxplot provides a sense of these each the distribution and central tendency of sales by title in each of the three markets.
# Boxplot of speed and distance
sales_box <- games_df %>%
select(NA_Sales, EU_Sales, JP_Sales) %>%
gather(key = "Variable", value = "Sales", NA_Sales:JP_Sales) %>%
ggplot(aes(x = Variable, y = Sales)) +
geom_boxplot(notch = TRUE, outlier.color = "red", outlier.shape = 20) +
coord_flip() +
labs(title = "Distribution of Sales in Each Market", y = "", x = "") +
scale_x_discrete(labels = c("North American Sales", "EU Sales", "Japan Sales")) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
sales_box
## notch went outside hinges. Try setting notch=FALSE.
Given the large number and spread of outliers (highlighted in red), it’s difficult to discern what the distribution really looks like. Suffice it to say, it’s apparent there are a large number of outliers on the high end in each market, particularly for Japan and North America. This is consistent with an entertainment marketplace driven by IP and network effects.
Scatterplots visually depict the overall relationship between 1) NA-Japan sales, and 2) NA-EU sales.
# Scatterplot of North America and Japan sales
NAJP_scatter <- ggplot(data = games_df, aes(x = JP_Sales, y = NA_Sales)) +
geom_point(size = 1.5) +
labs(title = "Sales: North America and Japan", x = "Japan Sales", y = "NA Sales") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
NAJP_scatter
# Scatterplot of North America and EU sales
NAEU_scatter <- ggplot(data = games_df, aes(x = EU_Sales, y = NA_Sales)) +
geom_point(size = 1.5) +
labs(title = "Sales: North America and EU", x = "EU Sales", y = "NA Sales") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
NAEU_scatter
The scatterplots reveal that the point density of the NA-EU sales appears to be clustered much tighter than NA-JP sales, suggesting that the linear relationship should be stronger. Accordingly, we’ll focus on the NA-EU relationship when developing a linear model.
# Fit linear model to NA and EU sales
NAEU_lm <- lm(data = games_df, formula = NA_Sales ~ EU_Sales)
summary(NAEU_lm)
##
## Call:
## lm(formula = NA_Sales ~ EU_Sales, data = games_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6863 -0.0839 -0.0713 0.0129 26.0667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.083920 0.004214 19.91 <2e-16 ***
## EU_Sales 1.237100 0.008047 153.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5236 on 16717 degrees of freedom
## Multiple R-squared: 0.5857, Adjusted R-squared: 0.5857
## F-statistic: 2.364e+04 on 1 and 16717 DF, p-value: < 2.2e-16
The residuals for the linear model do not exhibit a very normal distribution around the mean. While the mean is near zero, the interquartile range is skewed above it with many outliers.
The standard error for the EU sales variable is 150x smaller than its coefficient, and the standard error for the intercept is 20x smaller than its coefficient - this means very little variability in the estimate of either. P-values are below .05, which supports that these estimates are relevant to the model. Based on the reported \(R^2\), the model explains nearly 59% of of the variation in the data.
We would define the regression model for this fit as:
\(NA\_Sales = .08392 + 1.2371 * EU\_Sales\)
We can interpret this formula to mean that for every additional $1.00 of EU sales we predict that NA sales increases by about $1.24. The y-intercept suggests that no sales in the EU would be accompanied by about $84,000 sales in NA, which is probably not practically meaningful.
We’ll augment the previous scatterplot with the regression line.
# Add regression line to scatterplot
NAEU_scatter +
geom_smooth(method = lm, se = TRUE, fill = "gray95", color = "darkgray")
As anticipated, points hug the regression line pretty well, with some positive skew above the line but not too many apparent high leverage points.
Next we’ll take a look at the residuals:
# Append predicted value and residuals to original dataframe
games_df$NAEUpredicted <- predict(NAEU_lm)
games_df$NAEUresiduals <- residuals(NAEU_lm)
# Visualize residuals on scatterplot
games_residviz <- ggplot(data = games_df, aes(x = EU_Sales, y = NA_Sales)) +
geom_segment(aes(xend = EU_Sales, yend = NAEUpredicted), alpha = .2) +
geom_smooth(method = "lm", se = FALSE, color = "darkgray") +
geom_point(aes(color = NAEUresiduals), size = 1.5) + # plot actual points
scale_color_gradient2(low = "midnightblue", mid = "white", high = "red2") +
guides(color = FALSE) +
geom_point(aes(y = NAEUpredicted), size = 1.5, shape = 1) +
labs(title = "", x = "Japan Sales", y = "NA Sales") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
games_residviz
# Plot residuals
games_residplot <- ggplot(data = NAEU_lm, aes(x = .fitted, y = .resid)) +
geom_point() +
stat_smooth(method = "loess", se = TRUE, fill = "gray90", color = "darkgray") +
geom_hline(yintercept = 0, col = "red1", linetype = "dashed", alpha = .8, size = .5) +
labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
games_residplot
The first chart maps residuals to points on the regression line. Color intensity reflects magnitude of the residual - the greater the positive distance from the line, the redder the residual point; the greater the negative distance, the bluer the point. A quick look at the first chart shows more intense red than blues, which again reflects the wider spread, outliers, and leverage points above the mean.
The second chart is a proper residual plot, transforming the x-axis to the regression line to better depict the magnitude of residuals. A locally weighted scatterplot smoothed line has been fitted to the residuals. Its upward curve depicts the leverage exerted by the high leverage point to the right of the chart, for which NA sales were considerably higher than predicted.
Lastly, we’ll look at a quartile-quartile plot of residuals, which provides a visual indication of how normal their distribution is.
# QQplot
ggplot(data = games_df, aes(sample = EU_Sales)) +
stat_qq(size = 1.5) +
stat_qq_line(color = "darkgray") +
labs(title = "Quantile-Quantile Plot", x = "Theoretical Quantiles", y = "Standardized Residuals") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
Points follow the line pretty closely, but the plot evinces pronounced positive skew with a heavy upper tail, as we’ve seen elsewhere.
In conclusion, this general linear model seems to provide a reasonable prediction of North American sales based on EU sales that explains 59% of the variation in the data - not bad for a large, longitudinal set of real-world signals. This suggests that the cross-regional success of video game properties that achieve scale has some consistency in its pattern - perhaps driven by popular IP, generic trends, marketing and distribution practices, or other factors. If exploring further, we should examine collinearity and look at relationships between each of the regions with one another, correcting for outliers and discretizing by video game platform as well as time periods.
References: