1 Introduction

Context of Data

In 1958, William Higinbotham developed the first video game.1 Since then, video games have become an integral part of online culture.2 They became especially popular when their medium began moving from large arcade machines (e.g., Pac-Man) to personal computers and consoles. Nintendo has been at the forefront of this online movement with the development of video game consoles like the Gameboy, with Microsoft and Sony following closely behind with the Xbox and Playstation respectively. Initially, many games were marketed toward adolescent boys (the name “Gameboy” is a perfect example of such a marketing strategy). Now, however, video games have a wider audience. There are multitudes of multiplayer and online games, with many gaming conventions across the globe.3

Source of Data

The data set is Rush Kirubi’s “Video Game Sales With Ratings” that was uploaded to Kaggle.com, with sales data from Vgchartz and ratings data from Metacritic.4 The data set contains 16 variables and 11,562 observations, each of which is a game released between 1980 and 2016.

We narrowed the data set’s 12 genres to 5 arbitrarily – any less would make the analysis too niche while any more would make the model too complicated to read (our Professor also recommended the number). However, we did choose to explore the Shooter, Role-Playing, Action, Sports and Racing genres for two reasons. Firstly, these genres had the most complete observations. But more importantly, it is more difficult to find free versions of these game genres, i.e., there are no substitutes to buying the game (beyond pirating). This makes parsing relationships between variables a simpler process. 2007 was also chosen arbitrarily, two of our group members happened to play their first Pokemon games that year.

Research Question

Through our project, we want to investigate the relationship between critic score on the global sales of games. We also tested this effect of critic score as it related to the genre of the game. Our data frame specifically tests the Shooter, Role-Playing, Action, Sports, and Racing games released in 2007.

Limitations of Data

As the data set was too large for us to create a meaningful trend analysis, we chose to narrow down our variables. For the purposes of this project, we focused on exploring 5 of the 12 game genres’ combined relationship with critic score on Global Sales in 2007. At the same time, we took out games with incomplete observations (i.e., contained NA values) using drop_na(Critic_Score) from the Tidyverse package.

Additionally, the dataset for our analysis is taken solely from Metacritic. It is unlikely that this single source is representative of the entire population, so that needs to be taken into account when we are thinking of the significance of our findings.

We understand that there is a tradeoff between a model’s simplicity and representativeness. For our project, we attempt to find a balance between the two but we acknowledge that by narrowing the data’s scope, our model cannot predict Global Sales for other genres outside the ones we’ve explored. Further, most of the NA values we dropped came from missing critic scores, by omitting them, we lose information of how well unreviewed video games do and whether or not there’s a trend following that too.


2 Exploratory data analysis

Raw Data Glimpse

Name Year_of_Release Global_Sales Critic_Score Genre log10_Global_Sales
Shrek the Third 2007 0.48 56 Action -0.3187588
LEGO Star Wars: The Complete Saga 2007 2.57 80 Action 0.4099331
Skate 2007 0.71 86 Sports -0.1487417
Van Helsing 2007 0.18 63 Action -0.7447275
Ghost Rider 2007 0.24 49 Action -0.6197888

Correlation Between Global Sales and Critic Score

correlation
0.401

Regression Points

ID log10_Global_Sales Critic_Score log10_Global_Sales_hat residual
1 1.356 80 -0.353 1.709
2 1.084 94 -0.117 1.201
3 0.969 94 -0.117 1.086
4 0.825 94 -0.117 0.942
5 0.751 80 -0.353 1.104
6 0.706 90 -0.184 0.890

Exploratory Data Visualization

Initial Observations

The box plot shows the median critic score across genre of video games. Action has the lowest median score, Sports has the highest median score, while the rest have a median score of roughly 70. It’s also interesting to note that the Action and Sports genres have the most outliers. Most scores appear to be between the 55 to 80 score range.

This makes sense because the Action and Sports genres have the most data points. We can see this in the facetted scatterplot. All five genres have positive slopes. Role-playing has the steepest slope, while Racing’s slope is the most gentle. The rest, Action, Shooter, and Sports, seem to have milder slopes.


3 Multiple regression

Components of our Multiple Regression Model

In order to explore our numerical and categorical explanatory variables, we used an interaction model. The multiple regression lays out the relationship between Critic Score, our numerical variable; Genre (Shooting, Action, Role-Playing, Sports, and Racing), our categorical variable; and Global Sales, logged to the base 10, our outcome variable.

Interaction Model

term estimate std_error statistic p_value lower_ci upper_ci
intercept -1.507 0.193 -7.822 0.000 -1.886 -1.128
Critic_Score 0.015 0.003 4.959 0.000 0.009 0.022
GenreRacing 0.230 0.444 0.519 0.604 -0.642 1.102
GenreRole-Playing -1.187 0.457 -2.599 0.010 -2.085 -0.289
GenreShooter -0.676 0.369 -1.829 0.068 -1.402 0.050
GenreSports -0.537 0.357 -1.504 0.133 -1.240 0.165
Critic_Score:GenreRacing -0.005 0.007 -0.679 0.497 -0.018 0.009
Critic_Score:GenreRole-Playing 0.015 0.007 2.200 0.028 0.002 0.028
Critic_Score:GenreShooter 0.006 0.005 1.223 0.222 -0.004 0.017
Critic_Score:GenreSports 0.006 0.005 1.112 0.267 -0.004 0.016

3.1 Statistical interpretation

Interpretation of the table (statistical language)

The basic equation of the modeling equation for our regression is

\(\hat{Global Sales} = \beta_{action} + \beta_{critic score} * Critic Score + \beta_{Racing} * 1[x \in Racing] + \beta_{Role-Playing} * 1[x \in Role-Playing]\) \(+ \beta_{Shooter} * 1[x \in Shooter] + \beta_{Sports} * 1[x \in Sport] + \beta_{Racing}, Critic Score * Critic Score * 1[x \in Racing]\) \(+ \beta_{Role-Playing}, Critic Score * Critic Score * 1[x \in Role-Playing] + \beta_{Shooter}, Critic Score * Critic Score * 1[x \in Shooter]\) \(+ \beta_{Sports}, Critic Score * Critic Score * 1[x \in Sports]\)

In the regression table, the baseline is Action video games. Therefore, the intercept represents the intercept of the regression model for Action games (-1.15). The slope value is represented by the \(\beta_{criticscore}\) term, and it represents the average associated of the sales of Action games for every increase in 1 point of Critic Score with all other variable accounted for. Terms 3 through 6 are the offsets of the intercepts of each separate regression model based on game genre (e.g. the intercept for the regression model for Shooter games is -1.51 - 0.676 = -2.186). Terms 7 through 10 are the offsets of the slopes of each regression model (e.g. with all other variable accounted for, the associate average increase in global sales of Shooter Games for every increase in 1 point of Critic Score is 0.015 + 0.006 = 0.021).

Modeling Equation for Three Categories

Action

\(\hat{Global Sales}\) = -1.51 + 0.015 * Critic Score

Role-Playing

\(\hat{Global Sales}\) = -2.7 + 0.03 * Critic Score

Shooter

\(\hat{Global Sales}\) = -2.186 + 0.021 * Critic Score

Limitations of our regression analysis

The relationship between the critic score and genre of games and global sales may not be perfectly linear, but we have set the regression model as a linear model for simplicity.

Tying Together Results of Multiple Regression and Exploratory Data Analysis

The results of our multiple regression reflects that of our exploratory data analysis. All 5 genres have positive slopes representing each video game genre. Role-playing does have the steepest slope with 0.03, with Racing’s slope being 0.01. Interestingly, Shooter and Sports have the same slope, with an offset of 0.006 from the baseline, Action.

3.2 Non-statistical interpretation

Based on the Interaction Model, all games had an general increase in global sales as critic scores increase. Role-playing games had the sharpest increase, but racing games started with the highest global sales.


4 Inference for multiple regression

Because we’re interested in whether there is a relationship between Critic Scores and (log10) Global Sales for each genre, we will only be focusing on the slopes of our Interaction Model, that is, Critic_Score, Critic_Score:GenreRacing, Critic_Score:GenreRole-Playing, Critic_Score:GenreShooter, and Critic_Score:GenreSports. The slope of our baseline genre, Critic_Score, refers to the slope of games from the Action genre. Our null hypothesis is that there is no relationship between the variables, that is, that their slopes are 0. That means that as long as we have substantial evidence to prove that the slopes are not equal to 0, then the data will suggest that there is reason to believe that there is a relationship between the variables.

Interpretation of practical significance of Confidence intervals

We will use the baseline genre, Action, as an example. Here, the slope for the relationship between critic score and global sales for action is represented in the table by Critic_Score. The estimate is 0.015, a positive relationship between critic score and global sales for Action video games. But real life is random, and the relationship between critic score and (log10) global sales of Action video games each year is not going to be a constant 0.015. It varies.

The confidence intervals for our interaction model represent the plausible range of values for the slope of critic score’s relationship with (log10) global sales for each genre. The lower_ci and upper_ci columns on the table represent the lower and upper bounds of the confidence intervals in our interaction model analysis. For the baseline, Action, the confidence interval for the slope is [0.009, 0.022], which means that it is very likely that the slope estimate falls between 0.009 and 0.022, after accounting for variation. It is important to note that even after accounting for variation, the slope is always larger than 0, that is, the relationship between critic score and (log10) global sales for the action genre is always positive. Thus, the data suggests that there is reason to believe that there is a relationship between Critic Scores and (Log10) Global Sales for the Action genre.

The last 4 rows of the table represent statistics on the offsets of the slopes of each genre in comparison to that of Action. Their confidence intervals represent the range of possible offsets in slope as compared to Action’s slope. It is interesting to note that apart from the slope offset for Role-Playing games, all the other genres of investigation had confidence intervals that included 0 in their range. This means that for Racing, Shooter, and Sports games, it is possible that there is an offset of 0 from Action, that is, there isn’t enough evidence to suggest that these games have a different slope from that of Action games. On the other hand, the confidence interval for Role-Playing games is [0.002, 0.028], it’s always positive. So for this set of data in 2007, there is reason to suggest that Role-Playing games’ slope has a positive offset as compared to that of Action games.

Interpretation of p-values

Another way to investigate the validity of the null hypothesis is through the analysis of p-values.

For our project, our null hypothesis is that there is no relationship between our variables, Critic Score, (Log10) Global Sales, and Genre. We are taking the results as statistically significant only at \(\alpha\) = 0.05. If our p-value > 0.05, then we would not have enough evidence to reject the null hypothesis, if our p-value < 0.05, we reject it.

Thus, as p-value < 0.001 for the relationship between critic score of action games and (log10) global sales (Critic_Score), the data suggests that we have enough evidence to reject the null hypothesis (that there is no relationship between critic score of action games on (log10) global sales) and conclude that critic scores of action games do have a relationship with (log10) global sales.

The last four rows in the table are the offsets in the slope of the graph relationships of the remaining genres – Racing, Role-Playing, Shooter, and Sports. The conclusions drawn from the p-values in this case would show whether or not there is any significant difference between the slope of critic scores in the action genre and that of the comparing genre. Like the confidence intervals for these remaining genres, it is interesting to note that among them only Role-Playing games’ offset from Action has a p-value < \(\alpha\), that is, only in the Role-Playing genre is there substantial evidence to reject the null hypothesis. In other words, there isn’t enough evidence to suggest that there is a real difference in slope for the Racing, Shooter and Sports genres, but there is reason to believe that Role-Playing games have a different slope from that of Action games. This supports the conclusions that we can pull from the confidence interval data as well.

Residual analysis

Interpretation of Residual Analysis

Our residuals are fairly normally distributed. The histogram shows a near normal distribution with a slight left skew. Similarly, the spread is almost constant. A scatterplot of the residuals exhibits a small bit of heteroskedasticity. However, the variance is generally constant across the histogram. Because these conditions are met, the inference results for the regression can be assumed to be valid.


5 Conclusion

With our project, we wanted to see if there was a relationship between critic score on the global sales of games. Based on the multiple regression model, confidence intervals, and p-values, it seems that as critic scores increase, there is a general associated increase in (log10) global sales. Therefore we can conclude that there is reason to believe that there is a positive relationship between critic score and (log10) global sales.

We acknowledge that the claims of our analysis are limited. Because we narrowed down our data – for example, examining only games of a certain five genres released in the year 2007 – not all video games are represented in our analysis. Another limitation is that our dataset provides critic score data from a single websource, Metacritic. Since the scores here are aggregated from a select staff of critics, our claims may not generalize for all game review sources. Additionally, we examined the relationship between critic score and global sales in terms of genre, for which our baseline group for comparison is action. Because all of our observations are made relative to the action genre, the results of our analysis would be expected to change if we were to use another genre as our baseline group. Thus we say that our analysis claims are not absolute for all video game sales.

For video game developers, the results of this study show that getting a good critic score will impact the global sales of the game. Video game developers should be sure to appeal to critics to ensure that their games sells well. The study also shows that global sales of different genres of games are not that different (specifically, the sales of Action, Shooter, Sports, and Racing games). The sales of Role-Playing games increase more than the others as the critic score increases. However, the global sales of video games of these different genres when they have the same critic score are generally the same.

In the future, it might be interesting to study how the sales of different genres of games change throughout different regions. The full dataset for this study included sales in North America, Japan, and Europe (as well as global sales), and it is possible that different genres sell better in different areas of the world. There are plenty of other interesting studies that can be done with this dataset. We leave it up to future statisticians and video game enthusiasts to explore them all.


  1. APS.org. (2008). October 1958: Physicist Invents First Video Game. [online] Available at: https://www.aps.org/publications/apsnews/200810/physicshistory.cfm [Accessed 11 Dec. 2018].

  2. Nickson, C. (2010). How Video Games Became Major Entertainment. [online] Atechnologysociety.co.uk. Available at: http://www.atechnologysociety.co.uk/how-video-games-became-major-entertainment.html [Accessed 11 Dec. 2018].

  3. VideoGameCons.com. (2018). 2018 Video Game Convention Calendar | VideoGameCons.com. [online] Available at: https://videogamecons.com/calendar/calendar.php?year=2018 [Accessed 11 Dec. 2018].

  4. Kirubi, R. (2016). Video Game Sales with Ratings. [online] Kaggle.com. Available at: https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings [Accessed 11 Dec. 2018].