Project 3: Predicting Global Video Game Sales via Genre and User Scores

Project Background

The purpose of the present research is to determine whether global video game sales (data acquired for approximately 16720 games) may be predicted by user scores (ratings between 0 and 10 that the users gave each game) and genre (Shooter = 1, Sports = 2, Platform = 3, Role-Play = 4, Racing = 5, Puzzle = 6, Action = 7, Simulation = 8, Strategy = 9, Adventure = 10, Fighting = 11, Misc = 12). This research also seeks to determine if there is a predictive interaction effect between user scores and genre that also predicts global video game sales.

Descriptives

Global Sales (in millions of dollars)

games <- read.csv("C:/Users/Admin/Downloads/games.csv")
attach(games)
summary(Global_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0600  0.1700  0.5335  0.4700 82.5300
sd(Global_Sales)
## [1] 1.547935

According to the dataset, the average global sales for all included videogames is 0.5335 million dollars, with a standard deviation of 1.5479 million dollars. The median is 0.1700 million dollars.

Genre

names(table(Genre))[table(Genre)==max(table(Genre))]
## [1] "7"

Since Genre is categorical, the mean is not useful for understanding the data, but frequency is important. The mode for genre is “7,” which corresponds to “Action.”

User scores

summary(User_Score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   6.400   7.500   7.125   8.200   9.700    9129
sd(User_Score)
## [1] NA

The mean user score (ordinal categorical) is 7.125 out of 10. The median is 7.5, with a min of 0 and a max of 9.7. A standard deviation is unavailable because “User score” is categorical.

Visualization

Global Sales

library(ggplot2)
boxplot(Global_Sales, main="Global Video Game Sales",
xlab="Games", col = "cornflowerblue",
ylab="Sales in Millions of Dollars")

According to the boxplot, the mean is higher than the mode, largely because of outlier games which generated significantly more money than the others.The mean and mode appear to have generated less than 10 million dollars.

Genre

library(ggplot2)
boxplot(Genre, main="Video Game Genres",
xlab="Games", col = "cornflowerblue",
ylab="Genres (sorted 1 through 12")

According to the boxplot, game genres appear to be fairly evenly distributed; however, categorical variables are not as well represented by boxplots.

User Scores

boxplot(Critic_Score, main="Video Game User Scores",
xlab="Games", col = "cornflowerblue",
ylab="User Scores out of Ten")

According to the boxplot, user scores are skewed in favor of most games, with most users scoring games between 60 and 80 and seldom rating games below 30. This result indicates that the scores are slightly inflated (as the average video game would ideally be rated “50”).

Multiple Linear Regression

Statement of Full Regression Line

model <- lm(Global_Sales ~ Genre + User_Score + Genre:User_Score, data = games)
summary(model)
## 
## Call:
## lm(formula = Global_Sales ~ Genre + User_Score + Genre:User_Score, 
##     data = games)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.025 -0.634 -0.399  0.017 81.630 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       0.297491   0.198915   1.496   0.1348   
## Genre            -0.061100   0.030304  -2.016   0.0438 * 
## User_Score        0.079706   0.027448   2.904   0.0037 **
## Genre:User_Score  0.005411   0.004194   1.290   0.1971   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.867 on 7586 degrees of freedom
##   (9129 observations deleted due to missingness)
## Multiple R-squared:  0.009626,   Adjusted R-squared:  0.009235 
## F-statistic: 24.58 on 3 and 7586 DF,  p-value: 8.041e-16

The full regression line is as follows, such that \(Y={\mbox{Global Sales}}\):

\(Y=0.03-0.06X_{\mbox{Genre}}+0.08X_{\mbox{User Score}}+0.01_{\mbox{Genre:User Score}}\)

The p-value for the F-statistic for this model is nearly zero (\(8.041e-16\)), indicating the need to further test for the significance of the regression line.

Test for Significance of Regression Line

modelsummary <- summary(model)
p.value.string = function(p.value){
  p.value <- round(p.value, digits=4)
  if (p.value == 0) {
    return("p < 0.0001")
  } else {
    return(paste0("p = ", format(p.value, scientific = F)))
  }
}
# we specify an interaction between two variables by joining them with a colon (:)
library(tibble)
## Warning: package 'tibble' was built under R version 4.0.3
one_c1 <- coefficients(model)
one_s1 <- summary(model)
one_ci1 <- as_tibble(confint(model, level=0.95))
one_t1 <- as_tibble(one_s1[[4]])

Hypotheses

   \(H_0: \ \beta_1 = \beta_2 = \beta_3 = 0\)
   \(H_1:\) at least one \(\beta_i \ne 0\)

Test Statistic

   \(F_0 = 24.58\).

p-value

   \(p < 0.0001\).

Rejection Region

   Reject if \(p < \alpha\), where \(\alpha=0.05\).

Conclusion and Interpretation

   Reject \(H_0\). There is sufficient evidence to suggest that the regression line is significant.

Test for Significant Interaction

Predictor Estimate of \(\beta\) 95% CI for \(\beta\) p-value
Genre -0.06 (-0.12, 0) p = 0.0438
User Score 0.08 (0.03, 0.13) p = 0.0037
Interaction 0.01 (0, 0.01) p = 0.1971

The interaction is not significant (p = 0.1971), so it should be removed from the model.

Statement and Interpretation of Model without Interaction Term

one_m2 <- lm(Global_Sales ~ Genre + User_Score, data=games)
one_c2 <- coefficients(one_m2)
one_s2 <- summary(one_m2)
one_ci2 <- as_tibble(confint(one_m2, level=0.95))
one_t2 <- as_tibble(one_s2[[4]])

The resulting regression model is \[ \hat{Y} = 0.08 -0.02 X_{\mbox{Genre}} + 0.11X_{\mbox{User Score}} \]

The new model indicates that for every $1 million increase in Global Sales, Genre is reduced by 0.02 and User Score increases by 0.11.

Model Visualization

library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.0.3
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
library(sjmisc)
## Warning: package 'sjmisc' was built under R version 4.0.3
## 
## Attaching package: 'sjmisc'
## The following object is masked from 'package:tibble':
## 
##     add_case
library(ggplot2)
data(efc)
theme_set(theme_sjplot())

# make categorical
games$Genre <- to_factor(games$Genre)

# fit model with interaction
fit <- lm(Global_Sales ~ User_Score + Genre + User_Score * Genre, data = games)

plot_model(fit, type = "int")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning: Removed 33 row(s) containing missing values (geom_path).

Although R does not allow for more than 9 categories within “Genre” to be displayed, the graph indicates that the majority of genres do not overlap, indicating no significant interaction effect between User Score and Genre with regard to Global Sales.

Adjusted \(R^2\)

summary(model)$adj.r.squared 
## [1] 0.009234754
summary(one_m2)$adj.r.squared 
## [1] 0.009148023

The adjusted \(R^2\) value for the original model is \(0.009234754\).

The adjusted \(R^2\) value for the new model is \(0.009148023\).

Conclusion

The original model indicated that the interaction between Genre and User Score was insignificant, so removing the interaction from the model is beneficial. Accordingly, global video game sales may be predicted by both genre and user scores.

Animal Crossing

For Dr. Seals’s reference, I calculated the average Animal Crossing user scores: mean = 7.5 (minimum: Animal Crossing: Amiibo Festival (4.4), maximum: Animal Crossing for GameCube (8.9)).