The purpose of the present research is to determine whether global video game sales (data acquired for approximately 16720 games) may be predicted by user scores (ratings between 0 and 10 that the users gave each game) and genre (Shooter = 1, Sports = 2, Platform = 3, Role-Play = 4, Racing = 5, Puzzle = 6, Action = 7, Simulation = 8, Strategy = 9, Adventure = 10, Fighting = 11, Misc = 12). This research also seeks to determine if there is a predictive interaction effect between user scores and genre that also predicts global video game sales.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0600 0.1700 0.5335 0.4700 82.5300
## [1] 1.547935
According to the dataset, the average global sales for all included videogames is 0.5335 million dollars, with a standard deviation of 1.5479 million dollars. The median is 0.1700 million dollars.
## [1] "7"
Since Genre is categorical, the mean is not useful for understanding the data, but frequency is important. The mode for genre is “7,” which corresponds to “Action.”
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.400 7.500 7.125 8.200 9.700 9129
## [1] NA
The mean user score (ordinal categorical) is 7.125 out of 10. The median is 7.5, with a min of 0 and a max of 9.7. A standard deviation is unavailable because “User score” is categorical.
library(ggplot2)
boxplot(Global_Sales, main="Global Video Game Sales",
xlab="Games", col = "cornflowerblue",
ylab="Sales in Millions of Dollars")According to the boxplot, the mean is higher than the mode, largely because of outlier games which generated significantly more money than the others.The mean and mode appear to have generated less than 10 million dollars.
library(ggplot2)
boxplot(Genre, main="Video Game Genres",
xlab="Games", col = "cornflowerblue",
ylab="Genres (sorted 1 through 12")According to the boxplot, game genres appear to be fairly evenly distributed; however, categorical variables are not as well represented by boxplots.
boxplot(Critic_Score, main="Video Game User Scores",
xlab="Games", col = "cornflowerblue",
ylab="User Scores out of Ten")According to the boxplot, user scores are skewed in favor of most games, with most users scoring games between 60 and 80 and seldom rating games below 30. This result indicates that the scores are slightly inflated (as the average video game would ideally be rated “50”).
##
## Call:
## lm(formula = Global_Sales ~ Genre + User_Score + Genre:User_Score,
## data = games)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.025 -0.634 -0.399 0.017 81.630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.297491 0.198915 1.496 0.1348
## Genre -0.061100 0.030304 -2.016 0.0438 *
## User_Score 0.079706 0.027448 2.904 0.0037 **
## Genre:User_Score 0.005411 0.004194 1.290 0.1971
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.867 on 7586 degrees of freedom
## (9129 observations deleted due to missingness)
## Multiple R-squared: 0.009626, Adjusted R-squared: 0.009235
## F-statistic: 24.58 on 3 and 7586 DF, p-value: 8.041e-16
The full regression line is as follows, such that \(Y={\mbox{Global Sales}}\):
\(Y=0.03-0.06X_{\mbox{Genre}}+0.08X_{\mbox{User Score}}+0.01_{\mbox{Genre:User Score}}\)
The p-value for the F-statistic for this model is nearly zero (\(8.041e-16\)), indicating the need to further test for the significance of the regression line.
modelsummary <- summary(model)
p.value.string = function(p.value){
p.value <- round(p.value, digits=4)
if (p.value == 0) {
return("p < 0.0001")
} else {
return(paste0("p = ", format(p.value, scientific = F)))
}
}
# we specify an interaction between two variables by joining them with a colon (:)
library(tibble)## Warning: package 'tibble' was built under R version 4.0.3
one_c1 <- coefficients(model)
one_s1 <- summary(model)
one_ci1 <- as_tibble(confint(model, level=0.95))
one_t1 <- as_tibble(one_s1[[4]])Hypotheses
\(H_0: \ \beta_1 = \beta_2 = \beta_3 = 0\)
\(H_1:\) at least one \(\beta_i \ne 0\)
Test Statistic
\(F_0 = 24.58\).
p-value
\(p < 0.0001\).
Rejection Region
Reject if \(p < \alpha\), where \(\alpha=0.05\).
Conclusion and Interpretation
Reject \(H_0\). There is sufficient evidence to suggest that the regression line is significant.
| Predictor | Estimate of \(\beta\) | 95% CI for \(\beta\) | p-value |
|---|---|---|---|
| Genre | -0.06 | (-0.12, 0) | p = 0.0438 |
| User Score | 0.08 | (0.03, 0.13) | p = 0.0037 |
| Interaction | 0.01 | (0, 0.01) | p = 0.1971 |
The interaction is not significant (p = 0.1971), so it should be removed from the model.
one_m2 <- lm(Global_Sales ~ Genre + User_Score, data=games)
one_c2 <- coefficients(one_m2)
one_s2 <- summary(one_m2)
one_ci2 <- as_tibble(confint(one_m2, level=0.95))
one_t2 <- as_tibble(one_s2[[4]])The resulting regression model is \[ \hat{Y} = 0.08 -0.02 X_{\mbox{Genre}} + 0.11X_{\mbox{User Score}} \]
The new model indicates that for every $1 million increase in Global Sales, Genre is reduced by 0.02 and User Score increases by 0.11.
## Warning: package 'sjPlot' was built under R version 4.0.3
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
## Warning: package 'sjmisc' was built under R version 4.0.3
##
## Attaching package: 'sjmisc'
## The following object is masked from 'package:tibble':
##
## add_case
library(ggplot2)
data(efc)
theme_set(theme_sjplot())
# make categorical
games$Genre <- to_factor(games$Genre)
# fit model with interaction
fit <- lm(Global_Sales ~ User_Score + Genre + User_Score * Genre, data = games)
plot_model(fit, type = "int")## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning: Removed 33 row(s) containing missing values (geom_path).
Although R does not allow for more than 9 categories within “Genre” to be displayed, the graph indicates that the majority of genres do not overlap, indicating no significant interaction effect between User Score and Genre with regard to Global Sales.
## [1] 0.009234754
## [1] 0.009148023
The adjusted \(R^2\) value for the original model is \(0.009234754\).
The adjusted \(R^2\) value for the new model is \(0.009148023\).
The original model indicated that the interaction between Genre and User Score was insignificant, so removing the interaction from the model is beneficial. Accordingly, global video game sales may be predicted by both genre and user scores.
For Dr. Seals’s reference, I calculated the average Animal Crossing user scores: mean = 7.5 (minimum: Animal Crossing: Amiibo Festival (4.4), maximum: Animal Crossing for GameCube (8.9)).