The purpose of the present research is to continue the investigation regarding whether global video game sales (data acquired for approximately 16720 games) may be predicted by user scores (ratings between 0 and 10 that the users gave each game) and genre (Shooter = 1, Sports = 2, Platform = 3, Role-Play = 4, Racing = 5, Puzzle = 6, Action = 7, Simulation = 8, Strategy = 9, Adventure = 10, Fighting = 11, Misc = 12). This research also seeks to determine if multicollinearity or outliers that may affect the model are present in the data. The previous investigation determined no significant interaction effect between Genre and User Scores on the prediction of Global Sales.
## Warning: package 'tibble' was built under R version 4.0.3
## -- Attaching packages ------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts --------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'pander' was built under R version 4.0.3
## Warning: package 'car' was built under R version 4.0.3
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Warning: package 'lindia' was built under R version 4.0.3
games <- read.csv("C:/Users/Admin/Downloads/games.csv")
attach(games)
one_m2 <- lm(Global_Sales ~ Genre + User_Score, data=games)
one_c2 <- coefficients(one_m2)
one_s2 <- summary(one_m2)
one_ci2 <- as_tibble(confint(one_m2, level=0.95))
one_t2 <- as_tibble(one_s2[[4]])
p.value.string = function(p.value){
p.value <- round(p.value, digits=4)
if (p.value == 0) {
return("p < 0.0001")
} else {
return(paste0("p = ", format(p.value, scientific = F)))
}
}
almost_sas <- function(aov.results){
par(mfrow=c(2,2))
plot(aov.results, which=1)
plot(aov.results, which=2)
aov_residuals <- residuals(aov.results)
plot(density(aov_residuals))
hist(aov_residuals)
}
one <- mutate(games, obs = row_number())The regression line is as follows, such that \(Y={\mbox{Global Sales}}\) is:
\[ \hat{Y} = 0.08 -0.02 X_{\mbox{Genre}} + 0.11X_{\mbox{User Score}} \]
The new model indicates that for every $1 million increase in Global Sales, Genre is reduced by 0.02 and User Score increases by 0.11.
## Warning: package 'sjPlot' was built under R version 4.0.3
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
## Warning: package 'sjmisc' was built under R version 4.0.3
##
## Attaching package: 'sjmisc'
## The following object is masked from 'package:purrr':
##
## is_empty
## The following object is masked from 'package:tidyr':
##
## replace_na
## The following object is masked from 'package:tibble':
##
## add_case
library(ggplot2)
one_m2 <- lm(Global_Sales ~ Genre + User_Score, data=games)
Genre <- as.numeric(Genre)
User_Score <- as.numeric(User_Score)
cor(Genre, User_Score, use = "complete.obs")## [1] -0.005386824
The correlation between Genre and User Score is -0.005386824, indicating that, more than likely, the two are not closely related. However, checking for homogeneity of variances is also important. For data that are not normally distributed, using the nonparametric Fligner-Killeen’s test is appropriate.
## The following objects are masked _by_ .GlobalEnv:
##
## Genre, User_Score
## The following objects are masked from games (pos = 5):
##
## Genre, Global_Sales, User_Score
Genre.num <- as.numeric(Genre)
User_Score.num <- as.numeric(User_Score)
fligner.test(Genre.num ~ User_Score.num, data = )##
## Fligner-Killeen test of homogeneity of variances
##
## data: Genre.num by User_Score.num
## Fligner-Killeen:med chi-squared = 85.899, df = 94, p-value = 0.7121
Hypotheses
\(H_0: \ \sigma_{\mbox{Genre}} = \sigma_{\mbox{User Score}}\)
\(H_1: \ \sigma_{\mbox{Genre}} \neq \sigma_{\mbox{User Score}}\)
Test Statistic
\({\chi}^2 = 85.9\).
p-value
\(0.7121\).
Rejection Region
Reject if \(p < \alpha\), where \(\alpha=0.05\).
Conclusion and Interpretation
Fail to reject \(H_0\). There is insufficient evidence to suggest that the variances are unequal. Accordingly, the assumption of equal variances holds.
To test for normality, QQ plots will indicate whether or not the data for Genre and User Score follow a normal distribution.
## Warning: package 'ggpubr' was built under R version 4.0.3
## Warning: Removed 2 rows containing non-finite values (stat_qq).
## Warning: Removed 2 rows containing non-finite values (stat_qq_line).
## Warning: Removed 2 rows containing non-finite values (stat_qq_line).
## Warning: Removed 9129 rows containing non-finite values (stat_qq).
## Warning: Removed 9129 rows containing non-finite values (stat_qq_line).
## Warning: Removed 9129 rows containing non-finite values (stat_qq_line).
The QQ Plots for both Genre and User Score illustrate that both variables appear to violate the normality assumption.
| Genre | User_Score |
|---|---|
| 1 | 1 |
The VIFs for both Genre and User Score are 1, indicating that multicollinearity between Genre and User Score is unlikely.
Since the number of observations is so high, the plot is difficult to read. However, it appears that most values beyond the first 1000 observations are insignificant, so a new plot will be generated for better visibility, although the values may slighly change when later observations are omitted.
newgames<-games[-c(1000:16720),]
one_m3 <- lm(Global_Sales ~ Genre + User_Score, data=newgames)
gg_cooksd(one_m3) + theme_bw()It appears that observations 1, 5, 9, 3, 4, and 23 are above the threshold and may be considered influential points. Next, checking for studentized Residuals will be determine in order to identify outliers.
library(pander)
gamesna <- na.omit(newgames)
one_m4 <- lm(Global_Sales ~ Genre + User_Score, data = gamesna )
pander(rstandard(one_m4), style='rmarkdown')Because of the limitations of my processing system (and the number of values in the data set “games” exceeding 16000), Studentized residuals were calculated for the first 999 entries in the data set, which includes the outliers present in the Cook’s D graph. Although not all 16000 observations are present because of the limitations of the processing system, it appears that observations 1, 3, 4, 7, 8, 9, 12, 14, 15, 16, 17, 18, and 20 are extreme outliers, a result that is consistent with the Cook’s D plot. It is worth noting that one reason that these outliers appear sequential is that the data set is sorted according to User Scores, indicating that extremely high User Score values may present extreme outliers.
According to both the Cook’s D plot and Studentized R results, observations 1, 3, 4, 7, 8, 9, 12, 14, 15, 16, 17, 18, and 20 should be removed from the model. Observation 5 was a significant peak on the Cook’s D plot but did was not present in the Studentized R results. Furthermore, observation 1 had a significantly high peak on both the Cook’s D plot and the Studentized R results (value of 16.08).
The final model, after removing outliers, follows:
m5 = lm(Global_Sales[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] ~ Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] + User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)], data=games)
summary(m5)##
## Call:
## lm(formula = Global_Sales[c(-1, -3, -4, -7, -8, -9, -12, -14,
## -15, -16, -17, -18, -20)] ~ Genre[c(-1, -3, -4, -7, -8, -9,
## -12, -14, -15, -16, -17, -18, -20)] + User_Score[c(-1, -3,
## -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)], data = games)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9697 -0.5740 -0.3537 0.0379 15.5195
##
## Coefficients:
## Estimate
## (Intercept) 0.146614
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] -0.020096
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] 0.091923
## Std. Error
## (Intercept) 0.074716
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] 0.004382
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] 0.009681
## t value
## (Intercept) 1.962
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] -4.586
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] 9.495
## Pr(>|t|)
## (Intercept) 0.0498
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] 4.58e-06
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] < 2e-16
##
## (Intercept) *
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] ***
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.265 on 7574 degrees of freedom
## (9129 observations deleted due to missingness)
## Multiple R-squared: 0.01452, Adjusted R-squared: 0.01426
## F-statistic: 55.81 on 2 and 7574 DF, p-value: < 2.2e-16
The resulting regression model, such that \(Y={\mbox{Global Sales}}\) is:
\[ \hat{Y} = 0.15 -0.02 X_{\mbox{Genre}} + 0.09X_{\mbox{User Score}} \]
The final model indicates that for every $1 million increase in Global Sales, Genre is reduced by 0.02 and User Score increases by 0.09.
The new model, upon removal of outliers, slightly changed compared to the original model. However, the slope of Genre remained nearly the same, while the slope of User Score decreased. This consequence is likely the result of removing influential points with extraordinarily high user scores and global sales. Further projects could consider additional patterns in video game data regarding user scores.