Project 4: Further Explorations of Video Game Global Sales as Predicted by Genre and User Scores

Data Description and Visualization

The purpose of the present research is to continue the investigation regarding whether global video game sales (data acquired for approximately 16720 games) may be predicted by user scores (ratings between 0 and 10 that the users gave each game) and genre (Shooter = 1, Sports = 2, Platform = 3, Role-Play = 4, Racing = 5, Puzzle = 6, Action = 7, Simulation = 8, Strategy = 9, Adventure = 10, Fighting = 11, Misc = 12). This research also seeks to determine if multicollinearity or outliers that may affect the model are present in the data. The previous investigation determined no significant interaction effect between Genre and User Scores on the prediction of Global Sales.

Multiple Linear Regression

Regression Line

library(tibble)

## Warning: package 'tibble' was built under R version 4.0.3

library(tidyverse)

## -- Attaching packages ------------------------------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.2     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts --------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(pander)

## Warning: package 'pander' was built under R version 4.0.3

library(car)

## Warning: package 'car' was built under R version 4.0.3

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(lindia)

## Warning: package 'lindia' was built under R version 4.0.3

games <- read.csv("C:/Users/Admin/Downloads/games.csv")
attach(games)
one_m2 <- lm(Global_Sales ~ Genre + User_Score, data=games)
one_c2 <- coefficients(one_m2)
one_s2 <- summary(one_m2)
one_ci2 <- as_tibble(confint(one_m2, level=0.95))
one_t2 <- as_tibble(one_s2[[4]])
p.value.string = function(p.value){
  p.value <- round(p.value, digits=4)
  if (p.value == 0) {
    return("p < 0.0001")
  } else {
    return(paste0("p = ", format(p.value, scientific = F)))
  }
}
almost_sas <- function(aov.results){
  par(mfrow=c(2,2))
  plot(aov.results, which=1)
  plot(aov.results, which=2)
  aov_residuals <- residuals(aov.results)
  plot(density(aov_residuals))
  hist(aov_residuals)
}
one <- mutate(games, obs = row_number())

The regression line is as follows, such that $Y={\mbox{Global Sales}}$ is:

\[ \hat{Y} = 0.08 -0.02 X_{\mbox{Genre}} + 0.11X_{\mbox{User Score}} \]

The new model indicates that for every $1 million increase in Global Sales, Genre is reduced by 0.02 and User Score increases by 0.11.

Variance Assumption

library(pander)
library(sjPlot)

## Warning: package 'sjPlot' was built under R version 4.0.3

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

library(sjmisc)

## Warning: package 'sjmisc' was built under R version 4.0.3

## 
## Attaching package: 'sjmisc'

## The following object is masked from 'package:purrr':
## 
##     is_empty

## The following object is masked from 'package:tidyr':
## 
##     replace_na

## The following object is masked from 'package:tibble':
## 
##     add_case

library(ggplot2)
one_m2 <- lm(Global_Sales ~ Genre + User_Score, data=games)
Genre <- as.numeric(Genre)
User_Score <- as.numeric(User_Score)
cor(Genre, User_Score, use = "complete.obs")

## [1] -0.005386824

The correlation between Genre and User Score is -0.005386824, indicating that, more than likely, the two are not closely related. However, checking for homogeneity of variances is also important. For data that are not normally distributed, using the nonparametric Fligner-Killeen’s test is appropriate.

Fligner-Killen Test for Homoegeneity of Variances

games <- read.csv("C:/Users/Admin/Downloads/games.csv")
attach(games)

## The following objects are masked _by_ .GlobalEnv:
## 
##     Genre, User_Score

## The following objects are masked from games (pos = 5):
## 
##     Genre, Global_Sales, User_Score

Genre.num <- as.numeric(Genre)
User_Score.num <- as.numeric(User_Score)
fligner.test(Genre.num ~ User_Score.num, data = )

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Genre.num by User_Score.num
## Fligner-Killeen:med chi-squared = 85.899, df = 94, p-value = 0.7121

Hypotheses

$H_0: \ \sigma_{\mbox{Genre}} = \sigma_{\mbox{User Score}}$
$H_1: \ \sigma_{\mbox{Genre}} \neq \sigma_{\mbox{User Score}}$

Test Statistic

${\chi}^2 = 85.9$.

p-value

$0.7121$.

Rejection Region

Reject if $p < \alpha$, where $\alpha=0.05$.

Conclusion and Interpretation

Fail to reject $H_0$. There is insufficient evidence to suggest that the variances are unequal. Accordingly, the assumption of equal variances holds.

Normality Assumption

To test for normality, QQ plots will indicate whether or not the data for Genre and User Score follow a normal distribution.

library("ggpubr")

## Warning: package 'ggpubr' was built under R version 4.0.3

ggqqplot(Genre.num)

## Warning: Removed 2 rows containing non-finite values (stat_qq).

## Warning: Removed 2 rows containing non-finite values (stat_qq_line).

## Warning: Removed 2 rows containing non-finite values (stat_qq_line).

ggqqplot(User_Score.num)

## Warning: Removed 9129 rows containing non-finite values (stat_qq).

## Warning: Removed 9129 rows containing non-finite values (stat_qq_line).

## Warning: Removed 9129 rows containing non-finite values (stat_qq_line).

The QQ Plots for both Genre and User Score illustrate that both variables appear to violate the normality assumption.

Check for Collinearity

VIF

pander(vif(one_m2), style='rmarkdown')

Genre	User_Score
1	1

The VIFs for both Genre and User Score are 1, indicating that multicollinearity between Genre and User Score is unlikely.

Leverage and Distance (Cook’s Distance)

gg_cooksd(one_m2) + theme_bw()

Since the number of observations is so high, the plot is difficult to read. However, it appears that most values beyond the first 1000 observations are insignificant, so a new plot will be generated for better visibility, although the values may slighly change when later observations are omitted.

newgames<-games[-c(1000:16720),]
one_m3 <- lm(Global_Sales ~ Genre + User_Score, data=newgames)
gg_cooksd(one_m3) + theme_bw()

It appears that observations 1, 5, 9, 3, 4, and 23 are above the threshold and may be considered influential points. Next, checking for studentized Residuals will be determine in order to identify outliers.

Check for Outliers (Studentized Residuals)

library(pander)
gamesna <- na.omit(newgames)
one_m4 <- lm(Global_Sales ~ Genre + User_Score, data = gamesna )
pander(rstandard(one_m4), style='rmarkdown')

Because of the limitations of my processing system (and the number of values in the data set “games” exceeding 16000), Studentized residuals were calculated for the first 999 entries in the data set, which includes the outliers present in the Cook’s D graph. Although not all 16000 observations are present because of the limitations of the processing system, it appears that observations 1, 3, 4, 7, 8, 9, 12, 14, 15, 16, 17, 18, and 20 are extreme outliers, a result that is consistent with the Cook’s D plot. It is worth noting that one reason that these outliers appear sequential is that the data set is sorted according to User Scores, indicating that extremely high User Score values may present extreme outliers.

Sensitivity Analysis

According to both the Cook’s D plot and Studentized R results, observations 1, 3, 4, 7, 8, 9, 12, 14, 15, 16, 17, 18, and 20 should be removed from the model. Observation 5 was a significant peak on the Cook’s D plot but did was not present in the Studentized R results. Furthermore, observation 1 had a significantly high peak on both the Cook’s D plot and the Studentized R results (value of 16.08).

Final Model

The final model, after removing outliers, follows:

m5 = lm(Global_Sales[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] ~ Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] + User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)], data=games)
summary(m5)

## 
## Call:
## lm(formula = Global_Sales[c(-1, -3, -4, -7, -8, -9, -12, -14, 
##     -15, -16, -17, -18, -20)] ~ Genre[c(-1, -3, -4, -7, -8, -9, 
##     -12, -14, -15, -16, -17, -18, -20)] + User_Score[c(-1, -3, 
##     -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)], data = games)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9697 -0.5740 -0.3537  0.0379 15.5195 
## 
## Coefficients:
##                                                                           Estimate
## (Intercept)                                                               0.146614
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]      -0.020096
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]  0.091923
##                                                                          Std. Error
## (Intercept)                                                                0.074716
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]        0.004382
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]   0.009681
##                                                                          t value
## (Intercept)                                                                1.962
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]       -4.586
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]   9.495
##                                                                          Pr(>|t|)
## (Intercept)                                                                0.0498
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]      4.58e-06
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]  < 2e-16
##                                                                             
## (Intercept)                                                              *  
## Genre[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)]      ***
## User_Score[c(-1, -3, -4, -7, -8, -9, -12, -14, -15, -16, -17, -18, -20)] ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.265 on 7574 degrees of freedom
##   (9129 observations deleted due to missingness)
## Multiple R-squared:  0.01452,    Adjusted R-squared:  0.01426 
## F-statistic: 55.81 on 2 and 7574 DF,  p-value: < 2.2e-16

The resulting regression model, such that $Y={\mbox{Global Sales}}$ is:

\[ \hat{Y} = 0.15 -0.02 X_{\mbox{Genre}} + 0.09X_{\mbox{User Score}} \]

The final model indicates that for every $1 million increase in Global Sales, Genre is reduced by 0.02 and User Score increases by 0.09.

Conclusion

The new model, upon removal of outliers, slightly changed compared to the original model. However, the slope of Genre remained nearly the same, while the slope of User Score decreased. This consequence is likely the result of removing influential points with extraordinarily high user scores and global sales. Further projects could consider additional patterns in video game data regarding user scores.