Introduction

load("~/Documents/Statistics Model/vgsales.RData")

The data I am using is about video game sales. This data contains 500 video games with variables that include Name, Platform, Genre, Year, Publisher and Sales. Sales is in the millions. When researching this dataset I was curious about how Genre of video game affected their sales depending on the place they were sold. So, my question is does the genre of video games affect the sales in NA, EU, JP and other sales?

This dataset came from Kaggle, which they collected the data from “VGChartz” (https://www.vgchartz.com/). With the original dataset I had to make changes. The change was that the dataset had the locations as columns of NA, EU, JP and Other, so I had to convert that into rows under the column “Region”, and I also removed Global sales since it was unnecessary. That caused the dataset to increase four times since observational units were repeated four times.

Here is how the dataset looks like now after the changes.

head(vgsales2, n=8)

Methodology

I will be using Generalized Linear Model with a Gamma distribution to predict Sales.

The assumptions of the model are:

The data should be independent and random

The response variable doesn’t need to be normally distributed, but the distribution is from an exponential family

The original response variable doesn’t need to have linear relationship with the independent variables

The three components of the GLM with Gamma are:

\(Y \sim Gamma(\alpha, \beta)\)

Linear predictor is \(\eta = \alpha + \beta_1Genre + \beta_2Region + \beta_3GenreRegion\)

Link function is \(\eta = log(\mu)\)

Inverse function \(\mu = exp(\eta)\)

In the linear predictor the reason why I included an interaction term “GenreRegion” was because having an interaction would help show how each Genre can affect a specific Region.

The reason to use Gamma and not another exponential family is because it helps to model the continuous variable that should have a positive and skewed distribution. This can be seen by the density curve of Sales below.

ggplot(data = vgsales2, aes(x = Sales)) +
  geom_density(adjust=10)

Results and conclusions

To see the effectiveness of the predictors I use the anova function.

sales.aov <- aov(glm(Sales~Genre + Region + Genre:Region, data=vgsales2, family='Gamma'))
summary(sales.aov)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## Genre          11    109     9.9   2.156   0.0144 *  
## Region          3   1762   587.3 127.956  < 2e-16 ***
## Genre:Region   33    372    11.3   2.455 8.95e-06 ***
## Residuals    1859   8532     4.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the anova test shows that the predictors are effective. It also shows that the interaction is statistically significant which reinforces that the interaction is effective.

To look specifically which Genre affects Region the most I fit the GLM with Gamma that gives us a long list of coefficients.

sales_model <- glm(Sales~Genre + Region + Genre:Region, data=vgsales2, family='Gamma')
summary(sales_model)
## 
## Call:
## glm(formula = Sales ~ Genre + Region + Genre:Region, family = "Gamma", 
##     data = vgsales2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9961  -0.7615  -0.3106   0.1645   4.8308  
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.572800   0.063592   9.007  < 2e-16 ***
## GenreAdventure               0.135164   0.291761   0.463 0.643225    
## GenreFighting                0.284769   0.213130   1.336 0.181671    
## GenreMisc                   -0.084598   0.106714  -0.793 0.428021    
## GenrePlatform               -0.010892   0.105890  -0.103 0.918086    
## GenrePuzzle                  0.037633   0.218823   0.172 0.863474    
## GenreRacing                 -0.157558   0.102990  -1.530 0.126227    
## GenreRPG                    -0.076867   0.102029  -0.753 0.451314    
## GenreShooter                -0.050225   0.093574  -0.537 0.591511    
## GenreSimulation             -0.126057   0.132808  -0.949 0.342659    
## GenreSports                 -0.203987   0.084452  -2.415 0.015813 *  
## GenreStrategy                0.067405   0.331856   0.203 0.839068    
## RegionJP                     1.793612   0.286304   6.265 4.63e-10 ***
## RegionNA                    -0.176761   0.077194  -2.290 0.022142 *  
## RegionOther                  0.884445   0.173120   5.109 3.57e-07 ***
## GenreAdventure:RegionJP     -1.208473   0.723679  -1.670 0.095107 .  
## GenreFighting:RegionJP      -1.635953   0.428930  -3.814 0.000141 ***
## GenreMisc:RegionJP          -1.210863   0.368373  -3.287 0.001031 ** 
## GenrePlatform:RegionJP      -1.646397   0.318831  -5.164 2.68e-07 ***
## GenrePuzzle:RegionJP        -1.861157   0.404910  -4.596 4.59e-06 ***
## GenreRacing:RegionJP        -1.315199   0.350691  -3.750 0.000182 ***
## GenreRPG:RegionJP           -1.870838   0.304330  -6.147 9.61e-10 ***
## GenreShooter:RegionJP        3.194348   0.815227   3.918 9.24e-05 ***
## GenreSimulation:RegionJP    -1.514214   0.379875  -3.986 6.98e-05 ***
## GenreSports:RegionJP        -0.344866   0.407091  -0.847 0.397022    
## GenreStrategy:RegionJP      -1.571748   0.817885  -1.922 0.054794 .  
## GenreAdventure:RegionNA     -0.123249   0.337583  -0.365 0.715084    
## GenreFighting:RegionNA      -0.245868   0.239875  -1.025 0.305504    
## GenreMisc:RegionNA           0.008646   0.128596   0.067 0.946401    
## GenrePlatform:RegionNA      -0.142935   0.120248  -1.189 0.234722    
## GenrePuzzle:RegionNA        -0.201703   0.236916  -0.851 0.394675    
## GenreRacing:RegionNA         0.066945   0.126778   0.528 0.597531    
## GenreRPG:RegionNA            0.056203   0.125061   0.449 0.653195    
## GenreShooter:RegionNA       -0.076454   0.109117  -0.701 0.483600    
## GenreSimulation:RegionNA     0.190623   0.184401   1.034 0.301390    
## GenreSports:RegionNA         0.174294   0.109503   1.592 0.111628    
## GenreStrategy:RegionNA      -0.036823   0.398939  -0.092 0.926467    
## GenreAdventure:RegionOther   0.876726   1.047519   0.837 0.402725    
## GenreFighting:RegionOther    1.307541   0.756850   1.728 0.084224 .  
## GenreMisc:RegionOther        0.341639   0.357585   0.955 0.339496    
## GenrePlatform:RegionOther    0.958710   0.410449   2.336 0.019610 *  
## GenrePuzzle:RegionOther      1.454184   1.047381   1.388 0.165182    
## GenreRacing:RegionOther     -0.149873   0.294713  -0.509 0.611137    
## GenreRPG:RegionOther         0.658806   0.376636   1.749 0.080424 .  
## GenreShooter:RegionOther     0.119231   0.273638   0.436 0.663086    
## GenreSimulation:RegionOther  0.880686   0.613841   1.435 0.151537    
## GenreSports:RegionOther     -0.095714   0.249898  -0.383 0.701755    
## GenreStrategy:RegionOther    2.206694   1.933836   1.141 0.253977    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.294149)
## 
##     Null deviance: 2841.8  on 1906  degrees of freedom
## Residual deviance: 1790.3  on 1859  degrees of freedom
## AIC: 4784.4
## 
## Number of Fisher Scoring iterations: 8

For an example for interpreting the coefficient I will use GenreShooter:RegionJP, when the Genre is shooter in the region of JP sales increases by 3.194. In this example shooter is the Genre that can affect sales the most for JP.

Discussion and critique

I learned that there is evidence that the Genre of video games can affect the Sales depending on the Region. One of the weakness of this analysis is that I there wasn’t any data on other regions/continents, having a Other as an category does not help in representing other places like Africa or Asia. Another weakness is that it wasn’t randomly collected so there could be some bias. One strength of this analysis is using the interaction term because from the anova test it shows that it is important predictor and can show which specific Genre affected Region the most.