Week 8 Data Dive: Regression Modeling

The goal of this project is to demonstrate the use of ANOVA testing and linear regression modeling on game sales data to learn more about what impacts the global sales of a game.

library(readr)
library(tidyverse)
library(ggplot2)
game_sales <- read_csv("video_game_sales.csv")
game_sales_raw <- game_sales

Most Valuable Data

In this game sales data, the global sales column is the most valuable. Though understanding how profitable a game is in different markets around the world can be useful, the global sales of a game is the clearest indicator of its overall success.

ANOVA Test

One could expect the genre of a video game to influence its global sales, since some genres seem to have more widespread popularity and success than others, like shooters and action games, but this is just anecdotal. Since there are 12 genres included in the data set, we need to consolidate them for more targeted analysis. A ‘miscellaneous’ genre is already one of the genre options, so for simplicity’s sake, the two least common genres (Strategy and Puzzle) have also been merged into Miscellaneous.

game_sales <- game_sales |>
  mutate(genre = ifelse(genre %in% c("Strategy","Puzzle"), "Misc", genre))

Our null hypothesis is that the average global sales price is equal across all game genres.

aov_result <- aov(global_sales ~ genre, data = game_sales)
summary(aov_result)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## genre           9    458   50.86   21.26 <2e-16 ***
## Residuals   16588  39676    2.39                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A large F value shows that the genre introduces variance into the global sales column that is not present in the whole column, and the very small p value indicates that we should be confident in our result. Thus, there is strong evidence against our null hypothesis, and we can assume that the average global sales price is not equal across genre. We can use a pairwise t-test to get a better idea of which genres differ from the rest.

pairwise.t.test(game_sales$global_sales, game_sales$genre, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  game_sales$global_sales and game_sales$genre 
## 
##              Action  Adventure Fighting Misc    Platform Racing  Role-Playing
## Adventure    7.6e-10 -         -        -       -        -       -           
## Fighting     1.00000 2.3e-05   -        -       -        -       -           
## Misc         0.10737 0.00064   1.00000  -       -        -       -           
## Platform     1.1e-10 < 2e-16   1.7e-06  < 2e-16 -        -       -           
## Racing       1.00000 3.4e-09   1.00000  0.03190 9.8e-06  -       -           
## Role-Playing 1.00000 5.2e-12   1.00000  0.00060 7.2e-05  1.00000 -           
## Shooter      7.9e-06 < 2e-16   0.00531  4.0e-12 1.00000  0.03458 0.18013     
## Simulation   1.00000 0.00398   1.00000  1.00000 2.2e-09  1.00000 0.43745     
## Sports       1.00000 5.5e-11   1.00000  0.00983 5.4e-08  1.00000 1.00000     
##              Shooter Simulation
## Adventure    -       -         
## Fighting     -       -         
## Misc         -       -         
## Platform     -       -         
## Racing       -       -         
## Role-Playing -       -         
## Shooter      -       -         
## Simulation   2.4e-05 -         
## Sports       0.00116 1.00000   
## 
## P value adjustment method: bonferroni

Each 1 in this table shows that the genres in the row and column are similar, while the other small values indicate they are dissimilar. Every genre is dissimilar to at least one other genre, but the table suggests that adventure games are the most unlike the other genres in terms of their global sales, as there is not a single genre that shares a 1 with adventure games.

game_sales |>
  ggplot() +
  geom_boxplot(mapping = aes(x = genre, y = global_sales)) +
  labs(x = "Genre", y = "Global Sales (in Millions)")

Global sales are so heavily clustered around 0 that the boxplot is not a particularly useful visual. However, we can see that adventure games are even more clustered towards 0 than the rest of the genres, with the smallest sales outliers, so the pairwise t-test results make sense. Platform games have enough distance between the median and third quartile for some space to be visible between them on the visual, so they also appear to be quite dissimilar to the other columns, which is reflected in the t-test. However, they are still similar enough to the next-most varied games, shooters, to get a 1 value on the t-test.

Linear Regression Model

Unfortunately, there are not many continuous variables in this dataset, and those that have evident linear relationships with global sales are sales columns from more specific regions. By specifically looking at shooter games, we can see somewhat of an upward trend over the years, although it is not very strong. However, we can still attempt to build a linear regression model for demonstration.

shooter_sales <- game_sales |>
  filter(genre == "Shooter", !is.na(year))

shooter_sales |>
  ggplot() +
  geom_point(mapping = aes(x = year, y = global_sales)) +
  labs(x = "Year", y = "Global Sales (in Millions")

linear_reg_model <- lm(global_sales ~ year, shooter_sales)
linear_reg_model$coefficients

##  (Intercept)         year 
## -33.35934195   0.01702951

linear_reg_model$coefficients[1] + 1980*(linear_reg_model$coefficients[2])

## (Intercept) 
##   0.3590823

Our model shows that, if the year is 0, the expected global sales value would be -33.4 million. Clearly, this is impossible and not relevant for video game sales, so we can instead use the coefficient to find the expected value in the first year in the dataset. In 1980, the expected value for the global sales of a shooter game would be $30k, and each year, that amount increases by $17k.

shooter_sales |>
  ggplot() +
  geom_point(mapping = aes(x = year, y = global_sales)) +
  geom_smooth(mapping = aes(x = year, y = global_sales), method = "lm", se = FALSE, color = 'red') +
  labs(x = "Year", y = "Global Sales (in Millions")

## `geom_smooth()` using formula = 'y ~ x'

With our linear model overlayed over the visual, we can see that it does indicate an increase in global sales over time, but there are huge amounts of error with many of our data points. The data is so heavily weighted towards 0 that the few high-selling games do not have much of an impact. This model is not a good fit for the data, which is probably better suited to a different kind of model altogether. It may be more likely in current and future years for shooter games to reach higher sales targets, but this relationship is not linear.