Week 9 Data Dive: Regression Diagnostics

The goal of this project is to demonstrate the use of linear regression modeling with multiple different terms as well as how to identify potential issues with a model using diagnostic plots.

library(readr)
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(broom)
library(lindia)

game_sales <- read_csv("video_game_sales.csv")
game_sales_raw <- game_sales

Building a Regression Model

In the simple linear regression model from week 8, we plotted the global sales of games against the year, filtering down to just shooter games. By incorporating genre as another term into the model, we can create a regression model that could better model any game’s global sales with knowledge of its genre. To avoid too much complexity from the number of different genres in the data set, we can create a binary column that shows if a game is a shooter or not. Shooter games might be less or more popular in certain years compared to other game genres, so the categories might not be identically distributed, but it is unlikely for there to be a direct connection between genre and year, so there should not be any issues with collinearity.

game_sales <- game_sales |>
  filter(!is.na(year)) |>
  mutate(is_shooter = ifelse(genre == "Shooter", 1, 0))

game_sales |>
  ggplot() +
  geom_point(mapping = aes(x = year, y = global_sales, color = factor(is_shooter))) +
  labs(x = "Year", y = "Global Sales (in Millions)", color = "Is a shooter game?")

It is difficult to see in the visualization, but it seems that shooter games have a slight upward trend, as observed in last week’s model, while non-shooter games do not have much of a linear trend at all. However, we can assume this for the sake of building our model. An interaction term rather than a second variable would likely be a better choice since the slope does not look the same across the categories.

model <- lm(global_sales ~ year + is_shooter + year:is_shooter, game_sales)


model$coefficients

##     (Intercept)            year      is_shooter year:is_shooter 
##     47.77869591     -0.02355438    -81.13803785      0.04058389

model$coefficients[1] + 1980*(model$coefficients[2])

## (Intercept) 
##    1.141023

model$coefficients[1] + model$coefficients[3] + 1980*(model$coefficients[2] + model$coefficients[4])

## (Intercept) 
##   0.3590823

As with the previous model, the intercept is a bit difficult to interpret since the data obviously does not start at year 0. However, we can look at what the expected global sales value of shooter and non-shooter games in 1980 would be. Non-shooter games have a higher starting value of 1.14 million, while shooter games in 1980 are only expected to make 350 thousand. However, non-shooter games decrease by about $20k each year, while shooter games increase by $17k, as in the previous model of only shooter games.

Evaluating the Regression Model

Residuals vs. Fitted Values Plot

gg_resfitted(model) +
  geom_smooth(se=FALSE) +
  ylim(-3, 90)

This plot shows us that many of the data points’ real values are being greatly underestimated in our model, while almost none are being overestimated. It is difficult to tell how many are close to the regression line because the points are so packed in, but it is clear that several points are far above it. Since our model stays so close to 0, this makes sense, but it is not a good sign for our model being useful. I would expect that predicting the mean for every point would be about as effective as our model itself, as this would also be very close to the majority of the data but underestimate many apparent outliers.

Residual Histogram

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The residuals are evidently not normally distributed around 0, but with data that cannot go below 0 like sales, it is expected for it to be skewed right. The histogram does look like the right half of a normal distribution for the most part, but there are a few residuals that are very large. This data is likely normally distributed enough for linear regression, but investigating the QQ-plot may make this clearer.

QQ-Plot

gg_qqplot(model)

Most of the residuals line up very closely with what a normal distribution would indicate. However, on the upper end of the quantiles, the expected residual stays low while the real residuals grow extremely large. This was also shown in the histogram, but it is much easier to see the points here than the extremely short columns in the histogram. The QQ-Plot emphasizes these issues much more, but I believe the residuals are normally distributed enough for our model to be somewhat useful. There are many data points that have unusually high global sales and thus high residuals, but for most of the data, the model is quite accurate.

Cook’s D by Observation Plot

gg_cooksd(model, threshold = 'matlab')

Since the data is sorted by the global sales ranking of each game, every game that has unusually high sales values that more heavily influence the model is at the beginning of the dataset, so this particular plot is impossible to visually interpret. However, we can add the Cook’s D values to each row in the dataset and show the values on a scatterplot instead.

sales_cooksd = cooks.distance(model)
outlier_point = 3*mean(sales_cooksd)
game_sales <- game_sales |>
  mutate(cooksd = sales_cooksd[rank], is_cooksd_outlier = ifelse(cooksd > outlier_point, TRUE, FALSE))
outlier_point

## [1] 0.0003747242

game_sales |>
  ggplot(mapping = aes(x = year, y = global_sales)) +
  geom_point(size = 1, mapping = aes(color = is_cooksd_outlier)) +
  geom_text_repel(data = filter(game_sales, is_cooksd_outlier), mapping = aes(label = paste("Cook's D", round(cooksd, 3)))) +
    labs(x = "Year", y = "Global Sales (in Millions)", color = "Is a highly influential point?")

## Warning: ggrepel: 223 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The mean Cook’s D value is extremely small, about .0004, so over 200 of the over 16000 rows are considered highly influential points. We cannot fit all of the labels, but by color we can see that some games that are not considered highly influential sold better than games that were. More detailed investigation would likely show that shooter and non-shooter games have a different point at which sales heavily influence their modeling.

Residuals vs. X Values Plot

Unfortunately, this plot does not generate if an interaction term is used. So, to demonstrate the use of this plot, we will instead look at a simpler version of the model that uses is_shooter as an additional variable, but not an interaction term.

model2 <- lm(global_sales ~ year + is_shooter, game_sales)

model2$coefficients

## (Intercept)        year  is_shooter 
## 40.18190234 -0.01976819  0.27195192

model2$coefficients[1] + 1980*(model$coefficients[2])

## (Intercept) 
##   -6.455771

In this version of the model, games in the starting year are expected to make $1.04 million, and with each year, a game is expected to make $19k less. However, a game being a shooter adds $27k to its expected value.

resx_plots <- gg_resX(model2, plot.all = FALSE)

resx_plots$year

resx_plots$is_shooter

The expectation for this plot would be that, for each value of year or is_shooter, the residuals are normally distributed around 0. However, we can see that there are very few residuals slightly below 0 and many far above it for each value. As discussed repeatedly, this is not surprising for our data, since sales cannot go below 0 and our model generally predicts very close to it. However, even if we accept that we should only look for the positive half of a normal distribution, we do not find that in this plot. This breaks the assumptions we make when building a linear regression model, so this is not a good model for our data.

summary(model)$r.squared

## [1] 0.009696747

summary(model2)$r.squared

## [1] 0.007766937

summary(model)$adj.r.squared

## [1] 0.009514739

summary(model2)$adj.r.squared

## [1] 0.00764537

We can compare the usefulness of this simplified version of our model above and the main model using R-squared values. The model that incorporates an interaction term is more successful at modeling the data than the one without, shown by both the R-squared and adjusted R-squared values. This lines up with our understanding of the data, as the original scatterplot indicated that shooter and non-shooter games should have different slopes in our linear model.

Overall Evaluation

Though the data could be considered to be normally distributed as far as sales data that cannot go below 0 can be, this dataset is overall not a good candidate for linear regression modeling. Most of the games have global sales values very close to 0, so that is what our model predicts, but there are still over two hundred Cook’s D outliers that were very successful games. Both of these facts are both quite obvious–since the data consists of all the best selling games above a certain point, naturally that point is where we will find the most games, regardless of the year. Although linear regression is not very effective for inference in this case, it is still useful as an example to show less-than-ideal diagnostic plot conditions.