Week 11 Data Dive: Generalized Linear Models (Part 2)

The goal of this project is to demonstrate ways to diagnose issues with a generalized linear model and compare it with other models.

library(readr)
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(broom)
library(lindia)
library(GGally)
library(patchwork)
library(car)

game_sales <- read_csv("video_game_sales.csv")
game_sales_raw <- game_sales

Creating a Logistic Model

In this scenario, the team from week 10 is looking for a better way to model the probability of a game being profitable. They decide to add the popularity of the platform the game was released on to the model. The platform ranking column is based on how many games are on a certain platform, with the most common platform as 1 and the rank number increasing as the popularity of the platform decreases.

Mutating Platform Ranking Variable

game_sales <- game_sales |>
  mutate(profitable_investment = ifelse(global_sales >= 5, 1, 0))

# create new dataframe with total of number of games per platform and their popularity ranking
platform_ranks <- game_sales |>
  group_by(platform) |>
  summarize(total_platform_games = n()) |>
  mutate(platform_rank = rank(desc(total_platform_games)))

# merge total number of games per platform and platform ranking into main dataframe
game_sales <- game_sales |>
  merge(platform_ranks, all.x = TRUE) |>
  arrange(rank)

Plotting Platform Ranking Against Profitability

game_sales |>
  ggplot(mapping = aes(x = platform_rank, y = profitable_investment)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 1) +
  labs(x = "Popularity Ranking of Platform", y = "Profitable Investment?") +
  
  theme_minimal()

It looks like the likelihood of an investment being profitable may decrease slightly as the platforms become less popular. However, this is likely just because fewer games are on the less popular platforms and thus there are fewer profitable games. Despite this, the team decides to go ahead and create a new model with platform ranking as an additional variable so they can compare it to the original model.

Interpreting Coefficients

old_model <- glm(profitable_investment ~ jp_sales, data = game_sales, family = binomial(link = 'logit'))
new_model <- glm(profitable_investment ~ jp_sales + platform_rank, data = game_sales, family = binomial(link = 'logit'))

new_model$coefficients

##   (Intercept)      jp_sales platform_rank 
##   -4.85154214    2.43103216   -0.03716633

paste("Expected Profitability when parameters are 0:",round(exp(new_model$coefficients[1]),4))

## [1] "Expected Profitability when parameters are 0: 0.0078"

paste("Expected Profitability for $1 million made in Japan, rank 0:",round(exp(new_model$coefficients[1])*exp(new_model$coefficients[2]),4))

## [1] "Expected Profitability for $1 million made in Japan, rank 0: 0.0889"

paste("Expected Profitability for $0 million made in Japan, rank 15:",round(exp(new_model$coefficients[1])*(exp(new_model$coefficients[3]))^15,4))

## [1] "Expected Profitability for $0 million made in Japan, rank 15: 0.0045"

paste("Expected Profitability for $1 million made in Japan, rank 15:",round(exp(new_model$coefficients[1])*(exp(new_model$coefficients[2]))*(exp(new_model$coefficients[3]))^15,4))

## [1] "Expected Profitability for $1 million made in Japan, rank 15: 0.0509"

In this model, the coefficients tell us that the expected log odds of profitability of a game which makes less than $10k in Japan and is released on a platform which is ranked 0 (both impossible in this dataset) is -4.85. By using the exponent function, we see that this means there is an expected .0078 probability that the game is profitable. As sales in Japan increase by $1 million, the log odds increases by 2.431, and as the platform ranking increases by 1–meaning that there are fewer games in the dataset that were released on that platform–the log odds decreases by .037. As an example, we see that the likelihood of a game that made $1 million in Japan and was released on a platform of rank 0 being profitable is .0889, while a game that made less than $10k in Japan and was released on a platform of rank 15 is .0045. A game that made $1 million in Japan and was released on a platform of rank 15 would be expected to be profitable 5.09% of the time.

Comparing New and Old Models

Deviance

paste("Old Model (Sales in Japan Only) Deviance:", round(old_model$deviance,3))

## [1] "Old Model (Sales in Japan Only) Deviance: 1575.09"

paste("New Model (Sales in Japan + Platform Ranking) Deviance:", round(new_model$deviance,3))

## [1] "New Model (Sales in Japan + Platform Ranking) Deviance: 1568.821"

Both models seem to have very high deviance, but looking at this value on its own is not a useful way to judge the quality of the model. We can see that the new model deviates less from the data than the old one, but only by a small amount, so it may not be much of an improvement.

Information Criteria

paste("Old Model AIC:", round(old_model$aic,3))

## [1] "Old Model AIC: 1579.09"

paste("New Model AIC:", round(new_model$aic,3))

## [1] "New Model AIC: 1574.821"

The Akaike Information Criterion values are quite similar to those of the deviance, which makes sense, since it also judges how much information is being lost from the data in our models. Once again, the difference is very small between the two models, but the new model has a slightly lower value and thus loses slightly less information than the old one.

paste("Old Model BIC:", round(BIC(old_model),3))

## [1] "Old Model BIC: 1594.524"

paste("New Model BIC:", round(BIC(new_model),3))

## [1] "New Model BIC: 1597.972"

The Bayesian Information Criterion is calculated very similarly to the AIC, but in this case, the new model is actually worse than the old one. Although the AIC penalizes models that use more parameters, it is not by very much, so the new model was still an improvement. However, the BIC penalizes models that use parameters more if the dataset they are modeling has fewer rows. The game sales dataset has less than 17000 rows, so it is not very large, and models of it should try to use as few parameters as possible.

ANOVA

anova(old_model, new_model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: profitable_investment ~ jp_sales
## Model 2: profitable_investment ~ jp_sales + platform_rank
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1     16596     1575.1                       
## 2     16595     1568.8  1   6.2688  0.01229 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test between nested models helps to determine if adding additional variables to our model changes the residual variance or has little impact. Here, the new model does have slightly less residual deviance than the old one that only has sales in Japan as a parameter, but the p-value is relatively high at .012. It is unlikely but still very possible that the true deviance of these models is the same or very similar. This indicates that platform ranking is not a particularly useful parameter for our model of global sales.

Variance Inflation Factor

vif(new_model)

##      jp_sales platform_rank 
##      1.167675      1.167675

Finally, we can check for collinearity between our parameters using the Variance Inflation Factor. Since we only have two variables they have an equal VIF, as they are equally influenced by each other. Sales in Japan and platform ranking do not, unsurprisingly, majorly impact each other, so we can safely keep both of them in the model without concern of collinearity. However, as we have seen from the other tests and comparisons of the old and new model, adding the platform ranking does not seem to improve our model by much at all, so it would be best for the team to keep using their original model.