Week 10 Data Dive: Generalized Linear Models

The goal of this project is to demonstrate the use of logistic regression to a model a binary variable and how to interpret the results of this process.

library(readr)
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(broom)
library(lindia)

game_sales <- read_csv("video_game_sales.csv")
game_sales_raw <- game_sales

Creating a Binary Variable

Rather than investigating the exact amount a game made in sales, in this scenario, we will investigate a binary column based on global sales. A company would like to invest in an upcoming game project, but it will only turn a profit for them if it ends up making at least $5 million in global sales.

game_sales <- game_sales |>
  filter(!is.na(year)) |>
  mutate(profitable_investment = ifelse(global_sales >= 5, 1, 0))

Creating a Logistic Model

In the scenario, the company is able to conduct market research and get an accurate estimate of the profit the game will make, but they only have enough resources to conduct this research in Japan, not the whole world. Thus, they want to build a model showing how sales in Japan relate to the investment being profitable as a whole. Clearly if it makes the minimum $5 million it will definitely be a profitable investment, but other games may have low success in Japan but be highly successful in other regions, so we would like to find out how exactly these relate using logistic regression.

game_sales |>
  ggplot(mapping = aes(x = jp_sales, y = profitable_investment)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 1) +
  labs(x = "Sales in Japan (in Millions)", y = "Profitable Investment?") +
  
  theme_minimal()

Although there is a fair number of investments that were profitable overall but did not sell well in Japan, it does seem like there is an increasing probability as sales in Japan increase, so we can use logistic regression to model the data with a monotonic function.

Interpreting Coefficients

model <- glm(profitable_investment ~ jp_sales, data = game_sales, family = binomial(link = 'logit'))

model$coefficients[1]

## (Intercept) 
##    -5.07563

exp(model$coefficients[1])

## (Intercept) 
## 0.006247147

The first coefficient of our model tells us that, if a game makes zero dollars in Japan–really less than $10k as that as the minimum amount our dataset captures–the log odds of it being a profitable investment is -5. Using the exponential function, we see that this indicates the odds of it being a profitable investment are .006, so it is .006 times as likely to be a profitable investment than an unprofitable one. There is a very high concentration of unprofitable games with low sales in Japan, so this aligns with our data.

model$coefficients[2]

## jp_sales 
## 2.319616

exp(model$coefficients[1])*(exp(model$coefficients[2]))^2

## (Intercept) 
##   0.6463598

For every additional million dollars a game makes in sales in japan, 2.32 gets added to the log odds, which is the same as the odds being multiplied by e^2.32. As an example, we take the expected odds when sales are 0, which is .006, and multiply it by e^2.32 squared, which tells us that the expected odds of a game being profitable knowing that it made $2 million in japan is .6. Thus, it is still less likely at that point for a game to be profitable than unprofitable.

-model$coefficients[1]/model$coefficients[2]

## (Intercept) 
##    2.188134

Finally, using the coefficients, we can find the point at which log-odds is zero, or when there is a 50% chance of a game being a profitable investment or not. If a game makes $2.188 million, we think it is equally likely that it might make up the rest of the necessary $5 million in sales in other regions or that it fails to meet our sales target.

sigmoid <- \(x) 1 / (1 + exp(-(model$coefficients[1] + model$coefficients[2] * x)))

game_sales |>
  ggplot(mapping = aes(x = jp_sales, y = profitable_investment)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 1) +
  geom_function(fun = sigmoid, color = 'blue') +
  labs(x = "Sales in Japan (in Millions)", y = "Profitable Investment?") +
  theme_minimal()

Since there is so much data at the low end of sales, some of the profitable investments at the low end of sales in Japan are not being captured very well in our model. However, it definitely is a more accurate model than a simple linear one would be.

Coefficient Confidence Interval

summary(model)

## 
## Call:
## glm(formula = profitable_investment ~ jp_sales, family = binomial(link = "logit"), 
##     data = game_sales)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.07563    0.09609  -52.82   <2e-16 ***
## jp_sales     2.31962    0.11011   21.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2210.9  on 16326  degrees of freedom
## Residual deviance: 1561.3  on 16325  degrees of freedom
## AIC: 1565.3
## 
## Number of Fisher Scoring iterations: 7

z_score <- qnorm(p = (1 - .95)/2, lower.tail = FALSE)
model_coefficient_z <- z_score * .11011
c(2.31962 - model_coefficient_z, 2.31962 + model_coefficient_z)

## [1] 2.103808 2.535432

Assuming our dataset is a representative sample of the entire population of video games, we can expect our calculated coefficient to fall between 2.104 and 2.535 about 95% of the times we take a random sample from this population. However, the dataset is meant to be a comprehensive list of games with a certain amount of success, so it is not actually a representative sample and this conclusion is probably not very useful. However, we can still use it to get a general idea of the true value of the coefficient. We would expect that, for a model based on the entire population, the odds of a agme being profitable would increase by a factor somewhere between e^2.104 and e^2.535 for every additional million dollars it made in Japan.