Generalized Linear Models Part II

Data and Libraries

library(tidyverse)
library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)

data <- read_delim("./sports.csv", delim = ',')
data$total_athletes <- data$sum_partic_men + data$sum_partic_women

I’m considering using total_athletes as explanatory and total revenue as the response variable. Let’s first plot them to see if they have a linear relationship or if they require some transformations.

data |>
  ggplot(mapping = aes(x = total_athletes, y = total_rev_menwomen)) +
  geom_point() +
  scale_y_log10() +
  scale_x_log10() +
  geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

After scaling the axes, the relationship looks quite linear so I don’t think I have to apply any transformations for our model.

lm_model <- lm(total_rev_menwomen ~ total_athletes, data)
lm_model$coefficients
##    (Intercept) total_athletes 
##       24847.29       25182.31

Coefficient Interpretation

For every one unit increase in total_athletes, we can expect about $25000 increase in total revenue.

Let’s now use diagnostic plots to check our model.

Diagnostic Plots

Residuals vs Fitted Values

gg_resfitted(lm_model) +
  geom_smooth(se=FALSE)

This is interesting. For the most part, residuals are lined up pretty close to the zero line. This could be due to large extent or range of data values but even then they’re close to the line. One more thing to notice is that random jumps in residuals at certain values. It could be just part of how the data is. Next, we look at residuals vs X values plot.

plots <- gg_resX(lm_model, plot.all = FALSE)

plots$total_athletes +
  geom_smooth(se = FALSE)

We see a really similar picture from the last plot. Same thing is happening here, most of the residuals are lying close to the line. From these two plots, I’ll say that assumptions are not fully met technically but they’re really close to be met. Normal distribution is not quite there, because of the random jumps in data points. Now, we look at histogram of our residuals.

gg_reshist(lm_model) +
  scale_x_log10() # log scaled x axis for closer view

Most of it looks close to normal, but not quite normal. There are some extra bumps in data. We’ll see this clearly in a QQ plot.

gg_qqplot(lm_model)

This QQ plot is alike to distribution of the non scaled version of the histogram with only three bars. Even when we scaled the x axis using log on histogram, the right side still didn’t look right. And QQ plot is showing that. So our assumption of normal distribution of residuals is being violated.

gg_cooksd(lm_model, threshold = 'matlab')

We’ve got bunch of high influential points which affect our model.

Overall:

The main issue I noticed with our model was the normality of residuals assumption being violated. Also, there were many points in data that were highly influential leading to changes in our linear model.