Week 11 Data Dive - GLMs Part 2

For this week’s data dive I will be creating a linear model based on the following variables:

Response Variable: Playoffs

Explanatory Variables: FG_per_100, AST_per_100, TOV_per_100

After creating the linear model I will be evaluating the model with a series of 6 diagnostic plots.

Lastly, I will be interpreting the significant coefficients from the linear model.

As always I will be providing insights, significance, and potential questions for each part of the data dive.

Linear Model

Here is my linear model:

lm_playoffs <- lm(Playoffs ~ FG_per_100 + AST_per_100 + TOV_per_100,
                  data = NBA_lm)

summary(lm_playoffs)
## 
## Call:
## lm(formula = Playoffs ~ FG_per_100 + AST_per_100 + TOV_per_100, 
##     data = NBA_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1030 -0.4528  0.1887  0.4216  0.9469 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.524542   0.331941  -7.605 5.19e-14 ***
## FG_per_100   0.056589   0.008925   6.340 3.09e-10 ***
## AST_per_100  0.036420   0.007431   4.901 1.06e-06 ***
## TOV_per_100 -0.006382   0.007025  -0.908    0.364    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4704 on 1398 degrees of freedom
## Multiple R-squared:  0.1102, Adjusted R-squared:  0.1082 
## F-statistic: 57.69 on 3 and 1398 DF,  p-value: < 2.2e-16

Insights, Significance, and Questions

Some key insights I gathered:

The R squared value is 0.1101. This means that the model explains about 11% of the variation in playoff outcomes. This is a fairly low percentage to explain, but it makes sense when you consider how complex playoff variation can be and the lack of fit for the model (will be explained in diagnostic plots).

The p-value is still very low at 2.2e^-16. This means that the model is still statistically significant even if it’s lacking in explaining variation in making the playoffs.

Some further questions I would have would be: What other variables should be considered? and: What other types of models would be good to test?

Diagnostics

Plot 1: Residuals vs Fitted Values

plot(lm_playoffs$fitted.values,
     lm_playoffs$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0)

This plot shows a clear linear pattern rather than a random scatter. This indicates a violations of linearity and constant variance. However, this is also expected to happen when using a linear model with a binary outcome.

Plot 2: Residuals vs Each X Variable

plot(NBA_lm$FG_per_100, lm_playoffs$residuals,
     main = "Residuals vs FG_per_100")

plot(NBA_lm$AST_per_100, lm_playoffs$residuals,
     main = "Residuals vs AST_per_100")

plot(NBA_lm$TOV_per_100, lm_playoffs$residuals,
     main = "Residuals vs TOV_per_100")

The Field goal makes and assists are very linear and clustered. This suggests that the model might not fully capture the relationship between these predictors and playoff probability. Additionally, the turnovers is fairly clustered, but not nearly as linear or clustered as the other two variables. This suggests that turnovers is better at suggesting a relationship between itself and playoff probability.

Plot 3: Correlation Heatmap

vars <- NBA_lm[, c("FG_per_100", "AST_per_100", "TOV_per_100")]

cor_matrix <- cor(vars)

heatmap(cor_matrix)

The heatmap shows that there is some multicollinearity between field goals made and assists. The other relationships do not show significant multicollinearity. This could mean there is some redundancy between made field goals and assists.

Plot 4: Residual Histogram

hist(lm_playoffs$residuals,
     main = "Histogram of Residuals",
     xlab = "Residuals")

The residuals are not distributed as the response variable is binary. This violates the assumption of linear regression, but would make sense for the fact that the response variable is binary.

Plot 5: QQ Plot

qqnorm(lm_playoffs$residuals)
qqline(lm_playoffs$residuals)

The points are deviating from the line in the plot. This further indicates that the linear model is not ideal for this type of binary outcome.

Plot 6: Cook’s Distance

plot(cooks.distance(lm_playoffs),
     type = "h",
     main = "Cook's Distance")

abline(h = 4/nrow(NBA_lm), col = "red")

There are only a few observations that go above the red line in the plot. There is one in particular that is much higher and might have a disproportionate influence on the model, but in general it appears to be reasonable.

Insights, Significance, and Questions

The insight I gathered from these diagnostic plots is that modeling a binary outcome with a linear model is not a great idea. This is because it causes non-normal residuals, heteroskedasticity, and opportunity for predictions outside of 0,1 (True or False). This is significant because it shows me that I should avoid using linear models for this binary playoffs variable. That then leads me to ask what other models would be better fits? Perhaps logistic regression?

Interpreting Coefficients

exp(coef(lm_playoffs))
## (Intercept)  FG_per_100 AST_per_100 TOV_per_100 
##  0.08009501  1.05822072  1.03709110  0.99363874

I will provide an interpretation for the FG_per_100 and AST_per_100 coefficients as they are the most significant in this model.

The FG_per_100 variable shows us that there is about a 5.8% increase in playoff likelihood for each additional made field goal per 100 possessions.

The AST_per_100 variable shows us that there is about a 3.7% increase in playoff likelihood for each additional assist per 100 possessions.

Insights, Significance, and Questions

The insight I gather is that in this model the field goals made and assists both have significance and shows signs of explaining why a team might make the playoffs. However, something else I noticed is that the turnovers variable is very significant in the logistic regression model, but is not significant in this linear model. This is significant for explaining what model might be a better fit for different variables in my data. My questions would then be: What variables work best for what models?