Data Dive Week 9 - Regression Diagnostic

Start by setting up the packages to manipulate data.

suppressPackageStartupMessages({
  library(tidyverse)
  library(rio)
  library(boot)
  library(broom)
  library(car)
  library(GGally)
  library(ggrepel)
  library(lindia)
  source("aptheme.R") #Code that helps format graphs
  })

Import data

data <- import("plays.csv") %>%
  filter(!is.na(dropbackDistance) & 
           !is.na(timeInTackleBox))

Start with the same model from last week:

model <- lm(yardsGained ~ dropbackDistance, data)
summary(model)
## 
## Call:
## lm(formula = yardsGained ~ dropbackDistance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.097  -6.682  -2.348   3.664  91.016 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.93356    0.20744  28.603  < 2e-16 ***
## dropbackDistance  0.24535    0.05326   4.607 4.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.788 on 8915 degrees of freedom
## Multiple R-squared:  0.002375,   Adjusted R-squared:  0.002263 
## F-statistic: 21.22 on 1 and 8915 DF,  p-value: 4.147e-06

Adding in a binary variable for pass completion:

data <- data %>%
  mutate(completed_pass = passResult == "C")
model <- lm(yardsGained ~ dropbackDistance + completed_pass, data)
summary(model)
## 
## Call:
## lm(formula = yardsGained ~ dropbackDistance + completed_pass, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.949  -4.252  -0.005   1.423  86.438 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.42266    0.21204  -11.43   <2e-16 ***
## dropbackDistance    0.53940    0.04389   12.29   <2e-16 ***
## completed_passTRUE 11.67598    0.17698   65.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.024 on 8914 degrees of freedom
## Multiple R-squared:  0.3297, Adjusted R-squared:  0.3295 
## F-statistic:  2192 on 2 and 8914 DF,  p-value: < 2.2e-16

Here we can see that adding completion percentages drastically increases the share of the variance in yardsGained that is explained by the model. We also now need to check that the dropback distance variable is actually explaining anything, so we run the model below with that term removed.

model <- lm(yardsGained ~ completed_pass, data)
summary(model)
## 
## Call:
## lm(formula = yardsGained ~ completed_pass, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.992  -3.992   0.463   0.463  87.008 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.4634     0.1410  -3.287  0.00102 ** 
## completed_passTRUE  11.4551     0.1775  64.521  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.091 on 8915 degrees of freedom
## Multiple R-squared:  0.3183, Adjusted R-squared:  0.3182 
## F-statistic:  4163 on 1 and 8915 DF,  p-value: < 2.2e-16

Here we can see that without the dropback distance variable, the model only explains 21% of the variance in yards gained, rather than 31%, so it does make sense to keep both variables. Next let’s look at the model with the down of a given play added in.

model <- lm(yardsGained ~ dropbackDistance + completed_pass + down, data)
summary(model)
## 
## Call:
## lm(formula = yardsGained ~ dropbackDistance + completed_pass + 
##     down, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.958  -4.251  -0.023   1.447  86.281 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.71105    0.29963  -9.048   <2e-16 ***
## dropbackDistance    0.54285    0.04396  12.350   <2e-16 ***
## completed_passTRUE 11.69431    0.17748  65.890   <2e-16 ***
## down                0.13731    0.10080   1.362    0.173    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.023 on 8913 degrees of freedom
## Multiple R-squared:  0.3298, Adjusted R-squared:  0.3296 
## F-statistic:  1462 on 3 and 8913 DF,  p-value: < 2.2e-16

Here we can see that the adjusted R squared does not increase, meaning that the down does not help to explain the number of yards gained, and there does not appear to be a significant effect either. The coefficents for dropback distance and completed pass aldo do not change much. For these reasons we will leave it out.

Here we also add the variable timeInTackelBox to the model:

model <- lm(yardsGained ~ dropbackDistance + completed_pass + timeInTackleBox, data)
summary(model)
## 
## Call:
## lm(formula = yardsGained ~ dropbackDistance + completed_pass + 
##     timeInTackleBox, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.996  -4.150  -0.277   1.932  85.646 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -6.05962    0.33364 -18.162  < 2e-16 ***
## dropbackDistance    0.20720    0.04946   4.189 2.83e-05 ***
## completed_passTRUE 12.18528    0.17880  68.150  < 2e-16 ***
## timeInTackleBox     1.68515    0.12022  14.018  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.937 on 8913 degrees of freedom
## Multiple R-squared:  0.3441, Adjusted R-squared:  0.3439 
## F-statistic:  1559 on 3 and 8913 DF,  p-value: < 2.2e-16

Here we can see an increase in the Adjsted R squared, telling us that time spent in the tackle box helps to further explain the yards gained. It’s a pretty small increasee, so there is an argument for leaving it out, however the longer a quarterback has in the tackle box, the longer a play has to develop, the more likely the pass will be completed. Since there is a theoretical reason to include it in the model, I’m going to leave it alone.

Finally we check for multicolinearity:

vif(model)
## dropbackDistance   completed_pass  timeInTackleBox 
##         1.311538         1.053942         1.367888

Here we can see a little bit of correlation between model terms, but it’s well below 5 and pretty close to 1, suggesting there is only weak correlation and multicolinearity isn’t teribly worysome here. At this point, we can begin model evaluation

Evaluation of Model

We start evaluation with the residual vs fitted values:

gg_resfitted(model) +
  geom_smooth(se=FALSE) + 
  theme_ap(family = "sans")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## â„ı Please use `linewidth` instead.
## â„ı The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Here we’ve got a little concern about the variance of residuals. Instead of a constant random scatter of points, we see two clumps of data, the second larger clump that is funnel shaped. This means there could be an issue with heteroscedasticity, since the eror is higher for cases where yards gained is more than 10.

Next we look at the residuals vs x values

plots <- gg_resX(model, plot.all = FALSE)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## â„ı Please use `broom::augment(<lm>)` instead.
## â„ı The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plots$dropbackDistance +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$completed_pass +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$timeInTackleBox +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Here there is nothing overly concerning. For the most part, the points look plotted randomly. There are definitely some outliers to contend with, and the timeInTackleBox graph is showing a bit of a hump in the middle, but there’s nothing heinous here.

Now we check correlations between explanatory variables

ggcorr(select(data, 
              dropbackDistance, 
              timeInTackleBox), label = TRUE) +
  labs(title='Correlation Heatmap') 

This is reflective of what we saw from VIF. There is some slight correlation between dropbackDistance and timeInTackleBox, but not enough to be terribly problematic.

gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Here we can see a roughly normal distribution. There’s a little concern with the longer right hand tail, but nothing too worrisome. Next we look at the QQ plot:

gg_qqplot(model)

This is showing what looks like some pretty strong deviation from the theoretical normal distribution. The QQ plot is incredibly sensitive, so it’s maybe to sensitive to completely throw out the model, but it does give caution for values at the more extreme values.

Last we look at Cooks d:

gg_cooksd(model, threshold = 'matlab')

This is potentially the most concerning plot. It is showing more than just a handful of observations that have significant influence on the model.

Model Conclusions

From the model we can see the effect that a quarterback’s drop back distance, and time in the tackle box influence yards gained on a given play when controlling for whether a pass is completed or not. Every foot more a quarterback drops back is associated with a .20 yard pickup and for ever second the quarterback can stay in the tackle box, it is associated with an additional 1.69 yards. This highlights the importance of the offensive line players. The better blocking they provide, the longer the play has to develop, and the longer a play can develop, the more yards get picked up.

However, there is a big caveat with the model we have built here. It appears that the model has problems with values that are further away from the mean. There appear to be a lot of outliers in the data, but so many outliers that we cannot simply remove them. These will be a trick to handle.