DS 2870: Homework 7 - Multiple Linear Regression

Data Description:

ESPN has a metric it uses to judge quarterback (QB) performance called Quarterback Rating, QBR, and how it is calculated is kept a secret. The qbr game stats.csv file has the QBR rating and game statistics for all quarterback and game performances.

The columns in the qbr_df data set are:

response variable - qbr: The quarterback rating assigned by ESPN and will be between 0 and 100 (Larger -> Better)

The potential explanatory variables are:

attempts: The number of passes thrown by the quarterback in the game
completions: The number of passes that were caught
incompletions: The number of passes that were not caught
td_pass: The number of touchdowns thrown
interceptions: The number of interceptions thrown
sacks: The number of times the quarterback was tackled before he could throw the ball
pass_yds: The number of total yards gained throughout the game from throwing the ball
sack_yds: The number of yards lost by being sacked (tackled before throwing the ball)
accuracy: The number of caught passes
yds_attempt: The average number of yards gained per attempt
yds_completion: The average number of yards gained per completion
fumbled: If the quarterback fumbled at least once in the game (“no” = did not fumble, “yes” = at least 1 fumble)

The data are split into two different data sets:

qbr_df: The data set you’ll use to build the different and diagnose the models in 2a) and 4)
qbr_test: A data set you’ll use to measure the accuracy of the model in question 2b)

Question 1) Multicollinearity Check

Question 1a) Potential Multicollinearity

Based on the description of the 12 explanatory variables, will multicollinearity be a problem? If so, which variables will cause the multicollinearity problem? Justify your answer without using any code!

Because completions + incompletions = accuracy, there is a perfect relationship between those three variables. Knowing 2 of the 3 will tell you the other:

attempts = completions + incompletions

completions = attempts - incompletions

incompletions = attempts - completions

Question 1b) Removing a predictor

The graph below shows the correlation between qbr and each numeric explanatory variable. From the variables listed in your answer in question 1a, which variable should be removed, even before any linear models are ran? Again, justify your answer!

Since we need to remove one of the three variables (attempts/completions/incompletions), we want to remove the one that has the lowest correlation with QBR (the response), which is attempts

Question 2) Finding a good model

Part 2a i) Fit four candidate models

In the code chunk below, fit the following four linear models with the corresponding names and explanatory variables listed:

qbr_lm_all = everything but attempts
qbr_lm7 = td_pass + accuracy + interceptions + yds_attempt + pass_yds + sack_yds + fumbled
qbr_lm3 = td_pass + accuracy + pass_yds
qbr_lm1 = td_pass

# qbr_lm_all
qbr_lm_all <- lm(formula = qbr ~ . - attempts, data = qbr_df)

# qbr_lm7
qbr_lm7 <- lm(formula = qbr ~ td_pass + accuracy + interceptions + yds_attempt +
                               pass_yds + sack_yds + fumbled, 
            data = qbr_df)

# qbr_lm3
qbr_lm3 <- lm(formula = qbr ~ td_pass + accuracy + pass_yds, 
              data = qbr_df)

# qbr_lm1
qbr_lm1 <- lm(formula = qbr ~ td_pass, 
              data = qbr_df)

If done properly, the code chunk below should run

model	n_predictors	r.squared	sigma
qbr_lm_1	1	0.2760108	21.00915
qbr_lm_3	3	0.3958766	19.19653
qbr_lm_7	7	0.5762457	16.08615
qbr_lm_all	11	0.5782819	16.05616

Part 2a ii) Best model of the four options

Using the output created in 2a i), which model should you use? Again, justify your answer!

We should use qbr_lm7 because it has a much higher \(R^2\) and lower \(sigma\) than qbr_lm1 and qbr_lm3.

We should use it over qbr_lm_all because it fits almost as well as the more complex model (the \(R^2\) values are almost identical), so it’s not worth adding the additional complexity of the 4 additional predictors.

Part 2b i) Making predictions with the models

Using the models you created in 2a i), predict the qbr for the games in the qbr_test data set. You can predict the results for a new data set using the predict() function, which requires 2 arguments:

object = the model used to make predictions (the different lm objects)
newdata = The data set you want to make predictions for.

Combine these predictions into a data set named qbr_pred that has 5 columns:

qbr: The actual qbr for the games in the qbr_test data set
qbr_all: The predicted qbr using the qbr_lm_all model
qbr_7: The predicted qbr using the qbr_lm7 model
qbr_3: The predicted qbr using the qbr_lm3 model
qbr_1: The predicted qbr using the qbr_lm1 model

qbr_pred <- 
  data.frame(
    qbr     = qbr_test$qbr,
    qbr_all = predict(object = qbr_lm_all, newdata = qbr_test),
    qbr_7   = predict(object = qbr_lm7   , newdata = qbr_test),
    qbr_3   = predict(object = qbr_lm3   , newdata = qbr_test),
    qbr_1   = predict(object = qbr_lm1   , newdata = qbr_test)
  )




tibble(qbr_pred)

## # A tibble: 369 × 5
##      qbr qbr_all qbr_7 qbr_3 qbr_1
##    <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1  93.4    65.5  66.2  58.0  60.1
##  2  85.9    82.1  80.3  69.3  71.3
##  3  79.6    73.4  73.2  59.4  71.3
##  4  75.6    50.2  49.4  31.3  37.7
##  5  68.3    64.9  65.1  51.3  48.9
##  6  66.2    61.4  62.6  68.4  60.1
##  7  66.2    55.5  54.9  49.3  60.1
##  8  61.7    39.9  41.8  40.6  48.9
##  9  58.3    61.5  61.8  55.1  60.1
## 10  56.5    54.0  55.0  56.9  48.9
## # ℹ 359 more rows

Part 2b ii) Calculating the \(R^2\) and MAR for the test data

Calculate the \(R^2\), sigma, and mean absolute error (MAE) of the test predictions for each of the 4 models. You can either calculate them individual and put them together in a data set, or you can use pivot_longer() to “shorten” the code required!

sigma is: \[\textrm{sigma} = \sqrt{\frac{\sum(y - \hat{y})^2}{n}}\]

To calculate the MAE is: \[\textrm{MAE} = \frac{\sum|y - \hat{y}|}{n}\]

and the absolute function in R is abs()

qbr_pred |> 
  pivot_longer(
    cols = qbr_all:qbr_1,
    names_to = "model",
    values_to = "qbr_hat"
  ) |> 
  summarize(
    .by = model,
    r.squared = cor(qbr, qbr_hat)^2,
    sigma = sqrt(mean((qbr - qbr_hat)^2)),
    MAE = mean(abs(qbr - qbr_hat))
  )

## # A tibble: 4 × 4
##   model   r.squared sigma   MAE
##   <chr>       <dbl> <dbl> <dbl>
## 1 qbr_all     0.611  15.6  12.2
## 2 qbr_7       0.600  15.8  12.4
## 3 qbr_3       0.456  18.5  14.7
## 4 qbr_1       0.357  20.2  16.7

Using your results from the above code code chunk, which model should you use?

We have the same result as using the fit stats from the full model. The r.squared is much higher for qbr_7 than qbr_3 or qbr_1, and almost the same as qbr_all, indicating it is the best choice for accuarcy and simplicity.

Question 3) Using the model

Regardless of your answers in previous questions, the remaining questions will use the qbr_lm_all model!

The model estimates are displayed in the code chunk below and you’ll be using them to answer parts a) and b)

## # A tibble: 12 × 5
##    term           estimate std.error statistic p.value
##    <chr>             <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)       3.80     11.2       0.339   0.735
##  2 completions      -0.315     0.262    -1.2     0.23 
##  3 incompletions     0.026     0.215     0.119   0.905
##  4 td_pass           5.80      0.273    21.3     0    
##  5 interceptions    -7.77      0.315   -24.7     0    
##  6 sacks            -1.44      0.355    -4.07    0    
##  7 pass_yds          0.051     0.021     2.41    0.016
##  8 sack_yds         -0.334     0.047    -7.04    0    
##  9 accuracy          0.634     0.191     3.32    0.001
## 10 yds_attempt       1.25      1.23      1.01    0.31 
## 11 yds_completion    0.11      0.668     0.164   0.87 
## 12 fumbledyes       -4.96      0.783    -6.33    0

Part 3a) Model interpretations: Accuracy

Interpret the model estimate for qbr and accuracy :

For every additional 1% in accuracy, the predicted QBR increases by 0.634, when all other variables are the same (held constant)

Part 3b) Model interpretations: Fumbled

Interpret the model estimate for qbr and fumbled :

QBR is 5 lower, on average, when the QB fumbles compared to when they do not fumble, when all over variables are the same

Question 4) Model diagnostics

You’ll be using the qbr_df data set for all parts of question 4.

Part 4a) Overall Residual Plot

Create just the residual plot for the qbr_lm_all model.

augment_columns(
  x = qbr_lm_all,
  data = qbr_df
) |> 
  ggplot(
    mapping = aes(
      x = .fitted,
      y = .resid
    )
  ) +
  
  geom_point() + 
  
  labs(
    x = "Predicted QBR",
    y = "Residuals"
  )+
  
  geom_hline(
    mapping = aes(yintercept = mean(.resid)),
    color = "red",
    linewidth = 1
  )

Using the residual plot you created, which assumptions about our linear model below appear to be violated? If they’ve been violated, justify your answer

Linear Assumption:

No, there is a clear downward trend in the residual plot, indicating that a line is not the best choice.

No outliers:

Yes, there are no clear outliers in the residual plot (No very high or very low values)

Equal Spread (homoscedasticity):

Yes, the spread of the points appears to be very consistent going from left to right of the residual plot

Part 4b) Individual Residual Plots

The residual plot for each variable (expect fumbled) is shown below. Is there evidence of any non-linear trends in any of the predictors? Justify your answer!

augment_columns(
  x = qbr_lm_all,
  data = qbr_df
) |> 
  dplyr::select(completions:yds_completion, .resid) |> 
  pivot_longer(
    cols = -.resid,
    names_to = "predictor",
    values_to = "stat"
  ) |> 
  
  mutate(predictor = as_factor(predictor)) |> 
  
  ggplot(
    mapping = aes(
      x = stat,
      y = .resid
    )
  ) +
  
  geom_point(alpha = 0.25) + 
  
  geom_hline(
    mapping = aes(yintercept = mean(.resid)),
    color = "red",
    linewidth = 1
  ) +
  
  geom_smooth(
    method = "loess",
    se = F,
    formula = y ~ x,
    color = "steelblue",
    linewidth = 1
  ) +
  
  facet_wrap(
    facets = vars(predictor),
    scales = "free_x",
    nrow = 5
  ) + 
  
  labs(
    x = NULL,
    y = "Residuals"
  )

No, the individual residual plots all look like what you’d expect to see when the linearity condition is met.

Not required:

The issue with using a linear model to predict QBR is that QBR is bounded (between 0 and 100), but the predicted QBR can be anything (can be negative, could be over 100). Because of boundedness of the response variable, our linear model is not appropriate, even tho none of the individuals variables appear to have a non-linear relationship.