ESPN has a metric it uses to judge quarterback (QB) performance called Quarterback Rating, QBR, and how it is calculated is kept a secret. The qbr game stats.csv file has the QBR rating and game statistics for all quarterback and game performances.
The columns in the qbr_df data set are:
response variable - qbr: The quarterback rating assigned by ESPN and will be between 0 and 100 (Larger -> Better)
The potential explanatory variables are:
attempts: The number of passes thrown by the quarterback in the game
completions: The number of passes that were caught
incompletions: The number of passes that were not caught
td_pass: The number of touchdowns thrown
interceptions: The number of interceptions thrown
sacks: The number of times the quarterback was tackled before he could throw the ball
pass_yds: The number of total yards gained throughout the game from throwing the ball
sack_yds: The number of yards lost by being sacked (tackled before throwing the ball)
accuracy: The number of caught passes
yds_attempt: The average number of yards gained per attempt
yds_completion: The average number of yards gained per completion
fumbled: If the quarterback fumbled at least once in the game (“no” = did not fumble, “yes” = at least 1 fumble)
The data are split into two different data sets:
qbr_df: The data set you’ll use to build the different and diagnose the models in 2a) and 4)
qbr_test: A data set you’ll use to measure the accuracy of the model in question 2b)
Based on the description of the 12 explanatory variables, will multicollinearity be a problem? If so, which variables will cause the multicollinearity problem? Justify your answer without using any code!
Because completions + incompletions = accuracy, there is a perfect relationship between those three variables. Knowing 2 of the 3 will tell you the other:
attempts = completions + incompletions
completions = attempts - incompletions
incompletions = attempts - completions
The graph below shows the correlation between qbr and each numeric explanatory variable. From the variables listed in your answer in question 1a, which variable should be removed, even before any linear models are ran? Again, justify your answer!
Since we need to remove one of the three variables (attempts/completions/incompletions), we want to remove the one that has the lowest correlation with QBR (the response), which is attempts
In the code chunk below, fit the following four linear models with the corresponding names and explanatory variables listed:
qbr_lm_all
= everything but attempts
qbr_lm7
= td_pass + accuracy + interceptions +
yds_attempt + pass_yds + sack_yds + fumbled
qbr_lm3
= td_pass + accuracy + pass_yds
qbr_lm1
= td_pass
# qbr_lm_all
qbr_lm_all <- lm(formula = qbr ~ . - attempts, data = qbr_df)
# qbr_lm7
qbr_lm7 <- lm(formula = qbr ~ td_pass + accuracy + interceptions + yds_attempt +
pass_yds + sack_yds + fumbled,
data = qbr_df)
# qbr_lm3
qbr_lm3 <- lm(formula = qbr ~ td_pass + accuracy + pass_yds,
data = qbr_df)
# qbr_lm1
qbr_lm1 <- lm(formula = qbr ~ td_pass,
data = qbr_df)
If done properly, the code chunk below should run
model | n_predictors | r.squared | sigma |
---|---|---|---|
qbr_lm_1 | 1 | 0.2760108 | 21.00915 |
qbr_lm_3 | 3 | 0.3958766 | 19.19653 |
qbr_lm_7 | 7 | 0.5762457 | 16.08615 |
qbr_lm_all | 11 | 0.5782819 | 16.05616 |
Using the output created in 2a i), which model should you use? Again, justify your answer!
We should use qbr_lm7
because it has a much higher \(R^2\) and lower \(sigma\) than qbr_lm1
and
qbr_lm3
.
We should use it over qbr_lm_all
because it fits almost
as well as the more complex model (the \(R^2\) values are almost identical), so it’s
not worth adding the additional complexity of the 4 additional
predictors.
Using the models you created in 2a i), predict the qbr for the games
in the qbr_test data set. You can predict the results
for a new data set using the predict()
function, which
requires 2 arguments:
object =
the model used to make predictions (the
different lm
objects)
newdata =
The data set you want to make predictions
for.
Combine these predictions into a data set named qbr_pred that has 5 columns:
qbr: The actual qbr for the games in the qbr_test data set
qbr_all: The predicted qbr using the
qbr_lm_all
model
qbr_7: The predicted qbr using the qbr_lm7
model
qbr_3: The predicted qbr using the qbr_lm3
model
qbr_1: The predicted qbr using the qbr_lm1
model
qbr_pred <-
data.frame(
qbr = qbr_test$qbr,
qbr_all = predict(object = qbr_lm_all, newdata = qbr_test),
qbr_7 = predict(object = qbr_lm7 , newdata = qbr_test),
qbr_3 = predict(object = qbr_lm3 , newdata = qbr_test),
qbr_1 = predict(object = qbr_lm1 , newdata = qbr_test)
)
tibble(qbr_pred)
## # A tibble: 369 Ă— 5
## qbr qbr_all qbr_7 qbr_3 qbr_1
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 93.4 65.5 66.2 58.0 60.1
## 2 85.9 82.1 80.3 69.3 71.3
## 3 79.6 73.4 73.2 59.4 71.3
## 4 75.6 50.2 49.4 31.3 37.7
## 5 68.3 64.9 65.1 51.3 48.9
## 6 66.2 61.4 62.6 68.4 60.1
## 7 66.2 55.5 54.9 49.3 60.1
## 8 61.7 39.9 41.8 40.6 48.9
## 9 58.3 61.5 61.8 55.1 60.1
## 10 56.5 54.0 55.0 56.9 48.9
## # ℹ 359 more rows
Calculate the \(R^2\), sigma, and
mean absolute error (MAE) of the test predictions for each of the 4
models. You can either calculate them individual and put them together
in a data set, or you can use pivot_longer()
to “shorten”
the code required!
sigma is: \[\textrm{sigma} = \sqrt{\frac{\sum(y - \hat{y})^2}{n}}\]
To calculate the MAE is: \[\textrm{MAE} = \frac{\sum|y - \hat{y}|}{n}\]
and the absolute function in R is abs()
qbr_pred |>
pivot_longer(
cols = qbr_all:qbr_1,
names_to = "model",
values_to = "qbr_hat"
) |>
summarize(
.by = model,
r.squared = cor(qbr, qbr_hat)^2,
sigma = sqrt(mean((qbr - qbr_hat)^2)),
MAE = mean(abs(qbr - qbr_hat))
)
## # A tibble: 4 Ă— 4
## model r.squared sigma MAE
## <chr> <dbl> <dbl> <dbl>
## 1 qbr_all 0.611 15.6 12.2
## 2 qbr_7 0.600 15.8 12.4
## 3 qbr_3 0.456 18.5 14.7
## 4 qbr_1 0.357 20.2 16.7
Using your results from the above code code chunk, which model should you use?
We have the same result as using the fit stats from the full model.
The r.squared is much higher for qbr_7
than
qbr_3
or qbr_1
, and almost the same as
qbr_all
, indicating it is the best choice for accuarcy and
simplicity.
Regardless of your answers in previous questions, the remaining
questions will use the qbr_lm_all
model!
The model estimates are displayed in the code chunk below and you’ll be using them to answer parts a) and b)
## # A tibble: 12 Ă— 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.80 11.2 0.339 0.735
## 2 completions -0.315 0.262 -1.2 0.23
## 3 incompletions 0.026 0.215 0.119 0.905
## 4 td_pass 5.80 0.273 21.3 0
## 5 interceptions -7.77 0.315 -24.7 0
## 6 sacks -1.44 0.355 -4.07 0
## 7 pass_yds 0.051 0.021 2.41 0.016
## 8 sack_yds -0.334 0.047 -7.04 0
## 9 accuracy 0.634 0.191 3.32 0.001
## 10 yds_attempt 1.25 1.23 1.01 0.31
## 11 yds_completion 0.11 0.668 0.164 0.87
## 12 fumbledyes -4.96 0.783 -6.33 0
Interpret the model estimate for qbr and accuracy :
For every additional 1% in accuracy, the predicted QBR increases by 0.634, when all other variables are the same (held constant)
Interpret the model estimate for qbr and fumbled :
QBR is 5 lower, on average, when the QB fumbles compared to when they do not fumble, when all over variables are the same
You’ll be using the qbr_df data set for all parts of question 4.
Create just the residual plot for the qbr_lm_all
model.
augment_columns(
x = qbr_lm_all,
data = qbr_df
) |>
ggplot(
mapping = aes(
x = .fitted,
y = .resid
)
) +
geom_point() +
labs(
x = "Predicted QBR",
y = "Residuals"
)+
geom_hline(
mapping = aes(yintercept = mean(.resid)),
color = "red",
linewidth = 1
)
Using the residual plot you created, which assumptions about our linear model below appear to be violated? If they’ve been violated, justify your answer
Linear Assumption:
No, there is a clear downward trend in the residual plot, indicating that a line is not the best choice.
No outliers:
Yes, there are no clear outliers in the residual plot (No very high or very low values)
Equal Spread (homoscedasticity):
Yes, the spread of the points appears to be very consistent going from left to right of the residual plot
The residual plot for each variable (expect fumbled) is shown below. Is there evidence of any non-linear trends in any of the predictors? Justify your answer!
augment_columns(
x = qbr_lm_all,
data = qbr_df
) |>
dplyr::select(completions:yds_completion, .resid) |>
pivot_longer(
cols = -.resid,
names_to = "predictor",
values_to = "stat"
) |>
mutate(predictor = as_factor(predictor)) |>
ggplot(
mapping = aes(
x = stat,
y = .resid
)
) +
geom_point(alpha = 0.25) +
geom_hline(
mapping = aes(yintercept = mean(.resid)),
color = "red",
linewidth = 1
) +
geom_smooth(
method = "loess",
se = F,
formula = y ~ x,
color = "steelblue",
linewidth = 1
) +
facet_wrap(
facets = vars(predictor),
scales = "free_x",
nrow = 5
) +
labs(
x = NULL,
y = "Residuals"
)
No, the individual residual plots all look like what you’d expect to see when the linearity condition is met.
Not required:
The issue with using a linear model to predict QBR is that QBR is bounded (between 0 and 100), but the predicted QBR can be anything (can be negative, could be over 100). Because of boundedness of the response variable, our linear model is not appropriate, even tho none of the individuals variables appear to have a non-linear relationship.