DS 1870: Module 6 Homework - Multiple Linear Regression

Data description

ESPN has a metric it uses to judge a quarterback’s (QB) performance called Quarterback Rating (QBR), and how it is calculated is kept a secret. The qbr data.csv file has the QBR rating and game statistics for all quarterback and game performances.

The columns in the csv file are:

qbr (response variable): The quarterback rating assigned by ESPN and will be between 0 and 100 (Larger -> Better)
pts_added: An advanced metric that measures how many points the quarterback added compared to the “average” QB play. Measured in points and the higher the better. Negative means performed below average, positive means performed above average
intercepted: If the quarterback threw at least one interception during the game. (no ints = “no”, at least 1 = “yes”)
yds_attempt: The number of yards gained from passes divided by the number of passes attempted. The larger the number, the better.

For this assignment, you’ll be using both points added and intercepted to predict ESPN’s QBR.

Question 1) Exploratory Data Analysis

Part 1a)

Create a single graph to display QBR, points added, and intercepted. Save it as gg_qbr. Describe any important characteristics for each variable.

gg_qbr <- 
  ggplot(
    data = qbr_df,
    mapping = aes(
      x = pts_added,
      y = qbr,
      color = intercepted
    )
  ) + 
  
  geom_point(
    alpha = 0.5
  ) +
  
  labs(
    x = "Points added over expected",
    y = "ESPN's Quarterback Rating",
    color = "Did the QB throw\nan interception?"
  ) 

gg_qbr

The association between points added and QBR is strong and positive without any obvious outliers
Games where the quarterback didn’t throw an interception have higher QBR and points added.

Part 1b) Correlation per intercepted group

Calculate the correlation for quarterbacks when they threw at least one interception and the correlation when they did not throw an interception. Are the correlations similar?

qbr_df |> 
  summarize(
    .by = intercepted,
    corr = cor(qbr, pts_added)
  )

##   intercepted      corr
## 1          no 0.9307264
## 2         yes 0.9400106

Yes, the correlations are both strong and very, very similar

Question 2) The Interaction Model

Part 2a) Fitting the interaction model

Create the linear interaction model to predict QBR using points added intercepted. Display the results using get_regression_table() or summary().

lm_int <- 
  lm(formula = qbr ~ pts_added * intercepted,
     data = qbr_df)

get_regression_table(lm_int)

## # A tibble: 4 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                 50.6       0.209    243.     0       50.2     51    
## 2 pts_added                  6.16      0.054    115.     0        6.06     6.27 
## 3 intercepted: yes          -0.663     0.277     -2.39   0.017   -1.21    -0.119
## 4 pts_added:interceptedy…   -0.163     0.072     -2.25   0.025   -0.305   -0.021

Question 2b) Using the interaction model - prediction

Use the output from question 2a) to predict the QBR for Patrick Mahomes and Brock Purdy from the 2024 superbowl. You can round your answers to 1 decimal place

Patrick Mahomes - 4.3 points added, 1 interception

\(\hat{qbr} = 50.6 + 6.2 \times 4.3 + (-0.663) + (-0.163) \times 4.3 =\) 75.9

Brock Purdy - 2.0 points added, 0 interceptions

\(\hat{qbr} = 50.6 + 6.2 \times 2.0 =\) 63

Question 2c) Mahomes’ and Purdy’s Residuals

Using your answers from question 2b), what are the residuals for Patrick Mahomes (actual QBR = 75.8) and Brock Purdy (actual QBR = 69.8)

Patrick Mahomes:

\[ e = qbr - \hat{qbr} = 75.8 - 75.9 = -0.1\]

Brock Purdy:

\[e = qbr - \hat{qbr} = 69.8 - 63 = 6.8\]

Question 3) Additive Model

Part 3a) Fitting the additive model

Create the linear additive model to predict QBR using points added intercepted. Display the results using get_regression_table() or summary().

lm_add <- 
  lm(formula = qbr ~ pts_added + intercepted,
     data = qbr_df)

get_regression_table(lm_add)

## # A tibble: 3 × 7
##   term             estimate std_error statistic p_value lower_ci upper_ci
##   <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept          50.8       0.194    261.     0        50.4    51.1  
## 2 pts_added           6.07      0.036    168.     0         6.00    6.14 
## 3 intercepted: yes   -0.791     0.271     -2.92   0.004    -1.32   -0.259

Question 3b) Using the additive model - interpretation

Interpret the model terms in context of the variables in the spaces between the hashtags below

3b i) intercept

If a quarterback has 0 points added and throws 0 interceptions, the expected QBR is 50.7.

3b ii) pts_added

For every additional point added, the predicted QBR increases by 6.074, keeping intercepted the same

3b iii) intercepted: yes

If the quarterback throws at least one interception, the average QBR will be 0.8 lower, keeping the added points the same

Question 4) Which Model: Interaction or Additive

Part 4a) Which Model: Graphs

Add the lines for the interaction model to gg_qbr. Do the same for the additive model. Which one (interaction or additive) appears to be the better choice? Justify your answer!

# Adding the lines for the interaction model:
gg_qbr +
  geom_smooth(
    method = "lm",
    se = F, 
    formula = y ~ x
  )

# Adding the lines for the additive model
gg_qbr +
  
  geom_parallel_slopes(
    se = F
  )

From the two graphs, the additive model appears to be the better model since the lines are almost identical between the two!

Part 4b) Which Model: Fit Statistics

Using the \(R^2\) values for both models, which one would you recommend, the interaction of the additive model? Make sure to justify your answer!

bind_rows(
  .id = "model",
  # Add your interaction model in the function below
  "interaction" = get_regression_summaries(lm_int),  
  # Add your additive model in the function below
  "additive" = get_regression_summaries(lm_add)
)

## # A tibble: 2 × 10
##   model  r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##   <chr>      <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1 inter…     0.889         0.889  66.1  8.13  8.13    10788.       0     3  4047
## 2 addit…     0.889         0.889  66.2  8.13  8.14    16164.       0     2  4047

Since the \(R^2\) value is the same between the two models (to 3 decimal places), we should use the simpler additive model.