DS 1870: Module 5 Homework - Basic Regression

Data description

ESPN has a metric it uses to judge a quarterback’s (QB) performance called Quarterback Rating (QBR), and how it is calculated is kept a secret. The qbr data.csv file has the QBR rating and game statistics for all quarterback and game performances.

The columns in the csv file are:

qbr (response variable): The quarterback rating assigned by ESPN and will be between 0 and 100 (Larger -> Better)
pts_added: An advanced metric that measures how many points the quarterback added compared to the “average” QB play. Measured in points and the higher the better. Negative means performed below average, positive means performed above average
interceptions: How many interceptions the QB committed. The values are “none”, “one”, and “multiple”. More is bad, fewer interceptions is better.
yds_attempt: The number of yards gained from passes divided by the number of passes attempted. The larger the number, the better.

The games are split into two data sets:

qbr_df: The data set to use the build the linear regression models
qbr23: A data frame for the 2023 NFL season you’ll use to make predictions after building the model.

For this assignment, you’ll be using points added to predict ESPN’s QBR.

Question 1) Univariate Exploratory Data Analysis

Create the appropriate graphs to display QBR and points added individually. Describe any important characteristics for each variable.

qbr_df |> 
  pivot_longer(
    cols = c(qbr, pts_added),
    names_to = "stat",
    values_to = 'score'
  ) |> 
  ggplot(
    mapping = aes(
      x = score
    )
  ) + 
  
  geom_density(
    fill = "blue"
  ) +
  
  # Separate density plots for points added and qbr
  facet_wrap(
    facet = vars(stat),
    scales = "free"
  ) + 
  
  # Have the density curves sit on the x-axis
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  ) + 
  
  labs(x = NULL)

Both variables appear to be approximately unimodal and symmetric

Question 2) Bivariate Exploratory Data Analysis

Part 2a) Graph

Create the appropriate graph to display the association between QBR and points added. Describe any important characteristics for each variable. Save the graph as gg_qbr and display it.

gg_qbr <- 
  ggplot(
    data = qbr_df,
    mapping = aes(
      x = pts_added,
      y = qbr
    )
  ) + 
  
  geom_point() + 
  
  labs(
    x = "Points Added Above the Average by the Quarterback",
    y = "ESPN's Quarterback Rating"
  )

gg_qbr

There is a positive direction, no clear outliers, a somewhat linear (somewhat S shaped) trend, and strong relationship

Part 2b) Summary Stat

Calculate the appropriate summary of the association between points added and QBR. Does it indicate a strong association?

cor(
  x = qbr_df$pts_added,
  y = qbr_df$qbr
)

## [1] 0.9419234

Yes, the correlation is over 0.9, indicating a strong association!

Question 3) Fitting the model

Create the linear model to predict QBR using points added and display the results using get_regression_table() or summary(). Add the best fitting line to gg_qbr.

qbr_lm <- 
  lm(formula = qbr ~ pts_added,
     data = qbr_df)

get_regression_table(qbr_lm)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    50.4      0.137      367.       0    50.1     50.6 
## 2 pts_added     6.08     0.036      170.       0     6.01     6.15

# Adding the best fitting line to the scatterplot
gg_qbr +
  geom_smooth(
    method = "lm",
    se = F,
    formula = y ~ x
  )

Question 4) Using the model

Part 4a) Interpreting the slope

Interpret the slope in context of the data:

For every additional point added above average, the QBR is predicted to increase by 6.08

Part 4b) Interpreting the intercept

Interpret the intercept in context of the data:

If the points added above average is 0 (AKA, the average), the predicted QBR is 50.4

Part 4c) Using the model to predict the QBR for the 2023 season

Use the model you created to predict the QBR for the qbr23 and create a set that has the actual QBR and predicted QBR for the 2023 games. Display the first 10 rows.

qbr_pred <- 
  data.frame(
    qbr = qbr23$qbr,
    qbr_hat = predict(object = qbr_lm, newdata = qbr23)
  )

head(qbr_pred, n = 10)

##                      qbr  qbr_hat
## 2023_1_B.Purdy      93.4 84.42034
## 2023_1_T.Tagovailoa 85.9 90.50407
## 2023_1_J.Love       79.6 64.34402
## 2023_1_D.Prescott   75.6 60.69379
## 2023_1_J.Goff       68.3 57.65192
## 2023_1_T.Lawrence   66.2 59.47704
## 2023_1_P.Mahomes    66.2 74.68637
## 2023_1_D.Watson     61.7 63.12728
## 2023_1_B.Mayfield   58.3 57.04355
## 2023_1_D.Carr       56.5 61.30216

Create a graph that shows the predicted QBR on the x-axis and the actual QBR on the y-axis (this type of graph is called an R-squared plot). Add geom_smooth() to draw the line through the graph

ggplot(
  data = qbr_pred,
  mapping = aes(
    x = qbr_hat,
    y = qbr
  )
) + 
  geom_point() + 
  geom_smooth(
    method = "lm",
    se = F,
    formula = y ~ x
  ) + 
  
  labs(
    x = "Predicted QBR",
    y = "Actual QBR",
    title = "The predicted and actual QBR for the 2023 season"
  )

Question 5) Model Diagnostics

Question 5a) Residual Plot

Create the residual plot for your linear model in the code chunk below

get_regression_points(qbr_lm) |> 
  ggplot(
    mapping = aes(
      x = pts_added,
      y = residual
    )
  ) +
  
  # Adding the points and making them somewhat see thru
  geom_point(alpha = 0.5) +
  
  # Adding a horizontal line at 0
  geom_hline(
    mapping = aes(yintercept = mean(residual)),
    color = "red",
    linewidth = 1
  ) +
  
  labs(
    x = "Expected Points Added above Average",
    y = "Residuals",
    title = "Residual plot for QBR Model"
  )

Is the linear model appropriate to predict QBR from points added?

No, there is a noticable curve in the residual plot

Part 5b) Fit Statistics

Calculate the two fit statistics for the linear model. How well does the model make predictions? Use the fit statistics to justify your answer

get_regression_summaries(qbr_lm)

## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.887         0.887  67.0  8.19  8.19    28950.       0     1  3682

The \(R^2\) value is high (almost 0.9) and the rmse is about 8.2, indicating that the typical prediction error is about 8.2 points off the actual QBR.