DS 2870: Homework 8 - Non-parametric Regression

Data Description:

ESPN has a metric it uses to judge quarterback (QB) performance called Quarterback Rating, QBR, and how it is calculated is kept a secret. The qbr game stats.csv file has the QBR rating and game statistics for all quarterback and game performances.

The columns in the qbr_df data set are:

response variable - qbr: The quarterback rating assigned by ESPN and will be between 0 and 100 (Larger -> Better)

The potential explanatory variables are:

td_pass: The number of touchdowns thrown
interceptions: The number of interceptions thrown
sacks: The number of times the quarterback was tackled before he could throw the ball
pass_yds: The number of total yards gained throughout the game from throwing the ball
sack_yds: The number of yards lost by being sacked (tackled before throwing the ball)
accuracy: The number of caught passes
yds_attempt: The average number of yards gained per attempt
yds_completion: The average number of yards gained per completion
fumbled: If the quarterback fumbled at least once in the game (“no” = did not fumble, “yes” = at least 1 fumble)

The data are split into two different data sets:

qbr_df: The data set you’ll use to train the different machine learning methods
qbr_test: A data set you’ll use to measure the accuracy of the methods

Question 1) Exploratory data analysis

Create a graph of small multiples that has the QBR on the y-axis and the 8 numeric predictors on the x-axis. Which variables appear to have a somewhat strong association with QBR?

qbr_df |> 
  # Stacking the 8 numeric explanatory columns
  pivot_longer(
    cols = td_pass:yds_completion,
    names_to = "expl", 
    values_to = "value"
  ) |> 
  # Creating the graph of small multiples
  ggplot(
    mapping = aes(
      x = value,
      y = qbr
    )
  ) + 
  
  geom_point(
    alpha = 0.15
  ) + 
  
  # Adding a trend line
  geom_smooth(
    method = "loess",
    se = F,
    formula = y ~ x
  ) +
  
  # creating the small multiples
  facet_wrap(
    facet = vars(expl),
    scales = "free_x",
    ncol = 4
  )

Most variables appear to have some association with QBR

Question 2) kNN Regression

Part 2a) Fumbled

Briefly explain why the fumbled variable can’t be used as a predictor of QBR using kNN regression.

Since kNN uses distance to determine which neighbors are the closest (nearest), we can only use numeric predictors and fumbled is categorical (can’t subtract “yes” and “no”)

Part 2b) Best choice of k

Using all of the numeric predictor variables, find the best choice of \(k\) and the corresponding rescaling method using \(R^2\). Search k = 10 to k = 150. Display your results using a line graph showing the R2 value when normalizing the data and when standardizing the data with two lines.

### Normalizing the data
qbr_norm <- 
  qbr_df |> 
  # Removing the categorical columns
  dplyr::select(-fumbled) |> 
  mutate(
    across(
      .cols = td_pass:yds_completion,
      .fns = ~ (. - min(.)) / (max(.) - min(.))
    )
  )

### Standardizing the data
qbr_stan <- 
  qbr_df |> 
  # Removing the categorical columns
  dplyr::select(-fumbled) |> 
  mutate(
    across(
      .cols = td_pass:yds_completion,
      .fns = ~ (. - mean(.)) / (sd(.))
    )
  )

### for loop set up
# Vector of k to search through
k_vec = 10:150

# Preallocating a matrix to save the results in 
fit_stats <- 
  data.frame(
    k = k_vec, 
    R2_norm = rep(-1, length(k_vec)),
    R2_stan = rep(-1, length(k_vec))
    
  )


# Conducting the for loop
for (i in 1:nrow(fit_stats)){
  # Finding R2 for the normalized data
  loop_norm <- 
    knn.reg(
      train = qbr_norm |> dplyr::select(-qbr),
      y = qbr_norm$qbr,
      k = k_vec[i]
    )
  
  # Saving the R2 value
  fit_stats[i, "R2_norm"] <- loop_norm$R2Pred
  
  
  # Finding R2 for the normalized data
  loop_stan <- 
    knn.reg(
      train = qbr_stan |> dplyr::select(-qbr),
      y = qbr_norm$qbr,
      k = k_vec[i]
    )
  
  # Saving the R2 value
  fit_stats[i, "R2_stan"] <- loop_stan$R2Pred
  
}

fit_stats |> 
  pivot_longer(
    cols = -k,
    names_to = "rescale",
    values_to = "R2"
  ) |> 
  # Making the graph
  ggplot(
    mapping = aes(
      x = k,
      y = R2,
      color = rescale
    )
  ) + 
  
  geom_line() + 
  
  theme(legend.position = c(0.9, 0.85))

# Finding the best choice of k
fit_stats |> 
  pivot_longer(
    cols = -k,
    names_to = "rescale",
    values_to = "R2"
  ) |>
  slice_max(
    n = 1,
    R2
  )

## # A tibble: 1 × 3
##       k rescale    R2
##   <int> <chr>   <dbl>
## 1    35 R2_norm 0.563

The best choice of k is 35 when rescaling the data by normalizing with an \(R^2\) value of 0.563

Part 2c) Predicting the QBR for the 2023 season.

Regardless of your answer in the previous question, predict the QBR for the 369 games in the qbr_test data set when normalizing the data with k = 40. Display the results using an R-squared plot.

# Normaling the qbr23 data
qbr23_norm <- 
  qbr_test |> 
  dplyr::select(-fumbled) |> 
  mutate(
    across(
      .cols = td_pass:yds_completion,
      .fns = ~ (. - min(.)) / (max(.) - min(.))
    )
  )

# Predicting the qbr in the qbr23 data
qbr23_knn <- 
  knn.reg(
    train = qbr_norm[, -1],
    test = qbr23_norm[, -1],
    y = qbr_norm$qbr,
    k = 40
  )

qbr23_norm |> 
  mutate(qbr_hat = qbr23_knn$pred) |> 
  ggplot(
    mapping = aes(
      x = qbr_hat,
      y = qbr
    )
  ) + 
  
  geom_point() + 
  
  labs(x = "Predicted QBR",
       y = "Actual QBR",
       title = "Predicted vs Actual QBR for the 2023 Season using kNN")

Is kNN accurate for the 2023 season?

It’s not a great predictor because the graph has a lot of spread between the predicted values and actual values

Part 2d) The effect of yds_completion on QBR

Using the results of kNN, can you estimate the effect of yds_completion on QBR? If yes, interpret the results. If not, briefly explain why.

No, kNN is a lazy learner, which doesn’t build a model. Without building a model, we can’t learn how the method used each variable to predict the QBR

Question 3) Regression trees

Part 3a) Fitting the full tree

Create the full regression tree predicting QBR using all 9 predictors (the numeric variables and fumbled). Display the last 10 rows of the CP table

# Leave this at the top
set.seed(1234)
qbr_tree_full <- 
  rpart(
    formula = qbr ~ .,
    data = qbr_df,
    method = "anova",
    minsplit = 2,
    minbucket = 1,
    cp = 0
  )

qbr_tree_full$cptable |> 
  data.frame() |> 
  tail(n = 10)

##                CP nsplit    rel.error    xerror       xstd
## 2389 8.873514e-09   3618 1.316238e-07 0.8943188 0.02186663
## 2390 6.655136e-09   3620 1.138768e-07 0.8943188 0.02186663
## 2391 2.957838e-09   3621 1.072216e-07 0.8943173 0.02186663
## 2392 2.218379e-09   3622 1.042638e-07 0.8943179 0.02186663
## 2393 2.218379e-09   3632 8.208001e-08 0.8943420 0.02186705
## 2394 2.218379e-09   3647 4.880433e-08 0.8943420 0.02186705
## 2395 2.218379e-09   3648 4.658595e-08 0.8943420 0.02186705
## 2396 2.218379e-09   3649 4.436757e-08 0.8943420 0.02186705
## 2397 2.218379e-09   3653 3.549406e-08 0.8943420 0.02186705
## 2398 0.000000e+00   3669 0.000000e+00 0.8943420 0.02186705

Part 3b) Finding the pruning point

Find the cp value to prune the tree. Don’t round the actual results, but you can round to 4 decimal places when typing your answer.

# Finding the xerror cutoff: min(xerror) + xstd
xerror_cutoff <- 
  qbr_tree_full$cptable |> 
  data.frame() |> 
  slice_min(xerror, n = 1, with_ties = F) |> 
  mutate(
    xerror_1sd = xerror + xstd
  ) |> 
  pull(xerror_1sd)

# Finding the first (simplest) tree with xerror < xerror_cutoff
cp_prune <- 
  qbr_tree_full$cptable |> 
  data.frame() |> 
  filter(xerror < xerror_cutoff) |> 
  slice(1) |> 
  pull(CP)

cp_prune

## [1] 0.002932506

The cp value is: 0.0029

Part 3c) Pruning and plotting the tree

Using your answer from the previous question, prune the tree, then use rpart.plot() to display the tree.

qbr_tree_pruned <- 
  prune(tree = qbr_tree_full,
        cp = cp_prune)

rpart.plot(
  qbr_tree_pruned,
  type = 5,
  extra = 101
)

Part 3d) Variable Importance

Using the pruned tree, which three variables are the most important in predicting the QBR?

caret::varImp(qbr_tree_pruned) |> 
  arrange(-Overall)

##                  Overall
## accuracy       2.2300901
## sack_yds       1.9479121
## interceptions  1.9227610
## yds_attempt    1.6699656
## sacks          1.5601996
## td_pass        1.4781813
## pass_yds       0.6127660
## fumbled        0.1252413
## yds_completion 0.1109751

The three most important variables are accuracy, sack_yds, and interceptions

Part 3e) Predicting the QBR for the 2023 season

Using both the full tree and the pruned tree separately, predict the QBR for the 369 games in the qbr_test data set. Calculate the \(R^2\) value for both the full tree and pruned tree.

# Predicting the qbr in the qbr23 data
qbr_test |> 
  mutate(
    qbr_hat_full   = predict(qbr_tree_full, 
                             newdata = qbr_test),
    qbr_hat_pruned = predict(qbr_tree_pruned, 
                             newdata = qbr_test)  
  ) |> 
  summarize(
    R2_full   = cor(qbr, qbr_hat_full)^2,
    R2_pruned = cor(qbr, qbr_hat_pruned)^2
  )

##     R2_full R2_pruned
## 1 0.3151501 0.4776423

Which model predicts the 2023 season better? Briefly explain why the outcome (full vs pruned) is not surprising.

The pruned model predicts the 2023 season better. This isn’t surprising because pruned trees fit new data better than more complex (full) models. The more complex the model, the better it will fit the sample data but won’t fit new data nearly as well as it is likely overfit