DS 2870: Module 10 Homework - Non-parametric regression

Data Description

The nba_players data set has the overall advanced metrics recorded for 566 players in the NBA for the 2024 - 2025 regular season (no playoffs).

One of the advanced metrics is efficient, which is a statistic that attempts to measure how efficient a player is while on the basketball court. We’ll be using some of basketball-reference.com’s other advanced statistics to predict how efficient a player is.

The relevant columns are:

age: the age of the player at the start of the season
shooting: A measure of shot accuracy that combines 1-, 2-, and 3-point attempts and accounts for the difficulty of said shots
three_pt: The percentage of field goal attempts that are three point shots
free: Number of free throw attempts per field goal attempt
off_reb: Percentage of offensive rebounds by the player when the player was eligible for the rebound
def_reb: Percentage of defensive rebounds by the player when the player was eligible for the rebound
tot_reb: Percentage of total rebounds by the player when the player was eligible for the rebound
assist: Percentage of teammate’s field goals the player assisted when playing
steal: Percentage of opponents’ possessions ended by the player stealing the ball when said player was playing
block: Percentage of two-point field goal attempts blocked by the player while they were playing
turnover: Number of turnovers committed per 100 plays
usage: Percentage of plays where the player was involved

Question 1) kNN Regression

Part 1a) `position` variable

Briefly explain why the position variable can’t be used as a predictor of efficient using kNN regression.

Since kNN uses distance to determine which neighbors are the closest (nearest), we can only use numeric predictors and position is categorical (can’t subtract “Guard” from “Forward”)

Part 1b) Best choice of k

Using shooting, three_pt, free, tot_reb, assist, block, and usage as predictors, find the best choice of \(k\) to predict efficient. Report the value of \(k\), rescaling method, and resulting \(R^2\). Search k = 5 to k = 200. (Best to start with a smaller range of k until you get the loop to work).

Display your results using a line graph showing the R2 value when normalizing the data and when standardizing the data with two lines.

# Leave this at the top
set.seed(2870)
### Normalizing the data
nba_norm <- 
  nba_players |> 
  # Picking the relevant columns
  dplyr::select(efficient, shooting:free, tot_reb, assist, block, usage) |> 
  mutate(
    across(
      .cols = -efficient,
      .fns = ~ (. - min(.)) / (max(.) - min(.))
    )
  )

### Standardizing the data
nba_stan <- 
  nba_players |> 
  # Picking the relevant columns
  dplyr::select(efficient, shooting:free, tot_reb, assist, block, usage) |> 
  mutate(
    across(
      .cols = -efficient,
      .fns = ~ (. - mean(.)) / (sd(.))
    )
  )

### for loop set up
# Vector of k to search through

# Preallocating a matrix to save the results in 
fit_stats <- 
  tibble(
    k = 1:200, 
    Normalize = -1,
    Standardize = -1
    
  )


# Conducting the for loop
for (i in 1:nrow(fit_stats)){
  # Finding R2 for the normalized data
  loop_norm <- 
    knn.reg(
      train = nba_norm |> dplyr::select(-efficient),
      y = nba_norm$efficient,
      k = fit_stats$k[i]
    )
  
  # Saving the R2 value
  fit_stats[i, "Normalize"] <- loop_norm$R2Pred
  
  
  # Finding R2 for the standardized data
  loop_stan <- 
    knn.reg(
      train = nba_stan |> dplyr::select(-efficient),
      y = nba_stan$efficient,
      k = fit_stats$k[i]
    )
  
  # Saving the R2 value
  fit_stats[i, "Standardize"] <- loop_stan$R2Pred
  
}

fit_stats |> 
  pivot_longer(
    cols = -k,
    names_to = "rescale",
    values_to = "R2"
  ) |> 
  # Making the graph
  ggplot(
    mapping = aes(
      x = k,
      y = R2,
      color = rescale
    )
  ) + 
  
  geom_line() + 
  theme_bw() +
  theme(
    legend.position = 'inside',
    legend.position.inside = c(0.9, 0.85)
  )

# Finding the best choice of k
fit_stats |> 
  pivot_longer(
    cols = -k,
    names_to = "rescale",
    values_to = "R2"
  ) |>
  slice_max(
    n = 1,
    R2
  )

## # A tibble: 1 × 3
##       k rescale        R2
##   <int> <chr>       <dbl>
## 1     5 Standardize 0.802

The best choice of k is 5 when rescaling the data by standardizing with an \(R^2\) value of 0.802

Part 1c) Predicting `efficient` for the 2023 - 2024 season.

Regardless of your answer in the previous question, predict the price for the players in the nba23 data set when standardizing the data with k = 6. Display the results using an R-squared plot.

Make sure to standardize nba23 before predicting efficient!. Added the predictions below by adding predicted efficient to nba23.

nba23_stan <- 
  nba23 |> 
  # Picking the relevant columns
  dplyr::select(efficient, shooting:free, tot_reb, assist, block, usage) |> 
  mutate(
    across(
      .cols = -efficient,
      .fns = ~ (. - mean(.)) / (sd(.))
    )
  )


# Predicting efficient for the 2023 season
nba23 <- 
  nba23 |> 
  mutate(
    pred_eff =
      knn.reg(
        train = nba_stan |> dplyr::select(-efficient),
        test  = nba23_stan |> dplyr::select(-efficient),
        y = nba_stan$efficient,
        k = 6
      )$pred
  )

Now create the \(R^2\) plot with predicted efficiency on the x-axis and actual efficiency on the y-axis

ggplot(
  data = nba23,
  mapping = aes(
    x = pred_eff,
    y = efficient
  )
) + 
  # Adding the points and a trend line
  geom_point() + 
  geom_smooth(
    color = "red",
    method = "lm",
    se = F,
    formula = y ~ x
  ) +
  # Adding the R2 value to the graph
  annotate(
    geom = "text",
    x = 10,
    y = 30,
    label = paste("R-squared:", round((1 - sum((nba23$efficient - nba23$pred_eff)^2)/sum((nba23$efficient - mean(nba23$efficient))^2)), 3)),
    color = "red",
    size = 5
  ) +
  
  labs(
    x = "Predicted Player Efficiency",
    y = "Actual Player Efficiency",
    title = "Predicted vs Actual Player Efficiency for the 2023 - 2024 NBA Season"
  )

Is kNN accurate for the 2023-2024 NBA season?

It’s a pretty good predictor because most of the players are near the red line and the \(R^2\) is over 0.83, indicating reliable but not perfect predictions

Part 1d) The effect of `shooting` on `efficient`

Using the results of kNN, can you interpret the effect of `shooting on the efficiency of a player? If yes, interpret the results. If not, briefly explain why.

No, kNN is a lazy learner, which doesn’t build a model. Without building a model, we can’t learn how the method used each variable to predict efficiency

Question 2) Regression trees

Part 2a) Fitting the full tree

Create the full regression tree predicting efficient using position, shooting, three_pt, free, off_reb, def_reb, tot_reb, assist, steal, block, turnover, and usage. Display the last 10 rows of the CP table

# Leave this at the top
RNGversion('4.1.0'); set.seed(2870)
eff_tree_full <- 
  rpart(
    formula = efficient ~ position + shooting + three_pt + free + off_reb + def_reb + tot_reb + assist + steal + block + turnover + usage,
    data = nba_players,
    method = "anova",
    minsplit = 2,
    minbucket = 1,
    cp = 0
  )

eff_tree_full$cptable |> 
  data.frame() |> 
  tail(n = 10)

##               CP nsplit    rel.error    xerror       xstd
## 263 1.346944e-06    332 2.940827e-05 0.3514683 0.03167736
## 264 8.979623e-07    333 2.806132e-05 0.3514690 0.03167734
## 265 8.979623e-07    334 2.716336e-05 0.3514469 0.03167718
## 266 8.979623e-07    336 2.536744e-05 0.3514469 0.03167718
## 267 8.979623e-07    337 2.446947e-05 0.3514469 0.03167718
## 268 6.734718e-07    338 2.357151e-05 0.3514469 0.03167718
## 269 6.734718e-07    347 1.751027e-05 0.3512532 0.03167378
## 270 6.734718e-07    367 4.040831e-06 0.3512532 0.03167378
## 271 1.275061e-33    373 1.275061e-33 0.3512532 0.03167378
## 272 0.000000e+00    374 0.000000e+00 0.3512532 0.03167378

How many leaf (terminal) nodes are in the full tree?: 375 leaf nodes

Part 2b) Finding the pruning point

Find the cp value to prune the tree. Don’t round the actual results, but you can round to 4 decimal places when typing your answer.

# Finding the xerror cutoff: min(xerror) + xstd
xerror_cutoff <- 
  eff_tree_full$cptable |> 
  data.frame() |> 
  slice_min(xerror, n = 1, with_ties = F) |> 
  mutate(
    xerror_1sd = xerror + xstd
  ) |> 
  pull(xerror_1sd)

# Finding the first (simplest) tree with xerror < xerror_cutoff
cp_prune <- 
  eff_tree_full$cptable |> 
  data.frame() |> 
  filter(xerror < xerror_cutoff) |> 
  slice(1) |> 
  pull(CP)

cp_prune

## [1] 0.005852058

The cp value is: 0.0059

Part 2c) Pruning and plotting the tree

Using your answer from the previous question, prune the tree, then use rpart.plot() to display the tree.

eff_tree_pruned <- 
  prune(tree = eff_tree_full,
        cp = cp_prune)

rpart.plot(
  eff_tree_pruned,
  type = 5,
  extra = 101
)

Part 2d) Variable Importance

Using the pruned tree, which six variables are the most important in predicting the efficiency of an NBA player?

caret::varImp(eff_tree_pruned) |> 
  arrange(-Overall)

##            Overall
## shooting 4.5464976
## tot_reb  3.9934850
## usage    2.9881671
## three_pt 2.9342991
## off_reb  2.4840411
## def_reb  2.4144033
## assist   1.4825977
## block    1.2674307
## turnover 1.1178790
## position 0.4642158
## steal    0.4625269
## free     0.2645668

The six most important variables are shooting, tot_reb, usage, three_pt, off_reb, and def_reb

Part 2e) Which Method Is More Accurate: kNN or Regression Tree?

If you were to predict a player’s efficiency using either kNN or the Regression Tree, which would you choose? Make sure to justify your answer!

DS 2870: Module 10 Homework - Non-parametric regression

Your Name Here

2025-06-26

Data Description

Question 1) kNN Regression

Part 1a) `position` variable

Part 1b) Best choice of k

Part 1c) Predicting `efficient` for the 2023 - 2024 season.

Part 1d) The effect of `shooting` on `efficient`

Question 2) Regression trees

Part 2a) Fitting the full tree

Part 2b) Finding the pruning point

Part 2c) Pruning and plotting the tree

Part 2d) Variable Importance

Part 2e) Which Method Is More Accurate: kNN or Regression Tree?

DS 2870: Module 10 Homework - Non-parametric regression

Your Name Here

2025-06-26

Data Description

Question 1) kNN Regression

Part 1a) position variable

Part 1b) Best choice of k

Part 1c) Predicting efficient for the 2023 - 2024 season.

Part 1d) The effect of shooting on efficient

Question 2) Regression trees

Part 2a) Fitting the full tree

Part 2b) Finding the pruning point

Part 2c) Pruning and plotting the tree

Part 2d) Variable Importance

Part 2e) Which Method Is More Accurate: kNN or Regression Tree?

Part 1a) `position` variable

Part 1c) Predicting `efficient` for the 2023 - 2024 season.

Part 1d) The effect of `shooting` on `efficient`