Data Description

The nba_players data set has the overall advanced metrics recorded for 566 players in the NBA for the 2023 - 2024 regular season (no playoffs).

One of the advanced metrics is efficient, which is a statistic that attempts to measure how efficient a player is while on the basketball court. We’ll be using some of basketball-reference.com’s other advanced statistics to predict how efficient a player is.

tibble(nba_players)
## # A tibble: 396 × 17
##      age team  position games_played games_started efficient shooting three_pt
##    <int> <chr> <chr>           <int>         <int>     <dbl>    <dbl>    <dbl>
##  1    31 2TM   SF                 84            42      13.3     57.8    0.668
##  2    27 SAC   C                  82            82      23.2     63.7    0.081
##  3    27 BRK   SF                 82            82      14.9     56      0.457
##  4    25 LAL   SG                 82            57      15.5     61.3    0.447
##  5    21 HOU   SG                 82            82      14.7     54.1    0.454
##  6    21 OKC   C                  82            82      20.4     63.2    0.366
##  7    31 SAC   PF                 82            82      11.9     61.1    0.526
##  8    28 MIL   PF                 82             4      17.5     58.1    0.268
##  9    31 NOP   C                  82            82      19.8     61.9    0.167
## 10    25 MIN   SG                 82            20      10.9     57.8    0.623
## # ℹ 386 more rows
## # ℹ 9 more variables: free <dbl>, off_reb <dbl>, def_reb <dbl>, tot_reb <dbl>,
## #   assist <dbl>, steal <dbl>, block <dbl>, turnover <dbl>, usage <dbl>

The relevant columns are:

Question 1) Exploratory data analysis

Question 1a) Scatter plots of efficiency by the relevant columns

Create a set of scatterplots with efficiency on the y-axis and the other numeric columns on the respective x-axes.

nba_players |> 
  # Placing the numeric predictors into the same column named value and predictors
  pivot_longer(
    cols = c(age, shooting:usage),
    names_to = "stat",
    values_to = "value"
  ) |> 
  mutate(stat = as_factor(stat)) |> 
  # Creating the set of scatterplots
  ggplot(
    mapping = aes(
      x = value,
      y = efficient
    )
  ) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(
    method = "loess",
    se = F,
    formula = y ~ x
  ) +
  # Separating the plots with different x-axes for each statistic
  facet_wrap(
    facets = vars(stat),
    scales = "free_x"
  ) + 
  labs(
    x = NULL,
    y = 'Player Efficiency'
  )

Which two variables appear to have the strongest association with efficiency?

Any response with shooting, free, usage, assist, def / tot_reb is correct

Which two variables appear to have the weakest association with efficiency? Any response with steal, off_reb, turnover, or age is correct

Question 1b) Why can’t we use position to predict efficient using k-nearest neighbors?

We can’t use position because it is categorical, and since kNN calculates distances and we can’t subtract groups, we can’t use position with kNN

Question 2) Finding a good model

Part 2a) Tuning the hyper-parameter and best choice to rescale

Determine the \(k\) and rescaling method that minimizes \(SSE\) and the choice of \(k\) and rescaling method that minimizes \(MAE\). Make sure you doing it properly! Search from k = 2 to k = 390. Use a single loop!

If you want to make two separate data sets for the normalized results and standardized results, you can stack the rows together using bind_rows(.id = 'rescale', 'norm' = ..., 'stan' = ...)

**Regardless of you answer from 1a), only use shooting, three_pt, free, tot_reb, assist, and usage to predict efficient

# Normalizing and standardizing the data
nba_norm <- 
  nba_players |> 
  dplyr::select(shooting, three_pt, free, tot_reb, assist, usage) |> 
  mutate(
    across(
      .cols = everything(),
      .fns = ~ (. - min(.)) / (max(.) - min(.))
    )
  )

nba_stan <- 
  nba_players |> 
  dplyr::select(shooting, three_pt, free, tot_reb, assist, usage) |> 
  mutate(
    across(
      .cols = everything(),
      .fns = scale
    )
  )

# Data frames to save results for normalized and standardized data
k_norm_df <- 
  data.frame(
    k = 2:390,
    SSE = -1,
    MAE = -1
  )

k_stan_df <- 
  data.frame(
    k = 2:390,
    SSE = -1,
    MAE = -1
  )

# Performing the grid search
for (i in 1:nrow(k_stan_df)){
  # Saving the residuals for normalized data
  norm_error_loop <- 
    knn.reg(
      train = nba_norm,
      y = nba_players$efficient,
      k = k_stan_df$k[i]
    )$res
  
  # Saving SSE and MAE for normalized data
  k_norm_df[i, c('SSE', 'MAE')] <- c(
    sum(norm_error_loop ^ 2), # SSE
    sum(abs(norm_error_loop)) # MAE
  )
  
  ### Standardized data
  # Saving the predictions for normalized data
  stan_error_loop <- 
    knn.reg(
      train = nba_stan,
      y = nba_players$efficient,
      k = k_stan_df$k[i]
    )$res
  
  # Saving SSE and MAE for normalized data
  k_stan_df[i, c('SSE', 'MAE')] <- c(
    sum(stan_error_loop ^ 2), # SSE
    mean(abs(stan_error_loop)) # MAE
  )
  
}

k_search_results <- 
  bind_rows(
    .id = 'rescale',
    'norm' = k_norm_df,
    'stan' = k_stan_df
  )


tibble(k_search_results)
## # A tibble: 778 × 4
##    rescale     k   SSE   MAE
##    <chr>   <int> <dbl> <dbl>
##  1 norm        2 1886.  673.
##  2 norm        3 1715.  634.
##  3 norm        4 1632.  619 
##  4 norm        5 1636.  614.
##  5 norm        6 1676.  612.
##  6 norm        7 1780.  623.
##  7 norm        8 1795.  625.
##  8 norm        9 1781.  620.
##  9 norm       10 1786.  622.
## 10 norm       11 1781.  617.
## # ℹ 768 more rows

Use the chunk below after you’ve done the search to find the choice of k

# Best choice of SSE:
k_search_results |> 
  slice_min(SSE, n = 1)
##   rescale k      SSE      MAE
## 1    stan 4 1444.944 1.445518
# Best choice of MAE:
k_search_results |> 
  slice_min(MAE, n = 1)
##   rescale k      SSE      MAE
## 1    stan 4 1444.944 1.445518

Part 2C: Fit statistics for best k and rescale method

Using the choice of k and rescale method, calculate \(R^2\), \(\text{rmse}\), and \(\text{MAE}\) using cross-validation.

eff23_knn <- 
  knn.reg(
    train = nba_stan,
    y = nba_players$efficient,
    k = 4,
  )

## Rsquared
eff23_knn$R2Pred
## [1] 0.8220365
## rmse: sqrt(PRESS / (n - 1))
sqrt(eff23_knn$PRESS / (nrow(nba_players) - 1))
## [1] 1.912613
## MAE
mean(abs(nba_players$efficient - eff23_knn$pred))
## [1] 1.445518

Part 2D: Shooting percentage and efficiency

Using the results of kNN from part 2a) (and 2a only), how does shooting percentage affect efficiency?

We can’t determine how shooting effects efficiency since kNN doesn’t build a model, and we can use kNN alone to tell us how shooting percentage changes efficiency

Question 3) Players from the 2024 - 2025 NBA season

The code chunk below reads in the same data but from the 2024 - 2025 NBA regular season data set. For question 3, you’ll be predicting the players from the 2024-2025 season using kNN regression with k = 5 for the standardized data.

Part 3a) Making predictions for the 2024-2025 season

For the players from the 2024-2025 season, predict their efficiency. Make sure to standardize the data first!

When standardizing (or normalizing) the data for the data set being predicted, you want to standardize using the statistics from the training data. That is, when standardizing shooting for the 2024-2025 data set, you want to use the mean and standard deviation of shooting from the 2023-2024 data set! Repeat for all six predictors.

Display the predictions in a data frame that has two columns: efficient and predicted efficient

## Standardizing the nba24 data
nba24_stan <- 
  nba24 |> 
  dplyr::select(
    shooting, three_pt, free, tot_reb, assist, usage
  ) |> 
  mutate(
    shooting = (shooting - mean(nba_players$shooting)) / sd(nba_players$shooting),
    three_pt = (three_pt - mean(nba_players$three_pt)) / sd(nba_players$three_pt),
    free     = (free     - mean(nba_players$free)      / sd(nba_players$free)),
    tot_reb  = (tot_reb  - mean(nba_players$tot_reb))  / sd(nba_players$tot_reb),
    assist   = (assist   - mean(nba_players$assist))   / sd(nba_players$assist),
    usage    = (usage    - mean(nba_players$usage))    / sd(nba_players$usage)
  )



## Predicting the 2024 season efficiency
eff24_knn <- 
  knn.reg(
    train = nba_stan,
    test = nba24_stan,
    y = nba_players$efficient,
    k = 5
  )

data.frame(
  efficient =  nba24$efficient,
  efficient_hat = eff24_knn$pred
) |>
  tibble()
## # A tibble: 411 × 2
##    efficient efficient_hat
##        <dbl>         <dbl>
##  1      14            11.6
##  2      15.1          13.4
##  3      13.3          11.3
##  4       9            11.4
##  5      22.1          18.7
##  6      14.7          14.2
##  7      14.4          13.0
##  8      14.9          13.0
##  9      11.6          13  
## 10      12.2          11.7
## # ℹ 401 more rows

Part 3b) Fit statistics for NBA 24

Using the predictions from 3a), calculate \(R^2\), \(\text{rmse}\), and \(\text{MAE}\) for the 2024-2025 data.

## SSE for nba24
SSE24 <- sum((nba24$efficient -  eff24_knn$pred)^2)

## R^2:
Rsquared24 <- 1 - SSE24 / sum((nba24$efficient - mean(nba24$efficient))^2)

## rmse
rmse24 <- sqrt(SSE24 /(nrow(nba24) - 1))


## MAE
mae24 <- mean(abs(nba24$efficient - eff24_knn$pred))

c(
  'R-squared' = Rsquared24,
  'rmse24'    = rmse24,
  'mae24'     = mae24
)
## R-squared    rmse24     mae24 
## 0.7046397 2.3126475 1.7056448

How do the fit statistics compare to the ones calculated in part 2c (Better, worse)? Briefly explain why the results are not surprising

The fit statistics are all worse (Lower \(R^2\) and higher rmse and MAE) because we found the choice of \(k\) that minimize rmse (and MAE) for the training data. The tend to be a little worse for future data sets.

Part 3c: R-squared plot

Create a scatter plot to compare the efficiency for the 2024 - 2025 season to the predicted efficiency using an R-squared plot, which is a scatter plot with \(\hat{y}\) on the x-axis and \(y\) on the y-axis. Color each point by the player’s position. Make sure to make the graph look nice!

ggplot(
  data = data.frame(y = nba24$efficient,
                    y_hat = eff24_knn$pred,
                    position = nba24$position),
  mapping = aes(
    x = y_hat,
    y = y,
    color = position
  )
) + 
  geom_point() + 
  theme_bw()

Does it appear that kNN predicts efficiency well?