The nba_players data set has the overall advanced
metrics recorded for 566 players in the NBA for the 2023 - 2024 regular
season (no playoffs).
One of the advanced metrics is efficient, which is a
statistic that attempts to measure how efficient a player is while on
the basketball court. We’ll be using some of basketball-reference.com’s
other advanced statistics to predict how efficient a player is.
tibble(nba_players)
## # A tibble: 396 × 17
## age team position games_played games_started efficient shooting three_pt
## <int> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 31 2TM SF 84 42 13.3 57.8 0.668
## 2 27 SAC C 82 82 23.2 63.7 0.081
## 3 27 BRK SF 82 82 14.9 56 0.457
## 4 25 LAL SG 82 57 15.5 61.3 0.447
## 5 21 HOU SG 82 82 14.7 54.1 0.454
## 6 21 OKC C 82 82 20.4 63.2 0.366
## 7 31 SAC PF 82 82 11.9 61.1 0.526
## 8 28 MIL PF 82 4 17.5 58.1 0.268
## 9 31 NOP C 82 82 19.8 61.9 0.167
## 10 25 MIN SG 82 20 10.9 57.8 0.623
## # ℹ 386 more rows
## # ℹ 9 more variables: free <dbl>, off_reb <dbl>, def_reb <dbl>, tot_reb <dbl>,
## # assist <dbl>, steal <dbl>, block <dbl>, turnover <dbl>, usage <dbl>
The relevant columns are:
age: the age of the player at the start of the
seasonshooting: A measure of shot accuracy that combines 1-,
2-, and 3-point attempts and accounts for the difficulty of said
shotsthree_pt: The percentage of field goal attempts that
are three point shotsfree: Number of free throw attempts per field goal
attemptoff_reb: Percentage of offensive rebounds by the player
when the player was eligible for the rebounddef_reb: Percentage of defensive rebounds by the player
when the player was eligible for the reboundtot_reb: Percentage of total rebounds by the player
when the player was eligible for the reboundassist: Percentage of teammate’s field goals the player
assisted when playingsteal: Percentage of opponents’ possessions ended by
the player stealing the ball when said player was playingblock: Percentage of two-point field goal attempts
blocked by the player while they were playingturnover: Number of turnovers committed per 100
playsusage: Percentage of plays where the player was
involvedCreate a set of scatterplots with efficiency on
the y-axis and the other numeric columns on the respective
x-axes.
nba_players |>
# Placing the numeric predictors into the same column named value and predictors
pivot_longer(
cols = c(age, shooting:usage),
names_to = "stat",
values_to = "value"
) |>
mutate(stat = as_factor(stat)) |>
# Creating the set of scatterplots
ggplot(
mapping = aes(
x = value,
y = efficient
)
) +
geom_point(alpha = 0.5) +
geom_smooth(
method = "loess",
se = F,
formula = y ~ x
) +
# Separating the plots with different x-axes for each statistic
facet_wrap(
facets = vars(stat),
scales = "free_x"
) +
labs(
x = NULL,
y = 'Player Efficiency'
)
Which two variables appear to have the strongest association with efficiency?
Any response with shooting, free, usage, assist, def / tot_reb is correct
Which two variables appear to have the weakest association with efficiency? Any response with steal, off_reb, turnover, or age is correct
position to predict
efficient using k-nearest neighbors?We can’t use position because it is categorical, and since kNN calculates distances and we can’t subtract groups, we can’t use position with kNN
Determine the \(k\) and rescaling method that minimizes \(SSE\) and the choice of \(k\) and rescaling method that minimizes \(MAE\). Make sure you doing it properly! Search from k = 2 to k = 390. Use a single loop!
If you want to make two separate data sets for the normalized
results and standardized results, you can stack the rows together using
bind_rows(.id = 'rescale', 'norm' = ..., 'stan' = ...)
**Regardless of you answer from 1a), only use shooting,
three_pt, free, tot_reb,
assist, and usage to predict
efficient
# Normalizing and standardizing the data
nba_norm <-
nba_players |>
dplyr::select(shooting, three_pt, free, tot_reb, assist, usage) |>
mutate(
across(
.cols = everything(),
.fns = ~ (. - min(.)) / (max(.) - min(.))
)
)
nba_stan <-
nba_players |>
dplyr::select(shooting, three_pt, free, tot_reb, assist, usage) |>
mutate(
across(
.cols = everything(),
.fns = scale
)
)
# Data frames to save results for normalized and standardized data
k_norm_df <-
data.frame(
k = 2:390,
SSE = -1,
MAE = -1
)
k_stan_df <-
data.frame(
k = 2:390,
SSE = -1,
MAE = -1
)
# Performing the grid search
for (i in 1:nrow(k_stan_df)){
# Saving the residuals for normalized data
norm_error_loop <-
knn.reg(
train = nba_norm,
y = nba_players$efficient,
k = k_stan_df$k[i]
)$res
# Saving SSE and MAE for normalized data
k_norm_df[i, c('SSE', 'MAE')] <- c(
sum(norm_error_loop ^ 2), # SSE
sum(abs(norm_error_loop)) # MAE
)
### Standardized data
# Saving the predictions for normalized data
stan_error_loop <-
knn.reg(
train = nba_stan,
y = nba_players$efficient,
k = k_stan_df$k[i]
)$res
# Saving SSE and MAE for normalized data
k_stan_df[i, c('SSE', 'MAE')] <- c(
sum(stan_error_loop ^ 2), # SSE
mean(abs(stan_error_loop)) # MAE
)
}
k_search_results <-
bind_rows(
.id = 'rescale',
'norm' = k_norm_df,
'stan' = k_stan_df
)
tibble(k_search_results)
## # A tibble: 778 × 4
## rescale k SSE MAE
## <chr> <int> <dbl> <dbl>
## 1 norm 2 1886. 673.
## 2 norm 3 1715. 634.
## 3 norm 4 1632. 619
## 4 norm 5 1636. 614.
## 5 norm 6 1676. 612.
## 6 norm 7 1780. 623.
## 7 norm 8 1795. 625.
## 8 norm 9 1781. 620.
## 9 norm 10 1786. 622.
## 10 norm 11 1781. 617.
## # ℹ 768 more rows
Use the chunk below after you’ve done the search to find the choice of k
# Best choice of SSE:
k_search_results |>
slice_min(SSE, n = 1)
## rescale k SSE MAE
## 1 stan 4 1444.944 1.445518
# Best choice of MAE:
k_search_results |>
slice_min(MAE, n = 1)
## rescale k SSE MAE
## 1 stan 4 1444.944 1.445518
Graph both SSE and MAE for normalized and standardized data in the same graph (but not necessarily the same plot). Does it look like you found a true minimum?
k_search_results |>
pivot_longer(
cols = SSE:MAE,
names_to = 'metric',
values_to = 'value'
) |>
ggplot(
mapping = aes(
x = k,
y = value,
color = rescale
)
) +
geom_line() +
facet_wrap(
facets = vars(metric),
scales = 'free_y',
nrow = 2
)
Using the choice of k and rescale method, calculate \(R^2\), \(\text{rmse}\), and \(\text{MAE}\) using cross-validation.
eff23_knn <-
knn.reg(
train = nba_stan,
y = nba_players$efficient,
k = 4,
)
## Rsquared
eff23_knn$R2Pred
## [1] 0.8220365
## rmse: sqrt(PRESS / (n - 1))
sqrt(eff23_knn$PRESS / (nrow(nba_players) - 1))
## [1] 1.912613
## MAE
mean(abs(nba_players$efficient - eff23_knn$pred))
## [1] 1.445518
Using the results of kNN from part 2a) (and 2a only), how does shooting percentage affect efficiency?
We can’t determine how shooting effects efficiency since kNN doesn’t build a model, and we can use kNN alone to tell us how shooting percentage changes efficiency
The code chunk below reads in the same data but from the 2024 - 2025 NBA regular season data set. For question 3, you’ll be predicting the players from the 2024-2025 season using kNN regression with k = 5 for the standardized data.
For the players from the 2024-2025 season, predict their efficiency. Make sure to standardize the data first!
When standardizing (or normalizing) the data for the data set
being predicted, you want to standardize using the statistics from the
training data. That is, when standardizing shooting for the
2024-2025 data set, you want to use the mean and standard deviation of
shooting from the 2023-2024 data set! Repeat for all six
predictors.
Display the predictions in a data frame that has two columns: efficient and predicted efficient
## Standardizing the nba24 data
nba24_stan <-
nba24 |>
dplyr::select(
shooting, three_pt, free, tot_reb, assist, usage
) |>
mutate(
shooting = (shooting - mean(nba_players$shooting)) / sd(nba_players$shooting),
three_pt = (three_pt - mean(nba_players$three_pt)) / sd(nba_players$three_pt),
free = (free - mean(nba_players$free) / sd(nba_players$free)),
tot_reb = (tot_reb - mean(nba_players$tot_reb)) / sd(nba_players$tot_reb),
assist = (assist - mean(nba_players$assist)) / sd(nba_players$assist),
usage = (usage - mean(nba_players$usage)) / sd(nba_players$usage)
)
## Predicting the 2024 season efficiency
eff24_knn <-
knn.reg(
train = nba_stan,
test = nba24_stan,
y = nba_players$efficient,
k = 5
)
data.frame(
efficient = nba24$efficient,
efficient_hat = eff24_knn$pred
) |>
tibble()
## # A tibble: 411 × 2
## efficient efficient_hat
## <dbl> <dbl>
## 1 14 11.6
## 2 15.1 13.4
## 3 13.3 11.3
## 4 9 11.4
## 5 22.1 18.7
## 6 14.7 14.2
## 7 14.4 13.0
## 8 14.9 13.0
## 9 11.6 13
## 10 12.2 11.7
## # ℹ 401 more rows
Using the predictions from 3a), calculate \(R^2\), \(\text{rmse}\), and \(\text{MAE}\) for the 2024-2025 data.
## SSE for nba24
SSE24 <- sum((nba24$efficient - eff24_knn$pred)^2)
## R^2:
Rsquared24 <- 1 - SSE24 / sum((nba24$efficient - mean(nba24$efficient))^2)
## rmse
rmse24 <- sqrt(SSE24 /(nrow(nba24) - 1))
## MAE
mae24 <- mean(abs(nba24$efficient - eff24_knn$pred))
c(
'R-squared' = Rsquared24,
'rmse24' = rmse24,
'mae24' = mae24
)
## R-squared rmse24 mae24
## 0.7046397 2.3126475 1.7056448
How do the fit statistics compare to the ones calculated in part 2c (Better, worse)? Briefly explain why the results are not surprising
The fit statistics are all worse (Lower \(R^2\) and higher rmse and MAE) because we found the choice of \(k\) that minimize rmse (and MAE) for the training data. The tend to be a little worse for future data sets.
Create a scatter plot to compare the efficiency for the 2024 - 2025 season to the predicted efficiency using an R-squared plot, which is a scatter plot with \(\hat{y}\) on the x-axis and \(y\) on the y-axis. Color each point by the player’s position. Make sure to make the graph look nice!
ggplot(
data = data.frame(y = nba24$efficient,
y_hat = eff24_knn$pred,
position = nba24$position),
mapping = aes(
x = y_hat,
y = y,
color = position
)
) +
geom_point() +
theme_bw()
Does it appear that kNN predicts efficiency well?