The nba_players
data set has the overall advanced
metrics recorded for 566 players in the NBA for the 2024 - 2025 regular
season (no playoffs).
One of the advanced metrics is efficient
, which is a
statistic that attempts to measure how efficient a player is while on
the basketball court. We’ll be using some of basketball-reference.com’s
other advanced statistics to predict how efficient a player is.
The relevant columns are:
age
: the age of the player at the start of the
seasonshooting
: A measure of shot accuracy that combines 1-,
2-, and 3-point attempts and accounts for the difficulty of said
shotsthree_pt
: The percentage of field goal attempts that
are three point shotsfree
: Number of free throw attempts per field goal
attemptoff_reb
: Percentage of offensive rebounds by the player
when the player was eligible for the rebounddef_reb
: Percentage of defensive rebounds by the player
when the player was eligible for the reboundtot_reb
: Percentage of total rebounds by the player
when the player was eligible for the reboundassist
: Percentage of teammate’s field goals the player
assisted when playingsteal
: Percentage of opponents’ possessions ended by
the player stealing the ball when said player was playingblock
: Percentage of two-point field goal attempts
blocked by the player while they were playingturnover
: Number of turnovers committed per 100
playsusage
: Percentage of plays where the player was
involvedposition
variableBriefly explain why the position
variable can’t
be used as a predictor of efficient
using kNN
regression.
Since kNN uses distance to determine which neighbors are the closest
(nearest), we can only use numeric predictors and position
is categorical (can’t subtract “Guard” from “Forward”)
Using shooting
, three_pt
,
free
, tot_reb
, assist
,
block
, and usage
as predictors, find the best
choice of \(k\) to predict
efficient
. Report the value of \(k\), rescaling method, and resulting \(R^2\). Search k = 5 to k = 200. (Best to
start with a smaller range of k until you get the loop to work).
Display your results using a line graph showing the R2 value when normalizing the data and when standardizing the data with two lines.
# Leave this at the top
set.seed(2870)
### Normalizing the data
nba_norm <-
nba_players |>
# Picking the relevant columns
dplyr::select(efficient, shooting:free, tot_reb, assist, block, usage) |>
mutate(
across(
.cols = -efficient,
.fns = ~ (. - min(.)) / (max(.) - min(.))
)
)
### Standardizing the data
nba_stan <-
nba_players |>
# Picking the relevant columns
dplyr::select(efficient, shooting:free, tot_reb, assist, block, usage) |>
mutate(
across(
.cols = -efficient,
.fns = ~ (. - mean(.)) / (sd(.))
)
)
### for loop set up
# Vector of k to search through
# Preallocating a matrix to save the results in
fit_stats <-
tibble(
k = 1:200,
Normalize = -1,
Standardize = -1
)
# Conducting the for loop
for (i in 1:nrow(fit_stats)){
# Finding R2 for the normalized data
loop_norm <-
knn.reg(
train = nba_norm |> dplyr::select(-efficient),
y = nba_norm$efficient,
k = fit_stats$k[i]
)
# Saving the R2 value
fit_stats[i, "Normalize"] <- loop_norm$R2Pred
# Finding R2 for the standardized data
loop_stan <-
knn.reg(
train = nba_stan |> dplyr::select(-efficient),
y = nba_stan$efficient,
k = fit_stats$k[i]
)
# Saving the R2 value
fit_stats[i, "Standardize"] <- loop_stan$R2Pred
}
fit_stats |>
pivot_longer(
cols = -k,
names_to = "rescale",
values_to = "R2"
) |>
# Making the graph
ggplot(
mapping = aes(
x = k,
y = R2,
color = rescale
)
) +
geom_line() +
theme_bw() +
theme(
legend.position = 'inside',
legend.position.inside = c(0.9, 0.85)
)
# Finding the best choice of k
fit_stats |>
pivot_longer(
cols = -k,
names_to = "rescale",
values_to = "R2"
) |>
slice_max(
n = 1,
R2
)
## # A tibble: 1 × 3
## k rescale R2
## <int> <chr> <dbl>
## 1 5 Standardize 0.802
The best choice of k is 5 when rescaling the data by standardizing with an \(R^2\) value of 0.802
efficient
for the 2023 - 2024
season.Regardless of your answer in the previous question, predict the price for the players in the nba23 data set when standardizing the data with k = 6. Display the results using an R-squared plot.
Make sure to standardize nba23
before predicting
efficient
!. Added the predictions below by adding predicted
efficient
to nba23
.
nba23_stan <-
nba23 |>
# Picking the relevant columns
dplyr::select(efficient, shooting:free, tot_reb, assist, block, usage) |>
mutate(
across(
.cols = -efficient,
.fns = ~ (. - mean(.)) / (sd(.))
)
)
# Predicting efficient for the 2023 season
nba23 <-
nba23 |>
mutate(
pred_eff =
knn.reg(
train = nba_stan |> dplyr::select(-efficient),
test = nba23_stan |> dplyr::select(-efficient),
y = nba_stan$efficient,
k = 6
)$pred
)
Now create the \(R^2\) plot with predicted efficiency on the x-axis and actual efficiency on the y-axis
ggplot(
data = nba23,
mapping = aes(
x = pred_eff,
y = efficient
)
) +
# Adding the points and a trend line
geom_point() +
geom_smooth(
color = "red",
method = "lm",
se = F,
formula = y ~ x
) +
# Adding the R2 value to the graph
annotate(
geom = "text",
x = 10,
y = 30,
label = paste("R-squared:", round((1 - sum((nba23$efficient - nba23$pred_eff)^2)/sum((nba23$efficient - mean(nba23$efficient))^2)), 3)),
color = "red",
size = 5
) +
labs(
x = "Predicted Player Efficiency",
y = "Actual Player Efficiency",
title = "Predicted vs Actual Player Efficiency for the 2023 - 2024 NBA Season"
)
Is kNN accurate for the 2023-2024 NBA season?
It’s a pretty good predictor because most of the players are near the red line and the \(R^2\) is over 0.83, indicating reliable but not perfect predictions
shooting
on
efficient
Using the results of kNN, can you interpret the effect of `shooting on the efficiency of a player? If yes, interpret the results. If not, briefly explain why.
No, kNN is a lazy learner, which doesn’t build a model. Without building a model, we can’t learn how the method used each variable to predict efficiency
Create the full regression tree predicting
efficient
using position
,
shooting
, three_pt
, free
,
off_reb
, def_reb
, tot_reb
,
assist
, steal
, block
,
turnover
, and usage
. Display the last 10 rows
of the CP table
# Leave this at the top
RNGversion('4.1.0'); set.seed(2870)
eff_tree_full <-
rpart(
formula = efficient ~ position + shooting + three_pt + free + off_reb + def_reb + tot_reb + assist + steal + block + turnover + usage,
data = nba_players,
method = "anova",
minsplit = 2,
minbucket = 1,
cp = 0
)
eff_tree_full$cptable |>
data.frame() |>
tail(n = 10)
## CP nsplit rel.error xerror xstd
## 263 1.346944e-06 332 2.940827e-05 0.3514683 0.03167736
## 264 8.979623e-07 333 2.806132e-05 0.3514690 0.03167734
## 265 8.979623e-07 334 2.716336e-05 0.3514469 0.03167718
## 266 8.979623e-07 336 2.536744e-05 0.3514469 0.03167718
## 267 8.979623e-07 337 2.446947e-05 0.3514469 0.03167718
## 268 6.734718e-07 338 2.357151e-05 0.3514469 0.03167718
## 269 6.734718e-07 347 1.751027e-05 0.3512532 0.03167378
## 270 6.734718e-07 367 4.040831e-06 0.3512532 0.03167378
## 271 1.275061e-33 373 1.275061e-33 0.3512532 0.03167378
## 272 0.000000e+00 374 0.000000e+00 0.3512532 0.03167378
How many leaf (terminal) nodes are in the full tree?: 375 leaf nodes
Find the cp value to prune the tree. Don’t round the actual results, but you can round to 4 decimal places when typing your answer.
# Finding the xerror cutoff: min(xerror) + xstd
xerror_cutoff <-
eff_tree_full$cptable |>
data.frame() |>
slice_min(xerror, n = 1, with_ties = F) |>
mutate(
xerror_1sd = xerror + xstd
) |>
pull(xerror_1sd)
# Finding the first (simplest) tree with xerror < xerror_cutoff
cp_prune <-
eff_tree_full$cptable |>
data.frame() |>
filter(xerror < xerror_cutoff) |>
slice(1) |>
pull(CP)
cp_prune
## [1] 0.005852058
The cp value is: 0.0059
Using your answer from the previous question, prune the tree,
then use rpart.plot()
to display the tree.
eff_tree_pruned <-
prune(tree = eff_tree_full,
cp = cp_prune)
rpart.plot(
eff_tree_pruned,
type = 5,
extra = 101
)
Using the pruned tree, which six variables are the most important in predicting the efficiency of an NBA player?
caret::varImp(eff_tree_pruned) |>
arrange(-Overall)
## Overall
## shooting 4.5464976
## tot_reb 3.9934850
## usage 2.9881671
## three_pt 2.9342991
## off_reb 2.4840411
## def_reb 2.4144033
## assist 1.4825977
## block 1.2674307
## turnover 1.1178790
## position 0.4642158
## steal 0.4625269
## free 0.2645668
The six most important variables are shooting, tot_reb, usage, three_pt, off_reb, and def_reb
If you were to predict a player’s efficiency using either kNN or the Regression Tree, which would you choose? Make sure to justify your answer!