NBAPlayersProject

Author

Yash Shah

NBA Player Performance Modeling (R)


Context: Resume-ready machine learning project using real NBA player-season data (12,845 rows)

1. Project Overview

This project applies regression and machine learning techniques in R to model NBA player performance using historical player-season data. The workflow mirrors standard data science pipelines:

  • Data cleaning & preprocessing

  • Exploratory data analysis (EDA)

  • Feature engineering

  • Multiple regression models

  • Regularization (Ridge & LASSO)

  • Tree-based models

  • Model evaluation & comparison

The goal is to predict player scoring output (PTS) and analyze which factors most influence offensive performance.

2. Libraries

Code
library(tidyverse) 
Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'dplyr' was built under R version 4.4.3
Warning: package 'forcats' was built under R version 4.4.3
Warning: package 'lubridate' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(tidymodels) 
Warning: package 'tidymodels' was built under R version 4.4.3
── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
✔ broom        1.0.9     ✔ rsample      1.3.1
✔ dials        1.4.2     ✔ tailor       0.1.0
✔ infer        1.1.0     ✔ tune         2.0.1
✔ modeldata    1.5.1     ✔ workflows    1.3.0
✔ parsnip      1.4.0     ✔ workflowsets 1.1.1
✔ recipes      1.3.1     ✔ yardstick    1.3.2
Warning: package 'broom' was built under R version 4.4.3
Warning: package 'dials' was built under R version 4.4.3
Warning: package 'scales' was built under R version 4.4.3
Warning: package 'infer' was built under R version 4.4.3
Warning: package 'modeldata' was built under R version 4.4.3
Warning: package 'parsnip' was built under R version 4.4.3
Warning: package 'recipes' was built under R version 4.4.3
Warning: package 'rsample' was built under R version 4.4.3
Warning: package 'tailor' was built under R version 4.4.3
Warning: package 'tune' was built under R version 4.4.3
Warning: package 'workflows' was built under R version 4.4.3
Warning: package 'workflowsets' was built under R version 4.4.3
Warning: package 'yardstick' was built under R version 4.4.3
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
Code
library(glmnet) 
Warning: package 'glmnet' was built under R version 4.4.3
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-10
Code
library(rpart) 

Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune
Code
library(rpart.plot) 
Warning: package 'rpart.plot' was built under R version 4.4.3
Code
library(corrplot) 
Warning: package 'corrplot' was built under R version 4.4.3
corrplot 0.95 loaded

3. Data Loading

Code
nba <- read_csv("NBADataset.csv")  
New names:
Rows: 12844 Columns: 22
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): player_name, team_abbreviation, college, country, draft_year, draf... dbl
(14): ...1, age, player_height, player_weight, gp, pts, reb, ast, net_ra...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
set.seed(1234567) #set seed for reproducibility

4. Data Cleaning

Code
nba_clean <- nba %>%
  mutate(
    draft_year = as.numeric(draft_year),
    draft_round = as.numeric(draft_round),
    draft_number = as.numeric(draft_number)
  ) %>%
  drop_na(pts, age, player_height, player_weight, gp)
Warning: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `draft_year = as.numeric(draft_year)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

5. Feature Selection

We focus on numerical performance and physical attributes to avoid leakage.

Code
nba_model <- nba_clean %>%
  select(
   pts, age, player_height, player_weight, gp,
   reb, ast, net_rating, usg_pct, ts_pct, ast_pct
  )

6. Exploratory Data Analysis

Correlation Matrix

Code
corrplot(cor(nba_model), method = "color", type = "upper") 

Key observations:

  • Usage rate (usg_pct) strongly correlates with points

  • True shooting percentage (ts_pct) improves scoring efficiency

  • Games played (gp) captures opportunity

7. Train-Test Split

Code
split <- initial_split(nba_model, prop = 0.75) 
train <- training(split) 
test <- testing(split) 

8. Multiple Linear Regression

Code
lm_fit <- lm(pts ~ ., data = train) 
summary(lm_fit) 

Call:
lm(formula = pts ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.121  -1.253  -0.198   1.019  20.090 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.030897   0.894204   1.153   0.2490    
age             0.010774   0.005757   1.872   0.0613 .  
player_height  -0.030786   0.005199  -5.922 3.29e-09 ***
player_weight  -0.033639   0.003543  -9.496  < 2e-16 ***
gp              0.024878   0.001219  20.414  < 2e-16 ***
reb             0.827450   0.014531  56.944  < 2e-16 ***
ast             1.890306   0.028060  67.365  < 2e-16 ***
net_rating      0.005595   0.002173   2.575   0.0100 *  
usg_pct        45.621471   0.511484  89.194  < 2e-16 ***
ts_pct          5.723781   0.272493  21.005  < 2e-16 ***
ast_pct       -19.678337   0.514793 -38.226  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.377 on 9622 degrees of freedom
Multiple R-squared:  0.8451,    Adjusted R-squared:  0.8449 
F-statistic:  5248 on 10 and 9622 DF,  p-value: < 2.2e-16

Interpretation:

  • Usage rate and true shooting are statistically significant

  • Height and weight have weaker direct effects

9. Ridge & LASSO Regression

Code
x_train <- model.matrix(pts ~ ., train)[,-1]
y_train <- train$pts
x_test <- model.matrix(pts ~ ., test)[,-1]
y_test <- test$pts


ridge <- cv.glmnet(x_train, y_train, alpha = 0)
lasso <- cv.glmnet(x_train, y_train, alpha = 1)

Coefficient Shrinkage

Code
coef(lasso, s = "lambda.min")
11 x 1 sparse Matrix of class "dgCMatrix"
                 lambda.min
(Intercept)     0.120056556
age             0.006409516
player_height  -0.026534735
player_weight  -0.031892552
gp              0.025267584
reb             0.822515578
ast             1.850455841
net_rating      0.004921158
usg_pct        45.245079706
ts_pct          5.683112308
ast_pct       -18.460495462

Result:

  • LASSO selects usg_pct, ts_pct, gp, and ast

  • Less informative variables shrink to zero

10. Regression Tree

Code
tree_fit <- rpart(pts ~ ., data = train, method = "anova") 
rpart.plot(tree_fit) 

Insight:

  • First split occurs on assists.

  • High-usage players form the highest scoring leaf nodes.

11. Model Evaluation

Code
rmse <- function(actual, predicted) {
 sqrt(mean((actual - predicted)^2))
}


lm_rmse <- rmse(y_test, predict(lm_fit, test))
ridge_rmse <- rmse(y_test, predict(ridge, x_test, s = "lambda.min"))
lasso_rmse <- rmse(y_test, predict(lasso, x_test, s = "lambda.min"))
tree_rmse <- rmse(y_test, predict(tree_fit, test))


results <- tibble(
 Model = c("Linear", "Ridge", "LASSO", "Tree"),
 RMSE = c(lm_rmse, ridge_rmse, lasso_rmse, tree_rmse)
)


results
# A tibble: 4 × 2
  Model   RMSE
  <chr>  <dbl>
1 Linear  2.43
2 Ridge   2.49
3 LASSO   2.44
4 Tree    2.74

The RMSE of each model shows which one is the most accurate and which model to trust the most. This is based on a variety og things, for example, Bias-Variance tradeoff.

12. Conclusion

  • LASSO regression provided the best bias–variance tradeoff

  • Usage rate and shooting efficiency are dominant predictors

  • Tree-based models offer interpretability but higher variance

This project demonstrates practical experience with regression, regularization, and model evaluation in R using real-world sports data.