Introduction

  • Dataset: NBA 2023 Player Statistics
  • Problem Definition: Analyze the performance metrics of NBA players for the 2023 season.
  • Goals:
    • Identify top players based on per-game stats
    • Find relationships between stats such as shooting efficiency, scoring, assists, and rebounds
    • Predict future points per game based on current stats and age

Data Overview

## 'data.frame':    539 obs. of  30 variables:
##  $ PName: chr  "Jayson Tatum" "Joel Embiid" "Luka Doncic" "Shai Gilgeous-Alexander" ...
##  $ POS  : chr  "SF" "C" "PG" "PG" ...
##  $ Team : chr  "BOS" "PHI" "DAL" "OKC" ...
##  $ Age  : int  25 29 24 24 28 21 28 26 24 28 ...
##  $ GP   : int  74 66 66 68 63 79 77 68 73 77 ...
##  $ W    : int  52 43 33 33 47 40 44 44 38 38 ...
##  $ L    : int  22 23 33 35 16 39 33 24 35 39 ...
##  $ Min  : num  2732 2284 2390 2416 2024 ...
##  $ PTS  : int  2225 2183 2138 2135 1959 1946 1936 1922 1914 1913 ...
##  $ FGM  : int  727 728 719 704 707 707 658 679 597 673 ...
##  $ FGA  : int  1559 1328 1449 1381 1278 1541 1432 1402 1390 1388 ...
##  $ FG.  : num  46.6 54.8 49.6 51 55.3 45.9 45.9 48.4 42.9 48.5 ...
##  $ X3PM : int  240 66 185 58 47 213 218 245 154 204 ...
##  $ X3PA : int  686 200 541 168 171 578 636 635 460 544 ...
##  $ X3P. : num  35 33 34.2 34.5 27.5 36.9 34.3 38.6 33.5 37.5 ...
##  $ FTM  : int  531 661 515 669 498 319 402 319 566 363 ...
##  $ FTA  : int  622 771 694 739 772 422 531 368 639 428 ...
##  $ FT.  : num  85.4 85.7 74.2 90.5 64.5 75.6 75.7 86.7 88.6 84.8 ...
##  $ OREB : int  78 113 54 59 137 47 141 63 56 42 ...
##  $ DREB : int  571 557 515 270 605 411 626 226 161 303 ...
##  $ REB  : int  649 670 569 329 742 458 767 289 217 345 ...
##  $ AST  : int  342 274 529 371 359 350 316 301 741 327 ...
##  $ TOV  : int  213 226 236 192 246 259 216 180 300 194 ...
##  $ STL  : int  78 66 90 112 52 125 49 99 80 69 ...
##  $ BLK  : int  51 112 33 65 51 58 21 27 9 18 ...
##  $ PF   : int  160 205 166 192 197 186 233 168 104 159 ...
##  $ FP   : int  3691 3706 3747 3425 3451 3311 3324 2918 3253 2885 ...
##  $ DD2  : int  31 39 36 3 46 9 40 5 40 2 ...
##  $ TD3  : int  1 1 10 0 6 0 0 0 0 0 ...
##  $ X... : int  470 424 128 149 341 97 170 338 100 18 ...
  • The dataset contains key statistics like Points (PTS), Assists (AST), Rebounds (REB), Field Goal Percentage (FG%), and more.

Data Cleaning

# libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
missing_values <- colSums(is.na(nba_data))
missing_values[missing_values > 0]
## named numeric(0)
  • No significant missing values found in the dataset, so we proceed with exploratory data analysis.

Exploratory Data Analysis

Per-Game Summary Stats - Code

# per-game stats
nba_data <- nba_data %>%
  mutate(PPG = PTS / GP,
         RPG = REB / GP,
         APG = AST / GP,
         TOPG = TOV / GP,
         SPG = STL / GP,
         BPG = BLK / GP)
colnames(nba_data) <- make.names(colnames(nba_data))

Per-Game Summary Statistics

##       PPG              RPG              APG               TOPG       
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 4.153   1st Qu.: 1.831   1st Qu.: 0.7871   1st Qu.:0.5185  
##  Median : 7.033   Median : 3.033   Median : 1.3704   Median :0.8800  
##  Mean   : 9.119   Mean   : 3.541   Mean   : 2.0681   Mean   :1.1039  
##  3rd Qu.:12.092   3rd Qu.: 4.521   3rd Qu.: 2.7448   3rd Qu.:1.5000  
##  Max.   :33.076   Max.   :12.536   Max.   :10.6552   Max.   :4.1096  
##       SPG              BPG        
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3211   1st Qu.:0.1389  
##  Median :0.5571   Median :0.2727  
##  Mean   :0.6075   Mean   :0.3801  
##  3rd Qu.:0.8333   3rd Qu.:0.4968  
##  Max.   :3.0000   Max.   :3.0000
  • Summary statistics for per-game metrics allow players to be compared on average stats to mitigate disparities caused by missing games.

Summary Statistics for Key Metrics

# Summary stats
key_metrics <- c("PTS", "AST", "REB", "FG.", "X3P.", "FT.", "TOV", "STL", "BLK", "PF")
nba_statistics <- summary(nba_data[, key_metrics])
print(nba_statistics)
##       PTS              AST             REB             FG.        
##  Min.   :   0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.: 120.5   1st Qu.: 22.0   1st Qu.: 50.5   1st Qu.: 41.65  
##  Median : 374.0   Median : 69.0   Median :159.0   Median : 45.50  
##  Mean   : 523.4   Mean   :115.5   Mean   :198.3   Mean   : 46.33  
##  3rd Qu.: 769.5   3rd Qu.:162.5   3rd Qu.:286.0   3rd Qu.: 50.60  
##  Max.   :2225.0   Max.   :741.0   Max.   :973.0   Max.   :100.00  
##       X3P.             FT.              TOV             STL        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.: 28.10   1st Qu.: 66.70   1st Qu.: 14.5   1st Qu.:  8.50  
##  Median : 34.20   Median : 76.30   Median : 44.0   Median : 28.00  
##  Mean   : 31.53   Mean   : 71.99   Mean   : 61.3   Mean   : 33.27  
##  3rd Qu.: 38.50   3rd Qu.: 84.10   3rd Qu.: 92.5   3rd Qu.: 51.00  
##  Max.   :100.00   Max.   :100.00   Max.   :300.0   Max.   :128.00  
##       BLK               PF        
##  Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  5.00   1st Qu.: 32.00  
##  Median : 13.00   Median : 86.00  
##  Mean   : 21.24   Mean   : 91.18  
##  3rd Qu.: 28.00   3rd Qu.:140.00  
##  Max.   :193.00   Max.   :279.00
  • Descriptive statistics give us an overview of the player metrics, including mean, median, and ranges.

Questions

  1. Who is the best player based on overall stats?
  2. Which teams had the highest average points per game among all players?
  3. What is the relationship between a player’s position and PPG?
  4. How do FG% and 3PT% affect a player’s scoring?
  5. How does a player’s age affect performance metrics?
  6. Can a player’s future PPG be predicted based on current stats and age?

Best Player Analysis (Based on Efficiency)

# efficiency formula
nba_data <- nba_data %>%
  mutate(Efficiency = (PTS + REB + AST + STL + BLK) - ((FGA - FGM) + TOV))
##              PName Team Efficiency      PPG      APG      RPG  FG. X3P.  FT.
## 1     Nikola Jokic  DEN       2696 24.49275 9.826087 11.84058 63.2 38.3 82.2
## 2 Domantas Sabonis  SAC       2569 19.11392 7.253165 12.31646 61.5 37.3 74.2
## 3      Joel Embiid  PHI       2479 33.07576 4.151515 10.15152 54.8 33.0 85.7
  • The top 3 players are identified using a custom efficiency metric, which considers scoring, assists, rebounds, steals, blocks, missed shots, and turnovers.

Team-wise Performance

# team avg ppg
team_avg_pts <- nba_data %>%
  group_by(Team) %>%
  summarize(avg_ppg = mean(PPG, na.rm = TRUE)) %>%
  arrange(desc(avg_ppg)) %>%
  head(5)
print(team_avg_pts)
## # A tibble: 5 × 2
##   Team  avg_ppg
##   <chr>   <dbl>
## 1 PHX     10.8 
## 2 NOP     10.5 
## 3 PHI     10.4 
## 4 LAC     10.4 
## 5 LAL      9.92
  • The Phoenix Suns had the highest average points per player, which shows that their ball distribution is among the league’s best.

Player Position vs Points Scored

- The box plot shows that point guards had the highest median PPG, while centers typically scored the least. However, the league leader in PPG was a center, which shows as an outlier.

Field Goal Percentage vs Points Scored

  • It can be seen in the graph that players who average at least 20 PPG tend to shoot between 40% - 60%, which is a range that is considered elite in the NBA.

3PT Percentage vs Points Scored - Graph

3PT Percentage vs. Points Scored - Reasoning

  • Similarly, players who average at least 20 PPG shoot between 30% - 45% from the three-point line, which suggests a slightly more variable ability to shoot among these scorers.

Age vs. Performance Graphs

Age vs. Performance

  • On all of the previous graphs, it is clear that as age increases, PPG/APG/RPG all tend to decrease.
  • However, the league’s oldest player, LeBron James, appears as an outlier on all of the graphs.

Predicting Points Per Game Based on Age

# library
library(caret)

# ppg
nba_data <- nba_data %>%
  mutate(Above_Current_PPG = ifelse(PPG > median(PPG, na.rm = TRUE), 'Above', 'Below'))

set.seed(123)
nba_data_ml <- nba_data %>%
  select(Above_Current_PPG, Age)

data_ml <- na.omit(nba_data_ml)
trainIndex <- createDataPartition(data_ml$Above_Current_PPG, p = .8, list = FALSE)
trainData <- data_ml[trainIndex,]
testData <- data_ml[-trainIndex,]

# logistic trainer
model_logistic <- train(Above_Current_PPG ~ Age, data = trainData, method = "glm", family = "binomial")

# test
predictions <- predict(model_logistic, newdata = testData)

predictions <- factor(predictions, levels = c("Above", "Below"))
testData$Above_Current_PPG <- factor(testData$Above_Current_PPG, levels = c("Above", "Below"))

confusionMatrix(predictions, testData$Above_Current_PPG)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Above Below
##      Above    26    22
##      Below    27    32
##                                          
##                Accuracy : 0.5421         
##                  95% CI : (0.443, 0.6388)
##     No Information Rate : 0.5047         
##     P-Value [Acc > NIR] : 0.2494         
##                                          
##                   Kappa : 0.0832         
##                                          
##  Mcnemar's Test P-Value : 0.5677         
##                                          
##             Sensitivity : 0.4906         
##             Specificity : 0.5926         
##          Pos Pred Value : 0.5417         
##          Neg Pred Value : 0.5424         
##              Prevalence : 0.4953         
##          Detection Rate : 0.2430         
##    Detection Prevalence : 0.4486         
##       Balanced Accuracy : 0.5416         
##                                          
##        'Positive' Class : Above          
## 

Player predictions

##                     PName Age      PPG Predicted_Above_Current_PPG
## 1             Joel Embiid  29 33.07576                       Above
## 2 Shai Gilgeous-Alexander  24 31.39706                       Below
## 3            Nikola Jokic  28 24.49275                       Above
## 4           Jalen Brunson  26 24.01471                       Above
## 5            LeBron James  38 28.90909                       Above
##   Predicted_PPG_Avg
## 1          36.38333
## 2          28.25735
## 3          26.94203
## 4          26.41618
## 5          31.80000

Conclusion

  • The best player was identified based on a custom efficiency metric, which takes into account scoring, assists, rebounds, steals, blocks, missed field goals, and turnovers, providing measurable impact of a player.
  • Analyzing a player’s position and scoring can help create lineups and offensive sets for teams.
  • Field goal percentage and three-point percentage were found to correlate with a player’s scoring ability, which demonstrates the effect of shooting percentages on PPG.
  • By looking at a player’s age compared to their averages, teams can effectively manage contracts for older players, while the regression can predict a player’s future scoring.