NBA Analysis

Author

Red Pandas: Daniel Bowen, Samantha Sanchez, Anaya Tention

View Code

# Package imports
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(tidyr)
library(car)
library(knitr)
library(dplyr)
library(DT)
library(leaflet)
library(patchwork)
library(plotly)
# install.packages("packrat")
# install.packages("rsconnect")
# install.packages("qrcode")
library(rsconnect)
library(packrat)
library(qrcode)

# Color palette
nba_red    <- "#b01020"
nba_blue   <- "#1e4070"
nba_gold   <- "#e8a020"
nba_white  <- "#f0f4f8"

pos_colors <- c(
  "PG" = "#b01020",
  "SG" = "#1D4289",
  "SF" = "#848484",
  "PF" = "#116e14",
  "C"  = "#e8a020"
)

# Data import + cleaning
df_raw <- read.csv('nba_data_processed.csv', header = T, stringsAsFactors = T)

df <- df_raw |> 
    drop_na()

Introduction

The National Basketball Association’s (NBA) original name was the Basketball Association of America (BAA) in 1946, but this changed in 1949. The late Bill Russell won the most NBA championships, with a total of 11 for the Boston Celtics. NBA player performance is influenced by a combination of role, playing time, age, and statistical contribution. While common metrics like points, rebounds, and assists are widely used, they often reflect different responsibilities depending on position and stage of a player’s career.

Project Goal

This project analyzes NBA per-game player data to examine how performance varies across position, age, and scoring. We use exploratory data analysis (EDA) to identify patterns in the data and regression to explore which statistics are most closely associated with points per game.

Dataset

The dataset is from Kaggle, which was originally scraped from Basketball Reference. It has 29 variables and 649 observations covering per-game averages for one NBA season. We dropped rows with missing values before running any analysis.

These are the variables (columns) in the NBA dataset.

View Code

library(knitr)

data.frame(Variable_Names = names(df)) %>%
  knitr::kable(
    caption = "Variable Names in NBA Dataset"
  )

Variable Names in NBA Dataset
Variable_Names
Player
Pos
Age
Tm
G
GS
MP
FG
FGA
FG.
X3P
X3PA
X3P.
X2P
X2PA
X2P.
eFG.
FT
FTA
FT.
ORB
DRB
TRB
AST
STL
BLK
TOV
PF
PTS

You can explore the full cleaned dataset in the table below.

View Code

# Interactive data table
datatable(df)

Analysis

We broke the analysis into three parts:

Position behavior
Age distributions
Scoring analysis with a regression model

Player Position Behavior

First, we looked at how positions are distributed in the dataset, then used assists, rebounds, steals, and blocks to see how roles separate by position.

Position Distribution - Interactive Plot

View Code

df_pos <- df |>
mutate(
    Pos = case_when(
        Pos == "SG-PG" ~ "SG",
        Pos == "SF-SG" ~ "SF",
        Pos == "PF-SF" ~ "PF",
        TRUE ~ Pos
    )
)

pos_counts <- df_pos |>
  count(Pos)

plot_ly(
  pos_counts,
  x = ~Pos,
  y = ~n,
  type = "bar",
  color = I(nba_red),
  text = ~paste("Count:", n),
  hoverinfo = "text"
) |>
  layout(
    title = "Player Position Counts",
    xaxis = list(title = "Positions"),
    yaxis = list(title = "Count")
  )

Point guards and shooting guards make up the largest share of the dataset, which is consistent with most NBA rosters carrying more perimeter players than bigs (Forwards & Center positions).

Assists vs. Rebounds by Position

View Code

p1 <- df_pos |>
ggplot(aes(x = AST, y = TRB, color = Pos)) +
geom_point(size = 1.75) +
scale_color_manual(values = pos_colors) +
labs(
    title = "Assists vs Rebounds",
    x = "Assists",
    y = "Rebounds"
) +
theme_minimal() +
theme(legend.position = "none") 


p2 <- df_pos |>
ggplot(aes(x = AST, y = TRB, color = Pos)) +
geom_point(size = 1.75) +
geom_smooth(se = FALSE) +
scale_color_manual(values = pos_colors) +
labs(
    title = "Assists vs Rebounds",
    x = "Assists",
    y = "Rebounds",
    color = "Position"
) +
theme_minimal()

p1|p2

We can see a distinct split across positions. Point guards tend to have higher assists and lower rebounds, while centers show the opposite. Forwards fall in between, showing the flexibility in their positions. Overall, position is a good indicator of how players contribute on the court.

Steals vs. Blocks by Position

View Code

df_pos |>
ggplot(aes(x = STL, y = BLK, color = Pos)) +
geom_point(size = 1.75) +
scale_color_manual(values = pos_colors) +
labs(
    title = "Steals vs Blocks",
    x = "Steals",
    y = "Blocks",
    color = "Position"
) +
theme_minimal()

Steals and blocks capture different types of defense, perimeter versus interior. Guards tend to generate more steals, while centers and forwards lead in blocks. Most players are clustered at low values for both, so those who stand out in either category are less common.

View Code

# summarize
position_stats <- df_pos |>
  group_by(Pos) |>
  summarize(
    PTS = mean(PTS, na.rm = TRUE),
    AST = mean(AST, na.rm = TRUE),
    TRB = mean(TRB, na.rm = TRUE),
    STL = mean(STL, na.rm = TRUE),
    BLK = mean(BLK, na.rm = TRUE)
  )

metrics <- c("PTS", "AST", "TRB", "STL", "BLK")

p <- 0

for (i in 1:length(metrics)) {
  
  m <- metrics[i]
  
  new_plot <- ggplot(position_stats, aes_string(x = "Pos", y = m, fill = "Pos")) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values = pos_colors) +
    labs(
      title = m,
      x = "Position",
      y = "Average"
    ) +
    theme_minimal() +
    theme(legend.position = "none")
  
  if (i == 1) {
    p <- new_plot
  } else {
    p <- p | new_plot
  }
}

p

Player Age Analysis

This section will examine how age relates to key per-game metrics.

Player Scoring Analysis + Model

Scoring is a key measure of offensive contribution, but it is not evenly distributed across players. This section looks at the distribution of points, applies transformations where needed, and uses regression to identify which metrics are most related to points per game.

Distribution of Points Per Game & Field Goal Assists

View Code

# PTS (Points) 
p1 <- ggplot(df, aes(x = PTS)) + 
geom_histogram(bins = 25, fill = nba_blue, color = "white") + 
labs(
    title = "Histogram of Points Per Game (PTS)",
    x = "Points per Game (PTS)", 
    y = "Count"
) +
theme_minimal()

p2 <- ggplot(df, aes(x = PTS)) +
geom_boxplot(fill = nba_red) +
labs(
    title = "Boxplot of Points Per Game (PTS)",
    x = "Points per Game (PTS)"
) +
theme_minimal()

# Field Goal Assists (FGA) 
p3 <- ggplot(df, aes(x = FGA)) + 
geom_density(fill = nba_blue) + 
labs(
    title = "Density Plot of Field Goal Assists (FGA)",
    x = "Field Goal Assists (FGA)", 
    y = "Count"
) +
theme_minimal()

p4 <- ggplot(df, aes(x = FGA)) +
geom_boxplot(fill = nba_red) +
labs(
    title = "Boxplot of Field Goal Assists (FGA)",
    x = "Field Goal Assists (FGA)"
) +
theme_minimal()

(p1|p2) / (p3|p4)

Transformations

View Code

# Transform data
df_sqrt <- df |> 
  mutate(across(-c(Age, G, Pos, Player, Tm), sqrt))

# Define the variables you want in your dashboard
vars_to_plot <- c("PTS", "FGA", "FTA", "MP", "AST", "TRB", "STL", "BLK", "TOV")

# Pivot the data to a long format
df_dashboard <- df_sqrt |>
  select(all_of(vars_to_plot)) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Create the faceted dashboard
ggplot(df_dashboard, aes(x = Value)) +
  geom_histogram(bins = 20, fill = nba_blue, color = nba_white) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) + 
  labs(
    title = "Dashboard: Distributions of Square Root Transformed Variables",
    x = "Square Root Value",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold", size = 10))

Several of the original variables showed right-skewed distributions, especially lower-frequency stats like blocks and steals. After applying a square root transformation, the distributions are more balanced and less heavily concentrated near zero. This makes the variables more comparable and better suited for modeling.

Multiple Linear Regression — Full Model

We started with a full model using all numeric predictors to see which ones are statistically significant.

View Code

# Create a dataset with only numbers
df_model <- df |> 
  select(where(is.numeric))

# influential analysis
mlr <- lm(PTS ~ ., data = df_model)
summary(mlr)


Call:
lm(formula = PTS ~ ., data = df_model)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.198345 -0.054386  0.002792  0.051735  0.204425 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0338178  0.0430185  -0.786  0.43215    
Age         -0.0007865  0.0007995  -0.984  0.32572    
G            0.0006419  0.0002459   2.611  0.00929 ** 
GS          -0.0002373  0.0002933  -0.809  0.41894    
MP          -0.0005837  0.0012623  -0.462  0.64399    
FG           1.5989764  0.0684836  23.348  < 2e-16 ***
FGA         -0.0269209  0.0672518  -0.400  0.68910    
FG.         -0.3409268  0.1590474  -2.144  0.03253 *  
X3P          1.3316071  0.0695742  19.139  < 2e-16 ***
X3PA         0.0518628  0.0684292   0.758  0.44885    
X3P.        -0.0558905  0.0362552  -1.542  0.12378    
X2P          0.3981946  0.0683181   5.829 9.76e-09 ***
X2PA         0.0295982  0.0680930   0.435  0.66398    
X2P.        -0.0336538  0.0489776  -0.687  0.49231    
eFG.         0.4106422  0.1455912   2.821  0.00498 ** 
FT           0.9927641  0.0180146  55.109  < 2e-16 ***
FTA          0.0149571  0.0147759   1.012  0.31188    
FT.         -0.0141833  0.0279741  -0.507  0.61236    
ORB         -0.0458122  0.0661251  -0.693  0.48873    
DRB         -0.0586229  0.0661230  -0.887  0.37571    
TRB          0.0578476  0.0659804   0.877  0.38103    
AST          0.0043846  0.0036584   1.198  0.23126    
STL          0.0011267  0.0116043   0.097  0.92269    
BLK         -0.0079974  0.0117833  -0.679  0.49762    
TOV         -0.0184065  0.0104281  -1.765  0.07813 .  
PF           0.0023310  0.0071511   0.326  0.74458    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07516 on 526 degrees of freedom
Multiple R-squared:  0.9999,    Adjusted R-squared:  0.9999 
F-statistic: 1.766e+05 on 25 and 526 DF,  p-value: < 2.2e-16

Multicollinearity Check and Refined Model

Some predictors are highly correlated, such as field goal attempts and minutes played. Variance Inflation Factor (VIF) is used to detect multicollinearity, and a reduced model is fit using only variables with acceptable VIF values.

View Code

vif(mlr)

         Age            G           GS           MP           FG          FGA 
    1.176725     1.813753     3.767300    13.042509  2625.525369 10693.712059 
         FG.          X3P         X3PA         X3P.          X2P         X2PA 
   19.494380   364.600150  2384.971875     2.184929  1746.958593  5505.894393 
        X2P.         eFG.           FT          FTA          FT.          ORB 
    3.123517    15.653389    74.570198    73.385560     1.855610   213.759916 
         DRB          TRB          AST          STL          BLK          TOV 
 1350.347497  2290.255889     4.826522     2.322676     1.892363     7.292080 
          PF 
    2.913511

View Code

mlr2 <- lm(PTS ~ Age + G + GS + AST + STL + BLK + PF, data = df_model)
summary(mlr2)


Call:
lm(formula = PTS ~ Age + G + GS + AST + STL + BLK + PF, data = df_model)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.3940  -2.0299  -0.4176   1.6884  13.5957 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.366625   1.123488   2.997  0.00285 ** 
Age         -0.059924   0.038326  -1.564  0.11851    
G           -0.006179   0.011774  -0.525  0.59994    
GS           0.118135   0.012240   9.651  < 2e-16 ***
AST          1.562581   0.120853  12.930  < 2e-16 ***
STL          0.650518   0.534379   1.217  0.22401    
BLK          1.511576   0.521780   2.897  0.00392 ** 
PF           0.875314   0.301287   2.905  0.00382 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.818 on 544 degrees of freedom
Multiple R-squared:  0.6821,    Adjusted R-squared:  0.678 
F-statistic: 166.8 on 7 and 544 DF,  p-value: < 2.2e-16

Assists, blocks, and free throws are strong positive predictors of scoring, while games played and age are not significant. After reducing the model, the remaining variables still explain a large portion of the variation in points per game (R² ≈ 0.68).

Conclusion

Contact Information

Daniel Bowen: dbowen26@students.kennesaw.edu
Samantha Sanchez: ssanch53@students.kennesaw.edu
Anaya Tention: atention@students.kennesaw.edu