The National Basketball Association’s (NBA) original name was the Basketball Association of America (BAA) in 1946, but this changed in 1949. The late Bill Russell won the most NBA championships, with a total of 11 for the Boston Celtics. NBA player performance is influenced by a combination of role, playing time, age, and statistical contribution. While common metrics like points, rebounds, and assists are widely used, they often reflect different responsibilities depending on position and stage of a player’s career.
Project Goal
This project analyzes NBA per-game player data to examine how performance varies across position, age, and scoring. We use exploratory data analysis (EDA) to identify patterns in the data and regression to explore which statistics are most closely associated with points per game.
Dataset
The dataset is from Kaggle, which was originally scraped from Basketball Reference. It has 29 variables and 649 observations covering per-game averages for one NBA season. We dropped rows with missing values before running any analysis.
These are the variables (columns) in the NBA dataset.
View Code
library(knitr)data.frame(Variable_Names =names(df)) %>% knitr::kable(caption ="Variable Names in NBA Dataset" )
Variable Names in NBA Dataset
Variable_Names
Player
Pos
Age
Tm
G
GS
MP
FG
FGA
FG.
X3P
X3PA
X3P.
X2P
X2PA
X2P.
eFG.
FT
FTA
FT.
ORB
DRB
TRB
AST
STL
BLK
TOV
PF
PTS
You can explore the full cleaned dataset in the table below.
View Code
# Interactive data tabledatatable(df)
Analysis
We broke the analysis into three parts:
Position behavior
Age distributions
Scoring analysis with a regression model
Player Position Behavior
First, we looked at how positions are distributed in the dataset, then used assists, rebounds, steals, and blocks to see how roles separate by position.
Point guards and shooting guards make up the largest share of the dataset, which is consistent with most NBA rosters carrying more perimeter players than bigs (Forwards & Center positions).
Assists vs. Rebounds by Position
View Code
p1 <- df_pos |>ggplot(aes(x = AST, y = TRB, color = Pos)) +geom_point(size =1.75) +scale_color_manual(values = pos_colors) +labs(title ="Assists vs Rebounds",x ="Assists",y ="Rebounds") +theme_minimal() +theme(legend.position ="none") p2 <- df_pos |>ggplot(aes(x = AST, y = TRB, color = Pos)) +geom_point(size =1.75) +geom_smooth(se =FALSE) +scale_color_manual(values = pos_colors) +labs(title ="Assists vs Rebounds",x ="Assists",y ="Rebounds",color ="Position") +theme_minimal()p1|p2
We can see a distinct split across positions. Point guards tend to have higher assists and lower rebounds, while centers show the opposite. Forwards fall in between, showing the flexibility in their positions. Overall, position is a good indicator of how players contribute on the court.
Steals vs. Blocks by Position
View Code
df_pos |>ggplot(aes(x = STL, y = BLK, color = Pos)) +geom_point(size =1.75) +scale_color_manual(values = pos_colors) +labs(title ="Steals vs Blocks",x ="Steals",y ="Blocks",color ="Position") +theme_minimal()
Steals and blocks capture different types of defense, perimeter versus interior. Guards tend to generate more steals, while centers and forwards lead in blocks. Most players are clustered at low values for both, so those who stand out in either category are less common.
View Code
# summarizeposition_stats <- df_pos |>group_by(Pos) |>summarize(PTS =mean(PTS, na.rm =TRUE),AST =mean(AST, na.rm =TRUE),TRB =mean(TRB, na.rm =TRUE),STL =mean(STL, na.rm =TRUE),BLK =mean(BLK, na.rm =TRUE) )metrics <-c("PTS", "AST", "TRB", "STL", "BLK")p <-0for (i in1:length(metrics)) { m <- metrics[i] new_plot <-ggplot(position_stats, aes_string(x ="Pos", y = m, fill ="Pos")) +geom_bar(stat ="identity") +scale_fill_manual(values = pos_colors) +labs(title = m,x ="Position",y ="Average" ) +theme_minimal() +theme(legend.position ="none")if (i ==1) { p <- new_plot } else { p <- p | new_plot }}p
Player Age Analysis
This section will examine how age relates to key per-game metrics.
Player Scoring Analysis + Model
Scoring is a key measure of offensive contribution, but it is not evenly distributed across players. This section looks at the distribution of points, applies transformations where needed, and uses regression to identify which metrics are most related to points per game.
Distribution of Points Per Game & Field Goal Assists
View Code
# PTS (Points) p1 <-ggplot(df, aes(x = PTS)) +geom_histogram(bins =25, fill = nba_blue, color ="white") +labs(title ="Histogram of Points Per Game (PTS)",x ="Points per Game (PTS)", y ="Count") +theme_minimal()p2 <-ggplot(df, aes(x = PTS)) +geom_boxplot(fill = nba_red) +labs(title ="Boxplot of Points Per Game (PTS)",x ="Points per Game (PTS)") +theme_minimal()# Field Goal Assists (FGA) p3 <-ggplot(df, aes(x = FGA)) +geom_density(fill = nba_blue) +labs(title ="Density Plot of Field Goal Assists (FGA)",x ="Field Goal Assists (FGA)", y ="Count") +theme_minimal()p4 <-ggplot(df, aes(x = FGA)) +geom_boxplot(fill = nba_red) +labs(title ="Boxplot of Field Goal Assists (FGA)",x ="Field Goal Assists (FGA)") +theme_minimal()(p1|p2) / (p3|p4)
Transformations
View Code
# Transform datadf_sqrt <- df |>mutate(across(-c(Age, G, Pos, Player, Tm), sqrt))# Define the variables you want in your dashboardvars_to_plot <-c("PTS", "FGA", "FTA", "MP", "AST", "TRB", "STL", "BLK", "TOV")# Pivot the data to a long formatdf_dashboard <- df_sqrt |>select(all_of(vars_to_plot)) |>pivot_longer(cols =everything(), names_to ="Variable", values_to ="Value")# Create the faceted dashboardggplot(df_dashboard, aes(x = Value)) +geom_histogram(bins =20, fill = nba_blue, color = nba_white) +facet_wrap(~ Variable, scales ="free", ncol =3) +labs(title ="Dashboard: Distributions of Square Root Transformed Variables",x ="Square Root Value",y ="Frequency" ) +theme_minimal() +theme(strip.text =element_text(face ="bold", size =10))
Several of the original variables showed right-skewed distributions, especially lower-frequency stats like blocks and steals. After applying a square root transformation, the distributions are more balanced and less heavily concentrated near zero. This makes the variables more comparable and better suited for modeling.
Multiple Linear Regression — Full Model
We started with a full model using all numeric predictors to see which ones are statistically significant.
View Code
# Create a dataset with only numbersdf_model <- df |>select(where(is.numeric))# influential analysismlr <-lm(PTS ~ ., data = df_model)summary(mlr)
Some predictors are highly correlated, such as field goal attempts and minutes played. Variance Inflation Factor (VIF) is used to detect multicollinearity, and a reduced model is fit using only variables with acceptable VIF values.
mlr2 <-lm(PTS ~ Age + G + GS + AST + STL + BLK + PF, data = df_model)summary(mlr2)
Call:
lm(formula = PTS ~ Age + G + GS + AST + STL + BLK + PF, data = df_model)
Residuals:
Min 1Q Median 3Q Max
-14.3940 -2.0299 -0.4176 1.6884 13.5957
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.366625 1.123488 2.997 0.00285 **
Age -0.059924 0.038326 -1.564 0.11851
G -0.006179 0.011774 -0.525 0.59994
GS 0.118135 0.012240 9.651 < 2e-16 ***
AST 1.562581 0.120853 12.930 < 2e-16 ***
STL 0.650518 0.534379 1.217 0.22401
BLK 1.511576 0.521780 2.897 0.00392 **
PF 0.875314 0.301287 2.905 0.00382 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.818 on 544 degrees of freedom
Multiple R-squared: 0.6821, Adjusted R-squared: 0.678
F-statistic: 166.8 on 7 and 544 DF, p-value: < 2.2e-16
Assists, blocks, and free throws are strong positive predictors of scoring, while games played and age are not significant. After reducing the model, the remaining variables still explain a large portion of the variation in points per game (R² ≈ 0.68).