NBA player performance is influenced by a combination of role, playing time, age, and statistical contribution. While common metrics like points, rebounds, and assists are widely used, they often reflect different responsibilities depending on position and stage of a player’s career.
Project Goal
This project analyzes NBA per-game player data to examine how performance varies across position, age, and scoring. We use exploratory data analysis (EDA) to identify patterns in the data and regression to explore which statistics are most closely associated with points per game.
Dataset
The dataset is from Kaggle, which was originally scraped from Basketball Reference. It has 29 variables and 649 observations covering per-game averages for one NBA season. We dropped rows with missing values before running any analysis.
You can explore the full cleaned dataset in the table below.
View Code
# Interactive data tabledatatable(df)
Analysis
We broke the analysis into three parts:
Position behavior
Age distributions
Scoring analysis with a regression model
Player Position Behavior
First, we looked at how positions are distributed in the dataset, then used assists, rebounds, steals, and blocks to see how roles separate by position.
Point guards and shooting guards make up the largest share of the dataset, which is consistent with most NBA rosters carrying more perimeter players than bigs (Forwards & Center positions).
Assists vs. Rebounds by Position
View Code
p1 <- df_pos |>ggplot(aes(x = AST, y = TRB, color = Pos)) +geom_point(size =1.75) +scale_color_manual(values = pos_colors) +labs(title ="Assists vs Rebounds",x ="Assists",y ="Rebounds") +theme_minimal() +theme(legend.position ="none") p2 <- df_pos |>ggplot(aes(x = AST, y = TRB, color = Pos)) +geom_point(size =1.75) +geom_smooth(se =FALSE) +scale_color_manual(values = pos_colors) +labs(title ="Assists vs Rebounds",x ="Assists",y ="Rebounds",color ="Position") +theme_minimal()p1|p2
We can see a distinct split across positions. Point guards tend to have higher assists and lower rebounds, while centers show the opposite. Forwards fall in between, showing the flexibility in their positions. Overall, position is a good indicator of how players contribute on the court.
Steals vs. Blocks by Position
View Code
df_pos |>ggplot(aes(x = STL, y = BLK, color = Pos)) +geom_point(size =1.75) +scale_color_manual(values = pos_colors) +labs(title ="Steals vs Blocks",x ="Steals",y ="Blocks",color ="Position") +theme_minimal()
Steals and blocks capture different types of defense, perimeter versus interior. Guards tend to generate more steals, while centers and forwards lead in blocks. Most players are clustered at low values for both, so those who stand out in either category are less common.
View Code
# summarizeposition_stats <- df_pos |>group_by(Pos) |>summarize(PTS =mean(PTS, na.rm =TRUE),AST =mean(AST, na.rm =TRUE),TRB =mean(TRB, na.rm =TRUE),STL =mean(STL, na.rm =TRUE),BLK =mean(BLK, na.rm =TRUE) )metrics <-c("PTS", "AST", "TRB", "STL", "BLK")p <-0for (i in1:length(metrics)) { m <- metrics[i] new_plot <-ggplot(position_stats, aes_string(x ="Pos", y = m, fill ="Pos")) +geom_bar(stat ="identity") +scale_fill_manual(values = pos_colors) +labs(title = m,x ="Position",y ="Average" ) +theme_minimal() +theme(legend.position ="none")if (i ==1) { p <- new_plot } else { p <- p | new_plot }}p
Player Age Analysis
This graph compares player age with points per game. It helps show whether scoring changes across different career stages and whether peak scoring tends to occur among younger, prime-age, or older players.
View Code
ggplot(df, aes(x = Age, y = PTS)) +geom_jitter(alpha =0.4, width =0.2) +geom_smooth(method ="loess", se =TRUE, color ="blue") +labs(title ="Age vs. Points Per Game",subtitle ="Scoring tends to peak in the late 20s",x ="Player Age",y ="Points Per Game" ) +theme_minimal()
The relationship between age and scoring is relatively weak but shows a slight upward trend into the late 20s, followed by a gradual decline. This suggests that players tend to reach peak scoring performance during their prime years, typically between ages 27 and 29. However, the wide spread of points at all ages indicates that age alone is not a strong predictor of scoring.
Correlation of Key Metrics
These correlation metrics shows how strongly the main performance variables are related to each other. This is useful before regression because highly correlated predictors, such as minutes played and field goal attempts, can create multicollinearity.
The correlation heatmap shows that scoring is most strongly associated with usage-based statistics such as field goal attempts, free throws, and minutes played. Thus, indicating multicollinearity, which justifies the need for model refinement. In contrast, Age shows weak relationships with most performance metrics, indicating that it is not a strong predictor of scoring (points). In addition, there is a cluster of related statistics that reflects positional roles. This makes guards contributing more in assists and steals, and big players contributing more in rebounds and blocks.
Player Scoring Analysis + Model
Scoring is a key measure of offensive contribution, but it is not evenly distributed across players. This section looks at the distribution of points, applies transformations where needed, and uses regression to identify which metrics are most related to points per game.
Distribution of Points Per Game & Field Goal Assists
View Code
# PTS (Points) p1 <-ggplot(df, aes(x = PTS)) +geom_histogram(bins =25, fill = nba_blue, color ="white") +labs(title ="Histogram of Points Per Game (PTS)",x ="Points per Game (PTS)", y ="Count") +theme_minimal()p2 <-ggplot(df, aes(x = PTS)) +geom_boxplot(fill = nba_red) +labs(title ="Boxplot of Points Per Game (PTS)",x ="Points per Game (PTS)") +theme_minimal()# Field Goal Assists (FGA) p3 <-ggplot(df, aes(x = FGA)) +geom_density(fill = nba_blue) +labs(title ="Density Plot of Field Goal Assists (FGA)",x ="Field Goal Assists (FGA)", y ="Count") +theme_minimal()p4 <-ggplot(df, aes(x = FGA)) +geom_boxplot(fill = nba_red) +labs(title ="Boxplot of Field Goal Assists (FGA)",x ="Field Goal Assists (FGA)") +theme_minimal()(p1|p2) / (p3|p4)
Transformations
View Code
# Transform datadf_sqrt <- df |>mutate(across(-c(Age, G, Pos, Player, Tm), sqrt))# Define the variables you want in your dashboardvars_to_plot <-c("PTS", "FGA", "FTA", "MP", "AST", "TRB", "STL", "BLK", "TOV")# Pivot the data to a long formatdf_dashboard <- df_sqrt |>select(all_of(vars_to_plot)) |>pivot_longer(cols =everything(), names_to ="Variable", values_to ="Value")# Create the faceted dashboardggplot(df_dashboard, aes(x = Value)) +geom_histogram(bins =20, fill = nba_blue, color = nba_white) +facet_wrap(~ Variable, scales ="free", ncol =3) +labs(title ="Dashboard: Distributions of Square Root Transformed Variables",x ="Square Root Value",y ="Frequency" ) +theme_minimal() +theme(strip.text =element_text(face ="bold", size =10))
Several of the original variables showed right-skewed distributions, especially lower-frequency stats like blocks and steals. After applying a square root transformation, the distributions are more balanced and less heavily concentrated near zero. This makes the variables more comparable and better suited for modeling.
Multiple Linear Regression — Full Model
We started with a full model using all numeric predictors to see which ones are statistically significant.
View Code
# Create a dataset with only numbersdf_model <- df |>select(where(is.numeric))# influential analysismlr <-lm(PTS ~ ., data = df_model)summary(mlr)
Some predictors are highly correlated, such as field goal attempts and minutes played. VIF is used to detect multicollinearity, and a reduced model is fit using only variables with acceptable VIF values.
mlr2 <-lm(PTS ~ Age + G + GS + AST + STL + BLK + PF, data = df_model)summary(mlr2)
Call:
lm(formula = PTS ~ Age + G + GS + AST + STL + BLK + PF, data = df_model)
Residuals:
Min 1Q Median 3Q Max
-14.3940 -2.0299 -0.4176 1.6884 13.5957
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.366625 1.123488 2.997 0.00285 **
Age -0.059924 0.038326 -1.564 0.11851
G -0.006179 0.011774 -0.525 0.59994
GS 0.118135 0.012240 9.651 < 2e-16 ***
AST 1.562581 0.120853 12.930 < 2e-16 ***
STL 0.650518 0.534379 1.217 0.22401
BLK 1.511576 0.521780 2.897 0.00392 **
PF 0.875314 0.301287 2.905 0.00382 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.818 on 544 degrees of freedom
Multiple R-squared: 0.6821, Adjusted R-squared: 0.678
F-statistic: 166.8 on 7 and 544 DF, p-value: < 2.2e-16
Assists, blocks, and free throws are strong positive predictors of scoring, while games played and age are not significant. After reducing the model, the remaining variables still explain a large portion of the variation in points per game (R² ≈ 0.68).