NBA Analytics – Exploring Team Performance Through Reproducible Analysis

Introduction

The NBA season is about to begin, and I have just been hired as a Data Analyst for the National Basketball Association. Commissioner Adam Silver is asking me for insights into last year’s team performances — specifically, how offensive and defensive metrics relate to one another, and whether teams from the Eastern and Western Conferences differ in their overall performance.

Using real-style NBA data from all 30 teams, my job is to build a reproducible analysis in RMarkdown that loads, cleans, visualizes, and analyzes the data to uncover meaningful patterns.

Step 1: Loading & Preparing the Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(Hmisc)

## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(readxl)
library(skimr)
library(ggcorrplot)
library(GGally)
library(ppcor)

## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(knitr)

bball_function <- function(x){
  basketball<- read_xlsx("NBA Team Total Data 2024-2025.xlsx", sheet = x)
  basketball$Team <- x
  basketball$Won_award<- ifelse(is.na(basketball$Awards),"No","Yes")
  basketball$PRA<- basketball$PTS + basketball$TRB + basketball$AST
  basketball$STOCKS<- basketball$STL+ basketball$BLK
  return(basketball)
}

team_names<- excel_sheets("NBA Team Total Data 2024-2025.xlsx")

all_data<- lapply(team_names, bball_function) %>% bind_rows()

all_data %>% head(n=5) %>% kable(caption = "NBA Team Data 2024-2025, first 5 rows")

NBA Team Data 2024-2025, first 5 rows
Rk	Player	Age	G	GS	MP	FG	FGA	FG%	3P	3PA	3P%	2P	2PA	2P%	eFG%	FT	FTA	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS	Awards	Team	Won_award	PRA	STOCKS	Pos
1	Jalen Wilson	24	79	22	2031	246	620	0.397	122	362	0.337	124	258	0.481	0.495	135	165	0.818	75	195	270	145	40	5	79	164	749	NA	Nets	No	1164	45	NA
2	Keon Johnson	22	79	56	1925	303	779	0.389	126	401	0.314	177	378	0.468	0.470	107	139	0.770	63	234	297	175	82	30	116	209	839	NA	Nets	No	1311	112	NA
3	Nic Claxton	25	70	62	1882	320	568	0.563	5	21	0.238	315	547	0.576	0.568	79	154	0.513	157	358	515	157	62	100	87	150	724	NA	Nets	No	1396	162	NA
4	Cameron Johnson	28	57	57	1800	355	747	0.475	159	408	0.390	196	339	0.578	0.582	201	225	0.893	54	193	247	194	53	25	99	104	1070	NA	Nets	No	1511	78	NA
5	Ziaire Williams	23	63	45	1541	214	520	0.412	103	302	0.341	111	218	0.509	0.511	101	123	0.821	61	226	287	84	62	28	67	149	632	NA	Nets	No	1003	90	NA

In this chunk, I created a function (bball_function) that loaded data from all 30 teams (one from each sheet) from the NBA 2024-2025 dataset, created a variable showing if the player received an award (0=No, 1=Yes), and created 2 more variables of the players’ PRA (points, rebounds & assists) and their STOCKS (steals & blocks). Phew!

In order to actually load those 30 sheets, I created an object called ‘team_names’ that read each sheet individually, and used lapply() to assign the team_names object to the bball_function. Finally, I binded all the rows together to make one large dataset of all 30 teams called ‘all_data’.

Step 2: Adding Conference Information

conference_data<- read_xlsx("Team Conferences.xlsx")

all_data<- merge(all_data, conference_data, by="Team")

all_data<- all_data %>%
  mutate(Conference = recode(Conference, "East" = 1, "West" = 0))

all_data %>% head(n=5) %>% kable(caption = "Updated NBA Team Data 2024-2025, first 5 rows")

Updated NBA Team Data 2024-2025, first 5 rows
Team	Rk	Player	Age	G	GS	MP	FG	FGA	FG%	3P	3PA	3P%	2P	2PA	2P%	eFG%	FT	FTA	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS	Trp-Dbl	Awards	Won_award	PRA	STOCKS	Pos	Conference
Bucks	1	Brook Lopez	36	80	80	2546	394	774	0.509	139	373	0.373	255	401	0.636	0.599	114	138	0.826	113	288	401	143	50	148	84	171	1041	0	NA	No	1585	198	NA	1
Bucks	2	Giannis Antetokounmpo	30	67	67	2289	793	1319	0.601	14	63	0.222	779	1256	0.620	0.607	436	707	0.617	147	651	798	433	58	78	206	155	2036	11	MVP-3,DPOY-8,AS,NBA1	Yes	3267	136	NA	1
Bucks	3	Taurean Prince	30	80	73	2166	235	514	0.457	147	335	0.439	88	179	0.492	0.600	39	48	0.813	34	253	287	155	76	15	82	165	656	0	NA	No	1098	91	NA	1
Bucks	4	Damian Lillard	34	58	58	2093	444	992	0.448	197	524	0.376	247	468	0.528	0.547	362	393	0.921	29	243	272	410	70	10	162	97	1447	2	AS	Yes	2129	80	NA	1
Bucks	5	Gary Trent Jr.	26	74	9	1893	283	657	0.431	180	433	0.416	103	224	0.460	0.568	78	92	0.848	20	148	168	87	72	4	42	124	824	0	NA	No	1079	76	NA	1

I mean, it’s basically in the name. This chunk shows how I added a variable for the conference that each team belongs to (East or West) by uploading a dataset of teams and their conference and merging it with my all_data by the column named ‘Team’. And then, of course, making it a binary variable.

Step 3: Visual Exploration

Plot 1: PRA vs. STOCKS

p1<- ggplot(all_data, aes(x = PRA, y = STOCKS)) +
  geom_point(aes(color = factor(Conference)), size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  theme_minimal() +
  labs(color = "Conference",
    title = "PRA vs STOCKS",
    subtitle = "Are players' PRA and STOCKS related?",
    x = "PRA", y = "STOCKS",) +
  scale_color_manual(
    name = "Conference",
    values = c("0" = "orange", "1" = "skyblue3"),
    labels = c("0" = "West", "1" = "East")
  )
p1

## `geom_smooth()` using formula = 'y ~ x'

Let’s take a look at the relationship between a player’s PRA and their STOCKS. What we see is that PRA and STOCKS are very closely related; as PRA goes up, so does STOCKS. However, this may be due to other factors, such as minutes played (i.e., the more time a player gets on the court, the more PRA and STOCKS they are likely to accumulate). These metrics seem to be pretty uniform for both conferences- one conference does not perform better than another.

Plot 2: Minutes Played vs. Age (by Conference)

p2<- ggplot(all_data, aes(x = Age, y = MP, color = Conference)) +
  geom_point(aes(color = factor(Conference)), size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Minutes Played vs. Age by Conference",
       x = "Age",
       y = "Minutes Played") +
  scale_color_manual(
    name = "Conference",
    values = c("0" = "orange", "1" = "skyblue3"),
    labels = c("0" = "West", "1" = "East")
  )
p2

## `geom_smooth()` using formula = 'y ~ x'

## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

For my second visualization, I chose to look at Minutes Played vs. Age (and grouped them by conference just to see what lies between groups). Here’s what we see: The older a player is, the more minutes they get to play, but there is a lot of variation in minutes played within the same ages. It also seems as though the older players in the West coast conference get more playing time than the older players in the East coast conference on average.

Step 4: Correlation Analyses

Is Conference related to PRA?

cor.test(all_data$Conference, all_data$PRA)

## 
##  Pearson's product-moment correlation
## 
## data:  all_data$Conference and all_data$PRA
## t = -1.8195, df = 650, p-value = 0.0693
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.147164250  0.005629906
## sample estimates:
##         cor 
## -0.07118475

Strength: very weak (-0.07)
Direction: negative
Significance: not significant (p>0.05)

This correlation shows us that there may be a slight negative trend in conference predicting PRA, but the effect is so small and statistically insignificant that it virtually means nothing. There is no linear relationship between Conference and PRA.

Is Conference related to STOCKS?

cor.test(all_data$Conference, all_data$STOCKS)

## 
##  Pearson's product-moment correlation
## 
## data:  all_data$Conference and all_data$STOCKS
## t = -2.094, df = 650, p-value = 0.03665
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.157650363 -0.005105577
## sample estimates:
##         cor 
## -0.08185737

Strength: very weak (-0.08)
Direction: negative
Significance: significant (p<0.05)

This means that there is a very small but statistically significant negative correlation between Conference and STOCKS, suggesting a specific Conference may negatively predict STOCKS. However, the effect is so small that it is most likely not practically meaningful.

Correlation Matrix: Age, PRA, STOCKS, & Won_award_N

all_data <- all_data %>%
  mutate(Won_award_N = case_when(
    Won_award == "No" ~ 0,
    Won_award == "Yes" ~ 1)
  )
rcorr(as.matrix(all_data[, c("PRA", "STOCKS", "Age", "Won_award_N")]))

##              PRA STOCKS  Age Won_award_N
## PRA         1.00   0.84 0.12        0.54
## STOCKS      0.84   1.00 0.08        0.47
## Age         0.12   0.08 1.00        0.06
## Won_award_N 0.54   0.47 0.06        1.00
## 
## n= 652 
## 
## 
## P
##             PRA    STOCKS Age    Won_award_N
## PRA                0.0000 0.0015 0.0000     
## STOCKS      0.0000        0.0484 0.0000     
## Age         0.0015 0.0484        0.1501     
## Won_award_N 0.0000 0.0000 0.1501

colnames(all_data)

##  [1] "Team"        "Rk"          "Player"      "Age"         "G"          
##  [6] "GS"          "MP"          "FG"          "FGA"         "FG%"        
## [11] "3P"          "3PA"         "3P%"         "2P"          "2PA"        
## [16] "2P%"         "eFG%"        "FT"          "FTA"         "FT%"        
## [21] "ORB"         "DRB"         "TRB"         "AST"         "STL"        
## [26] "BLK"         "TOV"         "PF"          "PTS"         "Trp-Dbl"    
## [31] "Awards"      "Won_award"   "PRA"         "STOCKS"      "Pos"        
## [36] "Conference"  "Won_award_N"

all_data_num<- all_data %>% dplyr::select(4,33,34,37)

corr_matrix <- cor(all_data_num, use = "pairwise.complete.obs")
corr_matrix

##                    Age       PRA     STOCKS Won_award_N
## Age         1.00000000 0.1238926 0.07734898   0.0564231
## PRA         0.12389260 1.0000000 0.84021798   0.5361470
## STOCKS      0.07734898 0.8402180 1.00000000   0.4722913
## Won_award_N 0.05642310 0.5361470 0.47229129   1.0000000

Before creating the matrix, I had to create a binary variable for Won_award, which I have made into Won_award_N, where 0 = ‘No’ and 1 = ‘Yes’.

Let’s visualize it:

ggcorrplot(corr_matrix, lab = TRUE, type = "lower") +
    labs(title = "Correlation Matrix: PRA, STOCKS, Age, Won_award_N")

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggcorrplot package.
##   Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Consistent with the correlations previously run, PRA and STOCKS have the strongest relationship. As mentioned before, this is likely due to the fact that those who have a higher PRA likely have more playing time- which would make them more likely to have a higher number of STOCKS as well. After that, the next strongest relationships are that of Won_award_N with PRAs and with STOCKS. This makes sense, because I would assume that the players who have won an award have performed better either offensively or defensively than those who have not won any awards.

Partial Correlation

PRA vs. STOCKS, Controlling for Minutes Played

pcor.test(all_data$PRA, all_data$STOCKS, all_data$MP)

##     estimate    p.value statistic   n gp  Method
## 1 0.07911891 0.04359333  2.021931 652  1 pearson

PRA and STOCKS are closely related, but not when we control for Minutes Played. This just about confirmed my suspicions; PRA and STOCKS are both likely to be higher when a player has more time on the court- not necessarily because being good at one means being good at the other. So, PRA and STOCKS are not directly related to one another.

Step 5: Communicating my Findings

Dear Adam Silver,

Thank you for choosing me to run analytics on all of your teams’ performances! Unfortunately, we did not find anything too groundbreaking, but hopefully it provides you with some new insights you may not have had before. First, we looked at the relationship between PRA and STOCKS, and we determined that there is a positive linear relationship between the two variables. Upon further probing (partial correlation), we found that this relationship disappears when we control for Minutes Played, suggesting that having a higher PRA does not make you a better defensive player, but rather, more minutes on the court make you more likely to have accumulated higher PRA and STOCKS. Additionally, there seemed to be no difference in PRA and STOCKS between conferences. I then took a look at how Minutes Played varied by Age, and we saw that the older a player is, the more minutes they get to play on average. Overall, this finding was pretty consistent across conferences, but I noticed that the older players in the West coast conference tended to get more playing time than the older players in the East coast conference. Finally, we ran some correlations with the player’s performances, which revealed that there is no practical differences between conferences when it comes to PRA or STOCKS. You might be able to make the case for Conference negatively predicting STOCKS, but the relationship is so weak that it does not really mean much in the real world. Our final takeaway is that there was a strong positive correlation between performance and awards: those who performed better throughout the season (higher PRA and STOCKS) were much more likely to have received an award at the end of the season. So, were there differences between East coast and West coast conferences? Not particularly. At least, nothing to write home about. And as far as PRA and STOCKS moving together, they only do so when confounding variables are involved. A limitation from my analysis is that I only ran correlations for the most part; it would be interesting to see how teams within the same conferences varied. Anyway, I hope you choose me again (for some reason) to run analyses for your team performances next year, even though I know nothing about basketball!

Sincerely,

Shannon Joyce

NBA Homework