Final Project

Statistics Final Project

Created by Nathanael Jeffries

For my project I will be completing these tasks:

A description of my audience
Providing background info about my data
Providing a problem statement/goal
Showing my initial EDA (Exploratory Data Analysis)
Explaining what assumptions were made and why they were made
Showing my analysis and support
Showing hypothesis testing
Showing regression models
Conclusions that I’ve come to
Providing actionable outcomes with recommendations
Linking to a companion slide deck and video

My Audience:

My audience for this project is the front office of an NBA team. This could include coaches, scouts, player development, and the general manager. To make it simpler I will say that the audience could specifically be the front office of the Indiana Pacers.

Background Information:

Data: NBA Teams per 100 Possession Stats

Using per 100 possession stats helps to normalize the data across different eras as it accounts for a difference in pace and number of possessions.

1403 rows of NBA teams data from 1973-2024

28 columns including “Season”, “League”, “Team”, “Abbreviation”, “Playoffs”, “Games”, “Minutes_Played”, “FG_per_100”, “FGA_per_100”, “FG_Percent”, “X3p_per_100”, “X3pa_per_100”, “X3p_Percent”, “X2p_per_100”, “X2pa_per_100”, “X2p_Percent”, “FT_per_100”, “FTA_per_100”, “FT_Percent”, “ORB_per_100”, “DRB_per_100”, “TRB_per_100”, “AST_per_100”, “STL_per_100”, “BLK_per_100”, “TOV_per_100”, “PF_per_100”, and “PTS_per_100”.

Main Objective:

Using historical data what can we identify as the driving factors that lead an NBA team to making the Playoffs?

Initial EDA (Exploratory Data Analysis):

Setup:

Average Field Goal Percentage over time:

# Extract starting year (e.g., "2010-11" -> 2010)
NBA_ts <- NBA

NBA_ts$Year <- as.numeric(substr(NBA_ts$Season, 1, 4))

NBA_ts <- NBA_ts[, c("Year", "FG_Percent")]

NBA_ts <- NBA_ts[complete.cases(NBA_ts), ]

plot(NBA_ts$Year, NBA_ts$FG_Percent,
     type = "l",
     xlab = "Season (Year)",
     ylab = "Field Goal Percentage",
     main = "FG% Over Time")

This visual above illustrates how Field Goal Percentage changes over time. This helps us understand the trends in FG% over time so we can have a better idea of the effect of different play styles and rules in separate eras across the NBA’s history.

Box plots by Playoff Status:

boxplot(PTS_per_100 ~ Playoffs, data = NBA_clean,
        main = "Scoring by Playoff Status",
        xlab = "Playoffs (0 = No, 1 = Yes)",
        ylab = "PTS per 100")

boxplot(FG_Percent ~ Playoffs, data = NBA_clean,
        main = "FG% by Playoff Status")

boxplot(TOV_per_100 ~ Playoffs, data = NBA_clean,
        main = "Turnovers by Playoff Status")

These box plots above give us a good idea of how points, field goal percentage, and turnovers differ for playoff and non-playoff teams.

Correlation Heat map for key variables:

cleanvars <- NBA_clean[, c("PTS_per_100", "FG_Percent", "X3p_Percent",
                     "TOV_per_100", "ORB_per_100", "FTA_per_100")]

cor_matrix <- cor(cleanvars)

heatmap(cor_matrix, main = "Correlation Heatmap")

This heat map above shows us the relationships between some of the key variables. The reason for this is to try to avoid multicollinearity.

Assumptions Made and Explanations for why:

Each team’s season is treated as an independent observation. Despite the fact that teams could have very similar rosters or coaching systems to previous seasons, the data entries will be considered as unique values. This assumption was made to allow for analysis like the regression model to provide insight.
Some prior knowledge was necessary to conduct this project. I had to utilize some of my prior knowledge of the NBA and the sport of basketball to better understand the results I was getting. This assumption was made in order to have more meaningful analysis and more actionable results.
Binary outcome is appropriate for the Playoffs variable. This aligns directly with the goal of determining determining what factors lead to a team making the playoffs. This assumption was made to help with making the analysis possible. Unfortunately, this means that the data doesn’t capture how far teams go in the playoffs.
The historical data will reflect future behavior. This helps give value to the data we have in every era of NBA history. This assumption was made to increase the amount of data we have to go off of and subsequently improve the results.

Analyses and Support:

I am going to test each variable in the data set and then rank them according to importance and impact on playoff chances.

for (v in vars) {
  NBA[[v]] <- as.numeric(as.character(NBA[[v]]))
}
results <- data.frame(Variable = character(),
                      Mean_Playoff = numeric(),
                      Mean_NonPlayoff = numeric(),
                      Difference = numeric(),
                      P_Value = numeric(),
                      stringsAsFactors = FALSE)

for (v in vars) {
  
  NBA[[v]] <- as.numeric(as.character(NBA[[v]]))
  
  temp <- NBA[!is.na(NBA[[v]]) & !is.na(NBA$Playoffs), ]
  
  playoff_vals <- temp[[v]][temp$Playoffs == 1]
  non_vals <- temp[[v]][temp$Playoffs == 0]
  
  mean_playoff <- mean(playoff_vals, na.rm = TRUE)
  mean_non <- mean(non_vals, na.rm = TRUE)
  
  diff <- mean_playoff - mean_non
  
  test <- t.test(playoff_vals, non_vals)
  
  results <- rbind(results, data.frame(
    Variable = v,
    Mean_Playoff = mean_playoff,
    Mean_NonPlayoff = mean_non,
    Difference = diff,
    P_Value = test$p.value
  ))
}

# Sort by absolute difference (effect size)
results$Abs_Diff <- abs(results$Difference)

results <- results[order(-results$Abs_Diff), ]

# View top variables
head(results, 21)

       Variable Mean_Playoff Mean_NonPlayoff   Difference      P_Value
21  PTS_per_100  107.9313725     104.9846154  2.946757164 4.481852e-32
5  X3pa_per_100   16.1965763      17.9134021 -1.716825742 3.486285e-03
7   X2p_per_100   35.7903268      34.2863422  1.503984568 1.330929e-09
8  X2pa_per_100   72.7389542      71.4616954  1.277258801 2.825290e-02
16  AST_per_100   24.6366013      23.3989011  1.237700208 2.868554e-28
11  FTA_per_100   26.8938562      25.6623234  1.231532818 2.031085e-13
1    FG_per_100   41.0994771      39.9836735  1.115803655 4.991180e-32
10   FT_per_100   20.4271895      19.3193093  1.107880280 7.749641e-19
15  TRB_per_100   45.0732026      44.0772370  0.995965566 9.092024e-18
14  DRB_per_100   32.2013072      31.3089482  0.892358995 2.171911e-11
17  STL_per_100    8.5490196       8.0974882  0.451531382 2.162129e-16
4   X3p_per_100    5.7932953       6.2365979 -0.443302646 4.203907e-02
18  BLK_per_100    5.3711111       4.9583987  0.412712367 4.919850e-15
19  TOV_per_100   16.1082353      16.4113030 -0.303067689 1.980641e-03
2   FGA_per_100   87.5827451      87.8284144 -0.245669345 4.361643e-02
20   PF_per_100   22.8773856      23.0284144 -0.151028822 2.265961e-01
13  ORB_per_100   12.8734641      12.7678179  0.105646156 3.255683e-01
3    FG_Percent    0.4694248       0.4553658  0.014059060 2.614627e-39
9   X2p_Percent    0.4938510       0.4825997  0.011251294 5.238812e-14
12   FT_Percent    0.7604575       0.7535228  0.006934753 8.787273e-06
6   X3p_Percent    0.3361712       0.3320584  0.004112765 1.112162e-01
      Abs_Diff
21 2.946757164
5  1.716825742
7  1.503984568
8  1.277258801
16 1.237700208
11 1.231532818
1  1.115803655
10 1.107880280
15 0.995965566
14 0.892358995
17 0.451531382
4  0.443302646
18 0.412712367
19 0.303067689
2  0.245669345
20 0.151028822
13 0.105646156
3  0.014059060
9  0.011251294
12 0.006934753
6  0.004112765

Based on this test above we can see that points per 100 is the most important variable in the data set when it comes to making the playoffs. There are several more variables that have significant impacts on playoff chances as well.

Visuals

Now I will showcase some interesting and insightful visualizations:

top_vars <- results[1:10, ]

short_names <- gsub("_per_100", "", top_vars$Variable)
short_names <- gsub("X3p", "3P", short_names)
short_names <- gsub("X2p", "2P", short_names)

ggplot(top_vars, aes(x = reorder(short_names, Difference), y = Difference)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top Factors Impacting Playoff Teams",
       x = "",
       y = "Difference") +
  theme_minimal()

This visual above shows us the order for some of the most important factors for NBA teams when it comes to making the playoffs. The difference axis tells us how impactful the variable is.

NBA_clean$Playoffs <- factor(NBA_clean$Playoffs, labels = c("No Playoffs", "Playoffs"))

ggplot(NBA_clean, aes(x = Playoffs, y = FG_Percent, fill = Playoffs)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Shooting Efficiency Drives Playoff Success",
       x = "",
       y = "FG%") +
  theme_minimal() +
  theme(legend.position = "none")

ggplot(NBA_clean, aes(x = Playoffs, y = TOV_per_100, fill = Playoffs)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Turnovers Reduce Playoff Chances",
       x = "",
       y = "Turnovers / 100") +
  theme_minimal() +
  theme(legend.position = "none")

These two box plot visuals above highlight how playoff teams have much higher field goal percentage fewer turnovers in contrast to non-playoff teams.

ggplot(NBA_clean, aes(x = AST_per_100, y = TOV_per_100, color = Playoffs)) +
  geom_point(alpha = 0.6) +
  labs(title = "Ball Movement vs Turnovers",
       x = "Assists / 100",
       y = "Turnovers / 100") +
  theme_minimal()

ggplot(NBA_clean, aes(x = X3pa_per_100, y = FG_Percent, color = Playoffs)) +
  geom_point(alpha = 0.6) +
  labs(title = "3PT Volume vs Efficiency",
       x = "3PA / 100",
       y = "FG%") +
  theme_minimal()

ggplot(NBA_clean, aes(x = STL_per_100, y = BLK_per_100, color = Playoffs)) +
  geom_point(alpha = 0.6) +
  labs(title = "Defensive Playmaking Matters",
       x = "Steals / 100",
       y = "Blocks / 100") +
  theme_minimal()

These visuals above highlight the value of taking care of the ball, 3-point shooting, and defensive playmaking. The first visual shows us that many playoff teams are very good at assisting on their shots and limiting the amount of turnovers they have. The second visual tells us that field goal percentage seems to matter more than just shooting 3 point shots. And the final visual shows us that many playoff teams are great a getting steals and blocking shots.

ggplot(NBA_clean, aes(x = FTA_per_100, fill = Playoffs)) +
  geom_density(alpha = 0.5) +
  labs(title = "Playoff Teams Get to the Line More",
       x = "FT Attempts / 100",
       y = "Density") +
  theme_minimal()

This visual above simply shows us that playoff teams shoot more free throws in contrast to non-playoff teams.

Hypotheses Tests:

Hypotheses:

Playoff teams and non-playoff teams have the same scoring (PTS_per_100)

Playoff teams and non-playoff teams have the same field goal percentage (FG_Percent)

Playoff teams and non-playoff teams have the same three point percentage (X3p_Percent)

Playoff teams and non-playoff teams have the same turnovers (TOV_per_100)

Playoff teams and non-playoff teams have the same offensive rebounds (ORB_per_100)

Playoff teams and non-playoff teams have the same free throw attempts (FTA_per_100)

T-Tests:

t.test(PTS_per_100 ~ Playoffs, data = NBA_clean)


    Welch Two Sample t-test

data:  PTS_per_100 by Playoffs
t = -13.029, df = 1122.7, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No Playoffs and group Playoffs is not equal to 0
95 percent confidence interval:
 -3.483341 -2.571531
sample estimates:
mean in group No Playoffs    mean in group Playoffs 
                 105.5644                  108.5919

t.test(FG_Percent ~ Playoffs, data = NBA_clean)


    Welch Two Sample t-test

data:  FG_Percent by Playoffs
t = -13.274, df = 1256.4, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No Playoffs and group Playoffs is not equal to 0
95 percent confidence interval:
 -0.01665629 -0.01236676
sample estimates:
mean in group No Playoffs    mean in group Playoffs 
                0.4547766                 0.4692882

t.test(X3p_Percent ~ Playoffs, data = NBA_clean)


    Welch Two Sample t-test

data:  X3p_Percent by Playoffs
t = -1.5939, df = 1280, p-value = 0.1112
alternative hypothesis: true difference in means between group No Playoffs and group Playoffs is not equal to 0
95 percent confidence interval:
 -0.0091750311  0.0009495016
sample estimates:
mean in group No Playoffs    mean in group Playoffs 
                0.3320584                 0.3361712

t.test(TOV_per_100 ~ Playoffs, data = NBA_clean)


    Welch Two Sample t-test

data:  TOV_per_100 by Playoffs
t = 3.3604, df = 1208.8, p-value = 0.0008027
alternative hypothesis: true difference in means between group No Playoffs and group Playoffs is not equal to 0
95 percent confidence interval:
 0.1314588 0.5003171
sample estimates:
mean in group No Playoffs    mean in group Playoffs 
                 16.17595                  15.86006

t.test(ORB_per_100 ~ Playoffs, data = NBA_clean)


    Welch Two Sample t-test

data:  ORB_per_100 by Playoffs
t = -1.1534, df = 1236, p-value = 0.249
alternative hypothesis: true difference in means between group No Playoffs and group Playoffs is not equal to 0
95 percent confidence interval:
 -0.35491791  0.09211612
sample estimates:
mean in group No Playoffs    mean in group Playoffs 
                 12.66460                  12.79601

t.test(FTA_per_100 ~ Playoffs, data = NBA_clean)


    Welch Two Sample t-test

data:  FTA_per_100 by Playoffs
t = -7.4419, df = 1260.6, p-value = 1.831e-13
alternative hypothesis: true difference in means between group No Playoffs and group Playoffs is not equal to 0
95 percent confidence interval:
 -1.6549848 -0.9644425
sample estimates:
mean in group No Playoffs    mean in group Playoffs 
                 25.70412                  27.01384

Every single one of these T-Tests illustrates that there is a significant sign of improved stats for playoff teams in all 6 tested stats.

PTS_per_100: There is a just over a 3 point difference in the mean for playoff teams and non playoff teams. This tells us that playoff teams on average 3 more points per 100 possessions than non playoff teams. The p-value of 2.2e-16 tells us this is significant.

FG_Percent: There is about a 1.4% difference in field goal percentage for playoff teams and non playoff teams. This tells us that playoff teams on average shoot about 1.4% better per 100 possessions than non playoff teams. The p-value of 2.2e-16 tells us this is significant.

X3p_Percent: There is only about a 0.4% difference in three point percentage for playoff teams and non playoff teams. This tells us that playoff teams on average shoot about 0.4% better per 100 possessions than non playoff teams. The p-value of 0.1112 tells us that this is only somewhat significant.

TOV_per_100: There is about a -0.32 difference in turnovers for playoff and non playoff teams. This tells us that playoff teams on average turn the ball over 0.32 less times per 100 possessions than non playoff teams. The p-value of 0.0008027 tells us that this is significant.

ORB_per_100: There is about a 0.13 difference in offensive rebounds for playoff teams and non playoff teams. This tells us that playoff teams on average grab about 0.13 more offensive rebounds per 100 possessions than non playoff teams. The p-value of 0.249 tells us that this is only slightly significant.

FTA_per_100: There is about a 1.3 free throw difference in free throw attempts for playoff and non playoff teams. This tells us that playoff teams on average shoot about 1.3 more free throws per 100 possessions than non playoff teams. The p-value of 1.831e-13 tells us this is significant.

Most significant: Points per 100 possessions, Field goal percentage per 100 possessions, Turnovers per 100 possessions, Free throw attempts per 100 possessions

Less significant: Offensive rebounds per 100 possessions, 3-point percentage per 100 possessions

Regression Models:

I made two logistic regression models to test some of these variables from the hypothesis T-Tests. As a reminder, those variables would be FG_Percent, X3pa_per_100, TOV_per_100, AST_per_100, and FTA_per_100. I will also be testing PTS_per_100, STL_per_100, and BLK_per_100 in the second model.

To emphasize a specific result I included a visual with the first model.

Here is the first model that excludes points:

NBA_model <- NBA

NBA_model$Playoffs <- ifelse(NBA_model$Playoffs == "TRUE", 1, 0)

model1 <- glm(Playoffs ~ FG_Percent + TOV_per_100 + 
              FTA_per_100 + AST_per_100 + X3pa_per_100,
              data = NBA_clean,
              family = "binomial")

summary(model1)


Call:
glm(formula = Playoffs ~ FG_Percent + TOV_per_100 + FTA_per_100 + 
    AST_per_100 + X3pa_per_100, family = "binomial", data = NBA_clean)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -16.939561   2.048720  -8.268  < 2e-16 ***
FG_Percent    27.157541   4.232403   6.417 1.39e-10 ***
TOV_per_100   -0.379300   0.054445  -6.967 3.24e-12 ***
FTA_per_100    0.236471   0.027122   8.719  < 2e-16 ***
AST_per_100    0.182496   0.039622   4.606 4.11e-06 ***
X3pa_per_100   0.001518   0.009182   0.165    0.869    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1767.6  on 1282  degrees of freedom
Residual deviance: 1471.4  on 1277  degrees of freedom
AIC: 1483.4

Number of Fisher Scoring iterations: 4

NBA_clean$prob <- predict(model1, newdata = NBA_clean, type = "response")

ggplot(NBA_clean, aes(x = FG_Percent, y = prob, color = Playoffs)) + geom_point(alpha = 0.5) + labs(title = "Higher FG% Increases Playoff Probability", x = "FG%", y = "Predicted Probability") + theme_minimal()

This visual above shows us how increasing field goal percentage has a massive effect on the probability that a team will make the playoffs.

This first model tells us that Field Goal Percentage is the most important variable that was tested. Even the smallest increase in FG% can dramatically improve playoff odds.

The model also tells us that turnovers are very important. The model tells us that each additional turnover per 100 possessions reduces playoff odds by about 32%.

We also see that Assists and Free Throw Attempts help to increase playoff odds as well. They may not be as impactful as FG% and Turnovers according to this model, but they can still be very impactful.

Lastly, we see that 3-point attempts are not significant. The p-value of 0.869 tells us that 3-point volume alone does not impact the chances of making the playoffs.

Here is the second model that includes scoring and defensive metrics (PTS_per_100, STL_per_100, and BLK_per_100):

model2 <- glm(Playoffs ~ PTS_per_100 + TOV_per_100 + FTA_per_100 + STL_per_100 + BLK_per_100,
              data = NBA_clean,
              family = "binomial")

summary(model2)


Call:
glm(formula = Playoffs ~ PTS_per_100 + TOV_per_100 + FTA_per_100 + 
    STL_per_100 + BLK_per_100, family = "binomial", data = NBA_clean)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -33.24152    2.81349 -11.815  < 2e-16 ***
PTS_per_100   0.22358    0.02062  10.844  < 2e-16 ***
TOV_per_100  -0.17066    0.05746  -2.970  0.00298 ** 
FTA_per_100   0.15532    0.02605   5.962 2.50e-09 ***
STL_per_100   0.66155    0.07331   9.024  < 2e-16 ***
BLK_per_100   0.50474    0.07256   6.956 3.51e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1767.6  on 1282  degrees of freedom
Residual deviance: 1378.9  on 1277  degrees of freedom
AIC: 1390.9

Number of Fisher Scoring iterations: 4

This second model tells us points and free throw attempts per 100 possessions are both significant and have an impact on the chances of a team making the playoffs.

However, in this second model we see that turnovers lose some of their significance. The p-value increases significantly and leads us to believe that turnovers already affect scoring and have their effect absorbed in the points per 100.

The steals and blocks per 100 possessions both have significant effects on the chances of a team making the playoffs. Steals appear to be more significant than blocks and this could be because steals guarantee getting the ball while a block might just lead to the ball being up for grabs for either team. To further explain the value of steals, we can see that each additional steal per 100 possessions nearly doubles a teams playoff odds thanks to the STL_per_100 odds ratio being 1.94.

Model Comparison

Now I will compare the two models:

exp(coef(model1))

 (Intercept)   FG_Percent  TOV_per_100  FTA_per_100  AST_per_100 X3pa_per_100 
4.397868e-08 6.228312e+11 6.843400e-01 1.266771e+00 1.200210e+00 1.001519e+00

exp(coef(model2))

 (Intercept)  PTS_per_100  TOV_per_100  FTA_per_100  STL_per_100  BLK_per_100 
3.659239e-15 1.250551e+00 8.431087e-01 1.168032e+00 1.937789e+00 1.656547e+00

pred_probs <- predict(model1, type = "response")

pred_class <- ifelse(pred_probs > 0.5, 1, 0)


vif(model1)

  FG_Percent  TOV_per_100  FTA_per_100  AST_per_100 X3pa_per_100 
    1.566829     2.084611     1.758862     1.529595     2.319447

vif(model2)

PTS_per_100 TOV_per_100 FTA_per_100 STL_per_100 BLK_per_100 
   1.526363    2.116531    1.471152    1.218656    1.062597

We can see that model 1 emphasizes offensive efficiency and decision making. We can see that shooting accuracy, ball movement, and limiting turnovers are critical for team success. In contrast to the first model we see that model 2 incorporates scoring with defensive stats to give a two-way play assessment of teams. The model tells us that scoring, steals, and blocks are strongly associated with teams making the playoffs.

Combining the insight from both models tells us that successful teams are not only excellent when it comes to offensive efficiency, but they also need to have a strong defensive presence and force turnovers.

It is also worth noting that the VIF values at the end of the output above are all less than 5. This means that there is no serious multicollinearity concerns for either model.

Conclusions:

Scoring efficiency is king. Being able to score points efficiently is by far the most valuable skill an NBA team can have.
Taking care of the ball matters. limiting turnovers and maximizing ball movement opportunities on offense are important. We see that assists can improve playoff odds, while turnovers can greatly reduce playoff odds.
Free throws might be more impactful than what people would think. Getting to the free throw line helps teams to improve their scoring output. So it is important to get chances at the line and to capitalize on them.
3-point shooting is being overrated. It seems that teams who shoot more 3s and teams that have better 3-point efficiency are not always successful. This tells us that it is more important to simply take efficient shots instead of looking to specifically shoot 3s.
Defensive presence is extremely important. Steals and blocks both have a major impact on increasing the chances of making the playoffs. It is massively important to win the possession battle.

Actionable Outcomes and Recommendations:

Prioritize shot quality over shot type. Since we discovered that shooting efficiency matters more than volume and what type of shot you’re taking we should focus on finding the best shots for our players. This could mean that every player will have their own unique shot profile on what shots are best for them.
Reduce turnovers by any means necessary. turnovers have a negative impact on playoff odds so it is important to acquire players with high assist-to-turnover ratios. Additionally, we should teach better decision making with our player development.
Increase free throw opportunities. Free throws are an underrated, but impactful source of scoring. This means we should acquire players who can attack the rim and draw fouls consistently. We could also work to incorporate offensive schemes that create more opportunities for contact and subsequently more free throw attempts.
Build the team around efficient scorers. Hands down field goal percentage was the strongest predictor for making the playoffs. This means that we need to prioritize acquiring efficient shooters who have many strong shooting spots in their shot profile.
Add defensive playmakers. Steals and blocks both strongly increase playoff probability. This means we should work to acquire players who force turnovers by getting steals in passing lanes, picking the pocket of ball handlers, or shot blockers who can protect the rim. This may also mean that we need to develop a defensive scheme that focuses on creating turnover opportunities.