Predicting the next NBA Finals match-up

Created by: Patrick Ingelmo

www.patrickingelmo.com

Overview

Today is March 5, 2015. We are 60 games, or about 73%, through the regular season. While team trends can still change as the season nears its end, the available data can help create a snapshot of which teams are best positioned to make the NBA finals.

There are two separate datasets being analyzed. The first comes from basketball-reference.com (http://bit.ly/1AMgKJR) and contains every team and its associated stats from the 1970 season through the present season (2014-15). Some of these stats include total points scored, assists, turnovers, rebounds, effective field goal precentage, offensive rating and defensive rating. The other data file is a simple csv file containing each year and the corresponding finals teams.

Load and Clean Data

#load libraries
library(ggplot2)
library(ggthemes)
library(dplyr)
library(knitr)

#set directory to where csv files are located
setwd("/Users/pingelmo/Dropbox/data/nba/teams")
#initialize data frame
nba <- data.frame()
#assign files to list
files_list <- list.files()
#read in csv and row bind into one data frame
for(file in files_list){nba<-rbind(nba,read.table(file,header=T,sep=",",skip=2,stringsAsFactors=F))}
#remove rows not containing team data
nba <- nba[!nba$Season=="Season",]
rownames(nba) <- NULL
# convert characters to numeric
nba[,c(1,5:40)] <- sapply(nba[,c(1,5:40)],as.numeric)
#convert character to factor
nba$Season <- as.factor(nba$Season)
#remove asterisk from Team Name
nba$Tm <- substr(nba[,3],1,3)
#season end
nba$season_end <- as.numeric(substr(nba[,2],1,4))+1
options("scipen"=100, "digits"=4)

#load NBA champions data
finals <- read.csv("/Users/pingelmo/Dropbox/data/nba/finals.csv",header=T)
#create new binary championship variable
finals$finals <- 1
#rename to match NBA dataset
names(finals) <- c("Season","Tm","finals")
#join
nba <- left_join(nba,finals)
#remove NAs
nba$finals <- ifelse(is.na(nba$finals),0,1)

#rename fields
names(nba)[8] <- "win_pct"
names(nba)[33] <- "eFG"
names(nba)[34] <- "TOV_rate"
names(nba)[35] <- "ORB_rate"
names(nba)[36] <- "FT_FGA"
names(nba)[37] <- "opp_eFG"
names(nba)[38] <- "opp_TOV"
names(nba)[39] <- "opp_ORB"
names(nba)[40] <- "opp_FT_FGA"


# get rid of NAs, replace with 0
nba$X2P <- ifelse(is.na(nba$X2P),0,nba$X2P)
nba$X2PA <- ifelse(is.na(nba$X2PA),0,nba$X2PA)
nba$X3P <- ifelse(is.na(nba$X3P),0,nba$X3P)
nba$X3PA <- ifelse(is.na(nba$X3PA),0,nba$X3PA)

#create percentage variables
nba$netRtg <- nba$ORtg - nba$DRtg
nba$fg_per <- nba$FG/nba$FGA
nba$x2p_per <- nba$X2P/nba$X2PA
nba$x3p_per <- nba$X3P/nba$X3PA
nba$ft_per <- nba$FT/nba$FTA
nba$ast_turn <- nba$AST/nba$TOV
nba$pts_per_fga <- nba$PTS/nba$FGA

Explore data

Now that the data has been loaded and cleaned, we can begin to explore it and build some interesting models that can help us determine estimated Wins and the probability of reaching the NBA championship.

I hypothesized that Offensive Rating, Defensive Rating, Wins, Offensive Rebounds, Assists and Effective Field Goal Percentage were most responsible for determining whether a team reached the finals. Before we get into some regressions to test that hypothesis, let’s plot some historical data to see where past finals teams have landed compared to their peers that season.

In the below charts, the larger, light blue dots represent the teams that reached the NBA finals each season. The two most recent lockout shortened seasons (1998-99,2011-12) will fall well below the Loess Regression line when not plotting an efficiency metric.

BLUE DOTS ARE FINALS TEAMS

# Offensive Rating
nba %>% ggplot(aes(x=season_end,y=ORtg,size=finals,col=finals))+geom_point()+geom_smooth()+scale_size_continuous(range=c(1,4))+ggtitle("Offensive Rating")+theme_fivethirtyeight()+theme(legend.position="none",plot.title=element_text(size=rel(1.6),hjust=0))+xlab("Season End")+ylab("Offensive Rating")+annotate("text",x=1977,y=116,label="Made the NBA Finals",col="#1F93F2")

# Defensive Rating - the lower the better
nba %>% ggplot(aes(x=season_end,y=DRtg,size=finals,col=finals))+geom_point()+geom_smooth()+scale_size_continuous(range=c(1,4))+ggtitle("Defensive Rating")+theme_fivethirtyeight()+theme(legend.position="none",plot.title=element_text(size=rel(1.6)))+xlab("Season End")+ylab("Defensive Rating")+annotate("text",x=1977,y=116,label="Made the NBA Finals",col="#1F93F2")

# Wins
nba %>% ggplot(aes(x=season_end,y=W,size=finals,col=finals))+geom_point()+geom_smooth()+scale_size_continuous(range=c(1,4))+ggtitle("Wins")+theme_fivethirtyeight()+theme(legend.position="none",plot.title=element_text(size=rel(1.6)))+xlab("Season End")+ylab("Wins")+annotate("text",x=1976.6,y=75,label="Made the NBA Finals",col="#1F93F2")

# Offensive Rebounds
nba %>% ggplot(aes(x=season_end,y=ORB,size=finals,col=finals))+geom_point()+geom_smooth()+scale_size_continuous(range=c(1,4))+ggtitle("Offensive Rebounds")+theme_fivethirtyeight()+theme(legend.position="none",plot.title=element_text(size=rel(1.6)))+xlab("Season End")+ylab("ORB")+annotate("text",x=1976.5,y=1535,label="Made the NBA Finals",col="#1F93F2")

# Assists
nba %>% ggplot(aes(x=season_end,y=AST,size=finals,col=finals))+geom_point()+geom_smooth()+scale_size_continuous(range=c(1,4))+ggtitle("Assists")+theme_fivethirtyeight()+theme(legend.position="none",plot.title=element_text(size=rel(1.6)))+xlab("Season End")+ylab("Assists")+annotate("text",x=1976.5,y=2600,label="Made the NBA Finals",col="#1F93F2")

# Effective Field Goal
nba %>% ggplot(aes(x=season_end,y=eFG,size=finals,col=finals))+geom_point()+geom_smooth()+scale_size_continuous(range=c(1,4))+ggtitle("Effective Field Goal")+theme_fivethirtyeight()+theme(legend.position="none",plot.title=element_text(size=rel(1.6)))+xlab("Season End")+ylab("eFG")+annotate("text",x=1976.5,y=.57,label="Made the NBA Finals",col="#1F93F2")

A few interesting high-level observations:

Finals teams usually fall above the mean with regard to Offensive Rating, Defensive Rating, Effective Field Goal and Wins
Assists and Offensive Rebounds have been falling league-wide since the 1980s
Win totals seem to fall highest above the mean compared to the other variables

Logistic Regression

In order to determine the probability from a binary observation (Reach Finals or Not), we must use a logit regression. After running the regression a few times, Wins, Offensive Rating and Defensive Rating proved to be the most statistically significant in determining finals teams.

summary(glm(finals~W+DRtg+ORtg,nba,family="binomial"))

## 
## Call:
## glm(formula = finals ~ W + DRtg + ORtg, family = "binomial", 
##     data = nba)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.573  -0.291  -0.124  -0.038   3.279  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.0415     4.3081   -0.47  0.63559    
## W             0.1221     0.0342    3.57  0.00036 ***
## DRtg         -0.2780     0.0937   -2.97  0.00301 ** 
## ORtg          0.2077     0.0905    2.30  0.02168 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.17  on 1099  degrees of freedom
## Residual deviance: 362.34  on 1096  degrees of freedom
## AIC: 370.3
## 
## Number of Fisher Scoring iterations: 7

Projecting Wins

If we are going to consider wins a major input into our logistic regression, then we must first project how many wins each team will finish with this year. We can do this with a multiple linear regression. I will only use Offensive Rating and Defensive Rating in order to avoid any multicollinearity issues.

summary(lm(W~DRtg+ORtg,nba))

## 
## Call:
## lm(formula = W ~ DRtg + ORtg, data = nba)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.468  -1.866   0.594   2.927   9.612 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  25.1652     4.4039    5.71          0.000000014 ***
## DRtg         -2.4062     0.0375  -64.19 < 0.0000000000000002 ***
## ORtg          2.5471     0.0357   71.41 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.65 on 1097 degrees of freedom
## Multiple R-squared:  0.862,  Adjusted R-squared:  0.862 
## F-statistic: 3.42e+03 on 2 and 1097 DF,  p-value: <0.0000000000000002

This simple model is very effective in determining the projected wins of teams with a R-Squared of .86.

Below are the projected wins for every team this season. Unsurprisingly, the Golden State Warriors sit atop the rankings with a projected 66 wins and gaudy 10.5 point Net rating (Offensive rating - Defensive Rating).

nba %>% filter(season_end==2015) %>% mutate(projected_wins=25.165+2.547*ORtg-2.406*DRtg) %>% select(Tm,ORtg,DRtg,projected_wins) %>% arrange(desc(projected_wins)) %>% kable()

Tm	ORtg	DRtg	projected_wins
GSW	111.3	100.8	66.12
LAC	112.6	106.0	56.92
ATL	108.9	102.3	56.40
POR	107.6	102.4	52.85
DAL	110.3	105.5	52.27
MEM	106.7	102.3	50.80
TOR	110.5	106.5	50.37
CLE	110.4	106.5	50.11
HOU	106.5	102.8	49.08
SAS	105.8	102.2	48.74
CHI	107.7	104.8	47.33
OKC	106.1	103.3	46.86
MIL	103.1	101.5	43.55
PHO	107.7	107.0	42.03
NOP	108.5	108.0	41.67
WAS	104.2	103.5	41.54
IND	102.2	103.0	37.65
BOS	104.5	105.6	37.25
DET	104.8	106.1	36.81
UTA	105.6	107.0	36.69
CHO	100.4	102.9	33.31
BRK	103.2	106.5	31.78
MIA	103.8	107.2	31.62
SAC	104.3	108.8	29.04
DEN	103.0	108.1	27.42
ORL	101.2	107.2	25.00
LAL	104.0	110.8	23.47
MIN	102.6	110.8	19.90
NYK	101.6	110.3	18.56
PHI	93.7	104.7	11.91

And the most likely NBA Finals matchup is…

# Projected Wins and Champsionship probability
nba %>% filter(season_end==2015) %>% mutate(projected_wins=25.165+2.547*ORtg-2.406*DRtg,finals_prob=exp(-2.0415+.1221*projected_wins+.2077*ORtg-.278*DRtg)/(1+exp(-2.0415+.1221*projected_wins+.2077*ORtg-.278*DRtg))*100) %>% select(Tm,ORtg,DRtg,projected_wins,finals_prob) %>% arrange(desc(finals_prob)) %>% kable()

Tm	ORtg	DRtg	projected_wins	finals_prob
GSW	111.3	100.8	66.12	75.5196
ATL	108.9	102.3	56.40	27.3688
LAC	112.6	106.0	56.92	23.6421
POR	107.6	102.4	52.85	15.3493
DAL	110.3	105.5	52.27	11.1105
MEM	106.7	102.3	50.80	10.7445
HOU	106.5	102.8	49.08	7.5388
SAS	105.8	102.2	48.74	7.4004
TOR	110.5	106.5	50.37	7.2592
CLE	110.4	106.5	50.11	6.9177
OKC	106.1	103.3	46.86	4.7422
CHI	107.7	104.8	47.33	4.6183
MIL	103.1	101.5	43.55	2.8555
WAS	104.2	103.5	41.54	1.6303
PHO	107.7	107.0	42.03	1.3576
NOP	108.5	108.0	41.67	1.1629
IND	102.2	103.0	37.65	0.7756
BOS	104.5	105.6	37.25	0.5794
DET	104.8	106.1	36.81	0.5090
UTA	105.6	107.0	36.69	0.4610
CHO	100.4	102.9	33.31	0.3243
BRK	103.2	106.5	31.78	0.1772
MIA	103.8	107.2	31.62	0.1621
SAC	104.3	108.8	29.04	0.0842
DEN	103.0	108.1	27.42	0.0641
ORL	101.2	107.2	25.00	0.0421
LAL	104.0	110.8	23.47	0.0230
MIN	102.6	110.8	19.90	0.0111
NYK	101.6	110.3	18.56	0.0088
PHI	93.7	104.7	11.91	0.0036

Model Limitations

This model gives Golden State an incredible 75% probability to reach the NBA Championship. This is driven by GSW’s historically great Net Rating. In a distant second are the Atlanta Hawks, with a 27% probability. Of course there are some limitations with this model. First, it does not take into account the incredibly difficult journey Golden State must take through the Western Conference playoffs. Of the top 10 teams listed above, 7 of them are in the Western Conference. Atlanta’s journey will prove much easier, likely facing the Toronto Raptors or Cleveland Cavaliers in the Eastern Conference Finals. Another limitation with this model is its non-time-sensititve ratings. For instance, the Clevelend Cavaliers began the season at an under .500 clip, at one point sitting at 19-20. Since that moment, the team finally began to jell, improved both their Offensive and Defensive Ratings and is now 2nd in the East at 37-24. This model does not take into account this increased play, which likely would contribute to a higher probability of making the NBA finals.

Looking at past finals probabilities

Below are the probabilities for the every finals team since 1980.

Some observations:

The last team with a near 75% probability to make the finals were the 2008 Boston Celtics, who eventually won the Championship.
The 1997 72-win Chicago Bulls had the highest probability in the set, with a 92% chance
The least likely contender? The 1999 New York Knicks, who would lose 4-1 to the San Antonio Spurs
Be aware that lockout shortened seasons (1998-99,2011-12) skew some the probabilities for those finals teams, as wins are a major input into the model

nba %>% filter(finals==1,season_end>=1980) %>% mutate(finals_prob=exp(-2.0415+.1221*W+.2077*ORtg-.278*DRtg)/(1+exp(-2.0415+.1221*W+.2077*ORtg-.278*DRtg))*100) %>% select(season_end,Tm,ORtg,DRtg,W,finals_prob) %>% arrange(desc(season_end)) %>% kable()

season_end	Tm	ORtg	DRtg	W	finals_prob
2014	SAS	110.5	102.4	62	50.3087
2014	MIA	110.9	105.8	54	13.8647
2013	MIA	112.3	103.7	66	62.5559
2013	SAS	108.3	101.6	58	32.9468
2012	OKC	109.8	103.2	47	10.0929
2012	MIA	106.6	100.2	46	10.5300
2011	MIA	111.7	103.5	58	36.9908
2011	DAL	109.7	105.0	57	18.4356
2010	LAL	108.8	103.7	57	21.2045
2010	BOS	107.7	103.8	50	8.1390
2009	LAL	112.8	104.7	65	55.4028
2009	ORL	109.2	101.9	59	38.1082
2008	BOS	110.2	98.9	66	80.3993
2008	LAL	113.0	105.5	57	28.0759
2007	SAS	109.2	99.9	58	48.7238
2007	CLE	105.5	101.3	50	10.1056
2006	DAL	111.8	105.0	60	33.5226
2006	MIA	108.7	104.5	52	10.2817
2005	SAS	107.5	98.8	59	50.5937
2005	DET	105.6	101.2	54	16.1301
2004	LAL	105.5	101.3	56	18.9548
2004	DET	102.0	95.4	54	31.3480
2003	SAS	105.6	99.7	60	37.7780
2003	NJN	103.8	98.1	49	14.5401
2002	LAL	109.4	101.7	58	37.5212
2002	NJN	104.0	99.5	52	14.7732
2001	LAL	108.4	104.8	56	13.9000
2001	PHI	103.6	98.9	56	23.4984
2000	LAL	107.3	98.2	67	75.5065
2000	IND	108.5	103.6	56	18.7057
1999	SAS	104.0	95.0	37	8.8427
1999	NYK	98.6	97.5	27	0.4630
1998	CHI	107.7	99.8	62	53.8322
1998	UTA	112.7	105.4	62	40.9820
1997	CHI	114.4	102.4	69	84.2517
1997	UTA	113.6	104.0	64	61.1975
1996	CHI	115.2	101.8	72	91.5005
1996	SEA	110.3	102.1	64	57.4054
1995	ORL	115.1	107.8	57	24.1600
1995	HOU	109.7	107.4	47	3.3076
1994	HOU	105.9	101.4	58	23.9856
1994	NYK	105.7	98.2	57	39.4721
1993	PHO	113.3	106.7	62	35.4003
1993	CHI	112.9	106.1	57	24.4481
1992	CHI	115.5	104.5	67	74.6030
1992	POR	111.4	104.2	57	28.6668
1991	CHI	114.6	105.2	61	49.0856
1991	LAL	112.1	105.0	58	29.5977
1990	DET	109.9	103.5	59	31.3379
1990	POR	110.5	104.4	59	28.7010
1989	DET	110.8	104.7	63	39.1112
1989	LAL	113.8	106.7	57	24.8218
1988	LAL	113.1	107.3	62	30.7927
1988	DET	110.5	105.3	54	14.5462
1987	LAL	115.6	106.5	65	57.3983
1987	BOS	113.5	106.8	59	27.8071
1986	BOS	111.8	102.6	67	69.7888
1986	HOU	110.1	107.6	51	5.4196
1985	BOS	112.8	106.3	63	38.4134
1985	LAL	114.1	107.0	62	37.3149
1984	BOS	110.9	104.4	62	38.6851
1984	LAL	110.9	107.3	54	9.5906
1983	PHI	108.3	100.9	65	58.3877
1983	LAL	110.5	105.2	58	22.1931
1982	PHI	109.6	103.9	58	25.3510
1982	LAL	110.2	105.5	57	17.9129
1981	BOS	108.4	102.6	62	38.2389
1981	HOU	107.0	106.7	40	0.9989
1980	LAL	109.5	103.9	60	29.8061
1980	PHI	105.0	101.0	59	24.8412