Part 1 Basic Analytics in NBA metric

1.1 Data Resource

I choose NBA package(nbastatR Version0.1.151 published by MIT) to do the analytics

The package contains almost all the box metric in NBA, from 1946 till now.

The data obtained from:

  • NBA Stats API/Basketball-Reference

  • HoopsHype

  • nbadraft.net

  • realgm

  • Basketball Insiders

1.2 load the data

Due to the start year of small-ball usually defines as 2014 season,I selected 2014 season and 2022 season’s to do the basic analytic, and try to find out the trend of offensive style the high relative factor of win in Regular(a period of time in NBA season).

1.21 Data set varables Variables

## Registered S3 method overwritten by 'ftExtra':
##   method                  from     
##   as_flextable.data.frame flextable

Variables_name

explanation

teanName

team Name

slugSeason

Season(year)

gp

count the regular season games

fgm

file goal made

fga

file goal attempt

pctFG

percentage of shot

fg3m

3-point made

fg3a

3-point attempt

pctFG3a

percentage of 3-pointshot

ftm

free throw made

fta

free throw attempt

pctFT

percentage of free throw

oreb

offenive rebound

dreb

dffensive rebound

treb

total rebound

ast

assist

pf

personal foul

stl

stole the ball

tov

turnover

blk

block

pts

score

1.3 The trend of NBA offensive way

1.31 Looking into 2014season how NBA teams played and win game

As I have mentioned above, 2014season is called the start of small-ball era. Let’s see what have happened in that season.

library(ggplot2)
library(scales)
season2014_team_box_score %>% 
  group_by(teamName) %>% 
  mutate(three_pa_cover=fg3a/fga ) %>% 
  ggplot(aes(x=teamName,y=fg3a))+
  geom_col()+geom_text(aes(label = scales::percent(three_pa_cover), 
                y = fg3a
               ),
            position = position_stack(vjust =0.2)) 

In 2014 season, the top 6 team ranking in Regular did no more than 2000 times three-point attempt, and only cover 23.46~29.08% of total shot attempt. The other 70%+ were two-point shot.

Let see how much three-point can actually transfer into scores.

season2014_team_box_score %>% 
  group_by(teamName) %>% 
  ggplot(aes(x=teamName,y=fg3m))+
  geom_col()+geom_text(aes(label = scales::percent(pctFG3), y = pctFG3*1000 ),
            position = position_stack(vjust =0.5)) 

There doesn’t seem to be much difference in percentage of made, with a maximum difference of 4.5%. This means if shot more, make more. But why they didn’t shot more three_points?

Have a look in two_points!

season2014_team_box_score %>% 
  group_by(teamName) %>% mutate(fg2m=fgm-fg3m) %>% 
  ggplot(aes(x=teamName,y=fg2m))+
  geom_col()+geom_text(aes(label =scales::percent (round((fgm-fg3m)/(fga-fg3a),3)), y = round((fgm-fg3m)/(fga-fg3a),2)*1000 ),
            position = position_stack(vjust =2.5)) 

All the teams’ two-point percentage are higher than 47%, the largest different is 8.3%.

season2014_team_box_score %>% 
  group_by(teamName) %>% 
  mutate(three_pct= (fg3m*3)/pts,
         two_pct=(fgm-fg3m)*2/pts,
         freethrow_pct=ftm/pts) %>% 
  ggplot(aes(x=teamName))+
  geom_line(aes(y=three_pct,color="three_pct",group=1))+
  geom_line(aes(y=two_pct,color="two_pct",group=2))+
  geom_line(aes(y=freethrow_pct,color="freethrow_pct",group=3))

  • From the chat above, it can be concluded that 2-point is the main scoring method (55%-60% of the total), 3-point is second (20%-25%) and free throws are the lowest (15%-20%).

  • We can also see a peculiar phenomenon in that the Raptors seem to be the most willing of the six teams to try new attacking patterns. They have the highest percentage of threes compared to the other teams. Although the Spurs shot a higher percentage of threes than them and the Clippers attempted more total attempts ; they had the highest total field goal percentage as well as the highest percentage of points scored, so it is clear that they relied more on threes and get some results.

1.32Looking into 2022 season(last season)

library(ggpubr)
s2022_1<-season2022_team_box_score %>% 
  group_by(teamName) %>% 
  mutate(three_pa_cover=fg3a/fga ) %>% 
  ggplot(aes(x=teamName,y=fg3a))+
  geom_col()+geom_text(aes(label = scales::percent(three_pa_cover), 
                y = fg3a
               ),
            position = position_stack(vjust =0.2)) 

s2014_1 <- season2014_team_box_score %>% 
  group_by(teamName) %>% 
  mutate(three_pa_cover=fg3a/fga ) %>% 
  ggplot(aes(x=teamName,y=fg3a))+
  geom_col()+geom_text(aes(label = scales::percent(three_pa_cover), 
                y = fg3a
               ),
            position = position_stack(vjust =0.2)) 
ggarrange(s2022_1,s2014_1,ncol = 1,labels = c("2022 3-points attempt and pct of total shot attempt","2014 3-points attempt and pct of total shot attempt"))

None of the top six teams in the regular season (2022 season)made fewer than 2,500 three-point attempts and accounted for at least a third of the total attempts. This is a significant increase in both the number of shots taken and the percentage of offense compared to the 2014 season.

s2014_2 <- season2014_team_box_score %>% 
  group_by(teamName) %>% 
  ggplot(aes(x=teamName,y=fg3m))+
  geom_col()+geom_text(aes(label = scales::percent(pctFG3), y = pctFG3*1000 ),
            position = position_stack(vjust =0.7)) 
s2022_2 <- season2022_team_box_score %>% 
  group_by(teamName) %>% 
  ggplot(aes(x=teamName,y=fg3m))+
  geom_col()+geom_text(aes(label = scales::percent(pctFG3), y = pctFG3*1000 ),
            position = position_stack(vjust =0.7)) 

ggarrange(s2022_2,s2014_2,ncol = 1,labels = c("2022 3-points made and fiel goal pct ","2014 3-points made and fiel goal pct "))

There is no big difference in terms of three-points. The total number of three-points, on the other hand, has increased substantially. The reason for this is also very clear - a significant increase in the number of three pointers attempted.

s2014_3 <- season2014_team_box_score %>% 
  group_by(teamName) %>% mutate(fg2m=fgm-fg3m) %>% 
  ggplot(aes(x=teamName,y=fg2m))+
  geom_col()+geom_text(aes(label =scales::percent (round((fgm-fg3m)/(fga-fg3a),3)), y = round((fgm-fg3m)/(fga-fg3a),2)*1000 ),
            position = position_stack(vjust =2.5)) 

s2022_3 <- season2022_team_box_score %>% 
  group_by(teamName) %>% mutate(fg2m=fgm-fg3m) %>% 
  ggplot(aes(x=teamName,y=fg2m))+
  geom_col()+geom_text(aes(label =scales::percent (round((fgm-fg3m)/(fga-fg3a),3)), y = round((fgm-fg3m)/(fga-fg3a),2)*1000 ),
            position = position_stack(vjust =2.5)) 

ggarrange(s2022_3,s2014_3,ncol = 1,labels = c("2022 2-points made and fiel goal pct ","2014 2-points made and fiel goal pct "))

  • There may be an increase in two-point percentage, and a small decrease in total two-point made.
s2014_4 <- season2014_team_box_score %>% 
  group_by(teamName) %>% 
  mutate(three_pct= (fg3m*3)/pts,
         two_pct=(fgm-fg3m)*2/pts,
         freethrow_pct=ftm/pts) %>% 
  ggplot(aes(x=teamName))+
  geom_line(aes(y=three_pct,color="three_pct",group=1))+
  geom_line(aes(y=two_pct,color="two_pct",group=2))+
  geom_line(aes(y=freethrow_pct,color="freethrow_pct",group=3))

s2022_4 <- season2022_team_box_score %>% 
  group_by(teamName) %>% 
  mutate(three_pct= (fg3m*3)/pts,
         two_pct=(fgm-fg3m)*2/pts,
         freethrow_pct=ftm/pts) %>% 
  ggplot(aes(x=teamName))+
  geom_line(aes(y=three_pct,color="three_pct",group=1))+
  geom_line(aes(y=two_pct,color="two_pct",group=2))+
  geom_line(aes(y=freethrow_pct,color="freethrow_pct",group=3))

ggarrange(s2022_4,s2014_4,ncol = 1,labels = c("2022 Composition of the total score ","2014 Composition of the total score"))

These two charts are comparing the composition of the scores and we can see that:

  • The percentage of free throws dropped from 20% to about 10%, with a decrease in free throws without a significant increase in the number of shots taken.

  • The percentage of 3-points has increased substantially, originally at 20% ~25%, to 30% ~40%. It is indeed a big change and can be called a REVOLUTION.

  • There was also a drop of almost 10% in two-points.

1.33 boxplot of all the team in 2014 and 2022

Get the data set

1.331 3points

The chart shows no great difference in hit rate, even slightly lower. But in terms of the variability in the level of the teams’ 3-pointers it is narrowing.

1.332 2points

In keeping with what we have observed in the top 6 teams in the regular season, teams are trying to shoot fewer two points and the number of made has dropped. Let’s look at the 2-points shot percentage again.

The 2022 season looks better than 2014 in terms of two-point shooting.

  • That may have something to do with fewer mid-range and long-range 2-points being taken.

  • Everyone is now more willing to shoot inside the paint and more likely to score or cause opponents to foul.

1.333 Free throw

  • Free throw ability has been better

  • The total amount of free throw has slightly decreased

1.334 Total score

Earlier we saw no significant increase in either two-point or three-point percentage, while total points scored rose dramatically. Let’s look at the points per game.

Several reasons for this:

  • Increased pace of play and more attacking rounds

  • Three-point attempts as well as makes increased, with some of the previous two-point attempts becoming threes, leading to an overall increase in scoring

1.34 Conclution

  • Compared to the old days NBA teams are now more than happy to shoot threes and are becoming a more regular weapon, taking up over a third of the points, a substantial increase. It also proves that they are now more confident than ever to hit the three point shot.

  • Two-point attempts are down, but overall hitting is up, and without taking into account a player’s ability to shoot, there is a good chance that they are attempting closer shots.

  • Free throw totals are down, but only slightly. If there had to be a reason for this, it could be because the intensity of the physical play is now decreasing, as is the quality of the defence.

  • Scoring average has improved, but this is a general trend since the 21st century.

  • The overall shooting ability of players has gone up, and it’s hitting at an increased rate, regardless of the type of scoring. Most notably the three-point shot.

  • Due to the lack of coordinates of the shooting position, it is not possible to do further visualization and calculate the shooting distance. If there was an offensive timer there would also be further insight into the offensive options.This also plays a heavy role in the interpretation of the offensive model.

1.4 The key factors of win

1.41For the 2014 season

library(tidyverse)
library(GGally) 
small_ball_era_start <- small_ball_era_start%>% 
  group_by(dateGame,nameTeam) %>% 
  select(5,8,21,28,29,31,32,36,37,40:50) %>% 
  mutate(
         Gameresult=case_when((outcomeGame=="W"~1),
                              (outcomeGame=="L"~0)),
         tfgm=sum(fgm),tfga=sum(fga),tfg3m=sum(fg3m),
         tfg3a=sum(fg3a),tfg2m=sum(fg2m),tfg2a=sum(fg2a),
         tftm=sum(ftm),tfta=sum(fta),toreb=sum(oreb),tdreb=sum(dreb),ttreb=sum(treb),
         tast=sum(ast),tstl=sum(stl),tblk=sum(blk),ttov=sum(tov),tpf=sum(pf),tpts=sum(pts)
         ) 
  

small_ball_era_start <- small_ball_era_start%>% 
  select(-c(3:20))%>% 
  distinct_all()
#mutate the team box score for each game in the season,and do the correlation
#first we remove some high relative variables.
c <- small_ball_era_start %>% 
  ungroup() %>% 
  select(-c(1,2,3))
    
ggcorr(data = c)

#then we filter out five variables may have high relative with the win
g <- small_ball_era_start %>% 
  ungroup() %>% 
  select(-c(1,2,4,5,14,10))

ggcorr(data = g)

g <- g%>% 
  select(Gameresult,tpts,tdreb,tfg2m,tfg3m,tast,tstl) 

g %>% ggpairs(.,
                 title = "the factors with high relative to the win compared to the other", 
                 mapping = ggplot2::aes(colour=as.factor(Gameresult)), 
                 lower = list(continuous = wrap("smooth", alpha = 0.3, size=0.1), 
                              discrete = "blank", combo="blank"), 
                 diag = list(discrete="barDiag", 
                             continuous = wrap("densityDiag", alpha=0.5 )), 
                 upper = list(combo = wrap("box_no_facet", alpha=0.5),
                              continuous = wrap("cor", size=4, alignPercent=0.8))) + 
  theme(panel.grid.major = element_blank()) 

  • In 2014 season,the total points / defend rebound and assist are the top 3 high relative factors.

  • It is indeed difficult to identify the differences and further modelling may be required to get a closer answer.

library(rpart)
library(rpart.plot)
w <- rpart(Gameresult~., data=g, method="anova")
w$variable.importance
##      tpts     tdreb      tast     tfg2m     tfg3m      tstl 
## 126.02848  59.39596  40.34308  35.77511  35.54322  12.29514

1.42 For the 2022 season

last_season <- last_season%>% 
  group_by(dateGame,nameTeam) %>% 
  select(5,8,21,28,29,31,32,36,37,40:50) %>% 
  mutate(
         Gameresult=case_when((outcomeGame=="W"~1),
                              (outcomeGame=="L"~0)),
         tfgm=sum(fgm),tfga=sum(fga),tfg3m=sum(fg3m),
         tfg3a=sum(fg3a),tfg2m=sum(fg2m),tfg2a=sum(fg2a),
         tftm=sum(ftm),tfta=sum(fta),toreb=sum(oreb),tdreb=sum(dreb),ttreb=sum(treb),
         tast=sum(ast),tstl=sum(stl),tblk=sum(blk),ttov=sum(tov),tpf=sum(pf),tpts=sum(pts)
         ) 
  

last_season <- last_season%>% 
  select(-c(3:20))%>% 
  distinct_all()
#mutate the team box score for each game in the season,and do the correlation
#first we remove some high relative variables.
h <- last_season %>% 
  ungroup() %>% 
  select(-c(1,2,3))
    
ggcorr(data = h)

#then we filter out five variables may have high relative with the win
j <- last_season %>% 
  ungroup() %>% 
  select(-c(1,2,4,5,14,10))

ggcorr(data = j)

k <- j%>% 
  select(Gameresult,tpts,tdreb,tfg2m,tfg3m,tast,tblk) 

k %>% ggpairs(.,
                 title = "the factors with high relative to the win compared to the other", 
                 mapping = ggplot2::aes(colour=as.factor(Gameresult)), 
                 lower = list(continuous = wrap("smooth", alpha = 0.3, size=0.1), 
                              discrete = "blank", combo="blank"), 
                 diag = list(discrete="barDiag", 
                             continuous = wrap("densityDiag", alpha=0.5 )), 
                 upper = list(combo = wrap("box_no_facet", alpha=0.5),
                              continuous = wrap("cor", size=4, alignPercent=0.8))) + 
  theme(panel.grid.major = element_blank()) 

q <- rpart(Gameresult~., data=k, method="anova")
q$variable.importance
##      tpts     tdreb      tast     tfg3m     tfg2m 
## 140.71582  65.55712  50.97707  39.07252  33.81742
  • The top3 are still total points, defensive rebounds and assists.

  • There was a small increase in defensive rebounds, three-pointers made and the correlation between assists and wins. Proof that these three abilities are more important compared to the past.

Part 2 The article analytics(Estimating an NBA player’s impact on his team’s chances of winning

)

https://www-degruyter-com.ez.library.latrobe.edu.au/document/doi/10.1515/jqas-2015-0027/html

2.1 research question and study aims

research question is that Using model to fairly evaluate the impact of players on team wins.

the study aims:

  • to identify highly paid players with low impact relative to their teammates

  • players whose high impact is not captured by existing metrics

  • to estimate an individual player’s impact, after controlling for the other players on the court.

2.2 data(period and source)

play-by-play data obtained from ESPN for 8365 of the 9840 (85%) of the scheduled regular season games in each of the eight seasons between 2006 and 2014. The other 15% have the missing value.

2.3 analysis methods

  • The authors used a regression model to estimate the change in win rate and to evaluate the model veracity.

  • The authors used the Bayesian linear regression model to evaluate the influence of players on the winning percentage of matches

2.4 major finding

  • IImpact scores are environment dependent, and a player’s impact score from one season can help predict his impact score in the following season because he is more consistent unless his playing environment (including team changes, injuries, playing rules, etc.) changes significantly.

  • A player’s Impact Score is not affected by year and when combined with PER (a satistic metric of evaluating personal performance without considering any conditions), it does give a more realistic feedback of a player’s performance. When a player has a high PER as well as Impact of Score, he must be a player who plays a role in his team’s winning (e.g. Durant, James, Nowitzki).

  • A unit change in time corresponds to a smaller change in probability of winning than a unit change in lead, especially near the end of the game. This may introduce a slight bias against players who are often subbed in on defense and subbed out on offense, since such players would not be associated with a large change in win probability.

  • The impact of any one player in a single substitution on his team’s chances of winning is small, usually less than 1%.

  • No player is more likely to significantly improve his team’s chances of winning than any other player.

2.5 comments

  • this table is 2008–2009 – 2010–2011 Impact Score, consider the minutes I list the imapct of top 4(which were normally used in the game) :Warriors/Pacers/Rockets/Wizards. It is controversial that none of these four years of championship appear in the table. A championship team should be the team that wins more. This team’s Impact Score should be high.

  • In Table 4, I found some interesting phenomena. 2008-2010 Lebron was a one-man team and won back-to-back MVPs. 2011-2014 he was in the heat of forming the Big Three (three All-starplayers played together) and his influence must have declined compared to his previous stint with the Cavs. Both offensively and defensively. You can also observe the disappearance of Wade from the list. But Wade was still in the prime of his career in those years. This evaluation model is not so friendly to teams with more star players. There are some advantages for the evaluation of weaker teams with their leading players.

  • The Impact score does play a role in the overall assessment. Defensive bruisers have been underrated for a long time because there is very little box score metric on defense( little derective metrics: steal block backward rebound).They are really influencing the game.

  • This individual Impact score is very interesting. Its starting point and the results of the final simulation are also very close to the results of the MVP ballot at that time. We have been struggling with Lebron’s impact not matching his stats at times (and of course other players such as Harden this season). I think this is a solution.

  • While it’s fair as well as necessary to individually assess the impact of individuals on winning, basketball is ultimately a team sport. A lot of GMs have been talking about team chemistry, and a lot of ncaa coaches are talking about a winning culture, all proving that evaluating a group of players (three or four) is important to winning games. The Lakers changed their starting lineup 37 times last season more times than they won because they couldn’t find the right mix of players.

2.6 possible improvements

  • win probability estimation is admittedly simplistic and designing a more sophisticated procedure is an area for future work.

  • to entertain two-way or three-way player interactions, in case there are any on-court synergies or mismatches amongst small groups of players.

  • the variables σ which measure uncertainty of the win probability change by the shift and the performance need to be segmented.

  • Changing evaluating one person into evaluating a group of people (three or four). It adds a lot of computational work, but it plays a very important role in the team’s ability to win games.