library('ggplot2')
library('dplyr')
library('GGally')
library('lubridate')

Free shots analysis

Task: Read in the csv containing NBA free throws. Visualise some aspect of the data you find interesting, e.g., the average number of free throws per period for the regular season and the playoffs.

Domain Background

NBA Basketball has the following characteristics:

  • Each of the free shots adds one point to the score

  • Games are played by two teams, of 5 payers each.

  • One game is divided in 4 quarters of 12 minutes of duration

  • Free shots are shots throuwn by a player without any opposition. A free shot happen as a consequence of a foul done over the player that is throwing the free shot. If the foul happen within the 2 points area, the player is given 2 free shots. If the foul happen in the three points area, then the player has 3 free shots. One free shot is given when there was a foul when the player was shoting and he converted the shot.

  • A seasson of the league has two different periods:

    • Regular season: each team plays 82 games, 41 each home and away. All teams play against each other, but in a unbalanced manner
    • Playoffs: The playoffs follow a tournament format. Each team plays an opponent in a best-of-seven series, with the first team to win four games advancing into the next round, while the other team is eliminated from the playoffs
  • There is one season per year

Exploring the data

df <- read.csv('data/free_throws.csv', header = TRUE, stringsAsFactors = FALSE)
head(df)

Variable summary

Let’s have a look to the variables

str(df)
## 'data.frame':    618019 obs. of  11 variables:
##  $ end_result: chr  "106 - 114" "106 - 114" "106 - 114" "106 - 114" ...
##  $ game      : chr  "PHX - LAL" "PHX - LAL" "PHX - LAL" "PHX - LAL" ...
##  $ game_id   : num  2.61e+08 2.61e+08 2.61e+08 2.61e+08 2.61e+08 ...
##  $ period    : num  1 1 1 1 1 1 1 2 2 2 ...
##  $ play      : chr  "Andrew Bynum makes free throw 1 of 2" "Andrew Bynum makes free throw 2 of 2" "Andrew Bynum makes free throw 1 of 2" "Andrew Bynum misses free throw 2 of 2" ...
##  $ player    : chr  "Andrew Bynum" "Andrew Bynum" "Andrew Bynum" "Andrew Bynum" ...
##  $ playoffs  : chr  "regular" "regular" "regular" "regular" ...
##  $ score     : chr  "0 - 1" "0 - 2" "18 - 12" "18 - 12" ...
##  $ season    : chr  "2006 - 2007" "2006 - 2007" "2006 - 2007" "2006 - 2007" ...
##  $ shot_made : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ time      : chr  "11:45" "11:45" "7:26" "7:26" ...

Variable definitions

  • end_result: final score of the game
  • game: Teams that played
  • game_id: id reference for each of the games
  • period: Period of the game the free shot happened
  • play: Phrase with the name of the player that is throwing the free shot, if he made the free shot and the order of the free shots.
  • player: Name of the player
  • playoffs: The game was in the regular season or in playoffs
  • score: The score after the free shot was made
  • season: The season the game was played
  • shot_made: did the ball got in the basket?
  • time: minutes and seconds left to the end of the period

Time

Are times the actual time of the day or are they times since the game start?

They could be times in the format hh:mm or times from start of the game in the format of mm:ss

Each of the quarters in a basquetball game last 12 min, so we can’t really tell those options apart looking at the data. Common sense tells me that they are times since start, as time of the day is not very informative in this context. The dataset has the period also so that would give us when on the game the free shot was done.

Let’s have a look to the distribution of the times

times = as.integer(sapply(df$time, function(x) unlist(strsplit(x, ':'))[1]))
hist(times)

Let’s assume the format is ‘mm:ss’

Having a look to the data, it’s clear that the times provided is the time left in each period.

Shot_made

I don’t know what shot_made really is. Let’s have a look

df[,c("shot_made", 'play')]

So there is a row for echa free shot. Even when there are two free shots because of a foul there will be two rows, one for each of the free shots. So shot_made tell is if every single shot was made or not

NA’s

sum(is.na(df))
## [1] 0

There si no missing data.

Transforming the data

Data wrangling

Trasnformations required

  • end_result: chr, individual scores should be extracted
  • game: chr, individual scores should be extracted
  • period: numeric, it should be factor
  • playoffs: chr, should be factor
  • score: chr, individual scores should be extracted
  • season: chr, should be factor
  • time: chr, should be time type of data
  • time: create a new variable with elapsed time of each period
df <- df %>%
  mutate(
      ## Trasform variables
      game_id = as.character(game_id),
      period = as.factor(period),
      playoffs = as.factor(playoffs),
      season = as.factor(season),
      shot_made = as.factor(shot_made),
      
      ## New variables
      # split the scores
      end_home_score = as.integer(sapply(end_result, function(x) unlist(strsplit(x, ' - '))[1])),
      end_away_score = as.integer(sapply(end_result, function(x) unlist(strsplit(x, ' - '))[2])),
      # split the teams
      home_team = as.character(sapply(game, function(x) unlist(strsplit(x, ' - '))[1])),
      away_team = as.character(sapply(game, function(x) unlist(strsplit(x, ' - '))[2])),
      # split the scores
      elapsed_home_score = as.integer(sapply(score, function(x) unlist(strsplit(x, ' - '))[1])),
      elapsed_away_score = as.integer(sapply(score, function(x) unlist(strsplit(x, ' - '))[2])),
      # time
      time_min_sec = ms(time),
      time_dur_sec = dminutes(12) - as.duration(time_min_sec),
      time_elapsed_min_sec = as.period(time_dur_sec),
      time_elapsed_min = as.numeric(time_elapsed_min_sec, "minutes"),
      time_elapsed_sec = as.numeric(time_elapsed_min_sec, "seconds"),
  )
str(df)
## 'data.frame':    618019 obs. of  22 variables:
##  $ end_result          : chr  "106 - 114" "106 - 114" "106 - 114" "106 - 114" ...
##  $ game                : chr  "PHX - LAL" "PHX - LAL" "PHX - LAL" "PHX - LAL" ...
##  $ game_id             : chr  "261031013" "261031013" "261031013" "261031013" ...
##  $ period              : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ play                : chr  "Andrew Bynum makes free throw 1 of 2" "Andrew Bynum makes free throw 2 of 2" "Andrew Bynum makes free throw 1 of 2" "Andrew Bynum misses free throw 2 of 2" ...
##  $ player              : chr  "Andrew Bynum" "Andrew Bynum" "Andrew Bynum" "Andrew Bynum" ...
##  $ playoffs            : Factor w/ 2 levels "playoffs","regular": 2 2 2 2 2 2 2 2 2 2 ...
##  $ score               : chr  "0 - 1" "0 - 2" "18 - 12" "18 - 12" ...
##  $ season              : Factor w/ 10 levels "2006 - 2007",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ shot_made           : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 1 2 2 ...
##  $ time                : chr  "11:45" "11:45" "7:26" "7:26" ...
##  $ end_home_score      : int  106 106 106 106 106 106 106 106 106 106 ...
##  $ end_away_score      : int  114 114 114 114 114 114 114 114 114 114 ...
##  $ home_team           : chr  "PHX" "PHX" "PHX" "PHX" ...
##  $ away_team           : chr  "LAL" "LAL" "LAL" "LAL" ...
##  $ elapsed_home_score  : int  0 0 18 18 21 33 34 43 44 44 ...
##  $ elapsed_away_score  : int  1 2 12 12 12 20 20 29 29 30 ...
##  $ time_min_sec        :Formal class 'Period' [package "lubridate"] with 6 slots
##   .. ..@ .Data : num  45 45 26 26 18 15 15 52 52 37 ...
##   .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ hour  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ minute: num  11 11 7 7 7 3 3 10 10 10 ...
##  $ time_dur_sec        :Formal class 'Duration' [package "lubridate"] with 1 slot
##   .. ..@ .Data: num  15 15 274 274 282 525 525 68 68 83 ...
##  $ time_elapsed_min_sec:Formal class 'Period' [package "lubridate"] with 6 slots
##   .. ..@ .Data : num  15 15 34 34 42 45 45 8 8 23 ...
##   .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ hour  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ minute: num  0 0 4 4 4 8 8 1 1 1 ...
##  $ time_elapsed_min    : num  0.25 0.25 4.57 4.57 4.7 ...
##  $ time_elapsed_sec    : num  15 15 274 274 282 525 525 68 68 83 ...
head(df)

visual check that the scores were extracted correctly

df[1:100,'end_home_score']
##   [1] 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106
##  [18] 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106
##  [35] 106 106 106 106 106 106 106 108 108 108 108 108 108 108 108 108 108
##  [52] 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108
##  [69] 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108
##  [86] 108 108 108 108 108 108 108 108 108 108  91  91  91  91  91
head(df$end_result, 100)
##   [1] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
##   [6] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
##  [11] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
##  [16] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
##  [21] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
##  [26] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
##  [31] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
##  [36] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
##  [41] "106 - 114" "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [46] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [51] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [56] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [61] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [66] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [71] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [76] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [81] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [86] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [91] "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66"  "108 - 66" 
##  [96] "91 - 87"   "91 - 87"   "91 - 87"   "91 - 87"   "91 - 87"

Ideas to explore

Exploratory Data Analysis

Summary statistics

summary(df)
##   end_result            game             game_id              period      
##  Length:618019      Length:618019      Length:618019      4      :186280  
##  Class :character   Class :character   Class :character   2      :154218  
##  Mode  :character   Mode  :character   Mode  :character   3      :154108  
##                                                           1      :116962  
##                                                           5      :  5439  
##                                                           6      :   838  
##                                                           (Other):   174  
##      play              player              playoffs     
##  Length:618019      Length:618019      playoffs: 42126  
##  Class :character   Class :character   regular :575893  
##  Mode  :character   Mode  :character                    
##                                                         
##                                                         
##                                                         
##                                                         
##     score                   season       shot_made      time          
##  Length:618019      2006 - 2007: 67612   0:150330   Length:618019     
##  Class :character   2007 - 2008: 65500   1:467689   Class :character  
##  Mode  :character   2008 - 2009: 65355              Mode  :character  
##                     2009 - 2010: 64804                                
##                     2010 - 2011: 64137                                
##                     2013 - 2014: 62290                                
##                     (Other)    :228321                                
##  end_home_score   end_away_score  home_team          away_team        
##  Min.   : 54.00   Min.   : 59    Length:618019      Length:618019     
##  1st Qu.: 91.00   1st Qu.: 94    Class :character   Class :character  
##  Median : 99.00   Median :102    Mode  :character   Mode  :character  
##  Mean   : 98.99   Mean   :102                                         
##  3rd Qu.:107.00   3rd Qu.:110                                         
##  Max.   :161.00   Max.   :168                                         
##                                                                       
##  elapsed_home_score elapsed_away_score  time_min_sec                 
##  Min.   :  0.00     Min.   :  0.00     Min.   :0S                    
##  1st Qu.: 31.00     1st Qu.: 33.00     1st Qu.:1M 51S                
##  Median : 56.00     Median : 58.00     Median :4M 31S                
##  Mean   : 55.48     Mean   : 57.46     Mean   :4M 57.4996399787062S  
##  3rd Qu.: 79.00     3rd Qu.: 82.00     3rd Qu.:7M 50S                
##  Max.   :161.00     Max.   :166.00     Max.   :12M 0S                
##                                                                      
##   time_dur_sec                              
##  Min.   :0s                                 
##  1st Qu.:250s (~4.17 minutes)               
##  Median :449s (~7.48 minutes)               
##  Mean   :422.500360021294s (~7.04 minutes)  
##  3rd Qu.:609s (~10.15 minutes)              
##  Max.   :720s (~12 minutes)                 
##                                             
##  time_elapsed_min_sec           time_elapsed_min time_elapsed_sec
##  Min.   :0S                     Min.   : 0.000   Min.   :  0.0   
##  1st Qu.:4M 10S                 1st Qu.: 4.167   1st Qu.:250.0   
##  Median :7M 29S                 Median : 7.483   Median :449.0   
##  Mean   :7M 2.50036002129383S   Mean   : 7.042   Mean   :422.5   
##  3rd Qu.:10M 9S                 3rd Qu.:10.150   3rd Qu.:609.0   
##  Max.   :12M 0S                 Max.   :12.000   Max.   :720.0   
## 
  • There are 618.019 records
  • There is a game with 113 shots. That’s weird
  • there are records with periods > 4 which is werid. That must be wrong
  • player with most free shots: Lebron, Dwight, Kevin Durant
  • season 2006-07 had the max of free shots
  • More than 75% of the shots were made.
  • Over 50% of free shots happend in the first 5 min of each period, 25% of them happend aprox in the first 2 min of each period

general overview of the relationship of the variables

Scores by season and playoffs

df_pair_end_score <- df[,c("playoffs", "season", "end_home_score", "end_away_score")]
ggpairs(df_pair_end_score, labeller = "label_context") + theme_grey(base_size = 30)

  • The sixth season (2011 - 2012) has a considerable drop in the number of free shots in the regular season.
  • The mean and median of the home_end_score and away_end_score are at around 100 points
  • home_end_score and away_end_score look like they are correlarted
  • Not much variation in the end scores among seassons, but again the end scores of the sixth season (2011 - 2012), lower than the rest

Shots made by factors and time

df_pair_shot_made <- df[,c("period", "playoffs", "season", "time_elapsed_min", "shot_made")]
ggpairs(df_pair_shot_made) + theme_grey(base_size = 30)

  • Free shots are more frequent at the end of the period and the frequency increase over time within the same period.
  • The 4th period has more fres shots than any other periods. The number of free shots increases from 1st to the 4th period.

Hack-A-Shack strategy

Are there more free shots at the end of a game, more fouls at the end of a game?

Hack-A-Shack was a common strategy in the NBA used in the last minutes of the game when a team was behind in the score. The strategy consists in committing intentional fouls throughout the game against selected opponents who shot free throws poorly. The trailing team fouls intentionally to end the opponents’ possession as soon as possible. More info following this link https://en.wikipedia.org/wiki/Hack-a-Shaq

Let’s have a look at the distributions of times of the free shots in the 4 quarter

ggplot(data = df[df$period == 4,]) +
  geom_bar(aes(x=floor(time_elapsed_min)))

Interestingly, as we saw before, there is a spike on the number of the free shots in the last minute of the 4th period. There are free shots that happen when there is 0 seconds left (the 12 min bar). This free shots should really be in the 11th bar.

Let’s transform the 00:00 times into 00:01 so they belong to the 11th minute

df[df$time_min_sec == period(0), "time_min_sec"] <- duration(1, "second")
# Check there is not more rows with period = 0
# df[df$time_min_sec == period(0),]

Modify the rest of the time variables accordingly

df <- df%>%
    mutate(
      time_dur_sec = dminutes(12) - as.duration(time_min_sec),
      time_elapsed_min_sec = as.period(time_dur_sec),
      time_elapsed_min = as.numeric(time_elapsed_min_sec, "minutes"),
      time_elapsed_sec = as.numeric(time_elapsed_min_sec, "seconds"),
    )

Same graph without the 12th minute bin

ggplot(data = df[df$period == 4,]) +
  geom_bar(aes(x=floor(time_elapsed_min)))

Let’s have a look to the other periods as well

ggplot(data = df[df$period %in% c(1,2,3,4),]) +
  geom_bar(aes(x=floor(time_elapsed_min))) +
  facet_grid(rows = vars(period))

There is masive jump in the 11th minute of the 4th period. It definitely shows that teams are using hack-a-shack strategy quite a lot.

There has been numerous changes in the NBA rules to disencourage deliberate fouling. So the general pattern we see here might be caused by a change in the rules. This link follows the changes in the rules of the NBA with a brief commentary

Let’s have a look at the same histogram by season

ggplot(data = df[df$period == 4,]) +
  geom_bar(aes(x=floor(time_elapsed_min))) +
  facet_grid(rows = vars(season))

All the sessions follow the general pattern. Checking the rules changes during the the period covered by the data, there was not changes in the rules affecting the hack-a-shack strategy.

The strategy is only useful for games that are very close in their score. We can check if the games that have a closer score had more free shots at the end of the 4th period.

ggplot(data = df[abs(df$end_home_score-df$end_away_score) < 12  & df$period == 4,]) +
  geom_bar(aes(x=floor(time_elapsed_min)))

It looks like the tighter the final score was the more fouls were committed in the last minute of the game.

We can check as well if this effect is observed in both the regular season and the playoffs

ggplot(data = df[df$period %in% c(1,2,3,4),]) +
  geom_bar(aes(x=floor(time_elapsed_min))) +
  facet_grid(cols = vars(period), rows = vars(playoffs))

It’s hard to see the effect on the plays offs because of the scale of the y axis.

ggplot(data = df[df$period %in% c(1,2,3,4) & df$playoffs == 'playoffs',]) +
  geom_bar(aes(x=floor(time_elapsed_min))) +
  facet_grid(cols = vars(period))

There is a sharp increase in the number of free shots in the 11th minute of the 4th period.

Conclusions

The hack-a-shack strategy seems to be widely used in the NBA. We have shown that it has been used in all the seasons we have data for, and it is used in both the regular season and the playoffs.

The data shows that there is an increase in the number of free shots as time goes on in each of the periods. It’s well known in basketball that trailing teams tend commit more fouls to increase the number of possessions.

Future improvements

It would be interesting to look at if there’s any relation between how close the end score was and how many fouls were commited in the last minute of the game

# Select the free shots from the 11th minute of the 4th period. Our study object is the game for this hypothesis
last_min <- df[floor(df$time_elapsed_min) == 11 & df$period == 4, c("end_home_score", "end_away_score", "game_id")]
last_min_game <- last_min %>%
  group_by(game_id, end_home_score, end_away_score) %>%
  summarise(total_free_shots = n())
last_min_game <- last_min_game %>%
  mutate(
    end_score_diff = abs(end_home_score - end_away_score)
  )
head(last_min_game)

Let’s have a look to the distributins of the variables

breaks_free_shots <- seq(1,max(last_min_game$total_free_shots),1)
hist(last_min_game$total_free_shots, breaks = breaks_free_shots)

It’s more of a poisson distribution. The pair numbers are much more frecuent because two-free-shot fouls are much more frecuent than one or three-free-shot fouls.

breaks_end_score_diff <- seq(1,max(last_min_game$end_score_diff),1)
hist(last_min_game$end_score_diff, breaks = breaks_end_score_diff)

#hist(log(last_min_game$end_score_diff))

So the distribution of the two variables are not normal. I shouldn’t use normal linear regression because the response variable is not normally distributed. The number of free shots in the 11th minute of the 4th period is a count variable which probably follows a poisson distribution. So let’s try a poisson regression

# poisson_regression <- glm(total_free_shots ~ end_score_diff, data = last_min_game, family = poisson(link = 'log'))
# summary(poisson_regression)

OTHER VISUALIZATIONS - PLAYERS

Once a put my hands on this data, I was interested to look at players as the focus of the study.

This graph shows all the players showing the total number of shots and how many of them were scored.

ggplot(data = df) +
  geom_bar(aes(x = player, fill = shot_made))+
  theme(axis.text.x=element_text(angle=20, hjust=1))

There are too many players to display them nicely. We can see that there is a huge variability among players in the number of free shots taken

Create a new dataframe with one row per player and how many shots were scored and missed.

player_shots <- df %>%
  select(player, shot_made) %>%
  group_by(player, shot_made) %>%
  summarise(count=n())

player_shots <- player_shots %>%
  group_by(player) %>%
  mutate(
    total = sum(count),
    rate = count/sum(count)
    ) %>%
  arrange(desc(total))
head(player_shots)

This graph shows the first 25 players by total number of free shots taken

ggplot(data = player_shots[1:50,]) +
  geom_bar(aes(x = reorder(player, desc(total)), y=count,fill = shot_made), stat='identity')+
  theme(axis.text.x=element_text(angle=45, hjust=1))

LeBron James and Dwight Howard are the two players that clearly stand out over the rest. But there is clear difference in the rate at which both make free shots, being LeBron James’ much higher than Dwight Howard’s rate.

This graph shows the percentage of free shots made by the top 15 players with more than 500 free shots taken in the analysed study.

player_shots_made <- player_shots[player_shots$shot_made == 1 & player_shots$count > 500,]
ggplot(data = arrange(player_shots_made, desc(player_shots_made$rate))[1:30,], aes(x = reorder(player, desc(rate)), y=rate, fill = shot_made)) +
  geom_bar(stat='identity')+
  theme(axis.text.x=element_text(angle=45, hjust=1))+
  geom_text(aes(x = reorder(player, desc(rate)), y = rate,
                                             label = paste0(as.integer(rate*100),"%")), size=2)

These are very high numbers being the average for the NBA around 75%

And this graph shows the top 15 players by number of free shots made

ggplot(data = arrange(player_shots_made, desc(player_shots_made$rate))[1:30,], aes(x = reorder(player, desc(count)), y=count)) +
  geom_bar(stat='identity')+
  theme(axis.text.x=element_text(angle=45, hjust=1))