library('ggplot2')
library('dplyr')
library('GGally')
library('lubridate')
Task: Read in the csv containing NBA free throws. Visualise some aspect of the data you find interesting, e.g., the average number of free throws per period for the regular season and the playoffs.
NBA Basketball has the following characteristics:
Each of the free shots adds one point to the score
Games are played by two teams, of 5 payers each.
One game is divided in 4 quarters of 12 minutes of duration
Free shots are shots throuwn by a player without any opposition. A free shot happen as a consequence of a foul done over the player that is throwing the free shot. If the foul happen within the 2 points area, the player is given 2 free shots. If the foul happen in the three points area, then the player has 3 free shots. One free shot is given when there was a foul when the player was shoting and he converted the shot.
A seasson of the league has two different periods:
There is one season per year
df <- read.csv('data/free_throws.csv', header = TRUE, stringsAsFactors = FALSE)
head(df)
Let’s have a look to the variables
str(df)
## 'data.frame': 618019 obs. of 11 variables:
## $ end_result: chr "106 - 114" "106 - 114" "106 - 114" "106 - 114" ...
## $ game : chr "PHX - LAL" "PHX - LAL" "PHX - LAL" "PHX - LAL" ...
## $ game_id : num 2.61e+08 2.61e+08 2.61e+08 2.61e+08 2.61e+08 ...
## $ period : num 1 1 1 1 1 1 1 2 2 2 ...
## $ play : chr "Andrew Bynum makes free throw 1 of 2" "Andrew Bynum makes free throw 2 of 2" "Andrew Bynum makes free throw 1 of 2" "Andrew Bynum misses free throw 2 of 2" ...
## $ player : chr "Andrew Bynum" "Andrew Bynum" "Andrew Bynum" "Andrew Bynum" ...
## $ playoffs : chr "regular" "regular" "regular" "regular" ...
## $ score : chr "0 - 1" "0 - 2" "18 - 12" "18 - 12" ...
## $ season : chr "2006 - 2007" "2006 - 2007" "2006 - 2007" "2006 - 2007" ...
## $ shot_made : int 1 1 1 0 1 1 1 0 1 1 ...
## $ time : chr "11:45" "11:45" "7:26" "7:26" ...
Are times the actual time of the day or are they times since the game start?
They could be times in the format hh:mm or times from start of the game in the format of mm:ss
Each of the quarters in a basquetball game last 12 min, so we can’t really tell those options apart looking at the data. Common sense tells me that they are times since start, as time of the day is not very informative in this context. The dataset has the period also so that would give us when on the game the free shot was done.
Let’s have a look to the distribution of the times
times = as.integer(sapply(df$time, function(x) unlist(strsplit(x, ':'))[1]))
hist(times)
Let’s assume the format is ‘mm:ss’
Having a look to the data, it’s clear that the times provided is the time left in each period.
Shot_madeI don’t know what shot_made really is. Let’s have a look
df[,c("shot_made", 'play')]
So there is a row for echa free shot. Even when there are two free shots because of a foul there will be two rows, one for each of the free shots. So shot_made tell is if every single shot was made or not
sum(is.na(df))
## [1] 0
There si no missing data.
Data wrangling
df <- df %>%
mutate(
## Trasform variables
game_id = as.character(game_id),
period = as.factor(period),
playoffs = as.factor(playoffs),
season = as.factor(season),
shot_made = as.factor(shot_made),
## New variables
# split the scores
end_home_score = as.integer(sapply(end_result, function(x) unlist(strsplit(x, ' - '))[1])),
end_away_score = as.integer(sapply(end_result, function(x) unlist(strsplit(x, ' - '))[2])),
# split the teams
home_team = as.character(sapply(game, function(x) unlist(strsplit(x, ' - '))[1])),
away_team = as.character(sapply(game, function(x) unlist(strsplit(x, ' - '))[2])),
# split the scores
elapsed_home_score = as.integer(sapply(score, function(x) unlist(strsplit(x, ' - '))[1])),
elapsed_away_score = as.integer(sapply(score, function(x) unlist(strsplit(x, ' - '))[2])),
# time
time_min_sec = ms(time),
time_dur_sec = dminutes(12) - as.duration(time_min_sec),
time_elapsed_min_sec = as.period(time_dur_sec),
time_elapsed_min = as.numeric(time_elapsed_min_sec, "minutes"),
time_elapsed_sec = as.numeric(time_elapsed_min_sec, "seconds"),
)
str(df)
## 'data.frame': 618019 obs. of 22 variables:
## $ end_result : chr "106 - 114" "106 - 114" "106 - 114" "106 - 114" ...
## $ game : chr "PHX - LAL" "PHX - LAL" "PHX - LAL" "PHX - LAL" ...
## $ game_id : chr "261031013" "261031013" "261031013" "261031013" ...
## $ period : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ play : chr "Andrew Bynum makes free throw 1 of 2" "Andrew Bynum makes free throw 2 of 2" "Andrew Bynum makes free throw 1 of 2" "Andrew Bynum misses free throw 2 of 2" ...
## $ player : chr "Andrew Bynum" "Andrew Bynum" "Andrew Bynum" "Andrew Bynum" ...
## $ playoffs : Factor w/ 2 levels "playoffs","regular": 2 2 2 2 2 2 2 2 2 2 ...
## $ score : chr "0 - 1" "0 - 2" "18 - 12" "18 - 12" ...
## $ season : Factor w/ 10 levels "2006 - 2007",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ shot_made : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 1 2 2 ...
## $ time : chr "11:45" "11:45" "7:26" "7:26" ...
## $ end_home_score : int 106 106 106 106 106 106 106 106 106 106 ...
## $ end_away_score : int 114 114 114 114 114 114 114 114 114 114 ...
## $ home_team : chr "PHX" "PHX" "PHX" "PHX" ...
## $ away_team : chr "LAL" "LAL" "LAL" "LAL" ...
## $ elapsed_home_score : int 0 0 18 18 21 33 34 43 44 44 ...
## $ elapsed_away_score : int 1 2 12 12 12 20 20 29 29 30 ...
## $ time_min_sec :Formal class 'Period' [package "lubridate"] with 6 slots
## .. ..@ .Data : num 45 45 26 26 18 15 15 52 52 37 ...
## .. ..@ year : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ month : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ day : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ hour : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ minute: num 11 11 7 7 7 3 3 10 10 10 ...
## $ time_dur_sec :Formal class 'Duration' [package "lubridate"] with 1 slot
## .. ..@ .Data: num 15 15 274 274 282 525 525 68 68 83 ...
## $ time_elapsed_min_sec:Formal class 'Period' [package "lubridate"] with 6 slots
## .. ..@ .Data : num 15 15 34 34 42 45 45 8 8 23 ...
## .. ..@ year : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ month : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ day : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ hour : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ minute: num 0 0 4 4 4 8 8 1 1 1 ...
## $ time_elapsed_min : num 0.25 0.25 4.57 4.57 4.7 ...
## $ time_elapsed_sec : num 15 15 274 274 282 525 525 68 68 83 ...
head(df)
visual check that the scores were extracted correctly
df[1:100,'end_home_score']
## [1] 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106
## [18] 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106 106
## [35] 106 106 106 106 106 106 106 108 108 108 108 108 108 108 108 108 108
## [52] 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108
## [69] 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108
## [86] 108 108 108 108 108 108 108 108 108 108 91 91 91 91 91
head(df$end_result, 100)
## [1] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
## [6] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
## [11] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
## [16] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
## [21] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
## [26] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
## [31] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
## [36] "106 - 114" "106 - 114" "106 - 114" "106 - 114" "106 - 114"
## [41] "106 - 114" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [46] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [51] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [56] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [61] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [66] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [71] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [76] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [81] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [86] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [91] "108 - 66" "108 - 66" "108 - 66" "108 - 66" "108 - 66"
## [96] "91 - 87" "91 - 87" "91 - 87" "91 - 87" "91 - 87"
summary(df)
## end_result game game_id period
## Length:618019 Length:618019 Length:618019 4 :186280
## Class :character Class :character Class :character 2 :154218
## Mode :character Mode :character Mode :character 3 :154108
## 1 :116962
## 5 : 5439
## 6 : 838
## (Other): 174
## play player playoffs
## Length:618019 Length:618019 playoffs: 42126
## Class :character Class :character regular :575893
## Mode :character Mode :character
##
##
##
##
## score season shot_made time
## Length:618019 2006 - 2007: 67612 0:150330 Length:618019
## Class :character 2007 - 2008: 65500 1:467689 Class :character
## Mode :character 2008 - 2009: 65355 Mode :character
## 2009 - 2010: 64804
## 2010 - 2011: 64137
## 2013 - 2014: 62290
## (Other) :228321
## end_home_score end_away_score home_team away_team
## Min. : 54.00 Min. : 59 Length:618019 Length:618019
## 1st Qu.: 91.00 1st Qu.: 94 Class :character Class :character
## Median : 99.00 Median :102 Mode :character Mode :character
## Mean : 98.99 Mean :102
## 3rd Qu.:107.00 3rd Qu.:110
## Max. :161.00 Max. :168
##
## elapsed_home_score elapsed_away_score time_min_sec
## Min. : 0.00 Min. : 0.00 Min. :0S
## 1st Qu.: 31.00 1st Qu.: 33.00 1st Qu.:1M 51S
## Median : 56.00 Median : 58.00 Median :4M 31S
## Mean : 55.48 Mean : 57.46 Mean :4M 57.4996399787062S
## 3rd Qu.: 79.00 3rd Qu.: 82.00 3rd Qu.:7M 50S
## Max. :161.00 Max. :166.00 Max. :12M 0S
##
## time_dur_sec
## Min. :0s
## 1st Qu.:250s (~4.17 minutes)
## Median :449s (~7.48 minutes)
## Mean :422.500360021294s (~7.04 minutes)
## 3rd Qu.:609s (~10.15 minutes)
## Max. :720s (~12 minutes)
##
## time_elapsed_min_sec time_elapsed_min time_elapsed_sec
## Min. :0S Min. : 0.000 Min. : 0.0
## 1st Qu.:4M 10S 1st Qu.: 4.167 1st Qu.:250.0
## Median :7M 29S Median : 7.483 Median :449.0
## Mean :7M 2.50036002129383S Mean : 7.042 Mean :422.5
## 3rd Qu.:10M 9S 3rd Qu.:10.150 3rd Qu.:609.0
## Max. :12M 0S Max. :12.000 Max. :720.0
##
df_pair_end_score <- df[,c("playoffs", "season", "end_home_score", "end_away_score")]
ggpairs(df_pair_end_score, labeller = "label_context") + theme_grey(base_size = 30)
df_pair_shot_made <- df[,c("period", "playoffs", "season", "time_elapsed_min", "shot_made")]
ggpairs(df_pair_shot_made) + theme_grey(base_size = 30)
Hack-A-Shack was a common strategy in the NBA used in the last minutes of the game when a team was behind in the score. The strategy consists in committing intentional fouls throughout the game against selected opponents who shot free throws poorly. The trailing team fouls intentionally to end the opponents’ possession as soon as possible. More info following this link https://en.wikipedia.org/wiki/Hack-a-Shaq
Let’s have a look at the distributions of times of the free shots in the 4 quarter
ggplot(data = df[df$period == 4,]) +
geom_bar(aes(x=floor(time_elapsed_min)))
Interestingly, as we saw before, there is a spike on the number of the free shots in the last minute of the 4th period. There are free shots that happen when there is 0 seconds left (the 12 min bar). This free shots should really be in the 11th bar.
Let’s transform the 00:00 times into 00:01 so they belong to the 11th minute
df[df$time_min_sec == period(0), "time_min_sec"] <- duration(1, "second")
# Check there is not more rows with period = 0
# df[df$time_min_sec == period(0),]
Modify the rest of the time variables accordingly
df <- df%>%
mutate(
time_dur_sec = dminutes(12) - as.duration(time_min_sec),
time_elapsed_min_sec = as.period(time_dur_sec),
time_elapsed_min = as.numeric(time_elapsed_min_sec, "minutes"),
time_elapsed_sec = as.numeric(time_elapsed_min_sec, "seconds"),
)
Same graph without the 12th minute bin
ggplot(data = df[df$period == 4,]) +
geom_bar(aes(x=floor(time_elapsed_min)))
Let’s have a look to the other periods as well
ggplot(data = df[df$period %in% c(1,2,3,4),]) +
geom_bar(aes(x=floor(time_elapsed_min))) +
facet_grid(rows = vars(period))
There is masive jump in the 11th minute of the 4th period. It definitely shows that teams are using hack-a-shack strategy quite a lot.
There has been numerous changes in the NBA rules to disencourage deliberate fouling. So the general pattern we see here might be caused by a change in the rules. This link follows the changes in the rules of the NBA with a brief commentary
Let’s have a look at the same histogram by season
ggplot(data = df[df$period == 4,]) +
geom_bar(aes(x=floor(time_elapsed_min))) +
facet_grid(rows = vars(season))
All the sessions follow the general pattern. Checking the rules changes during the the period covered by the data, there was not changes in the rules affecting the hack-a-shack strategy.
The strategy is only useful for games that are very close in their score. We can check if the games that have a closer score had more free shots at the end of the 4th period.
ggplot(data = df[abs(df$end_home_score-df$end_away_score) < 12 & df$period == 4,]) +
geom_bar(aes(x=floor(time_elapsed_min)))
It looks like the tighter the final score was the more fouls were committed in the last minute of the game.
We can check as well if this effect is observed in both the regular season and the playoffs
ggplot(data = df[df$period %in% c(1,2,3,4),]) +
geom_bar(aes(x=floor(time_elapsed_min))) +
facet_grid(cols = vars(period), rows = vars(playoffs))
It’s hard to see the effect on the plays offs because of the scale of the y axis.
ggplot(data = df[df$period %in% c(1,2,3,4) & df$playoffs == 'playoffs',]) +
geom_bar(aes(x=floor(time_elapsed_min))) +
facet_grid(cols = vars(period))
There is a sharp increase in the number of free shots in the 11th minute of the 4th period.
The hack-a-shack strategy seems to be widely used in the NBA. We have shown that it has been used in all the seasons we have data for, and it is used in both the regular season and the playoffs.
The data shows that there is an increase in the number of free shots as time goes on in each of the periods. It’s well known in basketball that trailing teams tend commit more fouls to increase the number of possessions.
It would be interesting to look at if there’s any relation between how close the end score was and how many fouls were commited in the last minute of the game
# Select the free shots from the 11th minute of the 4th period. Our study object is the game for this hypothesis
last_min <- df[floor(df$time_elapsed_min) == 11 & df$period == 4, c("end_home_score", "end_away_score", "game_id")]
last_min_game <- last_min %>%
group_by(game_id, end_home_score, end_away_score) %>%
summarise(total_free_shots = n())
last_min_game <- last_min_game %>%
mutate(
end_score_diff = abs(end_home_score - end_away_score)
)
head(last_min_game)
Let’s have a look to the distributins of the variables
breaks_free_shots <- seq(1,max(last_min_game$total_free_shots),1)
hist(last_min_game$total_free_shots, breaks = breaks_free_shots)
It’s more of a poisson distribution. The pair numbers are much more frecuent because two-free-shot fouls are much more frecuent than one or three-free-shot fouls.
breaks_end_score_diff <- seq(1,max(last_min_game$end_score_diff),1)
hist(last_min_game$end_score_diff, breaks = breaks_end_score_diff)
#hist(log(last_min_game$end_score_diff))
So the distribution of the two variables are not normal. I shouldn’t use normal linear regression because the response variable is not normally distributed. The number of free shots in the 11th minute of the 4th period is a count variable which probably follows a poisson distribution. So let’s try a poisson regression
# poisson_regression <- glm(total_free_shots ~ end_score_diff, data = last_min_game, family = poisson(link = 'log'))
# summary(poisson_regression)
Once a put my hands on this data, I was interested to look at players as the focus of the study.
This graph shows all the players showing the total number of shots and how many of them were scored.
ggplot(data = df) +
geom_bar(aes(x = player, fill = shot_made))+
theme(axis.text.x=element_text(angle=20, hjust=1))
There are too many players to display them nicely. We can see that there is a huge variability among players in the number of free shots taken
Create a new dataframe with one row per player and how many shots were scored and missed.
player_shots <- df %>%
select(player, shot_made) %>%
group_by(player, shot_made) %>%
summarise(count=n())
player_shots <- player_shots %>%
group_by(player) %>%
mutate(
total = sum(count),
rate = count/sum(count)
) %>%
arrange(desc(total))
head(player_shots)
This graph shows the first 25 players by total number of free shots taken
ggplot(data = player_shots[1:50,]) +
geom_bar(aes(x = reorder(player, desc(total)), y=count,fill = shot_made), stat='identity')+
theme(axis.text.x=element_text(angle=45, hjust=1))
LeBron James and Dwight Howard are the two players that clearly stand out over the rest. But there is clear difference in the rate at which both make free shots, being LeBron James’ much higher than Dwight Howard’s rate.
This graph shows the percentage of free shots made by the top 15 players with more than 500 free shots taken in the analysed study.
player_shots_made <- player_shots[player_shots$shot_made == 1 & player_shots$count > 500,]
ggplot(data = arrange(player_shots_made, desc(player_shots_made$rate))[1:30,], aes(x = reorder(player, desc(rate)), y=rate, fill = shot_made)) +
geom_bar(stat='identity')+
theme(axis.text.x=element_text(angle=45, hjust=1))+
geom_text(aes(x = reorder(player, desc(rate)), y = rate,
label = paste0(as.integer(rate*100),"%")), size=2)
These are very high numbers being the average for the NBA around 75%
And this graph shows the top 15 players by number of free shots made
ggplot(data = arrange(player_shots_made, desc(player_shots_made$rate))[1:30,], aes(x = reorder(player, desc(count)), y=count)) +
geom_bar(stat='identity')+
theme(axis.text.x=element_text(angle=45, hjust=1))