Data Obtain
1 Getting the data
df2016 <- nbastatR::teams_shots(all_active_teams = T, seasons = 2016, measures = "FGA")
df2017 <- nbastatR::teams_shots(all_active_teams = T, seasons = 2017, measures = "FGA")
df2018 <- nbastatR::teams_shots(all_active_teams = T, seasons = 2018, measures = "FGA")
df2019 <- nbastatR::teams_shots(all_active_teams = T, seasons = 2019, measures = "FGA")
df2020 <- nbastatR::teams_shots(all_active_teams = T, seasons = 2020, measures = "FGA")# This won't load unless I do it season by season so this is the best solution I could find, not that this is cumulatively 362.3 megabytes so takes a bit
bigdf <- do.call("rbind", list(df2016, df2017, df2018, df2019, df2020))
write_rds(bigdf, here("Data", "shot_player_data.rds")) # The data folder is tracked by git lfsThe dataframe I derived from the above code has 1,037,103 observations (or shots taken in the seasons including and between 2015-2016 and 2019-2020), but needs a lot of cleaning as the dates and time frames in the game make it difficult to tell when a shooter makes or misses consecutively. Below I begin cleaning the data by first making shot times and specific games more clear:
bigdf <- read_rds(here("Data/shot_player_data.rds")) # Read in the dataframe composed aboveFirst I select relevant columns to make a date and time for each shot, whether the shot is a make or miss, and including players names and ids to keep track of who is shooting. The table notes the number of games containing each period number, interestingly the number of 1st, 2nd, 3rd, and 4th quarters consecutively trend downwards - I guess this means cancelled games? Quadruple overtime also appears to be the max that has occurred with 57 observations as there are 4 quarters in NBA basketball.
cleandf <- bigdf %>% select(yearSeason, idTeam, idPlayer, namePlayer, typeEvent, typeShot, dateGame, numberPeriod, minutesRemaining, secondsRemaining, isShotAttempted, isShotMade, distanceShot)
cleandf %>% group_by(numberPeriod) %>% summarize(count = n())## # A tibble: 8 x 2
## numberPeriod count
## <dbl> <int>
## 1 1 269132
## 2 2 259878
## 3 3 254826
## 4 4 246431
## 5 5 5983
## 6 6 686
## 7 7 110
## 8 8 57
Next I start cleaning the time part of the data, what game is it, and what period/time in the period is it? The last part of the pipeline uses the streak_length function which is documented in the package to find what the number of consecutive makes is, and then relocates that to the front of the dataframe for ease of use.
cleandf <- cleandf %>% mutate(shotTime = paste0(numberPeriod,":",minutesRemaining,":",secondsRemaining), dateGame = gsub("(\\d{4})(\\d{2})", "\\1-\\2-", dateGame)) # Here I make the shotTime variable which will be useful after grouping by dateGame, the date of the game played which had some weird date formatting I fix with gsub
cleandf <- cleandf %>% group_by(dateGame, namePlayer) %>% mutate(streakLength = streak_length(isShotMade)) %>% ungroup() %>% arrange(namePlayer) %>% relocate(streakLength, .after = typeEvent)
# write_rds(cleandf, here("Data", "shot_streaks_data.rds")) # Creating the dataframe where I run Simulations.rmd from2 Testing out on several players
Looks like Klay and Ibaka had some pretty crazy outlier nights on their 14 consecutive make nights. LeBron is of course historically efficient compared to anyone else, especially keeping in mind the scales are different on the x axis and y axis.
p1 <- player_streaks_graph("LeBron James", cleandf)
p2 <- player_streaks_graph("Klay Thompson", cleandf)
p3 <- player_streaks_graph("Damian Lillard", cleandf)
p4 <- player_streaks_graph("Serge Ibaka", cleandf)
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)player_streaks_graph("Kyle Korver", cleandf)A work by Duncan Gates
gatesdu@oregonstate.edu