Stephen Curry is an NBA basketball superstar who is a virtuoso when it comes to making difficult shots. Whether it’s a flashy 30-footer or a crucial buzzer-beater, time after time Steph Curry hits shots that seem utterly impossible. But just how improbable are the shots that he makes? We’ve gathered data consisting of over 125,000 shots from the 2014-2015 NBA season in order to get an idea of how difficult an NBA shot is in different situations. This data includes variables such as the distance of the shot, the distance of the shooter from the closest defender, and the number of dribbles that the shooter took before taking the shot. Using this data we created a logistical regression model that computes the probability of a given shot going in. We then used this model to estimate the probability of three of Stephen Curry’s most amazing shots.
Data Cleaning & Variable Description
We obtained our dataset from https://www.kaggle.com/dansbecker/nba-shot-logs. There were originally 128,069 observations 21 different variables in the dataset. We trimmed down the number of observations due to missing values in the “SHOT_CLOCK” variable. We also removed obvservations that contained nonsensical, negative time data. One of the observations reported -100 seconds in the variable TOUCH_TIME, which is impossible and clearly an error, so this and others like it were removed. We also removed variables that we felt wouldn’t improve our model for predicting the result of a shot in any way. This included variables like player_id and player_name.In addition to this, we made a new variable that became our dependent variabe called result which takes on a value of either 1 or 0; 1 for a made shot, 0 for a missed shot.
Variable | Description |
---|---|
TOUCH_TIME | Number of seconds the player had the ball before shooting (seconds) |
SHOT_DIST | Distance of shot from the basket (feet) |
CLOSE_DEF_DIST | Distance from the shooter to the closest defender |
PTS_TYPE | Number of points the shot was worth (2 or 3) |
FINAL_MARGIN | The final score margin |
SHOT_CLOCK | Number of seconds left on the shot clock at the time of the shot (seconds) |
GAME_CLOCK | Amount of time left on the game clock (MM:SS:00) |
DRIBBLES | How many dribbles the player took before shooting |
SHOT_NUMBER | The number of shots taken by the player |
PERIOD | The period the shot was taken in |
LOCATION | H or A for home game or away game |
W | Final result of the game: W for win, L for loss |
Variable Selection
In order to determine which variables we would keep in our final model, we first ran a logistic regression model containing all of the above variables. We took note of the AIC and the fact that not all of the variables were significant at even a 5% level. So, in order to trim down our model, we used the stepAIC command which performs a stepwise model selection based on AIC (our detailed AIC results are in the appendix). We did this using the “forward” and “backward” directions as well as the “both” direction which goes both forward and backward. The “both” selection algorithm worked the best and produced the lowest AIC and every variable that it left was easily significant at a .001 significance level. The variables that we included in the model are listed in the table below.
The three variables that were dropped were W, LOCATION, and GAME_CLOCK. These variables were statistically insignificant when we made the first model, so it is reasonable that the stepAIC function would suggest removing them. This also seems to make sense intuitively. Rarely does a player know if they’ll win or lose when they’re shooting a shot, so the variable W isn’t likely to effect the probability of the shot going in. Location may seem like it would make a big difference, but professional basketball players are unlikely to be phased by playing in a different arena so the variable LOCATION also doesn’t have a significant effect on the shot probability. The GAME_CLOCK variable was the only surprise. One would think that this would have some kind of effect, but it was statistically insignificant when we tested it.
Variable | Description |
---|---|
TOUCH_TIME | Number of seconds the player had the ball before shooting (seconds) |
SHOT_DIST | Distance of shot from the basket (feet) |
CLOSE_DEF_DIST | Distance from the shooter to the closest defender |
PTS_TYPE | Number of points the shot was worth (2 or 3) |
FINAL_MARGIN | The final score margin |
SHOT_CLOCK | Number of seconds left on the shot clock at the time of the shot (seconds) |
DRIBBLES | How many dribbles the player took before shooting |
SHOT_NUMBER | The number of shots taken by the player |
PERIOD | The period the shot was taken in |
The Final Model
Based on our model’s coefficients, the four strongest drivers of the log shot odds are TOUCH_TIME, SHOT_DIST, CLOSE_DEF_DIST, and PTS_TYPE (see Table 2). It is reasonable to use the coefficients to measure this because the input variables are all similar in size; none of the variable inputs are much bigger than others. All of the variables in our model are highly significant, so the coefficients are the best way to compare them.
As one would expect, the relationship between the independent variables TOUCH_TIME and SHOT_DIST and the the log odds of a shot going in is a negative one. The longer a player has the ball and the further away they are from the basket, the lower chance they have of making the shot. Longer shots are clearly harder and it is common knowledge for basketball players that catching and immediatley shooting is usually more successful than holding onto the ball before shooting. There is surprisingly a positive relationship between PTS_TYPE and the log shot odds, meaning that shooting a 3 pointer increases your chance of making the shot more than shooting a 2 pointer. This is possibly because players are more focused when it comes to taking more valuable shots and they’re probably only shooting 3 pointers when they’re wide open. Finally, there was a positive relationship between the distance of the shooter from the closest defender and the log shot odds. This is intuitive; the more unguarded a player is, the more likely they are to make the shot.
The model gives us the following equation: Y=(0.0179)-(0.0691TOUCH_TIME)-(0.0633SHOT_DIST)+(0.1012CLOSE_DEF_DIST)+(0.0818PTS_TYPE)+(0.0091FINAL_MARGIN)+(0.0149SHOT_CLOCK)+(0.0342DRIBBLES)+(0.0076SHOT_NUMBER)-(0.0265*PERIOD)
Variable | Coefficients | P - Values |
---|---|---|
Intercept | 0.0178883 | 0.681220 |
TOUCH_TIME | -0.0690932 | < 2e-16 |
SHOT_DIST | -0.0632707 | < 2e-16 |
CLOSE_DEF_DIST | 0.1011608 | < 2e-16 |
PTS_TYPE | 0.0817515 | 8.01e-05 |
FINAL_MARGIN | 0.0091040 | < 2e-16 |
SHOT_CLOCK | 0.0148791 | < 2e-16 |
DRIBBLES | 0.0342039 | 8.06e-13 |
SHOT_NUMBER | 0.0076131 | 7.17e-06 |
PERIOD | -0.0264933 | 0.000129 |
Since this model is logistic, the equation we formed computes the log odds of a shot going in. So, in order to find the actual odds, we must use the formula: odds = exp(Y) where Y is our estimated log made shot odds and to compute the probability, we use the formula: probability = 1/(1+exp(-Y)).
We were able to use this model to create the graph below. This graph consists of points that are either blue or red. Each point represents a shot from our dataset. A blue point indicates a missed shot and a red point indicates a made shot. The y-axis displays our predicted probability of each shot going in. For this reason, we see mostly red points above the ~70% predicted probability and mostly blue dots below the ~30% predicted probability. This indicates that our model is quite effective at predicting the likelihood of a shot going in.
We do see a significant number of points that indicate a made shot where the probability of making the shot was low. In contrast, we don’t see a great number of missed shots that had a high probability of going in. This may be confusing, but in reality, it makes sense. These are professional basketball players, and they’re going to make really difficult shots occasionally, but they will rarely miss easy shots.
Finally, we wanted to assess the difficulty of three of Steph Curry’s most mind boggling shots. Here are videos of the three shots we chose to analyze:
https://www.businessinsider.com/stephen-curry-half-court-shot-against-clippers-2017-1 We will call this shot: “Half-Court Shot”
https://www.youtube.com/watch?v=X9necA_prVM We will call this shot: “3/4-Court Shot”
https://www.youtube.com/watch?v=cP54guVj_XE We will call this shot: “Contested 3-Pointer”
The data for each of these shots is in a table below (see Table 4). As we can see, we computed Steph Curry’s probability of making the first shot to be 10.7% which means it was quite unlikely. We believe this percentage was inflated by the extreme final margin; Curry’s team won the game by 46 points. Even more unlikely than the first shot was the 3/4 Court Shot. This shot had a mere 4.2% of going in. This is because of this shot was taken from an incredibly long distance and there was just 1 second left on the shot clock. The final shot, the Contested 3-pointer, was more difficult because of the number of dribbles that Curry took and with how close the nearest defender was. It had a 25.4% probability of going in, which is a much lower probability than a standard uncontested 3-pointer would have been for one of the greatest shooters of all time.
Shot | TOUCH TIME | SHOT DIST | CLOSE DEF DIST | PTS TYPE | FINAL MARGIN | SHOT CLOCK | DRIBBLES | SHOT NUMBER | PERIOD | log(odds) | Probability of Making Shot |
---|---|---|---|---|---|---|---|---|---|---|---|
|
3.6 | 51 | 5 | 3 | 46 | 1.8 | 3 | 12 | 2 | -2.12 | 10.7% |
|
.73 | 60 | 4 | 3 | 13 | 1 | 0 | 17 | 3 | -3.12 | 4.2% |
|
3.1 | 24 | 1 | 3 | 8 | 10 | 6 | 1 | 3 | -1.08 | 25.4% |
This model helps to capture the level of difficulty of some of Steph Curry’s most amazing shots. There were certain variables that we did not have that we feel would have improved our model. Variables such as the height of the nearest defender and the field goal percentage of a given player up to the time that they took a given shot would more than likely have made our model even better at predicting.
In this project, we were able to create a model that computes the probability of a given shot in the NBA. We were able to use this model to assess three of Stephen Curry’s most amazing shots and in doing so were able to gain an even greater understanding of how amazingly he performs on the basketball court.
Information regarding Stephen Curry’s 3 improbable shots were obtained from basketball-reference.com Distance from nearest defender for these 3 shots were estimated by the authors.
## Start: AIC=161419.5
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + GAME_CLOCK + DRIBBLES + LOCATION +
## W + SHOT_NUMBER + PERIOD
##
## Df Deviance AIC
## - LOCATION 1 161394 161418
## - GAME_CLOCK 1 161394 161418
## - W 1 161395 161419
## <none> 161394 161420
## - PERIOD 1 161406 161430
## - PTS_TYPE 1 161409 161433
## - SHOT_NUMBER 1 161410 161434
## - DRIBBLES 1 161444 161468
## - FINAL_MARGIN 1 161517 161541
## - TOUCH_TIME 1 161544 161568
## - SHOT_CLOCK 1 161582 161606
## - CLOSE_DEF_DIST 1 162763 162787
## - SHOT_DIST 1 164574 164598
##
## Step: AIC=161417.9
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + GAME_CLOCK + DRIBBLES + W + SHOT_NUMBER +
## PERIOD
##
## Df Deviance AIC
## - GAME_CLOCK 1 161395 161417
## - W 1 161395 161417
## <none> 161394 161418
## + LOCATION 1 161394 161420
## - PERIOD 1 161407 161429
## - PTS_TYPE 1 161409 161431
## - SHOT_NUMBER 1 161410 161432
## - DRIBBLES 1 161445 161467
## - FINAL_MARGIN 1 161517 161539
## - TOUCH_TIME 1 161545 161567
## - SHOT_CLOCK 1 161583 161605
## - CLOSE_DEF_DIST 1 162763 162785
## - SHOT_DIST 1 164575 164597
##
## Step: AIC=161416.7
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + DRIBBLES + W + SHOT_NUMBER +
## PERIOD
##
## Df Deviance AIC
## - W 1 161396 161416
## <none> 161395 161417
## + GAME_CLOCK 1 161394 161418
## + LOCATION 1 161394 161418
## - PERIOD 1 161409 161429
## - PTS_TYPE 1 161410 161430
## - SHOT_NUMBER 1 161415 161435
## - DRIBBLES 1 161446 161466
## - FINAL_MARGIN 1 161518 161538
## - TOUCH_TIME 1 161546 161566
## - SHOT_CLOCK 1 161583 161603
## - CLOSE_DEF_DIST 1 162765 162785
## - SHOT_DIST 1 164589 164609
##
## Step: AIC=161415.9
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + DRIBBLES + SHOT_NUMBER + PERIOD
##
## Df Deviance AIC
## <none> 161396 161416
## + W 1 161395 161417
## + GAME_CLOCK 1 161395 161417
## + LOCATION 1 161395 161417
## - PERIOD 1 161411 161429
## - PTS_TYPE 1 161411 161429
## - SHOT_NUMBER 1 161416 161434
## - DRIBBLES 1 161447 161465
## - TOUCH_TIME 1 161547 161565
## - SHOT_CLOCK 1 161583 161601
## - FINAL_MARGIN 1 161814 161832
## - CLOSE_DEF_DIST 1 162767 162785
## - SHOT_DIST 1 164590 164608
## Start: AIC=161419.5
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + GAME_CLOCK + DRIBBLES + LOCATION +
## W + SHOT_NUMBER + PERIOD
## Start: AIC=161419.5
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + GAME_CLOCK + DRIBBLES + LOCATION +
## W + SHOT_NUMBER + PERIOD
##
## Df Deviance AIC
## - LOCATION 1 161394 161418
## - GAME_CLOCK 1 161394 161418
## - W 1 161395 161419
## <none> 161394 161420
## - PERIOD 1 161406 161430
## - PTS_TYPE 1 161409 161433
## - SHOT_NUMBER 1 161410 161434
## - DRIBBLES 1 161444 161468
## - FINAL_MARGIN 1 161517 161541
## - TOUCH_TIME 1 161544 161568
## - SHOT_CLOCK 1 161582 161606
## - CLOSE_DEF_DIST 1 162763 162787
## - SHOT_DIST 1 164574 164598
##
## Step: AIC=161417.9
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + GAME_CLOCK + DRIBBLES + W + SHOT_NUMBER +
## PERIOD
##
## Df Deviance AIC
## - GAME_CLOCK 1 161395 161417
## - W 1 161395 161417
## <none> 161394 161418
## - PERIOD 1 161407 161429
## - PTS_TYPE 1 161409 161431
## - SHOT_NUMBER 1 161410 161432
## - DRIBBLES 1 161445 161467
## - FINAL_MARGIN 1 161517 161539
## - TOUCH_TIME 1 161545 161567
## - SHOT_CLOCK 1 161583 161605
## - CLOSE_DEF_DIST 1 162763 162785
## - SHOT_DIST 1 164575 164597
##
## Step: AIC=161416.7
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + DRIBBLES + W + SHOT_NUMBER +
## PERIOD
##
## Df Deviance AIC
## - W 1 161396 161416
## <none> 161395 161417
## - PERIOD 1 161409 161429
## - PTS_TYPE 1 161410 161430
## - SHOT_NUMBER 1 161415 161435
## - DRIBBLES 1 161446 161466
## - FINAL_MARGIN 1 161518 161538
## - TOUCH_TIME 1 161546 161566
## - SHOT_CLOCK 1 161583 161603
## - CLOSE_DEF_DIST 1 162765 162785
## - SHOT_DIST 1 164589 164609
##
## Step: AIC=161415.9
## result ~ TOUCH_TIME + SHOT_DIST + CLOSE_DEF_DIST + PTS_TYPE +
## FINAL_MARGIN + SHOT_CLOCK + DRIBBLES + SHOT_NUMBER + PERIOD
##
## Df Deviance AIC
## <none> 161396 161416
## - PERIOD 1 161411 161429
## - PTS_TYPE 1 161411 161429
## - SHOT_NUMBER 1 161416 161434
## - DRIBBLES 1 161447 161465
## - TOUCH_TIME 1 161547 161565
## - SHOT_CLOCK 1 161583 161601
## - FINAL_MARGIN 1 161814 161832
## - CLOSE_DEF_DIST 1 162767 162785
## - SHOT_DIST 1 164590 164608