Untitled.knit

title: “NBA Curry shot analysis”

author: “Omar R. Lanneau”

date: “2026-05-31”

output: html_document

Introduction

This project analyzes Stephen Curry’s shot efficiency during his 2014-15 MVP season using multiple linear regression in R.

Load Data

shots <- read.csv("shot_logs copy.csv")
steph <- shots[shots$player_name == "stephen curry", ]
nrow(steph)

## [1] 968

The dataset used in this analysis is the NBA Shot Logs from the 2014-15 season, sourced from Kaggle. It contains detailed tracking data on every shot attempt across the league including shot distance, closest defender, defender distance, shot clock remaining, dribbles taken, and touch time among other variables.

Out of the full dataset, I filtered it down to Stephen Curry exclusively. During his successful MVP campaign in this Warriors era title run, he yielded 968 total shot attempts. This provided me a high volume shooting single-player sample to examine what conditions best predict whether Curry makes or misses a given shot. Using multiple variables, I could statistically conclude what his shooting efficiency is most sensitive to by testing his shot versus variables.

EDA - Shot Types

table(steph$PTS_TYPE, steph$SHOT_RESULT)

##    
##     made missed
##   2  280    232
##   3  190    266

Breaking down Curry’s 968 shots by type reveals an interesting split. He attempted more 2-pointers than 3-pointers on the season, making 280 and missing 232 from inside the arc, while making 190 and missing 266 from three point range.

At first glance this might suggest Curry should stick to 2-pointers given the higher make rate. However raw counts alone can be misleading, and a made 3-pointer is worth 50% more than a made 2-pointer. To truly compare shot to shot value, field goal percentage and points generated per attempt must be analyzed which I cover more in detail in the following sections. ## FG% by Shot Type

tapply(steph$FGM, steph$PTS_TYPE, mean)

##         2         3 
## 0.5468750 0.4166667

Curry shot 54.7% on 2-pointers and 41.7% on 3-pointers. However points per attempt tells a different story. Although it is true his 3-point percentage is noticeably lower, the extra point awarded for each make more than compensates for the lower conversion rate. This is exactly why modern NBA offenses have shifted away from mid-range 2-pointers — the math simply favors the 3, and Curry was the player who proved it most convincingly during this era.

Points Per Attempt

tapply(steph$PTS, steph$PTS_TYPE, mean)

##       2       3 
## 1.09375 1.25000

The numbers confirm what the FG% section suggested. Every 2-point attempt generated an average of 1.09 points while every 3-point attempt generated 1.25 points — a difference of 0.16 points per shot. Across hundreds of attempts in a season that gap compounds into a significant scoring advantage.

In order to understand this statistic better myself, I had to think of the numbers like two slot machines where one pays out $1.09 per pull and another pays $1.25. Even if the second machine fails more often, any rational player would always choose the higher expected return. This is the core principle behind the 3-point revolution in the NBA, and Curry’s 2014-15 season is one of the clearest statistical demonstrations of why being a high efficinecy 3pt scorer is so effective.

Defender Analysis

defender_attempts <- table(steph$CLOSEST_DEFENDER)
defender_fgpct <- tapply(steph$FGM, steph$CLOSEST_DEFENDER, mean)

defender_df <- data.frame(
  defender = names(defender_fgpct),
  attempts = as.numeric(defender_attempts),
  fg_pct = as.numeric(defender_fgpct)
)

defender_df <- defender_df[defender_df$attempts >= 5, ]
print(defender_df[order(defender_df$fg_pct), ], row.names = FALSE)

##                  defender attempts    fg_pct
##            Farmar, Jordan        7 0.0000000
##                 Gay, Rudy        5 0.2000000
##             McLemore, Ben        8 0.2500000
##              Ibaka, Serge       18 0.2777778
##           Morrow, Anthony        7 0.2857143
##            Anderson, Ryan        9 0.3333333
##               Diaw, Boris        6 0.3333333
##             Jack, Jarrett        9 0.3333333
##             Rose, Derrick       21 0.3333333
##              Teague, Jeff        6 0.3333333
##           Young, Thaddeus       12 0.3333333
##          Barea, Jose Juan       11 0.3636364
##               Exum, Dante       11 0.3636364
##             Price, Ronnie       13 0.3846154
##             Adams, Steven        5 0.4000000
##             Bledsoe, Eric       15 0.4000000
##           Jordan, DeAndre        5 0.4000000
##              Kanter, Enes        5 0.4000000
##             Sloan, Donald        5 0.4000000
##               Smith, J.R.        5 0.4000000
##            Thomas, Isaiah       10 0.4000000
##              LaVine, Zach       17 0.4117647
##              Conley, Mike       12 0.4166667
##             Brewer, Corey        7 0.4285714
##          Collison, Darren       21 0.4285714
##         Cousins, DeMarcus        7 0.4285714
##           Napier, Shabazz       14 0.4285714
##              Rubio, Ricky        7 0.4285714
##                Wall, John        7 0.4285714
##  Carter-Williams, Michael        9 0.4444444
##              Cole, Norris        9 0.4444444
##        Westbrook, Russell        9 0.4444444
##              Parker, Tony       11 0.4545455
##             Walker, Kemba       23 0.4782609
##            Canaan, Isaiah        6 0.5000000
##           Chandler, Tyson        6 0.5000000
##              Hill, Jordan       12 0.5000000
##             Irving, Kyrie       16 0.5000000
##           Lillard, Damian       10 0.5000000
##          Vasquez, Greivis        6 0.5000000
##            Bradley, Avery       13 0.5384615
##             Holiday, Jrue        9 0.5555556
##         Beverley, Patrick       14 0.5714286
##                Lawson, Ty        7 0.5714286
##               Paul, Chris        7 0.5714286
##           Williams, Deron        7 0.5714286
##                Asik, Omer        5 0.6000000
##               Ingles, Joe        5 0.6000000
##               Burke, Trey       16 0.6250000
##           Jackson, Reggie       11 0.6363636
##         Jennings, Brandon        6 0.6666667
##       Motiejunas, Donatas        6 0.6666667
##           Oladipo, Victor       15 0.6666667
##            Plumlee, Miles        6 0.6666667
##             Zeller, Tyler        6 0.6666667
##               Lin, Jeremy       13 0.6923077
##             Harden, James       14 0.7142857
##             Millsap, Paul        5 0.8000000
##            Payton, Elfrid        5 0.8000000
##           Wiggins, Andrew        5 0.8000000
##           Sessions, Ramon        6 0.8333333
##             Harris, Devin        8 0.8750000

To fairly assess which defenders gave Curry the most trouble, I calculated field goal percentage allowed rather than raw makes and misses. Raw counts can be misleading where a defender who guarded Curry 20 times looks worse than one who only guarded him twice, even if the latter gave up easier looks.

To keep the analysis meaningful we filtered to defenders who guarded Curry on at least 5 attempts. Jordan Farmar was the most effective defender, holding Curry to 0% on 7 attempts. On the other end, Devin Harris allowed Curry to shoot 87.5% on 8 attempts which made Curry seem nearly unguardable.

It is worth noting that even with the percentage filter, sample sizes remain small for individual matchups. No single defender guarded Curry enough times in one season to draw definitive conclusions. Indivdual matchups are telling, but because of the small sample size, I substituted this variable in my regressions.

Model 1 -

#Simple Regression - Field Goals Made tested against Shot Distance

model1 <- lm(FGM ~ SHOT_DIST, data = steph)
summary(model1)

## 
## Call:
## lm(formula = FGM ~ SHOT_DIST, data = steph)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6745 -0.4213 -0.3567  0.5072  0.6759 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.68241    0.03360   20.31  < 2e-16 ***
## SHOT_DIST   -0.01127    0.00170   -6.63 5.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4893 on 966 degrees of freedom
## Multiple R-squared:  0.04352,    Adjusted R-squared:  0.04253 
## F-statistic: 43.95 on 1 and 966 DF,  p-value: 5.583e-11

We start with the simplest possible question — does shot distance alone predict whether Curry makes a shot? The model confirms that distance is highly significant (p < 0.001), with each additional foot reducing his make probability by 1.1%. At the rim he is predicted to make 68% of shots, dropping to roughly 40% at three point range — closely matching his actual 41.7% from three.

However the R-squared of 4.3% tells an important story — distance alone explains very little of what makes Curry successful. There is clearly much more going on, which motivates adding more predictors in Model 2. ## Model 2 - Multiple Regression #Field Goals Made tested against Shot Distance, Closest Defender, and Shot Clock

model2 <- lm(FGM ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK, data = steph)
summary(model2)

## 
## Call:
## lm(formula = FGM ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK, data = steph)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8210 -0.4343 -0.3040  0.4730  0.7107 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.595457   0.060869   9.783  < 2e-16 ***
## SHOT_DIST      -0.014004   0.001883  -7.435 2.35e-13 ***
## CLOSE_DEF_DIST  0.025379   0.006305   4.025 6.15e-05 ***
## SHOT_CLOCK      0.001695   0.003290   0.515    0.607    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4861 on 937 degrees of freedom
##   (27 observations deleted due to missingness)
## Multiple R-squared:  0.05856,    Adjusted R-squared:  0.05554 
## F-statistic: 19.43 on 3 and 937 DF,  p-value: 3.173e-12

Adding defender distance and shot clock pressure improves the model to an adjusted R-squared of 5.6%. Both shot distance and defender distance are highly significant — every extra foot of space increases Curry’s make probability by 2.7%, confirming that even the greatest shooter alive benefits from open looks.

Most surprisingly, shot clock pressure showed no significant effect (p = 0.931). Whether Curry had 20 seconds or 5 seconds remaining had virtually no impact on his efficiency — a finding that speaks to his elite composure and shot readiness under pressure. ## Train Test Split

set.seed(42)
train_index <- sample(1:nrow(steph), 0.8 * nrow(steph))
train <- steph[train_index, ]
test <- steph[-train_index, ]

To evaluate our model honestly I split Curry’s 968 shots into a training set (80%) and a test set (20%). The model is built exclusively on the training data and then evaluated on the test set- measured on shots it has never seen before.

This is a critical step in any machine learning or statistical modeling project. Without a train/test split a model could simply memorize the data it was trained on and appear to perform well without actually learning any meaningful patterns. Testing on unseen shots provides a true measure of how well findings generalize beyond the data we used to build the model. ## Model Final

model_final <- lm(FGM ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK, data = train)
summary(model_final)

## 
## Call:
## lm(formula = FGM ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8308 -0.4251 -0.3029  0.4776  0.7333 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.5984726  0.0674386   8.874  < 2e-16 ***
## SHOT_DIST      -0.0141583  0.0020976  -6.750 2.97e-11 ***
## CLOSE_DEF_DIST  0.0272263  0.0068073   4.000 6.98e-05 ***
## SHOT_CLOCK      0.0003207  0.0036766   0.087    0.931    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4855 on 749 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.06074,    Adjusted R-squared:  0.05698 
## F-statistic: 16.15 on 3 and 749 DF,  p-value: 3.535e-10

My final model is trained on 80% of Curry’s shots using the three most meaningful predictors identified through the analysis I conducted using variables: shot distance, defender distance, and shot clock remaining.

The results are consistent with what I found in Model 2. Shot distance remains the strongest predictor (p < 0.001), followed by defender distance (p < 0.001). Shot clock again shows no significant effect (p = 0.931), reinforcing the earlier finding about Curry’s composure under pressure.

The adjusted R-squared of 5.7% is modest but honest. In basketball, shot outcomes are inherently unpredictable. I learned through this dataset and testing even the best models cannot tell whether any single shot will go in. What this model does capture is the underlying trend: distance and spacing are the two factors that consistently and significantly shape Curry’s shooting efficiency, regardless of which model iteration we test. ## Diagnostics

par(mar = c(4,4,2,1))
plot(model_final)

Four diagnostic plots were examined to assess whether our model assumptions were met. The Residuals vs Fitted plot revealed two distinct bands rather than a random scatter — a direct consequence of FGM being a binary outcome, much like how a shot in basketball has only two results: it either goes in or it doesn’t. There is no partial credit.

The Q-Q plot showed an S-curve deviation from normality at both ends, indicating that Curry produces more extreme outcomes than an average player would — more clutch makes on difficult shots and more surprising misses on easy ones. He simply does not behave like a average shooter, and the data reflects that.

The Scale-Location plot confirmed uneven error variance across predictions, meaning the model struggles most on medium difficulty shots — the truly close and truly far attempts are more predictable, but mid-range Curry is genuinely hard to model.

These patterns are not a modeling error — they are expected when applying linear regression to a binary outcome variable. A made shot is always 1 and a missed shot is always 0, leaving the model working against a fundamental constraint similar to trying to predict a coin flip using only context clues. Logistic regression, which is specifically designed for binary outcomes, would be the more statistically appropriate technique and is a natural next step for this analysis. ## Outlier Investigation

steph[steph$X == 14110, ]

##  [1] GAME_ID                    MATCHUP                   
##  [3] LOCATION                   W                         
##  [5] FINAL_MARGIN               SHOT_NUMBER               
##  [7] PERIOD                     GAME_CLOCK                
##  [9] SHOT_CLOCK                 DRIBBLES                  
## [11] TOUCH_TIME                 SHOT_DIST                 
## [13] PTS_TYPE                   SHOT_RESULT               
## [15] CLOSEST_DEFENDER           CLOSEST_DEFENDER_PLAYER_ID
## [17] CLOSE_DEF_DIST             FGM                       
## [19] PTS                        player_name               
## [21] player_id                 
## <0 rows> (or 0-length row.names)

The Residuals vs Leverage plot flagged row 14110 as the single most influential observation in Curry’s entire season. Investigating this shot reveals exactly why the model found it so unusual.

The shot occurred on February 27th, 2015 against the Toronto Raptors with DeMar DeRozan listed as the nearest defender — despite being 28.7 feet away from Curry at the time of the shot. Combined with 20.4 seconds remaining on the shot clock and no meaningful defensive pressure, this was one of the most wide open looks Curry had all season.

What makes it truly unusual is that despite all that space and time, Curry opted for a 2-pointer rather than stepping back for three. This combination of maximum space, no shot clock pressure, and a short distance shot was so rare in his season profile that the model had virtually no similar shots to reference — making it an outlier by every measure. ## RMSE Evaluation

pred_final <- predict(model_final, newdata = test)
rmse <- sqrt(mean((test$FGM - pred_final)^2, na.rm = TRUE))
rmse

## [1] 0.4898136

Evaluated on the unseen test set the final model produced an RMSE of 0.49, meaning predictions were off by an average of 0.49 on a 0 to 1 scale. While this may seem high, it reflects the inherent unpredictability of individual shot outcomes rather than a flaw in the model. Even the most sophisticated NBA analytics systems cannot reliably predict whether any single shot will go in.

The value of this model lies not in predicting individual shots but in quantifying the relationships between shot conditions and efficiency at scale- confirming that distance and defensive spacing are the two factors that most consistently shape Curry’s shooting success across a full season.