NBA Dataset

Loading in the data
Filtering out unnecessary games for this analysis

Identifying the Question: How do 3‑point shooting, paint scoring, and turnovers influence total points scored at home?

Creating the model

model <- lm(
  pts_home ~ fg3m_home + pts_paint_home + tov_home,
  data = NBA_Data
)

summary(model)
## 
## Call:
## lm(formula = pts_home ~ fg3m_home + pts_paint_home + tov_home, 
##     data = NBA_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.686  -5.818  -0.301   5.481  42.855 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    64.364455   0.430752  149.42   <2e-16 ***
## fg3m_home       1.810308   0.016982  106.60   <2e-16 ***
## pts_paint_home  0.647570   0.006905   93.79   <2e-16 ***
## tov_home       -0.240682   0.018320  -13.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.502 on 13872 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6058 
## F-statistic:  7109 on 3 and 13872 DF,  p-value: < 2.2e-16

Model Explanation

This is a linear regression model predicting the home points (points scored by the home team in an NBA game) using three different variable inputs: fg3m_home (the number of three point shots the home team made), pts_paint_home (points scored in the key/paint by the home team), and tov_home (turnovers committed by the home team).

Interpreting the Coefficients

Intercept (64.36)

This is the expected number of points a home team would score if:

  • they made 0 threes

  • scored 0 points in the paint

  • committed 0 turnovers

This accounts for all other points a team could accumulate in a game. Two point shots accumulated outside the paint (shots made in the mid-range) and free throws are a major driver of this high number. There is also a big assumption made in this point that the home team would commit zero turnovers which is exceedingly unlikely.

Three Point Shots (1.81)

This is the largest of the coefficients and plays the biggest role within the model.

Each additional made 3‑pointer increases expected home points by about 1.81 points, holding everything else constant.

Why isn’t it exactly 3 points?
Because:

  • teams that make more threes often take fewer 2‑point shots (more three point makes also accounts for more three point takes and less opportunity for higher percentage two point shots)

  • pace and possessions vary

  • paint scoring is already in the model

Still, 1.81 is a big effect — threes matter.

Paint Scoring (0.648)

Each additional point scored in the paint increases total points by about 0.65 points, controlling for threes and turnovers.

Why less than 1.0?
Because paint scoring is correlated with many other factors such as :

  • pace

  • offensive rebounds

  • free throws

  • 2‑point attempts outside the paint

Once those are partially accounted for, the “unique” contribution of paint scoring is slightly smaller.

Turnovers (-0.241)

This is exactly what you’d expect.

Each turnover reduces expected home points by about 0.24 points, holding other variables constant.

Turnovers kill possessions.
Even a quarter‑point per turnover adds up fast — 15 turnovers cost you ~3.6 points.

Model Fit

R² = 0.606

The model explains about 61% of the variation in home scoring.

Residual Standard Error = 8.502

On average, predictions are within about 8.5 points of actual scoring.

All predictors are highly significant.

P-values < 2e-16 means these variables are strongly associated with scoring and are significant to the model itself.

Plotting the Model

par(mfrow = c(2,2))
plot(model)

par(mfrow = c(1,1))

Residuals vs Fitted

The Residuals vs Fitted plot shows that the residuals are generally centered around zero, indicating that the model does not systematically over‑ or under‑predict home scoring. However, the spread of residuals increases at higher fitted values, producing a mild funnel shape. This suggests heteroskedasticity, meaning the model’s prediction error grows for higher‑scoring games. This is common in NBA data, as high‑scoring games tend to be more variable. There is no strong nonlinear pattern, so the linearity assumption appears reasonable. A few large residuals likely correspond to blowouts or overtime games, but they do not indicate a structural problem with the model.

Q-Q Plot

The Q–Q plot shows noticeable deviations from the straight reference line, particularly in the tails. This indicates that the residuals are not perfectly normally distributed. The center of the distribution aligns reasonably well with the theoretical quantiles, but the curvature at the extremes suggests heavy‑tailed behavior. In practical terms, this means the model struggles more with unusually high‑ or low‑scoring games, which is expected in NBA data where blowouts, overtime games, and pace variability create extreme outcomes. While normality is not strictly required for unbiased coefficient estimates, it does affect inference and suggests that further modeling refinements could be explored.

Scale Location Plot

The Scale–Location plot shows a clear upward trend in the red smoothing line, indicating that the variance of the residuals increases as fitted values increase. This suggests heteroskedasticity: the model’s prediction error is smaller for average‑scoring games and larger for high‑scoring games. This pattern is expected in NBA data, where blowouts and pace variability create more extreme outcomes. While this violates the constant‑variance assumption of linear regression, the effect is not severe enough to invalidate the model, but it does suggest that alternative modeling approaches (e.g., weighted least squares or including pace‑related predictors) could improve fit.

Residuals vs Leverage

The Residuals vs Leverage plot shows that most observations have low leverage, indicating that the majority of games fall within typical ranges of 3‑point makes, paint scoring, and turnovers. A small number of games appear as potential influential points, as indicated by their labels, but none exceed the Cook’s distance threshold. These games likely represent blowouts or unusually paced matchups. While they have some influence on the model, they do not distort the regression results. Overall, the model appears stable, with no single observation exerting undue influence.

Variance Inflation Factor

vif(model)
##      fg3m_home pts_paint_home       tov_home 
##       1.003835       1.002541       1.004191

Low VIF indicates low multicollinearity concerns as the predictors don’t seem to account for changes in others.

Model Insights

This model shows that both 3‑point shooting and paint scoring are strong, positive predictors of total points scored at home. Turnovers have a negative effect, reducing scoring by eliminating possessions. The model explains about 61% of the variation in scoring, which is strong for NBA data though it could definitely be improved. Diagnostics reveal mild heteroskedasticity and a few influential games, but no severe violations of linear model assumptions. These results reinforce common basketball intuition: efficient scoring inside and outside, combined with ball security, drives offensive output.