Week 9 Data Dive

NBA Dataset

Loading in the data:

Filtering out unnecessary games:

Revisiting Last Week’s Model

What variable would have a linear relationship to points scored?

Well in theory in order to score more points, one would think that the team would have to make more shots. Therefore my proposed variable for a linear relationship is field goals made (at home).

Is there a linear relationship?

summary(lm_model)

## 
## Call:
## lm(formula = pts_home ~ fgm_home, data = NBA_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.833  -4.924  -0.438   4.592  36.863 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.047833   0.239643   112.9   <2e-16 ***
## fgm_home     1.969632   0.006079   324.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.124 on 39829 degrees of freedom
## Multiple R-squared:  0.7249, Adjusted R-squared:  0.7249 
## F-statistic: 1.05e+05 on 1 and 39829 DF,  p-value: < 2.2e-16

While the R-Squared is not quite 1, that is a very strong linear relationship in the positive direction and I think when we plot this in a visualization, we will see just how linear this relationship is.

Visualizing Points Scored and Field Goals Made

ggplot(NBA_Data, aes(x = fgm_home, y = pts_home)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Home Points vs Field Goals Made",
    x = "Field Goals Made",
    y = "Home Points"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

lm_model$coefficients

## (Intercept)    fgm_home 
##   27.047833    1.969632

This is a pretty good fit for the linear regression with a relatively low standard error. The coefficients tell me that there is a good amount of points unaccounted for by the field goals made, or width in the field goals made (due to 2s vs 3s most likely). I hypothesized (in my TA meeting) this model could be improved by adding additional fields to tune in more closely to predict points scored like taking into account the distribution of 2 point shots and 3 point shots as well as adding in free throws. For this week’s exercise, let’s look to add those.

Assembling the New Model

point_model <- NBA_Data |>
  mutate(fg2m_home = fgm_home - fg3m_home) |>
  select(game_id,fgm_home,fg2m_home,fg3m_home,ftm_home,pts_home)

point_model <- point_model |>
  filter(!is.na(fg2m_home),
         !is.na(fg3m_home),
         !is.na(ftm_home),
         !is.na(pts_home))

head(point_model)

## # A tibble: 6 × 6
##   game_id    fgm_home fg2m_home fg3m_home ftm_home pts_home
##   <chr>         <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
## 1 0047900043       48        48         0       13      109
## 2 0047900044       48        48         0        8      104
## 3 0047900046       41        41         0       23      105
## 4 0047900047       41        41         0       26      108
## 5 0047900048       47        47         0       13      107
## 6 0048000048       41        41         0       16       98

model <- lm(pts_home ~ fg2m_home + fg3m_home + ftm_home, data = point_model)

Multicollinearity Concerns?

Each variable is concerned with a different phase of scoring so I am not concerned with multicollinearity within the model. For example, no shot can be both a 2 pointer and a free throw, or a three pointer and a two pointer. Thus, I think the data is truncated enough to avoid such issues.

Evaluating the Model

Residuals vs. Fitted Values

gg_resfitted(model) +
  geom_smooth(se=FALSE)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This is extremely promising as there is no fanning effect, or strong variance within the residuals. In fact, there does not appear to be any strong presence of residuals with most predictions fitting the actual value, a strong indication the model is fit well.

Residuals vs. X Values

plots <- gg_resX(model, plot.all = FALSE)

## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plots

## $fg2m_home

## 
## $fg3m_home

## 
## $ftm_home

Consistent with the fitted values residual analysis, the model appears to be doing a great job fitting the values, with minimal residuals across each X value as well. There are no clear patterns across any of the inputs to the model, further bolstering confidence in its output.

Correlation Heat map

ggcorr(select(point_model,
              fg2m_home,
              fg3m_home,
              ftm_home), label = TRUE) +
  labs(title='Correlation Heatmap')  # we can add ggplot elements

In this case, it looks like there is a fairly strong negative relationship between the two point and three point shots made (which makes sense considering we derived the number of two point shots utilizing the number of three point shots). However, I don’t have too many fears regarding the interaction between the two variables for the analysis considering the results, and the domain knowledge we’ve addressed earlier in which the two events are mutually exclusive and independent.

Residual Histogram

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

When we look at the histogram of residuals, we want to see a distribution that is roughly normal. We see that there is a slight concern, as this distribution is concentrated. However, it is also highly concentrated around 0, with very little spread meaning the model might just fit too well to even see a distribution of residuals. I don’t see this as too big of a concern, and actually as a compliment of a well built model.

QQ-Model

gg_qqplot(model)

This is once again extremely reassuring, with the data resembling basically a line. The model shows very little residuals and further backs up the hypothesis that the model strongly predicted points scored at home.

Summary Statistics

summary(model)

## 
## Call:
## lm(formula = pts_home ~ fg2m_home + fg3m_home + ftm_home, data = point_model)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.815e-11 -6.000e-14 -2.000e-14  2.000e-14  9.708e-10 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 6.444e-12  1.905e-13 3.382e+01   <2e-16 ***
## fg2m_home   2.000e+00  4.221e-15 4.739e+14   <2e-16 ***
## fg3m_home   3.000e+00  6.837e-15 4.388e+14   <2e-16 ***
## ftm_home    1.000e+00  3.749e-15 2.668e+14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.891e-12 on 39827 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.024e+29 on 3 and 39827 DF,  p-value: < 2.2e-16

While not a plot, I think it is important to verify what the plots are telling us with the cold hard data. The fact that the R-squared is at a 1 is strongly reflected in the plots, and does align with conventional wisdom as well. There are limited ways to contribute points in an NBA game (two point shots, three point shots, and free throws) so when we account for all of them, we should in theory account for every point in a game. This exercise, while a good way to expose myself to linear modeling, was also a great way to validate my data set and verify that the inputs were correct - which we have done. As evidenced by the coefficients reflecting the correct point assignments, the model and the data are validated and can confidently “predict” how many points a team will score based on the shots taken in each category.