Well in theory in order to score more points, one would think that the team would have to make more shots. Therefore my proposed variable for a linear relationship is field goals made (at home).
summary(lm_model)
##
## Call:
## lm(formula = pts_home ~ fgm_home, data = NBA_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.833 -4.924 -0.438 4.592 36.863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.047833 0.239643 112.9 <2e-16 ***
## fgm_home 1.969632 0.006079 324.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.124 on 39829 degrees of freedom
## Multiple R-squared: 0.7249, Adjusted R-squared: 0.7249
## F-statistic: 1.05e+05 on 1 and 39829 DF, p-value: < 2.2e-16
While the R-Squared is not quite 1, that is a very strong linear relationship in the positive direction and I think when we plot this in a visualization, we will see just how linear this relationship is.
ggplot(NBA_Data, aes(x = fgm_home, y = pts_home)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Home Points vs Field Goals Made",
x = "Field Goals Made",
y = "Home Points"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
lm_model$coefficients
## (Intercept) fgm_home
## 27.047833 1.969632
This is a pretty good fit for the linear regression with a relatively low standard error. The coefficients tell me that there is a good amount of points unaccounted for by the field goals made, or width in the field goals made (due to 2s vs 3s most likely). I hypothesized (in my TA meeting) this model could be improved by adding additional fields to tune in more closely to predict points scored like taking into account the distribution of 2 point shots and 3 point shots as well as adding in free throws. For this week’s exercise, let’s look to add those.
point_model <- NBA_Data |>
mutate(fg2m_home = fgm_home - fg3m_home) |>
select(game_id,fgm_home,fg2m_home,fg3m_home,ftm_home,pts_home)
point_model <- point_model |>
filter(!is.na(fg2m_home),
!is.na(fg3m_home),
!is.na(ftm_home),
!is.na(pts_home))
head(point_model)
## # A tibble: 6 × 6
## game_id fgm_home fg2m_home fg3m_home ftm_home pts_home
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0047900043 48 48 0 13 109
## 2 0047900044 48 48 0 8 104
## 3 0047900046 41 41 0 23 105
## 4 0047900047 41 41 0 26 108
## 5 0047900048 47 47 0 13 107
## 6 0048000048 41 41 0 16 98
model <- lm(pts_home ~ fg2m_home + fg3m_home + ftm_home, data = point_model)
Each variable is concerned with a different phase of scoring so I am not concerned with multicollinearity within the model. For example, no shot can be both a 2 pointer and a free throw, or a three pointer and a two pointer. Thus, I think the data is truncated enough to avoid such issues.
gg_resfitted(model) +
geom_smooth(se=FALSE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
This is extremely promising as there is no fanning effect, or strong variance within the residuals. In fact, there does not appear to be any strong presence of residuals with most predictions fitting the actual value, a strong indication the model is fit well.
plots <- gg_resX(model, plot.all = FALSE)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plots
## $fg2m_home
##
## $fg3m_home
##
## $ftm_home
Consistent with the fitted values residual analysis, the model appears to be doing a great job fitting the values, with minimal residuals across each X value as well. There are no clear patterns across any of the inputs to the model, further bolstering confidence in its output.
ggcorr(select(point_model,
fg2m_home,
fg3m_home,
ftm_home), label = TRUE) +
labs(title='Correlation Heatmap') # we can add ggplot elements
In this case, it looks like there is a fairly strong negative relationship between the two point and three point shots made (which makes sense considering we derived the number of two point shots utilizing the number of three point shots). However, I don’t have too many fears regarding the interaction between the two variables for the analysis considering the results, and the domain knowledge we’ve addressed earlier in which the two events are mutually exclusive and independent.
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
When we look at the histogram of residuals, we want to see a distribution that is roughly normal. We see that there is a slight concern, as this distribution is concentrated. However, it is also highly concentrated around 0, with very little spread meaning the model might just fit too well to even see a distribution of residuals. I don’t see this as too big of a concern, and actually as a compliment of a well built model.
gg_qqplot(model)
This is once again extremely reassuring, with the data resembling basically a line. The model shows very little residuals and further backs up the hypothesis that the model strongly predicted points scored at home.
summary(model)
##
## Call:
## lm(formula = pts_home ~ fg2m_home + fg3m_home + ftm_home, data = point_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.815e-11 -6.000e-14 -2.000e-14 2.000e-14 9.708e-10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.444e-12 1.905e-13 3.382e+01 <2e-16 ***
## fg2m_home 2.000e+00 4.221e-15 4.739e+14 <2e-16 ***
## fg3m_home 3.000e+00 6.837e-15 4.388e+14 <2e-16 ***
## ftm_home 1.000e+00 3.749e-15 2.668e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.891e-12 on 39827 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.024e+29 on 3 and 39827 DF, p-value: < 2.2e-16
While not a plot, I think it is important to verify what the plots are telling us with the cold hard data. The fact that the R-squared is at a 1 is strongly reflected in the plots, and does align with conventional wisdom as well. There are limited ways to contribute points in an NBA game (two point shots, three point shots, and free throws) so when we account for all of them, we should in theory account for every point in a game. This exercise, while a good way to expose myself to linear modeling, was also a great way to validate my data set and verify that the inputs were correct - which we have done. As evidenced by the coefficients reflecting the correct point assignments, the model and the data are validated and can confidently “predict” how many points a team will score based on the shots taken in each category.