For this week’s data dive I will be accomplishing two main tasks.
First, I will be taking my linear regression model from last week and adding variables to improve it.
Second, I will be evaluating the improved regression model using six diagnostic plots.
I will then break down each step of both parts and provide analysis of my work. And then I will include my insights, their significance, and some potential questions for each part.
The original regression model was built with the PTS_per_100 and FG_Percent variables. This new regression model is going to include an additional binary term (Playoffs) and then two more continuous variables (X3p_Percent and TOV_per_100).
Should the Playoffs variable be included?
- I would say yes as it gives a reference for team quality/level of success. Playoff teams typically have better metrics and are considered to be models for NBA teams to help them become more successful. Including the Playoff variable helps distinguish between good and bad teams.
Should the X3p_Percent (3pt Percentage) variable be included?
- I would say yes for this variable as well. The 3pt. percentage functions very similar to the FG_Percent variable, except it accounts for specifically 3pt. shots. In an ever evolving modern NBA that emphasizes the 3pt. shot more than ever it is important to include 3pt. efficiency in analysis.
Should the TOV_per_100 (Turnovers) variable be included?
- I would say yes for this variable as well. The amount of turnovers a team has can determine how many possessions a team has and total shots that a team shoots. Turnovers are a very important metric to account for when doing analysis of team success.
Should the ORB_per_100 (Offensive Rebounds) variable be included?
- We could include this but I do hesitate to do so. Offensive rebounds can be very important for successful teams since they can give extra opportunities to score. However, it can lead to problems when comparing to shooting efficiency as more offensive rebounds would mean more missed shots and introduce some confusion in analysis.
After consideration the new model will include the PTS_per_100, FG_Percent, X3p_Percent, TOV_per_100, and Playoffs variables.
lm_model2 <- lm(PTS_per_100 ~ FG_Percent + X3p_Percent + TOV_per_100 + Playoffs,
data = NBA_model)
summary(lm_model2)
##
## Call:
## lm(formula = PTS_per_100 ~ FG_Percent + X3p_Percent + TOV_per_100 +
## Playoffs, data = NBA_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8871 -1.3535 -0.0017 1.2092 7.4107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.32701 1.77443 29.490 < 2e-16 ***
## FG_Percent 147.23653 3.00717 48.962 < 2e-16 ***
## X3p_Percent 21.38238 1.44271 14.821 < 2e-16 ***
## TOV_per_100 -1.28725 0.03898 -33.020 < 2e-16 ***
## PlayoffsTRUE 0.39624 0.12195 3.249 0.00119 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.011 on 1278 degrees of freedom
## Multiple R-squared: 0.7858, Adjusted R-squared: 0.7851
## F-statistic: 1172 on 4 and 1278 DF, p-value: < 2.2e-16
Results I would like to highlight about the improved regression model:
The R Squared value is 0.7858, which means that the model explains about 78.6% of the variation in scoring (PTS_per_100). This is a very high value and much higher than the 28.7% that was explained by FG percentage on the previous regression model. This means that the variables of FG%, 3P%, Turnovers, and Playoffs are very important when identifying what drives scoring for NBA teams.
The FG_Percent variable was still the most important variable in the model as it had the largest coefficient of 147.23653. This means that FG_Percentage is still the strongest predictor of scoring. However, the 3PT% variable is the next highest coefficient at 21.38238. This means it is also a strong predictor for scoring, which is not very surprising.
Another result that I found intriguing was that the Turnovers have a strong negative impact on scoring (-1.287). This also makes a lot of sense as losing possessions leads to less opportunities to score.
The new regression model still provides statistically significant results since the p-value is still much less than 0.05 at 2.2e-16. Additionally, the residual standard error is 2.011. This means that the predictions are typically within about 2 points of the PTS_per_100. This is shows a very accurate prediction.
I highlighted some of the detailed insights in the above section, but my main takeaway from this new regression model is that offense in the NBA is multi-dimensional and very complex. Accounting for FG%, 3PT%, Turnovers, and Playoff status can give us a much better understanding of a team’s scoring than just the FG%. These insights are significant as they help us see that small improvements in various aspects of the game can lead to massive changes in scoring output and subsequently team success. I would be curious if there other variables that are important to account for. Perhaps variables such as offensive rebounds, free throw rate, and pace of play could also factor towards scoring output for teams.
In this part of the data dive I am going to go through each of the six diagnostic plots and point out any indications of potential issues or explain how the plot supports the evidence that an assumption is met.
I will now go through the diagnostic plots one by one.
Plot 1: Residuals vs Fitted Values
plot(lm_model2$fitted.values,
lm_model2$residuals,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0)
Since the Residuals vs Fitted visualization shows a random scatter around 0 the linearity and constant variance assumptions are reasonably satisfied. This means that this plot supports the idea that an assumption is met.
Plot 2: Residuals vs Each X Variable
plot(NBA_model$FG_Percent, lm_model2$residuals,
xlab = "FG_Percent", ylab = "Residuals",
main = "Residuals vs FG%")
plot(NBA_model$X3p_Percent, lm_model2$residuals,
xlab = "3P%", ylab = "Residuals",
main = "Residuals vs 3P%")
plot(NBA_model$TOV_per_100, lm_model2$residuals,
xlab = "Turnovers", ylab = "Residuals",
main = "Residuals vs Turnovers")
The Residuals vs Each X Variable doesn’t show any linear relationship for the three variables (FG%, 3PT%, and Turnovers). However, there is some clustering on the 3PT% visual. This would mean that there could be a missing variable to account for. Overall, there are minimal issues present and the visuals do mostly support the idea that an assumption is met.
Plot 3: Correlation Heatmap
vars <- NBA_model[, c("PTS_per_100", "FG_Percent", "X3p_Percent", "TOV_per_100")]
cor_matrix <- cor(vars, use = "complete.obs")
heatmap(cor_matrix)
There is some multicollinearity between the variables, but in general there isn’t very much (except for when the variables are compared to themselves). This means that multicollinearity is not a serious concern and that the visual does support the idea that an assumption is met.
Plot 4: Residual Histogram
hist(lm_model2$residuals,
main = "Histogram of Residuals",
xlab = "Residuals")
The residual histogram is very bell-shaped and would imply normality for the regression model. This would mean the visual does support the idea that an assumption is met.
Plot 5: QQ Plot
qqnorm(lm_model2$residuals)
qqline(lm_model2$residuals)
The QQ Plot shows the points following the line very closely. The ends to deviate some, but not significantly. This would imply there aren’t too many outliers and that the normality holds. This means the visual does mostly support the idea that an assumption is met.
Plot 6: Cook’s Distance
plot(cooks.distance(lm_model2),
type = "h",
main = "Cook's Distance")
abline(h = 4/nrow(NBA_model), col = "red")
In this Cook’s Distance visualization there are multiple points above the line that would indicate influential observations. This could bring up some concern for the regression model. The visual does somewhat support the idea that an assumption is met, but there are certainly concerns with this diagnostic plot.
The insight that I gather from the diagnostic plots is that the regression model has some issues, but primarily the visuals do mostly support the idea that an assumption is met. After assessing each diagnostic plot I can say that the confidence for the regression model should be moderate to high. This is significant as it means that the assumptions that we can make about the data are valuable and can provide some insight. My only question for further research would be what is causing the amount of values to exceed the line in the Cook’s Distance diagnostic plot?