shooting_levels <- NBA_Data |>
select(fg_pct_home, shooting_level) |>
unique()
head(shooting_levels, 10)
## # A tibble: 10 × 2
## fg_pct_home shooting_level
## <dbl> <ord>
## 1 0.539 very good
## 2 0.505 very good
## 3 0.519 very good
## 4 0.451 good
## 5 0.528 very good
## 6 0.432 average
## 7 0.5 very good
## 8 0.304 bad
## 9 0.359 bad
## 10 0.436 average
I chose these variables because, while many variables matter in the sport of basketball, at the end of the day whichever team scores more points will win the game. The home team’s score directly reflects team performance in game and also represents the outcome the organization and its fans will care most about, wins. The points scored variable is also influenced by other factors such as pace of the game, defensive prowess of the opponent, and more - thus it serves as a great equalizer of getting information of what is happening in the game without getting too in depth on one particular phase of the game.
Shooting level makes sense as the categorical since it directly ties to team performance, but also has a logical ordering system. Its few number of categories and independence from the continuous variable (in the sense that it is not derived from points scored) make it an exceptional choice for the ANOVA analysis.
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## shooting_level 5 3195411 639082 6127 <2e-16 ***
## Residuals 39825 4154259 104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Given such a small p-value, and such a large dataset, we can conclude that it is very unlikely the means are all equal, and at least one is different from the rest. Let’s see if that shows up in the visualization.
ggplot(NBA_Data, aes(x = shooting_level, y = pts_home, fill = shooting_level)) +
geom_boxplot(alpha = 0.8) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Home Points by Shooting Level",
x = "Shooting Level",
y = "Home Points"
) +
theme_minimal()
The clear stair step across the box plots backs up the finding in the ANOVA test that the means are not equal across the shooting levels. This makes logical sense as well, if a team is shooting poorly you would not expect them to score the same amount of points as if they shoot exceedingly well. While free throw shooting can supplement them at times, which is visible in the upper outliers, it is not enough to completely even out the distributions.
Well in theory in order to score more points, one would think that the team would have to make more shots. Therefore my proposed variable for a linear relationship is field goals made (at home).
summary(lm_model)
##
## Call:
## lm(formula = pts_home ~ fgm_home, data = NBA_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.833 -4.924 -0.438 4.592 36.863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.047833 0.239643 112.9 <2e-16 ***
## fgm_home 1.969632 0.006079 324.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.124 on 39829 degrees of freedom
## Multiple R-squared: 0.7249, Adjusted R-squared: 0.7249
## F-statistic: 1.05e+05 on 1 and 39829 DF, p-value: < 2.2e-16
While the R-Squared is not quite 1, that is a very strong linear relationship in the positive direction and I think when we plot this in a visualization, we will see just how linear this relationship is.
ggplot(NBA_Data, aes(x = fgm_home, y = pts_home)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Home Points vs Field Goals Made",
x = "Field Goals Made",
y = "Home Points"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
lm_model$coefficients
## (Intercept) fgm_home
## 27.047833 1.969632
This is a pretty good fit for the linear regression with a relatively low standard error. The coefficients tell me that there is a good amount of points unaccounted for by the field goals made, or width in the field goals made (due to 2s vs 3s most likely). I definitely think this model could be improved by adding additional fields to tune in more closely to predict points scored, but for a simple linear regression predicting points scored with a single input variable, I think it would be tough to perform better than field goals (shots) made.