Week 8 Data Dive

NBA Dataset

Loading in the data:

Filtering out unnecessary games:

Creating Variables

Shooting Percentage and Shooting Level

shooting_levels <- NBA_Data |>
  select(fg_pct_home, shooting_level) |>
  unique()

head(shooting_levels, 10)

## # A tibble: 10 × 2
##    fg_pct_home shooting_level
##          <dbl> <ord>         
##  1       0.539 very good     
##  2       0.505 very good     
##  3       0.519 very good     
##  4       0.451 good          
##  5       0.528 very good     
##  6       0.432 average       
##  7       0.5   very good     
##  8       0.304 bad           
##  9       0.359 bad           
## 10       0.436 average

Setting Up the ANOVA

Variables of Interest:

Continuous - Points Home (how many points the home team scored)

Categorical - Shooting Level (how well the home team shot)

I chose these variables because, while many variables matter in the sport of basketball, at the end of the day whichever team scores more points will win the game. The home team’s score directly reflects team performance in game and also represents the outcome the organization and its fans will care most about, wins. The points scored variable is also influenced by other factors such as pace of the game, defensive prowess of the opponent, and more - thus it serves as a great equalizer of getting information of what is happening in the game without getting too in depth on one particular phase of the game.

Shooting level makes sense as the categorical since it directly ties to team performance, but also has a logical ordering system. Its few number of categories and independence from the continuous variable (in the sense that it is not derived from points scored) make it an exceptional choice for the ANOVA analysis.

Null Hypothesis - average points home is equal across all shooting levels

Alternative Hypothesis - average points home is unequal across all shooting levels

Running the ANOVA

summary(anova_model)

##                   Df  Sum Sq Mean Sq F value Pr(>F)    
## shooting_level     5 3195411  639082    6127 <2e-16 ***
## Residuals      39825 4154259     104                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given such a small p-value, and such a large dataset, we can conclude that it is very unlikely the means are all equal, and at least one is different from the rest. Let’s see if that shows up in the visualization.

Visualizing the ANOVA

ggplot(NBA_Data, aes(x = shooting_level, y = pts_home, fill = shooting_level)) +
  geom_boxplot(alpha = 0.8) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Home Points by Shooting Level",
    x = "Shooting Level",
    y = "Home Points"
  ) +
  theme_minimal()

The clear stair step across the box plots backs up the finding in the ANOVA test that the means are not equal across the shooting levels. This makes logical sense as well, if a team is shooting poorly you would not expect them to score the same amount of points as if they shoot exceedingly well. While free throw shooting can supplement them at times, which is visible in the upper outliers, it is not enough to completely even out the distributions.

Linear Relationship to Points Scored

What variable would have a linear relationship to points scored?

Well in theory in order to score more points, one would think that the team would have to make more shots. Therefore my proposed variable for a linear relationship is field goals made (at home).

Is there a linear relationship?

summary(lm_model)

## 
## Call:
## lm(formula = pts_home ~ fgm_home, data = NBA_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.833  -4.924  -0.438   4.592  36.863 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.047833   0.239643   112.9   <2e-16 ***
## fgm_home     1.969632   0.006079   324.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.124 on 39829 degrees of freedom
## Multiple R-squared:  0.7249, Adjusted R-squared:  0.7249 
## F-statistic: 1.05e+05 on 1 and 39829 DF,  p-value: < 2.2e-16

While the R-Squared is not quite 1, that is a very strong linear relationship in the positive direction and I think when we plot this in a visualization, we will see just how linear this relationship is.

Visualizing Points Scored and Field Goals Made

ggplot(NBA_Data, aes(x = fgm_home, y = pts_home)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Home Points vs Field Goals Made",
    x = "Field Goals Made",
    y = "Home Points"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

lm_model$coefficients

## (Intercept)    fgm_home 
##   27.047833    1.969632

This is a pretty good fit for the linear regression with a relatively low standard error. The coefficients tell me that there is a good amount of points unaccounted for by the field goals made, or width in the field goals made (due to 2s vs 3s most likely). I definitely think this model could be improved by adding additional fields to tune in more closely to predict points scored, but for a simple linear regression predicting points scored with a single input variable, I think it would be tough to perform better than field goals (shots) made.