NBA Dataset

Loading in the data:
Filtering out unnecessary games:

Hyp 1: Home Teams Score More In High Attendance Games?

Null Hypothesis: There is no difference in average home team score between “low” and “high” attendance games.
Alternative Hypothesis: There is a difference in average home team score between “low” and “high” attendance games.

Setting Up Attendance Groups

Seeing the Distribution

score_by_attendance_grp
## # A tibble: 2 × 3
##   attendance_group avg_home_team_score cnt_recs
##   <chr>                          <dbl>    <int>
## 1 High                            103.    20199
## 2 Low                             104.    19632

Calculating Sample Size

NBA_Data |>
  group_by(attendance_group) |>
  summarize(sd = sd(pts_home),
            mean = mean(pts_home))
## # A tibble: 2 × 3
##   attendance_group    sd  mean
##   <chr>            <dbl> <dbl>
## 1 High              13.3  103.
## 2 Low               13.8  104.

I have selected an alpha of 0.1 because I am ok with risking a 10% false negative rate. This analysis is not in the scope of business, like costing a team money investing in my findings one way or another, so I think 10% is a good rate to risk. Teams will attempt to fill stadiums regardless of what I conclude - if there is a difference they will still sell tickets and if there is not they will still sell tickets (as many as they possibly can). That being said, I don’t want to go any higher than 10% and risk the integrity of my analysis with a higher figure, even though the business implications are minor.

Similar to alpha, I chose to stick with the default beta figure. There are minimal business consequences associated with this figure, but for the integrity of my analysis I think it makes the most sense to stay true to what I know to be fair.

pwr_result
## 
##      Two-sample t test power calculation 
## 
##               n = 295.4744
##               d = 0.2208475
##       sig.level = 0.1
##           power = 0.85
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

There needs to be at least 295 in each group. With over 19,000 in each group, we exceed the sample size necessary to execute the test.

Running the Test

t_test_np <- t.test(
  pts_home ~ attendance_group,
  data = NBA_Data,
  conf.level = 0.9,
  alternative = "two.sided",
  var.equal = FALSE
)
t_test_np
## 
##  Welch Two Sample t-test
## 
## data:  pts_home by attendance_group
## t = -7.4769, df = 39655, p-value = 7.766e-14
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 90 percent confidence interval:
##  -1.2416723 -0.7938622
## sample estimates:
## mean in group High  mean in group Low 
##           103.3242           104.3419

We reject the null hypothesis and assume that there is a difference in average home team score between “low” and “high” attendance games. The low, near zero, p-value suggests that we should reject the null hypothesis.

Visualization

NBA_Data |>
  ggplot(aes(x = attendance_group, y = pts_home, fill = attendance_group)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Low" = "gray70", "High" = "steelblue")) +
  labs(
    title = "Home Points by Attendance Group",
    x = "Attendance Group",
    y = "Home Points",
    fill = "Group"
  ) +
  theme_minimal()

Hyp 2: Elite Shooters Win More at Home

Null Hypothesis: There is no difference in the percent of games won at home between elite shooting teams and other teams.
Alternative Hypothesis: Elite shooting teams win more games at home than others.

Creating Shooting Levels

NBA_Data <- NBA_Data |>
  mutate(
    fg_pct_home = fgm_home / fga_home,
    shooting_level = case_when(
      fg_pct_home > 0.55 ~ "elite",
      fg_pct_home >= 0.50 ~ "very good",
      fg_pct_home >= 0.45 ~ "good",
      fg_pct_home >= 0.42 ~ "average",
      fg_pct_home >= 0.38 ~ "below average",
      fg_pct_home < 0.38 ~ "bad"
    ),
    shooting_level = factor(
      shooting_level,
      levels = c("bad", "below average", "average", "good", "very good", "elite"),
      ordered = TRUE
    ),
    home_win = ifelse(pts_home > pts_away, 1, 0)
  )

Specifying “Elite” Group

Looking at Proportions

tab
##            
##                 0     1
##   Elite       198  2956
##   Non-elite 15427 21250
prop_test
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(tab["Elite", "1"], tab["Non-elite", "1"]) out of c(sum(tab["Elite", ]), sum(tab["Non-elite", ]))
## X-squared = 1560, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.3495671 1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.9372226 0.5793822

I chose this test (the two sample proportion test) because it made the most sense to compare proportions for this particular case study. From this two sample proportion test, we can see that the p-value is, like the previous test, almost zero. This leads us to reject the null hypothesis, with an alpha of 0.05 as the default in the test, and assume that elite shooting teams win more games at home than others. This is also pretty sensible on a realistic level, but it is nice to see that the data aligns with what conventional basketball wisdom would tell you, especially since that has not always been the case in these data dives.

Visualization

win_rates <- nba_fisher |>
  group_by(elite_group) |>
  summarise(
    win_rate = mean(home_win),
    n = n()
  )

win_rates |>
  ggplot(aes(x = elite_group, y = win_rate, fill = elite_group)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = scales::percent(win_rate, accuracy = 0.1)),
            vjust = -0.5) +
  scale_fill_manual(values = c("Elite" = "steelblue", "Non-elite" = "gray70")) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(
    title = "Home Win Rate: Elite vs Non-elite Shooting Games",
    x = "Group",
    y = "Home Win Rate",
    fill = "Group"
  ) +
  theme_minimal()

This visualization does a nice job of illustrating just how steep the difference in between the percent of games won between “elite” shooting teams and everyone else - not a surprise that the test rejected the null hypothesis.