score_by_attendance_grp
## # A tibble: 2 × 3
## attendance_group avg_home_team_score cnt_recs
## <chr> <dbl> <int>
## 1 High 103. 20199
## 2 Low 104. 19632
NBA_Data |>
group_by(attendance_group) |>
summarize(sd = sd(pts_home),
mean = mean(pts_home))
## # A tibble: 2 × 3
## attendance_group sd mean
## <chr> <dbl> <dbl>
## 1 High 13.3 103.
## 2 Low 13.8 104.
I have selected an alpha of 0.1 because I am ok with risking a 10% false negative rate. This analysis is not in the scope of business, like costing a team money investing in my findings one way or another, so I think 10% is a good rate to risk. Teams will attempt to fill stadiums regardless of what I conclude - if there is a difference they will still sell tickets and if there is not they will still sell tickets (as many as they possibly can). That being said, I don’t want to go any higher than 10% and risk the integrity of my analysis with a higher figure, even though the business implications are minor.
Similar to alpha, I chose to stick with the default beta figure. There are minimal business consequences associated with this figure, but for the integrity of my analysis I think it makes the most sense to stay true to what I know to be fair.
pwr_result
##
## Two-sample t test power calculation
##
## n = 295.4744
## d = 0.2208475
## sig.level = 0.1
## power = 0.85
## alternative = two.sided
##
## NOTE: n is number in *each* group
There needs to be at least 295 in each group. With over 19,000 in each group, we exceed the sample size necessary to execute the test.
t_test_np <- t.test(
pts_home ~ attendance_group,
data = NBA_Data,
conf.level = 0.9,
alternative = "two.sided",
var.equal = FALSE
)
t_test_np
##
## Welch Two Sample t-test
##
## data: pts_home by attendance_group
## t = -7.4769, df = 39655, p-value = 7.766e-14
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 90 percent confidence interval:
## -1.2416723 -0.7938622
## sample estimates:
## mean in group High mean in group Low
## 103.3242 104.3419
We reject the null hypothesis and assume that there is a difference in average home team score between “low” and “high” attendance games. The low, near zero, p-value suggests that we should reject the null hypothesis.
NBA_Data |>
ggplot(aes(x = attendance_group, y = pts_home, fill = attendance_group)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("Low" = "gray70", "High" = "steelblue")) +
labs(
title = "Home Points by Attendance Group",
x = "Attendance Group",
y = "Home Points",
fill = "Group"
) +
theme_minimal()
NBA_Data <- NBA_Data |>
mutate(
fg_pct_home = fgm_home / fga_home,
shooting_level = case_when(
fg_pct_home > 0.55 ~ "elite",
fg_pct_home >= 0.50 ~ "very good",
fg_pct_home >= 0.45 ~ "good",
fg_pct_home >= 0.42 ~ "average",
fg_pct_home >= 0.38 ~ "below average",
fg_pct_home < 0.38 ~ "bad"
),
shooting_level = factor(
shooting_level,
levels = c("bad", "below average", "average", "good", "very good", "elite"),
ordered = TRUE
),
home_win = ifelse(pts_home > pts_away, 1, 0)
)
tab
##
## 0 1
## Elite 198 2956
## Non-elite 15427 21250
prop_test
##
## 2-sample test for equality of proportions without continuity correction
##
## data: c(tab["Elite", "1"], tab["Non-elite", "1"]) out of c(sum(tab["Elite", ]), sum(tab["Non-elite", ]))
## X-squared = 1560, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.3495671 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.9372226 0.5793822
I chose this test (the two sample proportion test) because it made the most sense to compare proportions for this particular case study. From this two sample proportion test, we can see that the p-value is, like the previous test, almost zero. This leads us to reject the null hypothesis, with an alpha of 0.05 as the default in the test, and assume that elite shooting teams win more games at home than others. This is also pretty sensible on a realistic level, but it is nice to see that the data aligns with what conventional basketball wisdom would tell you, especially since that has not always been the case in these data dives.
win_rates <- nba_fisher |>
group_by(elite_group) |>
summarise(
win_rate = mean(home_win),
n = n()
)
win_rates |>
ggplot(aes(x = elite_group, y = win_rate, fill = elite_group)) +
geom_col(alpha = 0.8) +
geom_text(aes(label = scales::percent(win_rate, accuracy = 0.1)),
vjust = -0.5) +
scale_fill_manual(values = c("Elite" = "steelblue", "Non-elite" = "gray70")) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
labs(
title = "Home Win Rate: Elite vs Non-elite Shooting Games",
x = "Group",
y = "Home Win Rate",
fill = "Group"
) +
theme_minimal()
This visualization does a nice job of illustrating just how steep the difference in between the percent of games won between “elite” shooting teams and everyone else - not a surprise that the test rejected the null hypothesis.