week7datadive

Hypothesis 1 (Neyman-Pearson Framework)

# Mutating the dataset so positions are simplified into guards (PG & SG) and non-guards (SF, PF, & C)
# Removing NA's from FG% and Position for data cleaning
df_h1 <-
  df |>
  filter(!team %in% c("2TM", "3TM", "4TM")) |>
  filter(!is.na(fg_percent), !is.na(pos)) |>
  mutate(
    role_group = if_else(pos %in% c("PG", "SG"), "Guard", "Non-Guard")
  )

\[ \begin{align} H_0 &: \text{The mean FG% is equal betwen guards and non-guards.} \\ H_1 &: \text{The mean FG% is different between guards and non-guards.} \end{align} \]

Alpha = 0.05, which is the standard threshold to limit Type I errors at 5%. Power = 0.85 to reduce Type II errors and false negatives. Minimum effect size = 0.02 since in the NBA, even a 2% FG% difference can display meaningful change in shot quality over a large sample size.

# Calculating the SD of FG%  
sd_pooled <-
  df_h1 |>
  group_by(role_group) |>
  summarise(sd = sd(fg_percent, na.rm = TRUE), n = n(), .groups = "drop") |>
  summarise(
    sd_pooled = sqrt(sum((n - 1) * sd^2) / sum(n - 1))
  ) |>
  pull(sd_pooled)

sd_pooled

## [1] 0.119329

# Calculating the required n of data for a two-sample t-test
ss_calc <-
  power.t.test(
    delta = 0.02,          
    sd = sd_pooled,
    sig.level = 0.05,
    power = 0.85,
    type = "two.sample",
    alternative = "two.sided"
  )

ss_calc

## 
##      Two-sample t test power calculation 
## 
##               n = 640.197
##           delta = 0.02
##              sd = 0.119329
##       sig.level = 0.05
##           power = 0.85
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

group_counts <-
  df_h1 |>
  count(role_group, name = "n_actual")

group_counts

## # A tibble: 2 × 2
##   role_group n_actual
##   <chr>         <int>
## 1 Guard          1403
## 2 Non-Guard      1818

Since both guards and non-guards sample size have more than the calculated Effect Size = 640, this means there is enough data to conduct the Neyman-Pearson framework.

# Performing the hypothesis test
test_h1 <-
  t.test(
    fg_percent ~ role_group,
    data = df_h1,
    alternative = "two.sided"
  )

test_h1

## 
##  Welch Two Sample t-test
## 
## data:  fg_percent by role_group
## t = -13.748, df = 3145.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Guard and group Non-Guard is not equal to 0
## 95 percent confidence interval:
##  -0.06570514 -0.04930286
## sample estimates:
##     mean in group Guard mean in group Non-Guard 
##               0.4146066               0.4721106

Since the p-value < 0.05, the null hypothesis is rejected meaning the mean FG% between guards and non-guards differs. The estimated mean difference was [-0.0657, -0.0493] for a 95% CI. The observed mean difference exceeds the delta (0.02), meaning the results are both statistically significant and meaningful in the created design criteria.

# Creating a boxplot to breakdown FG% by guards and non-guards
df_h1 |>
  ggplot(aes(x = role_group, y = fg_percent)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "FG% by Position",
    x = "Position",
    y = "FG%"
  ) +
  theme_minimal()

Insights: Guards have a lower mean FG% than non-guards and with the 95% confidence interval for the mean difference is negative, meaning the gap is consistently in the same direction. Significance: This reinforces the shot profile of non-guards (mostly centers) attempting more shots at the rim, naturally increasing their FG%, while guards shoot further away from the rim, lowering their FG%. Questions: If centers are removed from the non-guards group, since their shot selection is generally closer to the basket, how different would the mean FG% be between guards and non-guards?

Hypothesis 2 (Fisher Significance Framework)

\[ \begin{align} H_0 &: \text{The average shot distance is equal between guards and non-guards.} \\ H_1 &: \text{The average shot distance is different between guards and non-guards.} \end{align} \]

test_fisher <-
  t.test(dist ~ role_group, data = df_h1)

test_fisher

## 
##  Welch Two Sample t-test
## 
## data:  dist by role_group
## t = 21.235, df = 3192, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Guard and group Non-Guard is not equal to 0
## 95 percent confidence interval:
##  3.197893 3.848498
## sample estimates:
##     mean in group Guard mean in group Non-Guard 
##                15.84426                12.32107

The small p-value provides strong evidence against the null hypothesis that guards and non-guards have equal mean shot difference.

df_h1 |>
  ggplot(aes(x = role_group, y = dist)) +
  geom_boxplot() +
  labs(
    title = "Average Shot Distance by Position",
    x = "Position",
    y = "Average Shot Distance (feet)"
  ) +
  theme_minimal()

The reason to be confident about this conclusion is the large sample size (3669) and the boxplot shows clear distinctions in shot distance between the different positions. These results do align with NBA trends: guards take more perimeter shots (shots further away from the rim) non-guards (mainly centers) take shots closer to the basket.

Insights: Guards attempt shots from much farther distance on average than non-guards. Position strongly influences shot profile. Significance: This reinforces the same conclusion from the Neyman-Pearson Framework that guards have lower FG% because they take more difficult (further) shots. Evaluating players without providing positional contextual roles can lead to misinterpretation. Questions: Is there a difference between SGs and SFs when it comes to shot profile?

week7datadive

2026-03-02

Hypothesis 1 (Neyman-Pearson Framework)

Hypothesis 2 (Fisher Significance Framework)