We will be focusing on the variable drive_end
There are 4 ways in the data that a touchdown can end: Touchdown, Field Goal, Punt, Turnover
Let’s perform a hypothesis test to answer the question “Do 20% of all drives end in touchdowns?”
\[H_0: \pi = 0.20 \\ H_a: \pi \ne 0.20\]
For a single parameter, test statistics follow the same general pattern:
\[z = \frac{\textrm{statistic} - \textrm{null value}}{\textrm{standard error}}\]
If we are interested in learning about a single proportion, our test statistic is:
\[z = \frac{p - \pi_0}{\sqrt{\frac{\pi(1-\pi)}{n}}}\]
We have \(\pi_0\), but need to calculate our sample proportion:
# Calculating the sample proportion using mean(drives$drive_end = "Touchdown")
td_prop <- mean(drives$drive_end == "Touchdown")
td_prop
## [1] 0.24
# And the sample size
n = nrow(drives)
# and our null value
pi0 = 0.20
We have a sample proportion of \(p = 0.24\) (24%). Our null hypothesis is 0.20 (\(\pi_0 = 0.20\)).
So we have the top of our test statistic fraction, but what about the denominator:
\[SE = \sqrt{\frac{\pi(1-\pi)}{n}}\]
We don’t have \(\pi\) (hence the hypothesis test). So what do we replace it with?
There are two choices:
For larger sample sizes, the two won’t be that different. But for smaller sample sizes, the standard errors can be very different.
When performing a hypothesis test, a Score standard error is typically used and the test is then called a Score Test (unsurprising!)
Let’s calculate the standard error for both Wald and Score methods:
td_se <-
c(
Wald = sqrt(td_prop * (1-td_prop)/n),
Score = sqrt(pi0 * (1-pi0)/n)
)
td_se |> round(digits = 4)
## Wald Score
## 0.0121 0.0113
By storing them in the same vector, we can quickly calculate our test statistics (and subsequent p-values)
# Test statistic: (p - pi0)/se
test_stat <- (td_prop - pi0)/td_se
test_stat |> round(digits = 2)
## Wald Score
## 3.31 3.54
# P-value: 2P(z > |test stat|)
td_pval <- 2*pnorm(abs(test_stat), lower = F)
td_pval |> signif(digits = 3)
## Wald Score
## 0.000929 0.000407
While our Wald p-value is over 2 times as large as our score p-value, both agree that there is strong evidence that the proportion of all NFL drives that end in a touchdown is not 20%
We can also calculate a confidence interval using the formula:
\[p \pm z_{\alpha/2} \sqrt{\frac{\pi(1-\pi)}{n}}\]
When calculating a confidence interval, we typically use the Wald SE since we don’t have an explicit \(\pi_0\) to use in it’s place, but we do have a sample statistic:
# Start by finding the critical value of a 95% CI:
z95 <- qnorm(1 - 0.05/2)
# Next the confidence interval!
wald_ci_td <-
c(
lower = td_prop - z95*sqrt(td_prop*(1-td_prop)/n),
upper = td_prop + z95*sqrt(td_prop*(1-td_prop)/n)
)
wald_ci_td |> signif(digits = 3)
## lower upper
## 0.216 0.264
Using a Wald confidence interval, we can be 95% confident that between 21.6% to 26.4% of all NFL drives end in touchdowns.
How do we calculate a Score confidence interval?
We find the values of \(\pi_0\) that would cause us to not reject the null hypothesis. The easiest way to find that is find the values of \(\pi_0\) that give a test stat less than or equal to our critical value we found earlier!
score_df <-
tibble(
pi0 = seq(0, 1, by = 0.001), # Create a range of pi_0 values to test
se0 = sqrt(pi0*(1-pi0)/n), # Calculating the standard errors for the different pi0
z = (td_prop - pi0)/se0 # Finding the Score test stat for the various pi0
)
score_df
## # A tibble: 1,001 × 3
## pi0 se0 z
## <dbl> <dbl> <dbl>
## 1 0 0 Inf
## 2 0.001 0.000894 267.
## 3 0.002 0.00126 188.
## 4 0.003 0.00155 153.
## 5 0.004 0.00179 132.
## 6 0.005 0.00199 118.
## 7 0.006 0.00218 107.
## 8 0.007 0.00236 98.8
## 9 0.008 0.00252 92.1
## 10 0.009 0.00267 86.5
## # ℹ 991 more rows
# Next we want to find the values of pi0 that have a test stat below z95
score_df |>
filter(abs(z) <= z95) |>
summarize(
lower = min(pi0),
upper = max(pi0)
)
## # A tibble: 1 × 2
## lower upper
## <dbl> <dbl>
## 1 0.218 0.264
When rounding to 3 decimal places, we get the exact same confidence interval using the Wald or score methods!
There is also a formula we can use in place of conducting a grid search:
\[\frac{y +\left(z^*\right)^2/2}{n + \left(z^*\right)^2} \pm \frac{z^*}{n + \left(z^*\right)^2} \sqrt{\frac{y(n - y)}{n} + \frac{\left(z^*\right)^2}{4}}\]
where \(y\) is the count, not proportion
# td_tot = y (total number of Touchdowns)
td_tot <- sum(drives$drive_end == "Touchdown")
td_tot
## [1] 300
# First, find the "statistic"
score_stat <- (td_tot + z95^2/2)/(n + z95^2)
score_stat
## [1] 0.2407966
# Next find the standard error
se_formula <- sqrt(td_tot *(n - td_tot)/n + z95^2/4)
se_formula
## [1] 15.13144
# Then margin of error
me_formula <- z95/(n + z95^2) *se_formula
me_formula
## [1] 0.02365297
# Then interval:
c(
lower = score_stat - me_formula,
upper = score_stat + me_formula
) |>
signif(digits = 4)
## lower upper
## 0.2171 0.2644
Our two answers are slight different due to rounding and only searching to the nearest 0.001. If we used enough digits in our “search”, the answer will be identical!
We can use the R function prop_test()
in the
rstatix
package to conduct a similar test:
x =
number of successn =
total sample sizep =
the null hypothesis valueconf.level =
desired confidence levelcorrect = F
to not use the continuity correctpacman::p_load(rstatix)
prop_test(
x = td_tot,
n = n,
p = 0.20,
conf.level = 0.95,
correct = F,
detailed = T
) |>
dplyr::select(n:p, conf.low, conf.high)
## # A tibble: 1 × 7
## n n1 estimate statistic p conf.low conf.high
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1250 300 0.24 12.5 0.000407 0.217 0.264
The statistic it returns is a \(\chi^2-\)test statistic, which is just a z test statistic squared:
3.5355339
The results aren’t exactly the same as our test results from earlier,
and that’s because prop_test()
uses a likelihood
ratio test (LRT) instead of either a score or
Wald test.
For larger sample sizes, the results should be very similar, but the numbers can be a little different than if we conducted it by hand.
All 3 tests we’ve seen so far are approximate tests. They rely on the result that the sample proportions become more and more Normal as the sample size increases.
But we don’t necessarily need to use an approximate test when we have the binomial distribution to help find our p-value!
We can use the binom_test()
in the rstatix
package. It has the same arguments as prop_test()
except we
don’t need the correct
argument
binom_test_results <-
binom_test(x = td_tot,
n = n,
p = 0.20,
conf.level = 0.95,
detailed = T)
binom_test_results |>
dplyr::select(n:p, conf.low, conf.high)
## # A tibble: 1 × 6
## n estimate statistic p conf.low conf.high
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1250 0.24 300 0.000525 0.217 0.265
The binomial test finds the probability of getting 300 or more touchdowns or the probability of getting 200 or fewer touchdowns if the probability of a drive ending in a touchdown is \(\pi_0 = 0.20\)
Why 200 or fewer?
If the null hypothesis is true, then we would expect to see 250 touchdowns in our sample (n = 1250, \(\pi_0 = 0.20\))
So 300 is 50 over what we would expect, so we also need to find the probability of getting 50 or fewer below what we expected: \(250 - 50 = 200\)
tibble(
Y = 0:n,
P_Y = dbinom(x = Y, size = n, prob = 0.20)
) |>
# Remove the rows from 201 to 299
filter(!between(Y, 201, 299)) |>
summarize(p_val = sum(P_Y))
## # A tibble: 1 × 1
## p_val
## <dbl>
## 1 0.000471
# 200 or less
prob_200 <-
pbinom(q = 200,
size = 1250,
prob = 0.20)
# 300 or more
prob_300 <-
pbinom(q = 300-1,
size = 1250,
prob = 0.20,
lower = F)
# Pvalue = P(Y <= 200) + P(Y >= 300)
prob_200 + prob_300
## [1] 0.0004706629