We will be focusing on the variable drive_end

There are 4 ways in the data that a touchdown can end: Touchdown, Field Goal, Punt, Turnover

Single Proportion

Hypothesis Test

Let’s perform a hypothesis test to answer the question “Do 20% of all drives end in touchdowns?”

\[H_0: \pi = 0.20 \\ H_a: \pi \ne 0.20\]

For a single parameter, test statistics follow the same general pattern:

\[z = \frac{\textrm{statistic} - \textrm{null value}}{\textrm{standard error}}\]

If we are interested in learning about a single proportion, our test statistic is:

\[z = \frac{p - \pi_0}{\sqrt{\frac{\pi(1-\pi)}{n}}}\]

We have \(\pi_0\), but need to calculate our sample proportion:

# Calculating the sample proportion using mean(drives$drive_end = "Touchdown")
td_prop <- mean(drives$drive_end == "Touchdown")

td_prop
## [1] 0.24
# And the sample size
n = nrow(drives)

# and our null value
pi0 = 0.20

We have a sample proportion of \(p = 0.24\) (24%). Our null hypothesis is 0.20 (\(\pi_0 = 0.20\)).

So we have the top of our test statistic fraction, but what about the denominator:

\[SE = \sqrt{\frac{\pi(1-\pi)}{n}}\]

We don’t have \(\pi\) (hence the hypothesis test). So what do we replace it with?

Wald Test and Score Tests

There are two choices:

  1. Replace the unknown parameter with the sample statistic: \(\pi \rightarrow p\)
  • This is our Wald standard error
  1. Replace the unknown parameter with the null hypothesis value: \(\pi \rightarrow \pi_0\)$
  • This is our Score standard error

For larger sample sizes, the two won’t be that different. But for smaller sample sizes, the standard errors can be very different.

When performing a hypothesis test, a Score standard error is typically used and the test is then called a Score Test (unsurprising!)

Let’s calculate the standard error for both Wald and Score methods:

td_se <- 
  c(
    Wald = sqrt(td_prop * (1-td_prop)/n),
    Score = sqrt(pi0 * (1-pi0)/n)
  )

td_se |> round(digits = 4)
##   Wald  Score 
## 0.0121 0.0113

By storing them in the same vector, we can quickly calculate our test statistics (and subsequent p-values)

# Test statistic: (p - pi0)/se
test_stat <- (td_prop - pi0)/td_se

test_stat |> round(digits = 2)
##  Wald Score 
##  3.31  3.54
# P-value: 2P(z > |test stat|)
td_pval <- 2*pnorm(abs(test_stat), lower = F)

td_pval |> signif(digits = 3)
##     Wald    Score 
## 0.000929 0.000407

While our Wald p-value is over 2 times as large as our score p-value, both agree that there is strong evidence that the proportion of all NFL drives that end in a touchdown is not 20%

Confidence Intervals

We can also calculate a confidence interval using the formula:

\[p \pm z_{\alpha/2} \sqrt{\frac{\pi(1-\pi)}{n}}\]

When calculating a confidence interval, we typically use the Wald SE since we don’t have an explicit \(\pi_0\) to use in it’s place, but we do have a sample statistic:

Wald Confidence Interval

# Start by finding the critical value of a 95% CI:
z95 <- qnorm(1 - 0.05/2)

# Next the confidence interval!
wald_ci_td <- 
  c(
    lower = td_prop - z95*sqrt(td_prop*(1-td_prop)/n),
    upper = td_prop + z95*sqrt(td_prop*(1-td_prop)/n)
  )

wald_ci_td |> signif(digits = 3)
## lower upper 
## 0.216 0.264

Using a Wald confidence interval, we can be 95% confident that between 21.6% to 26.4% of all NFL drives end in touchdowns.

Score CI

How do we calculate a Score confidence interval?

We find the values of \(\pi_0\) that would cause us to not reject the null hypothesis. The easiest way to find that is find the values of \(\pi_0\) that give a test stat less than or equal to our critical value we found earlier!

score_df <- 
  tibble(
    pi0 = seq(0, 1, by = 0.001),    # Create a range of pi_0 values to test
    se0 = sqrt(pi0*(1-pi0)/n), # Calculating the standard errors for the different pi0
    z = (td_prop - pi0)/se0    # Finding the Score test stat for the various pi0
  )        

score_df
## # A tibble: 1,001 × 3
##      pi0      se0     z
##    <dbl>    <dbl> <dbl>
##  1 0     0        Inf  
##  2 0.001 0.000894 267. 
##  3 0.002 0.00126  188. 
##  4 0.003 0.00155  153. 
##  5 0.004 0.00179  132. 
##  6 0.005 0.00199  118. 
##  7 0.006 0.00218  107. 
##  8 0.007 0.00236   98.8
##  9 0.008 0.00252   92.1
## 10 0.009 0.00267   86.5
## # ℹ 991 more rows
# Next we want to find the values of pi0 that have a test stat below z95
score_df |> 
  filter(abs(z) <= z95) |> 
  summarize(
    lower = min(pi0),
    upper = max(pi0)
  )
## # A tibble: 1 × 2
##   lower upper
##   <dbl> <dbl>
## 1 0.218 0.264

When rounding to 3 decimal places, we get the exact same confidence interval using the Wald or score methods!

There is also a formula we can use in place of conducting a grid search:

\[\frac{y +\left(z^*\right)^2/2}{n + \left(z^*\right)^2} \pm \frac{z^*}{n + \left(z^*\right)^2} \sqrt{\frac{y(n - y)}{n} + \frac{\left(z^*\right)^2}{4}}\]

where \(y\) is the count, not proportion

# td_tot = y (total number of Touchdowns)
td_tot <- sum(drives$drive_end == "Touchdown")

td_tot
## [1] 300
# First, find the "statistic"
score_stat <- (td_tot + z95^2/2)/(n + z95^2)
score_stat
## [1] 0.2407966
# Next find the standard error
se_formula <- sqrt(td_tot *(n - td_tot)/n + z95^2/4)
se_formula
## [1] 15.13144
# Then margin of error
me_formula <- z95/(n + z95^2) *se_formula
me_formula
## [1] 0.02365297
# Then interval:
c(
  lower = score_stat - me_formula,
  upper = score_stat + me_formula
) |> 
  
  signif(digits = 4)
##  lower  upper 
## 0.2171 0.2644

Our two answers are slight different due to rounding and only searching to the nearest 0.001. If we used enough digits in our “search”, the answer will be identical!

Canned R Function

We can use the R function prop_test() in the rstatix package to conduct a similar test:

  1. x = number of success
  2. n = total sample size
  3. p = the null hypothesis value
  4. conf.level = desired confidence level
  5. correct = F to not use the continuity correct
pacman::p_load(rstatix)

prop_test(
  x = td_tot,
  n = n,
  p = 0.20,
  conf.level = 0.95,
  correct = F,
  detailed = T
) |> 
  
  dplyr::select(n:p, conf.low, conf.high)
## # A tibble: 1 × 7
##       n    n1 estimate statistic        p conf.low conf.high
##   <int> <int>    <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1  1250   300     0.24      12.5 0.000407    0.217     0.264

The statistic it returns is a \(\chi^2-\)test statistic, which is just a z test statistic squared:

3.5355339

The results aren’t exactly the same as our test results from earlier, and that’s because prop_test() uses a likelihood ratio test (LRT) instead of either a score or Wald test.

For larger sample sizes, the results should be very similar, but the numbers can be a little different than if we conducted it by hand.

Exact Binomial Test

All 3 tests we’ve seen so far are approximate tests. They rely on the result that the sample proportions become more and more Normal as the sample size increases.

But we don’t necessarily need to use an approximate test when we have the binomial distribution to help find our p-value!

We can use the binom_test() in the rstatix package. It has the same arguments as prop_test() except we don’t need the correct argument

binom_test_results <- 
  binom_test(x = td_tot,
             n = n,
             p = 0.20,
             conf.level = 0.95,
             detailed = T) 

binom_test_results |> 
  dplyr::select(n:p, conf.low, conf.high)
## # A tibble: 1 × 6
##       n estimate statistic        p conf.low conf.high
##   <int>    <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1  1250     0.24       300 0.000525    0.217     0.265

The binomial test finds the probability of getting 300 or more touchdowns or the probability of getting 200 or fewer touchdowns if the probability of a drive ending in a touchdown is \(\pi_0 = 0.20\)

Why 200 or fewer?

If the null hypothesis is true, then we would expect to see 250 touchdowns in our sample (n = 1250, \(\pi_0 = 0.20\))

So 300 is 50 over what we would expect, so we also need to find the probability of getting 50 or fewer below what we expected: \(250 - 50 = 200\)

tibble(
  Y = 0:n,
  P_Y = dbinom(x = Y, size = n, prob = 0.20)
) |> 
  
  # Remove the rows from 201 to 299
  filter(!between(Y, 201, 299)) |> 
  
  summarize(p_val = sum(P_Y))
## # A tibble: 1 × 1
##      p_val
##      <dbl>
## 1 0.000471
# 200 or less
prob_200 <- 
  pbinom(q = 200,
         size = 1250,
         prob = 0.20)

# 300 or more
prob_300 <- 
  pbinom(q = 300-1,
         size = 1250,
         prob = 0.20,
         lower = F)

# Pvalue = P(Y <= 200) + P(Y >= 300)
prob_200 + prob_300
## [1] 0.0004706629