Inference with Logistic Regression: Goal by Distance Example

Creating a summarized data set of field goal attempts that we can use to fit the logistic regression model:

fg_sum  <- 
  fg_df |> 
  count(distance, result) |> 
  # Conditional prop: success by distance
  mutate(
    .by = distance,
    dist_n = sum(n),
    result_prob = n/sum(n)
  ) |> 
  filter(
    dist_n >= 10,
    result == "made"
  ) |> 
  rename(made_n = n) 

tibble(fg_sum)

## # A tibble: 40 × 5
##    distance result made_n dist_n result_prob
##       <int> <fct>   <int>  <int>       <dbl>
##  1       19 made       19     19       1    
##  2       20 made       36     36       1    
##  3       21 made       34     36       0.944
##  4       22 made       44     44       1    
##  5       23 made       54     54       1    
##  6       24 made       49     51       0.961
##  7       25 made       49     49       1    
##  8       26 made       41     43       0.953
##  9       27 made       41     43       0.953
## 10       28 made       54     54       1    
## # ℹ 30 more rows

Logistic Regression Model

If we want to specify the link to use, we add an additional argument in the family = binomial part of the glm() function:

# GLM with a logit link
fg_logit <- 
  glm(
    formula = result_prob ~ distance,
      weight = dist_n,
      family = binomial(link = "logit"),
      data = fg_sum
    )

tidy(fg_logit) |> 
  mutate(
    across(
      .cols = -term,
      .fns = ~ round(., 3)
    )
  )

## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    5.92      0.36       16.5       0
## 2 distance      -0.103     0.008     -12.8       0

Using our data, our estimated model is:

\[\log \left( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right) = \hat{\alpha} + \hat{\beta}x_i\]

\[\log \left( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right) = 5.93 -0.103 x_i\]

Hypothesis Testing

Individual Model Terms

Using the output of tidy() we can do a quick hypothesis test to see if the explanatory variable (distance) has a linear association with the log odds of success for the response variable (field goal result).

\[H_0: \beta = 0 \\ H_a: \beta \ne 0\]

tidy(fg_logit)|> 
  mutate(
    across(
      .cols = -term,
      .fns = ~ round(., 3)
    )
  )

## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    5.92      0.36       16.5       0
## 2 distance      -0.103     0.008     -12.8       0

Since the p-value is very close to 0, we can claim that there is very strong evidence that the kicking distance has a linear association with the log odds of success for the field goal attempt.

We can also have tidy() calculate a confidence interval by including conf.int = T and conf.level = our desired confidence level (defaults to 0.95)

tidy(fg_logit, conf.int = T, conf.level = 0.99)|> 
  mutate(
    across(
      .cols = -term,
      .fns = ~ round(., 3)
    )
  )

## # A tibble: 2 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)    5.92      0.36       16.5       0    5.03      6.89 
## 2 distance      -0.103     0.008     -12.8       0   -0.124    -0.083

Like the last time we had a confidence interval for the odd ratio, we need to exponentiate both ends:

tidy(
  fg_logit,
  conf.int = T,
  conf.level = 0.99,
  exponentiate = T
)|> 
  mutate(
    across(
      .cols = -term,
      .fns = ~ round(., 3)
    )
  )

## # A tibble: 2 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)  374.        0.36       16.5       0  153.      980.   
## 2 distance       0.902     0.008     -12.8       0    0.883     0.921

We can interpret out confidence interval as

“We are 99% confident that the odds of a successful field goal is reduced by 8% to 12% for each additional yard in distance”

There are more complicated hypothesis tests that we can perform once we have more terms in the model!

Checking model fit

If the data are summarized (multiple observations (at least 5) for each combination of ALL explanatory variables), then there are a couple of tools we can use to determine if our model is appropriate for the data.

Deviance

In ordinary linear regression (OLS), one summary we can use to measure how well the model fits the data is Residual Sums of Squares (SSE), which is:

\[SSE = \sum_i (y_i - \hat{y}_i)^2\]

Unfortunately, there isn’t a good equivalent for SSE with generalized linear models. Instead, we rely on the deviance. So what is the deviance?

We need two values:

\(L_m\): The binomial probabilities of our observations using the model probabilities, \(\hat{\pi}_i\)

\[L_m = \prod_i \left(\hat{\pi}_i \right)^{y_i} \left(1- \hat{\pi}_i\right)^{n_i - y_i} \]

\(L_s\): The binomial probabilities of our observations using the sample probabilities, \(p_i\)

\[L_s = \prod_i \left(p_i \right)^{y_i} \left(1-p_i\right)^{n_i - y_i}\]

The deviance will be 2 times the difference if their natural logs

\[D = 2\left[\log(L_s) - \log(L_m)\right]\]

The code chunk below will calculate the two \(L\) and the deviance:

fg_dev <- 
  fg_sum |> 
  # Adding the model probabilities column using predict()
  add_column(
    model_prob = predict(object = fg_logit, type = "response")
  ) |> 
  
  # Using dbinom() to calculate the binomial probabilities for each distance group
  # with the model probabilities and the sample probabilities
  # The log = T will take the log of the probabilities, saving us the step of taking the log later
  mutate(
    binom_model  = dbinom(x = made_n, size = dist_n, prob = model_prob, log = T),
    binom_sample = dbinom(x = made_n, size = dist_n, prob = result_prob, log = T)
  )  |> 
  
  # Adding up the log probabilities of the binomial using the model and sample probabilities
  summarize(
    L_m = sum(binom_model), 
    L_s = sum(binom_sample)
  ) |> 
  
  # Calculating the deviance:
  mutate(Dev = 2*(L_s - L_m))


fg_dev

##         L_m       L_s      Dev
## 1 -82.93149 -60.09996 45.66307

The deviance is a way to measure how much our model probabilities, \(\hat{\pi}_i\) differ from the best probabilities: the sample proportions, \(p_i = n_{ij}/n_{i}\)

If the logistic regression model is adequate for the data, then the deviance will follow a \(\chi^2\) distribution with degrees of freedom equal to \(r = r_1 - r_0\) where:

\(r_0\) = the number of terms in the logistic regression model
- our model has 2 terms, \(\alpha\) and \(\beta\)
\(r_1\) = the number of sample proportions = the number of groups in the data
- will always be the number of rows in the data set used to fit the logistic regression model!

Our deviance will follow a \(\chi^2\) with \(r = 40 - 2 = 38\) degrees of freedom.

Instead of calculating all those by hand, we can just use the glance() function:

glance(fg_logit)

## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          255.      39  -82.9  170.  173.     45.7          38    40

# Our results from earlier:
fg_dev

##         L_m       L_s      Dev
## 1 -82.93149 -60.09996 45.66307

The values of interest are logLik, deviance, df.residual, and nobs (number of observations = number of sample proportions), all of which agree with what we found earlier!

We can find the p-value of a hypothesis test that has a null hypothesis that our model probabilities are accurate estimates of the true probabilities:

pchisq(q = fg_dev$Dev, df = 38, lower = F)

## [1] 0.1836896

With a p-value of about 0.18, we don’t have evidence that our model is not adequate for the population probabilities!

Pseudo R-Squared

We can calculate a similar summary of our GLM as we can for OLR: a pseudo-\(R^2\).

The \(R^2\) for OLS is:

\[R^2 = 1-\frac{SSE}{SST} = 1-\frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}\]

Earlier, we used the model deviance to be the equivalant of the \(SSE\). We’ll use the model deviance again as our facsimile to \(SSE\) again and use the Null Deviance as the replacement for \(SST\).

\[\tilde{R}^2 = 1-\frac{D_m}{D_0}\]

The ~ indicates that it is a pseudo-\(R^2\)

So what is the null deviance?

The null deviance is similar to the model deviance, but we use the overall probability of success instead of the model probabilities of success \(p = n_{\bullet 1}/N\).

We can find the null deviance using glance() using the model we all ready fit!

glance(fg_logit) |> 
  mutate(R2 = 1-deviance/null.deviance)

## # A tibble: 1 × 9
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs    R2
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int> <dbl>
## 1          255.      39  -82.9  170.  173.     45.7          38    40 0.821

The pseudo-\(R^2\) works similarly as the normal \(R^2\): the closer to 1, the better the model fits the data.

Using residuals to assess model fit

We can find the Pearson residuals and the standardized residuals using the augment_columns() function:

augment_columns(x = fg_logit,
                data = fg_sum)

## # A tibble: 40 × 12
##    distance result made_n dist_n result_prob .fitted .se.fit .resid   .hat
##       <int> <fct>   <int>  <int>       <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
##  1       19 made       19     19       1        3.98   0.212  0.841 0.0155
##  2       20 made       36     36       1        3.87   0.205  1.22  0.0301
##  3       21 made       34     36       0.944    3.77   0.197 -1.13  0.0308
##  4       22 made       44     44       1        3.67   0.190  1.49  0.0384
##  5       23 made       54     54       1        3.56   0.182  1.74  0.0479
##  6       24 made       49     51       0.961    3.46   0.175 -0.351 0.0459
##  7       25 made       49     49       1        3.36   0.167  1.83  0.0445
##  8       26 made       41     43       0.953    3.26   0.160 -0.315 0.0393
##  9       27 made       41     43       0.953    3.15   0.153 -0.181 0.0394
## 10       28 made       54     54       1        3.05   0.146  2.23  0.0494
## # ℹ 30 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

If we want to find the observations where our model doesn’t fit well, we can look for large residuals using a scatterplot:

augment_columns(x = fg_logit,
                data = fg_sum) |> 
  ggplot(mapping = aes(x = .fitted,
                       y = .std.resid)) + 
  
  geom_point() + 
  
  labs(y = "Standardize Residuals",
       x = "Estimated log odds")

We want to be on the lookout for any groups that have a standardized residual of at least 3 (in absolute values). Our largest residual is a little larger than 2, so there isn’t any distances where our model estimated the probability poorly!

Working with ungrouped data:

What if \(X\) is too continuous to group the data (ie, each value of \(x\) is unique), or if the sample size isn’t large enough to group the data?

We can still fit the model, we just don’t need to specify the weight:

# Getting the same data set:
fg_df2 <- 
  fg_df |> 
  # Keeping the same rows that are represented in fg_sum
  filter(
    distance %in% fg_sum$distance
  ) |> 
  # Dummy coding result
  mutate(result = (result == 'made') * 1)

fg_logit2 <- 
  glm(
    formula = result ~ distance,
    data = fg_df2,
    family = binomial
  )

tidy(fg_logit2)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    5.93    0.360        16.5 5.83e-61
## 2 distance      -0.103   0.00801     -12.8 1.39e-37

The parameter estimates will be the same and the standard errors will (essentially) be the same!

bind_rows(
  'grouped'   = tidy(fg_logit),
  'ungrouped' = tidy(fg_logit2),
  .id = 'data'
) |> 
  arrange(term)

## # A tibble: 4 × 6
##   data      term        estimate std.error statistic  p.value
##   <chr>     <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 grouped   (Intercept)    5.93    0.360        16.5 5.86e-61
## 2 ungrouped (Intercept)    5.93    0.360        16.5 5.83e-61
## 3 grouped   distance      -0.103   0.00801     -12.8 1.39e-37
## 4 ungrouped distance      -0.103   0.00801     -12.8 1.39e-37

So what’s the benefit of grouped data?

We can use the deviance!

If we have ungrouped data, any method that relies on the deviance shouldn’t be used, like our fit statistics :(

# Comparing the fit statistics
bind_rows(
  'grouped'   = glance(fg_logit),
  'ungrouped' = glance(fg_logit2),
  .id = 'data'
)

## # A tibble: 2 × 9
##   data      null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##   <chr>             <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1 grouped            255.      39  -82.9  170.  173.     45.7          38    40
## 2 ungrouped         1634.    1891 -713.  1429. 1440.   1425.         1890  1892

So how do we calculate a goodness-of-fit test for ungrouped data? That’s where the Hosmer-Lemeshow test comes in handy!

See the slides for a more in depth explanation of the Hosmer-Lemeshow test.

Hosmer-Lemeshow test

Step 1: Group the data

We get around the limit of having ungrouped data by grouping the data (so we’re kinda cheating).

We group the data by dividing it up into \(G\) equally-ish sized groups. Typically we choose \(G = 10\)

We group the data based on the predicted probability. The rows with the lowest 10% of estimated probabilities get placed into group 1. The second lowest 10% of estimated probabilities (10.1% - 20%) get placed into group 2, …, the largest 10% get placed into group 10.

Easiest way is to use cut_number(col, G), where cut_number() will divide the rows into G groups based on the values in col:

fg_hl <- 
  fg_df2 |>
  dplyr::select(distance, result) |> 
  # Adding the estimated probabilities
  mutate(
    est_prob = predict(fg_logit2, fg_df2, type = 'response')
  ) |> 
  # Arranging from lowest to highest prob (not necessary, but helps visualize)
  arrange(est_prob) |> 
  # Creating the groups using cut_number()
  mutate(
    decile_group = cut_number(est_prob, 10, labels = 1:10)
  )

tibble(fg_hl)

## # A tibble: 1,892 × 4
##    distance result est_prob decile_group
##       <int>  <dbl>    <dbl> <fct>       
##  1       58      1    0.493 1           
##  2       58      0    0.493 1           
##  3       58      0    0.493 1           
##  4       58      1    0.493 1           
##  5       58      0    0.493 1           
##  6       58      1    0.493 1           
##  7       58      0    0.493 1           
##  8       58      1    0.493 1           
##  9       58      1    0.493 1           
## 10       58      0    0.493 1           
## # ℹ 1,882 more rows

From here, we calculate the success for each group, \(n_{1g} = \sum_{i=1} y_{ig}\), failures for each group, \(n_{2g} = \sum_{i=1} (1-y_{ig})\), expected successes, \(\hat{\mu}_{1g} = \sum_{i=1} \hat{\pi}_{ig}\), and expected failures, \(\hat{\mu}_{2g} = \sum_{i=1} (1-\hat{\pi}_{ig})\)

fg_hl_sum <- 
  fg_hl |> 
  summarize(
    .by = decile_group,
    size = n(),
    obs_success = sum(result),
    exp_success = sum(est_prob),
    obs_fail    = sum(1 - result),
    exp_fail    = sum(1 - est_prob)
  )

fg_hl_sum

##    decile_group size obs_success exp_success obs_fail  exp_fail
## 1             1  235         146    142.8211       89 92.178895
## 2             2  176         128    125.4859       48 50.514108
## 3             3  172         129    132.4346       43 39.565432
## 4             4  185         152    151.7085       33 33.291489
## 5             5  214         184    185.6049       30 28.395094
## 6             6  174         152    157.1053       22 16.894678
## 7             7  200         185    185.9137       15 14.086289
## 8             8  161         155    152.9547        6  8.045269
## 9             9  186         180    179.4487        6  6.551258
## 10           10  189         187    184.5225        2  4.477489

From there, we can conduct a goodness of fit test with

\[\chi^2 = \sum_{g = 1}^G\sum_{i=1}^2\frac{(n_{ig} - \hat{\mu}_{ig})^2}{\hat{\mu}_{ig}}\]

which will follow a \(\chi^2\) with \(\text{df} = G-2\)

fg_hl_sum |> 
  # Calculating each row's success and fail component
  mutate(
    success_chi2 = (obs_success - exp_success)^2/exp_success,
    fail_chi2    = (obs_fail - exp_fail)^2/exp_fail
  ) |> 
  # Finding the test stat and df
  summarize(
    test_stat = sum(success_chi2) + sum(fail_chi2),
    df = n() - 2
  ) |> 
  # Adding the p-value
  mutate(
    p_val = pchisq(q = test_stat, df = df, lower.tail = F)
  )

##   test_stat df     p_val
## 1  4.622702  8 0.7970353

So we would fail to reject the null hypothesis that the model fits the data!

If we want to use a function to do it instead of by hand, we can use hltest(model) from the glmtoolbox package

pacman::p_load(glmtoolbox)

hltest(model = fg_logit2)

## 
##    The Hosmer-Lemeshow goodness-of-fit test
## 
##  Group Size Observed Expected
##      1  186      109 110.1776
##      2  225      165 158.1294
##      3  172      129 132.4346
##      4  185      152 151.7085
##      5  214      184 185.6049
##      6  174      152 157.1053
##      7  200      185 185.9137
##      8  204      196 194.1948
##      9  197      193 190.7222
##     10  135      133 132.0090
## 
##          Statistic =  4.84127 
## degrees of freedom =  8 
##            p-value =  0.7744

or hoslem.test(x, y, g) in the ResourceSelection package.

Note:

x is the binary success dummy coded as 0/1
y is the estimated probabilities
g is the number of groups to form

pacman::p_load(ResourceSelection)

hl_test <- 
  hoslem.test(
    x = fg_df2$result,
    y = predict(fg_logit2, type = 'response'),
    g = 10
  )

# Test results
hl_test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  fg_df2$result, predict(fg_logit2, type = "response")
## X-squared = 4.6227, df = 8, p-value = 0.797

# Expected counts
hl_test$expected

##                
## cutyhat              yhat0      yhat1
##   [0.493,0.666]  92.178895 142.821105
##   (0.666,0.731]  50.514108 125.485892
##   (0.731,0.787]  39.565432 132.434568
##   (0.787,0.834]  33.291489 151.708511
##   (0.834,0.883]  28.395094 185.604906
##   (0.883,0.912]  16.894678 157.105322
##   (0.912,0.94]   14.086289 185.913711
##   (0.94,0.955]    8.045269 152.954731
##   (0.955,0.97]    6.551258 179.448742
##   (0.97,0.982]    4.477489 184.522511

# Observed counts
hl_test$observed

##                
## cutyhat          y0  y1
##   [0.493,0.666]  89 146
##   (0.666,0.731]  48 128
##   (0.731,0.787]  43 129
##   (0.787,0.834]  33 152
##   (0.834,0.883]  30 184
##   (0.883,0.912]  22 152
##   (0.912,0.94]   15 185
##   (0.94,0.955]    6 155
##   (0.955,0.97]    6 180
##   (0.97,0.982]    2 187

As long as all the expected counts are at least 5 (like any \(\chi^2\) test), then we can use the p-value to decide if the model fits poorly (\(H_a\)) or doesn’t fit poorly (\(H_0\))

Since the p-value is large for all three implementations, we’ll fail to reject the null hypothesis!