Basic Regression: Field Goal Example

We want to see if successfully kicking a field goal is easier in domes (closed roofs) than stadiums (open roofs).

Let’s look at a table and bar chart of field goal attempts based on the roof of the stadiums

fg_summary <- 
  fg_df |> 
  # Counting the field & result combos
  count(field, result, name = 'made') |> 
  # Calculating the number of attempts per field and success rate
  mutate(
    .by = field,
    attempts = sum(made),
    success_rate = round(made/attempts, digits = 3)
    ) |> 
  # Only keeping rows that correspond to a success
  filter(result == "made") |> 
  dplyr::select(field, attempts, made, success_rate)


tibble(fg_summary)

## # A tibble: 2 × 4
##   field   attempts  made success_rate
##   <fct>      <int> <int>        <dbl>
## 1 dome         467   397        0.85 
## 2 stadium     1451  1211        0.835

Bar Chart:

ggplot(
  data = fg_summary,
  mapping = aes(
    x = field,
    y = success_rate
  )
) +
  
  geom_col(
    fill = c('tan4', 'forestgreen')
    ) +
  ggfittext::geom_bar_text(
    mapping = aes(label = scales::percent(success_rate))
  ) +
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0),
    labels = scales::label_percent()
  ) +
  
 labs(x = "Field Type",
      title = 'Field Goal Success Rate by Location',
      y = NULL
 ) + 
  
  theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5, size = 16))

Doesn’t appear to be much of a difference between the two field types!

Since both variables are binary, we typically assign the outcomes in both as either 0 or 1.

Field Type: Stadium -> 0, Dome -> 1
Field Goal Result: missed -> 0, made -> 1

Let’s recode them as 0, 1 in the data as well:

fg_dummy <- 
  fg_df |> 
  mutate(
    # Changing the field as a dummy variable: 0 or 1
    field_dummy = (field == "dome") * 1,
           
    # Changing the field goal result to a dummy variable
    result_dummy = (result == "made") * 1
 )

tibble(fg_dummy)

## # A tibble: 1,918 × 7
##    kicker      kick_distance roof     result field   field_dummy result_dummy
##    <fct>               <int> <fct>    <fct>  <fct>         <dbl>        <dbl>
##  1 A.Vinatieri            26 closed   made   dome              1            1
##  2 A.Vinatieri            50 outdoors made   stadium           0            1
##  3 A.Vinatieri            50 outdoors made   stadium           0            1
##  4 A.Vinatieri            43 outdoors made   stadium           0            1
##  5 A.Vinatieri            41 closed   made   dome              1            1
##  6 A.Vinatieri            27 open     made   stadium           0            1
##  7 A.Vinatieri            50 dome     made   dome              1            1
##  8 A.Vinatieri            30 open     made   stadium           0            1
##  9 A.Vinatieri            41 outdoors made   stadium           0            1
## 10 A.Vinatieri            48 closed   missed dome              1            0
## # ℹ 1,908 more rows

Odds Ratio

Since the two variables we’re analyzing are both categorical, we can use the odds ratio to summarize the relationship and conduct statistical inference!

fg_tab <- 
  fg_dummy |> 
  dplyr::select(field_dummy, result_dummy) |> 
  table()

fg_tab

##            result_dummy
## field_dummy    0    1
##           0  240 1211
##           1   70  397

Reminder: \(\hat{\theta} = \frac{n_{11}n_{22}}{n_{12}n_{21}}\)

odds_ratio_test <- 
  fg_dummy |> 
  summarize(
    # n11
    made_dome    = sum(field_dummy*result_dummy),     
    # n22
    missed_field = sum((1 - field_dummy)*(1 - result_dummy)),
    # n12
    made_field   = sum((1 - field_dummy)*(result_dummy)),
    # n21
    missed_dome  = sum((field_dummy)*(1 - result_dummy))
  ) |> 
  # Calculating the odds ratio
  mutate(
    odds_ratio = made_dome * missed_field / (made_field * missed_dome),
    # SE
    SE_LOR = sqrt(1/made_dome + 1/missed_field + 1/made_field + 1/missed_dome),
    # Test statistic = log(theta) / SE
    test_stat = log(odds_ratio) / SE_LOR,
    # Two-sided p-value
    p_val = 2*pnorm(abs(test_stat), lower.tail = F)
  )



odds_ratio_test|> 
  signif(digits = 4)

##   made_dome missed_field made_field missed_dome odds_ratio SE_LOR test_stat
## 1       397          240       1211          70      1.124 0.1476    0.7917
##    p_val
## 1 0.4286

From the output, we can see that there isn’t a significant difference between the kicking in a dome vs kicking in a stadium!

While the odds ratio is easy enough to use, we can “formulate” our model a little differently. Instead of calculating the odds ratio, we can form a logistic regression model!

Logistic regression

So what is a logistic regression model?

It is similar to an ordinary linear model:

\[y_i = \alpha + \beta_1 x_i\]

except instead of modelling \(y\), we model the probability of success, \(\pi\). But there is one major problem - Probabilities can only be between 0 and 1, where as a line can give any value!

So what do we do instead? We find a link function, \(g()\), that converts \(\pi_i\) from being between 0 and 1 to any possible value so the left and right side of the equal sign are actually equal:

\[0 \le \pi \le 1 \rightarrow -\infty \le g(\pi) \le \infty\]

\[g(\pi_i) = \alpha + \beta_1 x_i\]

But what function can we use to do that?

There are 3 typical options:

logit link:

\[g(\pi) = \ln \left( \frac{\pi}{1-\pi} \right)\]

probit link: The inverse of a Normal distribution:

if \(\pi = 0.05 \rightarrow g(\pi) = -1.645\)
if \(\pi = 0.975 \rightarrow g(\pi) = 1.96\)
if \(\pi = 0.995 \rightarrow g(\pi) = 2.575\)

complimentary log-log link:

\[g(\pi) = \ln \left( -\ln \left(1-\pi \right) \right)\]

The most common of these 3 choices is the logit link. Why? There are a few different reasons, but the easiest to explain is that it is the easiest to interpret!

Why is it easy to interpret?

Because it is the log of the odds ratio, which we’ve seen quite a bit in this class, and we’ve interpreted the odds already!

Fitting the logistic regression model

We can fit the logistic regression model using the glm() function. It can work with both the “raw”, unsummarized data or the summarized version of the data.

Let’s start with the unsummarized version, fg_df, which glm() needs 3 arguments:

formula = the formula we are going to use: response_var ~ explanatory_vars
family = binomial to indicate we want to use the binomial distribution
- used when the response variable is binary, but we can give family other values
data = the data set you want to analyze

First, check to make sure that the levels two variables are in the proper order. The glm() function wants the levels of the response variable to be in failure/success order and the explanatory to be in the group 2/group 1 order

levels(fg_df$result)

## [1] "made"   "missed"

levels(fg_df$field)

## [1] "dome"    "stadium"

Ours are currently in reverse order. If they are reversed, the easiest way to flip the order of the levels of a factor is with fct_rev() seen below:

fg_df <- 
  fg_df |> 
  # Flipping the order of result and field
  mutate(
    result = fct_rev(result),
    field = fct_rev(field)
  ) 

fg_df|> 
  # Checking the results (not needed to flip)
  reframe(
    result_levels = levels(result),
    field_levels = levels(field)
  )

##   result_levels field_levels
## 1        missed      stadium
## 2          made         dome

fg_logit <- 
  glm(
    formula = result ~ field,
      family = binomial,
      data = fg_df)

# Basic display of the results
summary(fg_logit)

## 
## Call:
## glm(formula = result ~ field, family = binomial, data = fg_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.61856    0.07066  22.907   <2e-16 ***
## fielddome    0.11688    0.14762   0.792    0.429    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1696.9  on 1917  degrees of freedom
## Residual deviance: 1696.2  on 1916  degrees of freedom
## AIC: 1700.2
## 
## Number of Fisher Scoring iterations: 3

# Display the estimates using the tidy() function from the broom package
tidy(fg_logit)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    1.62     0.0707    22.9   3.91e-116
## 2 fielddome      0.117    0.148      0.792 4.29e-  1

Our logistic regression model is:

\[\log \left ( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right ) = \hat{\alpha} + \hat{\beta}_1 x_i \]

where \(x_i= 1\) if it is in a dome (group 1) and \(x_i = 0\) if the kick occurred in a stadium (group 2)

\[\log \left ( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right ) = 1.62 + 0.117 x_i\]

What do the estimates in the logistic regression model indicate about the chance a field goal is successful if it is attempted in a stadium vs a dome?

Interpreting the model estimates

Intercept - alpha

Since we used dummy variables with made , the estimates reflect the probability that a field is successful.

The intercept, \(\hat{\alpha}\), is the estimated/predicted value when \(x_i = 0\). So what does it mean when \(x_i = 0\)? It means the kick occurred in a stadium!

So \(\hat{\alpha}\) is the estimated log odds that a field goal attempt is successful when it is in a stadium. But what do the log odds indicate?

Not much (at least to us).

Instead, we can convert it back to the normal odds:

\[ e^{\hat{\alpha}} = e^{1.62} \approx 5 \]

So the odds a field goal attempt is successful if it is attempted in a stadium is 5. Or for every 1 missed field goals in a stadium, we estimate that 5 field goals are made.

We can also use the odds to estimate the probability of a success:

\[\hat{\pi} = \frac{\textrm{odds}}{1+\textrm{odds}} \approx \frac{5}{6} = 0.835\]

Which is the same as the sample probability of a success stadium field goal,

\[p = \frac{1211}{1211 + 240} = 0.835\]

if you want to calculate the estimated probability from a logistic regression model, the formula is:

\[\hat{\pi}_i = \frac{e^{\hat{\alpha} + \hat{\beta}_1 x_i}}{1+e^{\hat{\alpha} + \hat{\beta}_1 x_i}}\]

or it’s simpler to find

\[1 - \hat{\pi}_i = \frac{1}{1+e^{\hat{\alpha} + \hat{\beta}_1 x_i}}\]

“Slope” - beta1

The slope estimate, \(\hat{\beta_1} = 0.117\) tell us about the odds of successfully kicking a field goal in a dome rather than a stadium (dome vs stadium since dome -> 1 and stadium -> 0)

The estimate itself measures the difference in the log odds of success between the two stadium types:

\[\log \left ( \frac{\hat{\pi}_1}{1-\hat{\pi}_1} \right ) - \log \left ( \frac{\hat{\pi}_0}{1-\hat{\pi}_0} \right )= 0.117 \]

Similar to interpreting the intercept, we don’t really interpret the slope estimate as is, but we exponentiate it first, then interpret the result.

What that means is we can use \(\hat{\beta}_1\) to estimate the odds ratio of success between domes vs stadiums by exponentiating it!

\[\hat{\theta} = \exp{\hat{\beta}_1} = e^{\hat{\beta}_1}\]

For our example:

\[\hat{\theta} = e^{0.1168782} = 1.12389\]

So the estimated odds a field goal attempt is successful is 1.124 times higher when it is in a dome compared to a stadium.

The tidy() function will exponentiate the estimates for us by including exponentiate = T

tidy(fg_logit, 
     exponentiate = T) |> 
  # Rounding all non-term columns
  mutate(
    across(
      .cols = -term,
      .fns = ~ round(., 3)
    )
  )

## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     5.05     0.071    22.9     0    
## 2 fielddome       1.12     0.148     0.792   0.429

Those values look pretty similar…

odds_ratio_test |> 
  round(3)

##   made_dome missed_field made_field missed_dome odds_ratio SE_LOR test_stat
## 1       397          240       1211          70      1.124  0.148     0.792
##   p_val
## 1 0.429

That’s because logistic regression is just another way to estimate and test the odds ratio!

Fitting a logistic regression model using summarized data

We can also fit a logistic regression model using the summarized data as well.

To do so, we need a data set with 3 columns:

the Explanatory variable outcomes
The proportion of success for each group of the explanatory variable
The sample size for each group of the explanatory variable

We can get a data set with the summarized data below:

fg_summary <- 
  fg_summary |> 
  mutate(field = fct_rev(field))
  
fg_summary |> 
  arrange(field)

##     field attempts made success_rate
## 1 stadium     1451 1211        0.835
## 2    dome      467  397        0.850

Once we have the summarized data set, we can use glm() again, but with a few updates to the arguments:

formula = y_prop ~ explanatory_var -> replaced the response variable with the proportion of successes
weight = explanatory variable group sample sizes: \(n_1\) and \(n_2\)

the family and data arguments remain unchanged.

# Fitting the model
fg_sum_lr <- 
  glm(
    formula = success_rate ~ field,
    weights = attempts,
    family = binomial,
    data = fg_summary
  )

# Displaying the model estimates
tidy(fg_sum_lr,
     exponentiate = T)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     5.06    0.0707    22.9   2.55e-116
## 2 fielddome       1.12    0.148      0.766 4.44e-  1

We get the same results as our model using the unsummarized data!

So why fit the model using the summarized version instead of the unsummarized version?

Sometimes the data comes summarized already!
There are statistics we can use later to judge how well the model fits the data that we shouldn’t use for models fitted using unsummarized data.

While they aren’t relevant now, we can find the fit statistics using glance(logreg_model) (glance is also in the broom package)

bind_rows( 
  "raw_data" = glance(fg_logit),

  "sum_data" = glance(fg_sum_lr),
  .id = "data_type"
  )

## # A tibble: 2 × 9
##   data_type null.deviance df.null  logLik    AIC    BIC deviance df.residual
##   <chr>             <dbl>   <int>   <dbl>  <dbl>  <dbl>    <dbl>       <int>
## 1 raw_data       1697.       1917 -848.   1700.  1711.     1696.        1916
## 2 sum_data          0.595       1   -6.53   17.1   14.4       0            0
## # ℹ 1 more variable: nobs <int>

Logistic Regression: Non-binary response

Let’s look at the probability a passenger survived the sinking of the Titanic based on what type of passenger they were.

The code below will create a summarized data set we can use:

t_df <- 
  Titanic |> 
  data.frame() |> 
  # Calculating the joint counts of survived and class
  summarize(
    .by = c(Class, Survived),
    counts = sum(Freq)
  ) |> 
  # Calculating the conditional proportions
  mutate(
    .by = Class,
    class_count = sum(counts),
         class_survived = counts/sum(counts)
    ) |> 
  # Only the survived rows
  filter(Survived == "Yes") |> 
  # Relevant columns
  dplyr::select(Class, class_count, class_survived)

t_df

##   Class class_count class_survived
## 1   1st         325      0.6246154
## 2   2nd         285      0.4140351
## 3   3rd         706      0.2521246
## 4  Crew         885      0.2395480

The table shows us the survival rates of passengers depending on what class they were in.

Unlike the previous version, we won’t convert these to dummy variables. The first group listed in the table above will be our reference group (like the choice of 0 for a dummy variable)

Fitting the model

The model for a non-binary explanatory variable is:

\[\log\left(\frac{\pi}{1-\pi}\right) = \alpha + \sum_{k = 2}^K\beta_kx_{ik}\]

where \(x_ik = 1\) if \(X = k\) and 0 otherwise. For the above model, \(X = 1\) is the reference group.

Let’s fit a logistic regression model for the probability a passenger survived:

t_lr <- 
  glm(
    formula = class_survived ~ Class,
      weights = class_count,
      family = binomial,
      data = t_df
    )

tidy(t_lr) |> 
  dplyr::select(term, estimate) |> 
  # Adding the exponentiated values to the table
  add_column(odds_est = tidy(t_lr, exponentiate = T)$estimate)

## # A tibble: 4 × 3
##   term        estimate odds_est
##   <chr>          <dbl>    <dbl>
## 1 (Intercept)    0.509    1.66 
## 2 Class2nd      -0.856    0.425
## 3 Class3rd      -1.60     0.203
## 4 ClassCrew     -1.66     0.189

Notice that First class, our reference group, isn’t listed in the term section. That’s because it is used as our reference group. What does that mean?

The intercept estimate is the log odds (or odds if we exponentiated it) of the odds of survival for First class passengers. We can say that the odds a first class passenger survived is 1.664

The other estimates are the (log) odds that a passenger in that class survived COMPARED to a First class passenger.

They are the odds ratio of that group compared to First class. Since all of the slopes are below 0, the odds of survival are for the other passenger types are lower than the odds of a First Class Passenger.

“The odds a Second class passenger survived is 0.42 times the odds a First class passenger survived”

“The odds of survival are 1/0.4246 = 2.355 times higher for a First Class passenger compared to a second class passenger”

It’s important to understand all the estimates are comparing the odds of that class to the odds of a first class passenger (our reference group).

What if we wanted to compare the odds of survival of the 3 passenger types to Crew members?

We can use relevel(x = Factor, ref = "ref group") to change what the reference group is. If you do, you’ll need to rerun the model for the estimates to update:

t_df <- 
  t_df |> 
  mutate(class = relevel(x = Class, ref = "Crew"))

# Rerunning the model:
t_lr2 <- 
  glm(
    formula = class_survived ~ class,
    weights = class_count,
    family = binomial,
    data = t_df
  )

tidy(t_lr2) |> 
  dplyr::select(term, estimate) |> 
  add_column(odds_est = tidy(t_lr2, exponentiate = T)$estimate)

## # A tibble: 4 × 3
##   term        estimate odds_est
##   <chr>          <dbl>    <dbl>
## 1 (Intercept)  -1.16      0.315
## 2 class1st      1.66      5.28 
## 3 class2nd      0.808     2.24 
## 4 class3rd      0.0678    1.07

Alternatively, we can calculate the new estimates by hand:

new intercepts: old_intercept + old_slope for the new reference category = \(0.509 + (-1.664) = -1.155\)

new slope: old_slope_i - old_slope_j for new reference category = \(-0.856 - (-1.664) = 0.808\)

How we interpret the results remains the same, just that we change who our reference group is (from 1st to crew)

# Using the tidy function to calculate the 95% confidence intervals
tidy(
  t_lr2, 
  conf.int = 0.95, 
  exponentiate = T
) |> 
  # Removing the 'class' from the term column
  mutate(
    term = str_remove(tolower(term), "class")
  ) |> 
  
  # Removing the intercept column
  filter(term != "(intercept)") |> 
  
  # Creating an graph to display the confidence intervals for the betas
  ggplot(mapping = aes(y = term)) +
  
  geom_errorbar(
    mapping = aes(xmin = conf.low, xmax = conf.high),
    width = 0.2
  ) + 
  
  geom_point(
    mapping = aes(
      x = estimate,
      color = term
    ),
    show.legend = F,
    size = 3
  ) +
  
  geom_vline(
    xintercept = 1,
    linewidth = 1,
    linetype = 'dashed'
  ) + 
  
  labs(
    y = "Passenger Class",
    x = "Odds Ratio of Survival vs Crew Members"
  )