Basic Regression: Field Goal by Distance Example

Previously we examined if successfully kicking a field goal is easier in domes (closed roofs) than stadiums (open roofs) and found that the difference in success probability is not statistically significant.

Now, let’s look at the other variable: distance

We’ll start by looking at a graph of the success probability for each distance:

fg_sum  <- 
  fg_df |> 
  # Distance result combo
  count(distance, result) |> 
  # Conditional prop: P(Y = 1 | x = i)
  mutate(
    .by = distance,
    dist_n = sum(n),
    result_prob = n/sum(n)
  ) |> 
  # Keeping distances of at least 10 attempts and successful rows
  filter(
    dist_n >= 10,
    result == "made"
  )

gg_fg <- 
  ggplot(
    data = fg_sum,
    mapping = aes(
      x = distance,
      y = result_prob)
  ) + 
  
  labs(
    y = "Field Goal Probability",
    x = "Field Goal Attempt Distance",
    size = "Distance \nCount"
  )
  
# Bar chart of success
gg_fg +
  geom_col(
    fill = "#825736",
           color = "black"
    ) + 
  
  scale_y_continuous(expand = c(0, 0, 0.05, 0))

# Scatterplot of success
gg_fg +
   geom_point(color = "#825736",
              size = 2) + 
  
  scale_y_continuous(minor_breaks = NULL)

Why don’t we try fitting a straight line to the data?

# Scatterplot of success
gg_fg +
   geom_point(
     color = "#825736"
   ) + 
  # Adding a straight trend line
  geom_smooth(
    formula = y~x,
    method = "lm",
    se = F
  ) +
  # Adding tick marks every 20%
  scale_y_continuous(
    breaks = 0:5/5,
    minor_breaks = NULL
  )

We have a “slight” problem when using a straight line to model the probability a field goal attempt is successful: When the kicking distance is less than 25 yards, we estimate a probability of success over 1!

Like we did when the explanatory variable was categorical, we’ll fit a generalized linear model!

Let’s look at the model using the three main link functions:

logit: \(\log \left(\frac{\pi}{1-\pi} \right)\)
probit: inverse of a normal distribution
complementary log-log: \(\log \left(-\log(1-\pi) \right)\)

Generalize Linear Models

If we want to specify the link to use, we add an additional argument in the family = binomial part of the glm() function:

# GLM with a logit link
fg_logit <- 
  glm(
    formula = result_prob ~ distance,
    weight = dist_n,
    family = binomial(link = "logit"),
    data = fg_sum
  )

# GLM with a probit link
fg_probit <- 
  glm(
    formula = result_prob ~ distance,
    weight = dist_n,
    family = binomial(link = "probit"),
    data = fg_sum
  )

# GLM with a cloglog link
fg_cll <- 
  glm(
    formula = result_prob ~ distance,
    weight = dist_n,
    family = binomial(link = "cloglog"),
    data = fg_sum
  )

Now that we have our models, we’ll compare the predictions on the graph.

To do so, how do we get our predictions of the probability of success from the glms?

Using predict() with 3 arguments:

object = our glm object
newdata = our data set that we want predictions for (Can omit if we want to predict the original data)
type = "response" if you want the probability of success (defaults to "link", the log odds of success)

fg_glm <- 
  data.frame(
    fg_sum,
    logit   = predict(object = fg_logit,  type = "response"),
    probit  = predict(object = fg_probit, type = "response"),
    cloglog = predict(object = fg_cll,    type = "response")
  )

# Comparing the lines using each of the models:
gg_link_curves <- 
  fg_glm |> 
  pivot_longer(
    cols = logit:cloglog,
    names_to = "link",
    values_to = "probs"
  ) |> 
  mutate(
    link = as_factor(link)
  ) |> 
  
  ggplot(
    mapping = aes(x = distance)
  ) + 
  
  geom_point(
    data = fg_sum,
    mapping = aes(y = result_prob),
    color = "#825736",
    size = 2
  ) +
  
  geom_line(
    mapping = aes(
      y = probs,
      color = link
    ),
    linewidth = 1
  ) + 
  labs(y = NULL) +
  theme(
    legend.position = 'inside',
    legend.position.inside = c(0.9, 0.85)
  ) 

gg_link_curves

While there aren’t any major differences in the range of the data, let’s compare across the range of all possibilities (hypothetically, field goals could be attempted between 11 yards to 109 yards!)

gg_link_curves_full <- 
  tibble(
    distance = 11:109,
    # Logit probability
    logit   = predict(
      object = fg_logit,
      newdata = data.frame(distance),
      type = "response"
    ),
    # probit probability
    probit  = predict(
      object = fg_probit,
      newdata = data.frame(distance),
      type = "response"
    ),
    # Complementary loglog
    cloglog = predict(
      object = fg_cll,
      newdata = data.frame(distance),
      type = "response")
  ) |> 
  
  pivot_longer(
    cols = logit:cloglog,
    values_to = "probs",
    names_to = "link"
  ) |> 
  
  mutate(link = as_factor(link)) |> 
  
  ggplot(mapping = aes(x = distance)) + 
  
  geom_line(
    mapping = aes(
      y = probs,
      color = link
    ),
    linewidth = 1
  ) + 
  
  scale_y_continuous(expand = c(0,0.01)) + 
  
  scale_x_continuous(
    breaks = 1:11*10,
    minor_breaks = NULL
  ) + 
  
  theme(legend.position = c(0.1, 0.2))

## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

gg_link_curves_full

The logit and probit links give similar results, but the predictions for the cloglog link start diverging from the other two after a distance of about 50.

That’s because the cloglog link will have slopes closer to 0 than the other two links, which will cause the curve to change slower.

So when do we use the cloglog link? When the probability of success across the range of values is low.

Basically, if we are modelling a rare event, cloglog works great. If not, we typically use the logit link since it is similar to the probit link and has better theoretical properties!

For the remainder of the course, we’ll be using the logit link (unless specified otherwise).

Interpreting the model estimates:

Our model using logistic regression is:

\[\log \left(\frac{\hat{\pi}(x_i)}{1-\hat{\pi}(x_i)} \right) = \hat{\alpha} + \hat{\beta} x_i\]

We can get our model estimates using tidy()

tidy(fg_logit) |>
  mutate(
    across(
      .cols = -term,
      .fns = ~round(., 3)
    )
  )

## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    5.92      0.36       16.5       0
## 2 distance      -0.103     0.008     -12.8       0

Intercept interpretation

When x is quantitative, we interpret the intercept similarly to ordinary linear regression:

“When the attempt distance is 0, we predict the log odds that a field goal is successful is 5.925”

However, what the log odds are isn’t intuitive, so we can convert it to the odds and then probability:

tidy(
  fg_logit,
  exponentiate = T
) |>
  mutate(
    across(
      .cols = -term,
      .fns = ~round(., 3)
    )
  )

## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  374.        0.36       16.5       0
## 2 distance       0.902     0.008     -12.8       0

“When the attempt distance is 0, the odds a field goal attempt is successful is about 374”

which we can also interpret as a probability:

“When the attempt distance is 0, the probability a field goal attempt is successful is about 374/(374+1) = 99.7%”

Slope Interpretation

When using ordinary linear regression, interpreting the slope is straightforward:

“We predict the response to increase by {slope amount} for each 1 increase in x”

If we were to use a linear model on the probability of success:

lm(result_prob ~ distance,
   weights = dist_n,
   data = fg_sum)

## 
## Call:
## lm(formula = result_prob ~ distance, data = fg_sum, weights = dist_n)
## 
## Coefficients:
## (Intercept)     distance  
##     1.27430     -0.01135

“For each 1 additional yard for the attempt, the probability the field goal is made decreases by about 1.1%”

“For each additional 5 yards for the attempt, the probability the field goal is made decreases by 0.01135*5 = 5.6%”

The big benefit of linear models is that the slope’s interpretation doesn’t change depending on what the value of x is, the change in y is constant. If the attempt changes from 15 to 20 yards or 50 to 55 yards, our line predicts the success probability to decrease by about 5.6%

However, the change in probabilities usually depends on what x is AND how much x is changing :(

If an attempt is moved 5 yards further back, it will likely have a larger impact going from 50 to 55 than from 15 to 20.

Since our logistic regression line is curved, there isn’t a nice, neat interpretation of the slope in terms of probabilities. How much we predict \(\pi\) to change depends on x.

So how do we interpret \(\hat{\beta}\) for logistic regression?

Fortunately, the odds of success only depends on the change in x, not the value of x itself! So our interpretations will focus on how a change in x changes the odds of success, \(e^{\beta}\)!

tidy(
  fg_logit,
  exponentiate = T
)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  374.      0.360        16.5 5.86e-61
## 2 distance       0.902   0.00801     -12.8 1.39e-37

“For each additional yard increase in kicking distance, the odds the field goal is made is 0.90 times than if the kick wasn’t moved”

If the slope is negative, the odds ratio will be below 1 (as we see here). If the odds ratio is below one, we can interpret the results as:

“The odds a field goal attempt is successful is reduce by 10% (1 - 0.90) for each additional yard increase in the kicking distance.”

Estimate effect on the curve

How do the terms (\(\alpha, \beta\)) in the model affect the curve we see in the graph?

Let’s look at the example in the graph below:

As \(|\beta|\) increases, the s-curve steepens more quickly (the faster the model probabilities approach 0 and 1 as x changes)

As \(\alpha\) increases, the inflection point in the graph (at \(\pi = 0.5\), represented by the dots in the graph) moves to the left. Higher \(\alpha\) means that lower values of \(x\) are needed to reach the 50% probability mark.