We want to see if successfully kicking a field goal is easier in domes (closed roofs) than stadiums (open roofs).
Let’s look at a table and bar chart of field goal attempts based on the roof of the stadiums
fg_summary <-
fg_df |>
# Counting the field & result combos
count(field, result, name = 'made') |>
# Calculating the number of attempts per field and success rate
mutate(
.by = field,
attempts = sum(made),
success_rate = round(made/attempts, digits = 3)
) |>
# Only keeping rows that correspond to a success
filter(result == "made") |>
dplyr::select(field, attempts, made, success_rate)
tibble(fg_summary)
## # A tibble: 2 × 4
## field attempts made success_rate
## <fct> <int> <int> <dbl>
## 1 dome 467 397 0.85
## 2 stadium 1451 1211 0.835
Bar Chart:
ggplot(
data = fg_summary,
mapping = aes(
x = field,
y = success_rate
)
) +
geom_col(
fill = c('tan4', 'forestgreen')
) +
ggfittext::geom_bar_text(
mapping = aes(label = scales::percent(success_rate))
) +
scale_y_continuous(
expand = c(0, 0, 0.05, 0),
labels = scales::label_percent()
) +
labs(x = "Field Type",
title = 'Field Goal Success Rate by Location',
y = NULL
) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, size = 16))
Doesn’t appear to be much of a difference between the two field types!
Since both variables are binary, we typically assign the outcomes in both as either 0 or 1.
Field Type: Stadium -> 0, Dome -> 1
Field Goal Result: missed -> 0, made -> 1
Let’s recode them as 0, 1 in the data as well:
fg_dummy <-
fg_df |>
mutate(
# Changing the field as a dummy variable: 0 or 1
field_dummy = (field == "dome") * 1,
# Changing the field goal result to a dummy variable
result_dummy = (result == "made") * 1
)
tibble(fg_dummy)
## # A tibble: 1,918 × 7
## kicker kick_distance roof result field field_dummy result_dummy
## <fct> <int> <fct> <fct> <fct> <dbl> <dbl>
## 1 A.Vinatieri 26 closed made dome 1 1
## 2 A.Vinatieri 50 outdoors made stadium 0 1
## 3 A.Vinatieri 50 outdoors made stadium 0 1
## 4 A.Vinatieri 43 outdoors made stadium 0 1
## 5 A.Vinatieri 41 closed made dome 1 1
## 6 A.Vinatieri 27 open made stadium 0 1
## 7 A.Vinatieri 50 dome made dome 1 1
## 8 A.Vinatieri 30 open made stadium 0 1
## 9 A.Vinatieri 41 outdoors made stadium 0 1
## 10 A.Vinatieri 48 closed missed dome 1 0
## # ℹ 1,908 more rows
Since the two variables we’re analyzing are both categorical, we can use the odds ratio to summarize the relationship and conduct statistical inference!
fg_tab <-
fg_dummy |>
dplyr::select(field_dummy, result_dummy) |>
table()
fg_tab
## result_dummy
## field_dummy 0 1
## 0 240 1211
## 1 70 397
Reminder: \(\hat{\theta} = \frac{n_{11}n_{22}}{n_{12}n_{21}}\)
odds_ratio_test <-
fg_dummy |>
summarize(
# n11
made_dome = sum(field_dummy*result_dummy),
# n22
missed_field = sum((1 - field_dummy)*(1 - result_dummy)),
# n12
made_field = sum((1 - field_dummy)*(result_dummy)),
# n21
missed_dome = sum((field_dummy)*(1 - result_dummy))
) |>
# Calculating the odds ratio
mutate(
odds_ratio = made_dome * missed_field / (made_field * missed_dome),
# SE
SE_LOR = sqrt(1/made_dome + 1/missed_field + 1/made_field + 1/missed_dome),
# Test statistic = log(theta) / SE
test_stat = log(odds_ratio) / SE_LOR,
# Two-sided p-value
p_val = 2*pnorm(abs(test_stat), lower.tail = F)
)
odds_ratio_test|>
signif(digits = 4)
## made_dome missed_field made_field missed_dome odds_ratio SE_LOR test_stat
## 1 397 240 1211 70 1.124 0.1476 0.7917
## p_val
## 1 0.4286
From the output, we can see that there isn’t a significant difference between the kicking in a dome vs kicking in a stadium!
While the odds ratio is easy enough to use, we can “formulate” our model a little differently. Instead of calculating the odds ratio, we can form a logistic regression model!
So what is a logistic regression model?
It is similar to an ordinary linear model:
\[y_i = \alpha + \beta_1 x_i\]
except instead of modelling \(y\), we model the probability of success, \(\pi\). But there is one major problem - Probabilities can only be between 0 and 1, where as a line can give any value!
So what do we do instead? We find a link function, \(g()\), that converts \(\pi_i\) from being between 0 and 1 to any possible value so the left and right side of the equal sign are actually equal:
\[0 \le \pi \le 1 \rightarrow -\infty \le g(\pi) \le \infty\]
\[g(\pi_i) = \alpha + \beta_1 x_i\]
But what function can we use to do that?
There are 3 typical options:
\[g(\pi) = \ln \left( \frac{\pi}{1-\pi} \right)\]
\[g(\pi) = \ln \left( -\ln \left(1-\pi \right) \right)\]
The most common of these 3 choices is the logit link. Why? There are a few different reasons, but the easiest to explain is that it is the easiest to interpret!
Why is it easy to interpret?
Because it is the log of the odds ratio, which we’ve seen quite a bit in this class, and we’ve interpreted the odds already!
We can fit the logistic regression model using the glm()
function. It can work with both the “raw”, unsummarized data or the
summarized version of the data.
Let’s start with the unsummarized version, fg_df
, which
glm()
needs 3 arguments:
formula =
the formula we are going to use:
response_var ~ explanatory_vars
family = binomial
to indicate we want to use the
binomial distribution
family
other valuesdata =
the data set you want to analyze
First, check to make sure that the levels two variables are
in the proper order. The glm()
function wants the levels of
the response variable to be in failure/success order
and the explanatory to be in the group 2/group 1
order
levels(fg_df$result)
## [1] "made" "missed"
levels(fg_df$field)
## [1] "dome" "stadium"
Ours are currently in reverse order. If they are reversed, the
easiest way to flip the order of the levels of a factor is with
fct_rev()
seen below:
fg_df <-
fg_df |>
# Flipping the order of result and field
mutate(
result = fct_rev(result),
field = fct_rev(field)
)
fg_df|>
# Checking the results (not needed to flip)
reframe(
result_levels = levels(result),
field_levels = levels(field)
)
## result_levels field_levels
## 1 missed stadium
## 2 made dome
fg_logit <-
glm(
formula = result ~ field,
family = binomial,
data = fg_df)
# Basic display of the results
summary(fg_logit)
##
## Call:
## glm(formula = result ~ field, family = binomial, data = fg_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.61856 0.07066 22.907 <2e-16 ***
## fielddome 0.11688 0.14762 0.792 0.429
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1696.9 on 1917 degrees of freedom
## Residual deviance: 1696.2 on 1916 degrees of freedom
## AIC: 1700.2
##
## Number of Fisher Scoring iterations: 3
# Display the estimates using the tidy() function from the broom package
tidy(fg_logit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.62 0.0707 22.9 3.91e-116
## 2 fielddome 0.117 0.148 0.792 4.29e- 1
Our logistic regression model is:
\[\log \left ( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right ) = \hat{\alpha} + \hat{\beta}_1 x_i \]
where \(x_i= 1\) if it is in a dome (group 1) and \(x_i = 0\) if the kick occurred in a stadium (group 2)
\[\log \left ( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right ) = 1.62 + 0.117 x_i\]
What do the estimates in the logistic regression model indicate about the chance a field goal is successful if it is attempted in a stadium vs a dome?
Since we used dummy variables with made , the estimates reflect the probability that a field is successful.
The intercept, \(\hat{\alpha}\), is the estimated/predicted value when \(x_i = 0\). So what does it mean when \(x_i = 0\)? It means the kick occurred in a stadium!
So \(\hat{\alpha}\) is the estimated log odds that a field goal attempt is successful when it is in a stadium. But what do the log odds indicate?
Not much (at least to us).
Instead, we can convert it back to the normal odds:
\[ e^{\hat{\alpha}} = e^{1.62} \approx 5 \]
So the odds a field goal attempt is successful if it is attempted in a stadium is 5. Or for every 1 missed field goals in a stadium, we estimate that 5 field goals are made.
We can also use the odds to estimate the probability of a success:
\[\hat{\pi} = \frac{\textrm{odds}}{1+\textrm{odds}} \approx \frac{5}{6} = 0.835\]
Which is the same as the sample probability of a success stadium field goal,
\[p = \frac{1211}{1211 + 240} = 0.835\]
if you want to calculate the estimated probability from a logistic regression model, the formula is:
\[\hat{\pi}_i = \frac{e^{\hat{\alpha} + \hat{\beta}_1 x_i}}{1+e^{\hat{\alpha} + \hat{\beta}_1 x_i}}\]
or it’s simpler to find
\[1 - \hat{\pi}_i = \frac{1}{1+e^{\hat{\alpha} + \hat{\beta}_1 x_i}}\]
The slope estimate, \(\hat{\beta_1} = 0.117\) tell us about the odds of successfully kicking a field goal in a dome rather than a stadium (dome vs stadium since dome -> 1 and stadium -> 0)
The estimate itself measures the difference in the log odds of success between the two stadium types:
\[\log \left ( \frac{\hat{\pi}_1}{1-\hat{\pi}_1} \right ) - \log \left ( \frac{\hat{\pi}_0}{1-\hat{\pi}_0} \right )= 0.117 \]
Similar to interpreting the intercept, we don’t really interpret the slope estimate as is, but we exponentiate it first, then interpret the result.
What that means is we can use \(\hat{\beta}_1\) to estimate the odds ratio of success between domes vs stadiums by exponentiating it!
\[\hat{\theta} = \exp{\hat{\beta}_1} = e^{\hat{\beta}_1}\]
For our example:
\[\hat{\theta} = e^{0.1168782} = 1.12389\]
So the estimated odds a field goal attempt is successful is 1.124 times higher when it is in a dome compared to a stadium.
The tidy()
function will exponentiate the estimates for
us by including exponentiate = T
tidy(fg_logit,
exponentiate = T) |>
# Rounding all non-term columns
mutate(
across(
.cols = -term,
.fns = ~ round(., 3)
)
)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.05 0.071 22.9 0
## 2 fielddome 1.12 0.148 0.792 0.429
Those values look pretty similar…
odds_ratio_test |>
round(3)
## made_dome missed_field made_field missed_dome odds_ratio SE_LOR test_stat
## 1 397 240 1211 70 1.124 0.148 0.792
## p_val
## 1 0.429
That’s because logistic regression is just another way to estimate and test the odds ratio!
We can also fit a logistic regression model using the summarized data as well.
To do so, we need a data set with 3 columns:
the Explanatory variable outcomes
The proportion of success for each group of the explanatory variable
The sample size for each group of the explanatory variable
We can get a data set with the summarized data below:
fg_summary <-
fg_summary |>
mutate(field = fct_rev(field))
fg_summary |>
arrange(field)
## field attempts made success_rate
## 1 stadium 1451 1211 0.835
## 2 dome 467 397 0.850
Once we have the summarized data set, we can use glm()
again, but with a few updates to the arguments:
formula = y_prop ~ explanatory_var
-> replaced
the response variable with the proportion of successes
weight =
explanatory variable group sample sizes:
\(n_1\) and \(n_2\)
the family
and data
arguments remain
unchanged.
# Fitting the model
fg_sum_lr <-
glm(
formula = success_rate ~ field,
weights = attempts,
family = binomial,
data = fg_summary
)
# Displaying the model estimates
tidy(fg_sum_lr,
exponentiate = T)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.06 0.0707 22.9 2.55e-116
## 2 fielddome 1.12 0.148 0.766 4.44e- 1
We get the same results as our model using the unsummarized data!
So why fit the model using the summarized version instead of the unsummarized version?
Sometimes the data comes summarized already!
There are statistics we can use later to judge how well the model fits the data that we shouldn’t use for models fitted using unsummarized data.
While they aren’t relevant now, we can find the fit statistics using
glance(logreg_model)
(glance is also in the broom
package)
bind_rows(
"raw_data" = glance(fg_logit),
"sum_data" = glance(fg_sum_lr),
.id = "data_type"
)
## # A tibble: 2 × 9
## data_type null.deviance df.null logLik AIC BIC deviance df.residual
## <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 raw_data 1697. 1917 -848. 1700. 1711. 1696. 1916
## 2 sum_data 0.595 1 -6.53 17.1 14.4 0 0
## # ℹ 1 more variable: nobs <int>
Let’s look at the probability a passenger survived the sinking of the Titanic based on what type of passenger they were.
The code below will create a summarized data set we can use:
t_df <-
Titanic |>
data.frame() |>
# Calculating the joint counts of survived and class
summarize(
.by = c(Class, Survived),
counts = sum(Freq)
) |>
# Calculating the conditional proportions
mutate(
.by = Class,
class_count = sum(counts),
class_survived = counts/sum(counts)
) |>
# Only the survived rows
filter(Survived == "Yes") |>
# Relevant columns
dplyr::select(Class, class_count, class_survived)
t_df
## Class class_count class_survived
## 1 1st 325 0.6246154
## 2 2nd 285 0.4140351
## 3 3rd 706 0.2521246
## 4 Crew 885 0.2395480
The table shows us the survival rates of passengers depending on what class they were in.
Unlike the previous version, we won’t convert these to dummy variables. The first group listed in the table above will be our reference group (like the choice of 0 for a dummy variable)
The model for a non-binary explanatory variable is:
\[\log\left(\frac{\pi}{1-\pi}\right) = \alpha + \sum_{k = 2}^K\beta_kx_{ik}\]
where \(x_ik = 1\) if \(X = k\) and 0 otherwise. For the above model, \(X = 1\) is the reference group.
Let’s fit a logistic regression model for the probability a passenger survived:
t_lr <-
glm(
formula = class_survived ~ Class,
weights = class_count,
family = binomial,
data = t_df
)
tidy(t_lr) |>
dplyr::select(term, estimate) |>
# Adding the exponentiated values to the table
add_column(odds_est = tidy(t_lr, exponentiate = T)$estimate)
## # A tibble: 4 × 3
## term estimate odds_est
## <chr> <dbl> <dbl>
## 1 (Intercept) 0.509 1.66
## 2 Class2nd -0.856 0.425
## 3 Class3rd -1.60 0.203
## 4 ClassCrew -1.66 0.189
Notice that First class, our reference group, isn’t listed in the term section. That’s because it is used as our reference group. What does that mean?
The intercept estimate is the log odds (or odds if we exponentiated it) of the odds of survival for First class passengers. We can say that the odds a first class passenger survived is 1.664
The other estimates are the (log) odds that a passenger in that class survived COMPARED to a First class passenger.
They are the odds ratio of that group compared to First class. Since all of the slopes are below 0, the odds of survival are for the other passenger types are lower than the odds of a First Class Passenger.
“The odds a Second class passenger survived is 0.42 times the odds a First class passenger survived”
or
“The odds of survival are 1/0.4246 = 2.355 times higher for a First Class passenger compared to a second class passenger”
It’s important to understand all the estimates are comparing the odds of that class to the odds of a first class passenger (our reference group).
What if we wanted to compare the odds of survival of the 3 passenger types to Crew members?
We can use relevel(x = Factor, ref = "ref group")
to
change what the reference group is. If you do, you’ll need to rerun the
model for the estimates to update:
t_df <-
t_df |>
mutate(class = relevel(x = Class, ref = "Crew"))
# Rerunning the model:
t_lr2 <-
glm(
formula = class_survived ~ class,
weights = class_count,
family = binomial,
data = t_df
)
tidy(t_lr2) |>
dplyr::select(term, estimate) |>
add_column(odds_est = tidy(t_lr2, exponentiate = T)$estimate)
## # A tibble: 4 × 3
## term estimate odds_est
## <chr> <dbl> <dbl>
## 1 (Intercept) -1.16 0.315
## 2 class1st 1.66 5.28
## 3 class2nd 0.808 2.24
## 4 class3rd 0.0678 1.07
Alternatively, we can calculate the new estimates by hand:
new intercepts: old_intercept + old_slope for the new reference category = \(0.509 + (-1.664) = -1.155\)
new slope: old_slope_i - old_slope_j for new reference category = \(-0.856 - (-1.664) = 0.808\)
How we interpret the results remains the same, just that we change who our reference group is (from 1st to crew)
# Using the tidy function to calculate the 95% confidence intervals
tidy(
t_lr2,
conf.int = 0.95,
exponentiate = T
) |>
# Removing the 'class' from the term column
mutate(
term = str_remove(tolower(term), "class")
) |>
# Removing the intercept column
filter(term != "(intercept)") |>
# Creating an graph to display the confidence intervals for the betas
ggplot(mapping = aes(y = term)) +
geom_errorbar(
mapping = aes(xmin = conf.low, xmax = conf.high),
width = 0.2
) +
geom_point(
mapping = aes(
x = estimate,
color = term
),
show.legend = F,
size = 3
) +
geom_vline(
xintercept = 1,
linewidth = 1,
linetype = 'dashed'
) +
labs(
y = "Passenger Class",
x = "Odds Ratio of Survival vs Crew Members"
)