Creating a summarized data set of field goal attempts that we can use to fit the logistic regression model:
fg_sum <-
fg_df |>
count(distance, result) |>
# Conditional prop: success by distance
mutate(
.by = distance,
dist_n = sum(n),
result_prob = n/sum(n)
) |>
filter(
dist_n >= 10,
result == "made"
) |>
rename(made_n = n)
tibble(fg_sum)
## # A tibble: 40 × 5
## distance result made_n dist_n result_prob
## <int> <fct> <int> <int> <dbl>
## 1 19 made 19 19 1
## 2 20 made 36 36 1
## 3 21 made 34 36 0.944
## 4 22 made 44 44 1
## 5 23 made 54 54 1
## 6 24 made 49 51 0.961
## 7 25 made 49 49 1
## 8 26 made 41 43 0.953
## 9 27 made 41 43 0.953
## 10 28 made 54 54 1
## # ℹ 30 more rows
If we want to specify the link to use, we add an additional argument
in the family = binomial
part of the glm()
function:
# GLM with a logit link
fg_logit <-
glm(
formula = result_prob ~ distance,
weight = dist_n,
family = binomial(link = "logit"),
data = fg_sum
)
tidy(fg_logit) |>
mutate(
across(
.cols = -term,
.fns = ~ round(., 3)
)
)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.92 0.36 16.5 0
## 2 distance -0.103 0.008 -12.8 0
Using our data, our estimated model is:
\[\log \left( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right) = \hat{\alpha} + \hat{\beta}x_i\]
\[\log \left( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right) = 5.93 -0.103 x_i\]
Using the output of tidy()
we can do a quick hypothesis
test to see if the explanatory variable (distance) has a linear
association with the log odds of success for the response variable
(field goal result).
\[H_0: \beta = 0 \\ H_a: \beta \ne 0\]
tidy(fg_logit)|>
mutate(
across(
.cols = -term,
.fns = ~ round(., 3)
)
)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.92 0.36 16.5 0
## 2 distance -0.103 0.008 -12.8 0
Since the p-value is very close to 0, we can claim that there is very strong evidence that the kicking distance has a linear association with the log odds of success for the field goal attempt.
We can also have tidy()
calculate a confidence interval
by including conf.int = T
and conf.level =
our
desired confidence level (defaults to 0.95)
tidy(fg_logit, conf.int = T, conf.level = 0.99)|>
mutate(
across(
.cols = -term,
.fns = ~ round(., 3)
)
)
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.92 0.36 16.5 0 5.03 6.89
## 2 distance -0.103 0.008 -12.8 0 -0.124 -0.083
Like the last time we had a confidence interval for the odd ratio, we need to exponentiate both ends:
tidy(
fg_logit,
conf.int = T,
conf.level = 0.99,
exponentiate = T
)|>
mutate(
across(
.cols = -term,
.fns = ~ round(., 3)
)
)
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 374. 0.36 16.5 0 153. 980.
## 2 distance 0.902 0.008 -12.8 0 0.883 0.921
We can interpret out confidence interval as
“We are 99% confident that the odds of a successful field goal is reduced by 8% to 12% for each additional yard in distance”
There are more complicated hypothesis tests that we can perform once we have more terms in the model!
If the data are summarized (multiple observations (at least 5) for each combination of ALL explanatory variables), then there are a couple of tools we can use to determine if our model is appropriate for the data.
In ordinary linear regression (OLS), one summary we can use to measure how well the model fits the data is Residual Sums of Squares (SSE), which is:
\[SSE = \sum_i (y_i - \hat{y}_i)^2\]
Unfortunately, there isn’t a good equivalent for SSE with generalized linear models. Instead, we rely on the deviance. So what is the deviance?
We need two values:
\[L_m = \prod_i \left(\hat{\pi}_i \right)^{y_i} \left(1- \hat{\pi}_i\right)^{n_i - y_i} \]
\[L_s = \prod_i \left(p_i \right)^{y_i} \left(1-p_i\right)^{n_i - y_i}\]
The deviance will be 2 times the difference if their natural logs
\[D = 2\left[\log(L_s) - \log(L_m)\right]\]
The code chunk below will calculate the two \(L\) and the deviance:
fg_dev <-
fg_sum |>
# Adding the model probabilities column using predict()
add_column(
model_prob = predict(object = fg_logit, type = "response")
) |>
# Using dbinom() to calculate the binomial probabilities for each distance group
# with the model probabilities and the sample probabilities
# The log = T will take the log of the probabilities, saving us the step of taking the log later
mutate(
binom_model = dbinom(x = made_n, size = dist_n, prob = model_prob, log = T),
binom_sample = dbinom(x = made_n, size = dist_n, prob = result_prob, log = T)
) |>
# Adding up the log probabilities of the binomial using the model and sample probabilities
summarize(
L_m = sum(binom_model),
L_s = sum(binom_sample)
) |>
# Calculating the deviance:
mutate(Dev = 2*(L_s - L_m))
fg_dev
## L_m L_s Dev
## 1 -82.93149 -60.09996 45.66307
The deviance is a way to measure how much our model probabilities, \(\hat{\pi}_i\) differ from the best probabilities: the sample proportions, \(p_i = n_{ij}/n_{i}\)
If the logistic regression model is adequate for the data, then the deviance will follow a \(\chi^2\) distribution with degrees of freedom equal to \(r = r_1 - r_0\) where:
Our deviance will follow a \(\chi^2\) with \(r = 40 - 2 = 38\) degrees of freedom.
Instead of calculating all those by hand, we can just use the
glance()
function:
glance(fg_logit)
## # A tibble: 1 × 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 255. 39 -82.9 170. 173. 45.7 38 40
# Our results from earlier:
fg_dev
## L_m L_s Dev
## 1 -82.93149 -60.09996 45.66307
The values of interest are logLik, deviance, df.residual, and nobs (number of observations = number of sample proportions), all of which agree with what we found earlier!
We can find the p-value of a hypothesis test that has a null hypothesis that our model probabilities are accurate estimates of the true probabilities:
pchisq(q = fg_dev$Dev, df = 38, lower = F)
## [1] 0.1836896
With a p-value of about 0.18, we don’t have evidence that our model is not adequate for the population probabilities!
We can calculate a similar summary of our GLM as we can for OLR: a pseudo-\(R^2\).
The \(R^2\) for OLS is:
\[R^2 = 1-\frac{SSE}{SST} = 1-\frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}\]
Earlier, we used the model deviance to be the equivalant of the \(SSE\). We’ll use the model deviance again as our facsimile to \(SSE\) again and use the Null Deviance as the replacement for \(SST\).
\[\tilde{R}^2 = 1-\frac{D_m}{D_0}\]
The ~ indicates that it is a pseudo-\(R^2\)
So what is the null deviance?
The null deviance is similar to the model deviance, but we use the overall probability of success instead of the model probabilities of success \(p = n_{\bullet 1}/N\).
We can find the null deviance using glance()
using the
model we all ready fit!
glance(fg_logit) |>
mutate(R2 = 1-deviance/null.deviance)
## # A tibble: 1 × 9
## null.deviance df.null logLik AIC BIC deviance df.residual nobs R2
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl>
## 1 255. 39 -82.9 170. 173. 45.7 38 40 0.821
The pseudo-\(R^2\) works similarly as the normal \(R^2\): the closer to 1, the better the model fits the data.
We can find the Pearson residuals and the standardized residuals
using the augment_columns()
function:
augment_columns(x = fg_logit,
data = fg_sum)
## # A tibble: 40 × 12
## distance result made_n dist_n result_prob .fitted .se.fit .resid .hat
## <int> <fct> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 made 19 19 1 3.98 0.212 0.841 0.0155
## 2 20 made 36 36 1 3.87 0.205 1.22 0.0301
## 3 21 made 34 36 0.944 3.77 0.197 -1.13 0.0308
## 4 22 made 44 44 1 3.67 0.190 1.49 0.0384
## 5 23 made 54 54 1 3.56 0.182 1.74 0.0479
## 6 24 made 49 51 0.961 3.46 0.175 -0.351 0.0459
## 7 25 made 49 49 1 3.36 0.167 1.83 0.0445
## 8 26 made 41 43 0.953 3.26 0.160 -0.315 0.0393
## 9 27 made 41 43 0.953 3.15 0.153 -0.181 0.0394
## 10 28 made 54 54 1 3.05 0.146 2.23 0.0494
## # ℹ 30 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
If we want to find the observations where our model doesn’t fit well, we can look for large residuals using a scatterplot:
augment_columns(x = fg_logit,
data = fg_sum) |>
ggplot(mapping = aes(x = .fitted,
y = .std.resid)) +
geom_point() +
labs(y = "Standardize Residuals",
x = "Estimated log odds")
We want to be on the lookout for any groups that have a standardized residual of at least 3 (in absolute values). Our largest residual is a little larger than 2, so there isn’t any distances where our model estimated the probability poorly!
What if \(X\) is too continuous to group the data (ie, each value of \(x\) is unique), or if the sample size isn’t large enough to group the data?
We can still fit the model, we just don’t need to specify the weight:
# Getting the same data set:
fg_df2 <-
fg_df |>
# Keeping the same rows that are represented in fg_sum
filter(
distance %in% fg_sum$distance
) |>
# Dummy coding result
mutate(result = (result == 'made') * 1)
fg_logit2 <-
glm(
formula = result ~ distance,
data = fg_df2,
family = binomial
)
tidy(fg_logit2)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.93 0.360 16.5 5.83e-61
## 2 distance -0.103 0.00801 -12.8 1.39e-37
The parameter estimates will be the same and the standard errors will (essentially) be the same!
bind_rows(
'grouped' = tidy(fg_logit),
'ungrouped' = tidy(fg_logit2),
.id = 'data'
) |>
arrange(term)
## # A tibble: 4 × 6
## data term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 grouped (Intercept) 5.93 0.360 16.5 5.86e-61
## 2 ungrouped (Intercept) 5.93 0.360 16.5 5.83e-61
## 3 grouped distance -0.103 0.00801 -12.8 1.39e-37
## 4 ungrouped distance -0.103 0.00801 -12.8 1.39e-37
So what’s the benefit of grouped data?
We can use the deviance!
If we have ungrouped data, any method that relies on the deviance shouldn’t be used, like our fit statistics :(
# Comparing the fit statistics
bind_rows(
'grouped' = glance(fg_logit),
'ungrouped' = glance(fg_logit2),
.id = 'data'
)
## # A tibble: 2 × 9
## data null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 grouped 255. 39 -82.9 170. 173. 45.7 38 40
## 2 ungrouped 1634. 1891 -713. 1429. 1440. 1425. 1890 1892
So how do we calculate a goodness-of-fit test for ungrouped data? That’s where the Hosmer-Lemeshow test comes in handy!
See the slides for a more in depth explanation of the Hosmer-Lemeshow test.
We get around the limit of having ungrouped data by grouping the data (so we’re kinda cheating).
We group the data by dividing it up into \(G\) equally-ish sized groups. Typically we choose \(G = 10\)
We group the data based on the predicted probability. The rows with the lowest 10% of estimated probabilities get placed into group 1. The second lowest 10% of estimated probabilities (10.1% - 20%) get placed into group 2, …, the largest 10% get placed into group 10.
Easiest way is to use cut_number(col, G)
, where
cut_number()
will divide the rows into G
groups based on the values in col
:
fg_hl <-
fg_df2 |>
dplyr::select(distance, result) |>
# Adding the estimated probabilities
mutate(
est_prob = predict(fg_logit2, fg_df2, type = 'response')
) |>
# Arranging from lowest to highest prob (not necessary, but helps visualize)
arrange(est_prob) |>
# Creating the groups using cut_number()
mutate(
decile_group = cut_number(est_prob, 10, labels = 1:10)
)
tibble(fg_hl)
## # A tibble: 1,892 × 4
## distance result est_prob decile_group
## <int> <dbl> <dbl> <fct>
## 1 58 1 0.493 1
## 2 58 0 0.493 1
## 3 58 0 0.493 1
## 4 58 1 0.493 1
## 5 58 0 0.493 1
## 6 58 1 0.493 1
## 7 58 0 0.493 1
## 8 58 1 0.493 1
## 9 58 1 0.493 1
## 10 58 0 0.493 1
## # ℹ 1,882 more rows
From here, we calculate the success for each group, \(n_{1g} = \sum_{i=1} y_{ig}\), failures for each group, \(n_{2g} = \sum_{i=1} (1-y_{ig})\), expected successes, \(\hat{\mu}_{1g} = \sum_{i=1} \hat{\pi}_{ig}\), and expected failures, \(\hat{\mu}_{2g} = \sum_{i=1} (1-\hat{\pi}_{ig})\)
fg_hl_sum <-
fg_hl |>
summarize(
.by = decile_group,
size = n(),
obs_success = sum(result),
exp_success = sum(est_prob),
obs_fail = sum(1 - result),
exp_fail = sum(1 - est_prob)
)
fg_hl_sum
## decile_group size obs_success exp_success obs_fail exp_fail
## 1 1 235 146 142.8211 89 92.178895
## 2 2 176 128 125.4859 48 50.514108
## 3 3 172 129 132.4346 43 39.565432
## 4 4 185 152 151.7085 33 33.291489
## 5 5 214 184 185.6049 30 28.395094
## 6 6 174 152 157.1053 22 16.894678
## 7 7 200 185 185.9137 15 14.086289
## 8 8 161 155 152.9547 6 8.045269
## 9 9 186 180 179.4487 6 6.551258
## 10 10 189 187 184.5225 2 4.477489
From there, we can conduct a goodness of fit test with
\[\chi^2 = \sum_{g = 1}^G\sum_{i=1}^2\frac{(n_{ig} - \hat{\mu}_{ig})^2}{\hat{\mu}_{ig}}\]
which will follow a \(\chi^2\) with \(\text{df} = G-2\)
fg_hl_sum |>
# Calculating each row's success and fail component
mutate(
success_chi2 = (obs_success - exp_success)^2/exp_success,
fail_chi2 = (obs_fail - exp_fail)^2/exp_fail
) |>
# Finding the test stat and df
summarize(
test_stat = sum(success_chi2) + sum(fail_chi2),
df = n() - 2
) |>
# Adding the p-value
mutate(
p_val = pchisq(q = test_stat, df = df, lower.tail = F)
)
## test_stat df p_val
## 1 4.622702 8 0.7970353
So we would fail to reject the null hypothesis that the model fits the data!
If we want to use a function to do it instead of by hand, we can use
hltest(model)
from the glmtoolbox
package
pacman::p_load(glmtoolbox)
hltest(model = fg_logit2)
##
## The Hosmer-Lemeshow goodness-of-fit test
##
## Group Size Observed Expected
## 1 186 109 110.1776
## 2 225 165 158.1294
## 3 172 129 132.4346
## 4 185 152 151.7085
## 5 214 184 185.6049
## 6 174 152 157.1053
## 7 200 185 185.9137
## 8 204 196 194.1948
## 9 197 193 190.7222
## 10 135 133 132.0090
##
## Statistic = 4.84127
## degrees of freedom = 8
## p-value = 0.7744
or hoslem.test(x, y, g)
in the
ResourceSelection
package.
Note:
x
is the binary success dummy coded as 0/1y
is the estimated probabilitiesg
is the number of groups to formpacman::p_load(ResourceSelection)
hl_test <-
hoslem.test(
x = fg_df2$result,
y = predict(fg_logit2, type = 'response'),
g = 10
)
# Test results
hl_test
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: fg_df2$result, predict(fg_logit2, type = "response")
## X-squared = 4.6227, df = 8, p-value = 0.797
# Expected counts
hl_test$expected
##
## cutyhat yhat0 yhat1
## [0.493,0.666] 92.178895 142.821105
## (0.666,0.731] 50.514108 125.485892
## (0.731,0.787] 39.565432 132.434568
## (0.787,0.834] 33.291489 151.708511
## (0.834,0.883] 28.395094 185.604906
## (0.883,0.912] 16.894678 157.105322
## (0.912,0.94] 14.086289 185.913711
## (0.94,0.955] 8.045269 152.954731
## (0.955,0.97] 6.551258 179.448742
## (0.97,0.982] 4.477489 184.522511
# Observed counts
hl_test$observed
##
## cutyhat y0 y1
## [0.493,0.666] 89 146
## (0.666,0.731] 48 128
## (0.731,0.787] 43 129
## (0.787,0.834] 33 152
## (0.834,0.883] 30 184
## (0.883,0.912] 22 152
## (0.912,0.94] 15 185
## (0.94,0.955] 6 155
## (0.955,0.97] 6 180
## (0.97,0.982] 2 187
As long as all the expected counts are at least 5 (like any \(\chi^2\) test), then we can use the p-value to decide if the model fits poorly (\(H_a\)) or doesn’t fit poorly (\(H_0\))
Since the p-value is large for all three implementations, we’ll fail to reject the null hypothesis!