Titanic Data

Logistic Regression: Non-binary response

Let’s look at the probability a passenger survived the sinking of the Titanic based on what type of passenger they were.

The code below will create a summarized data set we can use:

t_df <- 
  Titanic |> 
  data.frame() |> 
  # Calculating the joint counts of survived and class
  summarize(
    .by = c(Class, Survived),
    survived = sum(Freq)
  ) |> 
  # Calculating the conditional proportions
  mutate(
    .by = Class,
    class_count = sum(survived),
    class_survival = survived/sum(survived),
    class = relevel(x = Class, ref = "Crew")
  ) |> 
  # Only the survived rows
  filter(Survived == "Yes") |> 
  # Relevant columns
  dplyr::select(class, class_count, survived, class_survival)

t_df
##   class class_count survived class_survival
## 1   1st         325      203      0.6246154
## 2   2nd         285      118      0.4140351
## 3   3rd         706      178      0.2521246
## 4  Crew         885      212      0.2395480

The table shows us the survival rates of passengers depending on what class they were in.

Unlike the previous version, we won’t convert these to dummy variables. The first level listed in the table above will be our reference group (like the choice of 0 for a dummy variable)

levels(t_df$class)
## [1] "Crew" "1st"  "2nd"  "3rd"

Fitting the model

The model for a non-binary explanatory variable is:

\[\log\left(\frac{\pi}{1-\pi}\right) = \alpha + \sum_{k = 2}^K\beta_kx_{ik}\]

where \(x_{ik} = 1\) if \(X = k\) and 0 otherwise. For the above model, \(X = 1\) is the reference group.

Let’s fit a logistic regression model for the probability a passenger survived:

t_lr <- 
  glm(
    formula = class_survival ~ class,
      weights = class_count,
      family = binomial,
      data = t_df
    )

tidy(t_lr, exponentiate = T)
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    0.315    0.0788   -14.7   1.05e-48
## 2 class1st       5.28     0.139     12.0   4.97e-33
## 3 class2nd       2.24     0.144      5.62  1.91e- 8
## 4 class3rd       1.07     0.117      0.579 5.62e- 1

Testing if survival rates differ by passenger class

If \(X\) is quantitative or binary, we can use the p-value returned by tidy() to see if there is a statistically significant relationship between \(X\) and \(Y\).

In the Titanic example, \(X\) is a multinomial variable with four groups. With four groups, our model with have 3 \(\hat{\beta}\)s and each \(\hat{\beta}\) will have a p-value. So which one do we use for our significance test?

None of them! Instead, we need to do a comparison of the factor in the model vs the factor excluded from the model:

\[(XY) \text{ vs } (X,Y)\]

For our models, it is comparing \(\hat{\pi}_i\) vs \(\hat{\pi}\): a model with 4 different probabilities \((XY)\) vs a null model with a constant probability \((X, Y)\).

We compare them with:

\[-2\left[\ell(\pi)) - \ell(\pi_i|x) \right]\]

# Calculating the overall survivor rate:
overall_survival <- sum(t_df$survived) / sum(t_df$class_count)
  
overall_survival
## [1] 0.323035

Now that we have the estimated probabilities for both models, we can calculate the loglikelihoods by hand:

\[\sum(\log(p^y (1-p)^{n-y}))\]

t_df |> 
  mutate(
    # Log prob of each class using overall survival rate: p^y( 1-p)^(n-y)
    null_log_prob = log(
      overall_survival^survived * (1-overall_survival)^(class_count - survived)
    ),
    
    # Log prob of each class using class survival rate: p_i^y (1-p_i)^(n-y)
    model_log_prob = log(
      class_survival^survived * (1-class_survival)^(class_count - survived)
    )
  ) |> 
  
  # Finding the log-likelihood
  summarize(
    ell_null  = sum(null_log_prob),
    ell_model = sum(model_log_prob)
  ) |> 
  # Calculating the deviance
  mutate(
    loglik_diff = (ell_null - ell_model)
  )
##    ell_null ell_model loglik_diff
## 1 -1384.728 -1294.278   -90.45068

Alternatively, you can calculate the loglikelihood of a model using logLik(object)

# Null model: Need to fit a logit reg model with only an intercept
null_logit <- 
  glm(
    formula = class_survival ~ 1,
    data = t_df,
    weights = class_count,
    family = binomial
  )

# Loglikelihood of null model
null_loglik <- logLik(null_logit)

# Fitted model loglik
fitted_loglik <- logLik(t_lr)

c(
  'null' = null_loglik, 
  'model' = fitted_loglik, 
  "difference" = null_loglik - fitted_loglik
)
##       null      model difference 
## -103.40092  -12.95023  -90.45068

While the values for the loglikelihoods are different, the difference between them is the same!

Test: difference of deviances

A different way we can calculate the test statistic for two models (null and fitted models) is by using the difference in the null deviance and model deviance:

glance(t_lr) |> 
  # null dev - model dev
  mutate(test_stat = null.deviance - deviance) |> 
  dplyr::select(null.deviance, deviance, test_stat)
## # A tibble: 1 × 3
##   null.deviance deviance test_stat
##           <dbl>    <dbl>     <dbl>
## 1          181. 2.62e-14      181.

This one is positive and twice the size as the difference in loglikelihoods. Why? Because the deviance is:

\[-2\left[ \ell(\hat{\pi}_i|x) - \ell(\hat{p}_i|x) \right]\]

where \(\hat{\pi}_i\) is the estimated probability using the model (either null or fitted model).

We have a test statistic of about 181. How do we find the p-value?

The difference will follow a \(\chi^2\) distribution with \(df = k - 1\), where \(k\) is the number of parameters in the fitted model. The reason is that there are \(k\) parameters in the fitted model and 1 parameter in the null model, so the degrees of freedom are the difference in the parameters!

glance(t_lr) |> 
  mutate(
    test_stat = null.deviance - deviance,
    p_val = pchisq(test_stat, 3, lower.tail = F)
  ) |> 
  dplyr::select(null.deviance, deviance, test_stat, p_val)
## # A tibble: 1 × 4
##   null.deviance deviance test_stat    p_val
##           <dbl>    <dbl>     <dbl>    <dbl>
## 1          181. 2.62e-14      181. 5.63e-39

Testing for significance with anova()

If \(X\) is a multinomial variable, you can use the anova() function to find the overall p-value for \((XY) \text{ vs } (X,Y)\)

anova(t_lr)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: class_survival
## 
## Terms added sequentially (first to last)
## 
## 
##       Df Deviance Resid. Df Resid. Dev
## NULL                      3      180.9
## class  3    180.9         0        0.0

Note: It doesn’t give you a p-value, but it does supply the test statistic and degrees of freedom that we can use to find a p-value!

Goodness of fit text

We could test to see how well the model fits by finding the loglikelihood of the fitted model and saturated model using glance() again:

glance(t_lr)
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          181.       3  -13.0  33.9  31.4 2.62e-14           0     4

The deviance is essentially 0. But why?

If we are working with grouped data and \(X\) is a factor, then each row has a unique probability: \(p_i = \frac{n_i}{N}\)

The saturated model using each row’s proportion: \(p_i = \frac{n_i}{N}\) which are the same!

From chapter 2, if \(X\) and \(Y\) are both categorical, there are only two models: independent \((X,Y)\) and saturated \((XY)\)!