STA 112 Lab 5

Goal

It has been a while since we did logistic regression!! We are going to use our time today to review concepts for logistic regression to help prepare you for the final exam. If you have questions about anything, please let me know!

The Data

In this lab, we will return to the data set about the Titanic. Recall that the RMS Titanic was a huge, luxury passenger liner designed and built in the early 20th century. Despite the fact that the ship was believed to be unsinkable, during her maiden voyage on April 15, 1912, the Titanic collided with an iceberg and sank.

We have information on \(n = 714\) passengers from the Titanic. Unlike last time, today we are interested in modeling \(Y =\):

  • Survived: Whether or not each passenger survived (0 = no, 1 = yes)

In addition to this \(Y\) variable, we have 9 possible explanatory variables:

  • PassengerId: a unique number assigned to each passenger to identify them.
  • Pclass: The class of the ticket held by the passenger; 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
  • Name: the name of each passenger.
  • Sex: the sex of the passenger, limited to male and female.
  • Age: The age of the passenger in years. Note: Decimals are for children under a year in age.
  • SibSp: The number of spouses and/or siblings the passenger had on board.
  • Parch: The number of parents and/or children the passenger had on board.
  • Ticket: the number of their ticket (unique to each passenger)
  • Fare: the cost of the ticket in US dollars.

To load the data, copy and paste the following into a code chunk and press play:

# Load the data 
Titanic <- read.csv("https://www.dropbox.com/scl/fi/26vhy0o0n4zykjvpjgolo/Titanic.csv?rlkey=b4a6ewvjrgvd25gr42tuayoyh&st=yppg7u63&dl=1")

# Some Data cleaning
Titanic$Sex <- as.factor(Titanic$Sex)
Titanic$Pclass <- as.factor(Titanic$Pclass)
Titanic <- na.omit(Titanic)
Titanic <- Titanic[,-c(11,12)]

Question 1

Which of the following models is possible for us to use given our data and stated research question? Choose all that apply.

  1. an LSLR model
  2. a MR model
  3. a logistic regression model
  4. a multiple logistic regression model

Question 2

We have 9 possible explanatory variables in the data set. However, 3 of these variables cannot be used as explanatory variables in the model.

  1. Identify which 3 variables cannot use used as explanatory variables (do not report the \(Y\) variable!).

  2. Explain why we cannot use these three as explanatory variables. Hint: The reason is the same for all 3 variables.

Model 1: One Numeric X

We are approached by an historian, Historian 1. Historian 1 knows from their work that in the era when the Titanic sank, social conventions would have indicated that women were prioritized over men for space on the life boats. They therefore suggest you start your work in modeling \(Y\) = survival by building a model using \(X\) = whether the passenger was male or female. Historian 1 believes women were more likely to have survived than men.

To build the appropriate regression model in R, we use:

model1 <- glm(Survived ~ Sex, data = Titanic, family = "binomial")

To see the coefficients of the model and formatted them neatly, we use:

knitr::kable( summary( model1 )$coefficients )

Question 3

What is the baseline for sex? How can you tell?

Question 4

Write out the fitted model using appropriate notation. Round to 2 decimal places.

Hint: To format the model properly, copy and paste the following into WHITE SPACE (not a chunk!!!!) in your Markdown file. Fill in what you need after the = but before the $$.

$$log \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = $$

Question 5

Interpret the slope in terms of the log odds.

Question 6

Interpret the slope in terms of the odds.

Question 7

What is the predicted probability that a man survived the Titanic disaster? Show your work.

Question 8

What is the predicted probability that a woman survived the Titanic disaster? Show your work.

Question 9

The historian believed that women were more likely to have survived than men on the Titanic. Based on the model, does it seem as though they are correct?

Model Fit

Once we have a model, the next step is to assess model fit, or how well the model is able to match the patterns in \(Y\) in the data set. With logistic regression, we do this using something called the drop in deviance.

Remember that the deviance essentially measures how poorly the patterns reflected in the model match the patterns in \(Y\). This means a model with a smaller deviance is a better fit to the data.

The worst deviance we can get is with a model using only the intercept.

\[log \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = \hat{\beta}_0\]

Using this model means we assume that no explanatory variables matter - everyone had the same chances of surviving. In other words, this model assumes that the probability of survival for all passengers on the Titanic was:

\[\hat{\pi} = \frac{e^{\hat{\beta}_0}}{1+e^{\hat{\beta}_0}}\]

The deviance of this model is called the null deviance. To find this value in R, we use:

summary(model1)$null.deviance

Instead of using a model with only the intercept, we are working with Model 1. This means that we assume the probability of survival is different for men and women. To find the deviance of Model 1, we can use:

summary(model1)$deviance

Question 10

Is the model with only the intercept or Model 1 a better fit to the data? Explain your choice.

If we want to measure how much better a fit one model is over another, we often use the drop in deviance:

\[\frac{\text{Null Deviance} - \text{Deviance(Model 1)}}{\text{Null Deviance}} \times 100\]

The drop in deviance measures the percent improvement in the deviance using Model 1 instead of the model with just the intercept. In other words, it is the improvement in the deviance when we add the sex of the passenger into the model. For example, if the drop in deviance is 30, this means that the deviance improves by 30% when we add sex into the model.

Question 11

What is the drop in deviance for Model 1?

Question 12

Interpret the drop in deviance you got in Question 11. Hint: Look right above Question 11 for a template for how to do this.

Model 2: Adding Age

We have completed everything Historian 1 asked of us, and now we are approached by Historian 2. Historian 2 agrees with Historian 1 that social conventions in 1912 would have prioritized women over men for space in the life boats. However, Historian 2 believes the age of the passenger would also have mattered. They request that we add age into the model, too.

  • Age: The age of the passenger in years. Note: Decimals are for children under a year in age.

Historian 2 is asking for a multiple logistic regression model, which just means more than one X variable is in the model. To add a new variable into the model, we just use +.

model2 <- glm(Survived ~ Sex + Age, data = Titanic, family = "binomial")

Question 13

Interpret the coefficient for age in terms of the odds.

Question 14

Compute the drop in deviance for Model 2 (versus the intercept only model) and show your work.

It turns out that the deviance for Model 2 is only a tiny bit smaller than the deviance of Model 1. This means that when we add age into the model, we only get a small improvement in the deviance. Is this enough improvement to suggest that adding age into the model improves the model fit?

When we compare logistic regression models with different numbers of coefficients, we use something called the AIC to help us answer this question:

\[AIC = Deviance(Model) + 2p\] where \(p\) is the number of \(\beta\) terms in the model, including the intercept.

The key here is the +2p. Because the AIC is based on the deviance, we prefer models with small AIC values. The penalty + 2 p means that in order for the AIC to decrease, adding a new \(\beta\) into the model must result in a decrease in the deviance of more than 2. In other words, in order for us to claim adding a new variable into the model improves model fit, the deviance must improve by more than 2.

Question 15

Based on what we have so far, do you think the AIC for Model 1 or Model 2 will be larger?

To check, we can compute the AIC using the following, where we replace p with the number of \(\beta\) terms in the model:

summary(model1)$deviance + 2*p

Question 16

Compute and state the AIC for Model 1 and Model 2. Which model is a better fit to the data?

Model 3: Using Child

Now Historian 3 enters the conversation. They argue that age is important, but specifically think that women and children would have been prioritized in the life boats. They ask us to build a model that uses (1) whether or not a passenger is male and (2) whether or not a passenger is a young child as explanatory variables. They define a young child as being under 10 years in age.

We build their requested Model 3 as follows:

model3 <- glm(Survived ~ Sex + (Age < 10), data = Titanic, family = "binomial")

Note that Age < 10 creates a categorical variable where 1 means the individual was a child less than 10 years in age and 0 means an individual was at least 10 years of age.

Question 17

Interpret the coefficient for young child (age less than 10 years) in terms of the odds.

Question 18

What is the predicted probability of survival for a female child of age 4? Note: Round all coefficients to 2 decimal places during your calculations.

Question 19

Which model is a better fit to the data: Model 1, Model 2, or Model 3? Justify your response.

Model 4: All The Features

A fourth historian joins the discussion and says that since we have decided on how to handle age, we really should go ahead and put all the explanatory variables we can into the model!

Question 20

Build Historian 4’s requested model by adapting the code below. Remember, just use + to add new variables. Note: You should NOT include age again since we already have Age<10 in the model.

model4 <- glm( Survived ~ Sex + (Age < 10) + ADD MORE HERE, data = Titanic, family = "binomial")

As an answer to this question, show your fitted model output:

knitr::kable( summary(model4)$coefficients )

Question 21

Interpret the coefficient for fare in terms of the odds.

Question 22

Interpret the coefficient for 3rd class in terms of the odds.

Question 23

Historian 4 asks us to describe in clear language what the model tells us about survival on the Titanic. This means we need to say something like: “Based on the model, survival chances were higher for women than for men.” You need to include every variable in the model in your answer.

Question 24

What is the best fit to the data - Model 1 (male/female), Model 2 (male/female and age), Model 3 (male/female and whether or not a passenger is a child), Model 4 (all variables)?

Question 25

Fill out the quick form listed here. You get full points on this question just for filling it out!

https://docs.google.com/forms/d/e/1FAIpQLSdudacOBG-fYLBIG8wEgDoWiy_q_aBJLtzYVXb_OHQCfyWKmA/viewform?usp=publish-editor

References

Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 December 1.

The data set used in this lab is the Titanic data set, downloaded from Kaggle. Citation: Kaggle. Titanic: Machine Learning from Disaster Retrieved December 20, 2018 from https://www.kaggle.com/c/titanic/data.