STA 214 Lab 2

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

In class we have started working with logistic regression models. We have been discussing how to interpret coefficients and how to check conditions. We are going to practice these skills today and learn how to build logistic regression models in R.

The Data

The RMS Titanic was a huge, luxury passenger liner designed and built in the early 20th century. Despite the fact that the ship was believed to be unsinkable, during her maiden voyage on April 15, 1912, the Titanic collided with an iceberg and sank.

The loss of life during the Titanic tragedy was enormous, but there were survivors. Was it random chance that these particular people survived? Or were there some specific characteristics of these people that led to their positions in the life boats? Today, we’re going to build a model to find out.

The data for today can be downloaded from Canvas. We have $n = 891$ rows and 12 different variables. Some of the variables are categorical and some are numeric. The meaning of each variable is as follows:

Passenger: A unique ID number for each passenger.
Survived: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
Pclass: Indicator for the class of the ticket held by the passenger; 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
Name: The name of the passenger.
Sex: Binary Indicator for the biological sex of the passenger.
Age: Age of the passenger in years; Age is fractional if the passenger was less than 1 year old.
SibSp: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
Parch: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
Ticket: The unique ticket number for each passenger.
Fare: How much the ticket cost in US dollars.
Cabin: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
Embarked: Port where the passenger boarded the ship; C = Cherbourg, Q = Queenstown, S = Southampton

Our goal is to build a model to help explore what characteristics were related to whether or not a passenger survived the disaster, so our $Y$ variable is Survived.

Model 1

For our first model, we are going to see if there is a relationship between the sex of the passenger ($X_i$) and survival ($Y_i$). In this data set, sex is recorded as male or female. We want to know if male or female individuals were more likely to survive, and if there is a difference, how big is the difference?

Note that $Y_i = 1$ indicates survival.

Checking Conditions

Question 1

There are three conditions we need to check before we can use a logistic regression model with these two variables. What are they?

Survival is a random variable and it is binary, so the first condition is met. What about the second and third?

Question 2

To check the second condition, we need to make a table. Create and show a professionally formatted table to check the 2nd condition. Based on your table, is the second condition met? Explain.

Question 3

Create and show a professionally formatted table to check the 3rd condition. Based on your table, is the third condition met? Explain.

Visualizing the relationship

One other step we might consider before building a model is to visualize the relationship between X and Y. This is not strictly necessary, especially when we are building a model with only one X. However, when we start working with data with multiple X variables, it is very helpful to explore visually whether there seems to be a relationship between X and Y. This can help us decide which variables we want might want to include in the model.

With a categorical X and a categorical Y, we use a mosaic plot to visualize the relationship between the two variables. We talked about these in class, but here is the code:

We first need to load two libraries.

library(ggplot2)
library(ggmosaic)

Question 4

When you try to run the code above, you will likely get an error message that says:

Error in library(ggmosaic) : there is no package called ‘ggmosaic’

Seeing an error message that says there is no package called means that this function is not something installed into your RStudio. You need to therefore install the package.

Briefly, describe what steps you need to take to install a package in R.

Once you have the libraries loaded, there is one more step you need to take before you can create the mosaic plot. Our Y variable is recorded in our data set as 0s and 1s. This is fine, but mosaic plots require two categorical variables to work. Our brains know that survival is categorical, but the computer sees 0s and 1s and thinks that the variable is numeric. To check this, run this code:

class(Titanic$Survived)

This returns integer, which means the computer thinks the variable is a numeric variable. To convert to a categorical variable, we can use the code as.factor():

Survival <- as.factor(Titanic$Survived)

Now, we are ready to create the plot!

# Create the plot 
ggplot(data = Titanic) +
  geom_mosaic(aes(x = product(Survival, Sex), fill = Survival)) +   
  labs(x="Passenger Sex", y="Survival", title = "Figure 1",
  caption = "Mosaic plot of survival by sex") +
  theme_mosaic() +
  theme(legend.position = "none")

Question 5

Based on Figure 1, does it look like there is a relationship between sex and survival? Explain.

Fitting Model 1

Now that we have checked all of our conditions, it is time to actually fit the model. Remember, fitting the model means we need to estimate $\beta_0$ and $\beta_1$.

The code we need to fit a logistic regression model is:

Model1 <- glm( Survived ~ Sex, data = Titanic, family = "binomial" )

To see the estimated coefficients, we use:

summary(Model1)$coefficients

Question 6

This output is raw R output. Format the output professionally.

Note: You do not need to change the column names.

Question 7

Write down the fitted model in log odds form. Round to 2 decimal places.

To do this, paste the following into the white space in RMarkdown, and adapt to reflect your fitted model : $$log \left( \frac{\hat{\pi_i}}{ 1- \hat{\pi}_i} \right) =$$.

Question 8

Interpret $\hat{\beta}_1$ in log odds form.

Question 9

Interpret $\hat{\beta}_1$ in odds form.

Question 10

Look at the first row in the data set. For this individual, what is their predicted probability of survival $\hat{\pi}_i$? Show your work.

Hint: To do $e^4$ in R, we use exp(4). To do $\frac{e^{4}}{1 + e^4}$, we use exp(4)/( 1 + exp(4)).

At this point, we have practiced everything we have learned so far for logistic regression with a categorical X. Let’s try using a numeric X.

Model 2

For our second model, we are going to see if there is a relationship between the cost of the ticket ($X_i$) and survival ($Y_i$). In other words, were individuals who paid more for their ticket more likely to survive, and if so, by how much?

Question 11

There are three conditions we need to check before we can use a logistic regression model with a numeric X. Which of these are the same conditions we use for a model with a categorical X? In other words, which conditions have we already checked?

Creating an Empirical Log Odds Plot

One of the conditions we need when X is numeric is the shape condition. This means that we assume a certain shape of the relationship between X and the log odds of survival. To see what shape might be reasonable, we need to build an empirical log odds plot.

We are going to use a function in R to produce the empirical log odds plot. To do so, create an R chunk, copy the LONG function below, paste it into an R chunk in your Markdown, and run it. You will notice that nothing seems to happen, but if you check the top right panel of your RStudio session, you should notice that a new function called get_logOddsPlot has appeared. R allows individual users to create functions as needed, and in essence you have just “taught” R a new function.

get_logOddsPlot <- function(x, y, xname, yname, chosentitle, formulahere){
  sort = order(x)
  x = x[sort]
  y = y[sort]
  a = seq(from = 1, to = length(x), by=.05*length(y))
  a = round(a)
  b = c(a[-1] - 1, length(x))
  
  prob = xmean = rep(0, length(a))
  for (i in 1:length(a)){
    range = (a[i]):(b[i])
    prob[i] = mean(y[range])
    xmean[i] = mean(x[range])
  }
  
  prob[prob == 0] = .001
  prob[prob == 1] = .99

  g = log(prob/(1-prob))
  
  dataHere <- data.frame("x" = b, "LogOdds" = g)
  
  suppressMessages(library(ggplot2))
  
  ggplot(dataHere, aes(x =xmean, y = LogOdds)) +
    geom_point() + 
    geom_smooth(formula = formulahere, method = "lm", se = FALSE) + 
    labs(x = xname, y = yname, title = chosentitle)
}

To use the function, you will need to specify the inputs, meaning the pieces of information the computer needs in order to run the code.

x: the column containing the x variable (Titanic$Fare)
y: the column containing the y variable (Titanic$Survived)
xname: The label you want on the x axis of the graph; ex: “Fare”
yname: The label you want on the y axis of the graph; ex: “Log Odds of Survival”
chosentitle: The title you want on the graph
formula: The shape of the relationship you want to plot. If you want a line, this is y ~ x. If you want a 3rd order polynomial, you use y ~ poly(x,3)

Here is some of the code to create the plot. You will need to fill in the gaps.

get_logOddsPlot(x= , y=, xname = "Ticket Cost",
  yname = "Log Odds of Survival", chosentitle = " ",
  formulahere = y ~ x)

Question 12

Create the empirical log odds plot by completing the code above. Do you feel comfortable claiming the shape of the relationship between ticket price and the log odds of survival is linear?

Non-linear shapes

If the empirical log odds plot indicates that the relationship is not linear, this does not mean we cannot use logistic regression. It just means we have to change the right hand side of our logistic regression model to reflect a shape other than a line.

There are two major non-linear shapes we will see in this course:

1. Natural Logs
1. Polynomials

We choose a natural log of X if the relationship between $X$ and the log odds of survival does not change direction. This means that relationship is always increasing or always decreasing, but the steepness of that increase or decrease changes across the $X$ axis.

We choose a polynomial for X if the relationship between $X$ and the log odds of survival changes direction. This means that relationship may start off as increasing and then changes to decreasing at some point on the X axis, or vice versa. We need one power added to our polynomial for every time the direction changes. If it changes once, we need a 2nd order polynomial (a quadratic). If the direction changes twice, we need a 3rd order polynomial (a cubic).

In this case, the relationship does not change direction, so we want to consider using the following model:

\[Y_i \sim Bernoulli(\pi_i)\] \[log \left( \frac{{\pi_i}}{ 1- {\pi}_i} \right) = \beta_0 + \beta_1 log(Fare_i)\]

Question 13

Adapt the formula part of your empirical log odds plot to reflect the new choice of shape. Adapt any labels as necessary. Show the plot.

Hint: This just means we want to use log(x) instead of x.

Question 14

Does using the log of Fare seem to be a better shape for the fit of the relationship than the line?

Now ideally, we want to be able to determine what is a better fit by using more than just our eyes. In other words, we want to have some kind of numeric quantity that allows us to compare models. We will get to that shortly!

Fitting Model 2

Now that we have chosen the shape of our model, let’s fit the model in R!

To fit this model, we need to adapt our code slightly:

Model2 <- glm( Survived ~ log(Fare), data = Titanic, family = "binomial")

Notice where the log is placed!!

If you try to run the code above, you will get an error:

Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, : NA/NaN/Inf in 'x'

This is essentially telling us we have done something impossible. But what??

Question 15

We are taking the log of ticket price in this model. We can only take the log of values that are greater than what value?

Question 16

Do we have any ticket prices equal to or below that value? To check, run a summary.

summary(Titanic$Fare)

As we will start to see in this course, these details about the data have large impacts on the types of models that we can fit. This is why taking the time to explore the data using tables and summaries and graphs before we fit any models is so important!!!

Luckily, in this case, the fix is easy. We are going to add one penny ($.01) to all the ticket prices that are 0. Basically, we make their price a penny rather than 0, since 0 will not work with our model.

Titanic$Fare <- ifelse(Titanic$Fare == 0, .01, Titanic$Fare)

With this corrected, we can now fit the model.

Model2 <- glm( Survived ~ log(Fare), data = Titanic, family = "binomial")

And the error is gone!

Question 17

Write down the fitted model in log odds form. Round to 2 decimal places.

To do this, paste the following into the white space in RMarkdown, and adapt to reflect your fitted model : $$log \left( \frac{\hat{\pi_i}}{ 1- \hat{\pi}_i} \right) =$$.

Question 18

Interpret $\hat{\beta}_1$ in log odds form.

Interpreting with Logs: Odds Form

With our new model, we have logs on both sides of the fitted model.

\[log \left( \frac{{\hat{\pi}_i}}{ 1- \hat{\pi}_i} \right) = \hat{\beta}_0 + \hat{\beta}_1 log(X_i)\]

When we have this, we actually get a really nice interpretation of $\hat{\beta}_1$: If $X_i$ increases by 1%, we predict the odds that $Y_i = 1$ increases by $\hat{\beta}_1$%.

Question 19

Using the statement above as a template, interpret $\hat{\beta}_1$ in odds form.

Question 20

Look at the first row in the data set. For this individual, what is their predicted probability of survival $\hat{\pi}_i$ based on Model 2? Show your work.

Hint: To do $e^4$ in R, we use exp(4). To do $\frac{e^{4}}{1 + e^4}$, we use exp(4)/( 1 + exp(4)).

Next Steps

Today we have practiced using R to build logistic regression models. We have also practiced checking conditions and interpreting our coefficients when our X is categorical or when X is numeric.

However, are we stuck with just using one X? And can we actually use this model to predict whether a particular person survived on the Titanic? We’ll explore all of this coming up.

References

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 January 24.

The data set used in this lab is the Titanic data set, downloaded from Kaggle. Citation: Kaggle. Titanic: Machine Learning from Disaster Retrieved December 20, 2018 from https://www.kaggle.com/c/titanic/data.