STA 214 Lab 4

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

We have been working on logistic regression models in class, and today we are going to practice working with them!

The Data

The RMS Titanic was a huge, luxury passenger liner designed and built in the early 20th century. Despite the fact that the ship was believed to be unsinkable, during her maiden voyage on April 15, 1912, the Titanic collided with an iceberg and sank.

The loss of life during the Titanic tragedy was enormous, but there were survivors. Was it random chance that these particular people survived? Or were there some specific characteristics of these people that led to their positions in the life boats? Today, we’re going to build a model to find out.

The data for today can be downloaded from Canvas. We have $n = 891$ rows and 12 different variables. Some of the variables are categorical and some are numeric. The meaning of each variable is as follows:

Passenger: A unique ID number for each passenger.
Survived: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
Pclass: Indicator for the class of the ticket held by the passenger; 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
Name: The name of the passenger.
Sex: Binary Indicator for the biological sex of the passenger.
Age: Age of the passenger in years; Age is fractional if the passenger was less than 1 year old.
SibSp: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
Parch: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
Ticket: The unique ticket number for each passenger.
Fare: How much the ticket cost in US dollars.
Cabin: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
Embarked: Port where the passenger boarded the ship; C = Cherbourg, Q = Queenstown, S = Southampton

Our goal is to build a model to help explore what characteristics were related to whether or not a passenger survived the disaster, so our $Y$ variable is Survived. Because $Y$ is a binary random variable, we will be building a logistic regression model.

As we discussed in class, there are three steps to building a logistic regression model.

Step 1: Choose a reasonable distribution for $Y_i$
Step 2: Build a model for the parameters of interest
Step 3: Estimate the parameters of interest

We will work through each of these steps with these data.

EDA

Whenever we have a variable that we want to model, it is helpful to begin by visualizing the variable in question. This helps us to check for missing data and it also helps us see what values are possible for the variable. Because we are dealing with a binary random variable, we tend to use a bar plot or a table.

There are several ways to make tables in R, but we will discuss two. The first is very direct. We tell R we want to use the Titanic data set and the variable Survived, by using the code Titanic$Survived. Then, we use the table(whatWeWantToMakeATableWith) command to actually make the table.

table(Titanic$Survived)

However, this makes a table that is not particularly pretty or professional when you knit. A second option that does create professional tables is:

knitr::kable(table(Titanic$Survived), col.names=c("Survival", "Number of People") )

The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit. See how nicely the table gets formatted?

Question 1

When we model a binary random variable, it is helpful if at least 10% of the data is in each category. In other words, we do not want 99% of the Y variable to be 0 and 1% to be 1. Is this condition satisfied for the variable Survived?

Question 2

Create a bar graph (bar plot) to explore the distribution of Survived. Make sure to label your axes and title your plot Figure 1.

Hint: We learned about plots in Lab 1, so flip back to that if you need a refresher.

Step 1: Distribution of the Response variable

Now that we have done our EDA, we are ready to start modeling!

The first step in building our logistic regression model is deciding on a distribution that is reasonable to use to model the response variable $Y$. For us, this variable indicates whether or not an individual survived the disaster.

Doing things like writing down distributions often involve parameters. Parameters in statistics are generally represented as Greek letters. How can we write Greek letters and other mathematical notation in our Markdown file so they show up when we knit?

If you want to write mathematical notation, we need to tell Markdown, “Hey, we’re going to make a math symbol!” To do that, you use dollar signs. For instance, to make $\hat{\beta}_1$, you simply put $\hat{\beta}_1$ into the white space (not a chunk) in your Markdown.

Go ahead and do that. See how the dollar signs change colors? Also note that if you hover your mouse over what you just pasted, the mathematical symbol we want will appear.

If you want the symbol to appear on its own line in your Markdown, you need to put two $ signs at the beginning and end of the line (so $$). Try that now.

The same thing works for other mathematical symbols. Let’s say I want to write out a regression line. The code is $\widehat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1 X_i$ .

We’ll notice that we used \hat for the $\hat{\beta}_1$ but \widehat over $Y_i$. Why? Because we usually replace $Y_i$ with a longer word, so it needs a bigger (wider) hat.

A word of caution. You must make sure that both the dollar sign at the beginning and end of your mathematical expression is touching text. In other words, $\hat{y}$ will knit just fine but $\hat{y} $ will yield an error. This is important. Your document will not knit if you forget! Now, you can put spaces inside, meaning that $\hat{y} = 4$ is fine, but the beginning and end can have NO spaces.

Question 3

Write down the distribution that might be appropriate to use to model $Y_i$ using mathematical notation. To get started, copy and paste into the white space $$Y_i \sim$$. The \sim make the ~ shape. Now, all you need to do is add in the distribution you have chosen.

Example: $Y_i \sim Normal( \mu, \sigma).$

Exploring the Likelihood

A distribution tells us what values of $Y_i$ are possible and how often we expect those values to occur. Describing how often we expect values to occur usually involves some sort of parameter. Some distributions have one parameter, and some have more than one.

Question 4

How many parameters do we need to describe a normal distribution? What does each parameter tell us about the distribution?

Question 5

How many parameters do we need to describe the distribution you have selected for $Y_i$? What does each parameter tell us about the distribution?

Once we have parameters, we can do things like interpret, and predict, and compute likelihoods. Because they are so useful, a huge focus in statistics is on estimating these parameters as accurately as possible.

We are going to talk about how to estimate parameters in future classes. For now, I’m going to give you a few choices and we’ll explore.

Question 6

Look at the first 5 rows in the data set. Compute the likelihood that $\pi_i =.4$ for the first five rows and show your work. Interpret the value you get (meaning tell me what it means in a complete sentence).

Question 7

Now, suppose $\pi_i =.1$ for all i. Compute the likelihood for the first five rows, and show your work.

Question 8

Which of these two choices of $\pi_i$ do you think is the most reasonable for these data? Explain.

Question 9

Based on your choice of $\pi_i$ , what is the likelihood of the entire data set? Hint: The table of the response variable you made earlier will be helpful here. (Note: Your answer will be very small!)

Hmm. We have done our calculations correctly, but we notice that our likelihood is very tiny. Why??

Our likelihood is made up of multiplying together 891 values that are between 0 and 1. Doing this produces a tiny number!! Because of this, unless the data set is very small, we generally get very small values for the likelihood.

But didn’t we define the likelihood because it will be a cool tool to help us evaluate how reasonable a parameter estimate might be? Yes, and we will still use it that way. However, to combat the fact that likelihoods tend to be very small for all but the smallest data sets, we often take the natural log of the likelihood to make the values easier to compare. This is called the log-likelihood.

Question 10

Look back at Questions 6 and 7. What are the log-likelihood values for each of these choices of $\pi_i$?

You will notice that the smallest of the two likelihoods is also the smallest of the two log-likelihoods. This means that we can use the log-likelihood to compare estimates, or use the likelihood, and we will come to the same conclusion.

Question 11

In Question 9 you computed the likelihood of your choice $\pi_i$ using the entire data set. What is the corresponding log-likelihood?

So, in practice we usually use the log-likelihood rather than the likelihood. This is because by taking the natural log we get values that are larger and easier to compare.

There is also another tool that we use, and this is called the deviance. The deviance is

\[Deviance = -2~log(likelihood) = -2 ~ log(\mathcal{L}(\pi|Y))\]

Question 12

Look back at Questions 6 and 7. What is the deviance for each of these choices of $\pi_i$?

You will notice that the largest of the two likelihoods is associated with the smallest deviance. This means that while we prefer values with large likelihoods and log likelihoods, we prefer values with smaller deviance values.

Now, so far we have only explored two possible choices for the probability of survival. We can figure out which of the two are more likely, but there may be other choices of $\pi_i$ that give us an even higher likelihood. We could keep trying different values for the parameter, to see what the most reasonable value might be. However, there is a more streamlined way to do this. This is something we will be working on next class!

Step 2: Construct a Model for the Parameters

Okay, we have chosen a distribution for $Y_i$. Now, let’s move on Step 2: Constructing a model for the parameters.

When we were computing the likelihoods in the previous section, we chose a single value of $\pi_i$ to represent the probability of survival for everyone in the data set. This means that we assume that every single person on the ship has the same chances of surviving the disaster.

Question 13

Think about the realities of trying to get to life boats during a ship disaster. Do you think it is reasonable to assume that the probability of surviving was the same for every person on the ship? If not, explain what might have contributed to the chances of surviving.

Note: There is no right answer for this second part, we are just brainstorming right now. Be creative!

Instead of assuming every person has the same probability of success, let’s choose a model for $\pi_i$ that allows the probability to be different for different people, depending on some characteristic of that person. There are a lot of possible explanatory variables we could use, but let’s start with $X_i = Fare_i$, the cost of the ticket for each passenger.

If we assume that the probability differs depending on fare, our next step is to see how these two things (probability of survival and fare) might be connected so we can choose a reasonable model.

EDA: Fare vs. Survival

Since we are choosing $X_i = Fare_i$, our next step is going to be to explore the relationship between fare and the log odds of survival. We are going to do that using a new kind of plot called an empirical log odds plot.

We are going to use a function in R to produce the empirical log odds plot. To do so, create an R chunk, copy the LONG function below, paste it into an R chunk in your Markdown, and run it. You will notice that nothing seems to happen, but if you check the top right panel of your RStudio session, you should notice that a new function called get_logOddsPlot has appeared. R allows individual users to create functions as needed, and in essence you have just ``taught” R a new function.

get_logOddsPlot <- function(x, y, xname, formulahere){
  sort = order(x)
  x = x[sort]
  y = y[sort]
  a = seq(1, length(x), by=.05*length(y))
  b = c(a[-1] - 1, length(x))
  
  prob = xmean = ns = rep(0, length(a)) # ns is for CIs
  for (i in 1:length(a)){
    range = (a[i]):(b[i])
    prob[i] = mean(y[range])
    xmean[i] = mean(x[range])
    ns[i] = b[i] - a[i] + 1 # for CI 
  }
  
  extreme = (prob == 1 | prob == 0)
  #prob[prob == 0] = min(prob[!extreme])
  #prob[prob == 1] = max(prob[!extreme])
  prob[prob == 0] = .01
  prob[prob == 1] = .99
  
  g = log(prob/(1-prob))
  
  dataHere <- data.frame("x" = b, "LogOdds" = g)
  
  suppressMessages(library(ggplot2))
  
  ggplot(dataHere, aes(x =xmean, y = LogOdds)) + geom_point() + geom_smooth(formula = formulahere, method = "lm", se = FALSE) + labs(x = xname, y = "Log Odds")
}

To use the function, you will need to specify the inputs, meaning the pieces of information the computer needs in order to run the code.

x: the column containing the x variable (Titanic$Fare)
y: the column containing the y variable (Titanic$Survived)
xname: This is just what you want on the x axis of the graph; ex: “Fare”
formula: This is the shape of the relationship you want to plot. If you want a line, this is y ~ x. If you want a 3rd order polynomial, you use y ~ poly(x,3)

Here is some of the code to create the plot. You will need to fill in the gaps.

get_logOddsPlot(x= , y=, xname = "Fare", formulahere = y ~ x)`

Question 14

Create the empirical log odds plot by completing the code above. Do you feel comfortable claiming the shape is linear?

If we do not see a line, that doesn’t mean we cannot use logistic regression. It just means we have to change the right hand side of our logistic regression model to reflect a shape other than a line.

Question 15

Adapt the formula part of the code to try and see if a non-linear shape is a better fit for this relationship. Show the plot you choose.

Question 16

Let’s write down the second part of our model. Copy and paste the following into the white space in RMarkdown: $$log\left(\frac{\pi_i}{1-\pi_i}\right) = $$.

Replace your X with a word that describes your explanatory variable, and make any other changes needed to reflect the shape of the relationship you have chosen.

We are now done with Step 1 and Step 2 of the parametric model building process!! Time for Step 3.

Step 3: Estimate the Parameters of Interest

Our next step is to estimate the parameters in involved in the chosen model. The parameters we need to estimate are the $\beta$ terms.

To estimate these parameters, we will be using R. This is done using very similar code to LSLR, except we use the glm function instead of lm. Additionally, we have one extra argument (or input) for our code. We must indicate that we wish to fit a logistic model by including the argument family=binomial.

To examine the coefficient estimates from our logistic regression model, we use the summary command.

Question 17

Build the logistic regression model in R. Write down the fitted model (regression line) in logistic form as the answer to this question.

To do this, paste the following into the white space in RMarkdown and adapt to reflect the numeric values and shape of your fitted model: $$log\left(\frac{\hat{\pi_i}}{1-\hat{\pi_i}}\right) = $$.

Question 18

Write down the fitted model (regression line) in probability form.

To do this, paste the following into the white space in RMarkdown, and adapt to reflect your line : $$\hat{\pi_i} = \frac{e^{ Stuff}}{1+ e^{ Stuff}}$$.

Question 19

Look at the first row in the data set. For this individual, using your fitted model what is the predicted probability of survival? Show your work and use appropriate notation in your answer.

Wrapping Up

We have now seen how to build a logistic regression model using the three-step process of building a parametric model. We have also practiced using the estimates of the parameters we have obtained using R for interpretation and prediction.

Our next steps are to remind ourselves of how to perform inference using hypothesis tests and confidence intervals for these parameters. We also want to explore how R is actually computing the estimates of the parameters. That is where we are headed!

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2023 October 8.

The data set used in this lab is the Titanic data set, downloaded from Kaggle. Citation: Kaggle. Titanic: Machine Learning from Disaster Retrieved December 20, 2018 from https://www.kaggle.com/c/titanic/data.