STA 112 Lab 4
The Data
The RMS Titanic was a huge, luxury passenger liner designed and built in the early 20th century. Despite the fact that the ship was believed to be unsinkable, during her maiden voyage on April 15, 1912, the Titanic collided with an iceberg and sank.
We have information on \(n = 891\) passengers from the Titanic, and for today we are interested in modeling:
Fare
: How much each passenger’s ticket cost in US dollars.
We will build a few different models using these explanatory variables:
Survived
: An indicator for whether the passenger survived (1) or perished (0) during the disaster.Pclass
: The class of the ticket held by the passenger; 1 = 1st class, 2 = 2nd class, 3 = 3rd class.Parch
: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter, son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
To load the data, copy and paste the following into a code chunk and press play:
Model 1: One Categorical X
The first question we are going to explore is the relationship between \(X\) = survival and \(Y\) = ticket cost. In other words, can we find a relationship between whether or not someone survived the Titanic and how much they paid for their ticket? We are going to use a regression model to explore the relationship between the two variables.
The Shape Assumption
We know that when we build regression models, one assumption we have to make is called the shape assumption. This assumption means that the shape of the relationship we see in the scatter plot between \(X\) and \(Y\) is reflected in the shape we choose in the model.
However, when \(X\) is categorical, there is no shape assumption to check. All we are assuming is that there might be a difference in \(Y\) = ticket cost based on \(X\) = survival. We do not make any assumptions about a shape, and because \(X\) is categorical we cannot make a scatter plot!
Even though we do not need to check shape, if we want to look at the relationship between a categorical variable and a numeric variable, one handy plot is a side-by-side box plot.
To make any plot in R, we know the first step is to load the
ggplot
library.
Then, to create the plot, we use one of the following:
Horizontal
Vertical
Question 1
Show whichever plot you prefer (horizontal side by side boxplot or vertical side by side boxplot).
Based on the plot you created above, describe any differences you see in the two boxplots. In other words, describe any differences in the distribution of ticket price based on survival.
Building the Model
Now that we have looked at the relationship between the two variables, let’s build a regression model! Remember that the code to build a regression model in R is:
The model we have built is called model_survived
. Once
the model is built, we need to see the coefficients,
meaning the \(\beta\) terms, we have
estimated. To get the table we have been using for this in class, we
use
However, before we proceed, let’s see if we can make this nicer, because this output looks pretty messy.
Question 2
To professionally format the output from a regression model, we use
knitr::kable
.
If you run the chunk above, it will not look very pretty right now,
and you will get a warning. To remove the warning, change the header of
your chunk from {r}
to
{r, warning = FALSE, message = FALSE}
.
Once you have done that, go ahead and knit the document. This is all you need as an answer to this question! You will see how nice everything looks!
Question 3
Write out the fitted model using appropriate notation. Round to 2 decimal places.
Hint: Remember that to make mathematical notation in R, we copy and paste the following into WHITE SPACE (not a chunk!!!!) in your Markdown file.
$$\widehat{Fare} = $$
Question 4
What is the baseline for survived?
How can you tell?
Question 5
Interpret the coefficient for survived in the context of the data.
Question 6
Interpret the intercept in the context of the data.
Model 2: Two Explanatory Variables
Now that we have explored a model with one explanatory variable, let’s add a second variable. When we add more than one explanatory variable to the model, the model is called a multiple regression (MR) model.
Why might we want to include multiple X variables in a model? One reason is that that putting two or more X variables together in the model allows us to build models that account for more sources of variation in \(Y\). It is reasonable to assume that more than one thing is related to \(Y\)!!
By including more explanatory variables, we may be able to improve our model fit (how well we are capturing the patterns in \(Y\)) and/or provide more accurate predictions for \(Y\). Now, this isn’t always true - we have to chose the \(X\) variables we add into the model wisely. Our goal is to start to explore models with multiple \(X\)s to see how we can work with them and evaluate them.
For now, in addition to \(X_1 =\) survival, we will include:
Parch
: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter, son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
The Shape Assumption (again)
Whenever we add a new variable into the model, we have to revisit the shape assumption. With multiple regression, the shape assumption says that for each variable in the model, the shape we choose for the model matches the shape of the relationship between the variable and \(Y\) in a scatter plot.
We do not need to check shape for survival because it is categorical. However, the number of parents and children is numeric. This means we DO need to check the shape condition for this variable.
To check shape with a numeric \(X\), we know we need to create a scatter plot:
ggplot( Titanic, aes( x = Parch, y = Fare)) +
geom_point() +
stat_smooth(formula = y~x, method = "lm", se = FALSE)
Question 7
Take a look at the plot above. Can we argue that the shape condition is met?
Building the model
When it comes time to built a multiple regression model, our code has
to change a little. To add a second variable into a regression model,
the only change we need to make to our code is to add (+
) a
new variable,
You will also note that I changed the name of our model object from
model_survived
to model_survived_parch
. This
allows us to keep track of our models as we proceed through the
analysis.
Question 8
Write out the fitted model using appropriate notation.
Hint: Remember that to make mathematical notation in R, we copy and paste the following into WHITE SPACE (not a chunk!!!!) in your Markdown file.
$$\widehat{Fare} = $$
Question 9
Interpret the intercept of the fitted model.
Is this one of those cases when interpreting the intercept is meaningful or not meaningful? Explain in 1 sentence.
Question 10
Interpret the coefficient for Parch
in the fitted
model.
Question 11
If a passenger survived and had 2 parents/children on board, what is the predicted cost of a ticket?
Question 12
If a passenger did NOT survive and had 2 parents/children on board, what is the predicted cost of a ticket?
At this point, we can start to see that our conclusions can be more specific as we add more variables into the model. One thing we have not done yet is to evaluate in some way if adding in the new variable improved our model. We will get to that in our next class!
Model 3: Categorical X with 3 levels
So far we have built (1) a model with a categorical X with 2 levels and (2) a model with a numeric X and a categorical X with 2 levels. However, categorical variables can have more than 2 levels. How does the model work in that case?
Consider the variable \(X\) =
Pclass
, where:
Pclass
: The class of the ticket held by the passenger; 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
Question 13
Consider the variable passenger class.
How many levels (different values) does this variable have?
What are these levels?
Which of these levels will R treat as the baseline? Hint: To find this out, you can run
table( Titanic$Pclass )
and look at which level comes first!
Question 14
How many coefficients do we need in the model for
Pclass
? Explain briefly how you know.
Question 15
Do we need to check the shape condition with \(X\) = passenger class? Why or why not?
To build a regression model for Y = ticket price and X = class level, we can use the following:
Question 16
Write out the fitted model using appropriate notation.
Hint: Remember that to make mathematical notation in R, we copy and paste the following into WHITE SPACE (not a chunk!!!!) in your Markdown file.
$$\widehat{Fare}= $$
Question 17
Interpret the intercept in the model.
Question 18
Interpret the coefficient for 2nd class.
Model 4: Putting Them Together
We are going to build a model with \(X_1\) = passenger class and \(X_2\) = Parch
(parent/child
count).
Question 19
Build a multiple regression model in R with \(X_1\) = passenger class and \(X_2\) = Parch
(parent/child
count) and then:
Write down the fitted model for (1) first class passengers, (2) second class passengers, and (3) third class passengers. This means you should have 3 different fitted models. Simplify fully - this means all constants should be combined.
Is the slope different across these three fitted models?
What about the intercept?
Question 20
Interpret the coefficient for second class.
Question 21
Suppose a passenger had two parents/children on board. If we wanted to rank prices from highest predicted price to lowest predicted price, the correct rank is:
- First, Second, Third
- First, Third, Second
- Second, First, Third
- Second, Third, First
- Third, First, Second
- Third, Second, First
Next Steps
As we move forward in the course, we will start to learn how we can use this idea of using multiple explanatory variables to build stronger models. This means we want to build models that are a better reflection of the variability in \(Y\) and the relationships between \(Y\) and other variables. We also need to learn how we can compare models, and how we might decide whether including more explanatory variables in the model is effective. All of that is coming up!
References
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2025 October 15.
The data set used in this lab is the Titanic data set, downloaded from Kaggle. Citation: Kaggle. Titanic: Machine Learning from Disaster Retrieved December 20, 2018 from https://www.kaggle.com/c/titanic/data.