STA 112 Lab 4
The Data
The RMS Titanic was a huge, luxury passenger liner designed and built in the early 20th century. Despite the fact that the ship was believed to be unsinkable, during her maiden voyage on April 15, 1912, the Titanic collided with an iceberg and sank.
We have information on \(n = 714\) passengers from the Titanic, and for today we are interested in modeling:
Fare: How much each passenger’s ticket cost in US dollars.
We will build a few different models using these explanatory variables:
Sex: the sex of the passenger, limited to male and female.Pclass: The class of the ticket held by the passenger; 1 = 1st class, 2 = 2nd class, 3 = 3rd class.Age: The age of the passenger in years. Note: Decimals are for children under a year in age.
To load the data, copy and paste the following into a code chunk and press play:
Model 1: One Categorical X
The first question we are going to explore is the relationship between \(X\) = sex and \(Y\) = ticket cost. In other words, were ticket prices different for men versus women? We are going to use a regression model to explore the relationship between the two variables.
The Shape Assumption
We know that when we build regression models, one assumption we have to make is called the shape assumption. This assumption means that the shape of the relationship we see in the scatter plot between \(X\) and \(Y\) is reflected in the shape we choose in the model.
However, when \(X\) is categorical, there is no shape assumption to check. All we are assuming is that there might be a difference in \(Y\) = ticket cost based on \(X\) = sex. We do not make any assumptions about a shape, and because \(X\) is categorical we cannot make a scatter plot!
Even though we do not need to check shape, if we want to look at the relationship between a categorical variable and a numeric variable, one handy plot is a side-by-side box plot.
To make any plot in R, we know the first step is to load the
ggplot library.
Then, to create the plot, we use one of the following:
Horizontal
Vertical
Question 1
Show whichever plot you prefer (horizontal side by side boxplot or vertical side by side boxplot). Either is fine, this just lets me see what you all prefer!
Based on the plot you created above, describe any differences you see in the two boxplots. In other words, describe any differences in the distribution of ticket price based on sex.
Building the Model
Now that we have looked at the relationship between the two variables, let’s build a regression model! Remember that the code to build a regression model in R is:
The model we have built is called model_sex. Once the
model is built, we need to see the coefficients,
meaning the \(\beta\) terms, we have
estimated. To get the table we have been using for this in class, we
use
However, before we proceed, let’s see if we can make this nicer, because this output looks pretty messy.
Question 2
To professionally format the output from a regression model, we use
knitr::kable.
If you run the chunk above, you may get a warning. If you don’t,
great. If you do, change the header of your chunk from {r}
to {r, warning = FALSE, message = FALSE}.
At this point, go ahead and knit the document. This is all you need as an answer to this question! You will see how nice everything looks!
Question 3
Write out the fitted model using appropriate notation. Round to 2 decimal places.
Hint: Remember that to make mathematical notation in R, we copy and paste the following into WHITE SPACE (not a chunk!!!!) in your Markdown file.
$$\widehat{Fare} = $$ or $$\hat{y} = $$
Question 4
What is the baseline for sex?
How can you tell?
Question 5
Interpret the coefficient for male in the context of the data.
Question 6
Interpret the intercept in the context of the data.
Model 2: Multiple Regression Model 1
Now that we have explored a model with one explanatory variable, let’s add a second variable. When we add more than one explanatory variable to the model, the model is called a multiple regression (MR) model.
Why might we want to include multiple X variables in a model? Recall that our goal in modeling is to try to capture patterns in \(Y\). Some passengers have low fares, some have high fares, and some are in the middle. What is different about the passengers that might relate to these differences in prices?? It makes sense that more than one thing might be related to differences in prices, so by adding in more than one \(X\), we can include more things that might be related to higher/lower prices.
In addition to \(X_1 =\) sex, we will include:
Age: The age of the passenger in years. Note: Decimals are for children under a year in age.
The Shape Assumption (again)
Whenever we add a new variable into the model, we have to revisit the shape assumption. With multiple regression, the shape assumption says that for each variable in the model, the shape we choose for the model matches the shape of the relationship between the variable and \(Y\) in a scatter plot.
We do not need to check shape for sex because it is categorical. However, age is numeric. This means we DO need to check the shape assumption for this variable.
To check shape with a numeric \(X\), we know we need to create a scatter plot:
Question 7
Take a look at the plot above. Can we argue that the shape assumption is met?
Hint: It’s okay if the slope isn’t exciting! We just need to be able to say that using a line, however flat, makes the most sense. We don’t need a curve or another shape.
Building the model
When it comes time to built a multiple regression model, our code has
to change a little. To add a second variable into a regression model,
the only change we need to make to our code is to add (+) a
new variable,
You will also note that I changed the name of our model object from
model_sex to model_sex_age. This allows us to
keep track of our models as we proceed through the analysis.
Question 8
Write out the fitted model using appropriate notation.
Hint: Remember that to make mathematical notation in R, we copy and paste the following into WHITE SPACE (not a chunk!!!!) in your Markdown file.
$$\widehat{Fare} = $$ or $$\hat{y} = $$
Question 9
Interpret the coefficient for age in the fitted model.
Question 10
If a passenger was male and was age 19, what is the predicted cost of a ticket?
Question 11
If a passenger was female and was age 19, what is the predicted cost of a ticket?
As we can tell from Questions 10 and 11, when we combine one numeric explanatory variable with one categorical explanatory variable, we end up with parallel lines. Specifically, we get one line for each level (option) for our categorical \(X\). The lines have the same slope, but different intercepts.
Is there a way to visualize these lines? Yes indeed! The code required is a little tricky, so I have done it for you. To start, copy and paste the code below into a chunk and press play.
plot_MRmodels <- function( numericX, categoricalX, Y, dataset){
xlab = numericX
ylab = Y
groupLab <- categoricalX
Y <- dataset[,which(colnames(dataset)==Y)]
numericX <- dataset[,which(colnames(dataset)==numericX)]
categoricalX <- dataset[,which(colnames(dataset)==categoricalX)]
colorList <- c("#000000", "#E69F00" , "#56B4E9" ,"#009E73","#F0E442" , "#0072B2", "#D55E00", "#CC79A7")
colorsneeded <- levels(categoricalX)
colors <- rep(NA,nrow(dataset))
for(i in 1:nrow(dataset)){
for( j in 1:length(colorsneeded)){
if(categoricalX[i] == colorsneeded[j]){
colors[i] <- colorList[j]
}
}
}
# Build the model
model <- lm( Y ~ numericX + categoricalX, data = dataset)
# Create the lines
ggplot( dataset, aes( x = numericX, y = Y, group = categoricalX)) +
geom_point(col = colors) +
geom_abline(intercept = model$coefficients[1], slope = model$coefficients[2], color = colorList[1], lwd = 1) +
geom_abline(intercept = model$coefficients[1] + model$coefficients[3], slope = model$coefficients[2], color = colorList[2], lwd = 1) + geom_abline(intercept = model$coefficients[1] + model$coefficients[4], slope = model$coefficients[2], color = colorList[3], lwd = 1) + labs(x = xlab, y = ylab)
}When you press play, it will look like nothing has happened…but it has. We just taught R a function, which just means a specific set of commands we want R to perform.
The function takes 4 inputs:
- The name of our numeric X
- The name of our categorical X
- The name of our Y variable
- The name of our data set
This means that to use the function, copy and paste the following into R and press play!
Question 12
In the plot, one of the lines is for the male passengers and the other is for female passengers. Which line is for the male passengers, the gold line (on the bottom) or the black line (on the top)?
Hint: Your answers to Question 10 and 11 will help you here!
Model 3: Categorical X with 3 levels
So far we have built (1) a model with a categorical X with 2 levels and (2) a model with a numeric X and a categorical X with 2 levels. However, categorical variables can have more than 2 levels. How does the model work in that case?
Consider the variable \(X\) =
Pclass, where:
Pclass: The class of the ticket held by the passenger; 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
Question 13
Consider the variable passenger class.
How many levels (different values) does this variable have?
What are these levels?
Which of these levels will R treat as the baseline? Hint: R likes to go in alphabetic or numeric order, with the first being the baseline.
Question 14
How many coefficients (not including the intercept!) do we need in the model for passenger class?
Question 15
Do we need to check the shape assumption with \(X\) = passenger class? Why or why not?
To build a regression model for Y = ticket price and X = passenger class, we can use the following:
Question 16
Write out the fitted model using appropriate notation.
Hint: Remember that to make mathematical notation in R, we copy and paste the following into WHITE SPACE (not a chunk!!!!) in your Markdown file.
$$\widehat{Fare}= $$ or $$\hat{y}= $$
Question 17
Interpret the intercept in the model.
Question 18
Interpret the coefficient for 3rd class.
Model 4: Multiple Regression Model 2
Now that we have explored a categorical X with 3 features, let’s build another multiple regression model. This time, we are going to build a model with \(X_1\) = passenger class and \(X_2\) = age.
Question 19
Build a multiple regression model in R with \(X_1\) = passenger class and \(X_2\) = age and then:
Write down the fitted model for (1) first class passengers, (2) second class passengers, and (3) third class passengers. This means you should have 3 different fitted models. Simplify fully - this means all constants should be combined.
Is the slope different across these three fitted models?
What about the intercept?
Question 20
Interpret the coefficient for second class.
When we had our model with age and sex, we graphed our results using:
The only change we have now is that instead of “Sex”, our categorical explanatory variable is “Pclass”.
Question 21
Make a plot to show the 3 parallel lines created by our model.
Question 22
The plot in Question 21 has 3 lines: black (top), gold (middle), and blue (bottom). Which passenger class is associated with each line color?
Next Steps
As we move forward in the course, we will start to learn how we can use this idea of using multiple explanatory variables to build stronger models. We also need to learn how we can compare models, and how we might decide whether including more explanatory variables in the model is effective. All of that is coming up!
References
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2025 October 21.
The data set used in this lab is the Titanic data set, downloaded from Kaggle. Citation: Kaggle. Titanic: Machine Learning from Disaster Retrieved December 20, 2018 from https://www.kaggle.com/c/titanic/data.