Logistic regression is used to predict the outcome of a binary categorical variable according to a single (or more) predictor variables, which can be either numerical or categorical.
Let’s take a look at how we can use a logistic regression model to determine the probability of survival for a passenger aboard the titanic given their passenger class, sex and age.
Let’s first load our libraries, import, and tidy the data.
# Install and load the necessary packages
library(titanic)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
# Load the Titanic dataset
data("titanic_train")
titanic_data <- titanic_train
# Prepare the data
titanic_data$Survived <- as.factor(titanic_data$Survived)
levels(titanic_data$Survived) <- c("No", "Yes") # Change levels to "No" and "Yes"
titanic_data$Pclass <- as.factor(titanic_data$Pclass)
titanic_data$Sex <- as.factor(titanic_data$Sex)
titanic_data <- titanic_data[!is.na(titanic_data$Age), ] # Remove rows with missing Age
Let’s visualize some of the relationships that might help us determine good predictors for survival.
# Plot Survived vs. Pclass
pclass_plot <- ggplot(titanic_data, aes(x = Pclass, fill = Survived)) +
geom_bar(position = "fill") +
labs(title = "Survival Rate by Passenger Class", y = "Proportion") +
scale_fill_manual(values = c("orange", "yellow")) +
theme_minimal()
print(pclass_plot)
# Plot Survived vs. Sex
sex_plot <- ggplot(titanic_data, aes(x = Sex, fill = Survived)) +
geom_bar(position = "fill") +
labs(title = "Survival Rate by Sex", y = "Proportion") +
scale_fill_manual(values = c("orange", "yellow")) +
theme_minimal()
print(sex_plot)
# Plot Survived vs. Age using boxplots overlaid with violin plots
age_violin_boxplot <- ggplot(titanic_data, aes(x = Survived, y = Age, fill = Survived)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.1, position = position_dodge(width = 0.9)) +
labs(title = "Age Distribution by Survival Status", y = "Age", x = "Survived") +
scale_fill_manual(values = c("orange", "yellow")) +
facet_wrap(~Sex) +
theme(legend.position = "none") +
theme_minimal()
print(age_violin_boxplot)
Based on these visualizations, it appears that class and sex are strong predictors of survival, while age represents a possibly weaker predictor.
Let’s start building our logistic regression model by splitting the dataset into “train” and “test” partitions.
# Set a random seed for reproducibility
set.seed(123)
# Split the data into training (70%) and testing (30%) sets
trainIndex <- createDataPartition(titanic_data$Survived, p = 0.7, list = FALSE)
trainData <- titanic_data[trainIndex, ]
testData <- titanic_data[-trainIndex, ]
Now let’s fit our model using the glm() function in R, setting the family to be “binomial” for logistic regression. The first argument is the binary categorical variable we are trying to predict the probability of, followed by the predictor variables, the name of the dataset, and the family type.
# Fit the logistic regression model on the training data
logistic_model <- glm(Survived ~ Pclass + Sex + Age, data = trainData, family = binomial)
# Summarize the model
summary(logistic_model)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age, family = binomial,
## data = trainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.854315 0.482219 7.993 1.32e-15 ***
## Pclass2 -1.373890 0.335052 -4.101 4.12e-05 ***
## Pclass3 -2.795040 0.347798 -8.036 9.25e-16 ***
## Sexmale -2.364198 0.245591 -9.627 < 2e-16 ***
## Age -0.038481 0.009067 -4.244 2.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 675.37 on 499 degrees of freedom
## Residual deviance: 452.81 on 495 degrees of freedom
## AIC: 462.81
##
## Number of Fisher Scoring iterations: 5
How do we interpret this information?
The logistic regression equation is:
log(p/1-p) = β0 + β1 × Pclass2 + β2 × Pclass3 + β3 × Sexmale + β4 × Age
Where:
log(p/1−p) is the log-odds of the outcome.
β0 is the intercept.
β1,β2,β3,β4 are the coefficients for the predictors Pclass2, Pclass3, Sexmale, and Age, respectively.
p is the probability of the outcome (e.g., survival in this case).
Intercept (β0): The log-odds of the outcome (survival) when all predictors are at their reference level (e.g., 1st class, female, and Age = 0). For instance, if the intercept is 4.028, it means that the log-odds of survival for a 1st class female with an age of 0 is 4.028.
Pclass2 (β1): The change in log-odds of the outcome for being in 2nd class compared to 1st class. For example, if β1 is -1.303, it means that being in 2nd class decreases the log-odds of survival by 1.303 compared to 1st class.
Pclass3 (β2): The change in log-odds of the outcome for being in 3rd class compared to 1st class. For example, if β2 is -2.310, it means that being in 3rd class decreases the log-odds of survival by 2.310 compared to 1st class.
Sexmale (β3): The change in log-odds of the outcome for being male compared to female. For example, if β3 is -2.663, it means that being male decreases the log-odds of survival by 2.663 compared to being female.
Age (β4): The change in log-odds of the outcome for each additional year of age. For example, if β4 is -0.036, it means that each additional year of age decreases the log-odds of survival by 0.036.
To convert the log-odds to a probability, you use the logistic function.
We have the following coefficients:
Intercept (β0) = 4.028
Pclass2 (β1) = -1.303
Pclass3 (β2) = -2.310
Sexmale (β3) = -2.663
Age (β4) = -0.036
For a 3rd class male passenger who is 25 years old, the log-odds of survival would be approximately 13.7%.
Key Words:
Log-Odds: The log-odds are the logarithm of the odds of the event happening. It is a linear combination of the predictors.
Change in Log-Odds: The coefficients represent the change in the log-odds of the outcome for a one-unit increase in the predictor, holding all other predictors constant.
Probability: To find the probability of the outcome, convert the log-odds to odds and then to probability using the logistic function.
Lastly, we can test our regression model on the test data to see how accurately it predicts the survival status of passengers:
# Predict probabilities on the test data
predicted_probabilities <- predict(logistic_model, newdata = testData, type = "response")
# Convert predicted probabilities to binary outcomes
predicted_classes <- ifelse(predicted_probabilities > 0.5, 1, 0)
# Create a confusion matrix
confusion_matrix <- table(Predicted = predicted_classes, Actual = testData$Survived)
# Print the confusion matrix
print(confusion_matrix)
## Actual
## Predicted No Yes
## 0 109 21
## 1 18 66
The confusion matrix generated is a table comparing the predicted outcomes to the actual outcomes. The values in the top left and bottom right correspond to true positive and true negatives, respectively, and are good (we want these to be as large as possible). The values in the top right and bottom left represent false negatives and false positives, and we want to minimize these (see image below).