In this lab, you will apply logistic regression to a new dataset to predict binary outcomes. You will fit logistic regression models, interpret the coefficients and odds ratios, visualize the results, and assess the model’s fit. You will also explore how different predictors impact the probability of the outcome.
.Rmd file to HTML before submitting.Dataset: You will predict whether a customer will
buy a product (Purchase, 0 = No, 1 = Yes) based on
predictors Income, Age, and
MaritalStatus.
# Load necessary packages
library(dplyr)
library(broom)
# Generate a new dataset
set.seed(123)
customer_data <- data.frame(
Purchase = rbinom(1000, 1, prob = 0.5),
Income = round(rnorm(1000, mean = 50000, sd = 20000)),
Age = sample(18:70, 1000, replace = TRUE),
MaritalStatus = factor(sample(c("Single", "Married"), 1000, replace = TRUE))
)
customer_data$Purchase[customer_data$Income > 60000] <- rbinom(sum(customer_data$Income > 60000), 1, prob = 0.85)
customer_data$Purchase[customer_data$Age < 40] <- rbinom(sum(customer_data$Age < 40), 1, prob = 0.88)
customer_data$Purchase[customer_data$MaritalStatus == "Married"] <- rbinom(sum(customer_data$MaritalStatus == "Married"), 1, prob = 0.8)# Fit the logistic regression model
model.1 <- glm(Purchase ~ Income + Age + MaritalStatus,
data = customer_data,
family = "binomial")
# Name it model.1
summary(model.1)##
## Call:
## glm(formula = Purchase ~ Income + Age + MaritalStatus, family = "binomial",
## data = customer_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.458e+00 3.065e-01 4.756 1.97e-06 ***
## Income 1.136e-05 3.834e-06 2.963 0.00305 **
## Age -1.335e-02 5.072e-03 -2.633 0.00847 **
## MaritalStatusSingle -4.331e-01 1.520e-01 -2.849 0.00438 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1081.0 on 999 degrees of freedom
## Residual deviance: 1057.6 on 996 degrees of freedom
## AIC: 1065.6
##
## Number of Fisher Scoring iterations: 4
Task:
1. Exponentiate the coefficients from the logistic regression model to obtain odds ratios.
2. Interpret the odds ratios. What does an odds ratio greater than 1 signify for each predictor? What about less than 1?
## (Intercept) Income Age MaritalStatusSingle
## 4.2959778 1.0000114 0.9867365 0.6485186
# Display odds ratios and confidence intervals
exp(cbind(OddsRatio = coef(model.1), confint(model.1)))## Waiting for profiling to be done...
## OddsRatio 2.5 % 97.5 %
## (Intercept) 4.2959778 2.3698526 7.8885565
## Income 1.0000114 1.0000039 1.0000189
## Age 0.9867365 0.9769158 0.9965514
## MaritalStatusSingle 0.6485186 0.4809486 0.8731176
Task:
1. Create a plot to visualize the predicted probability of
Purchase based on Income.
2. Use ggplot2 to generate the logistic regression
curve.
# Load ggplot2
library(ggplot2)
# Generate predicted probabilities
customer_data$predicted_prob <- predict(model.1, newdata = customer_data, type = "response")
# Plot the the predicted probability of `Purchase` based on `Income`.
ggplot(customer_data, aes(x = Income, y = predicted_prob)) +
geom_point() +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(x = "Income", y = "Predicted Probability",
title = "Predicted Purchase Probability by Income")Task:
1. Generate the ROC curve for the logistic regression model to assess its fit.
2. Calculate and interpret the AUC value.
## [1] 0.7299340 0.8062799 0.7729388 0.7851370 0.8067591 0.7433706
# Generate ROC curve for Purchase vs predicted_prob
roc_result <- roc(customer_data$Purchase, customer_data$predicted_prob)
# save and name the object so you can plot it.
# Plot ROC curve using plot()
plot(roc_result, main = "ROC Curve")## Area under the curve: 0.6028
Task:
1. Create a plot to visualize the odds ratio for
MaritalStatus.
2. Discuss how marital status affects the likelihood of purchasing the product.
# Create a data frame with the odds ratios for Marital Status only
odds_ratio_marital <- 0.6485186
marital_status_odds <- data.frame(
OddsRatio = odds_ratio_marital,
Predictor = "MaritalStatus",
row.names = NULL
)
# Create a plot for the odds ratios of Marital Status
ggplot(marital_status_odds, aes(x = Predictor, y = OddsRatio)) +
geom_col(fill = "lightblue") +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
ylim(0, 2) +
labs(title = "Odds Ratio for Marital Status",
y = "Odds Ratio")