Replace “Your Name” with your actual name.
In this lab, you will apply logistic regression to a new dataset to predict binary outcomes. You will fit logistic regression models, interpret the coefficients and odds ratios, visualize the results, and assess the model’s fit. You will also explore how different predictors impact the probability of the outcome.
.Rmd
file to HTML before submitting.Dataset: You will predict whether a customer will
buy a product (Purchase
, 0 = No, 1 = Yes) based on
predictors Income
, Age
, and
MaritalStatus
.
# Load necessary packages
library(dplyr)
library(broom)
# Generate a new dataset
set.seed(123)
customer_data <- data.frame(
Purchase = rbinom(1000, 1, prob = 0.5),
Income = round(rnorm(1000, mean = 50000, sd = 20000)),
Age = sample(18:70, 1000, replace = TRUE),
MaritalStatus = factor(sample(c("Single", "Married"), 1000, replace = TRUE))
)
customer_data$Purchase[customer_data$Income > 60000] <- rbinom(sum(customer_data$Income > 60000), 1, prob = 0.85)
customer_data$Purchase[customer_data$Age < 40] <- rbinom(sum(customer_data$Age < 40), 1, prob = 0.88)
customer_data$Purchase[customer_data$MaritalStatus == "Married"] <- rbinom(sum(customer_data$MaritalStatus == "Married"), 1, prob = 0.8)
# Fit the logistic regression model
# Name it model.1
model.1 <- glm(Purchase ~ Income + Age + MaritalStatus, data = customer_data, family = "binomial")
summary(model.1)
##
## Call:
## glm(formula = Purchase ~ Income + Age + MaritalStatus, family = "binomial",
## data = customer_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.458e+00 3.065e-01 4.756 1.97e-06 ***
## Income 1.136e-05 3.834e-06 2.963 0.00305 **
## Age -1.335e-02 5.072e-03 -2.633 0.00847 **
## MaritalStatusSingle -4.331e-01 1.520e-01 -2.849 0.00438 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1081.0 on 999 degrees of freedom
## Residual deviance: 1057.6 on 996 degrees of freedom
## AIC: 1065.6
##
## Number of Fisher Scoring iterations: 4
Task:
Exponentiate the coefficients from the logistic regression model to obtain odds ratios.
Interpret the odds ratios. What does an odds ratio greater than 1 signify for each predictor? What about less than 1?
# Exponentiate the coefficients to obtain odds ratios
odds_ratios <- exp(coef(model.1))
conf_int <- exp(confint(model.1))
## Waiting for profiling to be done...
# Display odds ratios and confidence intervals
tidy_model <- broom::tidy(model.1, exponentiate = TRUE, conf.int = TRUE)
tidy_model
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.30 0.306 4.76 1.97e-6 2.37 7.89
## 2 Income 1.00 0.00000383 2.96 3.05e-3 1.00 1.00
## 3 Age 0.987 0.00507 -2.63 8.47e-3 0.977 0.997
## 4 MaritalStatusSingle 0.649 0.152 -2.85 4.38e-3 0.481 0.873
Task:
Create a plot to visualize the predicted probability of
Purchase
based on Income
.
Use ggplot2
to generate the logistic regression
curve.
# Load ggplot2
library(ggplot2)
# Generate predicted probabilities
customer_data$predicted <- predict(model.1, type = "response")
# Plot predicted probability vs Income
ggplot(customer_data, aes(x = Income, y = predicted)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
labs(title = "Predicted Probability of Purchase vs. Income",
x = "Income",
y = "Predicted Probability of Purchase")
Submission Instructions:
Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission.