Replace “Your Name” with your actual name.

Instructions

In this lab, you will apply logistic regression to a new dataset to predict binary outcomes. You will fit logistic regression models, interpret the coefficients and odds ratios, visualize the results, and assess the model’s fit. You will also explore how different predictors impact the probability of the outcome.

  1. Fit a Logistic Regression Model: Use the provided dataset to predict the binary outcome.
  2. Interpret the Coefficients: Interpret the logistic regression coefficients by exponentiating them to obtain odds ratios.
  3. Visualize the Results: Create visualizations to display the relationship between the predictors and the outcome.
  4. Assess Model Fit: Use the ROC curve to assess the fit of your model.
  5. Submit Your Work: Make sure to knit your .Rmd file to HTML before submitting.

Exercise 1: Fitting a Logistic Regression Model

Dataset: You will predict whether a customer will buy a product (Purchase, 0 = No, 1 = Yes) based on predictors Income, Age, and MaritalStatus.

# Load necessary packages
library(dplyr)
library(broom)
# Generate a new dataset
set.seed(123)
customer_data <- data.frame(
  Purchase = rbinom(1000, 1, prob = 0.5),
  Income = round(rnorm(1000, mean = 50000, sd = 20000)),
  Age = sample(18:70, 1000, replace = TRUE),
  MaritalStatus = factor(sample(c("Single", "Married"), 1000, replace = TRUE))
)

customer_data$Purchase[customer_data$Income > 60000] <- rbinom(sum(customer_data$Income > 60000), 1, prob = 0.85)
customer_data$Purchase[customer_data$Age < 40] <- rbinom(sum(customer_data$Age < 40), 1, prob = 0.88)
customer_data$Purchase[customer_data$MaritalStatus == "Married"] <- rbinom(sum(customer_data$MaritalStatus == "Married"), 1, prob = 0.8)
# Fit the logistic regression model
# Name it model.1
# Fit the logistic regression model
model.1 <- glm(Purchase ~ Income + Age + MaritalStatus, data = customer_data, family = "binomial")
summary(model.1)
## 
## Call:
## glm(formula = Purchase ~ Income + Age + MaritalStatus, family = "binomial", 
##     data = customer_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          1.458e+00  3.065e-01   4.756 1.97e-06 ***
## Income               1.136e-05  3.834e-06   2.963  0.00305 ** 
## Age                 -1.335e-02  5.072e-03  -2.633  0.00847 ** 
## MaritalStatusSingle -4.331e-01  1.520e-01  -2.849  0.00438 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1081.0  on 999  degrees of freedom
## Residual deviance: 1057.6  on 996  degrees of freedom
## AIC: 1065.6
## 
## Number of Fisher Scoring iterations: 4

Exercise 2: Interpreting the Coefficients and Odds Ratios

Task:

1. Exponentiate the coefficients from the logistic regression model to obtain odds ratios.

2. Interpret the odds ratios. What does an odds ratio greater than 1 signify for each predictor? What about less than 1?

# Get tidy model estimates
model_tidy <- tidy(model.1)

# Get 95% confidence intervals
confint_vals <- confint(model.1)
## Waiting for profiling to be done...
confint_exp <- exp(confint_vals)

# Add exponentiated coefficients (odds ratios)
model_tidy <- model_tidy %>%
  mutate(OddsRatio = exp(estimate),
         LowerCI = confint_exp[, 1],
         UpperCI = confint_exp[, 2])

model_tidy
## # A tibble: 4 × 8
##   term            estimate std.error statistic p.value OddsRatio LowerCI UpperCI
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>     <dbl>   <dbl>   <dbl>
## 1 (Intercept)      1.46e+0   3.06e-1      4.76 1.97e-6     4.30    2.37    7.89 
## 2 Income           1.14e-5   3.83e-6      2.96 3.05e-3     1.00    1.00    1.00 
## 3 Age             -1.34e-2   5.07e-3     -2.63 8.47e-3     0.987   0.977   0.997
## 4 MaritalStatusS… -4.33e-1   1.52e-1     -2.85 4.38e-3     0.649   0.481   0.873
  • Income: If OddsRatio > 1, higher income increases the odds of purchasing. If OddsRatio < 1, higher income decreases the odds.

Age: Similar logic: odds ratio > 1 means older customers are more likely to purchase.

MaritalStatus (Married vs. Single): If odds ratio < 1, being married reduces the odds of purchasing compared to being single.

Exercise 3: Visualizing the Logistic Regression Results

Task:

1. Create a plot to visualize the predicted probability of Purchase based on Income.

2. Use ggplot2 to generate the logistic regression curve.

# Load ggplot2
library(ggplot2)

# Generate predicted probabilities
customer_data$predicted_prob <- predict(model.1, newdata = customer_data, type = "response")

# Plot the the predicted probability of `Purchase` based on `Income`.
# Generate predicted probabilities
customer_data$predicted_prob <- predict(model.1, newdata = customer_data, type = "response")

# Plot predicted probability vs. Income
ggplot(customer_data, aes(x = Income, y = predicted_prob)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "blue") +
  labs(title = "Predicted Probability of Purchase vs. Income",
       x = "Income",
       y = "Predicted Probability of Purchase") +
  theme_minimal()

Exercise 4: Checking Model Fit with ROC Curve

Task:

1. Generate the ROC curve for the logistic regression model to assess its fit.

2. Calculate and interpret the AUC value.

#INSTALL pROC
# Load pROC package
library(pROC)

# Generate ROC curve for Purchase vs predicted_prob
# save and name the object so you can plot it.

# Plot ROC curve using plot()

# Display the AUC value
# Load pROC package
library(pROC)

# Generate ROC curve for Purchase vs predicted_prob
roc_obj <- roc(customer_data$Purchase, customer_data$predicted_prob)

# Plot ROC curve
plot(roc_obj, col = "darkorange", main = "ROC Curve for Logistic Regression Model")

# Display AUC value
auc(roc_obj)
## Area under the curve: 0.6028
  • AUC: Ranges from 0.5 (no predictive power) to 1.0 (perfect prediction). Values > 0.7 are generally considered acceptable, > 0.8 good, and > 0.9 excellent.

Exercise 5: Exploring the Impact of Marital Status

Task:

1. Create a plot to visualize the odds ratio for MaritalStatus.

2. Discuss how marital status affects the likelihood of purchasing the product.

# Create a data frame with the odds ratios for Marital Status only
odds_ratio_marital <- 0.6485186
marital_status_odds <- data.frame(
  OddsRatio = odds_ratio_marital,
  Predictor = "MaritalStatus",
  row.names = NULL
)

# Create a plot for the odds ratios of Marital Status
# Create a data frame with the odds ratios for Marital Status only
odds_ratio_marital <- exp(coef(model.1)["MaritalStatusMarried"])
marital_status_odds <- data.frame(
  Predictor = "MaritalStatus (Married vs. Single)",
  OddsRatio = odds_ratio_marital
)
## Error in data.frame(Predictor = "MaritalStatus (Married vs. Single)", : row names contain missing values
# Create a plot for the odds ratio
ggplot(marital_status_odds, aes(x = Predictor, y = OddsRatio)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(title = "Odds Ratio for Marital Status",
       y = "Odds Ratio",
       x = "") +
  theme_minimal()

  • f the odds ratio is below 1, being married reduces the likelihood of purchase compared to being single.

If it’s above 1, it increases the likelihood.

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission.