Instructions

In this lab, you will apply logistic regression to a new dataset to predict binary outcomes. You will fit logistic regression models, interpret the coefficients and odds ratios, visualize the results, and assess the model’s fit. You will also explore how different predictors impact the probability of the outcome.

  1. Fit a Logistic Regression Model: Use the provided dataset to predict the binary outcome.
  2. Interpret the Coefficients: Interpret the logistic regression coefficients by exponentiating them to obtain odds ratios.
  3. Visualize the Results: Create visualizations to display the relationship between the predictors and the outcome.
  4. Assess Model Fit: Use the ROC curve to assess the fit of your model.
  5. Submit Your Work: Make sure to knit your .Rmd file to HTML before submitting.

Exercise 1: Fitting a Logistic Regression Model

Dataset: You will predict whether a customer will buy a product (Purchase, 0 = No, 1 = Yes) based on predictors Income, Age, and MaritalStatus.

# Load necessary packages
library(dplyr)
library(broom)
# Generate a new dataset
set.seed(123)
customer_data <- data.frame(
  Purchase = rbinom(1000, 1, prob = 0.5),
  Income = round(rnorm(1000, mean = 50000, sd = 20000)),
  Age = sample(18:70, 1000, replace = TRUE),
  MaritalStatus = factor(sample(c("Single", "Married"), 1000, replace = TRUE))
)

customer_data$Purchase[customer_data$Income > 60000] <- rbinom(sum(customer_data$Income > 60000), 1, prob = 0.85)
customer_data$Purchase[customer_data$Age < 40] <- rbinom(sum(customer_data$Age < 40), 1, prob = 0.88)
customer_data$Purchase[customer_data$MaritalStatus == "Married"] <- rbinom(sum(customer_data$MaritalStatus == "Married"), 1, prob = 0.8)
# Fit the logistic regression model
model.1 <- glm(Purchase ~ Income + Age + MaritalStatus, 
               data = customer_data, 
               family = "binomial")
# Name it model.1
summary(model.1)
## 
## Call:
## glm(formula = Purchase ~ Income + Age + MaritalStatus, family = "binomial", 
##     data = customer_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          1.458e+00  3.065e-01   4.756 1.97e-06 ***
## Income               1.136e-05  3.834e-06   2.963  0.00305 ** 
## Age                 -1.335e-02  5.072e-03  -2.633  0.00847 ** 
## MaritalStatusSingle -4.331e-01  1.520e-01  -2.849  0.00438 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1081.0  on 999  degrees of freedom
## Residual deviance: 1057.6  on 996  degrees of freedom
## AIC: 1065.6
## 
## Number of Fisher Scoring iterations: 4

Exercise 2: Interpreting the Coefficients and Odds Ratios

Task:

1. Exponentiate the coefficients from the logistic regression model to obtain odds ratios.

2. Interpret the odds ratios. What does an odds ratio greater than 1 signify for each predictor? What about less than 1?

# Exponentiate the coefficients to obtain odds ratios
exp(coef(model.1))
##         (Intercept)              Income                 Age MaritalStatusSingle 
##           4.2959778           1.0000114           0.9867365           0.6485186
# Display odds ratios and confidence intervals
exp(cbind(OddsRatio = coef(model.1), confint(model.1)))
## Waiting for profiling to be done...
##                     OddsRatio     2.5 %    97.5 %
## (Intercept)         4.2959778 2.3698526 7.8885565
## Income              1.0000114 1.0000039 1.0000189
## Age                 0.9867365 0.9769158 0.9965514
## MaritalStatusSingle 0.6485186 0.4809486 0.8731176
  • Income: 1.0000121 - For every one unit increae of income, the odds of purchase increase slightly by about 0.0012%. The odds ratio being greater than 1 signifies that higher income is associated with higher odds of purchase.
  • Age: 0.9876778 - For every added year of age, the odds of purchase decrease by around 1.2%. The odds ratio being less than 1 symbolizes that older people are less likely to purchase.
  • MaritalStatus (Married vs. Single): 0.5787593 - Being single is associated with a 42% decrease in likelihood of purchasing. The odds ratio being less than one means that single people are less likely to purchase as opposed to married people.

Exercise 3: Visualizing the Logistic Regression Results

Task:

1. Create a plot to visualize the predicted probability of Purchase based on Income.

2. Use ggplot2 to generate the logistic regression curve.

# Load ggplot2
library(ggplot2)

# Generate predicted probabilities
customer_data$predicted_prob <- predict(model.1, newdata = customer_data, type = "response")

# Plot the the predicted probability of `Purchase` based on `Income`.
ggplot(customer_data, aes(x = Income, y = predicted_prob)) +
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(x = "Income", y = "Predicted Probability", 
       title = "Predicted Purchase Probability by Income")

Exercise 4: Checking Model Fit with ROC Curve

Task:

1. Generate the ROC curve for the logistic regression model to assess its fit.

2. Calculate and interpret the AUC value.

#INSTALL pROC
# Load pROC package
library(pROC)

head(customer_data$predicted_prob)
## [1] 0.7299340 0.8062799 0.7729388 0.7851370 0.8067591 0.7433706
# Generate ROC curve for Purchase vs predicted_prob
roc_result <- roc(customer_data$Purchase, customer_data$predicted_prob)

# save and name the object so you can plot it.

# Plot ROC curve using plot()
plot(roc_result, main = "ROC Curve")

# Display the AUC value
auc_value <- auc(roc_result)
print(auc_value)
## Area under the curve: 0.6028
  • AUC: 0.6111

Exercise 5: Exploring the Impact of Marital Status

Task:

1. Create a plot to visualize the odds ratio for MaritalStatus.

2. Discuss how marital status affects the likelihood of purchasing the product.

# Create a data frame with the odds ratios for Marital Status only
odds_ratio_marital <- 0.6485186
marital_status_odds <- data.frame(
  OddsRatio = odds_ratio_marital,
  Predictor = "MaritalStatus",
  row.names = NULL
)

# Create a plot for the odds ratios of Marital Status
ggplot(marital_status_odds, aes(x = Predictor, y = OddsRatio)) +
  geom_col(fill = "lightblue") +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  ylim(0, 2) +
  labs(title = "Odds Ratio for Marital Status",
       y = "Odds Ratio")

  • The plot shows us that the odds ratio is less than 1, meaning being single decreases the odds of purchasing the product compared to those who are married. Since an odds ratio of 1.0 means no difference or affect, it clearly shows us being single lowers the chance of purchase.