. Purpose of the
Quiz
This Quiz 4 focuses on logistic regression: building
a model, interpreting results, and giving managerial
recommendations.
We will follow these steps:
- Step 1 — Model construction: Define a binary
outcome (Subscribe: Yes=1) and choose two predictors
(Age, Gender).
- Step 2 — Interpretation: Interpret coefficients
using odds ratios and predicted
probabilities.
- Step 3 — Managerial recommendations: Translate
results into clear business actions.
- Step 4 — Value add: A simple predicted probability
plot and example profiles.
Important note: The goal is not advanced coding.
The goal is correct interpretation and manager-friendly insights.
. Load Libraries &
Read Data
# ==========================================================
# 2.1 Libraries
# - readxl: to read Excel
# - dplyr: data manipulation
# - ggplot2: visualization
# - broom: tidy model output into clean tables
# - tidyr: reshape data for plotting
# ==========================================================
library(readxl)
library(dplyr)
library(ggplot2)
library(broom)
library(tidyr)
# ==========================================================
# 2.2 Read Excel Data
# - We read the sheet named "data"
# ==========================================================
df_raw <- read_excel("Logitsubscribedata.xlsx", sheet = "data")
# Quick look at structure and first rows
str(df_raw)
## tibble [1,345 × 3] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:1345] 33 45 57 32 56 60 40 55 27 48 ...
## $ Gender (W=1) : num [1:1345] 0 1 0 1 0 1 0 0 0 1 ...
## $ Subscribe? (Yes=1): num [1:1345] 0 0 0 0 0 1 0 0 0 0 ...
## # A tibble: 6 × 3
## Age `Gender (W=1)` `Subscribe? (Yes=1)`
## <dbl> <dbl> <dbl>
## 1 33 0 0
## 2 45 1 0
## 3 57 0 0
## 4 32 1 0
## 5 56 0 0
## 6 60 1 1
Interpretation (Data check):
We confirm we have 3 columns: Age, Gender
(W=1), and Subscribe (Yes=1).
This is a binary classification problem because
Subscribe is 0/1.
. Data Preparation
(Clean names + types)
# ==========================================================
# 3.1 Rename columns to simpler names
# - gender_w: 1 if Woman, 0 if Man (based on the column name)
# - subscribe: 1 if Yes, 0 if No
# ==========================================================
df <- df_raw %>%
rename(
gender_w = `Gender (W=1)`,
subscribe = `Subscribe? (Yes=1)`
)
# ==========================================================
# 3.2 Ensure correct data types
# - Logistic regression expects the outcome as 0/1 numeric
# ==========================================================
df <- df %>%
mutate(
gender_w = as.integer(gender_w),
subscribe = as.integer(subscribe),
Age = as.numeric(Age)
)
# Confirm again
str(df)
## tibble [1,345 × 3] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:1345] 33 45 57 32 56 60 40 55 27 48 ...
## $ gender_w : int [1:1345] 0 1 0 1 0 1 0 0 0 1 ...
## $ subscribe: int [1:1345] 0 0 0 0 0 1 0 0 0 0 ...
## Age gender_w subscribe
## Min. :20.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:29.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :40.00 Median :1.0000 Median :0.0000
## Mean :39.65 Mean :0.5078 Mean :0.2379
## 3rd Qu.:50.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :60.00 Max. :1.0000 Max. :1.0000
Interpretation (Cleaning):
We renamed columns for easier coding and ensured Age is numeric and the
binary variables are integers (0/1).
This helps avoid common R issues (e.g., numbers read as text).
. Quick Descriptive
Statistics
# ==========================================================
# 4.1 Outcome distribution: how many subscribed vs not
# ==========================================================
table(df$subscribe)
##
## 0 1
## 1025 320
# ==========================================================
# 4.2 Age summary
# ==========================================================
summary(df$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 29.00 40.00 39.65 50.00 60.00
# ==========================================================
# 4.3 Gender distribution
# ==========================================================
table(df$gender_w)
##
## 0 1
## 662 683
# Subscription rate (mean of 0/1)
mean(df$subscribe)
## [1] 0.2379182
Interpretation (Descriptives):
- The dataset has 1,345 observations.
- The subscription rate is about 24% (mean of
0/1).
- Age ranges from 20 to 60 (median around
40).
- Gender is balanced (similar counts for men vs women).
These statistics provide context before modeling.
. Step 1 — Logistic
Regression Model
We model subscription likelihood using two predictors:
- Age (continuous)
- Gender (Woman=1)
Logistic regression is appropriate because the outcome is binary
(0/1).
# ==========================================================
# 5.1 Logistic regression model
# glm(..., family = binomial) fits a logistic regression
#
# Model form:
# log(odds(subscribe)) = b0 + b1*Age + b2*gender_w
# ==========================================================
m1 <- glm(subscribe ~ Age + gender_w, data = df, family = binomial)
# Main model output
summary(m1)
##
## Call:
## glm(formula = subscribe ~ Age + gender_w, family = binomial,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.597628 0.230870 2.589 0.00964 **
## Age -0.052399 0.005895 -8.888 < 2e-16 ***
## gender_w 0.407014 0.133725 3.044 0.00234 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1475.9 on 1344 degrees of freedom
## Residual deviance: 1381.1 on 1342 degrees of freedom
## AIC: 1387.1
##
## Number of Fisher Scoring iterations: 4
Interpretation (Model output – high level):
- Check the sign of coefficients
(positive/negative).
- Check p-values to see which predictors are
statistically significant.
From the output:
- Age coefficient is negative and highly
significant → older customers are less likely to
subscribe.
- Gender (Woman=1) coefficient is positive and
significant → women are more likely to subscribe.
. Step 2 — Odds
Ratios
Logistic coefficients are in log-odds. To interpret
in a simpler way, we convert them to odds ratios
(OR):
- OR > 1 → increases odds of subscribing
- OR < 1 → decreases odds of subscribing
# ==========================================================
# 6.1 Odds Ratios + Confidence Intervals
# exponentiate = TRUE means exp(coef) -> odds ratio
# ==========================================================
or_table <- tidy(m1, conf.int = TRUE, exponentiate = TRUE)
or_table
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.82 0.231 2.59 9.64e- 3 1.16 2.86
## 2 Age 0.949 0.00590 -8.89 6.22e-19 0.938 0.960
## 3 gender_w 1.50 0.134 3.04 2.34e- 3 1.16 1.95
Interpretation (Odds ratios):
- Age OR ≈ 0.95: each additional year decreases the
odds of subscribing by about 5% (1 - 0.95).
- Gender (Woman) OR ≈ 1.50: women have about
50% higher odds of subscribing than men.
Both predictors are statistically significant (small p-values).
. Predicted
Probabilities
Managers understand probabilities better than odds. We compute
predicted probabilities for example profiles:
- Ages: 25, 45, 65
- Gender: Man (0) vs Woman (1)
# ==========================================================
# 7.1 Example profiles
# predict(..., type="response") returns predicted probability (0-1)
# ==========================================================
profiles <- data.frame(
Age = c(25, 25, 45, 45, 65, 65),
gender_w = c(0, 1, 0, 1, 0, 1)
)
profiles$pred_prob <- predict(m1, newdata = profiles, type = "response")
profiles
## Age gender_w pred_prob
## 1 25 0 0.32908212
## 2 25 1 0.42425609
## 3 45 0 0.14675107
## 4 45 1 0.20533140
## 5 65 0 0.05687796
## 6 65 1 0.08307560
Interpretation (Predicted probabilities):
- Younger customers have much higher predicted subscription
probability.
- At every age, women have higher probability than men.
Example (from our output):
- Age 25: ~0.33 (men) vs ~0.42 (women)
- Age 45: ~0.15 (men) vs ~0.21 (women)
- Age 65: ~0.06 (men) vs ~0.08 (women)
. Visualization
We create a smooth probability curve by age for men vs women.
# ==========================================================
# 8.1 Create an age grid for predictions
# ==========================================================
age_grid_m <- data.frame(
Age = seq(min(df$Age), max(df$Age), by = 1),
gender_w = 0
)
age_grid_m$pred_prob <- predict(m1, newdata = age_grid_m, type = "response")
age_grid_m$Group <- "Men"
age_grid_w <- data.frame(
Age = seq(min(df$Age), max(df$Age), by = 1),
gender_w = 1
)
age_grid_w$pred_prob <- predict(m1, newdata = age_grid_w, type = "response")
age_grid_w$Group <- "Women"
plot_df <- bind_rows(age_grid_m, age_grid_w)
# ==========================================================
# 8.2 Plot
# ==========================================================
ggplot(plot_df, aes(x = Age, y = pred_prob, linetype = Group)) +
geom_line(linewidth = 1) +
labs(
title = "Predicted Subscription Probability by Age and Gender",
x = "Age",
y = "Predicted probability of subscribing",
linetype = "Group"
) +
theme_minimal()

Interpretation (Plot):
- The curve decreases with age → subscription probability decreases as
customers get older.
- The women curve is above the men curve → women are more likely to
subscribe at each age.
. Step 3 — Managerial
Recommendations
Based on the model:
- Focus acquisition campaigns on younger
customers
- Age is the strongest driver. Younger customers have the highest
conversion potential.
- Women are a higher-likelihood segment
- Women have ~50% higher odds of subscribing than men (OR ≈
1.50).
- Use channels and messaging that perform well for this segment.
- Improve conversion for older customers with targeted
actions
- For older customers, test clearer value communication, simpler
onboarding, and tailored offers.
Manager-friendly conclusion:
Age significantly reduces subscription likelihood, while being female
increases it. Therefore, marketing should prioritize younger segments
(especially women) for acquisition, and develop specific strategies to
improve conversion among older customers.
. Limitations
- We used only two predictors (as required by the
quiz). Real-world models may need more variables.
- This model shows association, not guaranteed
causation.
- We did not test interactions (e.g., Age × Gender). This can be
explored if required.
