W04-1: Reflection on Logistic Regression and Ch8 (Preprocessing with Recipes)

Author

Leeann Lewis-Waddell

Published

July 1, 2026

1 Video Quizzes

Code

library(tidyverse)


# Recreate the direct-mail dataset used throughout the course
set.seed(123)
n <- 3000

mail_data <- tibble(
  customer_id = paste0("C", str_pad(1:n, 5, pad = "0")),
  age = round(rnorm(n, mean = 45, sd = 12)),
  income = round(rlnorm(n, meanlog = 10.8, sdlog = 0.6)),
  recency_days = round(rexp(n, rate = 1 / 60)),
  freq_12mo = rpois(n, lambda = 3),
  avg_order_amt = round(rlnorm(n, meanlog = 4.2, sdlog = 0.5), 2),
  channel = sample(
    c("email", "direct_mail", "digital"),
    n,
    replace = TRUE,
    prob = c(0.5, 0.3, 0.2)
  ),
  region = sample(
    c("West", "South", "Midwest", "Northeast"),
    n,
    replace = TRUE
  ),
  loyalty_tier = sample(
    c("Bronze", "Silver", "Gold", "Platinum"),
    n,
    replace = TRUE,
    prob = c(0.4, 0.3, 0.2, 0.1)
  ),
  responded = rbinom(
    n,
    1,
    prob = plogis(
      -3 +
        0.02 * (age - 45) +
        0.3 * log(income / 50000) +
        0.1 * freq_12mo -
        0.005 * recency_days
    )
  )
) |>
  mutate(
    income = if_else(runif(n) < 0.08, NA_real_, income),
    avg_order_amt = if_else(runif(n) < 0.05, NA_real_, avg_order_amt),
    responded = factor(responded, levels = c(1, 0), labels = c("yes", "no"))
  )

1.1 Step 1-1 — Logarithms

Video: Logs (Logarithms), Clearly Explained!!! https://www.youtube.com/watch?v=VSi0Z04fWj0

1.1.1 Q1 — What Does a Log Do?

The video says the core job of a logarithm is to isolate the exponent.

Complete the statement: “If $2^5 = 32$, then $\log_2(32) =$ ____ because…”
- 5 because log functions isolate the exponent. log₂(32) = log₂(2⁵)
Verify in R: log(32, base = 2)

log(32, base = 2)

[1] 5
Why is this property useful when income values range from $7,000 to $455,000 in your dataset?
- Fold changes become symmetric.

Code

# Your code here
log(32, base = 2)

1.1.2 Q2 — Multiplication Becomes Addition

The video demonstrates: $\log(a \times b) = \log(a) + \log(b)$.

Verify using a = 4, b = 8 (base 2) — compute both sides and confirm they are equal
- log₂( 4 x 8) = log₂(32) = 5
- log₂(4) + log₂(8) = log₂(2²) + log₂(2³) = 2 + 3 = 5
In logistic regression, we multiply many small probabilities together. Explain in one sentence why the log property above makes this computationally safer.
- Adding logs makes the the small numbers produced by probabilities that would potentially become zero more manageable to account for.

Code

log(4 * 8, base = 2)
log(4, base = 2) + log(8, base = 2)

1.1.3 Q3 — Geometric Mean in Marketing

The video shows the geometric mean is better than the arithmetic mean for data that multiplies rather than adds.

Five customers’ order values: $10, $20, $80, $40, $1,000.

Compute the arithmetic mean with mean()
- orders <- c(10, 20, 80, 40, 1000)
  mean(orders)
  230
Compute the geometric mean: exp(mean(log(x)))
- exp(mean(log(orders)))
  57.708
Which better represents the “typical” customer — and why does the $1,000 order distort the arithmetic mean more?
Geometric mean represents the “typical” customer more accurately. The $1,000 inflates the average because it is an outlier, essentially dominating the scale.

Code

orders <- c(10, 20, 80, 40, 1000)
mean(orders)
exp(mean(log(orders)))

1.1.4 Q4 — Log(0) and the Offset

The video explains $\log(0)$ is undefined (negative infinity).

Run log(0) — what does R return?

log(0) [1]

-Inf
Your Week 2 recipe uses step_log(..., offset = 1). Explain in one sentence why this offset is necessary.
Adding 1 ensures the minimum input is log(1) = 0 instead of crashing because using zero is not possible.
What is log(0 + 1)? Why is adding 1 a safe lower bound?

log(0 + 1)

[1] 0

Adding 1 is a safe lower bound because log(1) = 0, which is a defined value as opposed to infinity.

Code

log(0)
log(0 + 1)

1.1.5 Q5 — Base Conversion

The video notes that base 2, base 10, and natural log all follow the same rules and differ only by a constant factor.

For income = $49,021, compute log2(), log10(), and log() (natural)

income_val <- 49021

log(income_val, base = 2)

[1] 15.58111

log10(income_val)

[1] 4.690382

log(income_val)

[1] 10.8
Divide log() by log10() — what constant do you get?

log(income_val) / log10(income_val)

[1] 2.302585
After step_normalize() is applied in the recipe, does it matter which base was used for step_log()? Explain why or why not.
No, after normalization, all log bases produce the same scaled feature.

Code

income_val <- 49021
log(income_val, base = 2)
log10(income_val)
log(income_val)

# Ratio
log(income_val) / log10(income_val)

1.2 Step 1-2 — Odds and Log(Odds)

Video: Odds and Log(Odds), Clearly Explained!!! https://www.youtube.com/watch?v=ARfXDSkQf1Y

1.2.1 Q6 — Probability vs. Odds

The video distinguishes probability from odds.

In mail_data, 142 customers responded (“yes”) out of 3,000 total.

Compute the probability of responding: $P(\text{respond}) = \frac{\text{yes}}{\text{total}}$
yes = 142/3,000
yes = 0.0473

yes <- 142

no <- 3000 - 142

p <- yes / (yes + no)

p [1] 0.04733333
Compute the odds of responding: $\text{Odds} = \frac{\text{yes}}{\text{no}} = \frac{P}{1 - P}$

odds = 0.0473/(1-0.0473)
odds = 0.0473/.9527

odds = 0.0496

odds <- yes / no

odds [1] 0.04968509
In plain English, what does an odds of 0.05 mean for a marketer trying to interpret campaign response?

For every one customer that responds, there are 20 who won’t.

Code

yes <- 142
no <- 3000 - 142

# Probability
p <- yes / (yes + no)
p

# Odds
odds <- yes / no
odds

# Or equivalently
p / (1 - p)

1.2.2 Q7 — Converting Between Probability and Odds

The video shows you can derive odds from probabilities using $\text{Odds} = \frac{P}{1 - P}$.

If a customer has a 20% chance of responding, what are their odds?
odds = .2/(1-.2)
odds = .25
If a customer has a 50% chance, what are their odds?
odds = .5/(1-.5)
odds = 1
If a customer has an 80% chance, what are their odds?
odds = .8/(1-.8)
odds = 4
What pattern do you notice? What happens to odds as probability approaches 1?
As probability approaches 1, odds increase in favor of the predicted outcome.

Code

prob_to_odds <- function(p) p / (1 - p)

prob_to_odds(0.20)
prob_to_odds(0.50)
prob_to_odds(0.80)

1.2.3 Q8 — Why Log(Odds)?

The video explains that odds are asymmetric: odds against an event range 0 to 1, while odds in favor range 1 to infinity. Taking the log fixes this.

Compute log(odds) for probabilities 0.10, 0.50, and 0.90
log_odds(.10) = log(.1/.9) = -2.197225
log_odds(.50) = log(.5/.5) = 0

log_odds(.90) = log(.9/.1) = 2.197225
Verify that log(odds) at $P = 0.50$ equals exactly 0 — why does this make sense intuitively?
log_odds(.50) = log(.5/.5) = log(1) =0
Are the log(odds) values symmetric around 0? What does this symmetry mean for modeling?
Yes. With respect to modeling, it makes the relationships symmetric and allows for easy comparisons.

Code

probs <- c(0.10, 0.50, 0.90)
odds <- probs / (1 - probs)
log_odds <- log(odds)

tibble(probability = probs, odds = odds, log_odds = log_odds)

1.2.4 *Q9 — The Logit Function

The video introduces the logit as the log of the odds ratio, central to logistic regression.

\[\text{logit}(P) = \log\left(\frac{P}{1-P}\right)\]

Using mail_data, compute the overall response rate (probability)

p_respond <- mean(mail_data$responded == “yes”)

p_respond

[1] 0.04733333
Convert it to log(odds) using the formula above

log(p_respond / (1 - p_respond))

[1] -3.00205
In your Week 2 recipe, the model output .pred_yes is a probability. Write the R code to convert it to log(odds).

Code

# Overall response rate
p_respond <- mean(mail_data$responded == "yes")
p_respond

# Convert to log(odds)
log(p_respond / (1 - p_respond))

1.2.5 Q10 — Log(Odds) and the Normal Distribution

The video notes that log(odds) values often follow a normal distribution, making them useful for modeling binary outcomes.

Written question — no code required:

Your income variable was log-transformed in the recipe because raw income is right-skewed. The video suggests log(odds) of a binary outcome often follow a normal distribution. Why might this property make logistic regression a natural choice for modeling responded (yes/no)?

Because log-odds are normally distributed and unbounded, a linear model can fit them directly.
In 2–3 sentences, explain why working in log(odds) space is more mathematically convenient than working in probability space for a linear model.

Log odds range from negative infinity to positive infinity. Coefficients represent consistent, additive shifts in the log odds for every unit change in a predictor. It also keeps the math clean because multiplying probabilities becomes addition in log odds space, which is far easier to work with algebraically.

1.3 Step 1-3 — Odds Ratios and Log(Odds Ratios)

Video: Odds Ratios and Log(Odds Ratios), Clearly Explained!!! https://www.youtube.com/watch?v=8nm0G-1uJzA

1.3.1 Q11 — What Is an Odds Ratio?

The video defines an odds ratio (OR) as the ratio of two odds.

In mail_data, compare the odds of responding for Gold loyalty tier customers vs. Bronze loyalty tier customers.

Compute the odds of responding for Gold customers

gold_odds <- tier_summary |>

filter(loyalty_tier == “Gold”) |>

pull(odds)

gold_odds

[1] 0.05
Compute the odds of responding for Bronze customers

bronze_odds <- tier_summary |>

filter(loyalty_tier == “Bronze”) |>

pull(odds)

bronze_odds

[1] 0.05074875
Compute the odds ratio: Gold odds ÷ Bronze odds

OR <- gold_odds / bronze_odds

OR

[1] 0.9852459
Is Gold tier a stronger or weaker predictor of response than Bronze?
Yes

Code

tier_summary <- mail_data |>
  group_by(loyalty_tier) |>
  summarise(
    yes = sum(responded == "yes"),
    no = sum(responded == "no"),
    odds = yes / no
  )

tier_summary

# Odds ratio: Gold vs Bronze
gold_odds <- tier_summary |> filter(loyalty_tier == "Gold") |> pull(odds)
bronze_odds <- tier_summary |> filter(loyalty_tier == "Bronze") |> pull(odds)

OR <- gold_odds / bronze_odds
OR

1.3.2 Q12 — Asymmetry and Log(Odds Ratio)

The video explains that odds ratios are asymmetric: values below 1 (negative association) are compressed into 0–1, while values above 1 (positive association) range from 1 to infinity.

Compute log(OR) from Q11. Is it positive or negative?

log(OR)

[1] -0.01486402

negative
Compute the odds ratio in the reverse direction: Bronze ÷ Gold. Then take log() of that. What do you notice?

OR_reverse <- bronze_odds / gold_odds

log(OR_reverse)

[1] 0.01486402

Odds are now positive, meaning these are the chances respondents will not respond
Why is log(OR) = 0 the reference point for “no relationship”?
Because a log odds ratio of 0 means the odds ratio itself equals 1, and an odds ratio of 1 means the odds of the outcome are identical in both groups, so the predictor has no effect on the response.

Code

# Log odds ratio
log(OR)

# Reverse direction
OR_reverse <- bronze_odds / gold_odds
log(OR_reverse)

1.3.3 Q13 — Effect Size Interpretation

The video compares the odds ratio to R² as a measure of effect size — the larger the magnitude, the stronger the relationship.

Compute odds ratios for all four loyalty tiers against Bronze as the reference, then answer:

Which tier has the strongest association with response?
Platinum
Which has the weakest?
Gold
How would you communicate the strongest result to a non-technical marketing manager in one sentence?
Platinum loyalty members are about 1.5 times more likely to respond to our campaign than Bronze members, making them our highest-priority segment to target.

Code

tier_summary |>
  mutate(OR_vs_bronze = odds / bronze_odds, log_OR = log(OR_vs_bronze)) |>
  arrange(desc(abs(log_OR)))

1.3.4 *Q14 — Fisher’s Exact Test

The video introduces Fisher’s Exact Test for small samples to determine if an observed association is statistically significant.

Using Gold vs. Bronze customers from mail_data:

Build a 2×2 contingency table using table()
Run fisher.test() on it
What does the p-value tell you? Is the Gold/Bronze difference statistically significant at the 0.05 level?
A p-value of 1 means the difference is not statistically significant
When would you prefer Fisher’s Exact Test over a chi-square test?
When the dataset is small

Code

gold_bronze <- mail_data |>
  filter(loyalty_tier %in% c("Gold", "Bronze"))

ct <- table(gold_bronze$loyalty_tier, gold_bronze$responded)
ct

stats::fisher.test
conflicted::conflicts_prefer(janitor::fisher.test)
fisher.test(ct)

1.3.5 Q15 — Chi-Square Test

The video explains the chi-square test compares observed values to expected values assuming no relationship exists.

Run chisq.test() on the same 2×2 table from Q14
Compare the p-value to Fisher’s Exact Test — are the conclusions the same?
The p-value is the same
The video notes chi-square works well for larger datasets. With 3,000 rows in mail_data, which test is more appropriate — and why?
Chi-square; it also compared observed values to expected values.

Code

conflicted::conflicts_prefer(janitor::chisq.test)
chisq.test(ct)

1.3.6 Q16 — Wald Test Concept

The video explains the Wald Test evaluates how many standard deviations a log(odds ratio) is from zero, assuming a normal distribution.

Written question — no code required:

In your own words, what does it mean for a log(odds ratio) to be “many standard deviations from zero”?
It means that the log(odds ratio) represents a real relationship rather than random noise. The further from zero, the stronger the evidence that the predictor actually matters.
The Wald test is used inside logistic regression to assess individual predictors. In the Week 2 notebook, you fitted a Lasso logistic regression. How does Lasso’s penalty relate to the Wald test’s goal of identifying which predictors are meaningfully different from zero?
Both are trying to determine which predictors actually matter in different ways. The Wald test does it after fitting by checking whether each coefficient is far enough from zero to be considered meaningful. Lasso does it during fitting by penalizing coefficients and shrinking unimportant ones all the way to exactly zero, effectively removing them from the model.
The video says there is no single consensus on which test (Fisher, chi-square, Wald) is best. What practical guideline does it suggest?
The video suggests relying on the technique most related to the respective field of study.

1.3.7 Q17 — Connecting Odds Ratios to the Course

Written question:

In logistic regression, the model coefficients are log(odds ratios) — each coefficient tells you how much the log(odds) of the outcome changes for a one-unit increase in a predictor.

In your Week 2 model, freq_12mo is a predictor. If its coefficient is positive, what does that imply about the odds of a customer responding as their purchase frequency increases?
A positive coefficient means that as purchase frequency increases, the odds of a customer responding to the campaign also increase.
After step_normalize(), predictors are on the same scale. Why does this make it easier to compare coefficients (and therefore odds ratios) across predictors with very different original units (e.g., income in dollars vs. frequency in counts)?
Normalization allows for every predictor to be measured in standard deviations, so a one unit increase means the same thing across all predictors. Coefficients can be compared side by side and directly rank which predictors have the strongest association with response, no matter what their original units were.

1.4 Step 2-1 — Logistic Regression

Video: Logistic Regression, Clearly Explained!!! https://www.youtube.com/watch?v=yIYKR4sgzI8

1.4.1 Q18 — Linear vs. Logistic Regression

The video begins by reviewing linear regression, then explains why it fails for binary outcomes.

What does linear regression predict? What does logistic regression predict?
Linear regression predicts a continuous numeric value with no bounds.
Logistic regression predicts the probability of belonging to one of two classes, always staying between 0 and 1.
Why can’t you use a straight line to model a yes/no outcome like responded? What goes wrong at the extremes?
A straight line keeps going in both directions forever, so at the extremes it will predict values below 0 or above 1, which are impossible for a probability.
In mail_data, the response rate is about 4.7%. If you fitted a linear regression to predict responded (as 0/1), what value might it predict for a very frequent, high-income customer — and why is that problematic?
For a very frequent, high income customer, a linear regression might predict something like 1.3 or higher, since those strong predictor values push the line upward with no ceiling to stop it. This is problematic because a response probability cannot exceed 1, and feeding impossible predictions into any downstream decision making, such as expected revenue calculations, would produce nonsense results.

Written answer — no code required.

1.4.2 Q19 — The S-Shaped Logistic Curve

The video shows that logistic regression fits an S-shaped curve (sigmoid function) that maps any input to a probability between 0 and 1.

\[P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}\]

In R, plogis() computes this function. Evaluate it at inputs of −3, 0, and 3:

plogis(-3)

[1] 0.04742587: negative linear predictor

plogis(0)

[1] 0.5: zero

plogis(3)

[1] 0.9525741: positive linear predictor

Code

plogis(-3) # very negative linear predictor
plogis(0) # at zero
plogis(3) # very positive linear predictor

What probability does plogis(0) return — and why does that make sense from the formula?

plogis(0)

[1] 0.5

It computes 1 / (1 + e^₀), and since e⁰ = 1, which simplifies to 1 / (1 + 1) = 1 / 2 = 0.5. This makes sense because a log odds of 0 means the odds are exactly 1, meaning the outcome is equally likely to happen or not happen.
Notice that your dataset’s responded column was generated using plogis(...) — find that line in the setup chunk above and explain what the linear predictor inside plogis() represents in marketing terms.

responded = rbinom( n, 1, prob = plogis( -3 + 0.02 * (age - 45) + 0.3 * log(income / 50000) + 0.1 * freq_12mo - 0.005 * recency_days

This is RFM - recency, frequency, monetary pattern.

1.4.3 Q20 — Maximum Likelihood vs. Least Squares

The video explains a key difference in how the two models are fitted:

Linear regression: minimizes the sum of squared residuals (least squares)
Logistic regression: maximizes the likelihood of the observed data

Written question:

Why can’t logistic regression use least squares? (Hint: what is the “residual” when the outcome is 0 or 1?)

When the outcome is only ever 0 or 1, errors are not normally distributed and the relationship is not linear, so minimizing squared residuals produces a nonsensical result because the math does not apply to binary data.
In plain English, what does it mean for logistic regression to find the curve that “maximizes the likelihood” of the observed responses?
The model finds the curve that assigns high probabilities to customers who actually responded and low probabilities to those who did not, across all customers at once.
The video mentions that logistic regression does not have the same concept of a residual. How does this affect how we assess model fit — and what metrics do we use instead? (Reference your Week 3 notebook.)
Since residuals are not meaningful, we use metrics like AUC ROC, sensitivity, specificity, and the confusion matrix to measure how well the model separates responders from non-responders.

1.4.4 Q21 — Predicting Probability vs. Class

In the Week 2 notebook, augment() returns both .pred_yes (a probability) and .pred_class (a class label).

Run the code below to see the distribution of .pred_yes. What is the typical predicted probability for a customer in this dataset?
The predicted probabilities cluster between 0.04 and 0.07, with most customers sitting right around 0.05.
The default threshold for .pred_class is 0.5. Given the 4.7% response rate, why might 0.5 be a poor threshold for a direct-mail campaign?
None of the customers in the dataset has a predicted probability anywhere near 0.5. Using that threshold would classify every single customer as “no” and the model would never identify anyone to mail.
What threshold might make more business sense — and what metric from Week 3 would you use to evaluate the impact of changing it?
A threshold around 0.055 to 0.06 would make more sense

Code

library(tidymodels)
library(tidyverse)
library(glmnet)

tidymodels_prefer()

mail_data <- tibble(
  customer_id = paste0("C", str_pad(1:n, 5, pad = "0")),
  age = round(rnorm(n, mean = 45, sd = 12)),
  income = round(rlnorm(n, meanlog = 10.8, sdlog = 0.6)),
  recency_days = round(rexp(n, rate = 1 / 60)),
  freq_12mo = rpois(n, lambda = 3),
  avg_order_amt = round(rlnorm(n, meanlog = 4.2, sdlog = 0.5), 2),
  channel = sample(
    c("email", "direct_mail", "digital"),
    n,
    replace = TRUE,
    prob = c(0.5, 0.3, 0.2)
  ),
  region = sample(
    c("West", "South", "Midwest", "Northeast"),
    n,
    replace = TRUE
  ),
  loyalty_tier = sample(
    c("Bronze", "Silver", "Gold", "Platinum"),
    n,
    replace = TRUE,
    prob = c(0.4, 0.3, 0.2, 0.1)
  ),
  responded = rbinom(
    n,
    1,
    prob = plogis(
      -3 +
        0.02 * (age - 45) +
        0.3 * log(income / 50000) +
        0.1 * freq_12mo -
        0.005 * recency_days
    )
  )
) |>
  mutate(
    income = if_else(runif(n) < 0.08, NA_real_, income),
    avg_order_amt = if_else(runif(n) < 0.05, NA_real_, avg_order_amt),
    responded = factor(responded, levels = c(1, 0), labels = c("yes", "no"))
  )


mail_split <- initial_split(mail_data, prop = 0.80, strata = responded)
mail_train <- training(mail_split)
mail_test <- testing(mail_split)

mail_rec <- recipe(responded ~ ., data = mail_train) |>
  update_role(customer_id, new_role = "ID") |>
  step_impute_median(all_numeric_predictors()) |>
  step_log(income, avg_order_amt, recency_days, base = 10, offset = 1) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

lr_spec <- logistic_reg(penalty = 0.01, mixture = 1) |>
  set_engine("glmnet") |>
  set_mode("classification")

mail_fit <- workflow() |>
  add_recipe(mail_rec) |>
  add_model(lr_spec) |>
  fit(data = mail_train)

preds <- augment(mail_fit, new_data = mail_test)

# Distribution of predicted probabilities
ggplot(preds, aes(x = .pred_yes, fill = responded)) +
  geom_histogram(bins = 40, alpha = 0.7, position = "identity") +
  labs(
    title = "Distribution of Predicted Probabilities",
    x = "P(responded = yes)",
    y = "Count"
  ) +
  theme_minimal()

1.4.5 Q22 — Wald’s Test and Variable Selection

The video mentions that logistic regression uses Wald’s tests to evaluate whether individual predictors contribute meaningfully to the model.

In the Week 2 notebook, you used a Lasso penalty (mixture = 1). How does Lasso serve a similar purpose to Wald’s test — identifying which predictors are not useful?
The Wald test identifies useless predictors by checking if their coefficient is close enough to zero to be considered meaningless. Lasso shrinks weak predictors all the way to exactly zero during fitting, removing them from the model automatically.
Run the code below to extract the model coefficients. Which predictors were shrunk to exactly zero by the Lasso?
recency_days, freq_12mo, avg_order_amt, channel_direct_mail, channel_email, region_Northeast, region_South, and region_West
Based on the video’s explanation of what logistic regression coefficients represent (log odds ratios), interpret the sign of the freq_12mo coefficient in plain English.
freq_12mo was shrunk to zero, meaning the model found no meaningful relationship between purchase frequency and response after accounting for the other predictors.

Code

# Extract Lasso coefficients
mail_fit |>
  extract_fit_parsnip() |>
  tidy() |>
  filter(term != "(Intercept)") |>
  arrange(desc(abs(estimate)))

1.4.6 Q23 — Synthesis: From Logs to Logistic Regression

This final question ties all four videos together.

Written question (3–5 sentences):

The four videos form a conceptual chain: Logs → Odds → Odds Ratios → Logistic Regression.

Explain in your own words how each step connects to the next:
- Why do we take the log of odds (not just use odds directly)?
  Taking the log stretches odds to an unbounded, symmetric scale that a linear model can handle cleanly.
- Why do we compare odds as a ratio rather than a difference?
  A difference in odds depends on the baseline, making it difficult to compare across groups. A ratio is scale-free.
- How does the logit (log odds) become the linear predictor inside logistic regression?
  Once odds are logged, the result ranges from negative infinity to positive infinity, exactly matching what a linear combination of predictors produces. Logistic regression sets the linear combination equal to the log odds, then uses plogis to convert back to a probability for the final prediction.
In the context of the direct-mail campaign in mail_data, describe in one sentence what the logistic regression model is actually computing when it produces .pred_yes = 0.08 for a specific customer.
The model is taking that customer’s age, income, frequency, and other features, combining them into a single log odds score using the fitted coefficients, and then converting that score into a probability, concluding that this particular customer has a 1 in 12 chance of responding to the campaign.

2 Log Assignments

Run this chunk first — it loads the packages and recreates the direct-mail dataset used throughout the course.

Code

library(tidyverse)

# ── Recreate the direct-mail dataset ─────────────────────────────────────────
set.seed(123)
n <- 3000

mail_data <- tibble(
  customer_id = paste0("C", str_pad(1:n, 5, pad = "0")),
  age = round(rnorm(n, mean = 45, sd = 12)),
  income = round(rlnorm(n, meanlog = 10.8, sdlog = 0.6)),
  recency_days = round(rexp(n, rate = 1 / 60)),
  freq_12mo = rpois(n, lambda = 3),
  avg_order_amt = round(rlnorm(n, meanlog = 4.2, sdlog = 0.5), 2),
  channel = sample(
    c("email", "direct_mail", "digital"),
    n,
    replace = TRUE,
    prob = c(0.5, 0.3, 0.2)
  ),
  region = sample(
    c("West", "South", "Midwest", "Northeast"),
    n,
    replace = TRUE
  ),
  loyalty_tier = sample(
    c("Bronze", "Silver", "Gold", "Platinum"),
    n,
    replace = TRUE,
    prob = c(0.4, 0.3, 0.2, 0.1)
  ),
  responded = rbinom(
    n,
    1,
    prob = plogis(
      -3 +
        0.02 * (age - 45) +
        0.3 * log(income / 50000) +
        0.1 * freq_12mo -
        0.005 * recency_days
    )
  )
) |>
  mutate(
    income = if_else(runif(n) < 0.08, NA_real_, income),
    avg_order_amt = if_else(runif(n) < 0.05, NA_real_, avg_order_amt),
    responded = factor(responded, levels = c(1, 0), labels = c("yes", "no"))
  )

glimpse(mail_data)

Rows: 3,000
Columns: 10
$ customer_id   <chr> "C00001", "C00002", "C00003", "C00004", "C00005", "C0000…
$ age           <dbl> 38, 42, 64, 46, 47, 66, 51, 30, 37, 40, 60, 49, 50, 46, …
$ income        <dbl> 44793, 40269, 20560, 32261, 233070, 47933, 84804, 43883,…
$ recency_days  <dbl> 103, 102, 27, 115, 47, 106, 4, 62, 13, 21, 139, 100, 9, …
$ freq_12mo     <int> 4, 3, 0, 3, 1, 3, 5, 2, 1, 6, 1, 3, 3, 3, 6, 1, 1, 1, 6,…
$ avg_order_amt <dbl> 80.19, NA, 54.80, 46.49, 64.86, 78.43, 39.45, NA, 68.26,…
$ channel       <chr> "email", "digital", "digital", "email", "digital", "dire…
$ region        <chr> "West", "South", "Northeast", "Northeast", "Midwest", "W…
$ loyalty_tier  <chr> "Bronze", "Bronze", "Bronze", "Bronze", "Gold", "Silver"…
$ responded     <fct> no, no, yes, no, yes, yes, no, no, no, no, no, no, no, n…

2.1 Assignment L.1 — Dollar to Log

Customer C00005 has an income of $233,070.

Convert this value to the natural log: log(233070)12.359
Convert this value to log base 10: log10(233070)

log₁₀(233070) = log(10^5.367) = 5.367
Verify both results using R
In your own words, why does the recipe use base = 10 rather than the natural log?
Base 10 is easier to read. A log10 value of 5 means the original number was 100,000, which most people can follow. Natural log values do not have that same intuitive meaning, so base 10 is just a more human friendly choice when you need to explain the numbers to someone outside of data science.

Hint

In R, log(x) computes the natural log ($\ln x$) and log10(x) computes log base 10. You can also write log(x, base = 10) — this is exactly what step_log(..., base = 10) uses internally.

Code

# Your code here
log(233070)
log10(233070)

2.2 Assignment L.2 — Log Back to Dollar (Median and Mean)

The income variable was generated with meanlog = 10.8 and sdlog = 0.6. These parameters describe the distribution of log(income), not income itself.

Convert the log-scale median (10.8) back to dollars using exp()
Convert the log-scale mean (10.98) back to dollars using exp()
Verify these match the formulas from the Week 2 notebook:

\[\text{Median} = e^{\mu} = e^{10.8}\]

\[\text{Mean} = e^{\mu + \sigma^2/2} = e^{10.8 + 0.6^2/2}\]
yes

Why is the mean higher than the median on the dollar scale?
Income is right-skewed — a small number of high earners pull the mean above the median.

Hint

exp(x) is the inverse of log(x). If log(income) = 10.8, then exp(10.8) gives you income back in dollars. For the mean, compute 10.8 + 0.6^2 / 2 first, then apply exp().

Code

# Your code here
median_income <- exp(10.8)
# [1] 49020.8
mean_income <- exp(10.98)
#[1] 58688.55

2.3 Assignment L.3 — Geometric Standard Deviation

The geometric standard deviation (GSD) describes the typical spread of income around the median on the original dollar scale.

\[\text{GSD} = e^{\sigma} = e^{0.6}\]

Compute the GSD using exp(0.6)``> exp(.6)

[1] 1.822119
Using the median income from L.2 (~$49,021), compute the typical range:
- Lower bound: median_income / GSD``> median_income / GSD
  
  [1] 26903.19
- Upper bound: median_income * GSD``> median_income * GSD
  
  [1] 89321.72
Verify this matches the shaded green band in the Week 2 plot (~$26,935 to ~$89,218) yes
Interpret this range in plain English: what does it tell a marketer about the “typical” customer in this dataset?
Even though the income spread is between $26k to $89k, and the average earning is $59k, marketers should expect a drop in salaries starting after $49k.

Hint

Unlike the regular standard deviation (which adds and subtracts), the GSD multiplies and divides because we are on a multiplicative (log) scale.

Code

# Your code here
GSD <- exp(.6)
#lower bound
median_income / GSD
#upper bound
median_income * GSD

2.4 Assignment L.4 — Confirm with Real Data

Now verify the theoretical values from L.2 and L.3 against the actual mail_data dataset.

Compute the median and mean of income, removing NAs with na.rm = TRUE. How close are they to the theoretical values?

median_income

<dbl>

mean_income

<dbl>

49331.5 58349.86

These values are only about $300 less than the theoretical values.
Create a new column called log_income using mutate() and log(). What are the median and mean of log_income?

median_log_income

<dbl>

mean_log_income

<dbl>

10.80632 10.79787
Apply exp() to the median and mean of log_income. Do you recover the original dollar-scale values? Explain why or why not.

Only median income is correctly reversed since the axis is even, but mean is incorrect since income average is right-skewed due to the inflation of the large amounts in the tail.

exp_median_log_income

<dbl>

exp_mean_log_income

<dbl>

49331.49 48916.37

median_income <dbl>	mean_income <dbl>
49331.5	58349.86

median_log_income <dbl>	mean_log_income <dbl>
10.80632	10.79787

exp_median_log_income <dbl>	exp_mean_log_income <dbl>
49331.49	48916.37

Hint

Use summarise() with median() and mean() to compute summary statistics. The theoretical and empirical values may differ slightly because mail_data is a finite random sample — not the infinite population the parameters describe.

Code

# Step 1: median and mean of income (removing NAs)
mail_data |>
  summarise(
    median_income = median(income, na.rm = TRUE),
    mean_income = mean(income, na.rm = TRUE)
  )
# Step 2: create log_income and summarise
mail_data <- mail_data |>
  mutate(log_income = log(income))
mail_data |>
  summarise(
    median_log_income = median(log_income, na.rm = TRUE),
    mean_log_income = mean(log_income, na.rm = TRUE)
  )
# Step 3: back-transform with exp()
mail_data |>
  summarise(
    exp_median_log_income = exp(median(log_income, na.rm = TRUE)),
    exp_mean_log_income = exp(mean(log_income, na.rm = TRUE))
  )

2.5 Assignment L.5 — Base 10 vs. Natural Log

Your recipe uses step_log(..., base = 10). Does the choice of base actually matter for modeling?
Not necessarily, but the base does change the scale of the coefficients

For income values of $10,000, $100,000, and $1,000,000, compute both log() (natural) and log10() for each value
log₁₀(10000) = 4; log(10000) = 9.2103

log₁₀(100000) = 5 ; log(100000) = 11.5129

log₁₀(1000000) = 6 ; log(1000000) = 13.8155
Present the results as a tibble with columns: income, log_natural, log_base10

ncome

<dbl>

log_natural

<dbl>

log_base10

<dbl>

1e+04 9.21034 4

1e+05 11.51293 5

1e+06 13.81551 6
Divide log_natural by log_base10 for each row. What constant do you get? (Hint: this constant is log(10))

income

<dbl>

log_natural

<dbl>

log_base10

<dbl>
log-natural/log_base10

1e+04 9.21034 4 2.3026

1e+05 11.51293 5 2.3026

1e+06 13.81551 6 2.3026
Since the two columns differ only by a constant, does it matter which base we use for preprocessing? Explain why or why not in terms of what step_normalize() does afterward.
No, log bases only change the logged values by a constant multiplier.

ncome <dbl>	log_natural <dbl>	log_base10 <dbl>
1e+04	9.21034	4
1e+05	11.51293	5
1e+06	13.81551	6

income <dbl>	log_natural <dbl>	log_base10 <dbl>	log-natural/log_base10
1e+04	9.21034	4	2.3026
1e+05	11.51293	5	2.3026
1e+06	13.81551	6	2.3026

Hint

log(10) ≈ 2.303. Multiplying or dividing by a constant shifts and scales values — but step_normalize() centers and scales anyway, so any constant difference between bases gets absorbed.

Code

# Step 1 & 2: build the tibble
tibble(
  income = c(10000, 100000, 1000000)
) |>
  mutate(
    log_natural = log(income),
    log_base10 = log10(income)
  )

# Step 3: compute the ratio

# Step 4: your written explanation here (as a comment)
# No, log bases only change the logged values by a constant multiplier.

2.5.1 Summary

Assignment	Key operation	R function
L.1	Dollar → log	`log()`, `log10()`
L.2	Log → dollar (median & mean)	`exp()`
L.3	Geometric standard deviation	`exp(sdlog)`
L.4	Empirical vs. theoretical values	`mutate()`, `summarise()`
L.5	Base 10 vs. natural log	`log()` / `log10()` ratio

3 Exercises

Code

library(tidymodels) # umbrella: rsample, recipes, parsnip, workflows, yardstick
library(tidyverse)
library(janitor) # clean_names()
library(skimr) # skim()

tidymodels_prefer() # resolve function-name conflicts
set.seed(2024)

# ── Synthetic direct-mail dataset ──────────────────────────────────────────────
set.seed(123)
n <- 3000

mail_data <- tibble(
  customer_id = paste0("C", str_pad(1:n, 5, pad = "0")),
  age = round(rnorm(n, mean = 45, sd = 12)),
  income = round(rlnorm(n, meanlog = 10.8, sdlog = 0.6)), # right-skewed
  recency_days = round(rexp(n, rate = 1 / 60)), # days since last purchase
  freq_12mo = rpois(n, lambda = 3), # purchases in 12 months
  avg_order_amt = round(rlnorm(n, meanlog = 4.2, sdlog = 0.5), 2),
  channel = sample(
    c("email", "direct_mail", "digital"),
    n,
    replace = TRUE,
    prob = c(0.5, 0.3, 0.2)
  ),
  region = sample(
    c("West", "South", "Midwest", "Northeast"),
    n,
    replace = TRUE
  ),
  loyalty_tier = sample(
    c("Bronze", "Silver", "Gold", "Platinum"),
    n,
    replace = TRUE,
    prob = c(0.4, 0.3, 0.2, 0.1)
  ),
  responded = rbinom(
    n,
    1,
    prob = plogis(
      -3 +
        0.02 * (age - 45) +
        0.3 * log(income / 50000) +
        0.1 * freq_12mo -
        0.005 * recency_days
    )
  )
) |>
  mutate(
    income = if_else(runif(n) < 0.08, NA_real_, income),
    avg_order_amt = if_else(runif(n) < 0.05, NA_real_, avg_order_amt),
    responded = factor(responded, levels = c(1, 0), labels = c("yes", "no"))
  )
# introduce ~8 % missingness in income and avg_order_amt

glimpse(mail_data)

Rows: 3,000
Columns: 10
$ customer_id   <chr> "C00001", "C00002", "C00003", "C00004", "C00005", "C0000…
$ age           <dbl> 38, 42, 64, 46, 47, 66, 51, 30, 37, 40, 60, 49, 50, 46, …
$ income        <dbl> 44793, 40269, 20560, 32261, 233070, 47933, 84804, 43883,…
$ recency_days  <dbl> 103, 102, 27, 115, 47, 106, 4, 62, 13, 21, 139, 100, 9, …
$ freq_12mo     <int> 4, 3, 0, 3, 1, 3, 5, 2, 1, 6, 1, 3, 3, 3, 6, 1, 1, 1, 6,…
$ avg_order_amt <dbl> 80.19, NA, 54.80, 46.49, 64.86, 78.43, 39.45, NA, 68.26,…
$ channel       <chr> "email", "digital", "digital", "email", "digital", "dire…
$ region        <chr> "West", "South", "Northeast", "Northeast", "Midwest", "W…
$ loyalty_tier  <chr> "Bronze", "Bronze", "Bronze", "Bronze", "Gold", "Silver"…
$ responded     <fct> no, no, yes, no, yes, yes, no, no, no, no, no, no, no, n…

3.1 ✏️ Exercise 3.1

Change prop to 0.70 and re-run the split. How many records move from training to test?
900
How does the response rate change?

Training

0.04333333

Test

0.05666667

Code

# Your code here
set.seed(617)
mail_split <- initial_split(
  mail_data,
  prop = 0.70, 
  strata = responded 
)

mail_train <- training(mail_split)
mail_test <- testing(mail_split)

cat("Training rows:", nrow(mail_train), "\n")
cat("Test rows    :", nrow(mail_test), "\n")

bind_rows(
  mail_train |> summarise(split = "Training", response_rate = mean(responded == "yes")),
  mail_test |> summarise(split = "Test", response_rate = mean(responded == "yes"))
)

# Training rows: 2100 

# Test rows    : 900

3.2 ✏️ Exercise 8.1 — Alternative Imputation

Replace step_impute_median() with step_impute_mean(). Re-prep the recipe and compare the imputed values. Which would you prefer for income and why? Median because mean is always right skewed due to inflation by the larger earners.

Code

# Your code here

rec_partial <- recipe(
  responded ~ income + freq_12mo + age,
  data = mail_train
) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_log(income, base = 10, offset = 1) |>
  step_normalize(all_numeric_predictors())

prep(rec_partial) |>
  bake(new_data = mail_train) |>
  summary()

3.3 ✏️ Exercise 8.2 — Different Encoding

loyalty_tier is ordinal (Bronze < Silver < Gold < Platinum). Replace step_dummy() with step_ordinalscore() for loyalty_tier only. Does this change the number of columns in the baked data? yes

Code

# Hint: use step_ordinalscore() from the recipes package
# loyalty_tier is currently character. To be converted to ordinal scale, it must be an ordered factor first.
mail_data <- mail_data |>
  mutate(
    loyalty_tier = factor(
      loyalty_tier,
      levels = c("Bronze", "Silver", "Gold", "Platinum"),
      ordered = TRUE
    )
  )


# Your code here
mail_data <- mail_data |>
  mutate(
    loyalty_tier = ordered(
      loyalty_tier,
      levels = c("Bronze", "Silver", "Gold", "Platinum")
    )
  )

set.seed(617)

mail_split <- initial_split(
  mail_data,
  prop = 0.70,
  strata = responded
)

mail_train <- training(mail_split)
mail_test <- testing(mail_split)

is.ordered(mail_train$loyalty_tier)

rec_ordinal <- recipe(responded ~ ., data = mail_train) |>
  update_role(customer_id, new_role = "ID") |>
  step_impute_median(all_numeric_predictors()) |>
  step_log(income, base = 10, offset = 1) |>
  step_ordinalscore(loyalty_tier) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

prep(rec_ordinal) |>
  bake(new_data = mail_train) |>
  ncol()

3.4 ✏️ Exercise 8.3 — Interaction Term

Add an interaction between freq_12mo and log10(income) using step_interact(). What is the marketing intuition for this interaction?
A person with high income and past purchases may be likely to respond.

Code

# Your code here
rec_interact <- recipe(responded ~ ., data = mail_train) |>
  update_role(customer_id, new_role = "ID") |>
  step_impute_median(all_numeric_predictors()) |>
  step_log(income, base = 10, offset = 1) |>
  step_interact(terms = ~ freq_12mo:income) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

prep(rec_interact) |>
  bake(new_data = mail_train) |>
  glimpse()

3.5 ✏️ Exercise 8.4 — Remove a Step

Remove step_normalize() from the recipe, retrain the Lasso, and compare roc_auc before and after. What happens to model performance?

Code

# Your code here
rec_no_norm <- recipe(responded ~ ., data = mail_train) |>
  update_role(customer_id, new_role = "ID") |>
  step_impute_median(all_numeric_predictors()) |>
  step_log(income, base = 10, offset = 1) |>
  step_dummy(all_nominal_predictors())

lasso_no_norm_fit <- workflow() |>
  add_recipe(rec_no_norm) |>
  add_model(lr_spec) |>
  fit(data = mail_train)

lasso_no_norm_preds <- predict(lasso_no_norm_fit, mail_test, type = "prob") |>
  bind_cols(mail_test |> 
              select(responded))

roc_auc(
  lasso_no_norm_preds,
  truth = responded,
  .pred_yes
)

3.6 ✏️ Exercise 8.5 — Stratification Check

The split used strata = responded. Remove stratification and re-split five times. Calculate the standard deviation of the response rate in the test sets. Does stratification reduce variance?
yes

Code

# Your code here
set.seed(617)
mail_split <- initial_split(
  mail_data,
  prop = 0.80, # 80 % train, 20 % test
  strata = responded # preserve class balance in both sets
)

mail_train <- training(mail_split)
mail_test <- testing(mail_split)

cat("Training rows:", nrow(mail_train), "\n")

no_strata_rates <- tibble(split_num = 1:5) |>
  mutate(
    split = map(split_num, ~ {
      set.seed(617 + .x)
      initial_split(
        mail_data,
        prop = 0.80
      )
    }),
    test_data = map(split, testing),
    test_response_rate = map_dbl(
      test_data,
      ~ mean(.x$responded == "yes")
    )
  )

no_strata_rates

no_strata_rates |>
  summarise(
    sd_test_response_rate = sd(test_response_rate)
  )

3.7 ✏️ Exercise 8.6 — Data Leakage

Prep the recipe correctly using mail_train, then bake mail_test. Next, incorrectly prep the same recipe using mail_test. Compare the normalization statistics for income. Why is the second approach considered data leakage?

Code

# Your code here

3.8 ✏️ Exercise 8.7 — Inspecting a Recipe

Use tidy(mail_prep) to list the recipe steps. Then inspect the imputation values and normalization statistics using tidy(mail_prep, number = 1) and tidy(mail_prep, number = 3). What values were learned during prep()?

Code

# Your code here

3.9 ✏️ Exercise 8.8 — Variable Roles

Remove update_role(customer_id, new_role = "ID") from the recipe and re-prep it. What happens to customer_id in the baked data? Why should an ID variable not be used as a predictor?

Code

# Your code here

3.9.1 Summary

Concept	Function	Key Argument
Data split	`initial_split()`	`prop`, `strata`
Training set	`training()`	—
Test set	`testing()`	—
Define recipe	`recipe()`	formula, `data = train`
Log transform	`step_log()`	`base`, `offset`
Median impute	`step_impute_median()`	selector
Normalize	`step_normalize()`	selector
Dummy encode	`step_dummy()`	`all_nominal_predictors()`
Estimate steps	`prep()`	`training = train`
Apply steps	`bake()`	`new_data = test`
Bundle recipe + model	`workflow()`	—

Code

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.5.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] skimr_2.2.1        janitor_2.2.1      glmnet_5.0         Matrix_1.7-4      
 [5] yardstick_1.3.2    workflowsets_1.1.1 workflows_1.3.0    tune_2.0.0        
 [9] tailor_0.1.0       rsample_1.3.1      recipes_1.3.1      parsnip_1.3.3     
[13] modeldata_1.5.1    infer_1.0.9        dials_1.4.2        scales_1.4.0      
[17] broom_1.0.10       tidymodels_1.4.1   lubridate_1.9.4    forcats_1.0.1     
[21] stringr_1.6.0      dplyr_1.1.4        purrr_1.2.1        readr_2.1.6       
[25] tidyr_1.3.1        tibble_3.3.0       ggplot2_4.0.0      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    timeDate_4041.110   farver_2.1.2       
 [4] S7_0.2.0            fastmap_1.2.0       digest_0.6.37      
 [7] rpart_4.1.24        timechange_0.3.0    lifecycle_1.0.4    
[10] survival_3.8-3      magrittr_2.0.4      compiler_4.5.1     
[13] rlang_1.1.6         tools_4.5.1         yaml_2.3.10        
[16] data.table_1.17.8   knitr_1.50          labeling_0.4.3     
[19] htmlwidgets_1.6.4   repr_1.1.7          DiceDesign_1.10    
[22] RColorBrewer_1.1-3  withr_3.0.2         nnet_7.3-20        
[25] grid_4.5.1          sparsevctrs_0.3.4   future_1.67.0      
[28] iterators_1.0.14    globals_0.18.0      MASS_7.3-65        
[31] cli_3.6.5           rmarkdown_2.29      generics_0.1.4     
[34] rstudioapi_0.17.1   future.apply_1.20.0 tzdb_0.5.0         
[37] cachem_1.1.0        splines_4.5.1       parallel_4.5.1     
[40] base64enc_0.1-3     vctrs_0.6.5         hardhat_1.4.2      
[43] jsonlite_2.0.0      hms_1.1.3           listenv_0.9.1      
[46] foreach_1.5.2       gower_1.0.2         glue_1.8.0         
[49] parallelly_1.45.1   codetools_0.2-20    shape_1.4.6.1      
[52] stringi_1.8.7       gtable_0.3.6        GPfit_1.0-9        
[55] pillar_1.11.1       furrr_0.3.1         htmltools_0.5.8.1  
[58] ipred_0.9-15        lava_1.8.1          R6_2.6.1           
[61] lhs_1.2.0           conflicted_1.2.0    evaluate_1.0.5     
[64] lattice_0.22-7      backports_1.5.0     snakecase_0.11.1   
[67] memoise_2.0.1       class_7.3-23        Rcpp_1.1.0         
[70] prodlim_2025.04.28  xfun_0.53           pkgconfig_2.0.3

4 Appendix

--- title: "W04-1: Reflection on Logistic Regression and Ch8 (Preprocessing with Recipes)" author: "Leeann Lewis-Waddell" date: today format: html: toc: true toc-depth: 4 toc-expand: 1 toc-location: right-body toc-title: "Contents" number-sections: true code-fold: show code-tools: true theme: cosmo highlight-style: github df-print: paged embed-resources: true execute: warning: false message: false freeze: true --- # Video Quizzes ```{r setup} library(tidyverse) # Recreate the direct-mail dataset used throughout the course set.seed(123) n <- 3000 mail_data <- tibble( customer_id = paste0("C", str_pad(1:n, 5, pad = "0")), age = round(rnorm(n, mean = 45, sd = 12)), income = round(rlnorm(n, meanlog = 10.8, sdlog = 0.6)), recency_days = round(rexp(n, rate = 1 / 60)), freq_12mo = rpois(n, lambda = 3), avg_order_amt = round(rlnorm(n, meanlog = 4.2, sdlog = 0.5), 2), channel = sample( c("email", "direct_mail", "digital"), n, replace = TRUE, prob = c(0.5, 0.3, 0.2) ), region = sample( c("West", "South", "Midwest", "Northeast"), n, replace = TRUE ), loyalty_tier = sample( c("Bronze", "Silver", "Gold", "Platinum"), n, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1) ), responded = rbinom( n, 1, prob = plogis( -3 + 0.02 * (age - 45) + 0.3 * log(income / 50000) + 0.1 * freq_12mo - 0.005 * recency_days ) ) ) |> mutate( income = if_else(runif(n) < 0.08, NA_real_, income), avg_order_amt = if_else(runif(n) < 0.05, NA_real_, avg_order_amt), responded = factor(responded, levels = c(1, 0), labels = c("yes", "no")) ) ``` ------------------------------------------------------------------------ ## Step 1-1 — Logarithms {#sec-logs} > **Video:** Logs (Logarithms), Clearly Explained!!! <https://www.youtube.com/watch?v=VSi0Z04fWj0> ### Q1 — What Does a Log Do? The video says the core job of a logarithm is to **isolate the exponent**. 1. Complete the statement: "If $2^5 = 32$, then $\log_2(32) =$ \_\_\_\_ because..." - 5 because log functions isolate the exponent. log~2~(32) = log~2~(2^5^) 2. Verify in R: `log(32, base = 2)` > log(32, base = 2) > > \[1\] 5 3. Why is this property useful when income values range from \$7,000 to \$455,000 in your dataset? - Fold changes become symmetric. ```{r q1-1, eval=FALSE} # Your code here log(32, base = 2) ``` ### Q2 — Multiplication Becomes Addition The video demonstrates: $\log(a \times b) = \log(a) + \log(b)$. 1. Verify using `a = 4`, `b = 8` (base 2) — compute both sides and confirm they are equal - log~2~( 4 x 8) = log~2~(32) = 5 - log~2~(4) + log~2~(8) = log~2~(2^2^) + log~2~(2^3^) = 2 + 3 = 5 2. In logistic regression, we multiply many small probabilities together. Explain in one sentence why the log property above makes this computationally safer. - Adding logs makes the the small numbers produced by probabilities that would potentially become zero more manageable to account for. ```{r q1-2, eval=FALSE} log(4 * 8, base = 2) log(4, base = 2) + log(8, base = 2) ``` ### Q3 — Geometric Mean in Marketing The video shows the geometric mean is better than the arithmetic mean for data that **multiplies** rather than adds. Five customers' order values: \$10, \$20, \$80, \$40, \$1,000. 1. Compute the arithmetic mean with `mean()` - orders \<- c(10, 20, 80, 40, 1000) \ mean(orders)\ 230 2. Compute the geometric mean: `exp(mean(log(x)))` - exp(mean(log(orders)))\ 57.708 3. Which better represents the "typical" customer — and why does the \$1,000 order distort the arithmetic mean more?\ Geometric mean represents the "typical" customer more accurately. The \$1,000 inflates the average because it is an outlier, essentially dominating the scale. ```{r q1-3, eval=FALSE} orders <- c(10, 20, 80, 40, 1000) mean(orders) exp(mean(log(orders))) ``` ### Q4 — Log(0) and the Offset The video explains $\log(0)$ is undefined (negative infinity). 1. Run `log(0)` — what does R return? > log(0) \[1\] > > -Inf 2. Your Week 2 recipe uses `step_log(..., offset = 1)`. Explain in one sentence why this offset is necessary.\ Adding 1 ensures the minimum input is log(1) = 0 instead of crashing because using zero is not possible. 3. What is `log(0 + 1)`? Why is adding 1 a safe lower bound? > log(0 + 1) > > \[1\] 0 Adding 1 is a safe lower bound because log(1) = 0, which is a defined value as opposed to infinity. ```{r q1-4, eval=FALSE} log(0) log(0 + 1) ``` ### Q5 — Base Conversion The video notes that base 2, base 10, and natural log all follow the same rules and differ only by a constant factor. 1. For income = \$49,021, compute `log2()`, `log10()`, and `log()` (natural) > income_val \<- 49021 > > log(income_val, base = 2) > > \[1\] 15.58111 > > log10(income_val) > > \[1\] 4.690382 > > log(income_val) > > \[1\] 10.8 2. Divide `log()` by `log10()` — what constant do you get? > log(income_val) / log10(income_val) > > \[1\] 2.302585\ 3. After `step_normalize()` is applied in the recipe, does it matter which base was used for `step_log()`? Explain why or why not.\ No, after normalization, all log bases produce the same scaled feature. ```{r q1-5, eval=FALSE} income_val <- 49021 log(income_val, base = 2) log10(income_val) log(income_val) # Ratio log(income_val) / log10(income_val) ``` ------------------------------------------------------------------------ ## Step 1-2 — Odds and Log(Odds) {#sec-odds} > **Video:** Odds and Log(Odds), Clearly Explained!!! <https://www.youtube.com/watch?v=ARfXDSkQf1Y> ### Q6 — Probability vs. Odds The video distinguishes probability from odds. In `mail_data`, 142 customers responded ("yes") out of 3,000 total. 1. Compute the **probability** of responding: $P(\text{respond}) = \frac{\text{yes}}{\text{total}}$\ yes = 142/3,000\ yes = 0.0473 > yes \<- 142 > > no \<- 3000 - 142 > > p \<- yes / (yes + no) > > p \[1\] 0.04733333 2. Compute the **odds** of responding: $\text{Odds} = \frac{\text{yes}}{\text{no}} = \frac{P}{1 - P}$ odds = 0.0473/(1-0.0473)\ odds = 0.0473/.9527 odds = 0.0496 > odds \<- yes / no > > odds \[1\] 0.04968509 3. In plain English, what does an odds of 0.05 mean for a marketer trying to interpret campaign response? For every one customer that responds, there are 20 who won't. ```{r q2-1, eval=FALSE} yes <- 142 no <- 3000 - 142 # Probability p <- yes / (yes + no) p # Odds odds <- yes / no odds # Or equivalently p / (1 - p) ``` ### Q7 — Converting Between Probability and Odds The video shows you can derive odds from probabilities using $\text{Odds} = \frac{P}{1 - P}$. 1. If a customer has a 20% chance of responding, what are their odds?\ odds = .2/(1-.2)\ odds = .25\ 2. If a customer has a 50% chance, what are their odds?\ odds = .5/(1-.5)\ odds = 1\ 3. If a customer has an 80% chance, what are their odds?\ odds = .8/(1-.8)\ odds = 4\ 4. What pattern do you notice? What happens to odds as probability approaches 1?\ As probability approaches 1, odds increase in favor of the predicted outcome. ```{r q2-2, eval=FALSE} prob_to_odds <- function(p) p / (1 - p) prob_to_odds(0.20) prob_to_odds(0.50) prob_to_odds(0.80) ``` ### Q8 — Why Log(Odds)? The video explains that odds are **asymmetric**: odds against an event range 0 to 1, while odds in favor range 1 to infinity. Taking the log fixes this. 1. Compute `log(odds)` for probabilities 0.10, 0.50, and 0.90\ log_odds(.10) = log(.1/.9) = -2.197225\ log_odds(.50) = log(.5/.5) = 0 log_odds(.90) = log(.9/.1) = 2.197225\ 2. Verify that `log(odds)` at $P = 0.50$ equals exactly 0 — why does this make sense intuitively?\ log_odds(.50) = log(.5/.5) = log(1) =0 3. Are the log(odds) values symmetric around 0? What does this symmetry mean for modeling?\ Yes. With respect to modeling, it makes the relationships symmetric and allows for easy comparisons. ```{r q2-3, eval=FALSE} probs <- c(0.10, 0.50, 0.90) odds <- probs / (1 - probs) log_odds <- log(odds) tibble(probability = probs, odds = odds, log_odds = log_odds) ``` ### \*Q9 — The Logit Function The video introduces the **logit** as the log of the odds ratio, central to logistic regression. $$\text{logit}(P) = \log\left(\frac{P}{1-P}\right)$$ 1. Using `mail_data`, compute the overall response rate (probability) > p_respond \<- mean(mail_data\$responded == "yes") > > p_respond > > \[1\] 0.04733333 2. Convert it to log(odds) using the formula above > log(p_respond / (1 - p_respond)) > > \[1\] -3.00205 3. In your Week 2 recipe, the model output `.pred_yes` is a probability. Write the R code to convert it to log(odds).\ ```{r q2-4, eval=FALSE} # Overall response rate p_respond <- mean(mail_data$responded == "yes") p_respond # Convert to log(odds) log(p_respond / (1 - p_respond)) ``` ### Q10 — Log(Odds) and the Normal Distribution The video notes that log(odds) values often follow a **normal distribution**, making them useful for modeling binary outcomes. Written question — no code required: 1. Your income variable was log-transformed in the recipe because raw income is right-skewed. The video suggests log(odds) of a binary outcome often follow a normal distribution. Why might this property make logistic regression a natural choice for modeling `responded` (yes/no)?\ \ Because log-odds are normally distributed and unbounded, a linear model can fit them directly.\ 2. In 2–3 sentences, explain why working in log(odds) space is more mathematically convenient than working in probability space for a linear model.\ \ Log odds range from negative infinity to positive infinity. Coefficients represent consistent, additive shifts in the log odds for every unit change in a predictor. It also keeps the math clean because multiplying probabilities becomes addition in log odds space, which is far easier to work with algebraically. ------------------------------------------------------------------------ ## Step 1-3 — Odds Ratios and Log(Odds Ratios) {#sec-or} > **Video:** Odds Ratios and Log(Odds Ratios), Clearly Explained!!! <https://www.youtube.com/watch?v=8nm0G-1uJzA> ### Q11 — What Is an Odds Ratio? The video defines an **odds ratio (OR)** as the ratio of two odds. In `mail_data`, compare the odds of responding for **Gold** loyalty tier customers vs. **Bronze** loyalty tier customers. 1. Compute the odds of responding for Gold customers > gold_odds \<- tier_summary \|\> > > filter(loyalty_tier == "Gold") \|\> > > pull(odds) > > gold_odds > > \[1\] 0.05 2. Compute the odds of responding for Bronze customers > bronze_odds \<- tier_summary \|\> > > filter(loyalty_tier == "Bronze") \|\> > > pull(odds) > > bronze_odds > > \[1\] 0.05074875 3. Compute the odds ratio: Gold odds ÷ Bronze odds > OR \<- gold_odds / bronze_odds > > OR > > \[1\] 0.9852459 4. Is Gold tier a stronger or weaker predictor of response than Bronze?\ Yes ```{r q3-1, eval=FALSE} tier_summary <- mail_data |> group_by(loyalty_tier) |> summarise( yes = sum(responded == "yes"), no = sum(responded == "no"), odds = yes / no ) tier_summary # Odds ratio: Gold vs Bronze gold_odds <- tier_summary |> filter(loyalty_tier == "Gold") |> pull(odds) bronze_odds <- tier_summary |> filter(loyalty_tier == "Bronze") |> pull(odds) OR <- gold_odds / bronze_odds OR ``` ### Q12 — Asymmetry and Log(Odds Ratio) The video explains that odds ratios are asymmetric: values below 1 (negative association) are compressed into 0–1, while values above 1 (positive association) range from 1 to infinity. 1. Compute `log(OR)` from Q11. Is it positive or negative? > log(OR) > > \[1\] -0.01486402 negative 2. Compute the odds ratio in the **reverse direction**: Bronze ÷ Gold. Then take `log()` of that. What do you notice? > OR_reverse \<- bronze_odds / gold_odds > > log(OR_reverse) > > \[1\] 0.01486402 Odds are now positive, meaning these are the chances respondents will not respond 3. Why is log(OR) = 0 the reference point for "no relationship"?\ Because a log odds ratio of 0 means the odds ratio itself equals 1, and an odds ratio of 1 means the odds of the outcome are identical in both groups, so the predictor has no effect on the response. ```{r q3-2, eval=FALSE} # Log odds ratio log(OR) # Reverse direction OR_reverse <- bronze_odds / gold_odds log(OR_reverse) ``` ### Q13 — Effect Size Interpretation The video compares the odds ratio to R² as a measure of **effect size** — the larger the magnitude, the stronger the relationship. Compute odds ratios for all four loyalty tiers against Bronze as the reference, then answer: 1. Which tier has the strongest association with response?\ Platinum 2. Which has the weakest?\ Gold 3. How would you communicate the strongest result to a non-technical marketing manager in one sentence?\ Platinum loyalty members are about 1.5 times more likely to respond to our campaign than Bronze members, making them our highest-priority segment to target.\ ```{r q3-3, eval=FALSE} tier_summary |> mutate(OR_vs_bronze = odds / bronze_odds, log_OR = log(OR_vs_bronze)) |> arrange(desc(abs(log_OR))) ``` ### \*Q14 — Fisher's Exact Test The video introduces **Fisher's Exact Test** for small samples to determine if an observed association is statistically significant. Using Gold vs. Bronze customers from `mail_data`: 1. Build a 2×2 contingency table using `table()` 2. Run `fisher.test()` on it 3. What does the p-value tell you? Is the Gold/Bronze difference statistically significant at the 0.05 level?\ A p-value of 1 means the difference is not statistically significant 4. When would you prefer Fisher's Exact Test over a chi-square test?\ When the dataset is small ```{r q3-4, eval=FALSE} gold_bronze <- mail_data |> filter(loyalty_tier %in% c("Gold", "Bronze")) ct <- table(gold_bronze$loyalty_tier, gold_bronze$responded) ct stats::fisher.test conflicted::conflicts_prefer(janitor::fisher.test) fisher.test(ct) ``` ### Q15 — Chi-Square Test The video explains the **chi-square test** compares observed values to expected values assuming no relationship exists. 1. Run `chisq.test()` on the same 2×2 table from Q14 2. Compare the p-value to Fisher's Exact Test — are the conclusions the same?\ The p-value is the same 3. The video notes chi-square works well for **larger datasets**. With 3,000 rows in `mail_data`, which test is more appropriate — and why?\ Chi-square; it also compared observed values to expected values. ```{r q3-5, eval=FALSE} conflicted::conflicts_prefer(janitor::chisq.test) chisq.test(ct) ``` ### Q16 — Wald Test Concept The video explains the **Wald Test** evaluates how many standard deviations a log(odds ratio) is from zero, assuming a normal distribution. Written question — no code required: 1. In your own words, what does it mean for a log(odds ratio) to be "many standard deviations from zero"?\ It means that the log(odds ratio) represents a real relationship rather than random noise. The further from zero, the stronger the evidence that the predictor actually matters.\ 2. The Wald test is used inside logistic regression to assess individual predictors. In the Week 2 notebook, you fitted a Lasso logistic regression. How does Lasso's penalty relate to the Wald test's goal of identifying which predictors are meaningfully different from zero?\ Both are trying to determine which predictors actually matter in different ways. The Wald test does it after fitting by checking whether each coefficient is far enough from zero to be considered meaningful. Lasso does it during fitting by penalizing coefficients and shrinking unimportant ones all the way to exactly zero, effectively removing them from the model.\ 3. The video says there is no single consensus on which test (Fisher, chi-square, Wald) is best. What practical guideline does it suggest?\ The video suggests relying on the technique most related to the respective field of study. ### Q17 — Connecting Odds Ratios to the Course Written question: In logistic regression, the model coefficients are log(odds ratios) — each coefficient tells you how much the log(odds) of the outcome changes for a one-unit increase in a predictor. 1. In your Week 2 model, `freq_12mo` is a predictor. If its coefficient is positive, what does that imply about the odds of a customer responding as their purchase frequency increases?\ A positive coefficient means that as purchase frequency increases, the odds of a customer responding to the campaign also increase.\ 2. After `step_normalize()`, predictors are on the same scale. Why does this make it easier to compare coefficients (and therefore odds ratios) across predictors with very different original units (e.g., income in dollars vs. frequency in counts)?\ Normalization allows for every predictor to be measured in standard deviations, so a one unit increase means the same thing across all predictors. Coefficients can be compared side by side and directly rank which predictors have the strongest association with response, no matter what their original units were. ------------------------------------------------------------------------ ## Step 2-1 — Logistic Regression {#sec-logit} > **Video:** Logistic Regression, Clearly Explained!!! <https://www.youtube.com/watch?v=yIYKR4sgzI8> ### Q18 — Linear vs. Logistic Regression The video begins by reviewing linear regression, then explains why it fails for binary outcomes. 1. What does linear regression predict? What does logistic regression predict?\ Linear regression predicts a continuous numeric value with no bounds. \ Logistic regression predicts the probability of belonging to one of two classes, always staying between 0 and 1.\ 2. Why can't you use a straight line to model a yes/no outcome like `responded`? What goes wrong at the extremes?\ A straight line keeps going in both directions forever, so at the extremes it will predict values below 0 or above 1, which are impossible for a probability. 3. In `mail_data`, the response rate is about 4.7%. If you fitted a linear regression to predict `responded` (as 0/1), what value might it predict for a very frequent, high-income customer — and why is that problematic?\ For a very frequent, high income customer, a linear regression might predict something like 1.3 or higher, since those strong predictor values push the line upward with no ceiling to stop it. This is problematic because a response probability cannot exceed 1, and feeding impossible predictions into any downstream decision making, such as expected revenue calculations, would produce nonsense results. Written answer — no code required. ### Q19 — The S-Shaped Logistic Curve The video shows that logistic regression fits an **S-shaped curve** (sigmoid function) that maps any input to a probability between 0 and 1. $$P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}$$ 1. In R, `plogis()` computes this function. Evaluate it at inputs of −3, 0, and 3: > plogis(-3) > > \[1\] 0.04742587: negative linear predictor > > plogis(0) > > \[1\] 0.5: zero > > plogis(3) > > \[1\] 0.9525741: positive linear predictor ```{r q4-2, eval=FALSE} plogis(-3) # very negative linear predictor plogis(0) # at zero plogis(3) # very positive linear predictor ``` 2. What probability does `plogis(0)` return — and why does that make sense from the formula? > plogis(0) > > \[1\] 0.5 It computes 1 / (1 + e^~0~^), and since e^0^ = 1, which simplifies to 1 / (1 + 1) = 1 / 2 = 0.5. This makes sense because a log odds of 0 means the odds are exactly 1, meaning the outcome is equally likely to happen or not happen. 3. Notice that your dataset's `responded` column was generated using `plogis(...)` — find that line in the setup chunk above and explain what the linear predictor inside `plogis()` represents in marketing terms. responded = rbinom( n, 1, prob = plogis( -3 + 0.02 \* (age - 45) + 0.3 \* log(income / 50000) + 0.1 \* freq_12mo - 0.005 \* recency_days This is RFM - recency, frequency, monetary pattern. ### Q20 — Maximum Likelihood vs. Least Squares The video explains a key difference in how the two models are fitted: - **Linear regression**: minimizes the sum of squared residuals (least squares) - **Logistic regression**: maximizes the likelihood of the observed data Written question: 1. Why can't logistic regression use least squares? (Hint: what is the "residual" when the outcome is 0 or 1?) When the outcome is only ever 0 or 1, errors are not normally distributed and the relationship is not linear, so minimizing squared residuals produces a nonsensical result because the math does not apply to binary data.\ 2. In plain English, what does it mean for logistic regression to find the curve that "maximizes the likelihood" of the observed responses?\ The model finds the curve that assigns high probabilities to customers who actually responded and low probabilities to those who did not, across all customers at once. 3. The video mentions that logistic regression does not have the same concept of a residual. How does this affect how we assess model fit — and what metrics do we use instead? (Reference your Week 3 notebook.)\ Since residuals are not meaningful, we use metrics like AUC ROC, sensitivity, specificity, and the confusion matrix to measure how well the model separates responders from non-responders. ### Q21 — Predicting Probability vs. Class In the Week 2 notebook, `augment()` returns both `.pred_yes` (a probability) and `.pred_class` (a class label). 1. Run the code below to see the distribution of `.pred_yes`. What is the typical predicted probability for a customer in this dataset?\ The predicted probabilities cluster between 0.04 and 0.07, with most customers sitting right around 0.05.\ 2. The default threshold for `.pred_class` is 0.5. Given the 4.7% response rate, why might 0.5 be a poor threshold for a direct-mail campaign?\ None of the customers in the dataset has a predicted probability anywhere near 0.5. Using that threshold would classify every single customer as "no" and the model would never identify anyone to mail.\ 3. What threshold might make more business sense — and what metric from Week 3 would you use to evaluate the impact of changing it?\ A threshold around 0.055 to 0.06 would make more sense\ ```{r q4-4, eval=TRUE} library(tidymodels) library(tidyverse) library(glmnet) tidymodels_prefer() mail_data <- tibble( customer_id = paste0("C", str_pad(1:n, 5, pad = "0")), age = round(rnorm(n, mean = 45, sd = 12)), income = round(rlnorm(n, meanlog = 10.8, sdlog = 0.6)), recency_days = round(rexp(n, rate = 1 / 60)), freq_12mo = rpois(n, lambda = 3), avg_order_amt = round(rlnorm(n, meanlog = 4.2, sdlog = 0.5), 2), channel = sample( c("email", "direct_mail", "digital"), n, replace = TRUE, prob = c(0.5, 0.3, 0.2) ), region = sample( c("West", "South", "Midwest", "Northeast"), n, replace = TRUE ), loyalty_tier = sample( c("Bronze", "Silver", "Gold", "Platinum"), n, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1) ), responded = rbinom( n, 1, prob = plogis( -3 + 0.02 * (age - 45) + 0.3 * log(income / 50000) + 0.1 * freq_12mo - 0.005 * recency_days ) ) ) |> mutate( income = if_else(runif(n) < 0.08, NA_real_, income), avg_order_amt = if_else(runif(n) < 0.05, NA_real_, avg_order_amt), responded = factor(responded, levels = c(1, 0), labels = c("yes", "no")) ) mail_split <- initial_split(mail_data, prop = 0.80, strata = responded) mail_train <- training(mail_split) mail_test <- testing(mail_split) mail_rec <- recipe(responded ~ ., data = mail_train) |> update_role(customer_id, new_role = "ID") |> step_impute_median(all_numeric_predictors()) |> step_log(income, avg_order_amt, recency_days, base = 10, offset = 1) |> step_normalize(all_numeric_predictors()) |> step_dummy(all_nominal_predictors()) |> step_zv(all_predictors()) lr_spec <- logistic_reg(penalty = 0.01, mixture = 1) |> set_engine("glmnet") |> set_mode("classification") mail_fit <- workflow() |> add_recipe(mail_rec) |> add_model(lr_spec) |> fit(data = mail_train) preds <- augment(mail_fit, new_data = mail_test) # Distribution of predicted probabilities ggplot(preds, aes(x = .pred_yes, fill = responded)) + geom_histogram(bins = 40, alpha = 0.7, position = "identity") + labs( title = "Distribution of Predicted Probabilities", x = "P(responded = yes)", y = "Count" ) + theme_minimal() ``` ### Q22 — Wald's Test and Variable Selection The video mentions that logistic regression uses **Wald's tests** to evaluate whether individual predictors contribute meaningfully to the model. 1. In the Week 2 notebook, you used a **Lasso** penalty (`mixture = 1`). How does Lasso serve a similar purpose to Wald's test — identifying which predictors are not useful?\ The Wald test identifies useless predictors by checking if their coefficient is close enough to zero to be considered meaningless. Lasso shrinks weak predictors all the way to exactly zero during fitting, removing them from the model automatically.\ 2. Run the code below to extract the model coefficients. Which predictors were shrunk to exactly zero by the Lasso?\ recency_days, freq_12mo, avg_order_amt, channel_direct_mail, channel_email, region_Northeast, region_South, and region_West\ 3. Based on the video's explanation of what logistic regression coefficients represent (log odds ratios), interpret the sign of the `freq_12mo` coefficient in plain English.\ freq_12mo was shrunk to zero, meaning the model found no meaningful relationship between purchase frequency and response after accounting for the other predictors. ```{r q4-5, eval=FALSE} # Extract Lasso coefficients mail_fit |> extract_fit_parsnip() |> tidy() |> filter(term != "(Intercept)") |> arrange(desc(abs(estimate))) ``` ### Q23 — Synthesis: From Logs to Logistic Regression This final question ties all four videos together. Written question (3–5 sentences): The four videos form a conceptual chain: **Logs → Odds → Odds Ratios → Logistic Regression**. 1. Explain in your own words how each step connects to the next: - Why do we take the **log** of odds (not just use odds directly)?\ Taking the log stretches odds to an unbounded, symmetric scale that a linear model can handle cleanly. - Why do we compare odds as a **ratio** rather than a difference?\ A difference in odds depends on the baseline, making it difficult to compare across groups. A ratio is scale-free. - How does the logit (log odds) become the linear predictor inside logistic regression?\ Once odds are logged, the result ranges from negative infinity to positive infinity, exactly matching what a linear combination of predictors produces. Logistic regression sets the linear combination equal to the log odds, then uses plogis to convert back to a probability for the final prediction. 2. In the context of the direct-mail campaign in `mail_data`, describe in one sentence what the logistic regression model is actually computing when it produces `.pred_yes = 0.08` for a specific customer.\ The model is taking that customer's age, income, frequency, and other features, combining them into a single log odds score using the fitted coefficients, and then converting that score into a probability, concluding that this particular customer has a 1 in 12 chance of responding to the campaign. # Log Assignments Run this chunk first — it loads the packages and recreates the direct-mail dataset used throughout the course. ```{r setup-log} library(tidyverse) # ── Recreate the direct-mail dataset ───────────────────────────────────────── set.seed(123) n <- 3000 mail_data <- tibble( customer_id = paste0("C", str_pad(1:n, 5, pad = "0")), age = round(rnorm(n, mean = 45, sd = 12)), income = round(rlnorm(n, meanlog = 10.8, sdlog = 0.6)), recency_days = round(rexp(n, rate = 1 / 60)), freq_12mo = rpois(n, lambda = 3), avg_order_amt = round(rlnorm(n, meanlog = 4.2, sdlog = 0.5), 2), channel = sample( c("email", "direct_mail", "digital"), n, replace = TRUE, prob = c(0.5, 0.3, 0.2) ), region = sample( c("West", "South", "Midwest", "Northeast"), n, replace = TRUE ), loyalty_tier = sample( c("Bronze", "Silver", "Gold", "Platinum"), n, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1) ), responded = rbinom( n, 1, prob = plogis( -3 + 0.02 * (age - 45) + 0.3 * log(income / 50000) + 0.1 * freq_12mo - 0.005 * recency_days ) ) ) |> mutate( income = if_else(runif(n) < 0.08, NA_real_, income), avg_order_amt = if_else(runif(n) < 0.05, NA_real_, avg_order_amt), responded = factor(responded, levels = c(1, 0), labels = c("yes", "no")) ) glimpse(mail_data) ``` ------------------------------------------------------------------------ ## Assignment L.1 — Dollar to Log Customer `C00005` has an income of \$233,070. 1. Convert this value to the **natural log**: `log(233070)12.359` 2. Convert this value to **log base 10**: `log10(233070)` log~10~(233070) = log(10^5.367^) = 5.367 3. Verify both results using R 4. In your own words, why does the recipe use `base = 10` rather than the natural log?\ Base 10 is easier to read. A log10 value of 5 means the original number was 100,000, which most people can follow. Natural log values do not have that same intuitive meaning, so base 10 is just a more human friendly choice when you need to explain the numbers to someone outside of data science. ::: {.callout-tip title="Hint"} In R, `log(x)` computes the natural log ($\ln x$) and `log10(x)` computes log base 10. You can also write `log(x, base = 10)` — this is exactly what `step_log(..., base = 10)` uses internally. ::: ```{r assignment-L1, eval=FALSE} # Your code here log(233070) log10(233070) ``` ------------------------------------------------------------------------ ## Assignment L.2 — Log Back to Dollar (Median and Mean) The income variable was generated with `meanlog = 10.8` and `sdlog = 0.6`. These parameters describe the distribution of **log(income)**, not income itself. 1. Convert the log-scale **median** (10.8) back to dollars using `exp()` 2. Convert the log-scale **mean** (10.98) back to dollars using `exp()` 3. Verify these match the formulas from the Week 2 notebook: $$\text{Median} = e^{\mu} = e^{10.8}$$ $$\text{Mean} = e^{\mu + \sigma^2/2} = e^{10.8 + 0.6^2/2}$$\ yes![](images/clipboard-627223023.png) 4. Why is the mean higher than the median on the dollar scale?\ Income is right-skewed — a small number of high earners pull the mean above the median. ::: {.callout-tip title="Hint"} `exp(x)` is the inverse of `log(x)`. If `log(income) = 10.8`, then `exp(10.8)` gives you income back in dollars. For the mean, compute `10.8 + 0.6^2 / 2` first, then apply `exp()`. ::: ```{r assignment-L2, eval=FALSE} # Your code here median_income <- exp(10.8) # [1] 49020.8 mean_income <- exp(10.98) #[1] 58688.55 ``` ------------------------------------------------------------------------ ## Assignment L.3 — Geometric Standard Deviation The **geometric standard deviation (GSD)** describes the typical spread of income around the median on the original dollar scale. $$\text{GSD} = e^{\sigma} = e^{0.6}$$ 1. Compute the GSD using `exp(0.6)``> exp(.6)` `[1] 1.822119` 2. Using the median income from L.2 (\~\$49,021), compute the typical range: - Lower bound: `median_income / GSD``> median_income / GSD` `[1] 26903.19` - Upper bound: `median_income * GSD``> median_income * GSD` `[1] 89321.72` 3. Verify this matches the shaded green band in the Week 2 plot (\~\$26,935 to \~\$89,218) yes ![](images/clipboard-3281195122.png) 4. Interpret this range in plain English: what does it tell a marketer about the "typical" customer in this dataset?\ Even though the income spread is between \$26k to \$89k, and the average earning is \$59k, marketers should expect a drop in salaries starting after \$49k. ::: {.callout-tip title="Hint"} Unlike the regular standard deviation (which adds and subtracts), the GSD **multiplies and divides** because we are on a multiplicative (log) scale. ::: ```{r assignment-L3, eval=FALSE} # Your code here GSD <- exp(.6) #lower bound median_income / GSD #upper bound median_income * GSD ``` ------------------------------------------------------------------------ ## Assignment L.4 — Confirm with Real Data Now verify the theoretical values from L.2 and L.3 against the actual `mail_data` dataset. 1. Compute the **median** and **mean** of `income`, removing `NA`s with `na.rm = TRUE`. How close are they to the theoretical values? +---------------+-------------+---+---+---+ | median_income | mean_income | | | | | | | | | | | \<dbl\> | \<dbl\> | | | | +==============:+============:+===+===+===+ | 49331.5 | 58349.86 | | | | +---------------+-------------+---+---+---+ These values are only about \$300 less than the theoretical values.\ 2. Create a new column called `log_income` using `mutate()` and `log()`. What are the median and mean of `log_income`? +-------------------+-----------------+---+---+---+ | median_log_income | mean_log_income | | | | | | | | | | | \<dbl\> | \<dbl\> | | | | +==================:+================:+===+===+===+ | 10.80632 | 10.79787 | | | | +-------------------+-----------------+---+---+---+ 3. Apply `exp()` to the median and mean of `log_income`. Do you recover the original dollar-scale values? Explain why or why not. Only median income is correctly reversed since the axis is even, but mean is incorrect since income average is right-skewed due to the inflation of the large amounts in the tail. +-----------------------+---------------------+---+---+---+ | exp_median_log_income | exp_mean_log_income | | | | | | | | | | | \<dbl\> | \<dbl\> | | | | +======================:+====================:+===+===+===+ | 49331.49 | 48916.37 | | | | +-----------------------+---------------------+---+---+---+ ::: {.callout-tip title="Hint"} Use `summarise()` with `median()` and `mean()` to compute summary statistics. The theoretical and empirical values may differ slightly because `mail_data` is a finite random sample — not the infinite population the parameters describe. ::: ```{r assignment-L4, eval=FALSE} # Step 1: median and mean of income (removing NAs) mail_data |> summarise( median_income = median(income, na.rm = TRUE), mean_income = mean(income, na.rm = TRUE) ) # Step 2: create log_income and summarise mail_data <- mail_data |> mutate(log_income = log(income)) mail_data |> summarise( median_log_income = median(log_income, na.rm = TRUE), mean_log_income = mean(log_income, na.rm = TRUE) ) # Step 3: back-transform with exp() mail_data |> summarise( exp_median_log_income = exp(median(log_income, na.rm = TRUE)), exp_mean_log_income = exp(mean(log_income, na.rm = TRUE)) ) ``` ------------------------------------------------------------------------ ## Assignment L.5 — Base 10 vs. Natural Log Your recipe uses `step_log(..., base = 10)`. Does the choice of base actually matter for modeling?\ Not necessarily, but the base does change the scale of the coefficients 1. For income values of \$10,000, \$100,000, and \$1,000,000, compute both `log()` (natural) and `log10()` for each value\ log~10~(10000) = 4; log(10000) = 9.2103 log~10~(100000) = 5 ; log(100000) = 11.5129 log~10~(1000000) = 6 ; log(1000000) = 13.8155 2. Present the results as a tibble with columns: `income`, `log_natural`, `log_base10` +---------+-------------+------------+---+---+ | ncome | log_natural | log_base10 | | | | | | | | | | \<dbl\> | \<dbl\> | \<dbl\> | | | +========:+============:+===========:+===+===+ | 1e+04 | 9.21034 | 4 | | | +---------+-------------+------------+---+---+ | 1e+05 | 11.51293 | 5 | | | +---------+-------------+------------+---+---+ | 1e+06 | 13.81551 | 6 | | | +---------+-------------+------------+---+---+ 3. Divide `log_natural` by `log_base10` for each row. What constant do you get? (Hint: this constant is `log(10)`)\ +---------+-------------+------------+------------------------+---+ | income | log_natural | log_base10 | log-natural/log_base10 | | | | | | | | | \<dbl\> | \<dbl\> | \<dbl\> | | | +========:+============:+===========:+========================+===+ | 1e+04 | 9.21034 | 4 | 2.3026 | | +---------+-------------+------------+------------------------+---+ | 1e+05 | 11.51293 | 5 | 2.3026 | | +---------+-------------+------------+------------------------+---+ | 1e+06 | 13.81551 | 6 | 2.3026 | | +---------+-------------+------------+------------------------+---+ 4. Since the two columns differ only by a constant, does it matter which base we use for preprocessing? Explain why or why not in terms of what `step_normalize()` does afterward.\ No, log bases only change the logged values by a constant multiplier. ::: {.callout-tip title="Hint"} `log(10)` ≈ 2.303. Multiplying or dividing by a constant shifts and scales values — but `step_normalize()` centers and scales anyway, so any constant difference between bases gets absorbed. ::: ```{r assignment-L5, eval=FALSE} # Step 1 & 2: build the tibble tibble( income = c(10000, 100000, 1000000) ) |> mutate( log_natural = log(income), log_base10 = log10(income) ) # Step 3: compute the ratio # Step 4: your written explanation here (as a comment) # No, log bases only change the logged values by a constant multiplier. ``` ------------------------------------------------------------------------ ### Summary +--------------+----------------------------------+---------------------------+ | Assignment | Key operation | R function | +==============+==================================+===========================+ | L.1 | Dollar → log | `log()`, `log10()` | +--------------+----------------------------------+---------------------------+ | L.2 | Log → dollar (median & mean) | `exp()` | +--------------+----------------------------------+---------------------------+ | L.3 | Geometric standard deviation | `exp(sdlog)` | +--------------+----------------------------------+---------------------------+ | L.4 | Empirical vs. theoretical values | `mutate()`, `summarise()` | +--------------+----------------------------------+---------------------------+ | L.5 | Base 10 vs. natural log | `log()` / `log10()` ratio | +--------------+----------------------------------+---------------------------+ ------------------------------------------------------------------------ # Exercises ```{r setup-exercises} library(tidymodels) # umbrella: rsample, recipes, parsnip, workflows, yardstick library(tidyverse) library(janitor) # clean_names() library(skimr) # skim() tidymodels_prefer() # resolve function-name conflicts set.seed(2024) # ── Synthetic direct-mail dataset ────────────────────────────────────────────── set.seed(123) n <- 3000 mail_data <- tibble( customer_id = paste0("C", str_pad(1:n, 5, pad = "0")), age = round(rnorm(n, mean = 45, sd = 12)), income = round(rlnorm(n, meanlog = 10.8, sdlog = 0.6)), # right-skewed recency_days = round(rexp(n, rate = 1 / 60)), # days since last purchase freq_12mo = rpois(n, lambda = 3), # purchases in 12 months avg_order_amt = round(rlnorm(n, meanlog = 4.2, sdlog = 0.5), 2), channel = sample( c("email", "direct_mail", "digital"), n, replace = TRUE, prob = c(0.5, 0.3, 0.2) ), region = sample( c("West", "South", "Midwest", "Northeast"), n, replace = TRUE ), loyalty_tier = sample( c("Bronze", "Silver", "Gold", "Platinum"), n, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1) ), responded = rbinom( n, 1, prob = plogis( -3 + 0.02 * (age - 45) + 0.3 * log(income / 50000) + 0.1 * freq_12mo - 0.005 * recency_days ) ) ) |> mutate( income = if_else(runif(n) < 0.08, NA_real_, income), avg_order_amt = if_else(runif(n) < 0.05, NA_real_, avg_order_amt), responded = factor(responded, levels = c(1, 0), labels = c("yes", "no")) ) # introduce ~8 % missingness in income and avg_order_amt glimpse(mail_data) ``` ------------------------------------------------------------------------ ## ✏️ Exercise 3.1 Change `prop` to `0.70` and re-run the split. How many records move from training to test? \ 900\ How does the response rate change?\ +----------+---------------+---+---+---+ | split | response_rate | | | | | | | | | | | \<chr\> | \<dbl\> | | | | +:=========+==============:+===+===+===+ | Training | 0.04333333 | | | | +----------+---------------+---+---+---+ | Test | 0.05666667 | | | | +----------+---------------+---+---+---+ ```{r ex3-1, eval=FALSE} # Your code here set.seed(617) mail_split <- initial_split( mail_data, prop = 0.70, strata = responded ) mail_train <- training(mail_split) mail_test <- testing(mail_split) cat("Training rows:", nrow(mail_train), "\n") cat("Test rows :", nrow(mail_test), "\n") bind_rows( mail_train |> summarise(split = "Training", response_rate = mean(responded == "yes")), mail_test |> summarise(split = "Test", response_rate = mean(responded == "yes")) ) # Training rows: 2100 # Test rows : 900 ``` ------------------------------------------------------------------------ ## ✏️ Exercise 8.1 — Alternative Imputation Replace `step_impute_median()` with `step_impute_mean()`. Re-prep the recipe and compare the imputed values. Which would you prefer for `income` and why? Median because mean is always right skewed due to inflation by the larger earners. ```{r ex8-1, eval=FALSE} # Your code here rec_partial <- recipe( responded ~ income + freq_12mo + age, data = mail_train ) |> step_impute_mean(all_numeric_predictors()) |> step_log(income, base = 10, offset = 1) |> step_normalize(all_numeric_predictors()) prep(rec_partial) |> bake(new_data = mail_train) |> summary() ``` ## ✏️ Exercise 8.2 — Different Encoding `loyalty_tier` is ordinal (Bronze \< Silver \< Gold \< Platinum). Replace `step_dummy()` with `step_ordinalscore()` for `loyalty_tier` only. Does this change the number of columns in the baked data? yes ```{r ex8-2, eval=FALSE} # Hint: use step_ordinalscore() from the recipes package # loyalty_tier is currently character. To be converted to ordinal scale, it must be an ordered factor first. mail_data <- mail_data |> mutate( loyalty_tier = factor( loyalty_tier, levels = c("Bronze", "Silver", "Gold", "Platinum"), ordered = TRUE ) ) # Your code here mail_data <- mail_data |> mutate( loyalty_tier = ordered( loyalty_tier, levels = c("Bronze", "Silver", "Gold", "Platinum") ) ) set.seed(617) mail_split <- initial_split( mail_data, prop = 0.70, strata = responded ) mail_train <- training(mail_split) mail_test <- testing(mail_split) is.ordered(mail_train$loyalty_tier) rec_ordinal <- recipe(responded ~ ., data = mail_train) |> update_role(customer_id, new_role = "ID") |> step_impute_median(all_numeric_predictors()) |> step_log(income, base = 10, offset = 1) |> step_ordinalscore(loyalty_tier) |> step_dummy(all_nominal_predictors()) |> step_normalize(all_numeric_predictors()) prep(rec_ordinal) |> bake(new_data = mail_train) |> ncol() ``` ## ✏️ Exercise 8.3 — Interaction Term Add an interaction between `freq_12mo` and `log10(income)` using `step_interact()`. What is the marketing intuition for this interaction?\ A person with high income and past purchases may be likely to respond. ```{r ex8-3, eval=FALSE} # Your code here rec_interact <- recipe(responded ~ ., data = mail_train) |> update_role(customer_id, new_role = "ID") |> step_impute_median(all_numeric_predictors()) |> step_log(income, base = 10, offset = 1) |> step_interact(terms = ~ freq_12mo:income) |> step_dummy(all_nominal_predictors()) |> step_normalize(all_numeric_predictors()) prep(rec_interact) |> bake(new_data = mail_train) |> glimpse() ``` ## ✏️ Exercise 8.4 — Remove a Step Remove `step_normalize()` from the recipe, retrain the Lasso, and compare `roc_auc` before and after. What happens to model performance? ```{r ex8-4, eval=FALSE} # Your code here rec_no_norm <- recipe(responded ~ ., data = mail_train) |> update_role(customer_id, new_role = "ID") |> step_impute_median(all_numeric_predictors()) |> step_log(income, base = 10, offset = 1) |> step_dummy(all_nominal_predictors()) lasso_no_norm_fit <- workflow() |> add_recipe(rec_no_norm) |> add_model(lr_spec) |> fit(data = mail_train) lasso_no_norm_preds <- predict(lasso_no_norm_fit, mail_test, type = "prob") |> bind_cols(mail_test |> select(responded)) roc_auc( lasso_no_norm_preds, truth = responded, .pred_yes ) ``` ## ✏️ Exercise 8.5 — Stratification Check The split used `strata = responded`. Remove stratification and re-split five times. Calculate the standard deviation of the response rate in the test sets. Does stratification reduce variance? \ yes ```{r ex8-5, eval=FALSE} # Your code here set.seed(617) mail_split <- initial_split( mail_data, prop = 0.80, # 80 % train, 20 % test strata = responded # preserve class balance in both sets ) mail_train <- training(mail_split) mail_test <- testing(mail_split) cat("Training rows:", nrow(mail_train), "\n") no_strata_rates <- tibble(split_num = 1:5) |> mutate( split = map(split_num, ~ { set.seed(617 + .x) initial_split( mail_data, prop = 0.80 ) }), test_data = map(split, testing), test_response_rate = map_dbl( test_data, ~ mean(.x$responded == "yes") ) ) no_strata_rates no_strata_rates |> summarise( sd_test_response_rate = sd(test_response_rate) ) ``` ## ✏️ Exercise 8.6 — Data Leakage Prep the recipe correctly using `mail_train`, then bake `mail_test`. Next, incorrectly prep the same recipe using `mail_test`. Compare the normalization statistics for `income`. Why is the second approach considered data leakage? ```{r ex8-6, eval=FALSE} # Your code here ``` ## ✏️ Exercise 8.7 — Inspecting a Recipe Use `tidy(mail_prep)` to list the recipe steps. Then inspect the imputation values and normalization statistics using `tidy(mail_prep, number = 1)` and `tidy(mail_prep, number = 3)`. What values were learned during `prep()`? ```{r ex8-7, eval=FALSE} # Your code here ``` ## ✏️ Exercise 8.8 — Variable Roles Remove `update_role(customer_id, new_role = "ID")` from the recipe and re-prep it. What happens to `customer_id` in the baked data? Why should an ID variable not be used as a predictor? ```{r ex8-8, eval=FALSE} # Your code here ``` ------------------------------------------------------------------------ ### Summary +-----------------------+------------------------+----------------------------+ | Concept | Function | Key Argument | +=======================+========================+============================+ | Data split | `initial_split()` | `prop`, `strata` | +-----------------------+------------------------+----------------------------+ | Training set | `training()` | — | +-----------------------+------------------------+----------------------------+ | Test set | `testing()` | — | +-----------------------+------------------------+----------------------------+ | Define recipe | `recipe()` | formula, `data = train` | +-----------------------+------------------------+----------------------------+ | Log transform | `step_log()` | `base`, `offset` | +-----------------------+------------------------+----------------------------+ | Median impute | `step_impute_median()` | selector | +-----------------------+------------------------+----------------------------+ | Normalize | `step_normalize()` | selector | +-----------------------+------------------------+----------------------------+ | Dummy encode | `step_dummy()` | `all_nominal_predictors()` | +-----------------------+------------------------+----------------------------+ | Estimate steps | `prep()` | `training = train` | +-----------------------+------------------------+----------------------------+ | Apply steps | `bake()` | `new_data = test` | +-----------------------+------------------------+----------------------------+ | Bundle recipe + model | `workflow()` | — | +-----------------------+------------------------+----------------------------+ ```{r session-info} sessionInfo() ``` # Appendix - [GitHub Repo](https://github.com/lewiswaddell/W04-1.git) - [GitHub Website](https://lewiswaddell.github.io/W04-1/)