This week I’m building a logistic regression model to predict a binary outcome. Unlike linear regression (which predicts continuous values like price), logistic regression predicts probabilities and classifications for yes/no outcomes.
Why logistic regression matters: Many real-world questions are binary Will this home sell quickly? Is this customer likely to default? Should we approve this loan? Understanding what factors influence these binary outcomes helps with decision-making and risk assessment.
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)
# Examine existing binary variables
cat("Existing binary variables:\n\n")
## Existing binary variables:
cat("Central.Air:\n")
## Central.Air:
table(ames$Central.Air)
##
## N Y
## 196 2734
cat("\nProportion with AC:", mean(ames$Central.Air == "Y"), "\n")
##
## Proportion with AC: 0.9331058
Issue with existing binary variables:
Central Air is too imbalanced (93% have AC, only 7% don’t). When one outcome is very rare, logistic regression struggles to find patterns and predictions become less reliable.
I’ll explore which features differentiate premium from non-premium homes.
# Compare premium vs non-premium homes across potential predictors
comparison <- ames |>
group_by(Is_Premium_Quality) |>
summarise(
Count = n(),
Mean_Living_Area = mean(Gr.Liv.Area),
Mean_Year_Built = mean(Year.Built),
Mean_Basement_SF = mean(Total.Bsmt.SF),
Mean_Garage_Area = mean(Garage.Area),
Pct_With_AC = mean(Central.Air == "Y") * 100
)
kable(comparison,
col.names = c("Premium?", "Count", "Living Area (sq ft)", "Year Built",
"Basement (sq ft)", "Garage (sq ft)", "% with AC"),
caption = "Comparing Premium vs. Non-Premium Homes",
digits = 0)
| Premium? | Count | Living Area (sq ft) | Year Built | Basement (sq ft) | Garage (sq ft) | % with AC |
|---|---|---|---|---|---|---|
| 0 | 2442 | 1402 | 1966 | 92 | ||
| 1 | 488 | 1989 | 1999 | 1530 | 708 | 99 |
What stands out:
# Check correlations among potential predictors
predictors_to_check <- ames |>
select(Gr.Liv.Area, Total.Bsmt.SF, Garage.Area, Year.Built)
cor_matrix <- cor(predictors_to_check, use = "complete.obs")
kable(cor_matrix,
caption = "Correlation Matrix of Potential Predictors",
digits = 3)
| Gr.Liv.Area | Total.Bsmt.SF | Garage.Area | Year.Built | |
|---|---|---|---|---|
| Gr.Liv.Area | 1.000 | 0.445 | 0.485 | 0.242 |
| Total.Bsmt.SF | 0.445 | 1.000 | 0.486 | 0.407 |
| Garage.Area | 0.485 | 0.486 | 1.000 | 0.480 |
| Year.Built | 0.242 | 0.407 | 0.480 | 1.000 |
Multicollinearity concerns:
Final predictor selection:
These 4 predictors have low to moderate correlations with each other (all < 0.6) and collectively capture size, age, and features.
# Convert Central.Air to numeric for model
ames <- ames |>
mutate(Has_Central_Air = if_else(Central.Air == "Y", 1, 0))
# Remove row with missing Garage.Area (1 row)
ames_complete <- ames |>
filter(!is.na(Garage.Area))
cat("Note: Removed", nrow(ames) - nrow(ames_complete),
"row(s) with missing Garage.Area\n")
## Note: Removed 1 row(s) with missing Garage.Area
cat("Final sample size:", nrow(ames_complete), "homes\n\n")
## Final sample size: 2929 homes
# Fit logistic regression
model <- glm(Is_Premium_Quality ~ Gr.Liv.Area + Year.Built + Garage.Area + Has_Central_Air,
data = ames_complete,
family = binomial(link = "logit"))
# Display summary
summary(model)
##
## Call:
## glm(formula = Is_Premium_Quality ~ Gr.Liv.Area + Year.Built +
## Garage.Area + Has_Central_Air, family = binomial(link = "logit"),
## data = ames_complete)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -122.3098398 8.9951944 -13.597 <2e-16 ***
## Gr.Liv.Area 0.0016463 0.0001625 10.133 <2e-16 ***
## Year.Built 0.0586044 0.0045935 12.758 <2e-16 ***
## Garage.Area 0.0053910 0.0004470 12.059 <2e-16 ***
## Has_Central_Air -1.6675640 0.6783577 -2.458 0.014 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2638.9 on 2928 degrees of freedom
## Residual deviance: 1434.1 on 2924 degrees of freedom
## AIC: 1444.1
##
## Number of Fisher Scoring iterations: 7
Model equation (log-odds scale):
coeffs <- coef(model)
cat("log(odds of premium quality) = \n")
## log(odds of premium quality) =
cat(" ", round(coeffs[1], 4), "\n")
## -122.3098
cat(" + ", round(coeffs[2], 6), " × Gr.Liv.Area\n")
## + 0.001646 × Gr.Liv.Area
cat(" + ", round(coeffs[3], 6), " × Year.Built\n")
## + 0.058604 × Year.Built
cat(" + ", round(coeffs[4], 6), " × Garage.Area\n")
## + 0.005391 × Garage.Area
cat(" + ", round(coeffs[5], 4), " × Has_Central_Air\n")
## + -1.6676 × Has_Central_Air
Logistic regression coefficients are trickier to interpret than linear regression because they represent changes in log-odds, not probabilities directly. I’ll interpret them in three ways:
coef_summary <- tidy(model, conf.int = TRUE)
kable(coef_summary,
digits = c(0, 6, 6, 3, 4, 6, 6),
caption = "Logistic Regression Coefficients")
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -122.309840 | 8.995194 | -13.597 | 0.000 | -140.644347 | -105.345048 |
| Gr.Liv.Area | 0.001646 | 0.000162 | 10.133 | 0.000 | 0.001333 | 0.001970 |
| Year.Built | 0.058604 | 0.004594 | 12.758 | 0.000 | 0.049930 | 0.067956 |
| Garage.Area | 0.005391 | 0.000447 | 12.059 | 0.000 | 0.004524 | 0.006277 |
| Has_Central_Air | -1.667564 | 0.678358 | -2.458 | 0.014 | -2.872003 | -0.133953 |
What the signs tell us:
All predictors are statistically significant and point in the expected direction.
The coefficient on the log-odds scale is hard to interpret, but when we exponentiate it, we get an odds ratio that’s much more intuitive.
# Calculate odds ratios
or_table <- coef_summary |>
mutate(
Odds_Ratio = exp(estimate),
OR_Lower = exp(conf.low),
OR_Upper = exp(conf.high)
) |>
select(term, estimate, Odds_Ratio, OR_Lower, OR_Upper, p.value)
kable(or_table,
digits = c(0, 6, 4, 4, 4, 4),
col.names = c("Variable", "Log-Odds Coefficient", "Odds Ratio",
"OR 95% Lower", "OR 95% Upper", "P-value"),
caption = "Odds Ratios and Confidence Intervals")
| Variable | Log-Odds Coefficient | Odds Ratio | OR 95% Lower | OR 95% Upper | P-value |
|---|---|---|---|---|---|
| (Intercept) | -122.309840 | 0.0000 | 0.0000 | 0.0000 | 0.000 |
| Gr.Liv.Area | 0.001646 | 1.0016 | 1.0013 | 1.0020 | 0.000 |
| Year.Built | 0.058604 | 1.0604 | 1.0512 | 1.0703 | 0.000 |
| Garage.Area | 0.005391 | 1.0054 | 1.0045 | 1.0063 | 0.000 |
| Has_Central_Air | -1.667564 | 0.1887 | 0.0566 | 0.8746 | 0.014 |
How to interpret odds ratios:
Practical interpretations:
While odds ratios are useful, probabilities are what people actually understand. Let me show how different features affect the probability of premium quality.
# Create example homes with varying living areas
example_homes <- expand.grid(
Gr.Liv.Area = seq(1000, 3000, by = 500),
Year.Built = 2000, # Recent home
Garage.Area = 500, # Average garage
Has_Central_Air = 1 # Has AC
)
# Predict probabilities
example_homes$Predicted_Prob <- predict(model, newdata = example_homes, type = "response")
kable(example_homes,
col.names = c("Living Area", "Year Built", "Garage Area", "Has AC",
"Probability of Premium"),
caption = "Example: How Living Area Affects Premium Quality Probability",
digits = c(0, 0, 0, 0, 3))
| Living Area | Year Built | Garage Area | Has AC | Probability of Premium |
|---|---|---|---|---|
| 1000 | 2000 | 500 | 1 | 0.081 |
| 1500 | 2000 | 500 | 1 | 0.167 |
| 2000 | 2000 | 500 | 1 | 0.314 |
| 2500 | 2000 | 500 | 1 | 0.511 |
| 3000 | 2000 | 500 | 1 | 0.704 |
# Visualize
ggplot(example_homes, aes(x = Gr.Liv.Area, y = Predicted_Prob)) +
geom_line(color = "#3498db", linewidth = 1.5) +
geom_point(size = 3, color = "#3498db") +
scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
scale_x_continuous(labels = comma_format()) +
labs(title = "How Home Size Affects Probability of Premium Quality",
subtitle = "For a home built in 2000, 500 sq ft garage, with central air",
x = "Living Area (square feet)",
y = "Predicted Probability of Premium Quality") +
theme_minimal()
Key insight from this visualization:
The relationship is non-linear (S-shaped). Going from 1,000 to 1,500 sq ft increases probability from 8% to 17% (a 8.6 percentage point increase). But going from 2,500 to 3,000 sq ft increases it from 51% to 70% (a 19.3 percentage point increase).
Size matters most in the middle range where the curve is steepest.
Let me focus on the Gr.Liv.Area coefficient and build a 95% confidence interval using its standard error.
# Extract coefficient and standard error for Gr.Liv.Area
coef_area <- coeffs["Gr.Liv.Area"]
se_area <- summary(model)$coefficients["Gr.Liv.Area", "Std. Error"]
# Build 95% CI on log-odds scale
z_critical <- qnorm(0.975) # 1.96 for 95% CI
ci_lower_logodds <- coef_area - z_critical * se_area
ci_upper_logodds <- coef_area + z_critical * se_area
cat("=== 95% CONFIDENCE INTERVAL FOR GR.LIV.AREA ===\n\n")
## === 95% CONFIDENCE INTERVAL FOR GR.LIV.AREA ===
cat("Coefficient (log-odds scale):", round(coef_area, 6), "\n")
## Coefficient (log-odds scale): 0.001646
cat("Standard Error:", round(se_area, 6), "\n")
## Standard Error: 0.000162
cat("Z-critical (95%):", round(z_critical, 3), "\n\n")
## Z-critical (95%): 1.96
cat("95% CI on log-odds scale:\n")
## 95% CI on log-odds scale:
cat(" Lower bound:", round(ci_lower_logodds, 6), "\n")
## Lower bound: 0.001328
cat(" Upper bound:", round(ci_upper_logodds, 6), "\n\n")
## Upper bound: 0.001965
# Convert to odds ratios
or_area <- exp(coef_area)
or_ci_lower <- exp(ci_lower_logodds)
or_ci_upper <- exp(ci_upper_logodds)
cat("95% CI on odds ratio scale (per 1 sq ft):\n")
## 95% CI on odds ratio scale (per 1 sq ft):
cat(" Odds Ratio:", round(or_area, 6), "\n")
## Odds Ratio: 1.001648
cat(" Lower bound:", round(or_ci_lower, 6), "\n")
## Lower bound: 1.001329
cat(" Upper bound:", round(or_ci_upper, 6), "\n\n")
## Upper bound: 1.001967
# More interpretable: per 100 sq ft
or_100 <- exp(coef_area * 100)
or_100_ci_lower <- exp(ci_lower_logodds * 100)
or_100_ci_upper <- exp(ci_upper_logodds * 100)
cat("95% CI on odds ratio scale (per 100 sq ft):\n")
## 95% CI on odds ratio scale (per 100 sq ft):
cat(" Odds Ratio:", round(or_100, 3), "\n")
## Odds Ratio: 1.179
cat(" Lower bound:", round(or_100_ci_lower, 3), "\n")
## Lower bound: 1.142
cat(" Upper bound:", round(or_100_ci_upper, 3), "\n")
## Upper bound: 1.217
What this confidence interval means:
We are 95% confident that the true effect of living area on premium quality falls within this range:
Per 100 square feet:
In practical terms:
If we compared two otherwise identical homes where one is 100 sq ft larger, we’re 95% confident the larger home has between 14.2% and 21.7% higher odds of being premium quality.
Why this interval doesn’t include 1.0:
The entire confidence interval is above 1.0 (no effect), which confirms that living area has a statistically significant positive effect on premium quality. If the CI included 1.0, we couldn’t rule out the possibility of no effect.
Statistical vs. Practical significance:
The interval is narrow (from 1.142 to 1.217), meaning we have precise estimates thanks to our large sample size (n = 2929). Even the lower bound represents a substantial effect, so this is both statistically and practically significant.
# Add predictions to dataset
ames_complete <- ames_complete |>
mutate(
Predicted_Prob = predict(model, type = "response"),
Predicted_Class = if_else(Predicted_Prob > 0.5, 1, 0)
)
# Confusion matrix
confusion <- table(Actual = ames_complete$Is_Premium_Quality,
Predicted = ames_complete$Predicted_Class)
cat("=== CONFUSION MATRIX ===\n")
## === CONFUSION MATRIX ===
print(confusion)
## Predicted
## Actual 0 1
## 0 2368 73
## 1 190 298
# Calculate accuracy metrics
accuracy <- sum(diag(confusion)) / sum(confusion)
sensitivity <- confusion[2,2] / sum(confusion[2,]) # True positive rate
specificity <- confusion[1,1] / sum(confusion[1,]) # True negative rate
precision <- confusion[2,2] / sum(confusion[,2]) # Positive predictive value
cat("\n=== MODEL PERFORMANCE ===\n")
##
## === MODEL PERFORMANCE ===
cat("Overall Accuracy:", percent(accuracy), "\n")
## Overall Accuracy: 91%
cat("Sensitivity (correctly ID premium):", percent(sensitivity), "\n")
## Sensitivity (correctly ID premium): 61%
cat("Specificity (correctly ID non-premium):", percent(specificity), "\n")
## Specificity (correctly ID non-premium): 97%
cat("Precision (of predicted premium, % actually premium):", percent(precision), "\n")
## Precision (of predicted premium, % actually premium): 80%
Interpreting the confusion matrix:
Model performance assessment:
The model achieves 91% overall accuracy, which is good but not perfect. However, accuracy can be misleading when classes are imbalanced (we have many more non-premium homes).
More informative metrics:
The model is much better at identifying non-premium homes (high specificity) than premium homes (moderate sensitivity). This makes sense premium quality depends on many subjective factors beyond just size, age, garage, and AC.
ggplot(ames_complete, aes(x = Gr.Liv.Area, y = Predicted_Prob,
color = factor(Is_Premium_Quality))) +
geom_point(alpha = 0.4, size = 2) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
scale_color_manual(values = c("0" = "#e74c3c", "1" = "#27ae60"),
labels = c("Not Premium (Actual)", "Premium (Actual)"),
name = "Actual Quality") +
scale_x_continuous(labels = comma_format()) +
scale_y_continuous(labels = percent_format()) +
labs(title = "Predicted Probabilities vs. Living Area",
subtitle = "Dashed line at 50% = classification threshold",
x = "Living Area (square feet)",
y = "Predicted Probability of Premium Quality") +
theme_minimal()
What this plot reveals:
There’s substantial overlap in the middle range (1,200-2,000 sq ft) where size alone doesn’t determine quality some small homes are premium (exceptional finishes, location) and some large homes aren’t (older, needs work).
Size matters most: Living area is the strongest predictor. A 500 sq ft larger home has 127.8% higher odds of being premium quality.
Newer is better: Each decade newer increases odds by 79.7%. Recent construction correlates with premium quality.
Garages add value: A 2-car garage (~400 sq ft) vs. a 1-car garage (~200 sq ft) increases odds by 193.9%.
AC is expected: Having central air gives -81.1% higher odds. It’s nearly required for premium status 99.8% of premium homes have it.
Non-linear effects: The probability curve is S-shaped. Size increases matter most in the middle range (1,500-2,000 sq ft).
For buyers:
For sellers:
For appraisers:
Missing predictors: Quality also depends on finishes, location, lot size, and subjective factors we haven’t captured
Correlation ≠ causation: Making a home bigger doesn’t automatically make it premium quality both might be caused by a third factor (builder quality, neighborhood standards)
Threshold is arbitrary: We chose 8+ as “premium” but 7.5 might be just as good. The cutoff affects our results.
Imbalanced classes: With only 17% premium homes, the model is biased toward predicting “not premium”
Does location matter? Would adding Neighborhood improve predictions? Some neighborhoods might have mostly premium homes regardless of size.
Are there interaction effects? Maybe garage size matters more for newer homes than older ones, or AC matters more for large homes.
What about quality 9-10 homes? If we redefined premium as only the top tier (9-10), would the same predictors matter?
Can we improve the model? What if we added Lot Area, Number of Bathrooms, or Fireplace Quality?
What’s the optimal threshold? Instead of 0.5, should we use 0.3 or 0.7 to balance sensitivity and specificity?
Do these patterns hold over time? Were the same factors important in 2006 vs. 2010?
Logistic regression allowed me to model a yes/no outcome (premium quality) and quantify how different features influence the probability of that outcome. The model revealed that size, age, garage area, and modern amenities all significantly predict premium quality, with living area being the dominant factor.
The key insight: A home’s path to premium quality is mostly determined by its size and when it was built. Features like garages and AC matter, but they’re secondary to getting the basics (square footage, construction era) right.
Model validation: With 91% accuracy and high specificity (97%), the model is reliable for identifying non-premium homes. However, moderate sensitivity (61%) means it misses some premium homes quality has subjective elements that numbers alone can’t fully capture.
This analysis demonstrates the power of logistic regression for understanding binary outcomes and provides actionable insights for real estate stakeholders navigating the Ames housing market.