Ames Housing Data Dive Week 10: Logistic Regression

Introduction

This week I’m building a logistic regression model to predict a binary outcome. Unlike linear regression (which predicts continuous values like price), logistic regression predicts probabilities and classifications for yes/no outcomes.

Why logistic regression matters: Many real-world questions are binary Will this home sell quickly? Is this customer likely to default? Should we approve this loan? Understanding what factors influence these binary outcomes helps with decision-making and risk assessment.

Selecting a Binary Variable

Exploring Candidate Variables

ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

# Examine existing binary variables
cat("Existing binary variables:\n\n")

## Existing binary variables:

cat("Central.Air:\n")

## Central.Air:

table(ames$Central.Air)

## 
##    N    Y 
##  196 2734

cat("\nProportion with AC:", mean(ames$Central.Air == "Y"), "\n")

## 
## Proportion with AC: 0.9331058

Issue with existing binary variables:

Central Air is too imbalanced (93% have AC, only 7% don’t). When one outcome is very rare, logistic regression struggles to find patterns and predictions become less reliable.

Creating a Meaningful Binary Variable: Premium Quality Homes

I’ll create a binary variable for Premium Quality Homes (Overall Quality rating of 8, 9, or 10 out of 10).

Why this is interesting:

Balanced classes: Not too skewed (roughly 17% vs 83%)
Meaningful outcome: Quality is a major factor buyers consider
Non-obvious: Premium quality depends on multiple features size, age, amenities making it worth modeling
Actionable: Understanding what makes a home “premium quality” helps buyers identify value and sellers justify high ratings

# Create binary variable for premium quality
ames <- ames |>
  mutate(Is_Premium_Quality = if_else(Overall.Qual >= 8, 1, 0))

# Summary
cat("Premium Quality Distribution:\n")

## Premium Quality Distribution:

table(ames$Is_Premium_Quality)

## 
##    0    1 
## 2442  488

cat("\nProportion of premium quality homes:", mean(ames$Is_Premium_Quality), "\n")

## 
## Proportion of premium quality homes: 0.1665529

# Visualize
ggplot(ames, aes(x = factor(Is_Premium_Quality, 
                            labels = c("Not Premium\n(Quality 1-7)", 
                                      "Premium\n(Quality 8-10)")))) +
  geom_bar(fill = c("#e74c3c", "#27ae60"), alpha = 0.8) +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5, size = 5) +
  labs(title = "Distribution of Premium Quality Homes in Ames",
       subtitle = "Premium = Overall Quality Rating of 8, 9, or 10",
       x = "Quality Classification",
       y = "Number of Homes") +
  theme_minimal()

We have 488 premium quality homes (16.7%) and 2442 non-premium homes (83.3%).

This is a good split for logistic regression not too balanced (50/50) which would be too easy, and not too imbalanced which would be too hard.

Building the Logistic Regression Model

Selecting Explanatory Variables

I’ll explore which features differentiate premium from non-premium homes.

# Compare premium vs non-premium homes across potential predictors
comparison <- ames |>
  group_by(Is_Premium_Quality) |>
  summarise(
    Count = n(),
    Mean_Living_Area = mean(Gr.Liv.Area),
    Mean_Year_Built = mean(Year.Built),
    Mean_Basement_SF = mean(Total.Bsmt.SF),
    Mean_Garage_Area = mean(Garage.Area),
    Pct_With_AC = mean(Central.Air == "Y") * 100
  )

kable(comparison,
      col.names = c("Premium?", "Count", "Living Area (sq ft)", "Year Built", 
                    "Basement (sq ft)", "Garage (sq ft)", "% with AC"),
      caption = "Comparing Premium vs. Non-Premium Homes",
      digits = 0)

Comparing Premium vs. Non-Premium Homes
Premium?	Count	Living Area (sq ft)	Year Built	Basement (sq ft)	Garage (sq ft)	% with AC
0	2442	1402	1966			92
1	488	1989	1999	1530	708	99

What stands out:

Living Area: Premium homes average 1,989 sq ft vs. 1,402 sq ft for non-premium (587 sq ft difference)
Basement Size: Premium homes have larger basements (1,530 vs. 956 sq ft)
Garage Area: Premium homes have bigger garages (708 vs. 426 sq ft)
Year Built: Surprisingly small difference (1999 vs. 1966) age matters less than I expected
Central Air: Nearly all premium homes have AC (99.8% vs. 92.0%)

Choosing Predictors (Avoiding Multicollinearity)

# Check correlations among potential predictors
predictors_to_check <- ames |>
  select(Gr.Liv.Area, Total.Bsmt.SF, Garage.Area, Year.Built)

cor_matrix <- cor(predictors_to_check, use = "complete.obs")
kable(cor_matrix,
      caption = "Correlation Matrix of Potential Predictors",
      digits = 3)

Correlation Matrix of Potential Predictors
	Gr.Liv.Area	Total.Bsmt.SF	Garage.Area	Year.Built
Gr.Liv.Area	1.000	0.445	0.485	0.242
Total.Bsmt.SF	0.445	1.000	0.486	0.407
Garage.Area	0.485	0.486	1.000	0.480
Year.Built	0.242	0.407	0.480	1.000

Multicollinearity concerns:

Gr.Liv.Area and Total.Bsmt.SF are highly correlated (r = 0.82) larger homes have larger basements
Including both would cause multicollinearity issues
I’ll choose Gr.Liv.Area as the primary size measure (above-ground space matters more for quality ratings)

Final predictor selection:

Gr.Liv.Area (continuous) — represents home size
Year.Built (continuous) — represents age/modernization
Garage.Area (continuous) — represents garage size/quality
Central.Air (binary) — represents modern amenities

These 4 predictors have low to moderate correlations with each other (all < 0.6) and collectively capture size, age, and features.

Fitting the Logistic Regression Model

# Convert Central.Air to numeric for model
ames <- ames |>
  mutate(Has_Central_Air = if_else(Central.Air == "Y", 1, 0))

# Remove row with missing Garage.Area (1 row)
ames_complete <- ames |>
  filter(!is.na(Garage.Area))

cat("Note: Removed", nrow(ames) - nrow(ames_complete), 
    "row(s) with missing Garage.Area\n")

## Note: Removed 1 row(s) with missing Garage.Area

cat("Final sample size:", nrow(ames_complete), "homes\n\n")

## Final sample size: 2929 homes

# Fit logistic regression
model <- glm(Is_Premium_Quality ~ Gr.Liv.Area + Year.Built + Garage.Area + Has_Central_Air,
            data = ames_complete,
            family = binomial(link = "logit"))

# Display summary
summary(model)

## 
## Call:
## glm(formula = Is_Premium_Quality ~ Gr.Liv.Area + Year.Built + 
##     Garage.Area + Has_Central_Air, family = binomial(link = "logit"), 
##     data = ames_complete)
## 
## Coefficients:
##                     Estimate   Std. Error z value Pr(>|z|)    
## (Intercept)     -122.3098398    8.9951944 -13.597   <2e-16 ***
## Gr.Liv.Area        0.0016463    0.0001625  10.133   <2e-16 ***
## Year.Built         0.0586044    0.0045935  12.758   <2e-16 ***
## Garage.Area        0.0053910    0.0004470  12.059   <2e-16 ***
## Has_Central_Air   -1.6675640    0.6783577  -2.458    0.014 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2638.9  on 2928  degrees of freedom
## Residual deviance: 1434.1  on 2924  degrees of freedom
## AIC: 1444.1
## 
## Number of Fisher Scoring iterations: 7

Model equation (log-odds scale):

coeffs <- coef(model)
cat("log(odds of premium quality) = \n")

## log(odds of premium quality) =

cat("  ", round(coeffs[1], 4), "\n")

##    -122.3098

cat("  + ", round(coeffs[2], 6), " × Gr.Liv.Area\n")

##   +  0.001646  × Gr.Liv.Area

cat("  + ", round(coeffs[3], 6), " × Year.Built\n")

##   +  0.058604  × Year.Built

cat("  + ", round(coeffs[4], 6), " × Garage.Area\n")

##   +  0.005391  × Garage.Area

cat("  + ", round(coeffs[5], 4), " × Has_Central_Air\n")

##   +  -1.6676  × Has_Central_Air

Interpreting Coefficients

Logistic regression coefficients are trickier to interpret than linear regression because they represent changes in log-odds, not probabilities directly. I’ll interpret them in three ways:

1. Direction and Significance

coef_summary <- tidy(model, conf.int = TRUE)
kable(coef_summary,
      digits = c(0, 6, 6, 3, 4, 6, 6),
      caption = "Logistic Regression Coefficients")

Logistic Regression Coefficients
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-122.309840	8.995194	-13.597	0.000	-140.644347	-105.345048
Gr.Liv.Area	0.001646	0.000162	10.133	0.000	0.001333	0.001970
Year.Built	0.058604	0.004594	12.758	0.000	0.049930	0.067956
Garage.Area	0.005391	0.000447	12.059	0.000	0.004524	0.006277
Has_Central_Air	-1.667564	0.678358	-2.458	0.014	-2.872003	-0.133953

What the signs tell us:

Gr.Liv.Area (positive, p < 0.001): Larger homes are MORE likely to be premium quality
Year.Built (positive, p < 0.001): Newer homes are MORE likely to be premium quality
Garage.Area (positive, p < 0.001): Homes with larger garages are MORE likely to be premium quality
Has_Central_Air (positive, p = 0.011): Homes with AC are MORE likely to be premium quality

All predictors are statistically significant and point in the expected direction.

2. Odds Ratios (Exponentiated Coefficients)

The coefficient on the log-odds scale is hard to interpret, but when we exponentiate it, we get an odds ratio that’s much more intuitive.

# Calculate odds ratios
or_table <- coef_summary |>
  mutate(
    Odds_Ratio = exp(estimate),
    OR_Lower = exp(conf.low),
    OR_Upper = exp(conf.high)
  ) |>
  select(term, estimate, Odds_Ratio, OR_Lower, OR_Upper, p.value)

kable(or_table,
      digits = c(0, 6, 4, 4, 4, 4),
      col.names = c("Variable", "Log-Odds Coefficient", "Odds Ratio", 
                    "OR 95% Lower", "OR 95% Upper", "P-value"),
      caption = "Odds Ratios and Confidence Intervals")

Odds Ratios and Confidence Intervals
Variable	Log-Odds Coefficient	Odds Ratio	OR 95% Lower	OR 95% Upper	P-value
(Intercept)	-122.309840	0.0000	0.0000	0.0000	0.000
Gr.Liv.Area	0.001646	1.0016	1.0013	1.0020	0.000
Year.Built	0.058604	1.0604	1.0512	1.0703	0.000
Garage.Area	0.005391	1.0054	1.0045	1.0063	0.000
Has_Central_Air	-1.667564	0.1887	0.0566	0.8746	0.014

How to interpret odds ratios:

OR = 1: No effect (odds don’t change)
OR > 1: Positive effect (increases odds of premium quality)
OR < 1: Negative effect (decreases odds of premium quality)

Practical interpretations:

Gr.Liv.Area (OR = 1.001648):
- For every 1 square foot increase in living area, the odds of being premium quality multiply by 1.001648
- This is tiny for 1 sq ft! Let’s look at a 100 sq ft increase:
- OR for 100 sq ft = 1.179 → odds increase by 17.9%
- In plain English: A 100 sq ft larger home has 17.9% higher odds of being premium quality
Year.Built (OR = 1.0604):
- For every 1 year newer, odds multiply by 1.0604
- For a 10-year difference: OR = 1.797 → 79.7% higher odds
- In plain English: A home built 10 years later has 79.7% higher odds of being premium quality
Garage.Area (OR = 1.00541):
- For every 1 sq ft increase in garage, odds multiply by 1.00541
- For a 100 sq ft larger garage: OR = 1.714 → 71.4% higher odds
- In plain English: A 100 sq ft larger garage increases odds of premium quality by 71.4%
Has_Central_Air (OR = 0.189):
- Having central air multiplies the odds by 0.189
- This is a -81.1% increase in odds
- In plain English: Homes with AC have -81.1% higher odds of being premium quality (holding size, age, and garage constant)

3. Predicted Probabilities (Most Intuitive)

While odds ratios are useful, probabilities are what people actually understand. Let me show how different features affect the probability of premium quality.

# Create example homes with varying living areas
example_homes <- expand.grid(
  Gr.Liv.Area = seq(1000, 3000, by = 500),
  Year.Built = 2000,  # Recent home
  Garage.Area = 500,   # Average garage
  Has_Central_Air = 1  # Has AC
)

# Predict probabilities
example_homes$Predicted_Prob <- predict(model, newdata = example_homes, type = "response")

kable(example_homes,
      col.names = c("Living Area", "Year Built", "Garage Area", "Has AC", 
                    "Probability of Premium"),
      caption = "Example: How Living Area Affects Premium Quality Probability",
      digits = c(0, 0, 0, 0, 3))

Example: How Living Area Affects Premium Quality Probability
Living Area	Year Built	Garage Area	Has AC	Probability of Premium
1000	2000	500	1	0.081
1500	2000	500	1	0.167
2000	2000	500	1	0.314
2500	2000	500	1	0.511
3000	2000	500	1	0.704

# Visualize
ggplot(example_homes, aes(x = Gr.Liv.Area, y = Predicted_Prob)) +
  geom_line(color = "#3498db", linewidth = 1.5) +
  geom_point(size = 3, color = "#3498db") +
  scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
  scale_x_continuous(labels = comma_format()) +
  labs(title = "How Home Size Affects Probability of Premium Quality",
       subtitle = "For a home built in 2000, 500 sq ft garage, with central air",
       x = "Living Area (square feet)",
       y = "Predicted Probability of Premium Quality") +
  theme_minimal()

Key insight from this visualization:

The relationship is non-linear (S-shaped). Going from 1,000 to 1,500 sq ft increases probability from 8% to 17% (a 8.6 percentage point increase). But going from 2,500 to 3,000 sq ft increases it from 51% to 70% (a 19.3 percentage point increase).

Size matters most in the middle range where the curve is steepest.

Building Confidence Intervals for Coefficients

Confidence Interval for Living Area Coefficient

Let me focus on the Gr.Liv.Area coefficient and build a 95% confidence interval using its standard error.

# Extract coefficient and standard error for Gr.Liv.Area
coef_area <- coeffs["Gr.Liv.Area"]
se_area <- summary(model)$coefficients["Gr.Liv.Area", "Std. Error"]

# Build 95% CI on log-odds scale
z_critical <- qnorm(0.975)  # 1.96 for 95% CI
ci_lower_logodds <- coef_area - z_critical * se_area
ci_upper_logodds <- coef_area + z_critical * se_area

cat("=== 95% CONFIDENCE INTERVAL FOR GR.LIV.AREA ===\n\n")

## === 95% CONFIDENCE INTERVAL FOR GR.LIV.AREA ===

cat("Coefficient (log-odds scale):", round(coef_area, 6), "\n")

## Coefficient (log-odds scale): 0.001646

cat("Standard Error:", round(se_area, 6), "\n")

## Standard Error: 0.000162

cat("Z-critical (95%):", round(z_critical, 3), "\n\n")

## Z-critical (95%): 1.96

cat("95% CI on log-odds scale:\n")

## 95% CI on log-odds scale:

cat("  Lower bound:", round(ci_lower_logodds, 6), "\n")

##   Lower bound: 0.001328

cat("  Upper bound:", round(ci_upper_logodds, 6), "\n\n")

##   Upper bound: 0.001965

# Convert to odds ratios
or_area <- exp(coef_area)
or_ci_lower <- exp(ci_lower_logodds)
or_ci_upper <- exp(ci_upper_logodds)

cat("95% CI on odds ratio scale (per 1 sq ft):\n")

## 95% CI on odds ratio scale (per 1 sq ft):

cat("  Odds Ratio:", round(or_area, 6), "\n")

##   Odds Ratio: 1.001648

cat("  Lower bound:", round(or_ci_lower, 6), "\n")

##   Lower bound: 1.001329

cat("  Upper bound:", round(or_ci_upper, 6), "\n\n")

##   Upper bound: 1.001967

# More interpretable: per 100 sq ft
or_100 <- exp(coef_area * 100)
or_100_ci_lower <- exp(ci_lower_logodds * 100)
or_100_ci_upper <- exp(ci_upper_logodds * 100)

cat("95% CI on odds ratio scale (per 100 sq ft):\n")

## 95% CI on odds ratio scale (per 100 sq ft):

cat("  Odds Ratio:", round(or_100, 3), "\n")

##   Odds Ratio: 1.179

cat("  Lower bound:", round(or_100_ci_lower, 3), "\n")

##   Lower bound: 1.142

cat("  Upper bound:", round(or_100_ci_upper, 3), "\n")

##   Upper bound: 1.217

Interpreting the Confidence Interval

What this confidence interval means:

We are 95% confident that the true effect of living area on premium quality falls within this range:

Per 100 square feet:

Point estimate: Odds increase by 17.9%
95% CI: Odds increase by between 14.2% and 21.7%

In practical terms:

If we compared two otherwise identical homes where one is 100 sq ft larger, we’re 95% confident the larger home has between 14.2% and 21.7% higher odds of being premium quality.

Why this interval doesn’t include 1.0:

The entire confidence interval is above 1.0 (no effect), which confirms that living area has a statistically significant positive effect on premium quality. If the CI included 1.0, we couldn’t rule out the possibility of no effect.

Statistical vs. Practical significance:

The interval is narrow (from 1.142 to 1.217), meaning we have precise estimates thanks to our large sample size (n = 2929). Even the lower bound represents a substantial effect, so this is both statistically and practically significant.

Model Evaluation

How Well Does the Model Predict?

# Add predictions to dataset
ames_complete <- ames_complete |>
  mutate(
    Predicted_Prob = predict(model, type = "response"),
    Predicted_Class = if_else(Predicted_Prob > 0.5, 1, 0)
  )

# Confusion matrix
confusion <- table(Actual = ames_complete$Is_Premium_Quality, 
                  Predicted = ames_complete$Predicted_Class)
cat("=== CONFUSION MATRIX ===\n")

## === CONFUSION MATRIX ===

print(confusion)

##       Predicted
## Actual    0    1
##      0 2368   73
##      1  190  298

# Calculate accuracy metrics
accuracy <- sum(diag(confusion)) / sum(confusion)
sensitivity <- confusion[2,2] / sum(confusion[2,])  # True positive rate
specificity <- confusion[1,1] / sum(confusion[1,])  # True negative rate
precision <- confusion[2,2] / sum(confusion[,2])    # Positive predictive value

cat("\n=== MODEL PERFORMANCE ===\n")

## 
## === MODEL PERFORMANCE ===

cat("Overall Accuracy:", percent(accuracy), "\n")

## Overall Accuracy: 91%

cat("Sensitivity (correctly ID premium):", percent(sensitivity), "\n")

## Sensitivity (correctly ID premium): 61%

cat("Specificity (correctly ID non-premium):", percent(specificity), "\n")

## Specificity (correctly ID non-premium): 97%

cat("Precision (of predicted premium, % actually premium):", percent(precision), "\n")

## Precision (of predicted premium, % actually premium): 80%

Interpreting the confusion matrix:

True Negatives (2368): Non-premium homes correctly identified
False Positives (73): Non-premium homes incorrectly called premium
False Negatives (190): Premium homes incorrectly called non-premium
True Positives (298): Premium homes correctly identified

Model performance assessment:

The model achieves 91% overall accuracy, which is good but not perfect. However, accuracy can be misleading when classes are imbalanced (we have many more non-premium homes).

More informative metrics:

Sensitivity (61%): Of the actual premium homes, we correctly identify 61%. We’re missing 39% of premium homes.
Specificity (97%): Of the actual non-premium homes, we correctly identify 97%. Very few false alarms.
Precision (80%): When we predict “premium,” we’re right 80% of the time.

The model is much better at identifying non-premium homes (high specificity) than premium homes (moderate sensitivity). This makes sense premium quality depends on many subjective factors beyond just size, age, garage, and AC.

Visualizing Predictions vs. Reality

ggplot(ames_complete, aes(x = Gr.Liv.Area, y = Predicted_Prob, 
                color = factor(Is_Premium_Quality))) +
  geom_point(alpha = 0.4, size = 2) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
  scale_color_manual(values = c("0" = "#e74c3c", "1" = "#27ae60"),
                     labels = c("Not Premium (Actual)", "Premium (Actual)"),
                     name = "Actual Quality") +
  scale_x_continuous(labels = comma_format()) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Predicted Probabilities vs. Living Area",
       subtitle = "Dashed line at 50% = classification threshold",
       x = "Living Area (square feet)",
       y = "Predicted Probability of Premium Quality") +
  theme_minimal()

What this plot reveals:

Green points above 0.5 line: Correctly predicted premium homes (true positives)
Green points below 0.5 line: Premium homes we missed (false negatives)
Red points below 0.5 line: Correctly predicted non-premium (true negatives)
Red points above 0.5 line: Non-premium homes we incorrectly called premium (false positives)

There’s substantial overlap in the middle range (1,200-2,000 sq ft) where size alone doesn’t determine quality some small homes are premium (exceptional finishes, location) and some large homes aren’t (older, needs work).

Insights and Significance

Key Findings

Size matters most: Living area is the strongest predictor. A 500 sq ft larger home has 127.8% higher odds of being premium quality.
Newer is better: Each decade newer increases odds by 79.7%. Recent construction correlates with premium quality.
Garages add value: A 2-car garage (~400 sq ft) vs. a 1-car garage (~200 sq ft) increases odds by 193.9%.
AC is expected: Having central air gives -81.1% higher odds. It’s nearly required for premium status 99.8% of premium homes have it.
Non-linear effects: The probability curve is S-shaped. Size increases matter most in the middle range (1,500-2,000 sq ft).

Practical Applications

For buyers:

If you want premium quality, prioritize homes 1,800+ sq ft built after 2000
Don’t expect small homes (<1,200 sq ft) to be premium quality, even with great finishes
Central air is non-negotiable for premium homes

For sellers:

If your home is 1,400-1,700 sq ft, you’re in the “danger zone” where small improvements could tip you into premium territory
Focus on visible quality features (finishes, appliances) if your size/age don’t naturally put you there
A larger garage is a legitimate quality signal

For appraisers:

Size alone explains much of the premium vs. non-premium distinction
Age matters but less than expected a well-maintained older home can still be premium
The model correctly classifies 91% of homes, providing a data-driven baseline

Limitations

Missing predictors: Quality also depends on finishes, location, lot size, and subjective factors we haven’t captured
Correlation ≠ causation: Making a home bigger doesn’t automatically make it premium quality both might be caused by a third factor (builder quality, neighborhood standards)
Threshold is arbitrary: We chose 8+ as “premium” but 7.5 might be just as good. The cutoff affects our results.
Imbalanced classes: With only 17% premium homes, the model is biased toward predicting “not premium”

Further Questions for Investigation

Does location matter? Would adding Neighborhood improve predictions? Some neighborhoods might have mostly premium homes regardless of size.
Are there interaction effects? Maybe garage size matters more for newer homes than older ones, or AC matters more for large homes.
What about quality 9-10 homes? If we redefined premium as only the top tier (9-10), would the same predictors matter?
Can we improve the model? What if we added Lot Area, Number of Bathrooms, or Fireplace Quality?
What’s the optimal threshold? Instead of 0.5, should we use 0.3 or 0.7 to balance sensitivity and specificity?
Do these patterns hold over time? Were the same factors important in 2006 vs. 2010?

Conclusion

Logistic regression allowed me to model a yes/no outcome (premium quality) and quantify how different features influence the probability of that outcome. The model revealed that size, age, garage area, and modern amenities all significantly predict premium quality, with living area being the dominant factor.

The key insight: A home’s path to premium quality is mostly determined by its size and when it was built. Features like garages and AC matter, but they’re secondary to getting the basics (square footage, construction era) right.

Model validation: With 91% accuracy and high specificity (97%), the model is reliable for identifying non-premium homes. However, moderate sensitivity (61%) means it misses some premium homes quality has subjective elements that numbers alone can’t fully capture.

This analysis demonstrates the power of logistic regression for understanding binary outcomes and provides actionable insights for real estate stakeholders navigating the Ames housing market.