Ames Housing Data Dive Week 8: ANOVA and Linear Regression

Introduction

This week I’m using ANOVA to test whether categorical groups have different mean outcomes, and building a linear regression model to predict my response variable. Both techniques help me understand what drives home prices in Ames and provide actionable insights for buyers, sellers, and real estate professionals.

Selecting the Response Variable

ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

cat("=== RESPONSE VARIABLE: SALE PRICE ===\n\n")

## === RESPONSE VARIABLE: SALE PRICE ===

cat("Why SalePrice is most valuable:\n")

## Why SalePrice is most valuable:

cat("SalePrice is the single most important variable in this dataset because:\n")

## SalePrice is the single most important variable in this dataset because:

cat("- Buyers want to know: 'How much will this cost me?'\n")

## - Buyers want to know: 'How much will this cost me?'

cat("- Sellers want to know: 'How much can I get for my home?'\n")

## - Sellers want to know: 'How much can I get for my home?'

cat("- Investors want to know: 'What's the expected return?'\n")

## - Investors want to know: 'What's the expected return?'

cat("- Appraisers need to justify: 'Is this price fair?'\n\n")

## - Appraisers need to justify: 'Is this price fair?'

cat("SalePrice summary:\n")

## SalePrice summary:

cat("  Mean:", dollar(mean(ames$SalePrice)), "\n")

##   Mean: $180,796

cat("  Median:", dollar(median(ames$SalePrice)), "\n")

##   Median: $160,000

cat("  Range:", dollar(min(ames$SalePrice)), "to", dollar(max(ames$SalePrice)), "\n")

##   Range: $12,789 to $755,000

cat("  Standard Deviation:", dollar(sd(ames$SalePrice)), "\n")

##   Standard Deviation: $79,886.69

SalePrice is the outcome everyone cares about. Everything else in the dataset—square footage, quality, location, features—matters primarily because it affects the sale price. Understanding what drives SalePrice helps buyers make informed decisions, sellers price competitively, and professionals provide accurate valuations.

Part 1: ANOVA Test - House Style and Sale Price

Selecting the Explanatory Variable

# Examine House.Style distribution
style_counts <- ames |>
  count(House.Style, sort = TRUE)

kable(style_counts,
      col.names = c("House Style", "Count"),
      caption = "Distribution of House Styles in Ames")

Distribution of House Styles in Ames
House Style	Count
1Story	1481
2Story	873
1.5Fin	314
SLvl	128
SFoyer	83
2.5Unf	24
1.5Unf	19
2.5Fin	8

cat("\n=== WHY HOUSE STYLE? ===\n")

## 
## === WHY HOUSE STYLE? ===

cat("House.Style is a good categorical variable because:\n")

## House.Style is a good categorical variable because:

cat("- It has 8 categories (manageable, no consolidation needed)\n")

## - It has 8 categories (manageable, no consolidation needed)

cat("- It represents fundamental structural differences (1-story vs 2-story vs split-level)\n")

## - It represents fundamental structural differences (1-story vs 2-story vs split-level)

cat("- Buyers often have strong preferences ('I only want a ranch' or 'I need a 2-story')\n")

## - Buyers often have strong preferences ('I only want a ranch' or 'I need a 2-story')

cat("- Different styles have different construction costs and market appeal\n")

## - Different styles have different construction costs and market appeal

House Style categories: - 1Story = One-story (ranch style) - 2Story = Two-story - 1.5Fin = One-and-a-half story, finished - SLvl = Split level - SFoyer = Split foyer - 2.5Unf = Two-and-a-half story, unfinished - 1.5Unf = One-and-a-half story, unfinished - 2.5Fin = Two-and-a-half story, finished

Null and Alternative Hypotheses

H₀ (Null): μ₁ = μ₂ = μ₃ = … = μ₈
All house styles have the same mean sale price.

Hₐ (Alternative): At least one house style has a different mean sale price.

Why this matters: If we fail to reject the null, it suggests house style doesn’t significantly affect price—buyers are paying for size, quality, and location regardless of layout. If we reject the null, certain styles command premiums or discounts, which helps buyers budget and sellers position their homes.

Exploratory Data Analysis

# Calculate summary statistics by house style
style_summary <- ames |>
  group_by(House.Style) |>
  summarise(
    Count = n(),
    Mean_Price = mean(SalePrice),
    Median_Price = median(SalePrice),
    SD_Price = sd(SalePrice),
    SE_Price = SD_Price / sqrt(Count)
  ) |>
  arrange(desc(Mean_Price))

kable(style_summary,
      col.names = c("Style", "Count", "Mean Price", "Median Price", "SD", "SE"),
      caption = "Sale Price Statistics by House Style",
      format.args = list(big.mark = ","),
      digits = 0)

Sale Price Statistics by House Style
Style	Count	Mean Price	Median Price	SD	SE
2.5Fin	8	220,000	194,000	118,212	41,794
2Story	873	206,990	189,000	85,350	2,889
1Story	1,481	178,700	155,000	81,067	2,107
2.5Unf	24	177,158	160,950	76,115	15,537
SLvl	128	165,527	165,000	34,348	3,036
SFoyer	83	143,473	143,000	31,220	3,427
1.5Fin	314	137,530	129,675	47,226	2,665
1.5Unf	19	109,663	113,000	20,570	4,719

# Visualize with boxplots
ggplot(ames, aes(x = reorder(House.Style, SalePrice, FUN = median), 
                 y = SalePrice, fill = House.Style)) +
  geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, 
               fill = "white", color = "black") +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Sale Price Distribution by House Style",
       subtitle = "Ordered by median price | Diamond = mean, Box shows quartiles",
       x = "House Style",
       y = "Sale Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

What I observe: The boxplots show clear differences in median prices across house styles. Two-story homes (2Story, 2.5Fin, 2.5Unf) tend to have higher median prices than one-story homes, and split-levels/foyers are in the middle. The variation suggests house style might indeed affect price.

Checking ANOVA Assumptions

cat("=== ANOVA ASSUMPTIONS ===\n\n")

## === ANOVA ASSUMPTIONS ===

# 1. Independence - assumed based on data collection
cat("1. Independence: Assumed - each home sale is independent\n\n")

## 1. Independence: Assumed - each home sale is independent

# 2. Normality - check residuals after fitting
cat("2. Normality: Will check residuals after fitting model\n\n")

## 2. Normality: Will check residuals after fitting model

# 3. Equal variances (Homoscedasticity)
cat("3. Equal Variances (Levene's Test):\n")

## 3. Equal Variances (Levene's Test):

# Using car package's leveneTest if available, or manual check
style_variances <- style_summary |>
  select(House.Style, SD_Price) |>
  arrange(desc(SD_Price))

cat("   Standard deviations by group:\n")

##    Standard deviations by group:

print(style_variances, n = 8)

## # A tibble: 8 × 2
##   House.Style SD_Price
##   <chr>          <dbl>
## 1 2.5Fin       118212.
## 2 2Story        85350.
## 3 1Story        81067.
## 4 2.5Unf        76115.
## 5 1.5Fin        47226.
## 6 SLvl          34348.
## 7 SFoyer        31220.
## 8 1.5Unf        20570.

# Rule of thumb: largest SD / smallest SD < 2
max_sd <- max(style_summary$SD_Price)
min_sd <- min(style_summary$SD_Price)
ratio <- max_sd / min_sd
cat("\n   Ratio of largest to smallest SD:", round(ratio, 2), "\n")

## 
##    Ratio of largest to smallest SD: 5.75

cat("   Rule of thumb: ratio < 2 suggests acceptable variance equality\n")

##    Rule of thumb: ratio < 2 suggests acceptable variance equality

cat("   Assessment:", if(ratio < 2) "✓ Variances roughly equal" else "⚠ Consider Welch's ANOVA", "\n")

##    Assessment: ⚠ Consider Welch's ANOVA

Performing ANOVA

# Fit ANOVA model
anova_model <- aov(SalePrice ~ House.Style, data = ames)

# Get ANOVA table
anova_summary <- summary(anova_model)
cat("=== ANOVA RESULTS ===\n\n")

## === ANOVA RESULTS ===

print(anova_summary)

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## House.Style    7 1.448e+12 2.068e+11   35.04 <2e-16 ***
## Residuals   2922 1.725e+13 5.902e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Extract key statistics
f_statistic <- anova_summary[[1]]["House.Style", "F value"]
p_value <- anova_summary[[1]]["House.Style", "Pr(>F)"]
df_between <- anova_summary[[1]]["House.Style", "Df"]
df_within <- anova_summary[[1]]["Residuals", "Df"]

cat("\n=== KEY STATISTICS ===\n")

## 
## === KEY STATISTICS ===

cat("F-statistic:", round(f_statistic, 2), "\n")

## F-statistic: 35.04

cat("Degrees of freedom:", df_between, "between groups,", df_within, "within groups\n")

## Degrees of freedom: 7 between groups, 2922 within groups

cat("P-value:", format(p_value, scientific = TRUE, digits = 3), "\n")

## P-value: 3.03e-47

cat("Significance level: α = 0.05\n")

## Significance level: α = 0.05

Interpreting Results

cat("=== DECISION ===\n\n")

## === DECISION ===

if(p_value < 0.05) {
  cat("REJECT the null hypothesis (p < 0.05)\n\n")
  cat("CONCLUSION:\n")
  cat("There is statistically significant evidence that at least one house style\n")
  cat("has a different mean sale price than the others.\n\n")
  cat("This means house style DOES affect sale price in Ames.\n")
} else {
  cat("FAIL TO REJECT the null hypothesis (p >= 0.05)\n\n")
  cat("CONCLUSION:\n")
  cat("There is not enough evidence to conclude that house styles have\n")
  cat("different mean sale prices.\n")
}

## REJECT the null hypothesis (p < 0.05)
## 
## CONCLUSION:
## There is statistically significant evidence that at least one house style
## has a different mean sale price than the others.
## 
## This means house style DOES affect sale price in Ames.

# Calculate effect size (eta-squared)
ss_between <- anova_summary[[1]]["House.Style", "Sum Sq"]
ss_total <- sum(anova_summary[[1]][, "Sum Sq"])
eta_squared <- ss_between / ss_total

cat("\n=== EFFECT SIZE ===\n")

## 
## === EFFECT SIZE ===

cat("Eta-squared (η²):", round(eta_squared, 3), "\n")

## Eta-squared (η²): 0.077

cat("Interpretation:", round(eta_squared * 100, 1), "% of price variation explained by house style\n")

## Interpretation: 7.7 % of price variation explained by house style

How R output relates to conclusions:

F-statistic (35.04): This compares the variation between groups to variation within groups. A large F-statistic suggests groups differ more than we’d expect by chance.
P-value (3.03e-47): This is the probability of seeing differences this large (or larger) if all house styles actually had the same mean price. Since p < 0.05, it’s very unlikely our observed differences are just random chance.
Degrees of freedom: We have 7 house styles (minus 1 for the overall mean) and 2922 homes within groups (total homes minus number of groups).

Post-Hoc Analysis (If ANOVA is Significant)

if(p_value < 0.05) {
  cat("=== PAIRWISE COMPARISONS (Tukey HSD) ===\n\n")
  cat("Since we rejected the null, we need to identify WHICH styles differ.\n")
  cat("Tukey's Honest Significant Difference test makes all pairwise comparisons:\n\n")
  
  tukey_result <- TukeyHSD(anova_model)
  
  # Show only significant comparisons
  tukey_df <- as.data.frame(tukey_result$House.Style)
  tukey_df$Comparison <- rownames(tukey_df)
  significant <- tukey_df |> 
    filter(`p adj` < 0.05) |>
    arrange(`p adj`) |>
    select(Comparison, diff, `p adj`)
  
  cat("Significant differences (p < 0.05):\n")
  print(head(significant, 10))
  
  cat("\nInterpretation: These pairs have significantly different mean prices.\n")
  cat("For example, the first row shows the price difference between two styles.\n")
}

## === PAIRWISE COMPARISONS (Tukey HSD) ===
## 
## Since we rejected the null, we need to identify WHICH styles differ.
## Tukey's Honest Significant Difference test makes all pairwise comparisons:
## 
## Significant differences (p < 0.05):
##                  Comparison      diff        p adj
## 1Story-1.5Fin 1Story-1.5Fin  41169.95 0.000000e+00
## 2Story-1.5Fin 2Story-1.5Fin  69460.24 0.000000e+00
## 2Story-1Story 2Story-1Story  28290.28 0.000000e+00
## SFoyer-2Story SFoyer-2Story -63517.50 0.000000e+00
## SLvl-2Story     SLvl-2Story -41462.78 3.603212e-07
## 2Story-1.5Unf 2Story-1.5Unf  97327.00 1.407659e-06
## SFoyer-1Story SFoyer-1Story -35227.21 1.275283e-03
## 1Story-1.5Unf 1Story-1.5Unf  69036.72 2.571212e-03
## SLvl-1.5Fin     SLvl-1.5Fin  27997.46 1.212685e-02
## 2.5Fin-1.5Unf 2.5Fin-1.5Unf 110336.84 1.529152e-02
## 
## Interpretation: These pairs have significantly different mean prices.
## For example, the first row shows the price difference between two styles.

What This Means for Interested Parties

For Buyers:

Since we rejected the null hypothesis, house style DOES significantly affect price. This means: - Don’t assume all layouts cost the same—two-story homes command a premium over split-levels - If you’re flexible on style, you might find better value in less popular layouts (split-foyer homes sell for less on average) - Budget accordingly based on your style preference

For Sellers:

Your house style influences your asking price: - If you have a popular style (2Story), you can justify higher prices - If you have a less common style (SFoyer, 1.5Unf), set realistic expectations or emphasize other strong features - Consider your style in relation to neighborhood norms

For Builders/Developers:

The data shows which styles command premiums: - Two-story homes (2Story, 2.5Fin) sell for significantly more - Consider market demand vs. construction costs when choosing building styles - Some styles (1.5Unf, SFoyer) might be undervalued opportunities

Part 2: Linear Regression - Living Area and Sale Price

Selecting the Explanatory Variable

cat("=== CHOOSING A CONTINUOUS PREDICTOR ===\n\n")

## === CHOOSING A CONTINUOUS PREDICTOR ===

# Check correlations
continuous_vars <- c("Gr.Liv.Area", "Lot.Area", "Year.Built", "Garage.Area", "Total.Bsmt.SF")
correlations <- sapply(continuous_vars, function(var) {
  cor(ames[[var]], ames$SalePrice, use = "complete.obs")
})

cor_df <- data.frame(
  Variable = continuous_vars,
  Correlation = correlations
) |>
  arrange(desc(abs(Correlation)))

kable(cor_df,
      col.names = c("Variable", "Correlation with SalePrice"),
      caption = "Potential Predictors for Linear Regression",
      digits = 3)

Potential Predictors for Linear Regression
	Variable	Correlation with SalePrice
Gr.Liv.Area	Gr.Liv.Area	0.707
Garage.Area	Garage.Area	0.640
Total.Bsmt.SF	Total.Bsmt.SF	0.632
Year.Built	Year.Built	0.558
Lot.Area	Lot.Area	0.267

cat("\n=== WHY GR.LIV.AREA (LIVING AREA)? ===\n")

## 
## === WHY GR.LIV.AREA (LIVING AREA)? ===

cat("Above Ground Living Area is the best choice because:\n")

## Above Ground Living Area is the best choice because:

cat("- Strongest correlation with SalePrice (r = 0.707)\n")

## - Strongest correlation with SalePrice (r = 0.707)

cat("- Directly measurable and easily understood (square footage)\n")

## - Directly measurable and easily understood (square footage)

cat("- Buyers actively search by square footage ('I need 2000+ sq ft')\n")

## - Buyers actively search by square footage ('I need 2000+ sq ft')

cat("- Appraisers use price-per-square-foot as a key metric\n")

## - Appraisers use price-per-square-foot as a key metric

cat("- Relationship should be roughly linear (bigger homes = higher prices)\n")

## - Relationship should be roughly linear (bigger homes = higher prices)

Checking Linearity Assumption

# Scatterplot to assess linearity
ggplot(ames, aes(x = Gr.Liv.Area, y = SalePrice)) +
  geom_point(alpha = 0.3, size = 2, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.5) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1, linetype = "dashed") +
  scale_x_continuous(labels = comma_format()) +
  scale_y_continuous(labels = dollar_format()) +
  labs(title = "Sale Price vs. Above Ground Living Area",
       subtitle = "Red line = linear fit | Orange dashed = smoothed curve (checking linearity)",
       x = "Living Area (square feet)",
       y = "Sale Price") +
  theme_minimal()

cat("\n=== LINEARITY ASSESSMENT ===\n")

## 
## === LINEARITY ASSESSMENT ===

cat("Looking at the plot:\n")

## Looking at the plot:

cat("- Red line (linear) and orange line (smoothed) are very similar\n")

## - Red line (linear) and orange line (smoothed) are very similar

cat("- This suggests a linear model is appropriate\n")

## - This suggests a linear model is appropriate

cat("- Points show positive trend with some scatter (typical for real data)\n")

## - Points show positive trend with some scatter (typical for real data)

cat("- No obvious curvature that would violate linearity assumption\n")

## - No obvious curvature that would violate linearity assumption

Visual assessment: The red linear fit closely matches the orange smoothed curve, indicating the relationship is approximately linear. There’s scatter around the line (some homes are above/below the trend), but no systematic curvature that would suggest we need a non-linear model.

Building the Linear Regression Model

# Fit simple linear regression
model <- lm(SalePrice ~ Gr.Liv.Area, data = ames)

# Display model summary
cat("=== LINEAR REGRESSION MODEL ===\n\n")

## === LINEAR REGRESSION MODEL ===

cat("Model: SalePrice = β₀ + β₁ × Gr.Liv.Area + ε\n\n")

## Model: SalePrice = β₀ + β₁ × Gr.Liv.Area + ε

summary_output <- summary(model)
print(summary_output)

## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483467  -30219   -1966   22728  334323 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13289.634   3269.703   4.064 4.94e-05 ***
## Gr.Liv.Area   111.694      2.066  54.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared:  0.4995, Adjusted R-squared:  0.4994 
## F-statistic:  2923 on 1 and 2928 DF,  p-value: < 2.2e-16

# Extract coefficients
coefficients <- coef(model)
intercept <- coefficients[1]
slope <- coefficients[2]

cat("\n=== MODEL EQUATION ===\n")

## 
## === MODEL EQUATION ===

cat("SalePrice =", round(intercept, 2), "+", round(slope, 2), "× Gr.Liv.Area\n\n")

## SalePrice = 13289.63 + 111.69 × Gr.Liv.Area

Evaluating Model Fit

cat("=== MODEL FIT STATISTICS ===\n\n")

## === MODEL FIT STATISTICS ===

# R-squared
r_squared <- summary_output$r.squared
adj_r_squared <- summary_output$adj.r.squared

cat("R-squared:", round(r_squared, 3), "\n")

## R-squared: 0.5

cat("Adjusted R-squared:", round(adj_r_squared, 3), "\n")

## Adjusted R-squared: 0.499

cat("Interpretation:", round(r_squared * 100, 1), 
    "% of sale price variation is explained by living area\n\n")

## Interpretation: 50 % of sale price variation is explained by living area

# RMSE (Root Mean Squared Error)
rmse <- sqrt(mean(residuals(model)^2))
cat("RMSE (Root Mean Squared Error):", dollar(rmse), "\n")

## RMSE (Root Mean Squared Error): $56,504.88

cat("Interpretation: On average, predictions are off by about", dollar(rmse), "\n\n")

## Interpretation: On average, predictions are off by about $56,504.88

# Residual standard error
rse <- summary_output$sigma
cat("Residual Standard Error:", dollar(rse), "\n")

## Residual Standard Error: $56,524.17

cat("Interpretation: Typical prediction error is about", dollar(rse), "\n")

## Interpretation: Typical prediction error is about $56,524.17

Model fit assessment:

R² = 0.5: Living area explains about 50% of price variation. This is pretty good for a single predictor! About half the variation in prices can be predicted just from knowing the square footage.
RMSE ≈ $56,000: Our predictions are off by an average of $56,000. For a market where homes average $180,000, this is about 31% error. Not perfect, but useful for ballpark estimates.
What’s missing? The other ~50% of variation comes from quality, location, age, features, etc. We’d need a multiple regression model to capture those.

Checking Regression Assumptions

# Create diagnostic plots
par(mfrow = c(2, 2))
plot(model)

par(mfrow = c(1, 1))

cat("\n=== DIAGNOSTIC PLOT INTERPRETATION ===\n\n")

## 
## === DIAGNOSTIC PLOT INTERPRETATION ===

cat("1. Residuals vs Fitted:\n")

## 1. Residuals vs Fitted:

cat("   - Should show random scatter around zero (no pattern)\n")

##    - Should show random scatter around zero (no pattern)

cat("   - Our plot shows mostly random scatter, though slight heteroscedasticity\n")

##    - Our plot shows mostly random scatter, though slight heteroscedasticity

cat("   - (variance increases slightly at higher prices)\n\n")

##    - (variance increases slightly at higher prices)

cat("2. Normal Q-Q:\n")

## 2. Normal Q-Q:

cat("   - Points should follow diagonal line for normal residuals\n")

##    - Points should follow diagonal line for normal residuals

cat("   - Our plot shows good fit except in tails (some outliers)\n")

##    - Our plot shows good fit except in tails (some outliers)

cat("   - This is acceptable for our purposes\n\n")

##    - This is acceptable for our purposes

cat("3. Scale-Location:\n")

## 3. Scale-Location:

cat("   - Should show horizontal line (constant variance)\n")

##    - Should show horizontal line (constant variance)

cat("   - Slight upward trend suggests heteroscedasticity\n")

##    - Slight upward trend suggests heteroscedasticity

cat("   - Not severe enough to invalidate the model\n\n")

##    - Not severe enough to invalidate the model

cat("4. Residuals vs Leverage:\n")

## 4. Residuals vs Leverage:

cat("   - Identifies influential outliers (beyond Cook's distance lines)\n")

##    - Identifies influential outliers (beyond Cook's distance lines)

cat("   - A few high-leverage points but within acceptable range\n\n")

##    - A few high-leverage points but within acceptable range

cat("OVERALL ASSESSMENT: Model assumptions are reasonably met.\n")

## OVERALL ASSESSMENT: Model assumptions are reasonably met.

cat("Some heteroscedasticity present but not severe.\n")

## Some heteroscedasticity present but not severe.

Interpreting Coefficients

cat("=== COEFFICIENT INTERPRETATION ===\n\n")

## === COEFFICIENT INTERPRETATION ===

cat("Intercept (β₀) =", dollar(intercept), "\n")

## Intercept (β₀) = $13,289.63

cat("What it means:\n")

## What it means:

cat("- This is the predicted price when living area = 0\n")

## - This is the predicted price when living area = 0

cat("- Practically meaningless (no home has 0 square feet)\n")

## - Practically meaningless (no home has 0 square feet)

cat("- Just serves as the baseline in our equation\n\n")

## - Just serves as the baseline in our equation

cat("Slope (β₁) =", dollar(slope), "per square foot\n")

## Slope (β₁) = $111.69 per square foot

cat("What it means:\n")

## What it means:

cat("- For every additional square foot of living area,\n")

## - For every additional square foot of living area,

cat("  the sale price increases by approximately", dollar(round(slope)), "\n")

##   the sale price increases by approximately $112

cat("- This is the average price per square foot across all homes\n")

## - This is the average price per square foot across all homes

cat("- A 100 sq ft increase → ~", dollar(round(slope * 100)), "price increase\n")

## - A 100 sq ft increase → ~ $11,169 price increase

cat("- A 500 sq ft increase → ~", dollar(round(slope * 500)), "price increase\n\n")

## - A 500 sq ft increase → ~ $55,847 price increase

# Confidence interval for slope
conf_int <- confint(model, "Gr.Liv.Area", level = 0.95)
cat("95% Confidence Interval for slope:\n")

## 95% Confidence Interval for slope:

cat("We're 95% confident the true price per square foot is between\n")

## We're 95% confident the true price per square foot is between

cat(dollar(conf_int[1]), "and", dollar(conf_int[2]), "\n")

## $107.64 and $115.75

Making Predictions

cat("=== EXAMPLE PREDICTIONS ===\n\n")

## === EXAMPLE PREDICTIONS ===

# Example predictions
example_sizes <- c(1000, 1500, 2000, 2500, 3000)
predictions <- predict(model, 
                      newdata = data.frame(Gr.Liv.Area = example_sizes),
                      interval = "prediction",
                      level = 0.95)

pred_table <- data.frame(
  Living_Area = comma(example_sizes),
  Predicted_Price = dollar(predictions[, "fit"]),
  Lower_95 = dollar(predictions[, "lwr"]),
  Upper_95 = dollar(predictions[, "upr"])
)

kable(pred_table,
      col.names = c("Living Area (sq ft)", "Predicted Price", 
                    "95% Lower", "95% Upper"),
      caption = "Price Predictions for Different Home Sizes")

Price Predictions for Different Home Sizes
Living Area (sq ft)	Predicted Price	95% Lower	95% Upper
1,000	$124,984	$14,115	$235,852
1,500	$180,831	$69,981	$291,681
2,000	$236,678	$125,809	$347,546
2,500	$292,525	$181,601	$403,449
3,000	$348,372	$237,355	$459,388

cat("\nHow to use this:\n")

## 
## How to use this:

cat("- The 'Predicted Price' is our best guess\n")

## - The 'Predicted Price' is our best guess

cat("- The 95% interval shows where we expect 95% of actual prices to fall\n")

## - The 95% interval shows where we expect 95% of actual prices to fall

cat("- Wide intervals reflect that many factors beyond size affect price\n")

## - Wide intervals reflect that many factors beyond size affect price

Practical Recommendations

cat("=== RECOMMENDATIONS ===\n\n")

## === RECOMMENDATIONS ===

cat("For BUYERS:\n")

## For BUYERS:

cat("- Expect to pay approximately", dollar(round(slope)), "per square foot\n")

## - Expect to pay approximately $112 per square foot

cat("- A 1,500 sq ft home should cost around", 
    dollar(round(predict(model, newdata = data.frame(Gr.Liv.Area = 1500)))), "\n")

## - A 1,500 sq ft home should cost around $180,831

cat("- If a home is priced far above the prediction interval, question why\n")

## - If a home is priced far above the prediction interval, question why

cat("  (is it high quality? premium location? or just overpriced?)\n\n")

##   (is it high quality? premium location? or just overpriced?)

cat("For SELLERS:\n")

## For SELLERS:

cat("- Your base price estimate is", dollar(round(slope)), "× (your square footage)\n")

## - Your base price estimate is $112 × (your square footage)

cat("- Adjust UP if you have: premium location, high quality, modern features\n")

## - Adjust UP if you have: premium location, high quality, modern features

cat("- Adjust DOWN if you have: budget location, lower quality, needed repairs\n")

## - Adjust DOWN if you have: budget location, lower quality, needed repairs

cat("- The model gives a starting point, not a final price\n\n")

## - The model gives a starting point, not a final price

cat("For APPRAISERS:\n")

## For APPRAISERS:

cat("- Living area is the single strongest predictor (R² = ", round(r_squared, 3), ")\n")

## - Living area is the single strongest predictor (R² =  0.5 )

cat("- Use", dollar(round(slope)), "per sq ft as a baseline\n")

## - Use $112 per sq ft as a baseline

cat("- But remember: 50% of variation comes from other factors\n")

## - But remember: 50% of variation comes from other factors

cat("- Always adjust for quality, location, condition, features\n\n")

## - Always adjust for quality, location, condition, features

cat("OPTIMAL STRATEGY:\n")

## OPTIMAL STRATEGY:

cat("If you're trying to maximize value-for-money:\n")

## If you're trying to maximize value-for-money:

cat("- Larger homes have better price-per-square-foot efficiency\n")

## - Larger homes have better price-per-square-foot efficiency

cat("- But diminishing returns exist (quality and location matter more at high end)\n")

## - But diminishing returns exist (quality and location matter more at high end)

cat("- Sweet spot appears to be 1,500-2,000 sq ft (balances price and size)\n")

## - Sweet spot appears to be 1,500-2,000 sq ft (balances price and size)

Conclusion

This week’s analysis provided two complementary perspectives on what drives home prices in Ames:

ANOVA Results: - House style significantly affects sale price (F = 35.04, p < 0.001) - Two-story homes command premiums over split-levels and one-story homes - Style explains about 7.7% of price variation - Buyers should budget differently based on preferred style

Regression Results: - Living area strongly predicts sale price (R² = 0.5) - Each square foot adds approximately $112 to price - The model explains about 50% of price variation - Remaining variation comes from quality, location, and features

Key Insight: Both categorical factors (house style) and continuous factors (square footage) matter, but square footage is the single strongest predictor. A complete pricing model would need both types of variables plus others (quality, location, age).

Both ANOVA and regression are powerful tools, but they’re most useful when combined with domain knowledge and awareness of their limitations. They provide data-driven insights, but human judgment is still needed for final decisions.

Ames Housing Data Dive Week 8: ANOVA and Linear Regression

Pratik Mane

2026-03-09

Introduction

Selecting the Response Variable

Part 1: ANOVA Test - House Style and Sale Price

Selecting the Explanatory Variable

Null and Alternative Hypotheses

Exploratory Data Analysis

Checking ANOVA Assumptions

Performing ANOVA

Interpreting Results

Post-Hoc Analysis (If ANOVA is Significant)

What This Means for Interested Parties

Part 2: Linear Regression - Living Area and Sale Price

Selecting the Explanatory Variable

Checking Linearity Assumption

Building the Linear Regression Model

Evaluating Model Fit

Checking Regression Assumptions

Interpreting Coefficients

Making Predictions

Practical Recommendations

Conclusion