This week I’m using ANOVA to test whether categorical groups have different mean outcomes, and building a linear regression model to predict my response variable. Both techniques help me understand what drives home prices in Ames and provide actionable insights for buyers, sellers, and real estate professionals.
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)
cat("=== RESPONSE VARIABLE: SALE PRICE ===\n\n")
## === RESPONSE VARIABLE: SALE PRICE ===
cat("Why SalePrice is most valuable:\n")
## Why SalePrice is most valuable:
cat("SalePrice is the single most important variable in this dataset because:\n")
## SalePrice is the single most important variable in this dataset because:
cat("- Buyers want to know: 'How much will this cost me?'\n")
## - Buyers want to know: 'How much will this cost me?'
cat("- Sellers want to know: 'How much can I get for my home?'\n")
## - Sellers want to know: 'How much can I get for my home?'
cat("- Investors want to know: 'What's the expected return?'\n")
## - Investors want to know: 'What's the expected return?'
cat("- Appraisers need to justify: 'Is this price fair?'\n\n")
## - Appraisers need to justify: 'Is this price fair?'
cat("SalePrice summary:\n")
## SalePrice summary:
cat(" Mean:", dollar(mean(ames$SalePrice)), "\n")
## Mean: $180,796
cat(" Median:", dollar(median(ames$SalePrice)), "\n")
## Median: $160,000
cat(" Range:", dollar(min(ames$SalePrice)), "to", dollar(max(ames$SalePrice)), "\n")
## Range: $12,789 to $755,000
cat(" Standard Deviation:", dollar(sd(ames$SalePrice)), "\n")
## Standard Deviation: $79,886.69
SalePrice is the outcome everyone cares about. Everything else in the dataset—square footage, quality, location, features—matters primarily because it affects the sale price. Understanding what drives SalePrice helps buyers make informed decisions, sellers price competitively, and professionals provide accurate valuations.
# Examine House.Style distribution
style_counts <- ames |>
count(House.Style, sort = TRUE)
kable(style_counts,
col.names = c("House Style", "Count"),
caption = "Distribution of House Styles in Ames")
| House Style | Count |
|---|---|
| 1Story | 1481 |
| 2Story | 873 |
| 1.5Fin | 314 |
| SLvl | 128 |
| SFoyer | 83 |
| 2.5Unf | 24 |
| 1.5Unf | 19 |
| 2.5Fin | 8 |
cat("\n=== WHY HOUSE STYLE? ===\n")
##
## === WHY HOUSE STYLE? ===
cat("House.Style is a good categorical variable because:\n")
## House.Style is a good categorical variable because:
cat("- It has 8 categories (manageable, no consolidation needed)\n")
## - It has 8 categories (manageable, no consolidation needed)
cat("- It represents fundamental structural differences (1-story vs 2-story vs split-level)\n")
## - It represents fundamental structural differences (1-story vs 2-story vs split-level)
cat("- Buyers often have strong preferences ('I only want a ranch' or 'I need a 2-story')\n")
## - Buyers often have strong preferences ('I only want a ranch' or 'I need a 2-story')
cat("- Different styles have different construction costs and market appeal\n")
## - Different styles have different construction costs and market appeal
House Style categories: - 1Story = One-story (ranch style) - 2Story = Two-story - 1.5Fin = One-and-a-half story, finished - SLvl = Split level - SFoyer = Split foyer - 2.5Unf = Two-and-a-half story, unfinished - 1.5Unf = One-and-a-half story, unfinished - 2.5Fin = Two-and-a-half story, finished
H₀ (Null): μ₁ = μ₂ = μ₃ = … = μ₈
All house styles have the same mean sale price.
Hₐ (Alternative): At least one house style has a different mean sale price.
Why this matters: If we fail to reject the null, it suggests house style doesn’t significantly affect price—buyers are paying for size, quality, and location regardless of layout. If we reject the null, certain styles command premiums or discounts, which helps buyers budget and sellers position their homes.
# Calculate summary statistics by house style
style_summary <- ames |>
group_by(House.Style) |>
summarise(
Count = n(),
Mean_Price = mean(SalePrice),
Median_Price = median(SalePrice),
SD_Price = sd(SalePrice),
SE_Price = SD_Price / sqrt(Count)
) |>
arrange(desc(Mean_Price))
kable(style_summary,
col.names = c("Style", "Count", "Mean Price", "Median Price", "SD", "SE"),
caption = "Sale Price Statistics by House Style",
format.args = list(big.mark = ","),
digits = 0)
| Style | Count | Mean Price | Median Price | SD | SE |
|---|---|---|---|---|---|
| 2.5Fin | 8 | 220,000 | 194,000 | 118,212 | 41,794 |
| 2Story | 873 | 206,990 | 189,000 | 85,350 | 2,889 |
| 1Story | 1,481 | 178,700 | 155,000 | 81,067 | 2,107 |
| 2.5Unf | 24 | 177,158 | 160,950 | 76,115 | 15,537 |
| SLvl | 128 | 165,527 | 165,000 | 34,348 | 3,036 |
| SFoyer | 83 | 143,473 | 143,000 | 31,220 | 3,427 |
| 1.5Fin | 314 | 137,530 | 129,675 | 47,226 | 2,665 |
| 1.5Unf | 19 | 109,663 | 113,000 | 20,570 | 4,719 |
# Visualize with boxplots
ggplot(ames, aes(x = reorder(House.Style, SalePrice, FUN = median),
y = SalePrice, fill = House.Style)) +
geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 3,
fill = "white", color = "black") +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Sale Price Distribution by House Style",
subtitle = "Ordered by median price | Diamond = mean, Box shows quartiles",
x = "House Style",
y = "Sale Price") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
What I observe: The boxplots show clear differences in median prices across house styles. Two-story homes (2Story, 2.5Fin, 2.5Unf) tend to have higher median prices than one-story homes, and split-levels/foyers are in the middle. The variation suggests house style might indeed affect price.
cat("=== ANOVA ASSUMPTIONS ===\n\n")
## === ANOVA ASSUMPTIONS ===
# 1. Independence - assumed based on data collection
cat("1. Independence: Assumed - each home sale is independent\n\n")
## 1. Independence: Assumed - each home sale is independent
# 2. Normality - check residuals after fitting
cat("2. Normality: Will check residuals after fitting model\n\n")
## 2. Normality: Will check residuals after fitting model
# 3. Equal variances (Homoscedasticity)
cat("3. Equal Variances (Levene's Test):\n")
## 3. Equal Variances (Levene's Test):
# Using car package's leveneTest if available, or manual check
style_variances <- style_summary |>
select(House.Style, SD_Price) |>
arrange(desc(SD_Price))
cat(" Standard deviations by group:\n")
## Standard deviations by group:
print(style_variances, n = 8)
## # A tibble: 8 × 2
## House.Style SD_Price
## <chr> <dbl>
## 1 2.5Fin 118212.
## 2 2Story 85350.
## 3 1Story 81067.
## 4 2.5Unf 76115.
## 5 1.5Fin 47226.
## 6 SLvl 34348.
## 7 SFoyer 31220.
## 8 1.5Unf 20570.
# Rule of thumb: largest SD / smallest SD < 2
max_sd <- max(style_summary$SD_Price)
min_sd <- min(style_summary$SD_Price)
ratio <- max_sd / min_sd
cat("\n Ratio of largest to smallest SD:", round(ratio, 2), "\n")
##
## Ratio of largest to smallest SD: 5.75
cat(" Rule of thumb: ratio < 2 suggests acceptable variance equality\n")
## Rule of thumb: ratio < 2 suggests acceptable variance equality
cat(" Assessment:", if(ratio < 2) "✓ Variances roughly equal" else "⚠ Consider Welch's ANOVA", "\n")
## Assessment: ⚠ Consider Welch's ANOVA
# Fit ANOVA model
anova_model <- aov(SalePrice ~ House.Style, data = ames)
# Get ANOVA table
anova_summary <- summary(anova_model)
cat("=== ANOVA RESULTS ===\n\n")
## === ANOVA RESULTS ===
print(anova_summary)
## Df Sum Sq Mean Sq F value Pr(>F)
## House.Style 7 1.448e+12 2.068e+11 35.04 <2e-16 ***
## Residuals 2922 1.725e+13 5.902e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Extract key statistics
f_statistic <- anova_summary[[1]]["House.Style", "F value"]
p_value <- anova_summary[[1]]["House.Style", "Pr(>F)"]
df_between <- anova_summary[[1]]["House.Style", "Df"]
df_within <- anova_summary[[1]]["Residuals", "Df"]
cat("\n=== KEY STATISTICS ===\n")
##
## === KEY STATISTICS ===
cat("F-statistic:", round(f_statistic, 2), "\n")
## F-statistic: 35.04
cat("Degrees of freedom:", df_between, "between groups,", df_within, "within groups\n")
## Degrees of freedom: 7 between groups, 2922 within groups
cat("P-value:", format(p_value, scientific = TRUE, digits = 3), "\n")
## P-value: 3.03e-47
cat("Significance level: α = 0.05\n")
## Significance level: α = 0.05
cat("=== DECISION ===\n\n")
## === DECISION ===
if(p_value < 0.05) {
cat("REJECT the null hypothesis (p < 0.05)\n\n")
cat("CONCLUSION:\n")
cat("There is statistically significant evidence that at least one house style\n")
cat("has a different mean sale price than the others.\n\n")
cat("This means house style DOES affect sale price in Ames.\n")
} else {
cat("FAIL TO REJECT the null hypothesis (p >= 0.05)\n\n")
cat("CONCLUSION:\n")
cat("There is not enough evidence to conclude that house styles have\n")
cat("different mean sale prices.\n")
}
## REJECT the null hypothesis (p < 0.05)
##
## CONCLUSION:
## There is statistically significant evidence that at least one house style
## has a different mean sale price than the others.
##
## This means house style DOES affect sale price in Ames.
# Calculate effect size (eta-squared)
ss_between <- anova_summary[[1]]["House.Style", "Sum Sq"]
ss_total <- sum(anova_summary[[1]][, "Sum Sq"])
eta_squared <- ss_between / ss_total
cat("\n=== EFFECT SIZE ===\n")
##
## === EFFECT SIZE ===
cat("Eta-squared (η²):", round(eta_squared, 3), "\n")
## Eta-squared (η²): 0.077
cat("Interpretation:", round(eta_squared * 100, 1), "% of price variation explained by house style\n")
## Interpretation: 7.7 % of price variation explained by house style
How R output relates to conclusions:
F-statistic (35.04): This compares the variation between groups to variation within groups. A large F-statistic suggests groups differ more than we’d expect by chance.
P-value (3.03e-47): This is the probability of seeing differences this large (or larger) if all house styles actually had the same mean price. Since p < 0.05, it’s very unlikely our observed differences are just random chance.
Degrees of freedom: We have 7 house styles (minus 1 for the overall mean) and 2922 homes within groups (total homes minus number of groups).
if(p_value < 0.05) {
cat("=== PAIRWISE COMPARISONS (Tukey HSD) ===\n\n")
cat("Since we rejected the null, we need to identify WHICH styles differ.\n")
cat("Tukey's Honest Significant Difference test makes all pairwise comparisons:\n\n")
tukey_result <- TukeyHSD(anova_model)
# Show only significant comparisons
tukey_df <- as.data.frame(tukey_result$House.Style)
tukey_df$Comparison <- rownames(tukey_df)
significant <- tukey_df |>
filter(`p adj` < 0.05) |>
arrange(`p adj`) |>
select(Comparison, diff, `p adj`)
cat("Significant differences (p < 0.05):\n")
print(head(significant, 10))
cat("\nInterpretation: These pairs have significantly different mean prices.\n")
cat("For example, the first row shows the price difference between two styles.\n")
}
## === PAIRWISE COMPARISONS (Tukey HSD) ===
##
## Since we rejected the null, we need to identify WHICH styles differ.
## Tukey's Honest Significant Difference test makes all pairwise comparisons:
##
## Significant differences (p < 0.05):
## Comparison diff p adj
## 1Story-1.5Fin 1Story-1.5Fin 41169.95 0.000000e+00
## 2Story-1.5Fin 2Story-1.5Fin 69460.24 0.000000e+00
## 2Story-1Story 2Story-1Story 28290.28 0.000000e+00
## SFoyer-2Story SFoyer-2Story -63517.50 0.000000e+00
## SLvl-2Story SLvl-2Story -41462.78 3.603212e-07
## 2Story-1.5Unf 2Story-1.5Unf 97327.00 1.407659e-06
## SFoyer-1Story SFoyer-1Story -35227.21 1.275283e-03
## 1Story-1.5Unf 1Story-1.5Unf 69036.72 2.571212e-03
## SLvl-1.5Fin SLvl-1.5Fin 27997.46 1.212685e-02
## 2.5Fin-1.5Unf 2.5Fin-1.5Unf 110336.84 1.529152e-02
##
## Interpretation: These pairs have significantly different mean prices.
## For example, the first row shows the price difference between two styles.
For Buyers:
Since we rejected the null hypothesis, house style DOES significantly affect price. This means: - Don’t assume all layouts cost the same—two-story homes command a premium over split-levels - If you’re flexible on style, you might find better value in less popular layouts (split-foyer homes sell for less on average) - Budget accordingly based on your style preference
For Sellers:
Your house style influences your asking price: - If you have a popular style (2Story), you can justify higher prices - If you have a less common style (SFoyer, 1.5Unf), set realistic expectations or emphasize other strong features - Consider your style in relation to neighborhood norms
For Builders/Developers:
The data shows which styles command premiums: - Two-story homes (2Story, 2.5Fin) sell for significantly more - Consider market demand vs. construction costs when choosing building styles - Some styles (1.5Unf, SFoyer) might be undervalued opportunities
cat("=== CHOOSING A CONTINUOUS PREDICTOR ===\n\n")
## === CHOOSING A CONTINUOUS PREDICTOR ===
# Check correlations
continuous_vars <- c("Gr.Liv.Area", "Lot.Area", "Year.Built", "Garage.Area", "Total.Bsmt.SF")
correlations <- sapply(continuous_vars, function(var) {
cor(ames[[var]], ames$SalePrice, use = "complete.obs")
})
cor_df <- data.frame(
Variable = continuous_vars,
Correlation = correlations
) |>
arrange(desc(abs(Correlation)))
kable(cor_df,
col.names = c("Variable", "Correlation with SalePrice"),
caption = "Potential Predictors for Linear Regression",
digits = 3)
| Variable | Correlation with SalePrice | |
|---|---|---|
| Gr.Liv.Area | Gr.Liv.Area | 0.707 |
| Garage.Area | Garage.Area | 0.640 |
| Total.Bsmt.SF | Total.Bsmt.SF | 0.632 |
| Year.Built | Year.Built | 0.558 |
| Lot.Area | Lot.Area | 0.267 |
cat("\n=== WHY GR.LIV.AREA (LIVING AREA)? ===\n")
##
## === WHY GR.LIV.AREA (LIVING AREA)? ===
cat("Above Ground Living Area is the best choice because:\n")
## Above Ground Living Area is the best choice because:
cat("- Strongest correlation with SalePrice (r = 0.707)\n")
## - Strongest correlation with SalePrice (r = 0.707)
cat("- Directly measurable and easily understood (square footage)\n")
## - Directly measurable and easily understood (square footage)
cat("- Buyers actively search by square footage ('I need 2000+ sq ft')\n")
## - Buyers actively search by square footage ('I need 2000+ sq ft')
cat("- Appraisers use price-per-square-foot as a key metric\n")
## - Appraisers use price-per-square-foot as a key metric
cat("- Relationship should be roughly linear (bigger homes = higher prices)\n")
## - Relationship should be roughly linear (bigger homes = higher prices)
# Scatterplot to assess linearity
ggplot(ames, aes(x = Gr.Liv.Area, y = SalePrice)) +
geom_point(alpha = 0.3, size = 2, color = "steelblue") +
geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.5) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1, linetype = "dashed") +
scale_x_continuous(labels = comma_format()) +
scale_y_continuous(labels = dollar_format()) +
labs(title = "Sale Price vs. Above Ground Living Area",
subtitle = "Red line = linear fit | Orange dashed = smoothed curve (checking linearity)",
x = "Living Area (square feet)",
y = "Sale Price") +
theme_minimal()
cat("\n=== LINEARITY ASSESSMENT ===\n")
##
## === LINEARITY ASSESSMENT ===
cat("Looking at the plot:\n")
## Looking at the plot:
cat("- Red line (linear) and orange line (smoothed) are very similar\n")
## - Red line (linear) and orange line (smoothed) are very similar
cat("- This suggests a linear model is appropriate\n")
## - This suggests a linear model is appropriate
cat("- Points show positive trend with some scatter (typical for real data)\n")
## - Points show positive trend with some scatter (typical for real data)
cat("- No obvious curvature that would violate linearity assumption\n")
## - No obvious curvature that would violate linearity assumption
Visual assessment: The red linear fit closely matches the orange smoothed curve, indicating the relationship is approximately linear. There’s scatter around the line (some homes are above/below the trend), but no systematic curvature that would suggest we need a non-linear model.
# Fit simple linear regression
model <- lm(SalePrice ~ Gr.Liv.Area, data = ames)
# Display model summary
cat("=== LINEAR REGRESSION MODEL ===\n\n")
## === LINEAR REGRESSION MODEL ===
cat("Model: SalePrice = β₀ + β₁ × Gr.Liv.Area + ε\n\n")
## Model: SalePrice = β₀ + β₁ × Gr.Liv.Area + ε
summary_output <- summary(model)
print(summary_output)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483467 -30219 -1966 22728 334323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13289.634 3269.703 4.064 4.94e-05 ***
## Gr.Liv.Area 111.694 2.066 54.061 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared: 0.4995, Adjusted R-squared: 0.4994
## F-statistic: 2923 on 1 and 2928 DF, p-value: < 2.2e-16
# Extract coefficients
coefficients <- coef(model)
intercept <- coefficients[1]
slope <- coefficients[2]
cat("\n=== MODEL EQUATION ===\n")
##
## === MODEL EQUATION ===
cat("SalePrice =", round(intercept, 2), "+", round(slope, 2), "× Gr.Liv.Area\n\n")
## SalePrice = 13289.63 + 111.69 × Gr.Liv.Area
cat("=== MODEL FIT STATISTICS ===\n\n")
## === MODEL FIT STATISTICS ===
# R-squared
r_squared <- summary_output$r.squared
adj_r_squared <- summary_output$adj.r.squared
cat("R-squared:", round(r_squared, 3), "\n")
## R-squared: 0.5
cat("Adjusted R-squared:", round(adj_r_squared, 3), "\n")
## Adjusted R-squared: 0.499
cat("Interpretation:", round(r_squared * 100, 1),
"% of sale price variation is explained by living area\n\n")
## Interpretation: 50 % of sale price variation is explained by living area
# RMSE (Root Mean Squared Error)
rmse <- sqrt(mean(residuals(model)^2))
cat("RMSE (Root Mean Squared Error):", dollar(rmse), "\n")
## RMSE (Root Mean Squared Error): $56,504.88
cat("Interpretation: On average, predictions are off by about", dollar(rmse), "\n\n")
## Interpretation: On average, predictions are off by about $56,504.88
# Residual standard error
rse <- summary_output$sigma
cat("Residual Standard Error:", dollar(rse), "\n")
## Residual Standard Error: $56,524.17
cat("Interpretation: Typical prediction error is about", dollar(rse), "\n")
## Interpretation: Typical prediction error is about $56,524.17
Model fit assessment:
R² = 0.5: Living area explains about 50% of price variation. This is pretty good for a single predictor! About half the variation in prices can be predicted just from knowing the square footage.
RMSE ≈ $56,000: Our predictions are off by an average of $56,000. For a market where homes average $180,000, this is about 31% error. Not perfect, but useful for ballpark estimates.
What’s missing? The other ~50% of variation comes from quality, location, age, features, etc. We’d need a multiple regression model to capture those.
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
cat("\n=== DIAGNOSTIC PLOT INTERPRETATION ===\n\n")
##
## === DIAGNOSTIC PLOT INTERPRETATION ===
cat("1. Residuals vs Fitted:\n")
## 1. Residuals vs Fitted:
cat(" - Should show random scatter around zero (no pattern)\n")
## - Should show random scatter around zero (no pattern)
cat(" - Our plot shows mostly random scatter, though slight heteroscedasticity\n")
## - Our plot shows mostly random scatter, though slight heteroscedasticity
cat(" - (variance increases slightly at higher prices)\n\n")
## - (variance increases slightly at higher prices)
cat("2. Normal Q-Q:\n")
## 2. Normal Q-Q:
cat(" - Points should follow diagonal line for normal residuals\n")
## - Points should follow diagonal line for normal residuals
cat(" - Our plot shows good fit except in tails (some outliers)\n")
## - Our plot shows good fit except in tails (some outliers)
cat(" - This is acceptable for our purposes\n\n")
## - This is acceptable for our purposes
cat("3. Scale-Location:\n")
## 3. Scale-Location:
cat(" - Should show horizontal line (constant variance)\n")
## - Should show horizontal line (constant variance)
cat(" - Slight upward trend suggests heteroscedasticity\n")
## - Slight upward trend suggests heteroscedasticity
cat(" - Not severe enough to invalidate the model\n\n")
## - Not severe enough to invalidate the model
cat("4. Residuals vs Leverage:\n")
## 4. Residuals vs Leverage:
cat(" - Identifies influential outliers (beyond Cook's distance lines)\n")
## - Identifies influential outliers (beyond Cook's distance lines)
cat(" - A few high-leverage points but within acceptable range\n\n")
## - A few high-leverage points but within acceptable range
cat("OVERALL ASSESSMENT: Model assumptions are reasonably met.\n")
## OVERALL ASSESSMENT: Model assumptions are reasonably met.
cat("Some heteroscedasticity present but not severe.\n")
## Some heteroscedasticity present but not severe.
cat("=== COEFFICIENT INTERPRETATION ===\n\n")
## === COEFFICIENT INTERPRETATION ===
cat("Intercept (β₀) =", dollar(intercept), "\n")
## Intercept (β₀) = $13,289.63
cat("What it means:\n")
## What it means:
cat("- This is the predicted price when living area = 0\n")
## - This is the predicted price when living area = 0
cat("- Practically meaningless (no home has 0 square feet)\n")
## - Practically meaningless (no home has 0 square feet)
cat("- Just serves as the baseline in our equation\n\n")
## - Just serves as the baseline in our equation
cat("Slope (β₁) =", dollar(slope), "per square foot\n")
## Slope (β₁) = $111.69 per square foot
cat("What it means:\n")
## What it means:
cat("- For every additional square foot of living area,\n")
## - For every additional square foot of living area,
cat(" the sale price increases by approximately", dollar(round(slope)), "\n")
## the sale price increases by approximately $112
cat("- This is the average price per square foot across all homes\n")
## - This is the average price per square foot across all homes
cat("- A 100 sq ft increase → ~", dollar(round(slope * 100)), "price increase\n")
## - A 100 sq ft increase → ~ $11,169 price increase
cat("- A 500 sq ft increase → ~", dollar(round(slope * 500)), "price increase\n\n")
## - A 500 sq ft increase → ~ $55,847 price increase
# Confidence interval for slope
conf_int <- confint(model, "Gr.Liv.Area", level = 0.95)
cat("95% Confidence Interval for slope:\n")
## 95% Confidence Interval for slope:
cat("We're 95% confident the true price per square foot is between\n")
## We're 95% confident the true price per square foot is between
cat(dollar(conf_int[1]), "and", dollar(conf_int[2]), "\n")
## $107.64 and $115.75
cat("=== EXAMPLE PREDICTIONS ===\n\n")
## === EXAMPLE PREDICTIONS ===
# Example predictions
example_sizes <- c(1000, 1500, 2000, 2500, 3000)
predictions <- predict(model,
newdata = data.frame(Gr.Liv.Area = example_sizes),
interval = "prediction",
level = 0.95)
pred_table <- data.frame(
Living_Area = comma(example_sizes),
Predicted_Price = dollar(predictions[, "fit"]),
Lower_95 = dollar(predictions[, "lwr"]),
Upper_95 = dollar(predictions[, "upr"])
)
kable(pred_table,
col.names = c("Living Area (sq ft)", "Predicted Price",
"95% Lower", "95% Upper"),
caption = "Price Predictions for Different Home Sizes")
| Living Area (sq ft) | Predicted Price | 95% Lower | 95% Upper |
|---|---|---|---|
| 1,000 | $124,984 | $14,115 | $235,852 |
| 1,500 | $180,831 | $69,981 | $291,681 |
| 2,000 | $236,678 | $125,809 | $347,546 |
| 2,500 | $292,525 | $181,601 | $403,449 |
| 3,000 | $348,372 | $237,355 | $459,388 |
cat("\nHow to use this:\n")
##
## How to use this:
cat("- The 'Predicted Price' is our best guess\n")
## - The 'Predicted Price' is our best guess
cat("- The 95% interval shows where we expect 95% of actual prices to fall\n")
## - The 95% interval shows where we expect 95% of actual prices to fall
cat("- Wide intervals reflect that many factors beyond size affect price\n")
## - Wide intervals reflect that many factors beyond size affect price
cat("=== RECOMMENDATIONS ===\n\n")
## === RECOMMENDATIONS ===
cat("For BUYERS:\n")
## For BUYERS:
cat("- Expect to pay approximately", dollar(round(slope)), "per square foot\n")
## - Expect to pay approximately $112 per square foot
cat("- A 1,500 sq ft home should cost around",
dollar(round(predict(model, newdata = data.frame(Gr.Liv.Area = 1500)))), "\n")
## - A 1,500 sq ft home should cost around $180,831
cat("- If a home is priced far above the prediction interval, question why\n")
## - If a home is priced far above the prediction interval, question why
cat(" (is it high quality? premium location? or just overpriced?)\n\n")
## (is it high quality? premium location? or just overpriced?)
cat("For SELLERS:\n")
## For SELLERS:
cat("- Your base price estimate is", dollar(round(slope)), "× (your square footage)\n")
## - Your base price estimate is $112 × (your square footage)
cat("- Adjust UP if you have: premium location, high quality, modern features\n")
## - Adjust UP if you have: premium location, high quality, modern features
cat("- Adjust DOWN if you have: budget location, lower quality, needed repairs\n")
## - Adjust DOWN if you have: budget location, lower quality, needed repairs
cat("- The model gives a starting point, not a final price\n\n")
## - The model gives a starting point, not a final price
cat("For APPRAISERS:\n")
## For APPRAISERS:
cat("- Living area is the single strongest predictor (R² = ", round(r_squared, 3), ")\n")
## - Living area is the single strongest predictor (R² = 0.5 )
cat("- Use", dollar(round(slope)), "per sq ft as a baseline\n")
## - Use $112 per sq ft as a baseline
cat("- But remember: 50% of variation comes from other factors\n")
## - But remember: 50% of variation comes from other factors
cat("- Always adjust for quality, location, condition, features\n\n")
## - Always adjust for quality, location, condition, features
cat("OPTIMAL STRATEGY:\n")
## OPTIMAL STRATEGY:
cat("If you're trying to maximize value-for-money:\n")
## If you're trying to maximize value-for-money:
cat("- Larger homes have better price-per-square-foot efficiency\n")
## - Larger homes have better price-per-square-foot efficiency
cat("- But diminishing returns exist (quality and location matter more at high end)\n")
## - But diminishing returns exist (quality and location matter more at high end)
cat("- Sweet spot appears to be 1,500-2,000 sq ft (balances price and size)\n")
## - Sweet spot appears to be 1,500-2,000 sq ft (balances price and size)
This week’s analysis provided two complementary perspectives on what drives home prices in Ames:
ANOVA Results: - House style significantly affects sale price (F = 35.04, p < 0.001) - Two-story homes command premiums over split-levels and one-story homes - Style explains about 7.7% of price variation - Buyers should budget differently based on preferred style
Regression Results: - Living area strongly predicts sale price (R² = 0.5) - Each square foot adds approximately $112 to price - The model explains about 50% of price variation - Remaining variation comes from quality, location, and features
Key Insight: Both categorical factors (house style) and continuous factors (square footage) matter, but square footage is the single strongest predictor. A complete pricing model would need both types of variables plus others (quality, location, age).
Both ANOVA and regression are powerful tools, but they’re most useful when combined with domain knowledge and awareness of their limitations. They provide data-driven insights, but human judgment is still needed for final decisions.