The target audience for this analysis is a real estate analytics firm operating in Ames, Iowa. The firm’s clients include:
Homebuyers: Understand pricing dynamics across different neighborhoods to identify affordable homes of high quality.
Investors: Make data-driven decisions on which properties are most likely to appreciate in value based on factors such as size, quality, and location.
Property Managers: Set competitive rent prices by analyzing property value fluctuations in different areas.
The firm needs to identify key factors that influence housing prices in Ames, Iowa, to help clients (homebuyers, investors, and property managers) make better decisions about home buying, property investments, and setting rental rates.
For example:
“Should we invest in larger homes with bigger lots, or are smaller homes in premium neighborhoods more likely to appreciate?”
“What combination of features would yield the highest return on investment in terms of sale price?”
The data used in the analysis includes crucial variables like
sale_price
, overall_qual
,
year_built
, gr_liv_area
, and
neighborhood
. The lab focuses on analyzing these factors
using linear regression to predict sale prices and
ANOVA to assess the price variations across
neighborhoods or other categorical variables. The assumption is that
macroeconomic factors, like interest rates, remain constant and do not
dramatically influence housing prices during this analysis.
We should also consider the limitations of these assumptions:
The effects of external factors such as interest rates, new infrastructure projects, and economic recessions may not be captured in this model.
We assume homogeneity within neighborhoods, but some areas may be experiencing rapid gentrification or changes in demand that are not reflected in the data.
The primary goal of this analysis is to identify the most significant factors that influence housing prices. The success criteria for this project will be:
Achieving an R-squared value of 0.85 or higher to ensure that the model accounts for a large proportion of the variance in home prices.
Clear interpretability of the variables involved, ensuring that clients can confidently understand how different property features affect prices.
Actionable insights, such as identifying the most valuable property features (e.g., quality, square footage, or neighborhood).
1. Which features influence home prices the most?
Answer:
Overall quality (overall_qual)
and
living area (gr_liv_area)
are the most significant
predictors of home prices. These features strongly correlate with sale
prices, with high-quality homes and larger living spaces fetching higher
prices.
Enhancement: A variable importance plot can be created to show the relative significance of each feature.
2. Do high-quality neighborhoods always command higher prices?
Answer: Based on the ANOVA results,
neighborhoods with better amenities and lower crime rates tend to have
higher average home prices. However, neighborhood alone may not be
sufficient; overall_qual
and gr_liv_area
also
play major roles.
Enhancement: Heatmaps and boxplots can be used to visualize price variations by neighborhood.
3. Are larger homes always better investments?
Answer: Larger homes (gr_liv_area
)
typically have higher sale prices, but the relationship isn’t purely
linear. After a certain point, increasing square footage may not lead to
proportional increases in price, especially if the quality of the home
isn’t aligned.
Enhancement: Interaction terms can be introduced to model the diminishing returns of larger homes in high-quality areas.
4. How do outliers impact the model?
Answer: Extreme values, like luxury homes or significant renovations, can heavily influence the model’s coefficients. Using Cook’s Distance, we can identify these outliers and evaluate their impact on the model’s predictive accuracy.
Enhancement: A comparison of model performance with and without outliers can highlight their effect on predictions.
5. Are assumptions of normality and constant variance met?
Answer: The residuals of the model slightly deviate from normality, and some variance heterogeneity is observed. This is typical in real estate pricing, where outliers and skewed distributions are common.
Enhancement: We might apply log transformations
to the dependent variable (sale_price
) or use robust
regression techniques to address these issues.
# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)
Current Problem: Some predictors, such as square footage and building type, may have high correlation. Multicollinearity can distort the interpretation of individual coefficients in the linear regression model.
Proposed Improvement: Use Variance Inflation Factor (VIF) to detect multicollinearity. If variables exhibit high multicollinearity, we can remove them or combine them for better interpretation.
# Initial model using a subset of important variables
model <- lm(sale_price ~ lot_area + overall_qual + year_built + neighborhood + gr_liv_area, data = ames)
summary(model)
##
## Call:
## lm(formula = sale_price ~ lot_area + overall_qual + year_built +
## neighborhood + gr_liv_area, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -433945 -14128 -4 13029 220440
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -985130.24078 87233.01744
## lot_area 0.85973 0.08526
## overall_qualPoor 27309.33281 18353.34931
## overall_qualFair 37759.90434 16816.65335
## overall_qualBelow_Average 50901.12299 16201.98462
## overall_qualAverage 61850.89077 16111.82067
## overall_qualAbove_Average 72500.03933 16147.60278
## overall_qualGood 86612.66728 16231.56696
## overall_qualVery_Good 122675.11905 16364.90272
## overall_qualExcellent 194712.85345 16685.64248
## overall_qualVery_Excellent 240489.25149 17486.35502
## year_built 505.73848 43.76797
## neighborhoodCollege_Creek 4149.52255 3030.96536
## neighborhoodOld_Town -7787.25970 3099.84543
## neighborhoodEdwards -14280.93201 2802.84641
## neighborhoodSomerset 6988.98348 3566.05614
## neighborhoodNorthridge_Heights 36165.39073 3979.12108
## neighborhoodGilbert -7368.29717 3421.43214
## neighborhoodSawyer -1525.07862 3041.00079
## neighborhoodNorthwest_Ames 1484.75499 3361.07945
## neighborhoodSawyer_West -5376.78353 3542.11251
## neighborhoodMitchell -919.39583 3499.46067
## neighborhoodBrookside 1857.69642 3668.38059
## neighborhoodCrawford 31126.88088 3665.77660
## neighborhoodIowa_DOT_and_Rail_Road -12399.82709 3976.39974
## neighborhoodTimberland 14478.49148 4519.05412
## neighborhoodNorthridge 44901.29315 4810.12730
## neighborhoodStone_Brook 41243.38019 5428.86970
## neighborhoodSouth_and_West_of_Iowa_State_University -11411.06633 5078.16322
## neighborhoodClear_Creek 16425.40788 5251.25950
## neighborhoodMeadow_Village -26789.15023 5728.06972
## neighborhoodBriardale -33560.75870 6121.96694
## neighborhoodBloomington_Heights 2757.27173 6650.72479
## neighborhoodVeenker 21184.89146 6901.45629
## neighborhoodNorthpark_Villa -14477.30364 6954.78656
## neighborhoodBlueste -21577.25985 10342.49128
## neighborhoodGreens -6326.75819 11734.74987
## neighborhoodGreen_Hills 93344.12831 22762.73479
## neighborhoodLandmark -27486.45646 32075.13584
## gr_liv_area 50.05646 1.60612
## t value Pr(>|t|)
## (Intercept) -11.293 < 2e-16 ***
## lot_area 10.084 < 2e-16 ***
## overall_qualPoor 1.488 0.136866
## overall_qualFair 2.245 0.024819 *
## overall_qualBelow_Average 3.142 0.001697 **
## overall_qualAverage 3.839 0.000126 ***
## overall_qualAbove_Average 4.490 7.41e-06 ***
## overall_qualGood 5.336 1.02e-07 ***
## overall_qualVery_Good 7.496 8.68e-14 ***
## overall_qualExcellent 11.669 < 2e-16 ***
## overall_qualVery_Excellent 13.753 < 2e-16 ***
## year_built 11.555 < 2e-16 ***
## neighborhoodCollege_Creek 1.369 0.171092
## neighborhoodOld_Town -2.512 0.012054 *
## neighborhoodEdwards -5.095 3.71e-07 ***
## neighborhoodSomerset 1.960 0.050108 .
## neighborhoodNorthridge_Heights 9.089 < 2e-16 ***
## neighborhoodGilbert -2.154 0.031356 *
## neighborhoodSawyer -0.502 0.616054
## neighborhoodNorthwest_Ames 0.442 0.658704
## neighborhoodSawyer_West -1.518 0.129134
## neighborhoodMitchell -0.263 0.792781
## neighborhoodBrookside 0.506 0.612609
## neighborhoodCrawford 8.491 < 2e-16 ***
## neighborhoodIowa_DOT_and_Rail_Road -3.118 0.001837 **
## neighborhoodTimberland 3.204 0.001371 **
## neighborhoodNorthridge 9.335 < 2e-16 ***
## neighborhoodStone_Brook 7.597 4.07e-14 ***
## neighborhoodSouth_and_West_of_Iowa_State_University -2.247 0.024710 *
## neighborhoodClear_Creek 3.128 0.001778 **
## neighborhoodMeadow_Village -4.677 3.05e-06 ***
## neighborhoodBriardale -5.482 4.57e-08 ***
## neighborhoodBloomington_Heights 0.415 0.678479
## neighborhoodVeenker 3.070 0.002163 **
## neighborhoodNorthpark_Villa -2.082 0.037464 *
## neighborhoodBlueste -2.086 0.037041 *
## neighborhoodGreens -0.539 0.589827
## neighborhoodGreen_Hills 4.101 4.23e-05 ***
## neighborhoodLandmark -0.857 0.391549
## gr_liv_area 31.166 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31990 on 2890 degrees of freedom
## Multiple R-squared: 0.8418, Adjusted R-squared: 0.8396
## F-statistic: 394.2 on 39 and 2890 DF, p-value: < 2.2e-16
# Checking multicollinearity using VIF
vif_values <- vif(model)
vif_values
## GVIF Df GVIF^(1/(2*Df))
## lot_area 1.291725 1 1.136541
## overall_qual 6.016995 9 1.104839
## year_built 5.015338 1 2.239495
## neighborhood 17.765211 27 1.054727
## gr_liv_area 1.886614 1 1.373541
# If VIF values > 5, consider removing variables with high multicollinearity
Outcome: By removing variables with high multicollinearity, we improve the stability and interpretability of the model.
Current Problem: The current linear model assumes that all relationships between variables and SalePrice are purely additive. However, some features, such as Overall Quality and GrLivArea, likely have interactive effects on the sale price.
Proposed Solution: Introduce interaction terms to better understand how combinations of factors influence the sale price. This would better model the reality of home pricing, where high-quality larger homes fetch a premium.
# Adding interaction terms
interaction_model <- lm(sale_price ~ gr_liv_area * overall_qual + year_built + lot_area, data = ames)
summary(interaction_model)
##
## Call:
## lm(formula = sale_price ~ gr_liv_area * overall_qual + year_built +
## lot_area, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -302492 -15189 -411 13795 337746
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1042240.79802 60607.09784 -17.197
## gr_liv_area 16.71845 35.34069 0.473
## overall_qualPoor 25701.43607 47849.29710 0.537
## overall_qualFair 30503.59136 38842.08207 0.785
## overall_qualBelow_Average 29700.06108 36173.19576 0.821
## overall_qualAverage 56750.72908 35639.54304 1.592
## overall_qualAbove_Average 35075.39015 35725.76194 0.982
## overall_qualGood 38972.23249 35929.40330 1.085
## overall_qualVery_Good 54131.22365 36192.98044 1.496
## overall_qualExcellent 93237.80568 38309.92115 2.434
## overall_qualVery_Excellent 450931.25860 39510.13922 11.413
## year_built 545.16616 25.34668 21.508
## lot_area 1.11209 0.07987 13.924
## gr_liv_area:overall_qualPoor -9.89282 58.52418 -0.169
## gr_liv_area:overall_qualFair 13.06821 38.09593 0.343
## gr_liv_area:overall_qualBelow_Average 25.45885 35.84575 0.710
## gr_liv_area:overall_qualAverage 17.65133 35.44907 0.498
## gr_liv_area:overall_qualAbove_Average 42.18743 35.45511 1.190
## gr_liv_area:overall_qualGood 50.72257 35.48477 1.429
## gr_liv_area:overall_qualVery_Good 67.35376 35.52197 1.896
## gr_liv_area:overall_qualExcellent 83.95258 35.97132 2.334
## gr_liv_area:overall_qualVery_Excellent -40.45491 35.78317 -1.131
## Pr(>|t|)
## (Intercept) <2e-16 ***
## gr_liv_area 0.6362
## overall_qualPoor 0.5912
## overall_qualFair 0.4323
## overall_qualBelow_Average 0.4117
## overall_qualAverage 0.1114
## overall_qualAbove_Average 0.3263
## overall_qualGood 0.2781
## overall_qualVery_Good 0.1349
## overall_qualExcellent 0.0150 *
## overall_qualVery_Excellent <2e-16 ***
## year_built <2e-16 ***
## lot_area <2e-16 ***
## gr_liv_area:overall_qualPoor 0.8658
## gr_liv_area:overall_qualFair 0.7316
## gr_liv_area:overall_qualBelow_Average 0.4776
## gr_liv_area:overall_qualAverage 0.6186
## gr_liv_area:overall_qualAbove_Average 0.2342
## gr_liv_area:overall_qualGood 0.1530
## gr_liv_area:overall_qualVery_Good 0.0580 .
## gr_liv_area:overall_qualExcellent 0.0197 *
## gr_liv_area:overall_qualVery_Excellent 0.2583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32320 on 2908 degrees of freedom
## Multiple R-squared: 0.8375, Adjusted R-squared: 0.8363
## F-statistic: 713.7 on 21 and 2908 DF, p-value: < 2.2e-16
Outcome: Including interaction terms can capture non-linear relationships that more accurately reflect how property features combine to affect price.
Current Problem: Outliers, such as luxury homes or properties with significant renovations, can distort regression results, leading to biased coefficient estimates. It’s essential to assess the impact of these outliers.
Proposed Solution: We’ll use Cook’s Distance to identify influential observations and outliers that may unduly affect the model. This will ensure a robust pricing model by mitigating the impact of outliers.
# Cook's Distance for detecting outliers
cooks_d <- cooks.distance(model)
plot(cooks_d, type = "h", main = "Cook's Distance", ylab = "Cook's Distance")
abline(h = 4/nrow(ames), col = "red")
Outcome: This approach helps detect extreme cases that might distort the model’s predictive power and allow us to refine the dataset for better accuracy.
Current Problem: Assuming homoscedasticity (constant variance of residuals) may lead to inaccurate estimates if violated. It’s essential to check whether residuals exhibit heteroscedasticity, which could invalidate inference.
Proposed Solution: We’ll perform a Breusch-Pagan test for heteroscedasticity and assess the residual distribution with Q-Q plots to ensure that assumptions of linear regression hold.
# Checking for homoscedasticity
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.2
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(model) # Breusch-Pagan Test
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 804.45, df = 39, p-value < 2.2e-16
# Plot residuals
plot(model, which = 1) # Residuals vs Fitted
plot(model, which = 2) # Q-Q plot for normality
## Warning: not plotting observations with leverage one:
## 2789
Outcome: Ensuring homoscedasticity and normality of residuals improves the validity of statistical tests, allowing us to trust the results.
ANOVA (Analysis of Variance) is a powerful statistical technique used to compare the means of multiple groups and determine if at least one group mean is significantly different from others. In the context of analyzing sale prices in the Ames dataset, ANOVA helps us identify variations due to factors like neighborhood or building type.
The key assumptions for ANOVA include:
Normality: The data within each group should follow a normal distribution.
Homoscedasticity: The variance across groups should be roughly equal.
Independence: Observations within and across groups must be independent.
Improvement Suggestions for ANOVA Analysis:
Provide a clear explanation of group definitions, such as building types or neighborhoods, to contextualize the analysis.
Include detailed calculations for total variance, group variance, and error variance to enhance transparency.
Offer multiple ways to test ANOVA assumptions and explore corrective measures for violations.
Address non-normal distributions by applying transformations or alternative methods.
Consider additional variables, like zoning or proximity to amenities, to enrich insights.
Step 1: ANOVA for Neighborhood Effects on Sale Prices
# Perform ANOVA for neighborhood effect on sale prices
anova_model <- aov(sale_price ~ neighborhood, data = ames)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## neighborhood 27 10716004520761 396889056324 144.4 <2e-16 ***
## Residuals 2902 7976532589590 2748632870
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check for homoscedasticity using Levene's Test
leveneTest(sale_price ~ neighborhood, data = ames)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 27 25.225 < 2.2e-16 ***
## 2902
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Diagnostic plots for residual analysis
par(mfrow = c(2, 2))
plot(anova_model)
## Warning: not plotting observations with leverage one:
## 2789
Explanation:
The ANOVA output provides a summary of the variation in sale prices explained by neighborhood differences.
A significant p-value (p < 0.05) in the ANOVA summary indicates statistically significant price differences between neighborhoods.
Levene’s Test checks for equal variances (homoscedasticity). A p-value > 0.05 indicates that the assumption of equal variances holds.
Diagnostic plots examine residual patterns and normality visually.
1. Normality Check for Overall Sale Prices
# Visualize overall sale price distribution
ames %>%
ggplot(aes(x = sale_price)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightblue', color = 'white') +
geom_density(color = 'red', size = 1) +
labs(title = "Distribution of Sale Prices", x = "Sale Price", y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Explanation:
The histogram shows a right-skewed distribution, which
is common for price data as it is bounded by zero. Skewness can affect
normality, necessitating transformations like logarithms or
non-parametric approaches.
# Visualize sale price distributions by building type
ames %>%
ggplot(aes(x = sale_price)) +
geom_histogram(aes(y = ..density..), bins = 20, fill = 'lightblue', color = 'white') +
geom_density(color = 'red', size = 1) +
facet_wrap(~ bldg_type, scales = "free_y") +
labs(title = "Sale Price Distribution by Building Type", x = "Sale Price", y = "Density")
Explanation:
Each building type’s histogram shows non-normal distributions for sale
prices, indicating a potential violation of the normality assumption.
This further justifies applying data transformations or exploring robust
alternatives.
Shapiro-Wilk Test:
# Shapiro-Wilk test for overall sale prices
shapiro.test(ames$sale_price)
##
## Shapiro-Wilk normality test
##
## data: ames$sale_price
## W = 0.87626, p-value < 2.2e-16
# Group-specific Shapiro-Wilk test
ames %>%
group_by(bldg_type) %>%
summarise(
p_value = shapiro.test(sale_price)$p.value
)
## # A tibble: 5 × 2
## bldg_type p_value
## <fct> <dbl>
## 1 OneFam 3.05e-40
## 2 TwoFmCon 1.36e- 1
## 3 Duplex 3.04e- 5
## 4 Twnhs 1.28e- 3
## 5 TwnhsE 7.05e- 8
Interpretation:
The p-values from the Shapiro-Wilk test indicate whether the data is
normally distributed:
A p-value < 0.05 means the null hypothesis of normality is rejected, confirming non-normal data.
This result strengthens the case for applying transformations or using non-parametric tests.
Log Transformation: A log transformation compresses the range of values, reducing skewness and improving normality.
# Log-transform the sale price
ames <- ames %>%
mutate(log_sale_price = log(sale_price))
# Recheck normality with transformed data
ggplot(ames, aes(x = log_sale_price)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightgreen', color = 'white') +
geom_density(color = 'blue', size = 1) +
labs(title = "Log-Transformed Sale Price Distribution", x = "Log Sale Price", y = "Density")
Explanation:
The log transformation addresses skewness, making the distribution
closer to normal. Improved normality ensures better adherence to ANOVA
assumptions.
Bootstrap Sampling: Bootstrap sampling provides a robust alternative when assumptions are difficult to meet.
# Bootstrap sampling to achieve a more robust normal distribution
set.seed(123)
bootstrap_samples <- replicate(1000, mean(sample(ames$sale_price, replace = TRUE)))
# Visualize the bootstrap distribution
ggplot(data.frame(mean_price = bootstrap_samples), aes(x = mean_price)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = 'orange', color = 'black') +
geom_density(color = 'blue', size = 1) +
labs(title = "Bootstrap Distribution of Mean Sale Prices", x = "Bootstrap Mean", y = "Density")
Explanation:
Bootstrap sampling generates a distribution of means through resampling,
offering insights into the underlying population distribution. This is
particularly useful for smaller datasets.
Certain neighborhoods may be overrepresented or underrepresented in the dataset. This could bias predictions and lead to inaccurate pricing recommendations, especially if the model overfits to wealthier areas.
Proposed Solution:
Improper use of this model could result in inflated home prices, exacerbating gentrification or making housing less affordable for lower-income families.
Proposed Solution:
Several features, like kitchen_quality
or
basement_exposure
, are missing for many homes, which could
influence the reliability of the model.
Proposed Solution: