Goal 1: Business Scenario

Customer or Audience:

The target audience for this analysis is a real estate analytics firm operating in Ames, Iowa. The firm’s clients include:

  • Homebuyers: Understand pricing dynamics across different neighborhoods to identify affordable homes of high quality.

  • Investors: Make data-driven decisions on which properties are most likely to appreciate in value based on factors such as size, quality, and location.

  • Property Managers: Set competitive rent prices by analyzing property value fluctuations in different areas.

Problem Statement:

The firm needs to identify key factors that influence housing prices in Ames, Iowa, to help clients (homebuyers, investors, and property managers) make better decisions about home buying, property investments, and setting rental rates.

For example:

  • “Should we invest in larger homes with bigger lots, or are smaller homes in premium neighborhoods more likely to appreciate?”

  • “What combination of features would yield the highest return on investment in terms of sale price?”

Scope:

The data used in the analysis includes crucial variables like sale_price, overall_qual, year_built, gr_liv_area, and neighborhood. The lab focuses on analyzing these factors using linear regression to predict sale prices and ANOVA to assess the price variations across neighborhoods or other categorical variables. The assumption is that macroeconomic factors, like interest rates, remain constant and do not dramatically influence housing prices during this analysis.

We should also consider the limitations of these assumptions:

  • The effects of external factors such as interest rates, new infrastructure projects, and economic recessions may not be captured in this model.

  • We assume homogeneity within neighborhoods, but some areas may be experiencing rapid gentrification or changes in demand that are not reflected in the data.

Objective:

The primary goal of this analysis is to identify the most significant factors that influence housing prices. The success criteria for this project will be:

  • Achieving an R-squared value of 0.85 or higher to ensure that the model accounts for a large proportion of the variance in home prices.

  • Clear interpretability of the variables involved, ensuring that clients can confidently understand how different property features affect prices.

  • Actionable insights, such as identifying the most valuable property features (e.g., quality, square footage, or neighborhood).

Business Questions and Answers:

1. Which features influence home prices the most?

  • Answer: Overall quality (overall_qual) and living area (gr_liv_area) are the most significant predictors of home prices. These features strongly correlate with sale prices, with high-quality homes and larger living spaces fetching higher prices.

  • Enhancement: A variable importance plot can be created to show the relative significance of each feature.

2. Do high-quality neighborhoods always command higher prices?

  • Answer: Based on the ANOVA results, neighborhoods with better amenities and lower crime rates tend to have higher average home prices. However, neighborhood alone may not be sufficient; overall_qual and gr_liv_area also play major roles.

  • Enhancement: Heatmaps and boxplots can be used to visualize price variations by neighborhood.

3. Are larger homes always better investments?

  • Answer: Larger homes (gr_liv_area) typically have higher sale prices, but the relationship isn’t purely linear. After a certain point, increasing square footage may not lead to proportional increases in price, especially if the quality of the home isn’t aligned.

  • Enhancement: Interaction terms can be introduced to model the diminishing returns of larger homes in high-quality areas.

4. How do outliers impact the model?

  • Answer: Extreme values, like luxury homes or significant renovations, can heavily influence the model’s coefficients. Using Cook’s Distance, we can identify these outliers and evaluate their impact on the model’s predictive accuracy.

  • Enhancement: A comparison of model performance with and without outliers can highlight their effect on predictions.

5. Are assumptions of normality and constant variance met?

  • Answer: The residuals of the model slightly deviate from normality, and some variance heterogeneity is observed. This is typical in real estate pricing, where outliers and skewed distributions are common.

  • Enhancement: We might apply log transformations to the dependent variable (sale_price) or use robust regression techniques to address these issues.


# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)

Goal 2: Model Critique

Linear Regression

  • It would help to diagnose the linear regression model (e.g., creating diagnostic plots such as residuals vs. fitted values, etc.) to verify the validity of the model and provide any necessary caveats about application of the model (e.g., model may be less valid for price points above certain values, etc.)

Issue 1: Multicollinearity

Current Problem: Some predictors, such as square footage and building type, may have high correlation. Multicollinearity can distort the interpretation of individual coefficients in the linear regression model.

Proposed Improvement: Use Variance Inflation Factor (VIF) to detect multicollinearity. If variables exhibit high multicollinearity, we can remove them or combine them for better interpretation.

# Initial model using a subset of important variables
model <- lm(sale_price ~ lot_area + overall_qual + year_built + neighborhood + gr_liv_area, data = ames)
summary(model)
## 
## Call:
## lm(formula = sale_price ~ lot_area + overall_qual + year_built + 
##     neighborhood + gr_liv_area, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -433945  -14128      -4   13029  220440 
## 
## Coefficients:
##                                                          Estimate    Std. Error
## (Intercept)                                         -985130.24078   87233.01744
## lot_area                                                  0.85973       0.08526
## overall_qualPoor                                      27309.33281   18353.34931
## overall_qualFair                                      37759.90434   16816.65335
## overall_qualBelow_Average                             50901.12299   16201.98462
## overall_qualAverage                                   61850.89077   16111.82067
## overall_qualAbove_Average                             72500.03933   16147.60278
## overall_qualGood                                      86612.66728   16231.56696
## overall_qualVery_Good                                122675.11905   16364.90272
## overall_qualExcellent                                194712.85345   16685.64248
## overall_qualVery_Excellent                           240489.25149   17486.35502
## year_built                                              505.73848      43.76797
## neighborhoodCollege_Creek                              4149.52255    3030.96536
## neighborhoodOld_Town                                  -7787.25970    3099.84543
## neighborhoodEdwards                                  -14280.93201    2802.84641
## neighborhoodSomerset                                   6988.98348    3566.05614
## neighborhoodNorthridge_Heights                        36165.39073    3979.12108
## neighborhoodGilbert                                   -7368.29717    3421.43214
## neighborhoodSawyer                                    -1525.07862    3041.00079
## neighborhoodNorthwest_Ames                             1484.75499    3361.07945
## neighborhoodSawyer_West                               -5376.78353    3542.11251
## neighborhoodMitchell                                   -919.39583    3499.46067
## neighborhoodBrookside                                  1857.69642    3668.38059
## neighborhoodCrawford                                  31126.88088    3665.77660
## neighborhoodIowa_DOT_and_Rail_Road                   -12399.82709    3976.39974
## neighborhoodTimberland                                14478.49148    4519.05412
## neighborhoodNorthridge                                44901.29315    4810.12730
## neighborhoodStone_Brook                               41243.38019    5428.86970
## neighborhoodSouth_and_West_of_Iowa_State_University  -11411.06633    5078.16322
## neighborhoodClear_Creek                               16425.40788    5251.25950
## neighborhoodMeadow_Village                           -26789.15023    5728.06972
## neighborhoodBriardale                                -33560.75870    6121.96694
## neighborhoodBloomington_Heights                        2757.27173    6650.72479
## neighborhoodVeenker                                   21184.89146    6901.45629
## neighborhoodNorthpark_Villa                          -14477.30364    6954.78656
## neighborhoodBlueste                                  -21577.25985   10342.49128
## neighborhoodGreens                                    -6326.75819   11734.74987
## neighborhoodGreen_Hills                               93344.12831   22762.73479
## neighborhoodLandmark                                 -27486.45646   32075.13584
## gr_liv_area                                              50.05646       1.60612
##                                                     t value Pr(>|t|)    
## (Intercept)                                         -11.293  < 2e-16 ***
## lot_area                                             10.084  < 2e-16 ***
## overall_qualPoor                                      1.488 0.136866    
## overall_qualFair                                      2.245 0.024819 *  
## overall_qualBelow_Average                             3.142 0.001697 ** 
## overall_qualAverage                                   3.839 0.000126 ***
## overall_qualAbove_Average                             4.490 7.41e-06 ***
## overall_qualGood                                      5.336 1.02e-07 ***
## overall_qualVery_Good                                 7.496 8.68e-14 ***
## overall_qualExcellent                                11.669  < 2e-16 ***
## overall_qualVery_Excellent                           13.753  < 2e-16 ***
## year_built                                           11.555  < 2e-16 ***
## neighborhoodCollege_Creek                             1.369 0.171092    
## neighborhoodOld_Town                                 -2.512 0.012054 *  
## neighborhoodEdwards                                  -5.095 3.71e-07 ***
## neighborhoodSomerset                                  1.960 0.050108 .  
## neighborhoodNorthridge_Heights                        9.089  < 2e-16 ***
## neighborhoodGilbert                                  -2.154 0.031356 *  
## neighborhoodSawyer                                   -0.502 0.616054    
## neighborhoodNorthwest_Ames                            0.442 0.658704    
## neighborhoodSawyer_West                              -1.518 0.129134    
## neighborhoodMitchell                                 -0.263 0.792781    
## neighborhoodBrookside                                 0.506 0.612609    
## neighborhoodCrawford                                  8.491  < 2e-16 ***
## neighborhoodIowa_DOT_and_Rail_Road                   -3.118 0.001837 ** 
## neighborhoodTimberland                                3.204 0.001371 ** 
## neighborhoodNorthridge                                9.335  < 2e-16 ***
## neighborhoodStone_Brook                               7.597 4.07e-14 ***
## neighborhoodSouth_and_West_of_Iowa_State_University  -2.247 0.024710 *  
## neighborhoodClear_Creek                               3.128 0.001778 ** 
## neighborhoodMeadow_Village                           -4.677 3.05e-06 ***
## neighborhoodBriardale                                -5.482 4.57e-08 ***
## neighborhoodBloomington_Heights                       0.415 0.678479    
## neighborhoodVeenker                                   3.070 0.002163 ** 
## neighborhoodNorthpark_Villa                          -2.082 0.037464 *  
## neighborhoodBlueste                                  -2.086 0.037041 *  
## neighborhoodGreens                                   -0.539 0.589827    
## neighborhoodGreen_Hills                               4.101 4.23e-05 ***
## neighborhoodLandmark                                 -0.857 0.391549    
## gr_liv_area                                          31.166  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31990 on 2890 degrees of freedom
## Multiple R-squared:  0.8418, Adjusted R-squared:  0.8396 
## F-statistic: 394.2 on 39 and 2890 DF,  p-value: < 2.2e-16
# Checking multicollinearity using VIF
vif_values <- vif(model)
vif_values
##                   GVIF Df GVIF^(1/(2*Df))
## lot_area      1.291725  1        1.136541
## overall_qual  6.016995  9        1.104839
## year_built    5.015338  1        2.239495
## neighborhood 17.765211 27        1.054727
## gr_liv_area   1.886614  1        1.373541
# If VIF values > 5, consider removing variables with high multicollinearity

Outcome: By removing variables with high multicollinearity, we improve the stability and interpretability of the model.

Issue 2: Adding Interaction Terms

Current Problem: The current linear model assumes that all relationships between variables and SalePrice are purely additive. However, some features, such as Overall Quality and GrLivArea, likely have interactive effects on the sale price.

Proposed Solution: Introduce interaction terms to better understand how combinations of factors influence the sale price. This would better model the reality of home pricing, where high-quality larger homes fetch a premium.

# Adding interaction terms
interaction_model <- lm(sale_price ~ gr_liv_area * overall_qual + year_built + lot_area, data = ames)
summary(interaction_model)
## 
## Call:
## lm(formula = sale_price ~ gr_liv_area * overall_qual + year_built + 
##     lot_area, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -302492  -15189    -411   13795  337746 
## 
## Coefficients:
##                                              Estimate     Std. Error t value
## (Intercept)                            -1042240.79802    60607.09784 -17.197
## gr_liv_area                                  16.71845       35.34069   0.473
## overall_qualPoor                          25701.43607    47849.29710   0.537
## overall_qualFair                          30503.59136    38842.08207   0.785
## overall_qualBelow_Average                 29700.06108    36173.19576   0.821
## overall_qualAverage                       56750.72908    35639.54304   1.592
## overall_qualAbove_Average                 35075.39015    35725.76194   0.982
## overall_qualGood                          38972.23249    35929.40330   1.085
## overall_qualVery_Good                     54131.22365    36192.98044   1.496
## overall_qualExcellent                     93237.80568    38309.92115   2.434
## overall_qualVery_Excellent               450931.25860    39510.13922  11.413
## year_built                                  545.16616       25.34668  21.508
## lot_area                                      1.11209        0.07987  13.924
## gr_liv_area:overall_qualPoor                 -9.89282       58.52418  -0.169
## gr_liv_area:overall_qualFair                 13.06821       38.09593   0.343
## gr_liv_area:overall_qualBelow_Average        25.45885       35.84575   0.710
## gr_liv_area:overall_qualAverage              17.65133       35.44907   0.498
## gr_liv_area:overall_qualAbove_Average        42.18743       35.45511   1.190
## gr_liv_area:overall_qualGood                 50.72257       35.48477   1.429
## gr_liv_area:overall_qualVery_Good            67.35376       35.52197   1.896
## gr_liv_area:overall_qualExcellent            83.95258       35.97132   2.334
## gr_liv_area:overall_qualVery_Excellent      -40.45491       35.78317  -1.131
##                                        Pr(>|t|)    
## (Intercept)                              <2e-16 ***
## gr_liv_area                              0.6362    
## overall_qualPoor                         0.5912    
## overall_qualFair                         0.4323    
## overall_qualBelow_Average                0.4117    
## overall_qualAverage                      0.1114    
## overall_qualAbove_Average                0.3263    
## overall_qualGood                         0.2781    
## overall_qualVery_Good                    0.1349    
## overall_qualExcellent                    0.0150 *  
## overall_qualVery_Excellent               <2e-16 ***
## year_built                               <2e-16 ***
## lot_area                                 <2e-16 ***
## gr_liv_area:overall_qualPoor             0.8658    
## gr_liv_area:overall_qualFair             0.7316    
## gr_liv_area:overall_qualBelow_Average    0.4776    
## gr_liv_area:overall_qualAverage          0.6186    
## gr_liv_area:overall_qualAbove_Average    0.2342    
## gr_liv_area:overall_qualGood             0.1530    
## gr_liv_area:overall_qualVery_Good        0.0580 .  
## gr_liv_area:overall_qualExcellent        0.0197 *  
## gr_liv_area:overall_qualVery_Excellent   0.2583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32320 on 2908 degrees of freedom
## Multiple R-squared:  0.8375, Adjusted R-squared:  0.8363 
## F-statistic: 713.7 on 21 and 2908 DF,  p-value: < 2.2e-16

Outcome: Including interaction terms can capture non-linear relationships that more accurately reflect how property features combine to affect price.

Issue 3: Outlier Detection

Current Problem: Outliers, such as luxury homes or properties with significant renovations, can distort regression results, leading to biased coefficient estimates. It’s essential to assess the impact of these outliers.

Proposed Solution: We’ll use Cook’s Distance to identify influential observations and outliers that may unduly affect the model. This will ensure a robust pricing model by mitigating the impact of outliers.

# Cook's Distance for detecting outliers
cooks_d <- cooks.distance(model)
plot(cooks_d, type = "h", main = "Cook's Distance", ylab = "Cook's Distance")
abline(h = 4/nrow(ames), col = "red")

Outcome: This approach helps detect extreme cases that might distort the model’s predictive power and allow us to refine the dataset for better accuracy.

Issue 4: Heteroscedasticity and Residual Analysis

Current Problem: Assuming homoscedasticity (constant variance of residuals) may lead to inaccurate estimates if violated. It’s essential to check whether residuals exhibit heteroscedasticity, which could invalidate inference.

Proposed Solution: We’ll perform a Breusch-Pagan test for heteroscedasticity and assess the residual distribution with Q-Q plots to ensure that assumptions of linear regression hold.

# Checking for homoscedasticity
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.2
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(model)  # Breusch-Pagan Test
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 804.45, df = 39, p-value < 2.2e-16
# Plot residuals
plot(model, which = 1)  # Residuals vs Fitted

plot(model, which = 2)  # Q-Q plot for normality
## Warning: not plotting observations with leverage one:
##   2789

Outcome: Ensuring homoscedasticity and normality of residuals improves the validity of statistical tests, allowing us to trust the results.

ANOVA Explanation:

ANOVA (Analysis of Variance) is a powerful statistical technique used to compare the means of multiple groups and determine if at least one group mean is significantly different from others. In the context of analyzing sale prices in the Ames dataset, ANOVA helps us identify variations due to factors like neighborhood or building type.

The key assumptions for ANOVA include:

  1. Normality: The data within each group should follow a normal distribution.

  2. Homoscedasticity: The variance across groups should be roughly equal.

  3. Independence: Observations within and across groups must be independent.

Improvement Suggestions for ANOVA Analysis:

  1. Provide a clear explanation of group definitions, such as building types or neighborhoods, to contextualize the analysis.

  2. Include detailed calculations for total variance, group variance, and error variance to enhance transparency.

  3. Offer multiple ways to test ANOVA assumptions and explore corrective measures for violations.

  4. Address non-normal distributions by applying transformations or alternative methods.

  5. Consider additional variables, like zoning or proximity to amenities, to enrich insights.

Performing ANOVA

Step 1: ANOVA for Neighborhood Effects on Sale Prices

# Perform ANOVA for neighborhood effect on sale prices
anova_model <- aov(sale_price ~ neighborhood, data = ames)
summary(anova_model)
##                Df         Sum Sq      Mean Sq F value Pr(>F)    
## neighborhood   27 10716004520761 396889056324   144.4 <2e-16 ***
## Residuals    2902  7976532589590   2748632870                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check for homoscedasticity using Levene's Test
leveneTest(sale_price ~ neighborhood, data = ames)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group   27  25.225 < 2.2e-16 ***
##       2902                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Diagnostic plots for residual analysis
par(mfrow = c(2, 2))
plot(anova_model)
## Warning: not plotting observations with leverage one:
##   2789

Explanation:

  1. The ANOVA output provides a summary of the variation in sale prices explained by neighborhood differences.

  2. A significant p-value (p < 0.05) in the ANOVA summary indicates statistically significant price differences between neighborhoods.

  3. Levene’s Test checks for equal variances (homoscedasticity). A p-value > 0.05 indicates that the assumption of equal variances holds.

  4. Diagnostic plots examine residual patterns and normality visually.

Assumption Checks:

1. Normality Check for Overall Sale Prices

# Visualize overall sale price distribution
ames %>%
  ggplot(aes(x = sale_price)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightblue', color = 'white') +
  geom_density(color = 'red', size = 1) +
  labs(title = "Distribution of Sale Prices", x = "Sale Price", y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Explanation:
The histogram shows a right-skewed distribution, which is common for price data as it is bounded by zero. Skewness can affect normality, necessitating transformations like logarithms or non-parametric approaches.

2. Group-Level Normality Check

# Visualize sale price distributions by building type
ames %>%
  ggplot(aes(x = sale_price)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = 'lightblue', color = 'white') +
  geom_density(color = 'red', size = 1) +
  facet_wrap(~ bldg_type, scales = "free_y") +
  labs(title = "Sale Price Distribution by Building Type", x = "Sale Price", y = "Density")

Explanation:
Each building type’s histogram shows non-normal distributions for sale prices, indicating a potential violation of the normality assumption. This further justifies applying data transformations or exploring robust alternatives.

Shapiro-Wilk Test:

# Shapiro-Wilk test for overall sale prices
shapiro.test(ames$sale_price)
## 
##  Shapiro-Wilk normality test
## 
## data:  ames$sale_price
## W = 0.87626, p-value < 2.2e-16
# Group-specific Shapiro-Wilk test
ames %>%
  group_by(bldg_type) %>%
  summarise(
    p_value = shapiro.test(sale_price)$p.value
  )
## # A tibble: 5 × 2
##   bldg_type  p_value
##   <fct>        <dbl>
## 1 OneFam    3.05e-40
## 2 TwoFmCon  1.36e- 1
## 3 Duplex    3.04e- 5
## 4 Twnhs     1.28e- 3
## 5 TwnhsE    7.05e- 8

Interpretation:
The p-values from the Shapiro-Wilk test indicate whether the data is normally distributed:

  • A p-value < 0.05 means the null hypothesis of normality is rejected, confirming non-normal data.

  • This result strengthens the case for applying transformations or using non-parametric tests.

Handling Violations of Normality

Log Transformation: A log transformation compresses the range of values, reducing skewness and improving normality.

# Log-transform the sale price
ames <- ames %>%
  mutate(log_sale_price = log(sale_price))

# Recheck normality with transformed data
ggplot(ames, aes(x = log_sale_price)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightgreen', color = 'white') +
  geom_density(color = 'blue', size = 1) +
  labs(title = "Log-Transformed Sale Price Distribution", x = "Log Sale Price", y = "Density")

Explanation:
The log transformation addresses skewness, making the distribution closer to normal. Improved normality ensures better adherence to ANOVA assumptions.

Bootstrap Sampling: Bootstrap sampling provides a robust alternative when assumptions are difficult to meet.

# Bootstrap sampling to achieve a more robust normal distribution
set.seed(123)
bootstrap_samples <- replicate(1000, mean(sample(ames$sale_price, replace = TRUE)))

# Visualize the bootstrap distribution
ggplot(data.frame(mean_price = bootstrap_samples), aes(x = mean_price)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = 'orange', color = 'black') +
  geom_density(color = 'blue', size = 1) +
  labs(title = "Bootstrap Distribution of Mean Sale Prices", x = "Bootstrap Mean", y = "Density")

Explanation:
Bootstrap sampling generates a distribution of means through resampling, offering insights into the underlying population distribution. This is particularly useful for smaller datasets.


Goal 3: Ethical and Epistemological Concerns

Bias in Neighborhood Representation:

Certain neighborhoods may be overrepresented or underrepresented in the dataset. This could bias predictions and lead to inaccurate pricing recommendations, especially if the model overfits to wealthier areas.

Proposed Solution:

  • Balance the Dataset: We can either oversample underrepresented neighborhoods or adjust for the population imbalances using weighted regression techniques.

Impact on the Housing Market:

Improper use of this model could result in inflated home prices, exacerbating gentrification or making housing less affordable for lower-income families.

Proposed Solution:

  • Price Cap: Implement caps on predicted home prices in vulnerable neighborhoods to avoid pushing prices beyond what’s affordable for the community.

Data Completeness:

Several features, like kitchen_quality or basement_exposure, are missing for many homes, which could influence the reliability of the model.

Proposed Solution:

  • Imputation: Use imputation techniques like k-nearest neighbors (KNN) or multiple imputation to fill in missing values, ensuring more comprehensive and robust predictions.