library(ggthemes)
library(ggrepel)
## Loading required package: ggplot2
library(AmesHousing)
library(boot)
library(lindia)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.1
## ✔ purrr 1.2.1 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(broom)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
##
## The following object is masked from 'package:boot':
##
## logit
library(knitr)
library(GGally)
Define Audience, Problem Statement, Scope, Analysis, etc
Hermes Real Estate, a housing developer that is seeking to purchase investment properties in Ames, IA. They get things done quick.
Hermes Real Estate LLC needs to identify 10 houses with an asking sale price undervalued by at least $30,000 by May 31st, 2026. To retain majority shares, Hermes’ owner, Paul, must repay a previous loan, for which he is $250,000 short. Identifying properties by May 31, 2026, allows for purchasing, renovating, and flipping before the repayment due date of May 31, 2027.
Response Variable: Sale Price (log-converted)
Explanatory Variables: (See Model Critique)
Linear Regression Model to determine the predicted sale price. Houses for sale at $30k or less than the model’s predicted price are a good option for our client.
Regression is chosen given the goal of predicting Sale Price based on a set of measurable variables. When compared to the listed sale price, we are able to evaluate which houses are undervalued. In a real-world situation, the regression model would also be easy to fit to other cities or situations where predicting the estimated marke-value of a house is useful.
The data is complete, i.e., there are none (or very few) missing or NA values. The linear and independence assumptions are met. (to be checked with a Correlation Matrix) The neighbors variable is worth including in the regression model (to be checked with ANOVA). Errors on ‘predicted values’ are distributed heterogenously (at random).
Were we able to provide the client with a model that accurately estimates the market value of the house(s) of interest?
To improve the model, our first step is improving variable selection with what real-world observations can tell us may be most relevant.
The selected variables are: Square Footage Year Remodeled Lot Area Overall Condition (1-10 ordinal) Overall Quality (1-10 ordinal) Number of Cars Garage can fit Neighborhood Year Built Electrical (to ordinal variable) Year Sold (last time sold)
We also selected variables based on what may be most useful for the client. For example, a house that is very cheap but needs the entire electrical rewired will cost a lot more to renovate before re-selling, therefore ultimately reducing profit. Knowing how that affects price can help the client make a decision the “sweet spot” of price-versus-condition.
Note: By applying categorical data, R will approach this data by constructing a dummy binary variable for each category. This will be applied specifically on the neighborhoods and electrical variables.
# Import data
ames <- make_ames() |>
rename_with(tolower)
# Prepare model data
ames_model <- ames |>
mutate(
log_sale_price = log(sale_price),
age_2026 = 2026 - year_built,
overall_cond_num = as.numeric(overall_cond),
overall_qual_num = as.numeric(overall_qual)
) |>
select(
sale_price,
log_sale_price,
gr_liv_area,
total_bsmt_sf,
lot_area,
year_remod_add,
age_2026,
overall_cond_num,
overall_qual_num,
garage_cars,
neighborhood,
electrical,
year_sold
) |>
na.omit()
# View data
glimpse(ames_model)
## Rows: 2,930
## Columns: 13
## $ sale_price <int> 215000, 105000, 172000, 244000, 189900, 195500, 21350…
## $ log_sale_price <dbl> 12.27839, 11.56172, 12.05525, 12.40492, 12.15425, 12.…
## $ gr_liv_area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, …
## $ total_bsmt_sf <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 99…
## $ lot_area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, …
## $ year_remod_add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996,…
## $ age_2026 <dbl> 66, 65, 68, 58, 29, 28, 25, 34, 31, 27, 33, 34, 28, 3…
## $ overall_cond_num <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2,…
## $ overall_qual_num <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9,…
## $ garage_cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3,…
## $ neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gilbe…
## $ electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrk…
## $ year_sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
cor_matrix <- ames_model |>
select(
sale_price,
gr_liv_area,
total_bsmt_sf,
lot_area,
year_remod_add,
age_2026,
overall_cond_num,
overall_qual_num,
garage_cars,
year_sold
) |>
cor(use = "pairwise.complete.obs") |>
round(2)
cor_matrix
## sale_price gr_liv_area total_bsmt_sf lot_area year_remod_add
## sale_price 1.00 0.71 0.63 0.27 0.53
## gr_liv_area 0.71 1.00 0.45 0.29 0.32
## total_bsmt_sf 0.63 0.45 1.00 0.25 0.30
## lot_area 0.27 0.29 0.25 1.00 0.02
## year_remod_add 0.53 0.32 0.30 0.02 1.00
## age_2026 -0.56 -0.24 -0.41 -0.02 -0.61
## overall_cond_num -0.10 -0.12 -0.17 -0.03 0.05
## overall_qual_num 0.80 0.57 0.55 0.10 0.57
## garage_cars 0.65 0.49 0.44 0.18 0.42
## year_sold -0.03 -0.03 -0.01 -0.02 0.03
## age_2026 overall_cond_num overall_qual_num garage_cars
## sale_price -0.56 -0.10 0.80 0.65
## gr_liv_area -0.24 -0.12 0.57 0.49
## total_bsmt_sf -0.41 -0.17 0.55 0.44
## lot_area -0.02 -0.03 0.10 0.18
## year_remod_add -0.61 0.05 0.57 0.42
## age_2026 1.00 0.37 -0.60 -0.54
## overall_cond_num 0.37 1.00 -0.09 -0.18
## overall_qual_num -0.60 -0.09 1.00 0.60
## garage_cars -0.54 -0.18 0.60 1.00
## year_sold 0.01 0.03 -0.02 -0.02
## year_sold
## sale_price -0.03
## gr_liv_area -0.03
## total_bsmt_sf -0.01
## lot_area -0.02
## year_remod_add 0.03
## age_2026 0.01
## overall_cond_num 0.03
## overall_qual_num -0.02
## garage_cars -0.02
## year_sold 1.00
ggcorr(
select(ames_model,
log_sale_price,
gr_liv_area,
total_bsmt_sf,
lot_area,
year_remod_add,
age_2026,
overall_cond_num,
overall_qual_num,
garage_cars,
year_sold),
label = TRUE,
label_round = 2,
label_size = 3,
hjust = 0.9,
layout.exp = 2,
) +
labs(title = "Correlation Heatmap of Model Predictors") +
theme_minimal()
While some variables have notable correlations (e.g. around ~ 0.6 or -0.6), suggesting a moderate linear relationship among some of our independent variables (e.g., age_2026 adjusted and year remodeled) none of the correlation values are directly worrisome by themselves (e.g., close to 1 or -1). Further investigation will be needed by applying a Variance Inflation Factor (VIFs) approach below to diagnose the models outputs for any apparent multicolinearity affecting regression coefficients.
We can also see from our correlation heatmap that our response variable, the logarithm of sale_price has a linear relationship with most of our independent variables aside from overall_cond_num and year_sold which have correlation values less than \(\pm0.05\). We may opt to remove these variables from the model in the near future, but for the time being we will keep them in with an understanding that these variables are unlikely to be significant and may be removed in the future.
lm_model <- lm(
log_sale_price ~ gr_liv_area +
total_bsmt_sf +
lot_area +
year_remod_add +
age_2026 +
overall_cond_num +
overall_qual_num +
garage_cars +
neighborhood +
electrical +
year_sold,
data = ames_model
)
summary(lm_model)
##
## Call:
## lm(formula = log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area +
## year_remod_add + age_2026 + overall_cond_num + overall_qual_num +
## garage_cars + neighborhood + electrical + year_sold, data = ames_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10088 -0.06437 0.00511 0.07777 0.59831
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.798e+01 4.139e+00
## gr_liv_area 2.368e-04 7.671e-06
## total_bsmt_sf 1.166e-04 8.247e-06
## lot_area 2.575e-06 3.940e-07
## year_remod_add 9.020e-04 2.012e-04
## age_2026 -2.594e-03 2.202e-04
## overall_cond_num 5.591e-02 3.039e-03
## overall_qual_num 8.138e-02 3.532e-03
## garage_cars 6.109e-02 5.011e-03
## neighborhoodCollege_Creek 1.727e-02 1.410e-02
## neighborhoodOld_Town -1.032e-01 1.415e-02
## neighborhoodEdwards -6.439e-02 1.286e-02
## neighborhoodSomerset 4.916e-02 1.660e-02
## neighborhoodNorthridge_Heights 1.307e-01 1.772e-02
## neighborhoodGilbert 1.042e-03 1.625e-02
## neighborhoodSawyer -6.357e-03 1.385e-02
## neighborhoodNorthwest_Ames -1.915e-02 1.511e-02
## neighborhoodSawyer_West -1.870e-02 1.643e-02
## neighborhoodMitchell -5.398e-03 1.607e-02
## neighborhoodBrookside -2.993e-02 1.667e-02
## neighborhoodCrawford 1.415e-01 1.661e-02
## neighborhoodIowa_DOT_and_Rail_Road -1.461e-01 1.802e-02
## neighborhoodTimberland 6.752e-02 2.074e-02
## neighborhoodNorthridge 1.082e-01 2.153e-02
## neighborhoodStone_Brook 1.487e-01 2.415e-02
## neighborhoodSouth_and_West_of_Iowa_State_University -1.086e-02 2.329e-02
## neighborhoodClear_Creek 1.098e-01 2.387e-02
## neighborhoodMeadow_Village -1.968e-01 2.569e-02
## neighborhoodBriardale -2.089e-01 2.812e-02
## neighborhoodBloomington_Heights -2.972e-02 3.027e-02
## neighborhoodVeenker 6.601e-02 3.123e-02
## neighborhoodNorthpark_Villa -9.874e-02 3.157e-02
## neighborhoodBlueste -1.201e-01 4.716e-02
## neighborhoodGreens 6.188e-02 5.287e-02
## neighborhoodGreen_Hills 5.038e-01 1.037e-01
## neighborhoodLandmark -1.346e-01 1.461e-01
## electricalFuseF -4.097e-02 2.347e-02
## electricalFuseP -7.116e-02 5.273e-02
## electricalMix -1.481e-01 1.471e-01
## electricalSBrkr 6.719e-03 1.187e-02
## electricalUnknown 4.597e-02 1.474e-01
## year_sold -4.500e-03 2.063e-03
## t value Pr(>|t|)
## (Intercept) 4.346 1.44e-05 ***
## gr_liv_area 30.870 < 2e-16 ***
## total_bsmt_sf 14.143 < 2e-16 ***
## lot_area 6.536 7.42e-11 ***
## year_remod_add 4.484 7.62e-06 ***
## age_2026 -11.783 < 2e-16 ***
## overall_cond_num 18.395 < 2e-16 ***
## overall_qual_num 23.040 < 2e-16 ***
## garage_cars 12.192 < 2e-16 ***
## neighborhoodCollege_Creek 1.225 0.22079
## neighborhoodOld_Town -7.291 3.95e-13 ***
## neighborhoodEdwards -5.007 5.85e-07 ***
## neighborhoodSomerset 2.961 0.00309 **
## neighborhoodNorthridge_Heights 7.375 2.14e-13 ***
## neighborhoodGilbert 0.064 0.94884
## neighborhoodSawyer -0.459 0.64637
## neighborhoodNorthwest_Ames -1.267 0.20529
## neighborhoodSawyer_West -1.138 0.25512
## neighborhoodMitchell -0.336 0.73692
## neighborhoodBrookside -1.796 0.07264 .
## neighborhoodCrawford 8.516 < 2e-16 ***
## neighborhoodIowa_DOT_and_Rail_Road -8.109 7.47e-16 ***
## neighborhoodTimberland 3.255 0.00115 **
## neighborhoodNorthridge 5.027 5.28e-07 ***
## neighborhoodStone_Brook 6.159 8.33e-10 ***
## neighborhoodSouth_and_West_of_Iowa_State_University -0.467 0.64084
## neighborhoodClear_Creek 4.602 4.36e-06 ***
## neighborhoodMeadow_Village -7.659 2.54e-14 ***
## neighborhoodBriardale -7.431 1.41e-13 ***
## neighborhoodBloomington_Heights -0.982 0.32629
## neighborhoodVeenker 2.114 0.03461 *
## neighborhoodNorthpark_Villa -3.127 0.00178 **
## neighborhoodBlueste -2.547 0.01092 *
## neighborhoodGreens 1.170 0.24194
## neighborhoodGreen_Hills 4.858 1.25e-06 ***
## neighborhoodLandmark -0.921 0.35700
## electricalFuseF -1.746 0.08100 .
## electricalFuseP -1.350 0.17727
## electricalMix -1.007 0.31410
## electricalSBrkr 0.566 0.57145
## electricalUnknown 0.312 0.75525
## year_sold -2.182 0.02920 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1456 on 2888 degrees of freedom
## Multiple R-squared: 0.8742, Adjusted R-squared: 0.8724
## F-statistic: 489.4 on 41 and 2888 DF, p-value: < 2.2e-16
The original model (lab) did not have diagnostic plots.
par(mfrow = c(2, 2))
plot(lm_model)
## Warning: not plotting observations with leverage one:
## 1578, 2240, 2789
par(mfrow = c(1, 1))
Based on the following diagnostic plots we have observed:
Residuals vs Fitted: Most of the residuals appear centered around 0 and randomly distributed, but there are a few negative residuals at high fitted values that stand out. This indicates that our model is predicting some houses will sell for significantly more than they actually did. In that sense, there may be some omitted variables that are not being accounted for that may otherwise significantly tank the value of a houses market price.
Q-Q Residuals: We can see that the residuals do not follow the diagonal line well, especially in the lower tail (theoretical quantile \(< 2\)), which means are residuals are not normally distributed. Again, this over emphasize of negative residuals means that our model over predicts the value of a home ~ which violates the normality assumption in how are residuals are distributed. Again, this may point to special conditions present within these homes that have sold that have caused them to be under valued (e.g., foundation issues, nearby cemetery, shady history, etc.).
Scale-Location: Again this displays a few observations within our dataset that violate a normal assumption and have rather large residuals relative to their fitted values – emphasizing again points found in plots #1 and #2.
Residuals vs Leverage: This plot is a little tricky to interpret. It seems to be telling us which points have high leverage (e.g., large x values) relative to the mean and how they map against our standardized residuals. As we can see most points have relatively low leverage – which is preferably aside from one point on the very far right. It may be worthwhile to investigate the leverage on those two observations in the far right to better understand what may be causing that.
Overall, the model captures a general relationship between our explanatory and response variable – with some potential room for improvement / non-normal residuals occurring on over predicting the value. Since the business context is to identify houses that are currently under valued, we may want to investigate the specific houses in our model that were over valued and sold for significantly less to identify factors that may have contributed so that when Hermes Real Estate LLC takes to the field, they are screening for these potentially catastrophic scenarios.
vif(lm_model)
## GVIF Df GVIF^(1/(2*Df))
## gr_liv_area 2.077510 1 1.441357
## total_bsmt_sf 1.827313 1 1.351782
## lot_area 1.331699 1 1.153993
## year_remod_add 2.433631 1 1.560010
## age_2026 6.127740 1 2.475427
## overall_cond_num 1.576803 1 1.255708
## overall_qual_num 3.432123 1 1.852599
## garage_cars 2.009661 1 1.417625
## neighborhood 15.127237 27 1.051592
## electrical 1.325861 5 1.028608
## year_sold 1.019014 1 1.009462
The variance inflation factor results indicate that multicollinearity is not a major concern in the model. Most adjusted GVIF values are below 2, suggesting low correlation among predictors. The largest adjusted GVIF is for age_2026 at approximately 2.48, indicating moderate multicollinearity, likely due to its relationship with time-related variables such as year_remod_add and possibly year_sold. However, this value is still below common concern thresholds such as 5 or 10. Although neighborhood has a large raw GVIF, its adjusted GVIF is only 1.05 due to its many degrees of freedom, so it does not appear problematic.
As a reminder, some of the original models were:
OgModel1 <- lm(sale_price ~ year_remod_add + great_qual
+ year_remod_add:great_qual, ames_basic)
OgModel2 <- lm(sale_price ~ first_flr_sf + great_qual, ames_basic)
OgModel3 <- lm(sale_price ~ year_remod_add, ames_basic)
r2_table <- tibble::tibble(
model = c("OgModel1 (interaction)", "OgModel2", "OgModel3", "Full Model"),
r_squared = c(
summary(OgModel1)$r.squared,
summary(OgModel2)$r.squared,
summary(OgModel3)$r.squared,
summary(lm_model)$r.squared
),
adj_r_squared = c(
summary(OgModel1)$adj.r.squared,
summary(OgModel2)$adj.r.squared,
summary(OgModel3)$adj.r.squared,
summary(lm_model)$adj.r.squared
)
)
r2_table
## # A tibble: 4 × 3
## model r_squared adj_r_squared
## <chr> <dbl> <dbl>
## 1 OgModel1 (interaction) 0.407 0.401
## 2 OgModel2 0.616 0.614
## 3 OgModel3 0.0360 0.0330
## 4 Full Model 0.874 0.872
As seen based on the Adjusted R-squared values, the new model has an improved fit to the sample. While that is expected as the new model includes more variables, the adjusted-R-Squared does account for over fitting, and the penalization there (compared to R-squared) was minimal (0.00178623).
We also propose a new model removing Neighborhood as a variable. We then compare it to the first proposed model.
model_without_neighborhood <- lm(
log_sale_price ~ gr_liv_area +
total_bsmt_sf +
lot_area +
year_remod_add +
age_2026 +
overall_cond_num +
overall_qual_num +
garage_cars +
electrical +
year_sold,
data = ames_model
)
anova(model_without_neighborhood, lm_model)
## Analysis of Variance Table
##
## Model 1: log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area + year_remod_add +
## age_2026 + overall_cond_num + overall_qual_num + garage_cars +
## electrical + year_sold
## Model 2: log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area + year_remod_add +
## age_2026 + overall_cond_num + overall_qual_num + garage_cars +
## neighborhood + electrical + year_sold
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2915 72.045
## 2 2888 61.221 27 10.825 18.912 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, ANOVA is being used to test whether the addition of a parameter (or, in our case, whether keeping it in) improves model fit. Our P value is extremely low, hence, the conclusion that: the model with Neighborhood is better (has a better fit).
ames_results <- ames_model |>
mutate(
predicted_log_price = predict(lm_model, newdata = ames_model),
predicted_price = exp(predicted_log_price),
undervalued_amount = predicted_price - sale_price,
undervalued_flag = undervalued_amount >= 30000
) |>
arrange(desc(undervalued_amount))
top_10_undervalued <- ames_results |>
filter(undervalued_flag == TRUE) |>
select(
sale_price,
predicted_price,
undervalued_amount,
gr_liv_area,
total_bsmt_sf,
lot_area,
overall_qual_num,
overall_cond_num,
garage_cars,
neighborhood,
year_sold
) |>
slice_head(n = 10)
top_10_undervalued
## # A tibble: 10 × 11
## sale_price predicted_price undervalued_amount gr_liv_area total_bsmt_sf
## <int> <dbl> <dbl> <int> <dbl>
## 1 160000 1307737. 1147737. 5642 6110
## 2 183850 1023814. 839964. 5095 5095
## 3 184750 736854. 552104. 4676 3138
## 4 150000 307701. 157701. 2944 994
## 5 147000 279581. 132581. 1795 1795
## 6 82500 177241. 94741. 1411 1386
## 7 270000 359183. 89183. 2687 2062
## 8 130000 217538. 87538. 1204 1191
## 9 277000 358504. 81504. 2144 1444
## 10 253293 332664. 79371. 2042 2042
## # ℹ 6 more variables: lot_area <int>, overall_qual_num <dbl>,
## # overall_cond_num <dbl>, garage_cars <dbl>, neighborhood <fct>,
## # year_sold <int>
Based on our success criteria, yes. Our model actually helped identify nearly 200 undervalued houses. Although, it’s important to keep in mind that our model does exhibit some potential to over value houses – thus it’s important for Hermes Real Estate LLC to do due diligence before making a purchase decision.
Reflecting on the model, some Epistemological concerns may be:
Data Quality Crucial Issues that cannot be measured.In the real-world, the final success measure would depend on whether human analysis concluded these are, in fact, good investment house. We cannot account for things like data entry errors, either genuine or bad-faith. For example, someone trying to sell their house will have in their best interest to sugar-coat issues and exaggerate quality. The top undervalued house may have hidden structural issues that the model cannot predict. Ultimately, we can only provide a list of potential good choices, but not a definitive “buy this” model. Nonethless, the client may use the model to guide strategy and further research only the houses that suggest highest profits, saving time and money.
Ethical Concern: is the client being unethical in their search to buy houses below fair-market value? In the current scenario of housing insecurity and gentrification, what are the ethical concerns with generating a model that will be used to determine which houses are being under-sold? Ultimately, the answer will depend heavily on the situation of the town, the state of the housing market, etc.