library(ggthemes)
library(ggrepel)
## Loading required package: ggplot2
library(AmesHousing)
library(boot)
library(lindia)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.1
## ✔ purrr 1.2.1 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(broom)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
##
## The following object is masked from 'package:boot':
##
## logit
library(knitr)
Define Audience, Problem Statement, Scope, Analysis, etc
Hermes Real Estate, a housing developer that is seeking to purchase investment properties in Ames, IA. They get things done quick.
Hermes Real Estate LLC needs to identify 10 houses with an asking sale price undervalued by at least $30,000 by May 31st, 2026. To retain majority shares, Hermes’ owner, Paul, must repay a previous loan, for which he is $250,000 short. Identifying properties by May 31, 2026, allows for purchasing, renovating, and flipping before the repayment due date of May 31, 2027.
Response Variable: Sale Price (log-converted)
Explanatory Variables: (See Model Critique)
Linear Regression Model to determine the predicted sale price. Houses for sale at $30k or less than the model’s predicted price are a good option for our client.
Regression is chosen given the goal of predicting Sale Price based on a set of measurable variables. When compared to the listed sale price, we are able to evaluate which houses are undervalued. In a real-world situation, the regression model would also be easy to fit to other cities or situations where predicting the estimated marke-value of a house is useful.
The data is complete, i.e., there are none (or very few) missing or NA values. The linear and independence assumptions are met. (to be checked with a Correlation Matrix) The neighbors variable is worth including in the regression model (to be checked with ANOVA). Errors on ‘predicted values’ are distributed heterogenously (at random).
Were we able to provide the client with a model that accurately estimates the market value of the house(s) of interest?
To improve the model, our first step is improving variable selection with what real-world observations can tell us may be most relevant.
The selected variables are: Square Footage Year Remodeled Lot Area Overall Condition (1-10 ordinal) Overall Quality (1-10 ordinal) Number of Cars Garage can fit Neighborhood Year Built Electrical (to ordinal variable) Year Sold (last time sold)
We also selected variables based on what may be most useful for the client. For example, a house that is very cheap but needs the entire electrical rewired will cost a lot more to renovate before re-selling, therefore ultimately reducing profit. Knowing how that affects price can help the client make a decision the “sweet spot” of price-versus-condition.
# Import data
ames <- make_ames() |>
rename_with(tolower)
# Prepare model data
ames_model <- ames |>
mutate(
log_sale_price = log(sale_price),
age_2026 = 2026 - year_built,
overall_cond_num = as.numeric(overall_cond),
overall_qual_num = as.numeric(overall_qual)
) |>
select(
sale_price,
log_sale_price,
gr_liv_area,
total_bsmt_sf,
lot_area,
year_remod_add,
age_2026,
overall_cond_num,
overall_qual_num,
garage_cars,
neighborhood,
electrical,
year_sold
) |>
na.omit()
# View data
glimpse(ames_model)
## Rows: 2,930
## Columns: 13
## $ sale_price <int> 215000, 105000, 172000, 244000, 189900, 195500, 21350…
## $ log_sale_price <dbl> 12.27839, 11.56172, 12.05525, 12.40492, 12.15425, 12.…
## $ gr_liv_area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, …
## $ total_bsmt_sf <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 99…
## $ lot_area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, …
## $ year_remod_add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996,…
## $ age_2026 <dbl> 66, 65, 68, 58, 29, 28, 25, 34, 31, 27, 33, 34, 28, 3…
## $ overall_cond_num <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2,…
## $ overall_qual_num <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9,…
## $ garage_cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3,…
## $ neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gilbe…
## $ electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrk…
## $ year_sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
cor_matrix <- ames_model |>
select(
sale_price,
gr_liv_area,
total_bsmt_sf,
lot_area,
year_remod_add,
age_2026,
overall_cond_num,
overall_qual_num,
garage_cars,
year_sold
) |>
cor(use = "pairwise.complete.obs") |>
round(2)
cor_matrix
## sale_price gr_liv_area total_bsmt_sf lot_area year_remod_add
## sale_price 1.00 0.71 0.63 0.27 0.53
## gr_liv_area 0.71 1.00 0.45 0.29 0.32
## total_bsmt_sf 0.63 0.45 1.00 0.25 0.30
## lot_area 0.27 0.29 0.25 1.00 0.02
## year_remod_add 0.53 0.32 0.30 0.02 1.00
## age_2026 -0.56 -0.24 -0.41 -0.02 -0.61
## overall_cond_num -0.10 -0.12 -0.17 -0.03 0.05
## overall_qual_num 0.80 0.57 0.55 0.10 0.57
## garage_cars 0.65 0.49 0.44 0.18 0.42
## year_sold -0.03 -0.03 -0.01 -0.02 0.03
## age_2026 overall_cond_num overall_qual_num garage_cars
## sale_price -0.56 -0.10 0.80 0.65
## gr_liv_area -0.24 -0.12 0.57 0.49
## total_bsmt_sf -0.41 -0.17 0.55 0.44
## lot_area -0.02 -0.03 0.10 0.18
## year_remod_add -0.61 0.05 0.57 0.42
## age_2026 1.00 0.37 -0.60 -0.54
## overall_cond_num 0.37 1.00 -0.09 -0.18
## overall_qual_num -0.60 -0.09 1.00 0.60
## garage_cars -0.54 -0.18 0.60 1.00
## year_sold 0.01 0.03 -0.02 -0.02
## year_sold
## sale_price -0.03
## gr_liv_area -0.03
## total_bsmt_sf -0.01
## lot_area -0.02
## year_remod_add 0.03
## age_2026 0.01
## overall_cond_num 0.03
## overall_qual_num -0.02
## garage_cars -0.02
## year_sold 1.00
While some variables have notable correlations, none present a worrisome
correlation or multicollineary concern
lm_model <- lm(
log_sale_price ~ gr_liv_area +
total_bsmt_sf +
lot_area +
year_remod_add +
age_2026 +
overall_cond_num +
overall_qual_num +
garage_cars +
neighborhood +
electrical +
year_sold,
data = ames_model
)
summary(lm_model)
##
## Call:
## lm(formula = log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area +
## year_remod_add + age_2026 + overall_cond_num + overall_qual_num +
## garage_cars + neighborhood + electrical + year_sold, data = ames_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10088 -0.06437 0.00511 0.07777 0.59831
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.798e+01 4.139e+00
## gr_liv_area 2.368e-04 7.671e-06
## total_bsmt_sf 1.166e-04 8.247e-06
## lot_area 2.575e-06 3.940e-07
## year_remod_add 9.020e-04 2.012e-04
## age_2026 -2.594e-03 2.202e-04
## overall_cond_num 5.591e-02 3.039e-03
## overall_qual_num 8.138e-02 3.532e-03
## garage_cars 6.109e-02 5.011e-03
## neighborhoodCollege_Creek 1.727e-02 1.410e-02
## neighborhoodOld_Town -1.032e-01 1.415e-02
## neighborhoodEdwards -6.439e-02 1.286e-02
## neighborhoodSomerset 4.916e-02 1.660e-02
## neighborhoodNorthridge_Heights 1.307e-01 1.772e-02
## neighborhoodGilbert 1.042e-03 1.625e-02
## neighborhoodSawyer -6.357e-03 1.385e-02
## neighborhoodNorthwest_Ames -1.915e-02 1.511e-02
## neighborhoodSawyer_West -1.870e-02 1.643e-02
## neighborhoodMitchell -5.398e-03 1.607e-02
## neighborhoodBrookside -2.993e-02 1.667e-02
## neighborhoodCrawford 1.415e-01 1.661e-02
## neighborhoodIowa_DOT_and_Rail_Road -1.461e-01 1.802e-02
## neighborhoodTimberland 6.752e-02 2.074e-02
## neighborhoodNorthridge 1.082e-01 2.153e-02
## neighborhoodStone_Brook 1.487e-01 2.415e-02
## neighborhoodSouth_and_West_of_Iowa_State_University -1.086e-02 2.329e-02
## neighborhoodClear_Creek 1.098e-01 2.387e-02
## neighborhoodMeadow_Village -1.968e-01 2.569e-02
## neighborhoodBriardale -2.089e-01 2.812e-02
## neighborhoodBloomington_Heights -2.972e-02 3.027e-02
## neighborhoodVeenker 6.601e-02 3.123e-02
## neighborhoodNorthpark_Villa -9.874e-02 3.157e-02
## neighborhoodBlueste -1.201e-01 4.716e-02
## neighborhoodGreens 6.188e-02 5.287e-02
## neighborhoodGreen_Hills 5.038e-01 1.037e-01
## neighborhoodLandmark -1.346e-01 1.461e-01
## electricalFuseF -4.097e-02 2.347e-02
## electricalFuseP -7.116e-02 5.273e-02
## electricalMix -1.481e-01 1.471e-01
## electricalSBrkr 6.719e-03 1.187e-02
## electricalUnknown 4.597e-02 1.474e-01
## year_sold -4.500e-03 2.063e-03
## t value Pr(>|t|)
## (Intercept) 4.346 1.44e-05 ***
## gr_liv_area 30.870 < 2e-16 ***
## total_bsmt_sf 14.143 < 2e-16 ***
## lot_area 6.536 7.42e-11 ***
## year_remod_add 4.484 7.62e-06 ***
## age_2026 -11.783 < 2e-16 ***
## overall_cond_num 18.395 < 2e-16 ***
## overall_qual_num 23.040 < 2e-16 ***
## garage_cars 12.192 < 2e-16 ***
## neighborhoodCollege_Creek 1.225 0.22079
## neighborhoodOld_Town -7.291 3.95e-13 ***
## neighborhoodEdwards -5.007 5.85e-07 ***
## neighborhoodSomerset 2.961 0.00309 **
## neighborhoodNorthridge_Heights 7.375 2.14e-13 ***
## neighborhoodGilbert 0.064 0.94884
## neighborhoodSawyer -0.459 0.64637
## neighborhoodNorthwest_Ames -1.267 0.20529
## neighborhoodSawyer_West -1.138 0.25512
## neighborhoodMitchell -0.336 0.73692
## neighborhoodBrookside -1.796 0.07264 .
## neighborhoodCrawford 8.516 < 2e-16 ***
## neighborhoodIowa_DOT_and_Rail_Road -8.109 7.47e-16 ***
## neighborhoodTimberland 3.255 0.00115 **
## neighborhoodNorthridge 5.027 5.28e-07 ***
## neighborhoodStone_Brook 6.159 8.33e-10 ***
## neighborhoodSouth_and_West_of_Iowa_State_University -0.467 0.64084
## neighborhoodClear_Creek 4.602 4.36e-06 ***
## neighborhoodMeadow_Village -7.659 2.54e-14 ***
## neighborhoodBriardale -7.431 1.41e-13 ***
## neighborhoodBloomington_Heights -0.982 0.32629
## neighborhoodVeenker 2.114 0.03461 *
## neighborhoodNorthpark_Villa -3.127 0.00178 **
## neighborhoodBlueste -2.547 0.01092 *
## neighborhoodGreens 1.170 0.24194
## neighborhoodGreen_Hills 4.858 1.25e-06 ***
## neighborhoodLandmark -0.921 0.35700
## electricalFuseF -1.746 0.08100 .
## electricalFuseP -1.350 0.17727
## electricalMix -1.007 0.31410
## electricalSBrkr 0.566 0.57145
## electricalUnknown 0.312 0.75525
## year_sold -2.182 0.02920 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1456 on 2888 degrees of freedom
## Multiple R-squared: 0.8742, Adjusted R-squared: 0.8724
## F-statistic: 489.4 on 41 and 2888 DF, p-value: < 2.2e-16
The original model (lab) did not have diagnostic plots.
par(mfrow = c(2, 2))
plot(lm_model)
## Warning: not plotting observations with leverage one:
## 1578, 2240, 2789
par(mfrow = c(1, 1))
Diagnostic plots indicate little issues in the model.
vif(lm_model)
## GVIF Df GVIF^(1/(2*Df))
## gr_liv_area 2.077510 1 1.441357
## total_bsmt_sf 1.827313 1 1.351782
## lot_area 1.331699 1 1.153993
## year_remod_add 2.433631 1 1.560010
## age_2026 6.127740 1 2.475427
## overall_cond_num 1.576803 1 1.255708
## overall_qual_num 3.432123 1 1.852599
## garage_cars 2.009661 1 1.417625
## neighborhood 15.127237 27 1.051592
## electrical 1.325861 5 1.028608
## year_sold 1.019014 1 1.009462
We see no multicolllinearity issues that would invalidate the model.
As a reminder, some of the original models were:
OgModel1 <- lm(sale_price ~ year_remod_add + great_qual
+ year_remod_add:great_qual, ames_basic)
OgModel2 <- lm(sale_price ~ first_flr_sf + great_qual, ames_basic)
OgModel3 <- lm(sale_price ~ year_remod_add, ames_basic)
r2_table <- tibble::tibble(
model = c("OgModel1 (interaction)", "OgModel2", "OgModel3", "Full Model"),
r_squared = c(
summary(OgModel1)$r.squared,
summary(OgModel2)$r.squared,
summary(OgModel3)$r.squared,
summary(lm_model)$r.squared
),
adj_r_squared = c(
summary(OgModel1)$adj.r.squared,
summary(OgModel2)$adj.r.squared,
summary(OgModel3)$adj.r.squared,
summary(lm_model)$adj.r.squared
)
)
r2_table
## # A tibble: 4 × 3
## model r_squared adj_r_squared
## <chr> <dbl> <dbl>
## 1 OgModel1 (interaction) 0.407 0.401
## 2 OgModel2 0.616 0.614
## 3 OgModel3 0.0360 0.0330
## 4 Full Model 0.874 0.872
As seen based on the Adjusted R-squared values, the new model has an improved fit to the sample. While that is expected as the new model includes more variables, the adjusted-R-Squared does account for over fitting, and the penalization there (compared to R-squared) was minimal (0.00178623).
We also propose a new model removing Neighborhood as a variable. We then compare it to the first proposed model.
model_without_neighborhood <- lm(
log_sale_price ~ gr_liv_area +
total_bsmt_sf +
lot_area +
year_remod_add +
age_2026 +
overall_cond_num +
overall_qual_num +
garage_cars +
electrical +
year_sold,
data = ames_model
)
anova(model_without_neighborhood, lm_model)
## Analysis of Variance Table
##
## Model 1: log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area + year_remod_add +
## age_2026 + overall_cond_num + overall_qual_num + garage_cars +
## electrical + year_sold
## Model 2: log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area + year_remod_add +
## age_2026 + overall_cond_num + overall_qual_num + garage_cars +
## neighborhood + electrical + year_sold
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2915 72.045
## 2 2888 61.221 27 10.825 18.912 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, ANOVA is being used to test whether the addition of a parameter (or, in our case, whether keeping it in) improves model fit. Our P value is extremely low, hence, the conclusion that: the model with Neighborhood is better (has a better fit).
ames_results <- ames_model |>
mutate(
predicted_log_price = predict(lm_model, newdata = ames_model),
predicted_price = exp(predicted_log_price),
undervalued_amount = predicted_price - sale_price,
undervalued_flag = undervalued_amount >= 30000
) |>
arrange(desc(undervalued_amount))
top_10_undervalued <- ames_results |>
filter(undervalued_flag == TRUE) |>
select(
sale_price,
predicted_price,
undervalued_amount,
gr_liv_area,
total_bsmt_sf,
lot_area,
overall_qual_num,
overall_cond_num,
garage_cars,
neighborhood,
year_sold
) |>
slice_head(n = 10)
top_10_undervalued
## # A tibble: 10 × 11
## sale_price predicted_price undervalued_amount gr_liv_area total_bsmt_sf
## <int> <dbl> <dbl> <int> <dbl>
## 1 160000 1307737. 1147737. 5642 6110
## 2 183850 1023814. 839964. 5095 5095
## 3 184750 736854. 552104. 4676 3138
## 4 150000 307701. 157701. 2944 994
## 5 147000 279581. 132581. 1795 1795
## 6 82500 177241. 94741. 1411 1386
## 7 270000 359183. 89183. 2687 2062
## 8 130000 217538. 87538. 1204 1191
## 9 277000 358504. 81504. 2144 1444
## 10 253293 332664. 79371. 2042 2042
## # ℹ 6 more variables: lot_area <int>, overall_qual_num <dbl>,
## # overall_cond_num <dbl>, garage_cars <dbl>, neighborhood <fct>,
## # year_sold <int>
Based on our success criteria, yes. Our model actually helped identify nearly 200 undervalued houses.
Reflecting on the model, some Epistemological concerns may be:
Data Quality Crucial Issues that cannot be measured.In the real-world, the final success measure would depend on whether human analysis concluded these are, in fact, good investment house. We cannot account for things like data entry errors, either genuine or bad-faith. For example, someone trying to sell their house will have in their best interest to sugar-coat issues and exaggerate quality. The top undervalued house may have hidden structural issues that the model cannot predict. Ultimately, we can only provide a list of potential good choices, but not a definitive “buy this” model. Nonethless, the client may use the model to guide strategy and further research only the houses that suggest highest profits, saving time and money.
Ethical Concern: is the client being unethical in their search to buy houses below fair-market value? In the current scenario of housing insecurity and gentrification, what are the ethical concerns with generating a model that will be used to determine which houses are being under-sold? Ultimately, the answer will depend heavily on the situation of the town, the state of the housing market, etc.