library(ggthemes)
library(ggrepel)

## Loading required package: ggplot2

library(AmesHousing)
library(boot)
library(lindia)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(broom)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## The following object is masked from 'package:boot':
## 
##     logit

library(knitr)

Goal 1: The Problem

Define Audience, Problem Statement, Scope, Analysis, etc

Customer or Audience:

Hermes Real Estate, a housing developer that is seeking to purchase investment properties in Ames, IA. They get things done quick.

Problem Statement:

Hermes Real Estate LLC needs to identify 10 houses with an asking sale price undervalued by at least $30,000 by May 31st, 2026. To retain majority shares, Hermes’ owner, Paul, must repay a previous loan, for which he is $250,000 short. Identifying properties by May 31, 2026, allows for purchasing, renovating, and flipping before the repayment due date of May 31, 2027.

Scope

What Variables?

Response Variable: Sale Price (log-converted)

Explanatory Variables: (See Model Critique)

What Analyses

Linear Regression Model to determine the predicted sale price. Houses for sale at $30k or less than the model’s predicted price are a good option for our client.

Regression is chosen given the goal of predicting Sale Price based on a set of measurable variables. When compared to the listed sale price, we are able to evaluate which houses are undervalued. In a real-world situation, the regression model would also be easy to fit to other cities or situations where predicting the estimated marke-value of a house is useful.

What Assumptions

The data is complete, i.e., there are none (or very few) missing or NA values. The linear and independence assumptions are met. (to be checked with a Correlation Matrix) The neighbors variable is worth including in the regression model (to be checked with ANOVA). Errors on ‘predicted values’ are distributed heterogenously (at random).

Success Crieria

Were we able to provide the client with a model that accurately estimates the market value of the house(s) of interest?

Goal 2 - Model Critique

Improvement 1 - Variable Selection

To improve the model, our first step is improving variable selection with what real-world observations can tell us may be most relevant.

The selected variables are: Square Footage Year Remodeled Lot Area Overall Condition (1-10 ordinal) Overall Quality (1-10 ordinal) Number of Cars Garage can fit Neighborhood Year Built Electrical (to ordinal variable) Year Sold (last time sold)

We also selected variables based on what may be most useful for the client. For example, a house that is very cheap but needs the entire electrical rewired will cost a lot more to renovate before re-selling, therefore ultimately reducing profit. Knowing how that affects price can help the client make a decision the “sweet spot” of price-versus-condition.

# Import data
ames <- make_ames() |>
  rename_with(tolower)

# Prepare model data
ames_model <- ames |>
  mutate(
    log_sale_price = log(sale_price),
    age_2026 = 2026 - year_built,
    overall_cond_num = as.numeric(overall_cond),
    overall_qual_num = as.numeric(overall_qual)
  ) |>
  select(
    sale_price,
    log_sale_price,
    gr_liv_area,
    total_bsmt_sf,
    lot_area,
    year_remod_add,
    age_2026,
    overall_cond_num,
    overall_qual_num,
    garage_cars,
    neighborhood,
    electrical,
    year_sold
  ) |>
  na.omit()

# View data
glimpse(ames_model)

## Rows: 2,930
## Columns: 13
## $ sale_price       <int> 215000, 105000, 172000, 244000, 189900, 195500, 21350…
## $ log_sale_price   <dbl> 12.27839, 11.56172, 12.05525, 12.40492, 12.15425, 12.…
## $ gr_liv_area      <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, …
## $ total_bsmt_sf    <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 99…
## $ lot_area         <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, …
## $ year_remod_add   <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996,…
## $ age_2026         <dbl> 66, 65, 68, 58, 29, 28, 25, 34, 31, 27, 33, 34, 28, 3…
## $ overall_cond_num <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2,…
## $ overall_qual_num <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9,…
## $ garage_cars      <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3,…
## $ neighborhood     <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gilbe…
## $ electrical       <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrk…
## $ year_sold        <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…

Improvement 2 - Assumption Verifications

Multi-Colinearity

cor_matrix <- ames_model |>
  select(
    sale_price,
    gr_liv_area,
    total_bsmt_sf,
    lot_area,
    year_remod_add,
    age_2026,
    overall_cond_num,
    overall_qual_num,
    garage_cars,
    year_sold
  ) |>
  cor(use = "pairwise.complete.obs") |>
  round(2)

cor_matrix

##                  sale_price gr_liv_area total_bsmt_sf lot_area year_remod_add
## sale_price             1.00        0.71          0.63     0.27           0.53
## gr_liv_area            0.71        1.00          0.45     0.29           0.32
## total_bsmt_sf          0.63        0.45          1.00     0.25           0.30
## lot_area               0.27        0.29          0.25     1.00           0.02
## year_remod_add         0.53        0.32          0.30     0.02           1.00
## age_2026              -0.56       -0.24         -0.41    -0.02          -0.61
## overall_cond_num      -0.10       -0.12         -0.17    -0.03           0.05
## overall_qual_num       0.80        0.57          0.55     0.10           0.57
## garage_cars            0.65        0.49          0.44     0.18           0.42
## year_sold             -0.03       -0.03         -0.01    -0.02           0.03
##                  age_2026 overall_cond_num overall_qual_num garage_cars
## sale_price          -0.56            -0.10             0.80        0.65
## gr_liv_area         -0.24            -0.12             0.57        0.49
## total_bsmt_sf       -0.41            -0.17             0.55        0.44
## lot_area            -0.02            -0.03             0.10        0.18
## year_remod_add      -0.61             0.05             0.57        0.42
## age_2026             1.00             0.37            -0.60       -0.54
## overall_cond_num     0.37             1.00            -0.09       -0.18
## overall_qual_num    -0.60            -0.09             1.00        0.60
## garage_cars         -0.54            -0.18             0.60        1.00
## year_sold            0.01             0.03            -0.02       -0.02
##                  year_sold
## sale_price           -0.03
## gr_liv_area          -0.03
## total_bsmt_sf        -0.01
## lot_area             -0.02
## year_remod_add        0.03
## age_2026              0.01
## overall_cond_num      0.03
## overall_qual_num     -0.02
## garage_cars          -0.02
## year_sold             1.00

While some variables have notable correlations, none present a worrisome correlation or multicollineary concern

The new Model

lm_model <- lm(
  log_sale_price ~ gr_liv_area +
    total_bsmt_sf +
    lot_area +
    year_remod_add +
    age_2026 +
    overall_cond_num +
    overall_qual_num +
    garage_cars +
    neighborhood +
    electrical +
    year_sold,
  data = ames_model
)

summary(lm_model)

## 
## Call:
## lm(formula = log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area + 
##     year_remod_add + age_2026 + overall_cond_num + overall_qual_num + 
##     garage_cars + neighborhood + electrical + year_sold, data = ames_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10088 -0.06437  0.00511  0.07777  0.59831 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                          1.798e+01  4.139e+00
## gr_liv_area                                          2.368e-04  7.671e-06
## total_bsmt_sf                                        1.166e-04  8.247e-06
## lot_area                                             2.575e-06  3.940e-07
## year_remod_add                                       9.020e-04  2.012e-04
## age_2026                                            -2.594e-03  2.202e-04
## overall_cond_num                                     5.591e-02  3.039e-03
## overall_qual_num                                     8.138e-02  3.532e-03
## garage_cars                                          6.109e-02  5.011e-03
## neighborhoodCollege_Creek                            1.727e-02  1.410e-02
## neighborhoodOld_Town                                -1.032e-01  1.415e-02
## neighborhoodEdwards                                 -6.439e-02  1.286e-02
## neighborhoodSomerset                                 4.916e-02  1.660e-02
## neighborhoodNorthridge_Heights                       1.307e-01  1.772e-02
## neighborhoodGilbert                                  1.042e-03  1.625e-02
## neighborhoodSawyer                                  -6.357e-03  1.385e-02
## neighborhoodNorthwest_Ames                          -1.915e-02  1.511e-02
## neighborhoodSawyer_West                             -1.870e-02  1.643e-02
## neighborhoodMitchell                                -5.398e-03  1.607e-02
## neighborhoodBrookside                               -2.993e-02  1.667e-02
## neighborhoodCrawford                                 1.415e-01  1.661e-02
## neighborhoodIowa_DOT_and_Rail_Road                  -1.461e-01  1.802e-02
## neighborhoodTimberland                               6.752e-02  2.074e-02
## neighborhoodNorthridge                               1.082e-01  2.153e-02
## neighborhoodStone_Brook                              1.487e-01  2.415e-02
## neighborhoodSouth_and_West_of_Iowa_State_University -1.086e-02  2.329e-02
## neighborhoodClear_Creek                              1.098e-01  2.387e-02
## neighborhoodMeadow_Village                          -1.968e-01  2.569e-02
## neighborhoodBriardale                               -2.089e-01  2.812e-02
## neighborhoodBloomington_Heights                     -2.972e-02  3.027e-02
## neighborhoodVeenker                                  6.601e-02  3.123e-02
## neighborhoodNorthpark_Villa                         -9.874e-02  3.157e-02
## neighborhoodBlueste                                 -1.201e-01  4.716e-02
## neighborhoodGreens                                   6.188e-02  5.287e-02
## neighborhoodGreen_Hills                              5.038e-01  1.037e-01
## neighborhoodLandmark                                -1.346e-01  1.461e-01
## electricalFuseF                                     -4.097e-02  2.347e-02
## electricalFuseP                                     -7.116e-02  5.273e-02
## electricalMix                                       -1.481e-01  1.471e-01
## electricalSBrkr                                      6.719e-03  1.187e-02
## electricalUnknown                                    4.597e-02  1.474e-01
## year_sold                                           -4.500e-03  2.063e-03
##                                                     t value Pr(>|t|)    
## (Intercept)                                           4.346 1.44e-05 ***
## gr_liv_area                                          30.870  < 2e-16 ***
## total_bsmt_sf                                        14.143  < 2e-16 ***
## lot_area                                              6.536 7.42e-11 ***
## year_remod_add                                        4.484 7.62e-06 ***
## age_2026                                            -11.783  < 2e-16 ***
## overall_cond_num                                     18.395  < 2e-16 ***
## overall_qual_num                                     23.040  < 2e-16 ***
## garage_cars                                          12.192  < 2e-16 ***
## neighborhoodCollege_Creek                             1.225  0.22079    
## neighborhoodOld_Town                                 -7.291 3.95e-13 ***
## neighborhoodEdwards                                  -5.007 5.85e-07 ***
## neighborhoodSomerset                                  2.961  0.00309 ** 
## neighborhoodNorthridge_Heights                        7.375 2.14e-13 ***
## neighborhoodGilbert                                   0.064  0.94884    
## neighborhoodSawyer                                   -0.459  0.64637    
## neighborhoodNorthwest_Ames                           -1.267  0.20529    
## neighborhoodSawyer_West                              -1.138  0.25512    
## neighborhoodMitchell                                 -0.336  0.73692    
## neighborhoodBrookside                                -1.796  0.07264 .  
## neighborhoodCrawford                                  8.516  < 2e-16 ***
## neighborhoodIowa_DOT_and_Rail_Road                   -8.109 7.47e-16 ***
## neighborhoodTimberland                                3.255  0.00115 ** 
## neighborhoodNorthridge                                5.027 5.28e-07 ***
## neighborhoodStone_Brook                               6.159 8.33e-10 ***
## neighborhoodSouth_and_West_of_Iowa_State_University  -0.467  0.64084    
## neighborhoodClear_Creek                               4.602 4.36e-06 ***
## neighborhoodMeadow_Village                           -7.659 2.54e-14 ***
## neighborhoodBriardale                                -7.431 1.41e-13 ***
## neighborhoodBloomington_Heights                      -0.982  0.32629    
## neighborhoodVeenker                                   2.114  0.03461 *  
## neighborhoodNorthpark_Villa                          -3.127  0.00178 ** 
## neighborhoodBlueste                                  -2.547  0.01092 *  
## neighborhoodGreens                                    1.170  0.24194    
## neighborhoodGreen_Hills                               4.858 1.25e-06 ***
## neighborhoodLandmark                                 -0.921  0.35700    
## electricalFuseF                                      -1.746  0.08100 .  
## electricalFuseP                                      -1.350  0.17727    
## electricalMix                                        -1.007  0.31410    
## electricalSBrkr                                       0.566  0.57145    
## electricalUnknown                                     0.312  0.75525    
## year_sold                                            -2.182  0.02920 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1456 on 2888 degrees of freedom
## Multiple R-squared:  0.8742, Adjusted R-squared:  0.8724 
## F-statistic: 489.4 on 41 and 2888 DF,  p-value: < 2.2e-16

Improvement 3 - Diagnostics

The original model (lab) did not have diagnostic plots.

Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_model)

## Warning: not plotting observations with leverage one:
##   1578, 2240, 2789

par(mfrow = c(1, 1))

Diagnostic plots indicate little issues in the model.

VIF

vif(lm_model)

##                       GVIF Df GVIF^(1/(2*Df))
## gr_liv_area       2.077510  1        1.441357
## total_bsmt_sf     1.827313  1        1.351782
## lot_area          1.331699  1        1.153993
## year_remod_add    2.433631  1        1.560010
## age_2026          6.127740  1        2.475427
## overall_cond_num  1.576803  1        1.255708
## overall_qual_num  3.432123  1        1.852599
## garage_cars       2.009661  1        1.417625
## neighborhood     15.127237 27        1.051592
## electrical        1.325861  5        1.028608
## year_sold         1.019014  1        1.009462

We see no multicolllinearity issues that would invalidate the model.

Model comparisons

As a reminder, some of the original models were:

OgModel1 <- lm(sale_price ~ year_remod_add + great_qual 
            + year_remod_add:great_qual, ames_basic)

OgModel2 <- lm(sale_price ~ first_flr_sf + great_qual, ames_basic)

OgModel3 <- lm(sale_price ~ year_remod_add, ames_basic)

r2_table <- tibble::tibble(
  model = c("OgModel1 (interaction)", "OgModel2", "OgModel3", "Full Model"),
  r_squared = c(
    summary(OgModel1)$r.squared,
    summary(OgModel2)$r.squared,
    summary(OgModel3)$r.squared,
    summary(lm_model)$r.squared
  ),
  adj_r_squared = c(
    summary(OgModel1)$adj.r.squared,
    summary(OgModel2)$adj.r.squared,
    summary(OgModel3)$adj.r.squared,
    summary(lm_model)$adj.r.squared
  )
)

r2_table

## # A tibble: 4 × 3
##   model                  r_squared adj_r_squared
##   <chr>                      <dbl>         <dbl>
## 1 OgModel1 (interaction)    0.407         0.401 
## 2 OgModel2                  0.616         0.614 
## 3 OgModel3                  0.0360        0.0330
## 4 Full Model                0.874         0.872

As seen based on the Adjusted R-squared values, the new model has an improved fit to the sample. While that is expected as the new model includes more variables, the adjusted-R-Squared does account for over fitting, and the penalization there (compared to R-squared) was minimal (0.00178623).

2nd proposed model

We also propose a new model removing Neighborhood as a variable. We then compare it to the first proposed model.

model_without_neighborhood <- lm(
  log_sale_price ~ gr_liv_area +
    total_bsmt_sf +
    lot_area +
    year_remod_add +
    age_2026 +
    overall_cond_num +
    overall_qual_num +
    garage_cars +
    electrical +
    year_sold,
  data = ames_model
)

anova(model_without_neighborhood, lm_model)

## Analysis of Variance Table
## 
## Model 1: log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area + year_remod_add + 
##     age_2026 + overall_cond_num + overall_qual_num + garage_cars + 
##     electrical + year_sold
## Model 2: log_sale_price ~ gr_liv_area + total_bsmt_sf + lot_area + year_remod_add + 
##     age_2026 + overall_cond_num + overall_qual_num + garage_cars + 
##     neighborhood + electrical + year_sold
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   2915 72.045                                  
## 2   2888 61.221 27    10.825 18.912 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, ANOVA is being used to test whether the addition of a parameter (or, in our case, whether keeping it in) improves model fit. Our P value is extremely low, hence, the conclusion that: the model with Neighborhood is better (has a better fit).

Model Results

ames_results <- ames_model |>
  mutate(
    predicted_log_price = predict(lm_model, newdata = ames_model),
    predicted_price = exp(predicted_log_price),
    undervalued_amount = predicted_price - sale_price,
    undervalued_flag = undervalued_amount >= 30000
  ) |>
  arrange(desc(undervalued_amount))

top_10_undervalued <- ames_results |>
  filter(undervalued_flag == TRUE) |>
  select(
    sale_price,
    predicted_price,
    undervalued_amount,
    gr_liv_area,
    total_bsmt_sf,
    lot_area,
    overall_qual_num,
    overall_cond_num,
    garage_cars,
    neighborhood,
    year_sold
  ) |>
  slice_head(n = 10)

top_10_undervalued

## # A tibble: 10 × 11
##    sale_price predicted_price undervalued_amount gr_liv_area total_bsmt_sf
##         <int>           <dbl>              <dbl>       <int>         <dbl>
##  1     160000        1307737.           1147737.        5642          6110
##  2     183850        1023814.            839964.        5095          5095
##  3     184750         736854.            552104.        4676          3138
##  4     150000         307701.            157701.        2944           994
##  5     147000         279581.            132581.        1795          1795
##  6      82500         177241.             94741.        1411          1386
##  7     270000         359183.             89183.        2687          2062
##  8     130000         217538.             87538.        1204          1191
##  9     277000         358504.             81504.        2144          1444
## 10     253293         332664.             79371.        2042          2042
## # ℹ 6 more variables: lot_area <int>, overall_qual_num <dbl>,
## #   overall_cond_num <dbl>, garage_cars <dbl>, neighborhood <fct>,
## #   year_sold <int>

Were we successfull?

Based on our success criteria, yes. Our model actually helped identify nearly 200 undervalued houses.

Goal 3 - Epistemological Components

Reflecting on the model, some Epistemological concerns may be:

Data Quality Crucial Issues that cannot be measured.In the real-world, the final success measure would depend on whether human analysis concluded these are, in fact, good investment house. We cannot account for things like data entry errors, either genuine or bad-faith. For example, someone trying to sell their house will have in their best interest to sugar-coat issues and exaggerate quality. The top undervalued house may have hidden structural issues that the model cannot predict. Ultimately, we can only provide a list of potential good choices, but not a definitive “buy this” model. Nonethless, the client may use the model to guide strategy and further research only the houses that suggest highest profits, saving time and money.
Ethical Concern: is the client being unethical in their search to buy houses below fair-market value? In the current scenario of housing insecurity and gentrification, what are the ethical concerns with generating a model that will be used to determine which houses are being under-sold? Ultimately, the answer will depend heavily on the situation of the town, the state of the housing market, etc.

Data Dive 14 – Group Work

2026-04-27