Matthew, Brody, Nathanael - Critiquing Week 9

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.5.2

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.5.2

library(AmesHousing)

## Warning: package 'AmesHousing' was built under R version 4.5.2

library(boot)
library(broom)
library(lindia)

## Warning: package 'lindia' was built under R version 4.5.2

library(GGally)

## Warning: package 'GGally' was built under R version 4.5.2

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())

ames <- make_ames()

ames_basic <- ames |>
  rename_with(tolower) |>
  filter(bldg_type == "OneFam",
         house_style == "One_Story",
         year_built >= 2000) |>
  mutate(great_qual = ifelse(overall_qual %in%
           c("Very_Excellent", "Excellent", "Very_Good"),
           1, 0))

Busines Question

The MBN Real Estate Group wants to know if the year in which the property was remodeled has an effect on the sales price of the property so that we can appropriately adjust housing prices for different properties.

We will find success by finding if there is a significant relationship between when the property was remodeled and the total sales price. We would want to account for a few different variables (such as house quality) in the model as well just to account for different scenarios. We want to improve our sales strategies and want to see how we may need to adjust pitches/asking prices based off of remodeling year. We would use a linear regression model to look at the relationship with sales price and when the house was remodeled and housing quality, and use a t-test(s) to assess the relationship significance.

Our audience would consist of the various real estate agents in the firm who are looking to boost and evaluate their sales and sales strategies.

Model Evaluation

The model used in Week 9 looks at Sales Price as the response variable with year remodeled, if the house is in “great quality”, and an interaction term between the 2. The model used, and the associated t-tests, are included below:

model <- lm(sale_price ~ year_remod_add + great_qual 
            + year_remod_add:great_qual, ames_basic)

tidy(model, conf.int = TRUE)

## # A tibble: 4 × 7
##   term                   estimate std.error statistic p.value conf.low conf.high
##   <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)             -8.30e5  5327026.    -0.156   0.876  -1.13e7  9650105.
## 2 year_remod_add           5.13e2     2656.     0.193   0.847  -4.71e3     5739.
## 3 great_qual              -1.01e7  7545297.    -1.34    0.182  -2.49e7  4755240.
## 4 year_remod_add:great_…   5.09e3     3762.     1.35    0.177  -2.31e3    12488.

Based on this output, the year the house was remodeled does NOT have a significant relationship with Sales Price when accounting for the house being in “great” quality due to the very high p-value (.847). However, there are a few issues with this model and process.

Problems ### 1. This model uses the year itself as a variable when it makes more sense to use years between when the unit was sold and the year it was last remodeled.

This would be a better way of using the year as a variable, especially since properties were sold in different years. For example, a house sold in 2010 that was remodeled in 2005 was remodeled 5 years before the sale. A house sold in 2008 remodeled in 2003 was also remodeled 5 years before the sale, but our model would treat them differently:

# Years since remodeled
ames_basic <- ames_basic |>
  mutate(years_since_remod = year_sold - year_remod_add)

# New model at t-test
model_updated <- lm(sale_price ~ years_since_remod + great_qual +
                    years_since_remod:great_qual,
                    data = ames_basic)

tidy(model_updated, conf.int = TRUE)

## # A tibble: 4 × 7
##   term                  estimate std.error statistic  p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)            198224.     8272.    24.0   1.21e-73  181950.   214497.
## 2 years_since_remod         552.     2438.     0.226 8.21e- 1   -4245.     5348.
## 3 great_qual             128270.    10835.    11.8   4.21e-27  106954.   149586.
## 4 years_since_remod:gr…   -5666.     3606.    -1.57  1.17e- 1  -12759.     1428.

Now, with the new adjusted table, we still don’t see a strong relationship, but we used a much smarter variable that really captures what we were looking for: time since the house was remodeled.

2. We can’t look at more nuanced quality differences (`great_qual` looks at Very Excellent, Excellent, and Very Good) in this model. We should keep more quality differences.

ames_basic <- ames_basic |>
  mutate(overall_qual = factor(overall_qual))

model_updated2 <- lm(sale_price ~ years_since_remod * overall_qual,
                     data = ames_basic)

tidy(model_updated2, conf.int = TRUE)

## # A tibble: 14 × 7
##    term                  estimate std.error statistic p.value conf.low conf.high
##    <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 (Intercept)            212000.   122140.   1.74     0.0836  -28319.   452319.
##  2 years_since_remod      -12000.    25749.  -0.466    0.642   -62664.    38664.
##  3 overall_qualAverage    -76791.   124670.  -0.616    0.538  -322088.   168507.
##  4 overall_qualAbove_Av…  -44634.   123113.  -0.363    0.717  -286868.   197600.
##  5 overall_qualGood         -805.   122350.  -0.00658  0.995  -241536.   239927.
##  6 overall_qualVery_Good   71619.   122327.   0.585    0.559  -169068.   312305.
##  7 overall_qualExcellent  170919.   122567.   1.39     0.164   -70241.   412079.
##  8 overall_qualVery_Exc…  204513.   123242.   1.66     0.0980  -37974.   447000.
##  9 years_since_remod:ov…   18918.    26787.   0.706    0.481   -33788.    71623.
## 10 years_since_remod:ov…   15587.    26376.   0.591    0.555   -36310.    67484.
## 11 years_since_remod:ov…   11662.    25831.   0.451    0.652   -39162.    62486.
## 12 years_since_remod:ov…    7270.    25859.   0.281    0.779   -43609.    58149.
## 13 years_since_remod:ov…    6806.    26231.   0.259    0.795   -44805.    58417.
## 14 years_since_remod:ov…   33284.    26464.   1.26     0.209   -18785.    85354.

Now, we can account for more nuanced quality differences. There still do not appear to be many statistically significant coefficients, so we are still leaning towards saying that the time since remodeling does not have a significant relationship with sales price.

3. We lack good visuals to present to the real estate sales team

We will present as a line chart colored by smaller groups of housing quality and how sales price has shifted in the years since remodeling:

ames_basic <- ames_basic |>
  mutate(qual_group = case_when(
    overall_qual %in% c("Very_Excellent", "Excellent") ~ "High",
    overall_qual %in% c("Very_Good", "Good") ~ "Above Avg",
    overall_qual %in% c("Above_Average", "Average") ~ "Average",
    TRUE ~ "Below Avg"
  ))

plot_df <- ames_basic |>
  group_by(years_since_remod, qual_group) |>
  summarize(avg_price = mean(sale_price, na.rm = TRUE),
            .groups = "drop")

ggplot(plot_df, aes(x = years_since_remod, y = avg_price, color = qual_group)) +
  geom_line(size = 1.2) +
  geom_point() +
  labs(
    title = "Average Sale Price vs. Years Since Remodel",
    subtitle = "Grouped by Housing Quality",
    x = "Years Since Remodel",
    y = "Average Sale Price",
    color = "Quality Group"
  ) +
  theme_minimal() +
  theme(text = element_text(size = 14))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This provides a clean look at how sales price has changed based on years since remodeling in different housing quality groups. Again, there doesn’t seem to be a strong relationship, though we do see slight decreases for “Above Avg” (“Very Good” and “Good”) and Below Avg houses. This communicates similar messaging as our linear model but has a better visual appeal for our sales team.

3. Ethical Concerns

One major issue with this dataset is that it only contains data for Ames, Iowa, so we can’t necessarily generalize to other real estate areas outside of this region given different socioeconomic factors. The models also leave out things like neighborhood, size, etc., so we could be incorrectly attributing price differences. We also filter to only Single Story, One Family buildings, so we can’t generalize to other building types as well.

We also have the issue here of trying to explain a complex system of housing prices using just a simple model, when there are countless outside factors that influence housing prices. There are many that we really can’t capture as well, such as buyer preference. Or we run the risk of an overcomplicated model that is very hard to interpret. If we rely to heavily on this dataset, we run the risk of negatively affecting the housing market based on pricing if we don’t include proper analysis and context with our models.

Home buyers and sellers, real estate agents, and housing developers could be affected by the findings. If results are misinterpreted or taken out of context, it can have significant financial consequences.

Week14_DataDive

2026-04-21

Matthew, Brody, Nathanael - Critiquing Week 9

Busines Question

Model Evaluation

2. We can’t look at more nuanced quality differences (`great_qual` looks at Very Excellent, Excellent, and Very Good) in this model. We should keep more quality differences.

3. We lack good visuals to present to the real estate sales team

3. Ethical Concerns

Week14_DataDive

2026-04-21

Matthew, Brody, Nathanael - Critiquing Week 9

Busines Question

Model Evaluation

2. We can’t look at more nuanced quality differences (great_qual looks at Very Excellent, Excellent, and Very Good) in this model. We should keep more quality differences.

3. We lack good visuals to present to the real estate sales team

3. Ethical Concerns

2. We can’t look at more nuanced quality differences (`great_qual` looks at Very Excellent, Excellent, and Very Good) in this model. We should keep more quality differences.