library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.5.2
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.5.2
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.5.2
library(boot)
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.5.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
# remove scientific notation
options(scipen = 6)
# default theme, unless otherwise noted
theme_set(theme_minimal())
ames <- make_ames()
ames_basic <- ames |>
rename_with(tolower) |>
filter(bldg_type == "OneFam",
house_style == "One_Story",
year_built >= 2000) |>
mutate(great_qual = ifelse(overall_qual %in%
c("Very_Excellent", "Excellent", "Very_Good"),
1, 0))
The MBN Real Estate Group wants to know if the year in which the property was remodeled has an effect on the sales price of the property so that we can appropriately adjust housing prices for different properties.
We will find success by finding if there is a significant relationship between when the property was remodeled and the total sales price. We would want to account for a few different variables (such as house quality) in the model as well just to account for different scenarios. We want to improve our sales strategies and want to see how we may need to adjust pitches/asking prices based off of remodeling year. We would use a linear regression model to look at the relationship with sales price and when the house was remodeled and housing quality, and use a t-test(s) to assess the relationship significance.
Our audience would consist of the various real estate agents in the firm who are looking to boost and evaluate their sales and sales strategies.
The model used in Week 9 looks at Sales Price as the response variable with year remodeled, if the house is in “great quality”, and an interaction term between the 2. The model used, and the associated t-tests, are included below:
model <- lm(sale_price ~ year_remod_add + great_qual
+ year_remod_add:great_qual, ames_basic)
tidy(model, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.30e5 5327026. -0.156 0.876 -1.13e7 9650105.
## 2 year_remod_add 5.13e2 2656. 0.193 0.847 -4.71e3 5739.
## 3 great_qual -1.01e7 7545297. -1.34 0.182 -2.49e7 4755240.
## 4 year_remod_add:great_… 5.09e3 3762. 1.35 0.177 -2.31e3 12488.
Based on this output, the year the house was remodeled does NOT have a significant relationship with Sales Price when accounting for the house being in “great” quality due to the very high p-value (.847). However, there are a few issues with this model and process.
Problems ### 1. This model uses the year itself as a variable when it makes more sense to use years between when the unit was sold and the year it was last remodeled.
This would be a better way of using the year as a variable, especially since properties were sold in different years. For example, a house sold in 2010 that was remodeled in 2005 was remodeled 5 years before the sale. A house sold in 2008 remodeled in 2003 was also remodeled 5 years before the sale, but our model would treat them differently:
# Years since remodeled
ames_basic <- ames_basic |>
mutate(years_since_remod = year_sold - year_remod_add)
# New model at t-test
model_updated <- lm(sale_price ~ years_since_remod + great_qual +
years_since_remod:great_qual,
data = ames_basic)
tidy(model_updated, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 198224. 8272. 24.0 1.21e-73 181950. 214497.
## 2 years_since_remod 552. 2438. 0.226 8.21e- 1 -4245. 5348.
## 3 great_qual 128270. 10835. 11.8 4.21e-27 106954. 149586.
## 4 years_since_remod:gr… -5666. 3606. -1.57 1.17e- 1 -12759. 1428.
Now, with the new adjusted table, we still don’t see a strong relationship, but we used a much smarter variable that really captures what we were looking for: time since the house was remodeled.
great_qual looks at Very Excellent, Excellent, and Very
Good) in this model. We should keep more quality differences.ames_basic <- ames_basic |>
mutate(overall_qual = factor(overall_qual))
model_updated2 <- lm(sale_price ~ years_since_remod * overall_qual,
data = ames_basic)
tidy(model_updated2, conf.int = TRUE)
## # A tibble: 14 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 212000. 122140. 1.74 0.0836 -28319. 452319.
## 2 years_since_remod -12000. 25749. -0.466 0.642 -62664. 38664.
## 3 overall_qualAverage -76791. 124670. -0.616 0.538 -322088. 168507.
## 4 overall_qualAbove_Av… -44634. 123113. -0.363 0.717 -286868. 197600.
## 5 overall_qualGood -805. 122350. -0.00658 0.995 -241536. 239927.
## 6 overall_qualVery_Good 71619. 122327. 0.585 0.559 -169068. 312305.
## 7 overall_qualExcellent 170919. 122567. 1.39 0.164 -70241. 412079.
## 8 overall_qualVery_Exc… 204513. 123242. 1.66 0.0980 -37974. 447000.
## 9 years_since_remod:ov… 18918. 26787. 0.706 0.481 -33788. 71623.
## 10 years_since_remod:ov… 15587. 26376. 0.591 0.555 -36310. 67484.
## 11 years_since_remod:ov… 11662. 25831. 0.451 0.652 -39162. 62486.
## 12 years_since_remod:ov… 7270. 25859. 0.281 0.779 -43609. 58149.
## 13 years_since_remod:ov… 6806. 26231. 0.259 0.795 -44805. 58417.
## 14 years_since_remod:ov… 33284. 26464. 1.26 0.209 -18785. 85354.
Now, we can account for more nuanced quality differences. There still do not appear to be many statistically significant coefficients, so we are still leaning towards saying that the time since remodeling does not have a significant relationship with sales price.
We will present as a line chart colored by smaller groups of housing quality and how sales price has shifted in the years since remodeling:
ames_basic <- ames_basic |>
mutate(qual_group = case_when(
overall_qual %in% c("Very_Excellent", "Excellent") ~ "High",
overall_qual %in% c("Very_Good", "Good") ~ "Above Avg",
overall_qual %in% c("Above_Average", "Average") ~ "Average",
TRUE ~ "Below Avg"
))
plot_df <- ames_basic |>
group_by(years_since_remod, qual_group) |>
summarize(avg_price = mean(sale_price, na.rm = TRUE),
.groups = "drop")
ggplot(plot_df, aes(x = years_since_remod, y = avg_price, color = qual_group)) +
geom_line(size = 1.2) +
geom_point() +
labs(
title = "Average Sale Price vs. Years Since Remodel",
subtitle = "Grouped by Housing Quality",
x = "Years Since Remodel",
y = "Average Sale Price",
color = "Quality Group"
) +
theme_minimal() +
theme(text = element_text(size = 14))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This provides a clean look at how sales price has changed based on years since remodeling in different housing quality groups. Again, there doesn’t seem to be a strong relationship, though we do see slight decreases for “Above Avg” (“Very Good” and “Good”) and Below Avg houses. This communicates similar messaging as our linear model but has a better visual appeal for our sales team.
One major issue with this dataset is that it only contains data for Ames, Iowa, so we can’t necessarily generalize to other real estate areas outside of this region given different socioeconomic factors. The models also leave out things like neighborhood, size, etc., so we could be incorrectly attributing price differences. We also filter to only Single Story, One Family buildings, so we can’t generalize to other building types as well.
We also have the issue here of trying to explain a complex system of housing prices using just a simple model, when there are countless outside factors that influence housing prices. There are many that we really can’t capture as well, such as buyer preference. Or we run the risk of an overcomplicated model that is very hard to interpret. If we rely to heavily on this dataset, we run the risk of negatively affecting the housing market based on pricing if we don’t include proper analysis and context with our models.
Home buyers and sellers, real estate agents, and housing developers could be affected by the findings. If results are misinterpreted or taken out of context, it can have significant financial consequences.