Week 14 Data Dive - Model Critique

For this week’s data dive I will be collaborating with two classmates during our weekly lab time to complete a model critique of the week 9 lab notes. Those 2 classmates are Matthew and Broderick.

This data dive will be broken into 3 parts, a business scenario, a model critique, and a discussion of ethical and epistemological concerns.

Goal 1 - Business Scenario

For the sake of this scenario we are choosing to be a real estate company that is aiming to reorganize our dataset of houses in the area of Ames, Iowa.

Our Audience: The team of real estate agents that work at our firm/company.

Our Problem Statement: The real estate firm/company needs to better understand aspects of the data such as remodeling year and house quality in order to optimize pricing.

The scope: The house quality and remodel year columns are the key variables we want to focus on for this business scenario. The only assumption that needs to be made is that the data is accurate and usable. This would also mean assuming that the houses are all in the Ames, Iowa area.

Objective: Our success criteria would be determining if there is a significant connection between remodel year house quality and the pricing of the house.

Goal 2 - Model Critique

Improved Analyses:

  1. The first thing we would like to do is alter the remodel year column to be a year since remodeled column.
# Create a "years_since_remodel" variable
current_year <- as.numeric(format(Sys.Date(), "%Y"))

ames_basic <- ames_basic %>%
  mutate(years_since_remodel = current_year - year_remod_add)

# Refit the model with the new variable
model_updated_1 <- lm(sale_price ~ years_since_remodel + great_qual +
                        years_since_remodel:great_qual, data = ames_basic)

tidy(model_updated_1, conf.int = TRUE)
## # A tibble: 4 × 7
##   term                   estimate std.error statistic p.value conf.low conf.high
##   <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)             210088.    55047.     3.82  1.62e-4  101792.   318384.
## 2 years_since_remodel       -513.     2656.    -0.193 8.47e-1   -5739.     4713.
## 3 great_qual              218417.    76353.     2.86  4.50e-3   68204.   368629.
## 4 years_since_remodel:g…   -5088.     3762.    -1.35  1.77e-1  -12488.     2313.
  1. The second thing we would like to do is include the houses that aren’t included in the “great_qual” column. This will widen the reach of our work and give our real estate agents a greater understanding of every house in the area instead of just the higher quality ones.
# Use full dataset with all quality levels
model_updated_2 <- lm(sale_price ~ years_since_remodel + overall_qual +
                        years_since_remodel:overall_qual, data = ames_basic)

tidy(model_updated_2, conf.int = TRUE)
## # A tibble: 14 × 7
##    term                  estimate std.error statistic p.value conf.low conf.high
##    <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 (Intercept)            932000.  1676385.     0.556   0.579  -2.37e6  4230408.
##  2 years_since_remodel    -36000.    77950.    -0.462   0.645  -1.89e5   117373.
##  3 overall_qualAverage   -956944.  1688835.    -0.567   0.571  -4.28e6  2365960.
##  4 overall_qualAbove_Av… -743190.  1680536.    -0.442   0.659  -4.05e6  2563385.
##  5 overall_qualGood      -710703.  1677022.    -0.424   0.672  -4.01e6  2588959.
##  6 overall_qualVery_Good -583863.  1677144.    -0.348   0.728  -3.88e6  2716038.
##  7 overall_qualExcellent -475796.  1678461.    -0.283   0.777  -3.78e6  2826696.
##  8 overall_qualVery_Exc… -877760.  1681800.    -0.522   0.602  -4.19e6  2431303.
##  9 years_since_remodel:…   44539.    78564.     0.567   0.571  -1.10e5   199119.
## 10 years_since_remodel:…   35277.    78159.     0.451   0.652  -1.19e5   189060.
## 11 years_since_remodel:…   35469.    77982.     0.455   0.650  -1.18e5   188905.
## 12 years_since_remodel:…   32331.    77991.     0.415   0.679  -1.21e5   185783.
## 13 years_since_remodel:…   31825.    78070.     0.408   0.684  -1.22e5   185433.
## 14 years_since_remodel:…   56023.    78248.     0.716   0.475  -9.79e4   209980.
  1. The final change we would like to make is to improve the visualizations shown in this project. We would like to add some visualizations that help convey our message more effectively to the real estate agents.
library(ggplot2)

ggplot(ames_basic, aes(x = years_since_remodel, y = sale_price, color = factor(overall_qual))) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Impact of Remodel Recency on Sale Price by House Quality",
    x = "Years Since Remodel",
    y = "Sale Price",
    color = "Quality Rating"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(ames_basic, aes(x = factor(overall_qual), y = sale_price)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Sale Price Distribution by House Quality",
    x = "Overall Quality",
    y = "Sale Price"
  ) +
  theme_minimal()

Goal 3 - Ethical and Epistemological Concerns

Overcoming biases:

The main bias present is with the quality of houses. The current quality variable is subjective and assigned by human judgement. Additionally, the remodel year may not reflect a significant remodel and could refer to a simpler and smaller change as opposed to a major aspect of the house.

Possible risks or societal implications:

A potential risk comes from the potential to systematically undervalue houses in lower-quality categories. Additionally, placing too much value in the quality system could lead to a great deal of inflation.

Crucial Issues which might not be measurable:

There are many factors that the data does not account for due to it being difficult to measure. The location would be very difficult to value for example. Additionally, there are factors like the proximity to certain amenities and the emotional appeal of a home.

Who would be affected by this project, and how does that affect our critique:

Our primary audience of real estate agents are benefiting from the pricing guidance that our model and analysis provides. The only risk is that they could become too confident in the output of the model.

Home buyers are also affected as they are going to be facing standardized prices that could either raise or lower housing prices in the area.