For this week’s data dive I will be collaborating with two classmates during our weekly lab time to complete a model critique of the week 9 lab notes. Those 2 classmates are Matthew and Broderick.
This data dive will be broken into 3 parts, a business scenario, a model critique, and a discussion of ethical and epistemological concerns.
For the sake of this scenario we are choosing to be a real estate company that is aiming to reorganize our dataset of houses in the area of Ames, Iowa.
Our Audience: The team of real estate agents that work at our firm/company.
Our Problem Statement: The real estate firm/company needs to better understand aspects of the data such as remodeling year and house quality in order to optimize pricing.
The scope: The house quality and remodel year columns are the key variables we want to focus on for this business scenario. The only assumption that needs to be made is that the data is accurate and usable. This would also mean assuming that the houses are all in the Ames, Iowa area.
Objective: Our success criteria would be determining if there is a significant connection between remodel year house quality and the pricing of the house.
Improved Analyses:
# Create a "years_since_remodel" variable
current_year <- as.numeric(format(Sys.Date(), "%Y"))
ames_basic <- ames_basic %>%
mutate(years_since_remodel = current_year - year_remod_add)
# Refit the model with the new variable
model_updated_1 <- lm(sale_price ~ years_since_remodel + great_qual +
years_since_remodel:great_qual, data = ames_basic)
tidy(model_updated_1, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 210088. 55047. 3.82 1.62e-4 101792. 318384.
## 2 years_since_remodel -513. 2656. -0.193 8.47e-1 -5739. 4713.
## 3 great_qual 218417. 76353. 2.86 4.50e-3 68204. 368629.
## 4 years_since_remodel:g… -5088. 3762. -1.35 1.77e-1 -12488. 2313.
# Use full dataset with all quality levels
model_updated_2 <- lm(sale_price ~ years_since_remodel + overall_qual +
years_since_remodel:overall_qual, data = ames_basic)
tidy(model_updated_2, conf.int = TRUE)
## # A tibble: 14 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 932000. 1676385. 0.556 0.579 -2.37e6 4230408.
## 2 years_since_remodel -36000. 77950. -0.462 0.645 -1.89e5 117373.
## 3 overall_qualAverage -956944. 1688835. -0.567 0.571 -4.28e6 2365960.
## 4 overall_qualAbove_Av… -743190. 1680536. -0.442 0.659 -4.05e6 2563385.
## 5 overall_qualGood -710703. 1677022. -0.424 0.672 -4.01e6 2588959.
## 6 overall_qualVery_Good -583863. 1677144. -0.348 0.728 -3.88e6 2716038.
## 7 overall_qualExcellent -475796. 1678461. -0.283 0.777 -3.78e6 2826696.
## 8 overall_qualVery_Exc… -877760. 1681800. -0.522 0.602 -4.19e6 2431303.
## 9 years_since_remodel:… 44539. 78564. 0.567 0.571 -1.10e5 199119.
## 10 years_since_remodel:… 35277. 78159. 0.451 0.652 -1.19e5 189060.
## 11 years_since_remodel:… 35469. 77982. 0.455 0.650 -1.18e5 188905.
## 12 years_since_remodel:… 32331. 77991. 0.415 0.679 -1.21e5 185783.
## 13 years_since_remodel:… 31825. 78070. 0.408 0.684 -1.22e5 185433.
## 14 years_since_remodel:… 56023. 78248. 0.716 0.475 -9.79e4 209980.
library(ggplot2)
ggplot(ames_basic, aes(x = years_since_remodel, y = sale_price, color = factor(overall_qual))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Impact of Remodel Recency on Sale Price by House Quality",
x = "Years Since Remodel",
y = "Sale Price",
color = "Quality Rating"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(ames_basic, aes(x = factor(overall_qual), y = sale_price)) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
labs(
title = "Sale Price Distribution by House Quality",
x = "Overall Quality",
y = "Sale Price"
) +
theme_minimal()
Overcoming biases:
The main bias present is with the quality of houses. The current quality variable is subjective and assigned by human judgement. Additionally, the remodel year may not reflect a significant remodel and could refer to a simpler and smaller change as opposed to a major aspect of the house.
Possible risks or societal implications:
A potential risk comes from the potential to systematically undervalue houses in lower-quality categories. Additionally, placing too much value in the quality system could lead to a great deal of inflation.
Crucial Issues which might not be measurable:
There are many factors that the data does not account for due to it being difficult to measure. The location would be very difficult to value for example. Additionally, there are factors like the proximity to certain amenities and the emotional appeal of a home.
Who would be affected by this project, and how does that affect our critique:
Our primary audience of real estate agents are benefiting from the pricing guidance that our model and analysis provides. The only risk is that they could become too confident in the output of the model.
Home buyers are also affected as they are going to be facing standardized prices that could either raise or lower housing prices in the area.