Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.

Your group will have three goals:

Create an explicit business scenario which might leverage the data (and methods) used in the lab.
Critique the models (or analyses) present in the lab based on this scenario.
Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

ames <- make_ames()
ames <- ames |> rename_with(tolower)

Goal 1: Business Scenario

First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.

You do not need to solve the problem, you only need to define it.

Your scenario should include the following:

Customer or Audience: who exactly will use your results?
Problem Statement: reference this article to help you write a SMART problem statement.
- E.g., the statement “we need to analyze sales data” is not a good problem summary/statement, but “for <this> reason, the company needs to know if they should stop selling product A …” is on a better track.
Scope: What variables from the data (in the lab) can address the issue presented in your problem statement? What analyses would you use? You’ll need to define any assumptions you feel need to be made before you move forward.
- If you feel the content in the lab cannot sufficiently address the issue, try to devise a more applicable problem statement.
Objective: Define your success criteria. In other words, suppose you started working on this problem in earnest; how will you know when you are done? For example, you might want to “identify the factors that most influence <some variable>.”
- Note: words like “identify”, “maximize”, “determine”, etc. could be useful here. Feel free to find the right action verbs that work for you!

Lab 8 Business Scenario

The second half of Lab 8 focuses on introducing linear regression using the Ames housing data set. The data and methods may be used by a real estate firm that is trying to reduce the amount of time their agents spend discussing basic estimates with clients by creating a model that people looking into selling their house can use to generate an estimate. They want to keep it as simple as possible so users can easily get their estimate on their own without agent assistance, so there should only be a few variables used in this model. The end goal in this scenario would be to create a useful linear regression model of sale price that does not stray too far from the necessary assumptions for linear regression.

Goal 2: Model Critique

Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:

Present a list of at least 3 (improved) analyses you would recommend for your business scenario. Each proposed analysis should be accompanied by a “proof of concept” R implementation. (As usual, execute R code blocks here in the RMarkdown file.)

In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).

You’ll want to consider the following:

Analytical issues, such as the current model assumptions.
Issues with the data itself.
Statistical improvements; what do we know now that we didn’t know (or at least didn’t use) then? Are there other methods that would be appropriate?
Are there better visualizations which could have been used?

Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.

Lab 8 Model Critique

In the lab notebook, the Ames dataset is filtered down by building type, house style, and year built before creating the model without explanation. In a deeper analysis of this dataset, we would want to analyze these kinds of variables first to confirm if we need to subset our data or potentially include them within the model.

ames |>
  ggplot() +
  geom_bar(mapping = aes(x = house_style)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "House Style", y = "Count")

ames |>
  group_by(house_style) |>
  summarize(mean_price = mean(sale_price)) |>
  ggplot() +
  geom_col(mapping = aes(x = house_style, y = mean_price)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "House Style", y = "Mean Price")

summary(aov(sale_price ~ house_style, data = ames))

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## house_style    7 1.448e+12 2.068e+11   35.04 <2e-16 ***
## Residuals   2922 1.725e+13 5.902e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This analysis is the beginning of investigating what should be done about house style in our model. We see there is a major imbalance between classes, with one story homes by far being the most common. Additionally, after plotting a bar chart of house styles and mean price, then conducting an ANOVA test on this data, we can see there appears to be a difference in the expected price of a house based on its style. Since one story houses are the most common, if we expect that to be true of the houses people try to get price estimates for using our model, it could make sense to remove other house styles from the dataset before building the model. However, this would require further investigation, along with investigation of the impact of building type and year built.

There are many variables in the Ames dataset that were left out of our model as well. We would want to investigate potential correlations between other explanatory variables and our response variable of house price to decide if they should be included as well.

ames |>
  ggplot() +
  geom_point(mapping = aes(x = year_built, y = sale_price)) +
  labs(x = "Year Built", y = "Sale Price")

round(cor(ames$year_built, ames$sale_price), 3)

## [1] 0.558

Here, we can see there may be correlation between the year a house was a built and its sale price, although the trend is clearly not linear, and the variance seems to change over time rather than staying consistent. Additionally, in the model built in the lab, only houses built after a certain point were used, so the influence of year built on the prices may not be as notable in that model. However, investigations such as this over many different variables across the Ames dataset would allow us to make a better informed decision about what to include and exclude from the model.

No diagnostic plots were made of the models created in the notebook to investigate if the necessary assumptions of linear regression were being adequately met.

ames_basic <- ames |>
  filter(bldg_type == "OneFam",
         house_style == "One_Story",
         year_built >= 2000) |>
  mutate(great_qual = ifelse(overall_qual %in%
           c("Very_Excellent", "Excellent", "Very_Good"),
           1, 0))

model <- lm(sale_price ~ first_flr_sf + great_qual, ames_basic)

gg_qqplot(model)

From the QQ-plot, we can see that the residuals are relatively normally distributed at the center, but deviate greatly on the upper end with residuals much higher than expected. This kind of plot should not be the only diagnostic used, especially as it tends to make issues look more severe than they do in a residual histogram plot, but this can be used as an example element of diagnosing issues with the model and the assumptions we made when creating it.

Goal 3: Ethical and Epistemological Concerns

Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:

Overcoming biases (existing or potential).
Possible risks or societal implications.
Crucial issues which might not be measurable.
Who would be affected by this project, and how does that affect your critique?

Lab 8 Ethical and Epistemological Concerns

One issue with using this model is that the housing market changes greatly over time and is certainly far different today than it was almost twenty years ago, which is when the Ames housing dataset starts. The dataset also comprises time before and during the Great Recession, so much of the variation in our data may be caused by the housing market crash as opposed to any actual difference in the values of the homes. Much more recent data would be needed to provide an accurate model for today’s housing prices.

If the real estate agency were only using this data to help inform their local clients, there would not be much of an issue of location. However, if they make a model based on this data publicly available on the web, people from places other than Ames may try to use it to approximate their house’s value and get a very misleading answer. Another issue could be the terminology used in the different metrics that are combined to determine house price. ‘Overall Quality’ is used in both models made in the lab notebook, and according to the documentation, it refers to the overall material and finish of the house, rather than overall condition. However, these two concepts could easily be conflated by users of this tool, and even if they knew what it referred to, giving their house an accurate quality ranking on a scale of one to ten would be very difficult for someone without real estate knowledge.

Overall, the damage caused by the flaws within this model depends on how much people end up relying on it when making decisions regarding selling their home. If the model is quite inaccurate for most users, whether due to the model itself having issues or users mistakenly entering information that is not applicable, and has an impact on the major financial decisions people make, it could be very harmful.

Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.