Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.

Your group will have three goals:

Create an explicit business scenario which might leverage the data (and methods) used in the lab.
Critique the models (or analyses) present in the lab based on this scenario.
Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.

Goal 1: Business Scenario

First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.

You do not need to solve the problem, you only need to define it.

Your scenario should include the following:

Customer or Audience: who exactly will use your results?
Problem Statement: reference this article to help you write a SMART problem statement.
- E.g., the statement “we need to analyze sales data” is not a good problem summary/statement, but “for <this> reason, the company needs to know if they should stop selling product A …” is on a better track.
Scope: What variables from the data (in the lab) can address the issue presented in your problem statement? What analyses would you use? You’ll need to define any assumptions you feel need to be made before you move forward.
- If you feel the content in the lab cannot sufficiently address the issue, try to devise a more applicable problem statement.
Objective: Define your success criteria. In other words, suppose you started working on this problem in earnest; how will you know when you are done? For example, you might want to “the factors that most influence <some variable> will be identified.”
- Note: past tense words like “identified”, “maximized”, “determined”, “found”, etc. could be useful here. Feel free to find the right action verbs that work for you!

Goal 2: Model Critique

Since this is a class, and not a workplace, we need to be careful not to present information to you too quickly or all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … Now that you have a more informed knowledge of statistics, your goal is to:

Present a list of at least 3 (improved) analyses you would recommend for your business scenario. Each proposed analysis should be accompanied by a “proof of concept” R implementation. (As usual, execute R code blocks here in the RMarkdown file.)

In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).

You’ll want to consider the following:

Analytical issues, such as the current model assumptions.
Issues with the data itself.
Statistical improvements; what do we know now that we didn’t know (or at least didn’t use) then? Are there other methods that would be appropriate?
Are there better visualizations which could have been used?

Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.

Goal 3: Ethical and Epistemological Concerns

Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:

Overcoming biases (existing or potential).
Possible risks or societal implications.
Crucial issues which might not be measurable.
Who would be affected by this project, and how does that affect your critique?

Example

For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:

Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., \(\log(\text{price})\)) and each of these variables linear? Are the error terms evenly distributed?
Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
Consider different evaluation metrics between models. What is a practical use for these values?

Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.

WK14 Data Dive

Week 14 Data Dive: Woods , Guyon, Grant

1. Business Scenario

Context

Linear regression model analyzing an ansecombe’s quartet dataset

# Loading in data
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.5.2

library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())
# Code we are analyzing
ames <- make_ames()
ames <- ames |> rename_with(tolower)

Problem statement / Audience

The context for this business analysis is for real estate agents and clients (buyers and sellers) who want to understand which factors significantly impact the sales price for houses in Ames, Iowa to get a full understanding of the housing market. The ames housing dataset has collected data between 2006-2010.

Scope / Important variables

sale_price, year_remod_add, great_qual

The analysis we will use is run a complete linear regression model to assess summary statistics, which identifies the most significant variables that impact / predict sales price. The current model in the week 8 lab is incomplete as it only shows a few coefficients, excluding important statistical calculations such as p-value and t-values. The current model only shows estimates.

Linear regression model

ames_basic <- ames |>
  filter(bldg_type == "OneFam",
         house_style == "One_Story",
         year_built >= 2000) |>
  mutate(great_qual = ifelse(overall_qual %in%
           c("Very_Excellent", "Excellent", "Very_Good"),
           1, 0))

ames_basic |>
  group_by(great_qual) |>
  summarize(num = n())

## # A tibble: 2 × 2
##   great_qual   num
##        <dbl> <int>
## 1          0   145
## 2          1   182

# include all variables and their interaction
model <- lm(sale_price ~ year_remod_add + great_qual 
            + year_remod_add:great_qual, ames_basic)

# to view more coefficients a bit easier
tidy(model) |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 1))

## # A tibble: 4 × 2
##   term                        estimate
##   <chr>                          <dbl>
## 1 (Intercept)                 -829942.
## 2 year_remod_add                  513.
## 3 great_qual                -10088892.
## 4 year_remod_add:great_qual      5088.

Objective

Create new linear regression model with same variables to gain insight into summary statistics. The new model will yield residuals such as quartiles, min, and max. More importantly, the model will be far more extensive in the coefficients produced providing the standard error, t-value, and p-value to provide customers with more statistically significant data points to rely on.

new_lm_model <- lm(sale_price ~ year_remod_add + great_qual + year_remod_add:great_qual, data = ames_basic)

summary(new_lm_model)

## 
## Call:
## lm(formula = sale_price ~ year_remod_add + great_qual + year_remod_add:great_qual, 
##     data = ames_basic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175088  -50737   -7689   32501  315315 
## 
## Coefficients:
##                              Estimate  Std. Error t value Pr(>|t|)
## (Intercept)                 -829941.8   5327025.6  -0.156    0.876
## year_remod_add                  513.3      2656.3   0.193    0.847
## great_qual                -10088892.4   7545297.2  -1.337    0.182
## year_remod_add:great_qual      5087.5      3761.7   1.352    0.177
## 
## Residual standard error: 72100 on 323 degrees of freedom
## Multiple R-squared:  0.407,  Adjusted R-squared:  0.4015 
## F-statistic:  73.9 on 3 and 323 DF,  p-value: < 2.2e-16

2. Model Critique

Improvements

We just ran a new linear regression model to gain more in depth summary statistics to evaluate the explanatory variables in predicting price. The primary statistical improvement this allows is this model provided more summary statistics to provide further insight into the significance of year_remod_add, great_qual, and year_remod_add:great_qual.
We would recommend leveraging a residuals vs fitted model, which we are able to build from the residuals found in our improved linear regression model. This diagnostics plot will tell us if there is a linear relationship between the sales price and independent variables.

# Residuals vs Fitted model (basic)
plot(new_lm_model, which =1)

From this Residuals vs Fitted regression model and the linear regression model, it may be smart to redirect our analysis focus to logistic regression models. From the Residuals vs Fitted model, the model line of best fit has a slight curve around 0, suggesting an overall good fit. This leads us to explore logistic analysis methods to evaluate the relationship between the variables to further investigate what will drive business decisions for real estate agents and clients in the process of buying and selling houses.

Third, we recommend running a correlation matrix to understand any overlap between any of the independent variables. Thus far, we have focused on analyzing the relationship between sales price and the independent variables. This heat map will prioritize identifying any relationships evident between the independent variables. This will be key in informing real estate agents and clients on how the weigh the overall impact of the explanatory variables. This can also help isolate variables that effect sales price the most on their own.

# Correlation Heatmap

library(GGally)
library(dplyr)

ames_basic_corr <- ames_basic %>%
  mutate(interaction = year_remod_add * great_qual)

ggcorr(
  select(ames_basic_corr,
         year_remod_add,
         great_qual,
         interaction),
  label = TRUE
) +
  labs(title = "Correlation Heatmap for Ames Housing Data")

3. Ethical Concerns

This dataset may not be a full representation of the entire housing market dataset for Ames, Iowa.
There may be some measurement bias with the sale price being a rounded estimate of the prices rather than the exact amount to conduct a proper analysis.
There is some historical bias since the dataset is from 2006-2010 which would not be an accurate representation of the Ames, Iowa housing market if buyers and real estate agents are comparing to today.