Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.

Your group will have three goals:

  1. Create an explicit business scenario which might leverage the data (and methods) used in the lab.
  2. Critique the models (or analyses) present in the lab based on this scenario.
  3. Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.

First, choose a previous lab week 6

Goal 1: Business Scenario

A real estate firm in Ames, Iowa, wants to understand how housing prices vary across different building types and within similar home categories. By analyzing these trends, they aim to provide better pricing guidance and recommendations to clients who are buying or selling homes. The purpose of this study is to better gauge housing prices on the market and identify how prices fluctuate across different types of homes.

  • Customer or Audience: The audience includes real estate agents and homeowners who want to better understand pricing and market trends.

  • Problem Statement: Real estate agents need a clear understanding of how building type affects housing prices so they can accurately price homes and advise clients. Inaccurately priced homes tend to stay on the market longer or fail to sell, creating financial and operational challenges. To address this, the firm will analyze the Ames Housing Dataset to identify pricing patterns across different home types, with the goal of reducing pricing uncertainty and improving decision-making for both buyers and sellers.

  • Scope: The variables that can be used from the Ames Housing Dataset to address this problem include sale_price and bldg_type. Additional variables such as overall_qual, first_flr_sf, and year_built can also be included to better explain differences in housing prices. To analyze this, an ANOVA test will be used to determine if there are significant differences in average sale prices across different building types. A linear regression model will also be used to examine how building type and other features impact housing prices and to better understand relationships between variables.

    • One assumption is that the Ames Housing Dataset may not reflect current housing market trends. Since the data is from earlier years, it may not capture recent increases in housing prices or changes in market conditions. This means the results may be more useful for identifying general patterns rather than exact current pricing. undefined
  • Objective: The goal of this analysis is to determine whether building type significantly affects housing prices and to identify the key factors that influence price differences. Success will be achieved when pricing patterns across building types have been identified, and the most influential variables affecting housing prices have been determined, allowing for more accurate pricing recommendations

    Goal 2: Model Critique

load libraries

library(tidyverse)  
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(AmesHousing) 

Issue 1: Using log transformation to fix skewed data

ames <-make_ames()
ames <-ames |> rename_with(tolower)

ames <- ames |> mutate(log_price = log(sale_price))
ml <- aov(log_price ~ bldg_type, data = ames)
summary(ml)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## bldg_type      4   19.7    4.93   30.89 <2e-16 ***
## Residuals   2925  466.9    0.16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(ames, aes(x = bldg_type, y = log_price)) +
geom_boxplot() 

Issue 2: Multiple Linear Regression Model

model_multi <- lm(sale_price ~ bldg_type + overall_qual + first_flr_sf, data = ames)
summary(model_multi) 
## 
## Call:
## lm(formula = sale_price ~ bldg_type + overall_qual + first_flr_sf, 
##     data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415328  -21767   -2121   18148  291807 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4192.718  19793.208   0.212  0.83226    
## bldg_typeTwoFmCon          -11473.824   5111.105  -2.245  0.02485 *  
## bldg_typeDuplex            -16726.811   3961.492  -4.222 2.49e-05 ***
## bldg_typeTwnhs             -23420.004   4120.805  -5.683 1.45e-08 ***
## bldg_typeTwnhsE            -23929.560   2755.176  -8.685  < 2e-16 ***
## overall_qualPoor            17286.241  22521.792   0.768  0.44283    
## overall_qualFair            36172.361  20653.519   1.751  0.07998 .  
## overall_qualBelow_Average   60180.688  19867.442   3.029  0.00247 ** 
## overall_qualAverage         79586.442  19741.832   4.031 5.69e-05 ***
## overall_qualAbove_Average  108545.238  19750.398   5.496 4.22e-08 ***
## overall_qualGood           146126.421  19767.315   7.392 1.88e-13 ***
## overall_qualVery_Good      198641.982  19849.771  10.007  < 2e-16 ***
## overall_qualExcellent      281492.678  20159.027  13.964  < 2e-16 ***
## overall_qualVery_Excellent 337087.996  21133.233  15.951  < 2e-16 ***
## first_flr_sf                   49.882      2.293  21.752  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39370 on 2915 degrees of freedom
## Multiple R-squared:  0.7582, Adjusted R-squared:  0.7571 
## F-statistic:   653 on 14 and 2915 DF,  p-value: < 2.2e-16

Issue 3: Better Visualization

ggplot(ames, aes(x = first_flr_sf, y = sale_price, color = bldg_type)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Goal 3: Ethical and Epistemological Concerns

One concern is that the Ames Housing Dataset may not reflect current housing market conditions. Housing prices have changed significantly in recent years due to factors such as inflation and shifts in demand following COVID-19. Older trends in the data may not hold the same weight today, which could affect the accuracy and relevance of the analysis.

Another concern is potential bias in the data. The dataset only includes housing data from Ames, Iowa, which makes it useful for understanding that specific market, but limits how well the results can be applied to other cities. If these findings were used to make decisions in different housing markets, it could lead to biased or misleading conclusions.

There are also important factors that may not be fully captured in the dataset. For example, external influences such as interest rates, economic conditions, neighborhood development, and buyer preferences can all impact housing prices but may not be directly measured. This limits the ability of the model to fully explain price variation.

Finally, different groups are affected by this analysis, including real estate agents, homeowners, and potential buyers. If the model provides inaccurate or outdated recommendations, it could lead to homes being overpriced or underpriced, which may negatively impact financial outcomes and decision-making for these groups.