Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.

Your group will have three goals:

  1. Create an explicit business scenario which might leverage the data (and methods) used in the lab.
  2. Critique the models (or analyses) present in the lab based on this scenario.
  3. Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.

Goal 1: Business Scenario

First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.

You do not need to solve the problem, you only need to define it.

Your scenario should include the following:

  • Customer or Audience: who exactly will use your results?
  • Problem Statement: identify a business need or a possible customer request. This should be actionable, in that it should call for an action taken.
    • E.g., the statement “we need to analyze sales data” is not a good problem statement, but “the company needs to know if they should stop selling product A” is better.
  • Scope: What variables from the data (in the lab) can address the issue presented in your problem statement? What analyses would you use? You’ll need to define any assumptions you feel need to be made before you move forward.
    • If you feel the content in the lab cannot sufficiently address the issue, try to devise a more applicable problem statement.
  • Objective: Define your success criteria. In other words, suppose you started working on this problem in earnest; how will you know when you are done? For example, you might want to “identify the factors that most influence <some variable>.”
    • Note: words like “identify”, “maximize”, “determine”, etc. could be useful here. Feel free to find the right action verbs that work for you!

Goal 2: Model Critique

Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:

Present a list of at least 3 (improved) analyses you would recommend for your business scenario. Each proposed analysis should be accompanied by a “proof of concept” R implementation. (As usual, execute R code blocks here in the RMarkdown file.)

In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).

You’ll want to consider the following:

  • Analytical issues, such as the current model assumptions.
  • Issues with the data itself.
  • Statistical improvements; what do we know now that we didn’t know (or at least didn’t use) then? Are there other methods that would be appropriate?
  • Are there better visualizations which could have been used?

Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.

Goal 3: Ethical and Epistemological Concerns

Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:

  • Overcoming biases (existing or potential).
  • Possible risks or societal implications.
  • Crucial issues which might not be measurable.
  • Who would be affected by this project, and how does that affect your critique?

Example

For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:

  • Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
  • Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., \(\log(\text{price})\)) and each of these variables linear? Are the error terms evenly distributed?
  • Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
  • What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
  • Consider different evaluation metrics between models. What is a practical use for these values?

Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.

For Week 8, let’s structure the example and critique in the same way, tailored to the lab content for that week. Suppose Week 8 focuses on predicting house prices using features like Lot Area, Overall Quality, Garage Area, and Year Built. Here’s how the critique might look:

Goal 1: Business Scenario

Customer or Audience: A real estate investment firm seeking to identify lucrative property investments in Ames, Iowa. Their clients consist of individual homeowners seeking to buy or sell properties, as well as investors aiming to maximize returns.

Problem Statement: The firm must identify the key factors that affect home prices in order to provide recommendations to clients. Specifically, they aim to assess how factors such as house quality, size, and location influence pricing trends.

Scope:

Key variables from the dataset include:

  • sale_price (response variable)

  • first_flr_sf, lot_area, great_qual, year_remod_add, and overall_qual (predictors)

Analyses to address this:

  1. Regression analysis to identify the relationship between home attributes and price.

  2. Interaction terms to explore how combinations of factors (e. g. , size and quality) affect price.

  3. Visualization of trends to effectively communicate findings to clients.

Assumptions:

  • Home quality and size exhibit a linear relationship with sale price.

  • Errors are normally distributed and homoscedastic.

  • No significant multicollinearity exists among predictors.

Objective:

The analysis is considered complete when we:

  1. Identify the primary predictors of sale price.

  2. Provide actionable insights for clients, detailing which home attributes yield the highest ROI.

  3. Quantify the impact of quality and remodeling on home value.

Goal 2: Model Critique

1. Analytical Improvements
Issue:
The original lab assumes linear relationships without assessing this assumption.

Recommendation:
Utilize diagnostic plots to evaluate linearity, normality of errors, and homoscedasticity. 

# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

# Load Ames Housing dataset
ames <- make_ames()

# Quick look at the dataset
glimpse(ames)
## Rows: 2,930
## Columns: 81
## $ MS_SubClass        <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ MS_Zoning          <fct> Residential_Low_Density, Residential_High_Density, …
## $ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ Lot_Area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ Street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ Alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ Lot_Shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ Land_Contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ Utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ Lot_Config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ Land_Slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ Condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ Condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ Bldg_Type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ House_Style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ Overall_Qual       <fct> Above_Average, Average, Above_Average, Good, Averag…
## $ Overall_Cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ Roof_Style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ Roof_Matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ Exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Mas_Vnr_Type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ Exter_Qual         <fct> Typical, Typical, Typical, Good, Typical, Typical, …
## $ Exter_Cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ Bsmt_Qual          <fct> Typical, Typical, Typical, Typical, Good, Typical, …
## $ Bsmt_Cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ Bsmt_Exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ BsmtFin_Type_1     <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ BsmtFin_Type_2     <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ Heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ Heating_QC         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ Central_Air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ Electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ Low_Qual_Fin_SF    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Kitchen_Qual       <fct> Typical, Typical, Good, Excellent, Typical, Good, G…
## $ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ Functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ Fireplace_Qu       <fct> Good, No_Fireplace, No_Fireplace, Typical, Typical,…
## $ Garage_Type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ Garage_Finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ Garage_Qual        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Garage_Cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Paved_Drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Pool_QC            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ Fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ Misc_Feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ Sale_Type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ Sale_Condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ Sale_Price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ Longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ Latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
head(ames)
## # A tibble: 6 × 81
##   MS_SubClass             MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
##   <fct>                   <fct>            <dbl>    <int> <fct>  <fct> <fct>    
## 1 One_Story_1946_and_New… Resident…          141    31770 Pave   No_A… Slightly…
## 2 One_Story_1946_and_New… Resident…           80    11622 Pave   No_A… Regular  
## 3 One_Story_1946_and_New… Resident…           81    14267 Pave   No_A… Slightly…
## 4 One_Story_1946_and_New… Resident…           93    11160 Pave   No_A… Regular  
## 5 Two_Story_1946_and_New… Resident…           74    13830 Pave   No_A… Slightly…
## 6 Two_Story_1946_and_New… Resident…           78     9978 Pave   No_A… Slightly…
## # ℹ 74 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
## #   Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
## #   Bldg_Type <fct>, House_Style <fct>, Overall_Qual <fct>, Overall_Cond <fct>,
## #   Year_Built <int>, Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
## #   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
## #   Mas_Vnr_Area <dbl>, Exter_Qual <fct>, Exter_Cond <fct>, Foundation <fct>,
## #   Bsmt_Qual <fct>, Bsmt_Cond <fct>, Bsmt_Exposure <fct>, …
# Fit an initial linear model
model <- lm(Sale_Price ~ Gr_Liv_Area + Lot_Area + Overall_Qual + Year_Remod_Add, data = ames)

# Summary of the model
summary(model)
## 
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area + Lot_Area + Overall_Qual + 
##     Year_Remod_Add, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -479911  -17893     452   15979  233947 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -1.085e+06  7.764e+04 -13.979  < 2e-16 ***
## Gr_Liv_Area                 4.918e+01  1.662e+00  29.590  < 2e-16 ***
## Lot_Area                    1.098e+00  8.728e-02  12.578  < 2e-16 ***
## Overall_QualPoor            2.076e+04  2.025e+04   1.025 0.305315    
## Overall_QualFair            2.572e+04  1.858e+04   1.384 0.166494    
## Overall_QualBelow_Average   4.312e+04  1.789e+04   2.411 0.015984 *  
## Overall_QualAverage         6.098e+04  1.779e+04   3.428 0.000616 ***
## Overall_QualAbove_Average   7.511e+04  1.782e+04   4.214 2.58e-05 ***
## Overall_QualGood            9.832e+04  1.791e+04   5.491 4.34e-08 ***
## Overall_QualVery_Good       1.509e+05  1.800e+04   8.383  < 2e-16 ***
## Overall_QualExcellent       2.335e+05  1.827e+04  12.783  < 2e-16 ***
## Overall_QualVery_Excellent  2.727e+05  1.919e+04  14.210  < 2e-16 ***
## Year_Remod_Add              5.503e+02  3.872e+01  14.211  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35400 on 2917 degrees of freedom
## Multiple R-squared:  0.8044, Adjusted R-squared:  0.8036 
## F-statistic: 999.7 on 12 and 2917 DF,  p-value: < 2.2e-16
# Diagnostic plots
png("diagnostic_plots.png", width = 800, height = 800)
par(mfrow = c(2, 2))
plot(model)
dev.off()
## png 
##   2

Expected Improvement:
Ensures the linear model is suitable or indicates where transformations are required.

2. Handling Multicollinearity
Issue:
Variables such as first_flr_sf and lot_area may exhibit high correlation, which could lead to distorted coefficient estimates.
Recommendation:
Compute Variance Inflation Factors (VIF) to identify multicollinearity. If VIF > 5 for any predictor, consider removing or combining variables.

# Check for Variance Inflation Factor (VIF)
# Check for multicollinearity using Variance Inflation Factor (VIF)
car::vif(model)
##                    GVIF Df GVIF^(1/(2*Df))
## Gr_Liv_Area    1.649538  1        1.284343
## Lot_Area       1.105341  1        1.051352
## Overall_Qual   2.108756  9        1.042321
## Year_Remod_Add 1.524842  1        1.234845
# Address high VIFs (if any) by dropping or combining variables
# Example: Dropping Lot_Area if its VIF > 5
model_no_lot <- lm(Sale_Price ~ Gr_Liv_Area + Overall_Qual + Year_Remod_Add, data = ames)

# Summary of the modified model
summary(model_no_lot)
## 
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area + Overall_Qual + Year_Remod_Add, 
##     data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -445640  -18950     444   16722  229136 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -1.042e+06  7.963e+04 -13.087  < 2e-16 ***
## Gr_Liv_Area                 5.493e+01  1.640e+00  33.484  < 2e-16 ***
## Overall_QualPoor            1.564e+04  2.078e+04   0.753 0.451692    
## Overall_QualFair            1.864e+04  1.907e+04   0.978 0.328337    
## Overall_QualBelow_Average   3.448e+04  1.835e+04   1.879 0.060298 .  
## Overall_QualAverage         5.352e+04  1.825e+04   2.933 0.003385 ** 
## Overall_QualAbove_Average   6.644e+04  1.829e+04   3.633 0.000285 ***
## Overall_QualGood            8.920e+04  1.837e+04   4.856 1.26e-06 ***
## Overall_QualVery_Good       1.409e+05  1.846e+04   7.636 3.01e-14 ***
## Overall_QualExcellent       2.248e+05  1.874e+04  11.999  < 2e-16 ***
## Overall_QualVery_Excellent  2.655e+05  1.969e+04  13.482  < 2e-16 ***
## Year_Remod_Add              5.340e+02  3.973e+01  13.442  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36340 on 2918 degrees of freedom
## Multiple R-squared:  0.7938, Adjusted R-squared:  0.793 
## F-statistic:  1021 on 11 and 2918 DF,  p-value: < 2.2e-16


Here’s a simpler explanation of results:

Overall Model

  • The model predicts house prices based on size (living area), renovation year, and overall quality.

  • It does a good job: it explains about 79% of the variation in house prices, which is strong.

  • The predictions are generally accurate, but the average error is about $36,340.

Important Findings

  • Living Area: For every extra square foot of living space, the price increases by about $54.93. This is very significant.

  • Renovation Year: Houses that were renovated more recently add about $534 to the price for every additional year.

Overall Quality: The better the house quality, the higher the price:

  • Average quality: adds about $53,521.

  • Excellent quality: adds about $224,849.

  • Top quality (Very Excellent): adds about $265,478.

  • Lower quality (e.g., Poor or Fair) has little to no significant impact on price.

Model Strengths

The model shows which factors matter most for pricing:

  1. Quality of the house is hugely important, especially for higher levels.

  2. Larger houses are worth more.

  3. Recent renovations add value.

Additional Insights

  • The relationships in the data are clear, and there are no major problems with overlapping or redundant predictors.

  • Still, the model could improve by exploring other factors like neighborhood, house age, or specific features.

This analysis can help builders, sellers, or buyers understand what drives house prices!
Expected Improvement:
Eliminating multicollinearity ensures reliable coefficient estimates and interpretation.



3. Interaction Terms
Issue:

The relationship between house size and quality has not been thoroughly examined.
Recommendation:
Incorporate interaction terms and evaluate their significance:

# Model with interaction between Gr_Liv_Area and Overall_Qual
model_interaction <- lm(Sale_Price ~ Gr_Liv_Area * Overall_Qual + Lot_Area + Year_Remod_Add, data = ames)

# Summary of interaction model
summary(model_interaction)
## 
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area * Overall_Qual + Lot_Area + 
##     Year_Remod_Add, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -302147  -16143    -218   15052  343845 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -1.077e+06  8.038e+04 -13.403   <2e-16
## Gr_Liv_Area                             1.701e+01  3.661e+01   0.465   0.6422
## Overall_QualPoor                        3.203e+04  4.956e+04   0.646   0.5182
## Overall_QualFair                        1.762e+04  4.024e+04   0.438   0.6614
## Overall_QualBelow_Average               3.392e+04  3.747e+04   0.905   0.3653
## Overall_QualAverage                     5.909e+04  3.692e+04   1.601   0.1096
## Overall_QualAbove_Average               3.886e+04  3.701e+04   1.050   0.2938
## Overall_QualGood                        4.723e+04  3.722e+04   1.269   0.2046
## Overall_QualVery_Good                   6.139e+04  3.750e+04   1.637   0.1017
## Overall_QualExcellent                   9.740e+04  3.970e+04   2.453   0.0142
## Overall_QualVery_Excellent              4.553e+05  4.094e+04  11.122   <2e-16
## Lot_Area                                1.159e+00  8.278e-02  13.999   <2e-16
## Year_Remod_Add                          5.604e+02  3.667e+01  15.282   <2e-16
## Gr_Liv_Area:Overall_QualPoor           -2.771e+01  6.062e+01  -0.457   0.6477
## Gr_Liv_Area:Overall_QualFair            1.286e+01  3.946e+01   0.326   0.7445
## Gr_Liv_Area:Overall_QualBelow_Average   1.545e+01  3.713e+01   0.416   0.6772
## Gr_Liv_Area:Overall_QualAverage         1.092e+01  3.672e+01   0.297   0.7662
## Gr_Liv_Area:Overall_QualAbove_Average   3.738e+01  3.672e+01   1.018   0.3089
## Gr_Liv_Area:Overall_QualGood            4.545e+01  3.675e+01   1.237   0.2163
## Gr_Liv_Area:Overall_QualVery_Good       6.431e+01  3.679e+01   1.748   0.0806
## Gr_Liv_Area:Overall_QualExcellent       8.342e+01  3.726e+01   2.239   0.0252
## Gr_Liv_Area:Overall_QualVery_Excellent -4.238e+01  3.706e+01  -1.143   0.2530
##                                           
## (Intercept)                            ***
## Gr_Liv_Area                               
## Overall_QualPoor                          
## Overall_QualFair                          
## Overall_QualBelow_Average                 
## Overall_QualAverage                       
## Overall_QualAbove_Average                 
## Overall_QualGood                          
## Overall_QualVery_Good                     
## Overall_QualExcellent                  *  
## Overall_QualVery_Excellent             ***
## Lot_Area                               ***
## Year_Remod_Add                         ***
## Gr_Liv_Area:Overall_QualPoor              
## Gr_Liv_Area:Overall_QualFair              
## Gr_Liv_Area:Overall_QualBelow_Average     
## Gr_Liv_Area:Overall_QualAverage           
## Gr_Liv_Area:Overall_QualAbove_Average     
## Gr_Liv_Area:Overall_QualGood              
## Gr_Liv_Area:Overall_QualVery_Good      .  
## Gr_Liv_Area:Overall_QualExcellent      *  
## Gr_Liv_Area:Overall_QualVery_Excellent    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33480 on 2908 degrees of freedom
## Multiple R-squared:  0.8257, Adjusted R-squared:  0.8244 
## F-statistic: 655.8 on 21 and 2908 DF,  p-value: < 2.2e-16
# Visualize interaction effect
ggplot(ames, aes(x = Gr_Liv_Area, y = Sale_Price, color = factor(Overall_Qual))) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Interaction Between Living Area and Overall Quality",
       x = "Above Ground Living Area (sq. ft.)", y = "Sale Price") +
  theme_minimal() +
  theme(legend.position = "bottom")
## `geom_smooth()` using formula = 'y ~ x'

This plot illustrates how a house’s size (living area) and quality (overall quality) influence its price:

  1. Larger houses typically have higher prices, as indicated by the upward trends.
  2. Houses of higher quality (e. g. , “Excellent”) are valued more than those of lower quality, even when their sizes are comparable.
  3. For very large homes, the price growth diminishes, particularly for top-quality houses.
  4. Quality and size interact—greater quality enhances the effect of size on price.
  5. Some points that deviate from the trends may be outliers, such as unique homes or special locations.


Expected Improvement:
Addresses the combined effects of variables, offering enhanced understanding of feature interactions.

Goal 3: Ethical and Epistemological Concerns

Overcoming Biases (Existing or Potential)

Existing Biases:

  • Data Bias: The data utilized may encapsulate historical biases in housing prices, including redlining, racial discrimination, or socioeconomic disparities. If specific regions are historically underfunded or undervalued, these biases may be reflected in the model, resulting in unequal predictions across various demographics or geographic areas. Example: If the dataset indicates lower house prices in certain zip codes due to previous redlining practices, the model may unintentionally imply that houses in those areas are less valuable, thereby perpetuating historical inequities.

  • Selection Bias: The dataset might not accurately represent the entire population of homes or housing transactions, particularly if certain areas or types of homes are inadequately represented. For instance, if the dataset comprises solely new or renovated homes, the model may not effectively generalize to older or less renovated properties.

Potential Biases:

  • Feature Bias: The selected features for modeling, such as square footage, number of bedrooms, or overall quality, may not encompass all the factors that affect housing prices. Critical factors such as neighborhood safety, access to public services, and local economic conditions may be overlooked.

  • Example: A model that fails to account for proximity to schools or transportation networks may produce inaccurate predictions of housing prices for properties located near these essential amenities.

Solutions:

  • Data Scrubbing: Ensure diverse data representation, encompassing all demographic, geographic, and economic groups. Examine and rectify potential data imbalances.

  • Regular Audits: Conduct periodic evaluations of model outcomes for fairness and detect any discriminatory patterns.

Possible Risks or Societal Implications

  • Impact on Housing Markets: Utilizing this model for real estate investment or pricing strategies may affect housing demand in specific neighborhoods, potentially leading to inflated or deflated property prices due to biased predictions. For instance, real estate investors might refrain from investing in specific areas due to anticipated lower property values, thereby limiting investment and development opportunities in those neighborhoods.

  • Access to Housing: Homebuyers in historically marginalized communities may face disadvantages due to models that undervalue their properties. This can contribute to wealth inequality if the model is utilized for determining tax assessments, insurance rates, or home loan values.

  • Gentrification Risks: If real estate developers leverage the model to identify “up-and-coming” neighborhoods based on anticipated future price increases, it may result in gentrification, displacing long-term residents and diminishing affordable housing.

Mitigation Strategies:

  • Transparency: Ensure that the modeling process and its predictions are transparent and easy to understand to prevent unintended consequences.

  • Inclusive Modeling: Integrate feedback from affected communities to guarantee the model addresses the needs of diverse groups.

Crucial Issues That May Not Be Quantifiable

  • Quality of Life Factors: The model may fail to adequately consider intangible elements influencing housing value, such as neighborhood quality of life (e. g. , social cohesion, safety, or environmental conditions). These factors are difficult to quantify yet are essential in shaping perceptions of a neighborhood’s desirability.

  • Psychological and Social Value: The emotional or social significance of a home or neighborhood may not be reflected in statistical variables. For example, a neighborhood may possess historical or cultural significance that cannot be quantified by square footage or amenities, yet holds great importance for its residents.

  • Environmental Concerns: Factors such as climate change, risks associated with natural disasters, and sustainability aspects (e. g. , energy efficiency or environmental hazards) may not be directly reflected in your existing data, but are becoming increasingly significant to home buyers and investors.

Mitigation Strategies:

  • Holistic Approach: Enhance quantitative models by integrating qualitative assessments (e. g. , community surveys, social factors).

  • Model Updates: Consistently revise the model to include emerging trends such as environmental factors or social values, even if these are more challenging to quantify.

Who Would Be Affected by This Project, and How Does That Affect Your Critique?

Affected Stakeholders:

  • Home Buyers: Their experience will be directly influenced by the model’s accuracy in predicting housing prices, which will affect their ability to make informed purchasing decisions.

  • Real Estate Agents and Investors: They will rely on this model for guiding investment decisions, evaluating properties, and developing pricing strategies. Misguided predictions may lead to financial losses or missed opportunities.

  • Low-Income and Marginalized Communities: These groups could suffer adverse effects if the model results in decreased investments or lowered property values in their neighborhoods, potentially worsening socioeconomic disparities. - Local Governments and Urban Planners: These stakeholders may depend on housing price data for policy-making, zoning decisions, and tax assessments, which in turn influences community development and urban planning.

Impact on Critique:

  • Inclusivity: The model should be assessed for its inclusivity. It is essential to ensure that marginalized communities are not adversely impacted by biased predictions.

  • Social Responsibility: As a data scientist or analyst, it is imperative to uphold social responsibility and consider the wider implications of implementing models that affect public and private decisions.

  • Accountability: Should the model produce unintended negative consequences (e. g. , escalating housing prices in low-income neighborhoods), there must be mechanisms for accountability and correction.