For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
<some variable>
.”
Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R
code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.
For Week 8, let’s structure the example and critique in the same way, tailored to the lab content for that week. Suppose Week 8 focuses on predicting house prices using features like Lot Area, Overall Quality, Garage Area, and Year Built. Here’s how the critique might look:
Customer or Audience: A real estate investment firm seeking to identify lucrative property investments in Ames, Iowa. Their clients consist of individual homeowners seeking to buy or sell properties, as well as investors aiming to maximize returns.
Problem Statement: The firm must identify the key factors that affect home prices in order to provide recommendations to clients. Specifically, they aim to assess how factors such as house quality, size, and location influence pricing trends.
Scope:
Key variables from the dataset include:
sale_price (response variable)
first_flr_sf, lot_area, great_qual, year_remod_add, and overall_qual (predictors)
Analyses to address this:
Regression analysis to identify the relationship between home attributes and price.
Interaction terms to explore how combinations of factors (e. g. , size and quality) affect price.
Visualization of trends to effectively communicate findings to clients.
Assumptions:
Home quality and size exhibit a linear relationship with sale price.
Errors are normally distributed and homoscedastic.
No significant multicollinearity exists among predictors.
Objective:
The analysis is considered complete when we:
Identify the primary predictors of sale price.
Provide actionable insights for clients, detailing which home attributes yield the highest ROI.
Quantify the impact of quality and remodeling on home value.
1. Analytical Improvements
Issue:
The original lab assumes linear relationships without assessing this
assumption.
Recommendation:
Utilize diagnostic plots to evaluate linearity, normality of errors, and
homoscedasticity.
# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
# Load Ames Housing dataset
ames <- make_ames()
# Quick look at the dataset
glimpse(ames)
## Rows: 2,930
## Columns: 81
## $ MS_SubClass <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ MS_Zoning <fct> Residential_Low_Density, Residential_High_Density, …
## $ Lot_Frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ Lot_Area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ Alley <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ Lot_Shape <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ Land_Contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ Lot_Config <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ Land_Slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ Condition_1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ Condition_2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ Bldg_Type <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ House_Style <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ Overall_Qual <fct> Above_Average, Average, Above_Average, Good, Averag…
## $ Overall_Cond <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ Year_Built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ Year_Remod_Add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ Roof_Style <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ Roof_Matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ Exterior_1st <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Exterior_2nd <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Mas_Vnr_Type <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ Mas_Vnr_Area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ Exter_Qual <fct> Typical, Typical, Typical, Good, Typical, Typical, …
## $ Exter_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Foundation <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ Bsmt_Qual <fct> Typical, Typical, Typical, Typical, Good, Typical, …
## $ Bsmt_Cond <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ Bsmt_Exposure <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ BsmtFin_Type_1 <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ BsmtFin_SF_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ BsmtFin_Type_2 <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ BsmtFin_SF_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ Bsmt_Unf_SF <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ Total_Bsmt_SF <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ Heating_QC <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ Central_Air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ First_Flr_SF <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ Second_Flr_SF <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ Low_Qual_Fin_SF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Gr_Liv_Area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ Bsmt_Full_Bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ Bsmt_Half_Bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full_Bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ Half_Bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Bedroom_AbvGr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ Kitchen_AbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Kitchen_Qual <fct> Typical, Typical, Good, Excellent, Typical, Good, G…
## $ TotRms_AbvGrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ Fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ Fireplace_Qu <fct> Good, No_Fireplace, No_Fireplace, Typical, Typical,…
## $ Garage_Type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ Garage_Finish <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ Garage_Cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ Garage_Area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ Garage_Qual <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Garage_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ Wood_Deck_SF <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ Open_Porch_SF <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ Enclosed_Porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen_Porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ Pool_Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Pool_QC <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ Fence <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ Misc_Feature <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ Misc_Val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ Mo_Sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ Year_Sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ Sale_Type <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ Sale_Condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ Sale_Price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ Longitude <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ Latitude <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
head(ames)
## # A tibble: 6 × 81
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
## <fct> <fct> <dbl> <int> <fct> <fct> <fct>
## 1 One_Story_1946_and_New… Resident… 141 31770 Pave No_A… Slightly…
## 2 One_Story_1946_and_New… Resident… 80 11622 Pave No_A… Regular
## 3 One_Story_1946_and_New… Resident… 81 14267 Pave No_A… Slightly…
## 4 One_Story_1946_and_New… Resident… 93 11160 Pave No_A… Regular
## 5 Two_Story_1946_and_New… Resident… 74 13830 Pave No_A… Slightly…
## 6 Two_Story_1946_and_New… Resident… 78 9978 Pave No_A… Slightly…
## # ℹ 74 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
## # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
## # Bldg_Type <fct>, House_Style <fct>, Overall_Qual <fct>, Overall_Cond <fct>,
## # Year_Built <int>, Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
## # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
## # Mas_Vnr_Area <dbl>, Exter_Qual <fct>, Exter_Cond <fct>, Foundation <fct>,
## # Bsmt_Qual <fct>, Bsmt_Cond <fct>, Bsmt_Exposure <fct>, …
# Fit an initial linear model
model <- lm(Sale_Price ~ Gr_Liv_Area + Lot_Area + Overall_Qual + Year_Remod_Add, data = ames)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area + Lot_Area + Overall_Qual +
## Year_Remod_Add, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -479911 -17893 452 15979 233947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.085e+06 7.764e+04 -13.979 < 2e-16 ***
## Gr_Liv_Area 4.918e+01 1.662e+00 29.590 < 2e-16 ***
## Lot_Area 1.098e+00 8.728e-02 12.578 < 2e-16 ***
## Overall_QualPoor 2.076e+04 2.025e+04 1.025 0.305315
## Overall_QualFair 2.572e+04 1.858e+04 1.384 0.166494
## Overall_QualBelow_Average 4.312e+04 1.789e+04 2.411 0.015984 *
## Overall_QualAverage 6.098e+04 1.779e+04 3.428 0.000616 ***
## Overall_QualAbove_Average 7.511e+04 1.782e+04 4.214 2.58e-05 ***
## Overall_QualGood 9.832e+04 1.791e+04 5.491 4.34e-08 ***
## Overall_QualVery_Good 1.509e+05 1.800e+04 8.383 < 2e-16 ***
## Overall_QualExcellent 2.335e+05 1.827e+04 12.783 < 2e-16 ***
## Overall_QualVery_Excellent 2.727e+05 1.919e+04 14.210 < 2e-16 ***
## Year_Remod_Add 5.503e+02 3.872e+01 14.211 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35400 on 2917 degrees of freedom
## Multiple R-squared: 0.8044, Adjusted R-squared: 0.8036
## F-statistic: 999.7 on 12 and 2917 DF, p-value: < 2.2e-16
# Diagnostic plots
png("diagnostic_plots.png", width = 800, height = 800)
par(mfrow = c(2, 2))
plot(model)
dev.off()
## png
## 2
Expected Improvement:
Ensures the linear model is suitable or indicates where transformations
are required.
2. Handling Multicollinearity
Issue:
Variables such as first_flr_sf and lot_area may exhibit high
correlation, which could lead to distorted coefficient estimates.
Recommendation:
Compute Variance Inflation Factors (VIF) to identify
multicollinearity. If VIF > 5 for any predictor, consider removing or
combining variables.
# Check for Variance Inflation Factor (VIF)
# Check for multicollinearity using Variance Inflation Factor (VIF)
car::vif(model)
## GVIF Df GVIF^(1/(2*Df))
## Gr_Liv_Area 1.649538 1 1.284343
## Lot_Area 1.105341 1 1.051352
## Overall_Qual 2.108756 9 1.042321
## Year_Remod_Add 1.524842 1 1.234845
# Address high VIFs (if any) by dropping or combining variables
# Example: Dropping Lot_Area if its VIF > 5
model_no_lot <- lm(Sale_Price ~ Gr_Liv_Area + Overall_Qual + Year_Remod_Add, data = ames)
# Summary of the modified model
summary(model_no_lot)
##
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area + Overall_Qual + Year_Remod_Add,
## data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -445640 -18950 444 16722 229136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.042e+06 7.963e+04 -13.087 < 2e-16 ***
## Gr_Liv_Area 5.493e+01 1.640e+00 33.484 < 2e-16 ***
## Overall_QualPoor 1.564e+04 2.078e+04 0.753 0.451692
## Overall_QualFair 1.864e+04 1.907e+04 0.978 0.328337
## Overall_QualBelow_Average 3.448e+04 1.835e+04 1.879 0.060298 .
## Overall_QualAverage 5.352e+04 1.825e+04 2.933 0.003385 **
## Overall_QualAbove_Average 6.644e+04 1.829e+04 3.633 0.000285 ***
## Overall_QualGood 8.920e+04 1.837e+04 4.856 1.26e-06 ***
## Overall_QualVery_Good 1.409e+05 1.846e+04 7.636 3.01e-14 ***
## Overall_QualExcellent 2.248e+05 1.874e+04 11.999 < 2e-16 ***
## Overall_QualVery_Excellent 2.655e+05 1.969e+04 13.482 < 2e-16 ***
## Year_Remod_Add 5.340e+02 3.973e+01 13.442 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36340 on 2918 degrees of freedom
## Multiple R-squared: 0.7938, Adjusted R-squared: 0.793
## F-statistic: 1021 on 11 and 2918 DF, p-value: < 2.2e-16
Here’s a simpler explanation of results:
Overall Model
The model predicts house prices based on size (living area), renovation year, and overall quality.
It does a good job: it explains about 79% of the variation in house prices, which is strong.
The predictions are generally accurate, but the average error is about $36,340.
Important Findings
Living Area: For every extra square foot of living space, the price increases by about $54.93. This is very significant.
Renovation Year: Houses that were renovated more recently add about $534 to the price for every additional year.
Overall Quality: The better the house quality, the higher the price:
Average quality: adds about $53,521.
Excellent quality: adds about $224,849.
Top quality (Very Excellent): adds about $265,478.
Lower quality (e.g., Poor or Fair) has little to no significant impact on price.
Model Strengths
The model shows which factors matter most for pricing:
Quality of the house is hugely important, especially for higher levels.
Larger houses are worth more.
Recent renovations add value.
Additional Insights
The relationships in the data are clear, and there are no major problems with overlapping or redundant predictors.
Still, the model could improve by exploring other factors like neighborhood, house age, or specific features.
This analysis can help builders, sellers, or buyers understand what
drives house prices!
Expected Improvement:
Eliminating multicollinearity ensures reliable coefficient estimates and
interpretation.
3. Interaction Terms
Issue:
The relationship between house size and quality has not been thoroughly
examined.
Recommendation:
Incorporate interaction terms and evaluate their significance:
# Model with interaction between Gr_Liv_Area and Overall_Qual
model_interaction <- lm(Sale_Price ~ Gr_Liv_Area * Overall_Qual + Lot_Area + Year_Remod_Add, data = ames)
# Summary of interaction model
summary(model_interaction)
##
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area * Overall_Qual + Lot_Area +
## Year_Remod_Add, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -302147 -16143 -218 15052 343845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.077e+06 8.038e+04 -13.403 <2e-16
## Gr_Liv_Area 1.701e+01 3.661e+01 0.465 0.6422
## Overall_QualPoor 3.203e+04 4.956e+04 0.646 0.5182
## Overall_QualFair 1.762e+04 4.024e+04 0.438 0.6614
## Overall_QualBelow_Average 3.392e+04 3.747e+04 0.905 0.3653
## Overall_QualAverage 5.909e+04 3.692e+04 1.601 0.1096
## Overall_QualAbove_Average 3.886e+04 3.701e+04 1.050 0.2938
## Overall_QualGood 4.723e+04 3.722e+04 1.269 0.2046
## Overall_QualVery_Good 6.139e+04 3.750e+04 1.637 0.1017
## Overall_QualExcellent 9.740e+04 3.970e+04 2.453 0.0142
## Overall_QualVery_Excellent 4.553e+05 4.094e+04 11.122 <2e-16
## Lot_Area 1.159e+00 8.278e-02 13.999 <2e-16
## Year_Remod_Add 5.604e+02 3.667e+01 15.282 <2e-16
## Gr_Liv_Area:Overall_QualPoor -2.771e+01 6.062e+01 -0.457 0.6477
## Gr_Liv_Area:Overall_QualFair 1.286e+01 3.946e+01 0.326 0.7445
## Gr_Liv_Area:Overall_QualBelow_Average 1.545e+01 3.713e+01 0.416 0.6772
## Gr_Liv_Area:Overall_QualAverage 1.092e+01 3.672e+01 0.297 0.7662
## Gr_Liv_Area:Overall_QualAbove_Average 3.738e+01 3.672e+01 1.018 0.3089
## Gr_Liv_Area:Overall_QualGood 4.545e+01 3.675e+01 1.237 0.2163
## Gr_Liv_Area:Overall_QualVery_Good 6.431e+01 3.679e+01 1.748 0.0806
## Gr_Liv_Area:Overall_QualExcellent 8.342e+01 3.726e+01 2.239 0.0252
## Gr_Liv_Area:Overall_QualVery_Excellent -4.238e+01 3.706e+01 -1.143 0.2530
##
## (Intercept) ***
## Gr_Liv_Area
## Overall_QualPoor
## Overall_QualFair
## Overall_QualBelow_Average
## Overall_QualAverage
## Overall_QualAbove_Average
## Overall_QualGood
## Overall_QualVery_Good
## Overall_QualExcellent *
## Overall_QualVery_Excellent ***
## Lot_Area ***
## Year_Remod_Add ***
## Gr_Liv_Area:Overall_QualPoor
## Gr_Liv_Area:Overall_QualFair
## Gr_Liv_Area:Overall_QualBelow_Average
## Gr_Liv_Area:Overall_QualAverage
## Gr_Liv_Area:Overall_QualAbove_Average
## Gr_Liv_Area:Overall_QualGood
## Gr_Liv_Area:Overall_QualVery_Good .
## Gr_Liv_Area:Overall_QualExcellent *
## Gr_Liv_Area:Overall_QualVery_Excellent
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33480 on 2908 degrees of freedom
## Multiple R-squared: 0.8257, Adjusted R-squared: 0.8244
## F-statistic: 655.8 on 21 and 2908 DF, p-value: < 2.2e-16
# Visualize interaction effect
ggplot(ames, aes(x = Gr_Liv_Area, y = Sale_Price, color = factor(Overall_Qual))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Interaction Between Living Area and Overall Quality",
x = "Above Ground Living Area (sq. ft.)", y = "Sale Price") +
theme_minimal() +
theme(legend.position = "bottom")
## `geom_smooth()` using formula = 'y ~ x'
This plot illustrates how a house’s size (living area) and quality (overall quality) influence its price:
Expected Improvement:
Addresses the combined effects of variables, offering enhanced
understanding of feature interactions.
Overcoming Biases (Existing or Potential)
Existing Biases:
Data Bias: The data utilized may encapsulate historical biases in housing prices, including redlining, racial discrimination, or socioeconomic disparities. If specific regions are historically underfunded or undervalued, these biases may be reflected in the model, resulting in unequal predictions across various demographics or geographic areas. Example: If the dataset indicates lower house prices in certain zip codes due to previous redlining practices, the model may unintentionally imply that houses in those areas are less valuable, thereby perpetuating historical inequities.
Selection Bias: The dataset might not accurately represent the entire population of homes or housing transactions, particularly if certain areas or types of homes are inadequately represented. For instance, if the dataset comprises solely new or renovated homes, the model may not effectively generalize to older or less renovated properties.
Potential Biases:
Feature Bias: The selected features for modeling, such as square footage, number of bedrooms, or overall quality, may not encompass all the factors that affect housing prices. Critical factors such as neighborhood safety, access to public services, and local economic conditions may be overlooked.
Example: A model that fails to account for proximity to schools or transportation networks may produce inaccurate predictions of housing prices for properties located near these essential amenities.
Solutions:
Data Scrubbing: Ensure diverse data representation, encompassing all demographic, geographic, and economic groups. Examine and rectify potential data imbalances.
Regular Audits: Conduct periodic evaluations of model outcomes for fairness and detect any discriminatory patterns.
Possible Risks or Societal Implications
Impact on Housing Markets: Utilizing this model for real estate investment or pricing strategies may affect housing demand in specific neighborhoods, potentially leading to inflated or deflated property prices due to biased predictions. For instance, real estate investors might refrain from investing in specific areas due to anticipated lower property values, thereby limiting investment and development opportunities in those neighborhoods.
Access to Housing: Homebuyers in historically marginalized communities may face disadvantages due to models that undervalue their properties. This can contribute to wealth inequality if the model is utilized for determining tax assessments, insurance rates, or home loan values.
Gentrification Risks: If real estate developers leverage the model to identify “up-and-coming” neighborhoods based on anticipated future price increases, it may result in gentrification, displacing long-term residents and diminishing affordable housing.
Mitigation Strategies:
Transparency: Ensure that the modeling process and its predictions are transparent and easy to understand to prevent unintended consequences.
Inclusive Modeling: Integrate feedback from affected communities to guarantee the model addresses the needs of diverse groups.
Crucial Issues That May Not Be Quantifiable
Quality of Life Factors: The model may fail to adequately consider intangible elements influencing housing value, such as neighborhood quality of life (e. g. , social cohesion, safety, or environmental conditions). These factors are difficult to quantify yet are essential in shaping perceptions of a neighborhood’s desirability.
Psychological and Social Value: The emotional or social significance of a home or neighborhood may not be reflected in statistical variables. For example, a neighborhood may possess historical or cultural significance that cannot be quantified by square footage or amenities, yet holds great importance for its residents.
Environmental Concerns: Factors such as climate change, risks associated with natural disasters, and sustainability aspects (e. g. , energy efficiency or environmental hazards) may not be directly reflected in your existing data, but are becoming increasingly significant to home buyers and investors.
Mitigation Strategies:
Holistic Approach: Enhance quantitative models by integrating qualitative assessments (e. g. , community surveys, social factors).
Model Updates: Consistently revise the model to include emerging trends such as environmental factors or social values, even if these are more challenging to quantify.
Who Would Be Affected by This Project, and How Does That Affect Your Critique?
Affected Stakeholders:
Home Buyers: Their experience will be directly influenced by the model’s accuracy in predicting housing prices, which will affect their ability to make informed purchasing decisions.
Real Estate Agents and Investors: They will rely on this model for guiding investment decisions, evaluating properties, and developing pricing strategies. Misguided predictions may lead to financial losses or missed opportunities.
Low-Income and Marginalized Communities: These groups could suffer adverse effects if the model results in decreased investments or lowered property values in their neighborhoods, potentially worsening socioeconomic disparities. - Local Governments and Urban Planners: These stakeholders may depend on housing price data for policy-making, zoning decisions, and tax assessments, which in turn influences community development and urban planning.
Impact on Critique:
Inclusivity: The model should be assessed for its inclusivity. It is essential to ensure that marginalized communities are not adversely impacted by biased predictions.
Social Responsibility: As a data scientist or analyst, it is imperative to uphold social responsibility and consider the wider implications of implementing models that affect public and private decisions.
Accountability: Should the model produce unintended negative consequences (e. g. , escalating housing prices in low-income neighborhoods), there must be mechanisms for accountability and correction.