For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
Create an explicit business scenario which might leverage the data (and methods) used in the lab.
Critique the models (or analyses) present in the lab based on this scenario.
Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ dials 1.3.0 ✔ rsample 1.2.1
## ✔ infer 1.0.7 ✔ tune 1.2.1
## ✔ modeldata 1.4.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
## ✔ recipes 1.1.0 ✔ yardstick 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
#Importing the Data
ames <- make_ames()
ames_basic <- ames |>
rename_with(tolower) |>
filter(bldg_type == "OneFam",
house_style == "One_Story",
year_built >= 2000) |>
mutate(great_qual = ifelse(overall_qual %in%
c("Very_Excellent", "Excellent", "Very_Good"),
1, 0))
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
<some variable>
.”Note: words like “identify”, “maximize”, “determine”, etc. could be useful here. Feel free to find the right action verbs that work for you!
Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R
code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
1.) Adding an Interaction term to the Linear Regression model in order to provide a deeper understanding of how these variables jointly affect sales price of a property. Currently the model treats the the variables year_remod_add and lot_area independently. This enables us to get a better glimpse into the bigger picture of how these variables come together to affect sales price. The product of the model changes slightly and suggests the same findings that the original model presented, however we know know with a better degree of certainty and accuracy that these two factors do have a significant affect on sales price. Things like a partial dependence plot would be useful visualizations for this scenario as it would show the affect of our two variables while taking into account other unaccounted for variables. This is useful because these two variables aren’t the only two factors that influence sales price. The new model suggest that the fit got better suggesting that the interaction term holds some sort of power, the new model also suggests that there is a diminishing return effect when both factors are high. (Nathan)
model <- lm(sale_price ~ year_remod_add * lot_area, data = ames_basic)
summary(model)
##
## Call:
## lm(formula = sale_price ~ year_remod_add * lot_area, data = ames_basic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -305878 -44452 -11059 36263 340697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.449e+07 1.450e+07 -3.757 0.000204 ***
## year_remod_add 2.723e+04 7.227e+03 3.767 0.000196 ***
## lot_area 3.684e+03 1.280e+03 2.878 0.004262 **
## year_remod_add:lot_area -1.830e+00 6.377e-01 -2.870 0.004380 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79970 on 323 degrees of freedom
## Multiple R-squared: 0.2706, Adjusted R-squared: 0.2638
## F-statistic: 39.93 on 3 and 323 DF, p-value: < 2.2e-16
2.) Multilinearity: Size-related variables (e.g., total living area, basement area) could be highly correlated with each other, leading to multicollinearity issues if multiple size-related variables are included in the model without proper handling. (Akshay)
# Log-transform Sale_Price to handle skewness
ames <- ames %>%
mutate(Sale_Price_Log = log10(Sale_Price))
# Fit a linear regression model with interaction between Gr_Liv_Area (size) and Year_Remod_Add (last remodeled)
model <- lm(Sale_Price_Log ~ Gr_Liv_Area * Year_Remod_Add + Neighborhood + Overall_Qual + Overall_Cond, data = ames)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = Sale_Price_Log ~ Gr_Liv_Area * Year_Remod_Add +
## Neighborhood + Overall_Qual + Overall_Cond, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65012 -0.03307 0.00214 0.03793 0.26801
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 2.144e+00 4.492e-01
## Gr_Liv_Area 5.641e-04 3.058e-04
## Year_Remod_Add 1.200e-03 2.264e-04
## NeighborhoodCollege_Creek 3.650e-02 6.280e-03
## NeighborhoodOld_Town -9.532e-02 5.727e-03
## NeighborhoodEdwards -4.594e-02 6.018e-03
## NeighborhoodSomerset 4.394e-02 7.484e-03
## NeighborhoodNorthridge_Heights 8.614e-02 8.362e-03
## NeighborhoodGilbert 1.874e-02 7.146e-03
## NeighborhoodSawyer -6.102e-04 6.529e-03
## NeighborhoodNorthwest_Ames 1.528e-02 7.142e-03
## NeighborhoodSawyer_West 6.638e-03 7.481e-03
## NeighborhoodMitchell 2.383e-02 7.349e-03
## NeighborhoodBrookside -5.600e-02 7.422e-03
## NeighborhoodCrawford 4.345e-02 7.737e-03
## NeighborhoodIowa_DOT_and_Rail_Road -1.110e-01 8.003e-03
## NeighborhoodTimberland 6.865e-02 9.400e-03
## NeighborhoodNorthridge 7.109e-02 1.013e-02
## NeighborhoodStone_Brook 8.214e-02 1.152e-02
## NeighborhoodSouth_and_West_of_Iowa_State_University -6.078e-02 1.054e-02
## NeighborhoodClear_Creek 7.225e-02 1.095e-02
## NeighborhoodMeadow_Village -1.086e-01 1.213e-02
## NeighborhoodBriardale -1.229e-01 1.301e-02
## NeighborhoodBloomington_Heights 2.959e-02 1.413e-02
## NeighborhoodVeenker 6.591e-02 1.468e-02
## NeighborhoodNorthpark_Villa -3.977e-02 1.479e-02
## NeighborhoodBlueste -5.835e-02 2.210e-02
## NeighborhoodGreens 1.771e-02 2.506e-02
## NeighborhoodGreen_Hills 2.025e-01 4.854e-02
## NeighborhoodLandmark -5.891e-02 6.839e-02
## Overall_QualPoor 1.253e-01 3.969e-02
## Overall_QualFair 2.387e-01 3.660e-02
## Overall_QualBelow_Average 2.841e-01 3.559e-02
## Overall_QualAverage 3.313e-01 3.551e-02
## Overall_QualAbove_Average 3.668e-01 3.563e-02
## Overall_QualGood 4.013e-01 3.579e-02
## Overall_QualVery_Good 4.652e-01 3.602e-02
## Overall_QualExcellent 5.562e-01 3.664e-02
## Overall_QualVery_Excellent 5.541e-01 3.836e-02
## Overall_CondPoor -4.064e-03 3.413e-02
## Overall_CondFair 3.117e-02 2.805e-02
## Overall_CondBelow_Average 1.011e-01 2.744e-02
## Overall_CondAverage 1.340e-01 2.691e-02
## Overall_CondAbove_Average 1.480e-01 2.696e-02
## Overall_CondGood 1.560e-01 2.707e-02
## Overall_CondVery_Good 1.498e-01 2.749e-02
## Overall_CondExcellent 1.562e-01 2.911e-02
## Gr_Liv_Area:Year_Remod_Add -2.228e-07 1.540e-07
## t value Pr(>|t|)
## (Intercept) 4.773 1.91e-06 ***
## Gr_Liv_Area 1.845 0.065201 .
## Year_Remod_Add 5.301 1.24e-07 ***
## NeighborhoodCollege_Creek 5.811 6.88e-09 ***
## NeighborhoodOld_Town -16.645 < 2e-16 ***
## NeighborhoodEdwards -7.633 3.10e-14 ***
## NeighborhoodSomerset 5.871 4.83e-09 ***
## NeighborhoodNorthridge_Heights 10.300 < 2e-16 ***
## NeighborhoodGilbert 2.623 0.008770 **
## NeighborhoodSawyer -0.093 0.925548
## NeighborhoodNorthwest_Ames 2.140 0.032443 *
## NeighborhoodSawyer_West 0.887 0.374943
## NeighborhoodMitchell 3.243 0.001198 **
## NeighborhoodBrookside -7.546 5.98e-14 ***
## NeighborhoodCrawford 5.616 2.15e-08 ***
## NeighborhoodIowa_DOT_and_Rail_Road -13.875 < 2e-16 ***
## NeighborhoodTimberland 7.303 3.62e-13 ***
## NeighborhoodNorthridge 7.020 2.76e-12 ***
## NeighborhoodStone_Brook 7.132 1.25e-12 ***
## NeighborhoodSouth_and_West_of_Iowa_State_University -5.768 8.88e-09 ***
## NeighborhoodClear_Creek 6.599 4.90e-11 ***
## NeighborhoodMeadow_Village -8.946 < 2e-16 ***
## NeighborhoodBriardale -9.447 < 2e-16 ***
## NeighborhoodBloomington_Heights 2.094 0.036321 *
## NeighborhoodVeenker 4.489 7.45e-06 ***
## NeighborhoodNorthpark_Villa -2.689 0.007198 **
## NeighborhoodBlueste -2.640 0.008325 **
## NeighborhoodGreens 0.707 0.479838
## NeighborhoodGreen_Hills 4.171 3.12e-05 ***
## NeighborhoodLandmark -0.861 0.389133
## Overall_QualPoor 3.157 0.001609 **
## Overall_QualFair 6.523 8.12e-11 ***
## Overall_QualBelow_Average 7.982 2.06e-15 ***
## Overall_QualAverage 9.331 < 2e-16 ***
## Overall_QualAbove_Average 10.295 < 2e-16 ***
## Overall_QualGood 11.214 < 2e-16 ***
## Overall_QualVery_Good 12.913 < 2e-16 ***
## Overall_QualExcellent 15.179 < 2e-16 ***
## Overall_QualVery_Excellent 14.444 < 2e-16 ***
## Overall_CondPoor -0.119 0.905215
## Overall_CondFair 1.111 0.266604
## Overall_CondBelow_Average 3.686 0.000232 ***
## Overall_CondAverage 4.981 6.69e-07 ***
## Overall_CondAbove_Average 5.487 4.44e-08 ***
## Overall_CondGood 5.761 9.25e-09 ***
## Overall_CondVery_Good 5.448 5.51e-08 ***
## Overall_CondExcellent 5.366 8.71e-08 ***
## Gr_Liv_Area:Year_Remod_Add -1.447 0.147967
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06821 on 2882 degrees of freedom
## Multiple R-squared: 0.8539, Adjusted R-squared: 0.8515
## F-statistic: 358.4 on 47 and 2882 DF, p-value: < 2.2e-16
# Predicting sale prices on new data (example)
new_data <- data.frame(Gr_Liv_Area = c(1500, 2000), Year_Remod_Add = c(2010, 2005),
Neighborhood = c("NAmes", "Edwards"),
Overall_Qual = c(6, 7), Overall_Cond = c(5, 6))
3.) The residuals for lot_area in the basic model are noticeably higher at larger values, thus causing overestimations at large values. To remedy this, a box-cox transformation is needed. After applying the transformation to the sale price, we can see much lower residuals for the lot area variable. (Wyatt)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
model <- lm(sale_price ~ year_remod_add + great_qual + lot_area, ames_basic)
pT <- powerTransform(model, family = "bcnPower")
## Warning in sqrt(diag(res$invHess[2, 2, drop = FALSE])): NaNs produced
pT$lambda
## [1] -0.4995754
#since lambda is close to -0.5, we need a 1/sqrt(x) transformation
ames_basic$bc_price <- 1/sqrt(ames_basic$sale_price)
bc_model <- lm(bc_price ~ year_remod_add + great_qual + lot_area, ames_basic)
plots <- gg_resX(bc_model, plot.all = FALSE)
plots$lot_area
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
Addressing socioeconomic, this model may inadvertently capture biases related to race, status, or other factors. For example, certain neighborhoods may have been historically undervalued and impacts by systemic issues, like redlining. With this true and the model unable to address it this model could lead to reinforcing inequalities. Building off this, the fairness of this model must be critiqued. The way the model is constructed now, there is risk of exacerbating already high value communities as well as low value ones.
Inflation: Since the goal of this model is to maximize the price for a real estate company, if this were to ever be used on a large scale (god forbid), then it would more than likely lead to inflating the market which could contribute the homelessness and create more affordability issues within real estate making the population which can easily buy homes even smaller.
Omission of relevant variables: To make such an important and influential conclusion on a property based off of two variables is crazy. This model was created for this exercises purpose and only utilizes the variables within the exercise. This omits many relevant factors such as location, quality of local education, etc. This must be made clear to the user before use, this could result in a model that serves specific demographics more than other.
For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.