For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.
First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).
For the lab your group has been assigned, consider issues with models, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).
Share your model critique in this notebook as your data dive submission for the week.
As a start, think about the context of the lab and consider the following:
Analytical issues, such as model assumptions
Overcoming biases (existing or potential)
Possible risks or societal implications
Crucial issues which might not be measurable
Treat this exercise as if the analyses in your assigned lab
(i.e., the one you are critiquing) were to be published, made available
to the public in a press release, or used at some large company (e.g.,
for mpg data, imagine if Toyota used the conclusions to
drive strategic decisions).
# your code here
If you were unable to attend class, select a
notes_*.Rmdfile from a previous week (not including weeks 1 or 3), and complete the analysis above. Share your critique below.
For example, in Week 11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment.
We will critique the linear regression model in Week 8 lab , using the Ames Housing Dataset. We aim to understand how various factors, such as first-floor square footage and quality metrics, influence the sale prices of houses.
We examine two main types of linear regression models: basic linear models and models with interaction terms.
# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
# Load the Ames Housing dataset
ames <- AmesHousing::make_ames()
# Basic Linear Model
basic_model <- lm(Sale_Price ~ First_Flr_SF + Overall_Qual, data = ames)
summary(basic_model)
##
## Call:
## lm(formula = Sale_Price ~ First_Flr_SF + Overall_Qual, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -419854 -21359 -1972 18099 292059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.295e+03 2.017e+04 0.114 0.909426
## First_Flr_SF 5.201e+01 2.251e+00 23.105 < 2e-16 ***
## Overall_QualPoor 1.787e+04 2.296e+04 0.778 0.436549
## Overall_QualFair 3.390e+04 2.105e+04 1.610 0.107443
## Overall_QualBelow_Average 5.579e+04 2.025e+04 2.755 0.005905 **
## Overall_QualAverage 7.683e+04 2.013e+04 3.818 0.000138 ***
## Overall_QualAbove_Average 1.039e+05 2.013e+04 5.160 2.64e-07 ***
## Overall_QualGood 1.418e+05 2.015e+04 7.037 2.44e-12 ***
## Overall_QualVery_Good 1.930e+05 2.023e+04 9.542 < 2e-16 ***
## Overall_QualExcellent 2.750e+05 2.054e+04 13.389 < 2e-16 ***
## Overall_QualVery_Excellent 3.335e+05 2.153e+04 15.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40150 on 2919 degrees of freedom
## Multiple R-squared: 0.7483, Adjusted R-squared: 0.7474
## F-statistic: 867.7 on 10 and 2919 DF, p-value: < 2.2e-16
# Interaction
interaction_model <- lm(Sale_Price ~ Year_Remod_Add + Overall_Qual + Year_Remod_Add:Overall_Qual, data = ames)
summary(interaction_model)
##
## Call:
## lm(formula = Sale_Price ~ Year_Remod_Add + Overall_Qual + Year_Remod_Add:Overall_Qual,
## data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -236778 -22961 -2613 17809 264225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27341491 49301533 0.555 0.579
## Year_Remod_Add -13991 25273 -0.554 0.580
## Overall_QualPoor -22745732 49481273 -0.460 0.646
## Overall_QualFair -27782545 49305417 -0.563 0.573
## Overall_QualBelow_Average -28039143 49302363 -0.569 0.570
## Overall_QualAverage -27971042 49301753 -0.567 0.571
## Overall_QualAbove_Average -28425742 49301784 -0.577 0.564
## Overall_QualGood -28365659 49302152 -0.575 0.565
## Overall_QualVery_Good -29937368 49304493 -0.607 0.544
## Overall_QualExcellent -34815347 49335337 -0.706 0.480
## Overall_QualVery_Excellent 5040490 49419699 0.102 0.919
## Year_Remod_Add:Overall_QualPoor 11663 25365 0.460 0.646
## Year_Remod_Add:Overall_QualFair 14258 25275 0.564 0.573
## Year_Remod_Add:Overall_QualBelow_Average 14400 25274 0.570 0.569
## Year_Remod_Add:Overall_QualAverage 14378 25273 0.569 0.569
## Year_Remod_Add:Overall_QualAbove_Average 14620 25273 0.578 0.563
## Year_Remod_Add:Overall_QualGood 14607 25273 0.578 0.563
## Year_Remod_Add:Overall_QualVery_Good 15424 25275 0.610 0.542
## Year_Remod_Add:Overall_QualExcellent 17902 25290 0.708 0.479
## Year_Remod_Add:Overall_QualVery_Excellent -1938 25331 -0.077 0.939
##
## Residual standard error: 41910 on 2910 degrees of freedom
## Multiple R-squared: 0.7266, Adjusted R-squared: 0.7248
## F-statistic: 406.9 on 19 and 2910 DF, p-value: < 2.2e-16
# Residuals vs Fitted for Basic Model
plot(basic_model$fitted.values, residuals(basic_model), xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")
# Normal Q-Q for Basic Model
qqnorm(residuals(basic_model))
qqline(residuals(basic_model), col = "red")
# Scale-Location for Basic Model
plot(basic_model$fitted.values, sqrt(abs(residuals(basic_model))), xlab = "Fitted Values", ylab = "Sqrt(|Residuals|)")
abline(h = 0, col = "red")
# Residuals vs Leverage for Basic Model
plot(hatvalues(basic_model), residuals(basic_model), xlab = "Leverage", ylab = "Residuals")
abline(h = 0, col = "red")
Basic Linear Model Summary What the Model Shows: The model looks at how the first floor square footage and the overall quality of a house, like whether it’s poor, fair, average, etc. affect its sale price.
Key Findings:
Quality Impact: The quality of a house has a clear effect on its price. Higher quality generally means a higher price, except for the ‘Poor’ and ‘Fair’ categories, which don’t show a strong effect.
Residuals: There’s a big range in the residuals. This might mean there are outliers or the model is missing some important factors.
Concerns: The model might not fully capture the true relationship since it assumes a simple straight-line effect. Real-life relationships might be more complex.
The big range in residuals suggests that the model might be oversimplified or missing some key details.
Overall Fit: The model explains about 74.74% of the variation in house prices, which is pretty good but not perfect.
Interaction Model Summary
What the Model Shows: This model tries to understand how the year of remodeling and the overall quality together influence the sale price.
Key Findings:
Interaction Terms: The model includes terms that combine remodeling year and quality, but these don’t show a strong effect on price.
Residuals: Like the basic model, there’s a big range in residuals here too.
Concerns:
The interaction (combined effect) terms are not showing significant results, suggesting that the combined effect of remodeling year and quality on price might not be as expected or is too varied.
Similar to the basic model, this model might also not capture all complexities of real-life pricing. Overall Fit: This model explains about 72.48% of the variation in house prices, which is slightly less effective than the basic model despite being more complex.
Improvements:
Adding More Details: If the relationship between square footage and price isn’t straight-line, consider adding more complex terms (like squaring the square footage). Managing Overfitting: If the model is too tailored to this specific dataset then it is called overfitting, use techniques like Ridge or Lasso regression. Looking at Outliers: for unusual cases that might be throwing off the model.
Diagnostic plots help identify issues like non-linearity and heteroscedasticity. The Q-Q plot checks for the normality of residuals. A high VIF score would indicate potential multicollinearity problems.
Cross-validation is a method to assess the predictive performance of our models. Here’s a simple example of k-fold cross-validation using the caret package in R. We’ll use 10-fold cross-validation as an example:
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(250) # seed set for eproducibility
fold <- createFolds(ames$Sale_Price, k = 10)
cv_model <- lapply(fold, function(x) {
train_data <- ames[-x,]
test_data <- ames[x,]
model <- lm(Sale_Price ~ First_Flr_SF + Overall_Qual, data = train_data)
predictions <- predict(model, test_data)
data.frame(observed = test_data$Sale_Price, predicted = predictions)
})
# Calculate RMSE for each fold
rmse_values <- sapply(cv_model, function(x) {
sqrt(mean((x$observed - x$predicted)^2))
})
mean_rmse <- mean(rmse_values)
mean_rmse
## [1] 40242.65
RMSE of 40,242.65 sounds like a big number, but it depends on house prices in our data. If house prices vary a lot and are generally high, this RMSE might be acceptable. But if house prices are lower on average, this RMSE could mean our model’s predictions are often quite far off.
#Feature Engineering we can create new features that might have a significant impact on the sale price. For example, creating a feature that represents the age of the house at the time of sale
ames$House_Age = ames$Year_Sold - ames$Year_Built
# Re-running the linear model with the new feature
model_fe <- lm(Sale_Price ~ First_Flr_SF + Overall_Qual + House_Age, data = ames)
summary(model_fe)
##
## Call:
## lm(formula = Sale_Price ~ First_Flr_SF + Overall_Qual + House_Age,
## data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -418999 -20070 -2528 16311 293537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26883.322 19833.683 1.355 0.175383
## First_Flr_SF 50.230 2.206 22.773 < 2e-16 ***
## Overall_QualPoor 21050.203 22450.338 0.938 0.348510
## Overall_QualFair 37980.794 20584.489 1.845 0.065122 .
## Overall_QualBelow_Average 54800.153 19796.965 2.768 0.005674 **
## Overall_QualAverage 72859.236 19678.008 3.703 0.000217 ***
## Overall_QualAbove_Average 95365.858 19694.872 4.842 1.35e-06 ***
## Overall_QualGood 126142.598 19743.836 6.389 1.94e-10 ***
## Overall_QualVery_Good 174716.439 19835.997 8.808 < 2e-16 ***
## Overall_QualExcellent 254545.221 20152.114 12.631 < 2e-16 ***
## Overall_QualVery_Excellent 316435.193 21099.450 14.997 < 2e-16 ***
## House_Age -355.233 30.395 -11.687 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39250 on 2918 degrees of freedom
## Multiple R-squared: 0.7595, Adjusted R-squared: 0.7586
## F-statistic: 837.9 on 11 and 2918 DF, p-value: < 2.2e-16
What the Model Says:
First Floor Size (First_Flr_SF): Bigger first floors increase house prices. Specifically, each extra square foot adds about $50 to the sale price. House Quality (Overall_Qual): Better quality houses sell for more. But the ‘Poor’ quality category doesn’t really change the price, so it might not be a helpful factor to consider. House Age (House_Age): Older houses usually sell for less. Each year older a house is, its price drops by about $355. How Well the Model Works:
The model is pretty good at explaining house prices, accounting for about 76% of the reasons why prices vary. The test used to check if the model is useful (F-statistic) shows that it is indeed effective.
Residuals:
There’s still a big range between what the model predicts and the actual sale prices, which could mean there are some unusual cases in the data or other factors we haven’t considered.
#outlier handling
# Calculate IQR
Q1 <- quantile(ames$Sale_Price, 0.25)
Q3 <- quantile(ames$Sale_Price, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Identifying outlier
outliers <- which(ames$Sale_Price < lower_bound | ames$Sale_Price > upper_bound)
ames_outliers <- ames[outliers, ]
# Removing outliers from the dataset
ames_clean <- ames[-outliers, ]
# the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Fitting a robust linear model
robust_model <- rlm(Sale_Price ~ First_Flr_SF + Overall_Qual + House_Age, data = ames)
summary(robust_model)
##
## Call: rlm(formula = Sale_Price ~ First_Flr_SF + Overall_Qual + House_Age,
## data = ames)
## Residuals:
## Min 1Q Median 3Q Max
## -428017 -17978 -688 18268 285277
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 28151.7611 15429.6130 1.8245
## First_Flr_SF 50.4378 1.7159 29.3942
## Overall_QualPoor 22136.9710 17465.2395 1.2675
## Overall_QualFair 38662.1737 16013.7025 2.4143
## Overall_QualBelow_Average 54122.5409 15401.0479 3.5142
## Overall_QualAverage 71988.5284 15308.5057 4.7025
## Overall_QualAbove_Average 92626.3216 15321.6251 6.0455
## Overall_QualGood 121643.2397 15359.7164 7.9196
## Overall_QualVery_Good 167604.2661 15431.4129 10.8612
## Overall_QualExcellent 244454.4524 15677.3366 15.5929
## Overall_QualVery_Excellent 323211.4670 16414.3166 19.6908
## House_Age -377.6847 23.6456 -15.9727
##
## Residual standard error: 26940 on 2918 degrees of freedom
Size of the First Floor: Each extra square foot on the first floor increases the house price by about $50.44. Quality of the House: Higher quality houses sell for more. This pattern is clear and strong across different quality levels. Age of the House: Older houses generally sell for less. For each additional year in age, the price drops by around $377.68.
Model Fit:
Our model is doing a pretty decent job. The numbers (t values) show that the size, quality, and age are all important factors affecting house prices.
The error is smaller than in our earlier model, suggesting this one is a better fit.
Handling Outliers:
The range of errors is smaller now. This means our model is better at handling unusual cases or extreme values in the data.
Conclusion Our robust regression model has improved our understanding of what affects house prices. It’s particularly good at considering the unusual or extreme cases without letting them skew the results too much. This model gives us a reliable way to predict house prices based on size, quality, and age while dealing well with the variety in the data.