For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.

First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).

For the lab your group has been assigned, consider issues with models, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).

Share your model critique in this notebook as your data dive submission for the week.

As a start, think about the context of the lab and consider the following:

  • Analytical issues, such as model assumptions

  • Overcoming biases (existing or potential)

  • Possible risks or societal implications

  • Crucial issues which might not be measurable

Treat this exercise as if the analyses in your assigned lab (i.e., the one you are critiquing) were to be published, made available to the public in a press release, or used at some large company (e.g., for mpg data, imagine if Toyota used the conclusions to drive strategic decisions).

# your code here

If you were unable to attend class, select a notes_*.Rmd file from a previous week (not including weeks 1 or 3), and complete the analysis above. Share your critique below.

Example

For example, in Week 11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment.

  • Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
  • Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., \(\log(\text{price})\)) and each of these variables linear? Are the error terms evenly distributed?
  • Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
  • What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
  • Consider different evaluation metrics between models. What is a practical use for these values?

We will critique the linear regression model in Week 8 lab , using the Ames Housing Dataset. We aim to understand how various factors, such as first-floor square footage and quality metrics, influence the sale prices of houses.

Model Summaries

We examine two main types of linear regression models: basic linear models and models with interaction terms.

# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)

# Load the Ames Housing dataset
ames <- AmesHousing::make_ames()

# Basic Linear Model
basic_model <- lm(Sale_Price ~ First_Flr_SF + Overall_Qual, data = ames)
summary(basic_model)
## 
## Call:
## lm(formula = Sale_Price ~ First_Flr_SF + Overall_Qual, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -419854  -21359   -1972   18099  292059 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.295e+03  2.017e+04   0.114 0.909426    
## First_Flr_SF               5.201e+01  2.251e+00  23.105  < 2e-16 ***
## Overall_QualPoor           1.787e+04  2.296e+04   0.778 0.436549    
## Overall_QualFair           3.390e+04  2.105e+04   1.610 0.107443    
## Overall_QualBelow_Average  5.579e+04  2.025e+04   2.755 0.005905 ** 
## Overall_QualAverage        7.683e+04  2.013e+04   3.818 0.000138 ***
## Overall_QualAbove_Average  1.039e+05  2.013e+04   5.160 2.64e-07 ***
## Overall_QualGood           1.418e+05  2.015e+04   7.037 2.44e-12 ***
## Overall_QualVery_Good      1.930e+05  2.023e+04   9.542  < 2e-16 ***
## Overall_QualExcellent      2.750e+05  2.054e+04  13.389  < 2e-16 ***
## Overall_QualVery_Excellent 3.335e+05  2.153e+04  15.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40150 on 2919 degrees of freedom
## Multiple R-squared:  0.7483, Adjusted R-squared:  0.7474 
## F-statistic: 867.7 on 10 and 2919 DF,  p-value: < 2.2e-16
# Interaction
interaction_model <- lm(Sale_Price ~ Year_Remod_Add + Overall_Qual + Year_Remod_Add:Overall_Qual, data = ames)
summary(interaction_model)
## 
## Call:
## lm(formula = Sale_Price ~ Year_Remod_Add + Overall_Qual + Year_Remod_Add:Overall_Qual, 
##     data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -236778  -22961   -2613   17809  264225 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                27341491   49301533   0.555    0.579
## Year_Remod_Add                               -13991      25273  -0.554    0.580
## Overall_QualPoor                          -22745732   49481273  -0.460    0.646
## Overall_QualFair                          -27782545   49305417  -0.563    0.573
## Overall_QualBelow_Average                 -28039143   49302363  -0.569    0.570
## Overall_QualAverage                       -27971042   49301753  -0.567    0.571
## Overall_QualAbove_Average                 -28425742   49301784  -0.577    0.564
## Overall_QualGood                          -28365659   49302152  -0.575    0.565
## Overall_QualVery_Good                     -29937368   49304493  -0.607    0.544
## Overall_QualExcellent                     -34815347   49335337  -0.706    0.480
## Overall_QualVery_Excellent                  5040490   49419699   0.102    0.919
## Year_Remod_Add:Overall_QualPoor               11663      25365   0.460    0.646
## Year_Remod_Add:Overall_QualFair               14258      25275   0.564    0.573
## Year_Remod_Add:Overall_QualBelow_Average      14400      25274   0.570    0.569
## Year_Remod_Add:Overall_QualAverage            14378      25273   0.569    0.569
## Year_Remod_Add:Overall_QualAbove_Average      14620      25273   0.578    0.563
## Year_Remod_Add:Overall_QualGood               14607      25273   0.578    0.563
## Year_Remod_Add:Overall_QualVery_Good          15424      25275   0.610    0.542
## Year_Remod_Add:Overall_QualExcellent          17902      25290   0.708    0.479
## Year_Remod_Add:Overall_QualVery_Excellent     -1938      25331  -0.077    0.939
## 
## Residual standard error: 41910 on 2910 degrees of freedom
## Multiple R-squared:  0.7266, Adjusted R-squared:  0.7248 
## F-statistic: 406.9 on 19 and 2910 DF,  p-value: < 2.2e-16
# Residuals vs Fitted for Basic Model
plot(basic_model$fitted.values, residuals(basic_model), xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")

# Normal Q-Q for Basic Model
qqnorm(residuals(basic_model))
qqline(residuals(basic_model), col = "red")

# Scale-Location for Basic Model
plot(basic_model$fitted.values, sqrt(abs(residuals(basic_model))), xlab = "Fitted Values", ylab = "Sqrt(|Residuals|)")
abline(h = 0, col = "red")

# Residuals vs Leverage for Basic Model
plot(hatvalues(basic_model), residuals(basic_model), xlab = "Leverage", ylab = "Residuals")
abline(h = 0, col = "red")

Basic Linear Model Summary What the Model Shows: The model looks at how the first floor square footage and the overall quality of a house, like whether it’s poor, fair, average, etc. affect its sale price.

Key Findings:

Quality Impact: The quality of a house has a clear effect on its price. Higher quality generally means a higher price, except for the ‘Poor’ and ‘Fair’ categories, which don’t show a strong effect.

Residuals: There’s a big range in the residuals. This might mean there are outliers or the model is missing some important factors.

Concerns: The model might not fully capture the true relationship since it assumes a simple straight-line effect. Real-life relationships might be more complex.

The big range in residuals suggests that the model might be oversimplified or missing some key details.

Overall Fit: The model explains about 74.74% of the variation in house prices, which is pretty good but not perfect.

Interaction Model Summary

What the Model Shows: This model tries to understand how the year of remodeling and the overall quality together influence the sale price.

Key Findings:

Interaction Terms: The model includes terms that combine remodeling year and quality, but these don’t show a strong effect on price.

Residuals: Like the basic model, there’s a big range in residuals here too.

Concerns:

The interaction (combined effect) terms are not showing significant results, suggesting that the combined effect of remodeling year and quality on price might not be as expected or is too varied.

Similar to the basic model, this model might also not capture all complexities of real-life pricing. Overall Fit: This model explains about 72.48% of the variation in house prices, which is slightly less effective than the basic model despite being more complex.

Improvements:

Adding More Details: If the relationship between square footage and price isn’t straight-line, consider adding more complex terms (like squaring the square footage). Managing Overfitting: If the model is too tailored to this specific dataset then it is called overfitting, use techniques like Ridge or Lasso regression. Looking at Outliers: for unusual cases that might be throwing off the model.

Diagnostic plots help identify issues like non-linearity and heteroscedasticity. The Q-Q plot checks for the normality of residuals. A high VIF score would indicate potential multicollinearity problems.

Cross-Validation

Cross-validation is a method to assess the predictive performance of our models. Here’s a simple example of k-fold cross-validation using the caret package in R. We’ll use 10-fold cross-validation as an example:

library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(250)  # seed set for eproducibility
fold <- createFolds(ames$Sale_Price, k = 10)
cv_model <- lapply(fold, function(x) {
  train_data <- ames[-x,]
  test_data <- ames[x,]
  
  model <- lm(Sale_Price ~ First_Flr_SF + Overall_Qual, data = train_data)
  predictions <- predict(model, test_data)
  data.frame(observed = test_data$Sale_Price, predicted = predictions)
})

# Calculate RMSE for each fold
rmse_values <- sapply(cv_model, function(x) {
  sqrt(mean((x$observed - x$predicted)^2))
})
mean_rmse <- mean(rmse_values)
mean_rmse
## [1] 40242.65

RMSE of 40,242.65 sounds like a big number, but it depends on house prices in our data. If house prices vary a lot and are generally high, this RMSE might be acceptable. But if house prices are lower on average, this RMSE could mean our model’s predictions are often quite far off.

#Feature Engineering we can create new features that might have a significant impact on the sale price. For example, creating a feature that represents the age of the house at the time of sale

ames$House_Age = ames$Year_Sold - ames$Year_Built

# Re-running the linear model with the new feature
model_fe <- lm(Sale_Price ~ First_Flr_SF + Overall_Qual + House_Age, data = ames)
summary(model_fe)
## 
## Call:
## lm(formula = Sale_Price ~ First_Flr_SF + Overall_Qual + House_Age, 
##     data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -418999  -20070   -2528   16311  293537 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 26883.322  19833.683   1.355 0.175383    
## First_Flr_SF                   50.230      2.206  22.773  < 2e-16 ***
## Overall_QualPoor            21050.203  22450.338   0.938 0.348510    
## Overall_QualFair            37980.794  20584.489   1.845 0.065122 .  
## Overall_QualBelow_Average   54800.153  19796.965   2.768 0.005674 ** 
## Overall_QualAverage         72859.236  19678.008   3.703 0.000217 ***
## Overall_QualAbove_Average   95365.858  19694.872   4.842 1.35e-06 ***
## Overall_QualGood           126142.598  19743.836   6.389 1.94e-10 ***
## Overall_QualVery_Good      174716.439  19835.997   8.808  < 2e-16 ***
## Overall_QualExcellent      254545.221  20152.114  12.631  < 2e-16 ***
## Overall_QualVery_Excellent 316435.193  21099.450  14.997  < 2e-16 ***
## House_Age                    -355.233     30.395 -11.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39250 on 2918 degrees of freedom
## Multiple R-squared:  0.7595, Adjusted R-squared:  0.7586 
## F-statistic: 837.9 on 11 and 2918 DF,  p-value: < 2.2e-16

What the Model Says:

First Floor Size (First_Flr_SF): Bigger first floors increase house prices. Specifically, each extra square foot adds about $50 to the sale price. House Quality (Overall_Qual): Better quality houses sell for more. But the ‘Poor’ quality category doesn’t really change the price, so it might not be a helpful factor to consider. House Age (House_Age): Older houses usually sell for less. Each year older a house is, its price drops by about $355. How Well the Model Works:

The model is pretty good at explaining house prices, accounting for about 76% of the reasons why prices vary. The test used to check if the model is useful (F-statistic) shows that it is indeed effective.

Residuals:

There’s still a big range between what the model predicts and the actual sale prices, which could mean there are some unusual cases in the data or other factors we haven’t considered.

#outlier handling

# Calculate IQR
Q1 <- quantile(ames$Sale_Price, 0.25)
Q3 <- quantile(ames$Sale_Price, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identifying outlier
outliers <- which(ames$Sale_Price < lower_bound | ames$Sale_Price > upper_bound)

ames_outliers <- ames[outliers, ]

# Removing outliers from the dataset
ames_clean <- ames[-outliers, ]
# the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Fitting a robust linear model
robust_model <- rlm(Sale_Price ~ First_Flr_SF + Overall_Qual + House_Age, data = ames)
summary(robust_model)
## 
## Call: rlm(formula = Sale_Price ~ First_Flr_SF + Overall_Qual + House_Age, 
##     data = ames)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -428017  -17978    -688   18268  285277 
## 
## Coefficients:
##                            Value       Std. Error  t value    
## (Intercept)                 28151.7611  15429.6130      1.8245
## First_Flr_SF                   50.4378      1.7159     29.3942
## Overall_QualPoor            22136.9710  17465.2395      1.2675
## Overall_QualFair            38662.1737  16013.7025      2.4143
## Overall_QualBelow_Average   54122.5409  15401.0479      3.5142
## Overall_QualAverage         71988.5284  15308.5057      4.7025
## Overall_QualAbove_Average   92626.3216  15321.6251      6.0455
## Overall_QualGood           121643.2397  15359.7164      7.9196
## Overall_QualVery_Good      167604.2661  15431.4129     10.8612
## Overall_QualExcellent      244454.4524  15677.3366     15.5929
## Overall_QualVery_Excellent 323211.4670  16414.3166     19.6908
## House_Age                    -377.6847     23.6456    -15.9727
## 
## Residual standard error: 26940 on 2918 degrees of freedom

Size of the First Floor: Each extra square foot on the first floor increases the house price by about $50.44. Quality of the House: Higher quality houses sell for more. This pattern is clear and strong across different quality levels. Age of the House: Older houses generally sell for less. For each additional year in age, the price drops by around $377.68.

Model Fit:

Our model is doing a pretty decent job. The numbers (t values) show that the size, quality, and age are all important factors affecting house prices.

The error is smaller than in our earlier model, suggesting this one is a better fit.

Handling Outliers:

The range of errors is smaller now. This means our model is better at handling unusual cases or extreme values in the data.

Conclusion Our robust regression model has improved our understanding of what affects house prices. It’s particularly good at considering the unusual or extreme cases without letting them skew the results too much. This model gives us a reliable way to predict house prices based on size, quality, and age while dealing well with the variety in the data.