Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.

Your group will have three goals:

Create an explicit business scenario which might leverage the data (and methods) used in the lab.
Critique the models (or analyses) present in the lab based on this scenario.
Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.

Goal 1: Business Scenario

First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.

You do not need to solve the problem, you only need to define it.

Your scenario should include the following:

Customer or Audience: who exactly will use your results?
Problem Statement: reference this article to help you write a SMART problem statement.
- E.g., the statement “we need to analyze sales data” is not a good problem summary/statement, but “for <this> reason, the company needs to know if they should stop selling product A …” is on a better track.
Scope: What variables from the data (in the lab) can address the issue presented in your problem statement? What analyses would you use? You’ll need to define any assumptions you feel need to be made before you move forward.
- If you feel the content in the lab cannot sufficiently address the issue, try to devise a more applicable problem statement.
Objective: Define your success criteria. In other words, suppose you started working on this problem in earnest; how will you know when you are done? For example, you might want to “identify the factors that most influence <some variable>.”
- Note: words like “identify”, “maximize”, “determine”, etc. could be useful here. Feel free to find the right action verbs that work for you!

Goal 1: Business Scenario – Logan Johnson

Business Use-Case: A real estate investment firm aims to identify undervalued single-family homes built after the year 2000 in Ames, Iowa, that have the potential for significant appreciation after minor renovations. They want to develop a model to predict sale prices based on property characteristics and use this model to flag properties currently listed below their predicted value.

Customer or Audience: Real estate investment analysts at the firm.

Problem Statement: For a real estate investment firm in Ames, Iowa, the challenge is to efficiently identify single-family homes built after 2000 that are currently undervalued relative to their intrinsic characteristics, hindering the firm’s ability to make profitable acquisition decisions and maximize return on investment.

SMART Goals:
- Specific: Identify undervalued single-family homes built after 2000 in Ames, Iowa.
- Measurable: Develop a model that predicts sale prices with a Mean Absolute Error (MAE) of less than $15,000.
- Achievable: Utilize publicly available data from the Ames Housing dataset, focusing on relevant features.
- Relevant: Directly addresses the firm’s need to identify profitable investment opportunities.
- Time-bound: Develop an initial model and identification strategy within one month.
Scope:
- Variables: Based on the lab, relevant variables could include first_flr_sf, year_built, year_remod_add, overall_qual, and potentially interactions between these. We might also consider other relevant features available in the full Ames Housing dataset, such as lot_area, total_bsmt_sf, gr_liv_area, garage_area, and neighborhood indicators.
- Analyses: We would start with linear regression modeling, potentially exploring polynomial terms or other transformations to capture non-linear relationships. We would also need to perform feature selection to identify the most impactful variables and assess multicollinearity. Model evaluation using metrics like R-squared, MAE, and RMSE would be crucial.
- Assumptions: We would assume that historical sale prices and property characteristics are indicative of future values, and that the relationships between these variables remain relatively stable over the short to medium term. We also assume the data is accurate and representative of the market segment of interest.
Objective: Determine a robust linear regression model that accurately predicts the sale price of single-family homes built after 2000 in Ames, Iowa, to identify properties listed below their predicted value by a statistically significant margin.

Goal 2: Model Critique

Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:

Present a list of at least 3 (improved) analyses you would recommend for your business scenario. Each proposed analysis should be accompanied by a “proof of concept” R implementation. (As usual, execute R code blocks here in the RMarkdown file.)

In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).

You’ll want to consider the following:

Analytical issues, such as the current model assumptions.
Issues with the data itself.
Statistical improvements; what do we know now that we didn’t know (or at least didn’t use) then? Are there other methods that would be appropriate?
Are there better visualizations which could have been used?

Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.

Goal 2: Model Critique – Logan Johnson

Based on the business scenario above, here are at least three improved analyses I would recommend:

More Comprehensive Feature Engineering and Selection:
- Critique of Lab: The lab uses a limited set of features (first_flr_sf, great_qual, year_remod_add) and a simple transformation of overall_qual. It doesn’t explore other potentially important features or consider non-linear relationships.
- Improved Analysis: I would perform more extensive feature engineering using the full Ames Housing dataset. This could include:
  - Creating polynomial features for numerical variables like square footage to capture potential non-linear effects on price.
  - Encoding categorical variables (e.g., neighborhood, house style, material quality) using one-hot encoding or other appropriate methods.
  - Calculating new features such as the age of the house (current year - year_built) and the time since the last renovation (current year - year_remod_add).
  - Exploring interaction terms between a wider range of variables that might logically influence price together (e.g., neighborhood and square footage, quality and age).

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: ggplot2

## Loading required package: lattice

library(AmesHousing)

# Load the Ames Housing dataset with correct naming
ames <- make_ames()

ames_fe_initial <- ames %>%
  filter(Bldg_Type == "OneFam", Year_Built >= 2000) %>%
  mutate(House_Age = 2025 - Year_Built,
         Remod_Age = 2025 - Year_Remod_Add,
         Great_Qual = ifelse(Overall_Qual %in% c("Very_Excellent", "Excellent", "Very_Good"), 1, 0)) %>%
  select(Sale_Price, First_Flr_SF, House_Age, Remod_Age, Great_Qual, Lot_Area, Total_Bsmt_SF, Gr_Liv_Area, Garage_Area, Neighborhood, House_Style, Overall_Qual)

# Ensure 'Neighborhood' is a factor
ames_fe_initial$Neighborhood <- as.factor(ames_fe_initial$Neighborhood)

# Create dummy variables for Neighborhood, ensuring all levels are considered
dmy <- dummyVars(" ~ Neighborhood", data = ames_fe_initial, fullRank = TRUE)
ames_fe_encoded <- predict(dmy, newdata = ames_fe_initial) %>%
  as.data.frame() %>%
  bind_cols(ames_fe_initial %>% select(-Neighborhood))

# Example of polynomial feature
ames_fe_poly <- ames_fe_encoded %>%
  mutate(First_Flr_SF_Sq = First_Flr_SF^2)

# Now, the 'NeighborhoodCentral' column should exist (if the level was present in the original 'ames' dataset)
# Let's check the column names to be sure and select one for the model
names(ames_fe_poly)

##  [1] "Neighborhood.College_Creek"                          
##  [2] "Neighborhood.Old_Town"                               
##  [3] "Neighborhood.Edwards"                                
##  [4] "Neighborhood.Somerset"                               
##  [5] "Neighborhood.Northridge_Heights"                     
##  [6] "Neighborhood.Gilbert"                                
##  [7] "Neighborhood.Sawyer"                                 
##  [8] "Neighborhood.Northwest_Ames"                         
##  [9] "Neighborhood.Sawyer_West"                            
## [10] "Neighborhood.Mitchell"                               
## [11] "Neighborhood.Brookside"                              
## [12] "Neighborhood.Crawford"                               
## [13] "Neighborhood.Iowa_DOT_and_Rail_Road"                 
## [14] "Neighborhood.Timberland"                             
## [15] "Neighborhood.Northridge"                             
## [16] "Neighborhood.Stone_Brook"                            
## [17] "Neighborhood.South_and_West_of_Iowa_State_University"
## [18] "Neighborhood.Clear_Creek"                            
## [19] "Neighborhood.Meadow_Village"                         
## [20] "Neighborhood.Briardale"                              
## [21] "Neighborhood.Bloomington_Heights"                    
## [22] "Neighborhood.Veenker"                                
## [23] "Neighborhood.Northpark_Villa"                        
## [24] "Neighborhood.Blueste"                                
## [25] "Neighborhood.Greens"                                 
## [26] "Neighborhood.Green_Hills"                            
## [27] "Neighborhood.Landmark"                               
## [28] "Neighborhood.Hayden_Lake"                            
## [29] "Sale_Price"                                          
## [30] "First_Flr_SF"                                        
## [31] "House_Age"                                           
## [32] "Remod_Age"                                           
## [33] "Great_Qual"                                          
## [34] "Lot_Area"                                            
## [35] "Total_Bsmt_SF"                                       
## [36] "Gr_Liv_Area"                                         
## [37] "Garage_Area"                                         
## [38] "House_Style"                                         
## [39] "Overall_Qual"                                        
## [40] "First_Flr_SF_Sq"

# Select a Neighborhood dummy variable that is present (e.g., 'NeighborhoodCollgCr' if present)
neighborhood_col <- if("NeighborhoodCollgCr" %in% names(ames_fe_poly)) "NeighborhoodCollgCr" else if (any(grepl("^Neighborhood", names(ames_fe_poly)))) names(ames_fe_poly)[grepl("^Neighborhood", names(ames_fe_poly))][1] else NULL

# Model with engineered features (example - you'd likely do more feature selection)
formula_str <- paste("Sale_Price ~ First_Flr_SF + House_Age + Great_Qual + Lot_Area + Total_Bsmt_SF + First_Flr_SF_Sq", ifelse(!is.null(neighborhood_col), paste("+", neighborhood_col), ""))

model_fe <- lm(formula_str, data = ames_fe_poly)
summary(model_fe)

## 
## Call:
## lm(formula = formula_str, data = ames_fe_poly)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277451  -36195   -2696   31835  259753 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 3.252e+04  2.692e+04   1.208  0.22752    
## First_Flr_SF                2.044e+02  2.557e+01   7.993 6.59e-15 ***
## House_Age                  -8.016e+02  1.048e+03  -0.765  0.44475    
## Great_Qual                  6.649e+04  5.872e+03  11.322  < 2e-16 ***
## Lot_Area                    4.204e+00  6.780e-01   6.201 1.03e-09 ***
## Total_Bsmt_SF              -2.811e+01  2.139e+01  -1.314  0.18943    
## First_Flr_SF_Sq            -3.495e-02  4.074e-03  -8.579  < 2e-16 ***
## Neighborhood.College_Creek -1.673e+04  5.711e+03  -2.929  0.00353 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57990 on 612 degrees of freedom
## Multiple R-squared:  0.5822, Adjusted R-squared:  0.5774 
## F-statistic: 121.8 on 7 and 612 DF,  p-value: < 2.2e-16

# Further feature selection would be done using methods like stepwise regression,
# LASSO, or assessing feature importance from tree-based models.

Rigorous Assessment of Model Assumptions:

**Critique of Lab:** The lab briefly mentions the assumptions of linear regression (linearity, constant variance, independence of errors, normality of errors) but doesn't explicitly check them using the data.

**Improved Analysis:** I would thoroughly assess these assumptions for any developed model:

-   **Linearity:** Examine scatter plots of residuals vs. predicted values and partial residual plots. Consider transformations of variables if non-linear patterns are evident.

-   **Homoscedasticity:** Use plots of residuals vs. predicted values to check for funnel shapes (indicating non-constant variance). Employ statistical tests like the Breusch-Pagan test. Consider weighted least squares if heteroscedasticity is a problem.

-   **Independence of Errors:** This is harder to directly test with cross-sectional data but should be considered in the context of the data collection process. For example, if sales in the same neighborhood at the same time were not truly independent.

-   **Normality of Errors:** Examine histograms, Q-Q plots of the residuals, and perform statistical tests like the Shapiro-Wilk test. Transformations of the response variable (e.g., log transformation as hinted in the Week 10-11 example) might be necessary if the errors are not normally distributed.

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.4.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

# For this proof of concept, I'll use the 'model_fe' created above.
# In a real scenario, you would build a model and then assess its assumptions.

residuals_fe <- residuals(model_fe)
fitted_values_fe <- fitted(model_fe)

# Check linearity and homoscedasticity
plot(fitted_values_fe, residuals_fe, main = "Residuals vs. Fitted Values (Engineered Features)")
abline(h = 0, lty = 2)

# Check normality of residuals
hist(residuals_fe, main = "Histogram of Residuals (Engineered Features)")

qqnorm(residuals_fe, main = "Q-Q Plot of Residuals (Engineered Features)")
qqline(residuals_fe)

# Breusch-Pagan test for homoscedasticity
bptest(model_fe)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_fe
## BP = 127.93, df = 7, p-value < 2.2e-16

# Note: This section focuses on assessing assumptions of an *existing* model.
# The 'lm()' call happened in the previous section.

Model Evaluation and Validation:

Critique of Lab: The lab focuses on interpreting coefficients but doesn’t discuss model evaluation or validation to assess its predictive performance on unseen data.

Improved Analysis: I would split the data into training and testing sets to evaluate the model’s generalization ability. I would use various evaluation metrics relevant to the business problem, such as:
- Mean Absolute Error (MAE): The average absolute difference between predicted and actual prices (directly interpretable in dollars).
- Root Mean Squared Error (RMSE): Similar to MAE but penalizes larger errors more heavily.
- R-squared: The proportion of variance in the sale price explained by the model.

library(dplyr)
library(caret)
library(AmesHousing)

# Load the Ames Housing dataset with correct naming
ames <- make_ames()

ames_fe <- ames %>%
  filter(Bldg_Type == "OneFam", Year_Built >= 2000) %>%
  mutate(House_Age = 2025 - Year_Built,
         Remod_Age = 2025 - Year_Remod_Add,
         # Ensure Great_Qual is treated as numeric
         Great_Qual = as.numeric(ifelse(Overall_Qual %in% c("Very_Excellent", "Excellent", "Very_Good"), 1, 0))) %>%
  select(Sale_Price, First_Flr_SF, House_Age, Remod_Age, Great_Qual, Lot_Area, Total_Bsmt_SF, Gr_Liv_Area, Garage_Area, House_Style, Overall_Qual)

# Example of polynomial feature
ames_fe_poly <- ames_fe %>%
  mutate(First_Flr_SF_Sq = First_Flr_SF^2)

# Split data into training and testing sets
set.seed(123)
train_index_val <- createDataPartition(ames_fe_poly$Sale_Price, p = 0.8, list = FALSE)
train_data_val <- ames_fe_poly[train_index_val, ]
test_data_val <- ames_fe_poly[-train_index_val, ]

# Define the formula string without Neighborhood
predictor_cols <- c("First_Flr_SF", "House_Age", "Great_Qual", "Lot_Area", "Total_Bsmt_SF", "First_Flr_SF_Sq")
formula_str <- paste("Sale_Price ~", paste(predictor_cols, collapse = " + "))

model_validated_fe <- lm(formula_str, data = train_data_val) # Train on training data
summary(model_validated_fe)

## 
## Call:
## lm(formula = formula_str, data = train_data_val)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -309316  -33590    -941   30786  256897 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.562e+04  3.000e+04   1.521    0.129    
## First_Flr_SF     1.306e+02  3.297e+01   3.960 8.60e-05 ***
## House_Age       -1.062e+03  1.169e+03  -0.908    0.364    
## Great_Qual       6.922e+04  6.300e+03  10.987  < 2e-16 ***
## Lot_Area         4.940e+00  7.157e-01   6.901 1.60e-11 ***
## Total_Bsmt_SF    1.870e+01  2.696e+01   0.693    0.488    
## First_Flr_SF_Sq -2.722e-02  4.526e-03  -6.015 3.53e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57080 on 490 degrees of freedom
## Multiple R-squared:  0.5976, Adjusted R-squared:  0.5927 
## F-statistic: 121.3 on 6 and 490 DF,  p-value: < 2.2e-16

# Print the model summary
summary(model_validated_fe)

## 
## Call:
## lm(formula = formula_str, data = train_data_val)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -309316  -33590    -941   30786  256897 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.562e+04  3.000e+04   1.521    0.129    
## First_Flr_SF     1.306e+02  3.297e+01   3.960 8.60e-05 ***
## House_Age       -1.062e+03  1.169e+03  -0.908    0.364    
## Great_Qual       6.922e+04  6.300e+03  10.987  < 2e-16 ***
## Lot_Area         4.940e+00  7.157e-01   6.901 1.60e-11 ***
## Total_Bsmt_SF    1.870e+01  2.696e+01   0.693    0.488    
## First_Flr_SF_Sq -2.722e-02  4.526e-03  -6.015 3.53e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57080 on 490 degrees of freedom
## Multiple R-squared:  0.5976, Adjusted R-squared:  0.5927 
## F-statistic: 121.3 on 6 and 490 DF,  p-value: < 2.2e-16

# Make predictions on the test data
predictions_fe <- predict(model_validated_fe, newdata = test_data_val)

# Evaluate model performance
mae_fe <- mean(abs(test_data_val$Sale_Price - predictions_fe))
rmse_fe <- sqrt(mean((test_data_val$Sale_Price - predictions_fe)^2))
r_squared_fe <- summary(model_validated_fe)$r.squared

cat("MAE (Engineered Features):", mae_fe, "\n")

## MAE (Engineered Features): 46707.26

cat("RMSE (Engineered Features):", rmse_fe, "\n")

## RMSE (Engineered Features): 66841.56

cat("R-squared (Engineered Features):", r_squared_fe, "\n")

## R-squared (Engineered Features): 0.5975808

# Cross-validation on the training data
train_control_fe <- trainControl(method = "cv", number = 5)
cv_model_fe <- train(Sale_Price ~ First_Flr_SF + House_Age + Great_Qual + Lot_Area + Total_Bsmt_SF + First_Flr_SF_Sq,
                    data = train_data_val,
                    method = "lm",
                    trControl = train_control_fe)
print(cv_model_fe$results)

##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 67162.24 0.5184409 42898.68 25087.06  0.1931067 5623.542

Goal 3: Ethical and Epistemological Concerns

Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:

Overcoming biases (existing or potential).
Possible risks or societal implications.
Crucial issues which might not be measurable.
Who would be affected by this project, and how does that affect your critique?

Overcoming Biases: Our model risks inheriting biases from historical sales data (e.g., appraisal practices) and our modeling choices (e.g., assuming linear relationships, selecting specific variables). This could lead to unfair property valuations. To mitigate this, we must critically evaluate the data sources, explore diverse modeling techniques, and regularly audit the model’s predictions for potential disparities.

Possible Risks/Societal Implications: The firm’s use of this model could inadvertently influence the local housing market, potentially leading to price distortions or creating an informational advantage over individual buyers and sellers. We need to be mindful of these broader societal implications and consider strategies for transparency about our modeling approach.

Unmeasurable Issues: Factors like a property’s unique aesthetic appeal or unforeseen future market changes are difficult to quantify and include in our model. Relying solely on measurable data risks overlooking these crucial aspects. Therefore, integrating qualitative assessments and local market expertise alongside the model’s predictions is essential.

Affected Parties & Critique Impact: This project affects the investment firm, sellers, potential homebuyers, existing homeowners, and the wider community. Recognizing these stakeholders underscores the importance of ethical considerations in our model development and application. Our critique emphasizes the need for accuracy, fairness, transparency about limitations, and rigorous validation to ensure responsible use

Example

For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:

Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., $\log(\text{price})$) and each of these variables linear? Are the error terms evenly distributed?
Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
Consider different evaluation metrics between models. What is a practical use for these values?

Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.