Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.

Your group will have three goals:

Create an explicit business scenario which might leverage the data (and methods) used in the lab.
Critique the models (or analyses) present in the lab based on this scenario.
Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.

Goal 1: Business Scenario

First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.

You do not need to solve the problem, you only need to define it.

Your scenario should include the following:

Customer or Audience: who exactly will use your results?
Problem Statement: reference this article to help you write a SMART problem statement.
- E.g., the statement “we need to analyze sales data” is not a good problem summary/statement, but “for <this> reason, the company needs to know if they should stop selling product A …” is on a better track.
Scope: What variables from the data (in the lab) can address the issue presented in your problem statement? What analyses would you use? You’ll need to define any assumptions you feel need to be made before you move forward.
- If you feel the content in the lab cannot sufficiently address the issue, try to devise a more applicable problem statement.
Objective: Define your success criteria. In other words, suppose you started working on this problem in earnest; how will you know when you are done? For example, you might want to “identify the factors that most influence <some variable>.”
- Note: words like “identify”, “maximize”, “determine”, etc. could be useful here. Feel free to find the right action verbs that work for you!

Goal 2: Model Critique

Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:

Present a list of at least 3 (improved) analyses you would recommend for your business scenario. Each proposed analysis should be accompanied by a “proof of concept” R implementation. (As usual, execute R code blocks here in the RMarkdown file.)

In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).

You’ll want to consider the following:

Analytical issues, such as the current model assumptions.
Issues with the data itself.
Statistical improvements; what do we know now that we didn’t know (or at least didn’t use) then? Are there other methods that would be appropriate?
Are there better visualizations which could have been used?

Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.

Goal 3: Ethical and Epistemological Concerns

Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:

Overcoming biases (existing or potential).
Possible risks or societal implications.
Crucial issues which might not be measurable.
Who would be affected by this project, and how does that affect your critique?

Example

For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:

Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., \(\log(\text{price})\)) and each of these variables linear? Are the error terms evenly distributed?
Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
Consider different evaluation metrics between models. What is a practical use for these values?

Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.

# Model Critique on Week 11 Lab

Goal 1: Business Scenario

Customer or Audience

A nationwide apartment rental platform (e.g., Zillow Rentals or Apartments.com) seeks to improve its rental price estimation model and recommend listings to users more accurately.

Problem Statement

In order to enhance user satisfaction and reduce bounce rates on listings, the company needs to determine which features (e.g., square footage, elevation, year built) most influence the rental price of an apartment.

Scope

Relevant variables from the dataset: - price, sqft, elevation, year_built, beds, bath

Analyses: - GLMs to model price and in_sf (location classifier) - Variable selection via stepwise regression - Multicollinearity check via VIF - Model comparison via AIC/BIC/Deviance

Assumptions: - GLMs are appropriate for the outcome variable type - Residual deviance and deviance comparisons inform improvement

Objective

Identify the variables that most significantly influence apartment price and accurately classify location (SF vs. NY) to power recommendation algorithms and pricing insights.

Goal 2: Model Critique

Issue 1: Feature Engineering Could Be Improved

Currently, only square footage is transformed (sqrt_sqft) while others like elevation or year_built are not.

Recommendation A: Add interaction terms or polynomial features

apts <- apts |> 
  mutate(elevation2 = elevation^2,
         sqft_beds_interaction = sqft * beds)

model_poly <- glm(log_price ~ sqrt_sqft + elevation2 + sqft_beds_interaction + beds, data = apts)
summary(model_poly)

This captures nonlinear effects of elevation and the interaction between size and number of bedrooms, which may influence price.

Issue 2: No Model Diagnostics Are Performed for GLMs

Residual plots, leverage, and influence points are missing.

Recommendation B: Use diagnostic tools

library(lindia)
plot(model_poly)

These plots help validate GLM assumptions and identify outliers or influential points.

Issue 3: Classification Analysis Lacks Validation

The in_sf classification is modeled but not validated.

Recommendation C: Evaluate classification with confusion matrix and accuracy

library(caret)
apts$pred <- ifelse(predict(model1, type = "response") > 0.5, 1, 0)
confusionMatrix(as.factor(apts$pred), as.factor(apts$in_sf))

This provides accuracy, sensitivity, and specificity, giving insight into the model’s real-world classification ability.

Goal 3: Ethical and Epistemological Concerns

Potential Biases

Location bias: Modeling assumes that elevation or square footage have the same value in SF and NY, but social/economic contexts may differ.
Historical inequality: year_built may reflect systemic urban development disparities.

Societal Impacts

Renters may be priced out of neighborhoods due to algorithms trained on historical inequities.
Biased recommendations can reinforce segregation or gentrification.

Unmeasurable but Crucial Aspects

Neighborhood culture, safety, and accessibility are not captured.
Human behaviors such as landlord discrimination or preference for certain tenants.

Who Is Affected?

Renters may get unfair price estimates or misleading recommendations.
Small landlords may suffer if models are skewed by big landlord data.

Final Suggestions

Use cross-validation and ROC/AUC to validate classification.
Explore nonlinear models like GAMs or tree-based models for better price prediction.
Include more granular location metadata (e.g., zip code, walkability scores).
Center ethical discussion around fairness, interpretability, and equity.