Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned a lab from a previous week. Your goal is to critique the models (or analyses) present in the lab.

First, review the materials from the Lesson on Ethics and Epistemology (week 5?). This includes lecture slides, the lecture video, or the reading. You can use these as reference materials for this lab. You may even consider the reading for the week associated with the lab, or even supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.).

For the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and possible solutions (even if you would need to request more data or resources to accomplish those solutions).

Share your model critique in this notebook as your data dive submission for the week.

As a start, think about the context of the lab and consider the following:

Analytical issues, such as model assumptions
Statistical improvements; what do we know now that we didn’t know then?
Are there better visualizations which could have been used?
Overcoming biases (existing or potential)
Possible risks or societal implications
Crucial issues which might not be measurable

Treat this exercise as if the analyses in your assigned lab (i.e., the one you are critiquing) were to be published, made available to the public in a press release, or used at some large company (e.g., for mpg data, imagine if Toyota used the conclusions to drive strategic decisions).

If you were unable to attend class, select a notes_*.Rmd file from a previous week (not including weeks 1 or 3), and complete the analysis above. Share your critique below.

Example

For example, in Week 11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:

Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., \(\log(\text{price})\)) and each of these variables linear? Are the error terms evenly distributed?
Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
Consider different evaluation metrics between models. What is a practical use for these values?

Week 13 Data Dive - Lab 10 (and 11)

Loading apartments data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"
apts <- read_delim(url_, delim = ",")

## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exploring alternative explanatory variables for ‘price’

In Lab 10, we explore the relationship between Price and Price per square foot and ignore the fact that the explanatory variable is in the repsonse variable. We assume that this is a valid means of exploring the relationship, but there are other columns that can be explored like elevation or the number of rooms.

# using beds(number of bedrooms) as an explanatory variable
apts_ny <- filter(apts, in_sf == 0)
model_beds <- lm(price ~ beds, apts_ny)
rsquared_beds <- summary(model_beds)$r.squared
apts_ny %>%
  ggplot(mapping = aes(x = beds, 
                       y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', 
              se = FALSE) +
  labs(title = "Price vs. Number of Bedrooms",
       subtitle = paste("Linear Fit R-Squared =", round(rsquared_beds, 3))) +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

We can observe a positive correlation between price and the number of bedrooms. This means that as the number of bedrooms increases, the price also tends to increase.

While price per square foot provides insights into the efficiency of space utilization and cost effectiveness, observing the relationship between price and the number of bedrooms offers a more nuanced understanding of housing prices, reflecting factors such as accommodation capacity and market segmentation.

The number of bedrooms directly relates to the accommodation capacity of a house, which cab be an important factor for many buyers. Larger families or those with specific needs may prioritize the number of bedrooms over overall square footage. Also, in certain housing markets segments, the number of bedrooms may be a more salient factor influencing pricing than price per square foot.

Considerations for Transforming Variables

Taking the square root of area can be problematic because it changes the fundamental nature of the variable. Area is a 2D space, typically measured in square units. The square root operation, however, implies finding the side length of a square with the given area, which changes the whole interpretation of the variable. This transformation complicates the understanding of the relationship between area and price, as it no longer accurately reflects the actual size or extent of the space being measured. Moreover, this transformation can distort the visual representation of the data and lead to misinterpretations of the underlying relationship.

By keeping the square footage untransformed, we can analyze how price changes as the total usable living space increases. This provides clearer insights into how much buyers are willing to pay for an additional space.

While the logic behind a square root transformation mmay seem reasonable, it loses interpretability in this context. We believe that the original square footage provides a more accurate and meaningful measure for understanding how price scales with living space.

Enhancing Normality with Additional Transformation

When we transform y using the BoxCox transformation in lab10, we don’t actually obtain a Normal Distribution. We accept it as an approximation and improvement from our original model. However, it is not a normal distribution and so it can be further transformed using the qnorm() function.

model <- lm(price ~ I(price_per_sqft ^ 2) + price_per_sqft,
            apts_ny)
# performing Box-Cox transformation
pT <- powerTransform(model, family = "bcnPower")
lambda <- pT$lambda
if (abs(lambda) < 0.01) {
  transformed_price <- log(apts$price)
} else {
  transformed_price <- ((apts$price + 0.1) ^ lambda - 1) / lambda
}
new <- qnorm(ppoints(length(transformed_price)))
# removing NA values
new <- new[!is.na(new)]
p1 <- ggplot(data = apts) +
  geom_histogram(aes(x = price), color='white') +
  labs(x = "Original Price Distribution") +
  theme(legend.position = "none")
p2 <- ggplot(data = apts) +
  geom_histogram(aes(x = transformed_price), color='white', na.rm = TRUE) +  
  labs(x = "Transformed Price Distribution") +
  theme(legend.position = "none")
p3 <- ggplot() +
  geom_histogram(aes(x = new), color='white') +  
  labs(x = "Quantiles of Transformed Price Distribution") +
  theme(legend.position = "none")
combined_plot <- p1 + facet_wrap(~ "Original Price Distribution") +
                 p2 + facet_wrap(~ "Transformed Price Distribution") +
                 p3 + facet_wrap(~ "Quantiles of Transformed Price Distribution", scales = "free")
print(combined_plot)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Biases and Limitations in the Dataset

Some biases in our dataset could be the fact that it ignores issues of informal and illegal housing in both New York and San Francisco. Our dataset likely excludes informal housing and unregistered apartments in both cities. Therefore we can assume that this dataset is biased towards only richer/wealthier apartment owners that have registered these houses. This could mean that the relationships between area and number of rooms is difficult to apply in the real world. Our dataset also contains some extremely old houses - dating as far back as 1909. We have no way of knowing if these older houses are still extant or have been demolished. Furthermore, we assume that older houses/apartments are structured the same way as newer apartments. This assumption would likely be misplaced as apartment layouts and structures have changed drastically over time. Thus relationships of price to area and number of rooms is likely inconsistent across time.

Deployment Challenges

Platforms like Zillow would not fully benefit from the inferences in our model as they often include listings of individual rooms and subleases. As a result, our dataset which is already biased towards richer registered houses would not be a perfect fit for platforms like Zillow. The issue of older homes being listed in our dataset also makes it difficult to deploy in real world application. It is possible that older houses have been heavily remodelled or even demolished over time making it difficult to use these inferences in the real world.

Crucial issues which might not be measurable

Neighborhood Dynamics - Factors like community cohesion, safety, and neighborly relations can profoundly influence the livability of an area. While crime rates might be measurable, the sense of community or the presence of local amenities and social activities might not be quantifiable but can be crucial determinants of housing value.
Environmental Quality - Issues such as air and noise pollution, proximity to industrial sites, or potential hazards like landslides or flooding might not be readily quantifiable in a dataset but can have a substantial impact on the quality of life and property values.
Historical Significance - Certain properties might hold historical or cultural significance that adds intangible value beyond their physical characteristics.
Regional Development Plans - Potential future developments in the area, such as infrastructure projects, zoning changes, or commercial developments, can significantly affect property values. While speculative, knowledge about such plans can be crucial for making informed decisions about property investment.