Model Critique

For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.

Your group will have three goals:

  1. Create an explicit business scenario which might leverage the data (and methods) used in the lab.
  2. Critique the models (or analyses) present in the lab based on this scenario.
  3. Devise a list of ethical and epistemological concerns that might pertain to this lab in the context of your business scenario.

Goal 1: Business Scenario

First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.

You do not need to solve the problem, you only need to define it.

Your scenario should include the following:

  • Customer or Audience: who exactly will use your results?
  • Problem Statement: identify a business need or a possible customer request. This should be actionable, in that it should call for an action taken.
    • E.g., the statement “we need to analyze sales data” is not a good problem statement, but “the company needs to know if they should stop selling product A” is better.
  • Scope: What variables from the data (in the lab) can address the issue presented in your problem statement? What analyses would you use? You’ll need to define any assumptions you feel need to be made before you move forward.
    • If you feel the content in the lab cannot sufficiently address the issue, try to devise a more applicable problem statement.
  • Objective: Define your success criteria. In other words, suppose you started working on this problem in earnest; how will you know when you are done? For example, you might want to “identify the factors that most influence <some variable>.”
    • Note: words like “identify”, “maximize”, “determine”, etc. could be useful here. Feel free to find the right action verbs that work for you!

Goal 2: Model Critique

Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:

Present a list of at least 3 (improved) analyses you would recommend for your business scenario. Each proposed analysis should be accompanied by a “proof of concept” R implementation. (As usual, execute R code blocks here in the RMarkdown file.)

In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).

You’ll want to consider the following:

  • Analytical issues, such as the current model assumptions.
  • Issues with the data itself.
  • Statistical improvements; what do we know now that we didn’t know (or at least didn’t use) then? Are there other methods that would be appropriate?
  • Are there better visualizations which could have been used?

Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.

Goal 3: Ethical and Epistemological Concerns

Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:

  • Overcoming biases (existing or potential).
  • Possible risks or societal implications.
  • Crucial issues which might not be measurable.
  • Who would be affected by this project, and how does that affect your critique?

Example

For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:

  • Is this a “good” selection of variables? What could we be missing, or are there potential biases inherent in the groups of apartments here?
  • Nowhere in the lab do we investigate the assumptions of a linear model. Is the relationship between the response (i.e., \(\log(\text{price})\)) and each of these variables linear? Are the error terms evenly distributed?
  • Is it possible that our conclusions are more appropriate for some group(s) of the data and not others?
  • What if assumptions are not met? What could happen to this model if it were deployed on a platform like Zillow?
  • Consider different evaluation metrics between models. What is a practical use for these values?

Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.

library(conflicted)  
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.

Business scenario

We are working for the San Francisco local government, and we want to increase the number of affordable apartments for residents. To do this, we will assess price data of apartments in area along with metrics such as the number of beds, baths, and square footage. We want to identify what type of apartment will give the lowest end price for renters or home-owners. The city can then use this information to help incentivise a higher proportion of new apartment spaces constructed to be of these certain values.

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"

apts <- read_delim(url_, delim = ",")
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
baths <- apts |>
  filter(bath %in% c(1,2)) |>
  filter(beds == 2)|>
  filter(in_sf == 1)

baths_minus_one <- baths$bath - 1
model <- glm(baths_minus_one ~ price_per_sqft, data = baths,
             family = binomial(link = 'logit'))

model$coefficients
##    (Intercept) price_per_sqft 
##   -3.610613832    0.004645585
sigmoid <- \(x) 1 / (1 + exp(-(-3.61 + 0.0046 * x)))

baths |>
  ggplot(mapping = aes(x = price_per_sqft, y = baths_minus_one)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Price per square foot of different 2 bedroom apartments",
       x = "Price per square foot",
       y = "1 bath (0) or 2 bath (1)") +
  scale_y_continuous(breaks = c(0,1)) +
  theme_minimal() 

Above graph was a glimpse into differences in one and two bath in terms of price per square foot. We can see that most of the one bath apartments have a low price per sq-ft and less variation than the two bath apartments.

# san francisco filter
sfapts <- apts |>
  filter(in_sf == 1)

sfapts |>
  ggplot(mapping = aes(x = as.factor(beds), y = price_per_sqft)) +
  geom_boxplot() + 
  labs(x = "Number of Bedrooms", y = "Price per square foot") +
  theme_minimal()

sfapts$cpb <- sfapts$price / sfapts$beds

sfapts |>
  ggplot(mapping = aes(x = as.factor(beds), y = cpb)) +
  geom_boxplot() + 
  labs(x = "Number of Beds", y = "Price per bed") +
  theme_minimal()
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Interestingly, we can see that as the number of beds increase, the price per square foot as a mean looks to steadily decrease. However, when looking at the cost per bed, another valuable metric, we see the mean price is lowest at about 2-4 bedrooms. Although 6 bedrooms is the lowest, this might be impacted from the small sample size of 6 bedroom apartments available. Since its not practical to think that the majority of new apartments constructed can consist of 6 bedroom apartments, we’ll likely want the scope of our future tests to only include apartments with four or less beds.

sfapts |>
  ggplot(mapping = aes(x = as.factor(beds))) +
  geom_bar() + 
  labs(x = "Number of Beds", y = "Number of Apartments") +
  theme_minimal()

ANOVA Testing

First we’ll test if there’s any difference between the means of bedrooms 2,3,and 4 for price per sq-ft. Our null hypothesis is going to be that there’s no difference between them.

library(tidyverse)
# san francisco filter
sfapts <- apts |>
  filter(in_sf == 1)
filtered_sfapts <- sfapts |>
  filter(beds %in% c(2,3,4)) #anova only on 2,3,and 4 beds
filtered_sfapts |>
  ggplot(mapping = aes(x = as.factor(beds), y = price_per_sqft)) +
  geom_boxplot() + 
  labs(x = "Number of Bedrooms", y = "Price per square foot") +
  theme_minimal()

# Anova assumptions checks
filtered_sfapts |>
  filter(beds == 4) |> # normality check for each group
  ggplot(mapping = aes(x= price_per_sqft)) +
  geom_histogram() +
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

filtered_sfapts |>
  filter(beds == 4) |> # sd for each group is relatively close
  select(price_per_sqft) |>
  pull() |>
  sd()
## [1] 321.8597
m <- aov(price_per_sqft ~ beds, data = filtered_sfapts)
summary(m)
##              Df   Sum Sq Mean Sq F value   Pr(>F)    
## beds          1  1232894 1232894   11.18 0.000993 ***
## Residuals   193 21283901  110279                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We got an F value of 11.18 which is much larger than break-point of 1. Also the p value is low so we can be confident this is not due to random chance. Our F value is suggestive of a difference in these three groups.

Let’s run the same process on bedrooms and price per bed relationship to see if the groups are actually different.

filtered_sfapts$ppb <- filtered_sfapts$price / filtered_sfapts$beds
filtered_sfapts |>
  ggplot(mapping = aes(x = as.factor(beds), y = ppb)) +
  geom_boxplot() + 
  #scale_y_continuous(breaks = c(0, 1)) +
  labs(x = "Number of Beds", y = "Price per bed") +
  theme_minimal()

m2 <- aov(ppb ~ beds, data = filtered_sfapts)
summary(m2)
##              Df    Sum Sq   Mean Sq F value Pr(>F)  
## beds          1 5.883e+11 5.883e+11   4.107 0.0441 *
## Residuals   193 2.765e+13 1.432e+11                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, F value of 4.1 indicates that at least one group is different from the others.

This adds some statistical rigor to determine that bedrooms are different when it comes to price per sq-ft and price per bed. So for efficiency it is looking like 3 bedroom apartments have lower price per bed among the three, also 3 bedroom apts also have lower price per sq-ft than 2 bedrooms. We could’ve done pairwise t tests and C.I for these, but since this is proof of concept we thought this should be sufficient.

Limitations and potential conclusions

Our data is beginning to suggest that 3 bedroom apartments might be among the most efficient in terms of price per square foot and price per bed. This would mean that occupants are paying the least for their room, and are getting the most value for their area at this number. Additionally, we’ve seen that prices also tend to be lower when there are less bathrooms. Because of this, we might want to conclude that we want to incentivise much more 2-4 bedroom apartments be build, especially three bedrooms and those with less bathrooms than average for their bedroom number. However, this might not be that simple.

  1. Supply and Demand. It may be the case that if supply was raised for certain bedroom numbers, prices would go down. An example is 1 bedroom apartments. If there were, say, twice as many one bedroom apartments, would the price come down to be less than a 2 bedroom apartment? I think we can assume that no, two bedroom apartments are likely more efficient because of shared expenses like appliances, but that’s also another thing we’re missing: cost structures.
  2. Although its very important to look at prices that our end customers, citizens, will pay, the costs associated with constructing new apartments are very important too. Companies will favor constructing apartments with larger margins. A 2 bedroom apartments with 2 bathrooms instead of 1 sells at a higher price, but we don’t know if the cost structure is much different.
  3. More supply and demand. How in demand are 3 or 4 bedroom apartments? If we try to “flood the market” with these cost effective places, they may sit vacant from additional difficulties of people needing to find roommates, difficulties leaving or subletting, and general roommate dissatisfaction.
  4. Cost per person. Although the average 2 bedroom apartment has a lower cost per room and cost per square foot compared to a 1 bedroom apartment, we don’t know the average number of occupants in each of these rooms. Its common that couples will share a 1 bedroom apartment, but how prevelant is this? Is the average occupant per bed in a 1 bedroom apartment 1.5 where the average for a 2 bedroom apartment is 1.1? What if the average occupant per bed in a 3 bedroom apartment is 0.9 due to difficulties subletting? In this case, 1 bedroom apartments might actually be the most efficient in terms of cost per occupant. But that occupancy data isn’t available to us.
  5. Location. Are some types of apartments located in nicer areas of SF? Maybe 3 and 4 bedroom apartments seem cheaper because they are on “cheaper” sides of town, and if they were in the more luxurious sides of town they’d be more expensive than the 1-2 bedroom apartments. We don’t know this. Additionally, we don’t know from elevation which floor of the apartment each place is on. SF is a very hilly area, so although we have elevation data, its plausible that one apartment with a lower elevation is actually on a higher floor than one with a higher elevation, but each apartment rests on a different hill. The price could easily be impacted by the relative elevation of the apartment, not just to sea elevation, and that is missing from out data.

Goal 3: Ethical and Epistemological Concerns

Existing Data Bias

  • Housing price data often includes inherent biases tied to socio-economic disparities, historical racial zoning practices, or unequal market dynamics.

    • Example: If the dataset over-represents affluent neighborhoods, results might fail to generalize to low-income housing markets.

    • Variable Concerns: Features like “Elevation” may inadvertently reflect systemic neighborhood segregation, reinforcing biases.

Possible Risks or Societal Implications

  • Profit Over Affordability: The analysis could unintentionally push developers to prioritize maximizing profits, undermining the goal of affordable housing initiatives.

  • Exclusion Risk: Developers may focus on market segments that yield higher returns, leaving out designs that cater to diverse housing needs.

Issues Which Might Not Be Measurable

  • Historical Context: Historical practices such as redlining might have shaped current price trends but remain unquantifiable in a purely statistical analysis.

    • Ignoring such contexts could limit the relevance and fairness of the findings.

Affected Stakeholders

  1. Home Buyers/Renters:

    • Positive Impact: Improved access to affordable and well-designed housing if insights are implemented responsibly.

    • Negative Impact: Risk of pricing out lower-income families if developers focus solely on profit-driven designs.

  2. Real Estate Developers:

    • Positive Impact: Data-driven insights could support cost-effective, equitable housing solutions aligned with affordability goals.

    • Negative Impact: Facing ethical scrutiny if findings drive inequitable housing practices.

Ethics Recommendations

  1. Fairness and Bias Mitigation:

    • Employ fairness-aware preprocessing techniques to address biases in the dataset.

    • Regularly audit model outputs to ensure equitable treatment of different demographic groups.

  2. Collaborate with Stakeholders:

    • Work with policymakers, community representatives, and housing advocates to ensure the analysis aligns with societal needs and prioritizes inclusivity.
  3. Transparency and Documentation:

    • Publish comprehensive documentation detailing:

      • Data sources and preprocessing methods.

      • Model assumptions and potential limitations.

      • Ethical considerations and mitigation efforts.

    • This ensures future analyses maintain integrity and build upon an ethically sound foundation.