Model Critique

I was not in class this week (11/20), so I am using the notes from Week 10 (Generalized Linear Models Part 1) to complete this weeks data dive/analysis. Week 10’s notebook uses a dataset with apartments from SF and NYC, with how many beds and baths they have, their price, the year built, the square footage, the price per square foot, and the elevation.

Goal 1: Business Scenario

My context for this lab is that a real estate company in California is looking to expand its portfolio into San Francisco, and are looking to see what properties they can sell for what price. The customer, or audience, is the employees of the real estate company who will be making the decision on what apartments to show. The problem statement is that the company needs to know if they should put more effort into showing apartment A versus apartment B based on historical data. The scope is that I will be using the column in_sf to only pull out apartments that are in SF, the beds and baths column, the year_built column, the sqft column, and the price column. The analysis I will use will be creating a GLM with these variables to see how they influence the price. My assumptions are that the more bathrooms and the larger the square footage, the more the apartment will sell for. I also think that very new apartments will sell for more. My objective is to identify the factors that influence price.

Goal 2: Model Critique

Load the data

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.1.3

library(patchwork)
library(broom)
library(lindia)
library(car)

## Warning: package 'car' was built under R version 4.1.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.1.3

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

options(scipen = 6)
theme_set(theme_minimal())

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"

apts <- read_delim(url_, delim = ",")

## Rows: 492 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filter out only the apartments in SF

sf_apts <- filter(apts, in_sf == 1)

Analysis 1: GLM

glm_model <- glm(price ~ beds + bath + year_built + sqft, data = sf_apts)

summary(glm_model)

## 
## Call:
## glm(formula = price ~ beds + bath + year_built + sqft, data = sf_apts)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2232003   -298802    -19220    209113   4010056  
## 
## Coefficients:
##                Estimate  Std. Error t value     Pr(>|t|)    
## (Intercept) -2077134.08  2242900.68  -0.926        0.355    
## beds         -336633.63    60779.84  -5.539 0.0000000739 ***
## bath           96975.95    88014.30   1.102        0.272    
## year_built      1034.52     1142.77   0.905        0.366    
## sqft            1282.78       80.31  15.974      < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 469906651483)
## 
##     Null deviance: 407240750230947  on 267  degrees of freedom
## Residual deviance: 123585449339964  on 263  degrees of freedom
## AIC: 7970.2
## 
## Number of Fisher Scoring iterations: 2

From this model, I can assume that the variables that influence the price of an apartment are the number of beds and the square footage. This somewhat goes against what I initially assumed; I figured beds would play a part, but I assumed the number of bathrooms might be more important. As someone who’s currently looking for a place to move, I’m less concerned about the number of bedrooms but am more concerned about the ratio of bathrooms to bedrooms. For example, I’m looking for a 2 bedroom house, but would pay a little more if it meant I got 2 bathrooms instead of 1. I did assume that the square footage would play a large role thought, because more square footage = more space, and paying more for more space makes sense. I wouldn’t conclude that this model does a good job of predicting the price, just because some of the variables are insignificant, and there is a large range of residuals, which suggests to me that this model violates some of the assumptions of linear models. To improve this model, I think I would take out the variables that don’t correctly predict price, but then I don’t think the full story would be captured of the reason why an apartment is priced the way it is.

Analysis 2: Linear Regression

lr_model <- lm(price ~ beds + bath + year_built + sqft, data= sf_apts)

summary(lr_model)

## 
## Call:
## lm(formula = price ~ beds + bath + year_built + sqft, data = sf_apts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2232003  -298802   -19220   209113  4010056 
## 
## Coefficients:
##                Estimate  Std. Error t value     Pr(>|t|)    
## (Intercept) -2077134.08  2242900.68  -0.926        0.355    
## beds         -336633.63    60779.84  -5.539 0.0000000739 ***
## bath           96975.95    88014.30   1.102        0.272    
## year_built      1034.52     1142.77   0.905        0.366    
## sqft            1282.78       80.31  15.974      < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 685500 on 263 degrees of freedom
## Multiple R-squared:  0.6965, Adjusted R-squared:  0.6919 
## F-statistic: 150.9 on 4 and 263 DF,  p-value: < 2.2e-16

Looking at this model after looking at the GLM, I can confidently conclude that the two factors that influence the pricing of an apartment are the number of beds and the square footage. The Linear Regression model has a few statistical points that the GLM didn’t have; for one, the R-squared value is around 0.70, which means that around 70% of the pricing of apartments in SF is correctly predicted from this model. I think this is still a relatively high number considering there are only two statistically significant variables out of the 4 variables in the model. Additionally, the F-statistic is very small (and statistically significant), which makes me think that this model is a little more accurate than the other one I built. However, there are a large amount of outliers, as can be seen by the residuals, so it’s still not a very strong model. One issue I think there is with both the GLM and the Linear Regression model is that there is probably a collinearity issue between beds and square footage, which might be why they are statistically significant. The more square footage an apartment has, the more likely it is that it will have more bedrooms (the likelihood of a super large apartment having one bedroom is very low). To improve both of these models, I could check for collinearity and then remove either beds or sqft from the model.

Analysis 3: Visualization of distribution of price by year built

library(ggplot2)

ggplot(sf_apts, aes(x = year_built, y = price)) +
  geom_point(color = 'pink') +
  geom_smooth(method = "loess", color = "purple") +
  labs(title = "Year Built vs. Price", x = "Year Built", y = "Price") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Finally, looking at this visualization comparing the year built to the price of the apartment (with a smoothing effect and a trend line), I think that my initial assumption that a newer apartment would cost more is incorrect. There doesn’t really appear to be a trend - there are outliers for almost every year built, and actually some of the cheapest apartment data points are in 2000 as opposed to older years. This graph does not demonstrate a linear relationship between these two variables. Maybe a newer apartment might have more square footage and that might then impact the price, but year built alone does not impact the price in any significant way (which I already had somewhat determined via my two previous models, but it seemed good to examine if there were any trends at all).

Goal 3: Ethical and Epistemological Concerns

Going off of the example that was provided initially in this document, it would be nice if there were more variables in this dataset, like potentially what neighborhood the apartment is in, or if it is rent controlled or not. It doesn’t appear like it from my preliminary analysis, but the price of an apartment even in 2000 is different than the price of that same apartment in 2024 due to inflation.

Considering risks or societal implications, and who this project is for, it’s possible that only showing apartments that would sell for the most money could reinforce already existing biases, like recommending properties to the company that are only in wealthier neighborhoods. This could be okay if that is their specific clientele, but even then. I recently read a paper in another class about predictive analytics and how it unknowingly reinforces already-existing biases; this case is a little bit different, but the point still stands that it’s easy to reinforce bias even if it’s not a part of the data.

There are also a good amount of issues that can’t be measured, either accurately or at all, in this data. One of the biggest ones is human preference, especially when using real estate/apartment data. There is a lot more that goes into whether or not a person will want to purchase or rent an apartment than just the beds/baths, price, and square footage. The location, and what’s surrounding the apartment are important, as well as the aesthetics of the apartment unit and building, and the amenities offered by the apartment. Utilities also might be a concern, or what the atmosphere or demographic of the apartment is.

The people impacted by this project are primarily the apartment buyers/renters, and this affects my critique by making me more aware of what implicit biases might exist in the data, and acknowledging that a lot goes more into what might make an apartment sellable other than just the hard data points. The other party impacted by this project would be the real estate agents and company; they likely don’t care about the specifics or aesthetics of the apartment, just how much they can sell it for and if they can find a person to rent or buy it at that price.