Week 14 Data Dive

Josh Gaughan, Adam Kovar, and Lucas Tetrault

Goal 1: Business Scenario

Customer or Audience: We are a real estate consultant company dedicated to tech workforce housing initiatives in San Francisco and New York. We consult for tech companies that hired us and sell to tech workers seeking housing.

Problem Statement: San Francisco and New York are two prominent tech hubs in the United States that are attractive to companies and tech workers. Our goal is to formulate a model that determines which aspects of an apartment predict price in these two cities. This will allow us to inform prospective buyers about whether a listed apartment is priced fairly compared to their budget and preferences.

Scope: We will utilize all variables within the dataset (price, sqft, price_per_sqft, beds, bath, elevation, in_sf). Our analysis will test various models including Poisson regression which is tested in the lab. Our primary assumption is that the dataset is representative of apartments that are relevant to these buyers.

Objective: Identify the variables that most significantly impact price and determine which models perform the best for our specific project.

Goal 2: Model Critique

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

Analysis 1:

Instead of using just one predictor like the lab, we would like to consider all predictors. If we use AIC as a criterion for comparing models with or without all predictors, we see that the additional features are important to improving the model. AIC decreases substantially. However, if we use overdispersion as a means for evaluating model performance, we see that the Poisson model produces an estimate of over 77,000 (which is not good whatsoever). This will be addressed in analysis 3.

# Poisson full model
pois_full = glm(price ~ I(sqrt(sqft)) + beds + bath + elevation + in_sf + year_built + price_per_sqft, data = apts, family = poisson(link = 'log'))

# Poisson Step Model
pois_step = step(pois_full, direction = "both", trace = 0)

cat("AIC of the stepwise Poisson model:",
    AIC(pois_step))

## AIC of the stepwise Poisson model: 37353507

cat("\nDispersion estimate of the full Poisson model:",
    pois_step$deviance / pois_step$df.residual)

## 
## Dispersion estimate of the full Poisson model: 77160.44

Analysis 2:

San Francisco and New York are two very different cities even if they are both appealing to tech-minded people. Maybe we should consider two separate Poisson models?

While we do get an improvement in AIC in both models compared to the single model, our overdispersion problem still exists which indicates that even split models isn’t our answer.

apts_sf = filter(apts, in_sf == 1)
apts_ny = filter(apts, in_sf == 0)

pois_sf = glm(price ~ I(sqrt(sqft)) + beds + bath + elevation + in_sf + year_built + price_per_sqft, data = apts_sf, family = poisson(link = 'log'))

pois_ny = glm(price ~ I(sqrt(sqft)) + beds + bath + elevation + in_sf + year_built + price_per_sqft, data = apts_ny, family = poisson(link = 'log'))

cat("AIC of the SF Poisson Model:",
    pois_sf$deviance / pois_sf$df.residual)

## AIC of the SF Poisson Model: 23216.2

cat("\nAIC of the NY Poisson Model:",
    pois_ny$deviance / pois_ny$df.residual)

## 
## AIC of the NY Poisson Model: 83594.21

cat("\nDispersion estimate of the SF Poisson model:",
    pois_sf$deviance / pois_sf$df.residual)

## 
## Dispersion estimate of the SF Poisson model: 23216.2

cat("\nDispersion estimate of the NY Poisson model:",
    pois_ny$deviance / pois_ny$df.residual)

## 
## Dispersion estimate of the NY Poisson model: 83594.21

Analysis 3:

From our previous analyses we learned that the Poisson model (with or without all of the predictors) just doesn’t seem like the best model for our situation. While we hope for an overdispersion estimate of about 1 from our model, we got a wildly large estimate from the Poisson models we tested which makes interpretations very questionable. However, if we consider a negative binomial model, we get substantially better performance. The negative binomial removes the ‘mean is equivalent to variance’ of the Poisson constraint by adding a dispersion parameter (theta). It also utilizes a log link function. The negative binomial approaches our target overdispersion estimate and has a drastic improvement in AIC and the overdispersion estimate.

nbinomial = glm.nb(price ~ I(sqrt(sqft)) + beds + bath + elevation + in_sf + year_built + price_per_sqft, data = apts)

cat("AIC of the Negative Binomial Model:",
    AIC(nbinomial))

## AIC of the Negative Binomial Model: 13622.37

cat("\nDispersion estimate of the Negative Binomial model:",
    nbinomial$deviance / nbinomial$df.residual)

## 
## Dispersion estimate of the Negative Binomial model: 1.02223

Goal 3: Ethical and Epistemiological Concerns

The primary concern for this analysis is the model used in the lab. The Poisson model and interpretations from it are questionable. The one predictor Poisson model from the lab produces an overdispersion estimate of over 764,000. While there was meaningful improvement when we added predictors, it was still way off the mark. Changing the model yielded better results.
Trying diagnostic plots on the various models in addition to viewing numerical performance indicators would increase confidence in the model.
There are potential biases in the data. We don’t know the sampling technique used for this dataset and whether there are patterns (un)intentionally forced onto the data.
There are additional variables that we would ideally consider beyond the few included in this dataset that are measurable. At the same time, there are variables that aren’t measurable but that still affect the results.
The data we are training our model on may be old and less useful by the time it is actually used.

Week 14 Data Dive

2026-04-23

Goal 1: Business Scenario

Goal 2: Model Critique

Goal 3: Ethical and Epistemiological Concerns