CUNY SPS DATA-606 Final Project

1. Introduction

Finding affordable housing is a real challenge throughout the country. The Rochester, New York metro is acclaimed fit its affordable housing. Recent data shows 81% of homes in the Rochester Metro are affordable (National Association of Home Builders 2018). The median sales price in Rochester is $146,000, or half of the national average ($268,000). With the relatively high family income, Rochester housing proves to be affordable.

There is considerable variety in the type of housing offered in Rochester. There are picturesque village homes built in the late 1800s. There are also suburban homes that are larger and newer. Can the square footage and age of a home estimate the sale price reliably?

2. Data

Data Collection

The NYS Office of Real Property Services (NYS RPS) collects sales price data. This administrative record data was scrapped using a web scrapper in early November of 2018 (Monroe County 2018). At that time, the data was as of 22 October 2018.

The NYS GIS clearinghouse compiles data on housing attributes (NYS GIS Clearinghouse 2017). This data originates from local assessors. The 2017 data (latest available) was used. The square footage of living space is the measurement of size in this study. It is a quantitative measure.

Data on the age of the home are transformed into eras (a categorical variable). These eras come from an article authored by Trulia’s Chief Economist. They are:

  1. Before 1940s
  2. Post-War Boom (1940 to 1979)
  3. Last 30+ Years (1980+)

This study is purely observational in nature. It does not assert these variables have a causal relationship.

Data Preparation

The scrapped data is in a SQLite database stored on GitHub. The first step is to download and extract this database locally.

library(DBI)
library(dplyr)

# Download the database
if(!file.exists("Monroe Real Property Data.db")){
  if(!file.exists("Monroe Real Property Data.db.tar.gz")){
    # Download this specific commit from my GitHub repo
    download.file("https://github.com/mikeasilva/CUNY-SPS/raw/9a2023b57a466550646c9190076a53d51ef09ef2/data/Monroe%20Real%20Property%20Data.db.tar.gz", "Monroe Real Property Data.db.tar.gz")
  }
  # Decompress the file
  untar("Monroe Real Property Data.db.tar.gz")
  # Remove the file
  unlink("Monroe Real Property Data.db.tar.gz")
}

Now that the data is downloaded it is time to pull the sales price. Each parcel is identified by a SWIS (Statewide Information System) and SBL (Section, Block, Lot) code. Using this identifier, the next step is to find the latest sale date.

# Connect to the database
con <- dbConnect(RSQLite::SQLite(), "Monroe Real Property Data.db")

# Get the date of the latest sale for each property
results <- dbSendQuery(con, "SELECT MAX(sale_date) AS sale_date, SWIS, SBL FROM sales_data GROUP BY SWIS, SBL HAVING sale_date BETWEEN '2017-01-01' AND '2017-12-31'")
df <- dbFetch(results) 

dbClearResult(results)

Now the sale price can be pulled. There should be 8,444 sale prices in the data set after the next step.

df <- dbReadTable(con, "sales_data") %>% # Pull all the records
  distinct() %>% # remove any duplicates
  merge(df) %>% # Merge with the latest sale dates
  rename(Price = price)

The observations total 8,460. It appears the merging process introduced some duplication into the data set. This needs to be corrected. Any parcel with more than one record will be dropped from the dataset.

df <- df %>%
  select(SWIS, SBL) %>% # Select the unique identifiers
  group_by(SWIS, SBL) %>% 
  summarise(count = n()) %>% # Get the number of records
  ungroup() %>%
  filter(count == 1) %>% # Only keep the observations with one price
  select(-count) %>% # Drop the count variable
  merge(df) # Merge it with the original data frame to subset it

This leaves 8,428 unduplicated observations in the data set. Now that we have the outcome variable, it is time to add in the housing attributes.

# Add in the housing attributes
df <- dbReadTable(con, "property_info") %>% # Pull all the records
  distinct() %>% # remove any duplicates
  merge(df) # Merge with the sale price

# Disconnect from the database
dbDisconnect(con)

There are 8,428 observations in the data set. This indicates the merging process did not introduce any duplication. Now that the housing attributes are added in, some odd observations need to be filtered out.

df <- df %>% # First filter out some obviously weird data
  filter(YR_BLT > 0) %>%
  filter(SQFT_LIV > 0) %>%
  filter(NBR_KITCHN > 0) %>%
  filter(NBR_BEDRM > 0)

After filtering there are 8,421 observations. This is the final count of cases for the study. The next step is to translate the year of construction into an era.

get_era <- function(year){
  if(year < 1940){
    return("1. Before 1940s")
  } else if(year < 1980){
    return("2. Post-War Boom")
  }else{
    return("3. Last 30+ Years")
  }
}

df <- df %>% 
  rowwise() %>%
  mutate(Era = get_era(YR_BLT)) %>%
  ungroup() %>%
  mutate(Era = as.factor(Era))

The last data preparation step is renaming and selecting the variables of interest.

df <- df %>%
  rename(Size = SQFT_LIV) %>%
  select(Price, Size, Era)

Sanity Check

The American Community survey asks respondents about their home value. Assuming the sale price is comparable to the home value, how well does the study data represent the housing stock? The following figure takes the study data and compares with with Monroe County’s housing stock (U.S. Census Bureau 2017).

Figure 1. Percent of Homes by Price Groups and Data Set

The study data is generally fairly representative. It under represent the two highest and two lowest categories. It also over represent home in the $100,000 to $149,999 price range. This could be due to different segments of the housing stock “moving” faster/more than other segments.

This data distribution depicted above raises some concern. The data appear skewed. The following density plot illustrates the skew. The x-axis is stopped at half a million for readability, even though the largest sale price in the data set is 11.5 million dollars.

Figure 2: Sale Price Distribution

Table 1: Preliminary Price Summary
Metric Value
Min. 2.0
1st Qu. 113500.0
Median 148000.0
Mean 176589.1
3rd Qu. 212500.0
Max. 11504800.0

This confirms there are extreme housing prices in the data set. The data is also not normal. To correct this, the price will be put on a log scale and outliers will be removed.

# Log transformation
df <- df %>%
  mutate(Log_Price = log(Price))

# Remove the outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

valid_prices <- unique(remove_outliers(df$Log_Price))
min_price <- min(valid_prices, na.rm=TRUE)
max_price <- max(valid_prices, na.rm=TRUE)

df <- df %>%
  filter(Log_Price > min_price) %>%
  filter(Log_Price < max_price)

Now that the outliers have been removed we have 8,144 observations in our data set. Here is what the log of the prices looks like:

Figure 3. Log Price Distribution

Study Data

As previously stated there are 8,144 observations in this data set. Each case represents the most recent sale price for a residential property occuring in 2017, ranging in price from $45,000 to $536,000. On a log scale these prices range from $10.71442 to $13.19189.

3. Exploratory Data Analysis

As mentioned in the introduction, the median sale price for a home in the Rochester Metro is $146,000 (National Association of Home Builders 2018). The median sale price in the study dataset is $149,900 which is comparable.

This study posits the size and era a home was built are reliable predictors of the price (on a log scale). The following scatter plot illustrates the relationship observed in the data.

Figure 4: Log Sales Price and Size by Era

It is clear that price and size are related. The relationship between price and era is not as clear The following visualization and summary table examines this relationship further:

Figure 5: Log Sales Price by Era

Table 2: Log Sales Price by Era Summary
Era min mean median max
  1. Before 1940s
10.71442 11.73829 11.69525 13.18063
  1. Post-War Boom
10.71442 11.81259 11.78295 13.19002
  1. Last 30+ Years
10.75790 12.27014 12.27839 13.19189

The older houses tend to fetch less in the market than the newer houses. There also does not seem to be much difference between the pre and post-war housing.

Checking for Multicollinearity

The two explanatory variables might be related. It is reasonable to believe that homes increased in size over time. The plot below will display how the size and era are related.

Figure 6: Size by Era

Table 3: Size by Era Summary
Era min mean median max
  1. Before 1940s
634 1657.584 1540 6454
  1. Post-War Boom
544 1615.005 1537 5489
  1. Last 30+ Years
865 1997.098 1896 5648

The newer houses are generally larger than the older houses. This trend does not seem as pronounced as I expected it would.

4. Inference

In order to evaluate how well the model generalizes the data will be split into training/test sets using an 80/20 split.

library(caret)
set.seed(12345)
in_training <- createDataPartition(df$Log_Price, p = 0.8, list = FALSE)
training <- df[ in_training, ]
testing  <- df[-in_training, ]

There are 6,517 observations in the training data set. With the data split, the OLS line can be fit to the data.

fit <- lm(Log_Price ~ Size + Era , training)

This is the summary of the model:

summary(fit)

Call:
lm(formula = Log_Price ~ Size + Era, data = training)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.77588 -0.15563  0.01354  0.17986  1.32777 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.091e+01  1.450e-02 752.378   <2e-16 ***
Size                 4.943e-04  6.410e-06  77.114   <2e-16 ***
Era2. Post-War Boom  1.023e-01  1.098e-02   9.314   <2e-16 ***
Era3. Last 30+ Years 3.691e-01  1.182e-02  31.237   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2902 on 6513 degrees of freedom
Multiple R-squared:  0.6014,    Adjusted R-squared:  0.6012 
F-statistic:  3275 on 3 and 6513 DF,  p-value: < 2.2e-16

This model’s coefficients are a little hard to interpret given that they are on a log scale. Yet one observes that all the coefficients are statistically significant. The square footage of living space (size) is positively correlated with the home price. Also the later the home is built is positively correlated with the price. Only 60% of the variability in the log of the sales price is explained by these two variables.

Residual Examination

Figure 7: Residuals Scatterplot

The above figure is the residuals ordered by their index. There is no clear pattern in this figure.

Figure 7: Residuals Q to Q

The Q to Q plot above indicates that the residuals are more linear near the middle of the distribution. But there are some major problems in the tails.

Generalization

In order to assess how well this model generalizes, predictions for the testing data set will be made and compared to the actual.

predictions <- predict(fit, testing, interval = "confidence")
testing$Log_Price_hat <- predictions[, 1]
testing$Price_hat <- exp(testing$Log_Price_hat)
testing$Log_Price_hat_lower <- predictions[, 2]
testing$Price_hat_lower <- exp(testing$Log_Price_hat_lower)
testing$Log_Price_hat_upper <- predictions[, 3]
testing$Price_hat_upper <- exp(testing$Log_Price_hat_upper)

Let’s visualize the predictions vs the actual sales prices. In the following plot the sales price is on the x-axis and the prediction is on the y-axis. Values are per one thousand. The black dashed line represents where the prediction and the actual values are the same. The blue are under estimates, red are over estimates, and the green points have a price that falls within the prediction’s confidence interval.

Figure 8: Test Set Actual vs Model Prediction (in thousands)

There are 836 cases where the model over estimates the home price, and 730 where it under estimates the home price. 61 are in within the confidence interval. Let’s examine to what degree it over/under estimates. The model does not seem to accurately depict reality.

testing <- testing %>%
  mutate(Error = Price_hat - Price)
Figure 9: Prediction Error Distribution

Table 4: Prediction Error Summary
Metric Value
Min. -319779.752
1st Qu. -30638.761
Median -3338.245
Mean -7323.755
3rd Qu. 21743.421
Max. 416150.010

There is a wide variety in the errors. Let’s look at the distribution of the error in absolute terms expressed as a share of the total price. It would be nice to see errors in the 5% range

testing <- testing %>%
  mutate(Error_Share = (abs(Error) / Price) * 100)
Figure 10. Cumulative Distribution of Error Share (5% cutoff in red)

Only 15% of the testing set observations have an error which is +/-5% of the estimate. Once again the model seems to be a poor predictor.

5. Conclusion

This study set out to see if the size of a home and the era in which it was built could be used to get an estimate of the sale price. While these variables seem to be relevant, predictions made only using the variables are largely inaccurate. It appears that the relationship is more complicated. Future research could expand the model. A location variable would be a next step along with the size of the plot. A different modeling approach should be employed as the relationship appears to be non-linear.

6. References

Monroe County. 2018. “Real Property Portal.” https://www.monroecounty.gov/etc/rp.

NYS GIS Clearinghouse. 2017. “Monroe County Tax Parcel Centroids [Data File].” http://gis.ny.gov/gisdata/inventories/details.cfm?DSID=1300.

U.S. Census Bureau. 2017. “American Community Survey 1-Year Estimates.” https://factfinder.census.gov/.

Mike Silva

November 28, 2018