CUNY SPS DATA-606 Final Project
1. Introduction
Finding affordable housing is a real challenge throughout the country. The Rochester, New York metro is acclaimed fit its affordable housing. Recent data shows 81% of homes in the Rochester Metro are affordable (National Association of Home Builders 2018). The median sales price in Rochester is $146,000, or half of the national average ($268,000). With the relatively high family income, Rochester housing proves to be affordable.
There is considerable variety in the type of housing offered in Rochester. There are picturesque village homes built in the late 1800s. There are also suburban homes that are larger and newer. Can the square footage and age of a home estimate the sale price reliably?
2. Data
Data Collection
The NYS Office of Real Property Services (NYS RPS) collects sales price data. This administrative record data was scrapped using a web scrapper in early November of 2018 (Monroe County 2018). At that time, the data was as of 22 October 2018.
The NYS GIS clearinghouse compiles data on housing attributes (NYS GIS Clearinghouse 2017). This data originates from local assessors. The 2017 data (latest available) was used. The square footage of living space is the measurement of size in this study. It is a quantitative measure.
Data on the age of the home are transformed into eras (a categorical variable). These eras come from an article authored by Trulia’s Chief Economist. They are:
- Before 1940s
- Post-War Boom (1940 to 1979)
- Last 30+ Years (1980+)
This study is purely observational in nature. It does not assert these variables have a causal relationship.
Data Preparation
The scrapped data is in a SQLite database stored on GitHub. The first step is to download and extract this database locally.
library(DBI)
library(dplyr)
# Download the database
if(!file.exists("Monroe Real Property Data.db")){
if(!file.exists("Monroe Real Property Data.db.tar.gz")){
# Download this specific commit from my GitHub repo
download.file("https://github.com/mikeasilva/CUNY-SPS/raw/9a2023b57a466550646c9190076a53d51ef09ef2/data/Monroe%20Real%20Property%20Data.db.tar.gz", "Monroe Real Property Data.db.tar.gz")
}
# Decompress the file
untar("Monroe Real Property Data.db.tar.gz")
# Remove the file
unlink("Monroe Real Property Data.db.tar.gz")
}
Now that the data is downloaded it is time to pull the sales price. Each parcel is identified by a SWIS (Statewide Information System) and SBL (Section, Block, Lot) code. Using this identifier, the next step is to find the latest sale date.
# Connect to the database
con <- dbConnect(RSQLite::SQLite(), "Monroe Real Property Data.db")
# Get the date of the latest sale for each property
results <- dbSendQuery(con, "SELECT MAX(sale_date) AS sale_date, SWIS, SBL FROM sales_data GROUP BY SWIS, SBL HAVING sale_date BETWEEN '2017-01-01' AND '2017-12-31'")
df <- dbFetch(results)
dbClearResult(results)
Now the sale price can be pulled. There should be 8,444 sale prices in the data set after the next step.
df <- dbReadTable(con, "sales_data") %>% # Pull all the records
distinct() %>% # remove any duplicates
merge(df) %>% # Merge with the latest sale dates
rename(Price = price)
The observations total 8,460. It appears the merging process introduced some duplication into the data set. This needs to be corrected. Any parcel with more than one record will be dropped from the dataset.
df <- df %>%
select(SWIS, SBL) %>% # Select the unique identifiers
group_by(SWIS, SBL) %>%
summarise(count = n()) %>% # Get the number of records
ungroup() %>%
filter(count == 1) %>% # Only keep the observations with one price
select(-count) %>% # Drop the count variable
merge(df) # Merge it with the original data frame to subset it
This leaves 8,428 unduplicated observations in the data set. Now that we have the outcome variable, it is time to add in the housing attributes.
# Add in the housing attributes
df <- dbReadTable(con, "property_info") %>% # Pull all the records
distinct() %>% # remove any duplicates
merge(df) # Merge with the sale price
# Disconnect from the database
dbDisconnect(con)
There are 8,428 observations in the data set. This indicates the merging process did not introduce any duplication. Now that the housing attributes are added in, some odd observations need to be filtered out.
df <- df %>% # First filter out some obviously weird data
filter(YR_BLT > 0) %>%
filter(SQFT_LIV > 0) %>%
filter(NBR_KITCHN > 0) %>%
filter(NBR_BEDRM > 0)
After filtering there are 8,421 observations. This is the final count of cases for the study. The next step is to translate the year of construction into an era.
get_era <- function(year){
if(year < 1940){
return("1. Before 1940s")
} else if(year < 1980){
return("2. Post-War Boom")
}else{
return("3. Last 30+ Years")
}
}
df <- df %>%
rowwise() %>%
mutate(Era = get_era(YR_BLT)) %>%
ungroup() %>%
mutate(Era = as.factor(Era))
The last data preparation step is renaming and selecting the variables of interest.
df <- df %>%
rename(Size = SQFT_LIV) %>%
select(Price, Size, Era)
Sanity Check
The American Community survey asks respondents about their home value. Assuming the sale price is comparable to the home value, how well does the study data represent the housing stock? The following figure takes the study data and compares with with Monroe County’s housing stock (U.S. Census Bureau 2017).
Figure 1. Percent of Homes by Price Groups and Data Set
The study data is generally fairly representative. It under represent the two highest and two lowest categories. It also over represent home in the $100,000 to $149,999 price range. This could be due to different segments of the housing stock “moving” faster/more than other segments.
This data distribution depicted above raises some concern. The data appear skewed. The following density plot illustrates the skew. The x-axis is stopped at half a million for readability, even though the largest sale price in the data set is 11.5 million dollars.
Figure 2: Sale Price Distribution
Table 1: Preliminary Price Summary
Metric | Value |
---|---|
Min. | 2.0 |
1st Qu. | 113500.0 |
Median | 148000.0 |
Mean | 176589.1 |
3rd Qu. | 212500.0 |
Max. | 11504800.0 |
This confirms there are extreme housing prices in the data set. The data is also not normal. To correct this, the price will be put on a log scale and outliers will be removed.
# Log transformation
df <- df %>%
mutate(Log_Price = log(Price))
# Remove the outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
valid_prices <- unique(remove_outliers(df$Log_Price))
min_price <- min(valid_prices, na.rm=TRUE)
max_price <- max(valid_prices, na.rm=TRUE)
df <- df %>%
filter(Log_Price > min_price) %>%
filter(Log_Price < max_price)
Now that the outliers have been removed we have 8,144 observations in our data set. Here is what the log of the prices looks like:
Figure 3. Log Price Distribution
Study Data
As previously stated there are 8,144 observations in this data set. Each case represents the most recent sale price for a residential property occuring in 2017, ranging in price from $45,000 to $536,000. On a log scale these prices range from $10.71442 to $13.19189.
3. Exploratory Data Analysis
As mentioned in the introduction, the median sale price for a home in the Rochester Metro is $146,000 (National Association of Home Builders 2018). The median sale price in the study dataset is $149,900 which is comparable.
This study posits the size and era a home was built are reliable predictors of the price (on a log scale). The following scatter plot illustrates the relationship observed in the data.
Figure 4: Log Sales Price and Size by Era
It is clear that price and size are related. The relationship between price and era is not as clear The following visualization and summary table examines this relationship further:
Figure 5: Log Sales Price by Era
Table 2: Log Sales Price by Era Summary
Era | min | mean | median | max |
---|---|---|---|---|
|
10.71442 | 11.73829 | 11.69525 | 13.18063 |
|
10.71442 | 11.81259 | 11.78295 | 13.19002 |
|
10.75790 | 12.27014 | 12.27839 | 13.19189 |
The older houses tend to fetch less in the market than the newer houses. There also does not seem to be much difference between the pre and post-war housing.
Checking for Multicollinearity
The two explanatory variables might be related. It is reasonable to believe that homes increased in size over time. The plot below will display how the size and era are related.
Figure 6: Size by Era
Table 3: Size by Era Summary
Era | min | mean | median | max |
---|---|---|---|---|
|
634 | 1657.584 | 1540 | 6454 |
|
544 | 1615.005 | 1537 | 5489 |
|
865 | 1997.098 | 1896 | 5648 |
The newer houses are generally larger than the older houses. This trend does not seem as pronounced as I expected it would.
4. Inference
In order to evaluate how well the model generalizes the data will be split into training/test sets using an 80/20 split.
library(caret)
set.seed(12345)
in_training <- createDataPartition(df$Log_Price, p = 0.8, list = FALSE)
training <- df[ in_training, ]
testing <- df[-in_training, ]
There are 6,517 observations in the training data set. With the data split, the OLS line can be fit to the data.
fit <- lm(Log_Price ~ Size + Era , training)
This is the summary of the model:
summary(fit)
Call:
lm(formula = Log_Price ~ Size + Era, data = training)
Residuals:
Min 1Q Median 3Q Max
-1.77588 -0.15563 0.01354 0.17986 1.32777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.091e+01 1.450e-02 752.378 <2e-16 ***
Size 4.943e-04 6.410e-06 77.114 <2e-16 ***
Era2. Post-War Boom 1.023e-01 1.098e-02 9.314 <2e-16 ***
Era3. Last 30+ Years 3.691e-01 1.182e-02 31.237 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2902 on 6513 degrees of freedom
Multiple R-squared: 0.6014, Adjusted R-squared: 0.6012
F-statistic: 3275 on 3 and 6513 DF, p-value: < 2.2e-16
This model’s coefficients are a little hard to interpret given that they are on a log scale. Yet one observes that all the coefficients are statistically significant. The square footage of living space (size) is positively correlated with the home price. Also the later the home is built is positively correlated with the price. Only 60% of the variability in the log of the sales price is explained by these two variables.
Residual Examination
Figure 7: Residuals Scatterplot
The above figure is the residuals ordered by their index. There is no clear pattern in this figure.
Figure 7: Residuals Q to Q
The Q to Q plot above indicates that the residuals are more linear near the middle of the distribution. But there are some major problems in the tails.
Generalization
In order to assess how well this model generalizes, predictions for the testing data set will be made and compared to the actual.
predictions <- predict(fit, testing, interval = "confidence")
testing$Log_Price_hat <- predictions[, 1]
testing$Price_hat <- exp(testing$Log_Price_hat)
testing$Log_Price_hat_lower <- predictions[, 2]
testing$Price_hat_lower <- exp(testing$Log_Price_hat_lower)
testing$Log_Price_hat_upper <- predictions[, 3]
testing$Price_hat_upper <- exp(testing$Log_Price_hat_upper)
Let’s visualize the predictions vs the actual sales prices. In the following plot the sales price is on the x-axis and the prediction is on the y-axis. Values are per one thousand. The black dashed line represents where the prediction and the actual values are the same. The blue are under estimates, red are over estimates, and the green points have a price that falls within the prediction’s confidence interval.
Figure 8: Test Set Actual vs Model Prediction (in thousands)
There are 836 cases where the model over estimates the home price, and 730 where it under estimates the home price. 61 are in within the confidence interval. Let’s examine to what degree it over/under estimates. The model does not seem to accurately depict reality.
testing <- testing %>%
mutate(Error = Price_hat - Price)
Figure 9: Prediction Error Distribution
Table 4: Prediction Error Summary
Metric | Value |
---|---|
Min. | -319779.752 |
1st Qu. | -30638.761 |
Median | -3338.245 |
Mean | -7323.755 |
3rd Qu. | 21743.421 |
Max. | 416150.010 |
There is a wide variety in the errors. Let’s look at the distribution of the error in absolute terms expressed as a share of the total price. It would be nice to see errors in the 5% range
testing <- testing %>%
mutate(Error_Share = (abs(Error) / Price) * 100)
5. Conclusion
This study set out to see if the size of a home and the era in which it was built could be used to get an estimate of the sale price. While these variables seem to be relevant, predictions made only using the variables are largely inaccurate. It appears that the relationship is more complicated. Future research could expand the model. A location variable would be a next step along with the size of the plot. A different modeling approach should be employed as the relationship appears to be non-linear.
6. References
Monroe County. 2018. “Real Property Portal.” https://www.monroecounty.gov/etc/rp.
National Association of Home Builders. 2018. “Housing Opportunity Index [Data File].” https://www.nahb.org/-/media/Sites/NAHB/research/housing-economics/housing-indexes/housing-opportunity-index/20181108/7-rank-by-region-q318.ashx?la=en&hash=950950B1B28EFCEF56535E21E81E9118A7B1638A.
NYS GIS Clearinghouse. 2017. “Monroe County Tax Parcel Centroids [Data File].” http://gis.ny.gov/gisdata/inventories/details.cfm?DSID=1300.
U.S. Census Bureau. 2017. “American Community Survey 1-Year Estimates.” https://factfinder.census.gov/.