What effect does the age of a home and its square footage have on its sales price in Monroe County, New York?
The sales price data were collected through governmental reporting. The price data is through the NYS Office of Real Property Services and provided to the general public by Monroe County through their real property portal. The housing attributes is compiled by the NYS GIS clearinghous from data provided by local assessors.
This study will be an observational study.
The response variable is the housing price. This study will use the latest sales price available. This is a quantitative variable. Since the housing attributes are as of 2017 we will only include sales that occured in 2016-2018 (assuming the homes haven’t changed much since 2017).
The two variables will be the square footage of the living space (quantitative) and the era the home was built (qualitative). These eras will follow the groups used by the Trulia’s Chief Economist in this article.
Monroe County’s Real Property Portal. (2018). Database was synced with NYS RPS on 22 October 2018. Scraped from https://www.monroecounty.gov/etc/rp on 10 November 2018.
NYS GIS Clearinghouse. (2017). Monroe County Tax Parcel Centroids [Data file]. Retrieved from http://gis.ny.gov/gisdata/inventories/details.cfm?DSID=1300.
The Monroe County Real Property portal has been scraped and the data has been stored in an SQLite database.
library(DBI)
library(dplyr)
# Download the database
if(!file.exists("Monroe Real Property Data.db")){
if(!file.exists("Monroe Real Property Data.db.tar.gz")){
# Download this specific commit from my GitHub repo
download.file("https://github.com/mikeasilva/CUNY-SPS/raw/9a2023b57a466550646c9190076a53d51ef09ef2/data/Monroe%20Real%20Property%20Data.db.tar.gz", "Monroe Real Property Data.db.tar.gz")
}
# Decompress the file
untar("Monroe Real Property Data.db.tar.gz")
}
Now that the data is downloaded it’s time to read it in.
# Connect to the database
con <- dbConnect(RSQLite::SQLite(), "Monroe Real Property Data.db")
# Get the latest unduplicated sales price data
results <- dbSendQuery(con, "SELECT MAX(sale_date) AS sale_date, SWIS, SBL FROM sales_data GROUP BY SWIS, SBL HAVING sale_date > '2015-12-31'")
df <- dbFetch(results)
dbClearResult(results)
There are 22346 sales prices in the data frame.
df <- dbReadTable(con, "sales_data") %>% # Pull all the records
distinct() %>% # remove any duplicates
merge(df) %>% # Merge with the targeted sales observations
rename(Price = price)
The count of observations is now 22392. There was some duplication introduced through the merging process. I will drop all the observations with more than one price.
df <- df %>%
select(SWIS, SBL) %>% # Select the unique identifiers
group_by(SWIS, SBL) %>%
summarise(count = n()) %>% # Get the number of records
ungroup() %>%
filter(count == 1) %>% # Only keep the observations with one price
select(-count) %>% # Drop the count
merge(df) # subset the price observations
There are 22300 observations in the data frame. Now that we have the dependent variable, it is time to add in the housing attributes.
# Add in the housing attributes
df <- dbReadTable(con, "property_info") %>% # Pull all the records
distinct() %>% # remove any duplicates
merge(df) # Merge with the targeted sales observations
dbDisconnect(con)
Now that the housing attributes are added in I will filter out observations that are odd. Some variable have 0 when there isn’t data.
df <- df %>% # First filter out some obviously weird data
filter(YR_BLT > 0) %>%
filter(SQFT_LIV > 0) %>%
filter(NBR_KITCHN > 0) %>%
filter(NBR_BEDRM > 0)
The last step is to change the year built quantitative variable into a qualitative variable.
get_era <- function(year){
if(year < 1940){
return("1. Before 1940s")
} else if(year < 1980){
return("2. Post-War Boom")
}else{
return("3. Last 30+ Years")
}
}
df <- df %>%
rowwise() %>%
mutate(Era = get_era(YR_BLT)) %>%
ungroup() %>%
mutate(Era = as.factor(Era)) %>%
rename(Size = SQFT_LIV) %>%
select(Price, Size, Era)
Each case is a residential property in Monroe County, New York. There are 22283 cases in the data set.
Here’s the summary of the data
Price Size Era
Min. : 2 Min. : 400 1. Before 1940s : 3201
1st Qu.: 112425 1st Qu.:1326 2. Post-War Boom :11886
Median : 147500 Median :1642 3. Last 30+ Years: 7196
Mean : 175464 Mean :1767
3rd Qu.: 212000 3rd Qu.:2060
Max. :11650000 Max. :9847
There are extreme price and size values on both ends. There will probably need to be some more filtering of the the data. Let’s explore the data with some visualizations.
This looks encouraging. I probably should remove the few high price outliers from the data set.
This visualization is not very helpful because of the outliers. Let’s look at the same data in tabular form.
Era | min | mean | median | max |
---|---|---|---|---|
|
2 | 152151.5 | 119900 | 11650000 |
|
2 | 144699.2 | 131000 | 5890753 |
|
2 | 236651.3 | 215625 | 1700000 |
Again it looks like more work will need to be done on the price data to clean up the records that are more transactional than capturing the value of a home. Is a home really worth $2.00?