DATA 606 Data Project Proposal

Research Question

What effect does the age of a home and its square footage have on its sales price in Monroe County, New York?

Data Collection

The sales price data were collected through governmental reporting. The price data is through the NYS Office of Real Property Services and provided to the general public by Monroe County through their real property portal. The housing attributes is compiled by the NYS GIS clearinghous from data provided by local assessors.

Type of Study

This study will be an observational study.

Dependent Variable

The response variable is the housing price. This study will use the latest sales price available. This is a quantitative variable. Since the housing attributes are as of 2017 we will only include sales that occured in 2016-2018 (assuming the homes haven’t changed much since 2017).

Independent Variable

The two variables will be the square footage of the living space (quantitative) and the era the home was built (qualitative). These eras will follow the groups used by the Trulia’s Chief Economist in this article.

Data Source

Sales Price Data

Monroe County’s Real Property Portal. (2018). Database was synced with NYS RPS on 22 October 2018. Scraped from https://www.monroecounty.gov/etc/rp on 10 November 2018.

Housing Attributes

NYS GIS Clearinghouse. (2017). Monroe County Tax Parcel Centroids [Data file]. Retrieved from http://gis.ny.gov/gisdata/inventories/details.cfm?DSID=1300.

Data Preparation

The Monroe County Real Property portal has been scraped and the data has been stored in an SQLite database.

library(DBI)
library(dplyr)

# Download the database
if(!file.exists("Monroe Real Property Data.db")){
  if(!file.exists("Monroe Real Property Data.db.tar.gz")){
    # Download this specific commit from my GitHub repo
    download.file("https://github.com/mikeasilva/CUNY-SPS/raw/9a2023b57a466550646c9190076a53d51ef09ef2/data/Monroe%20Real%20Property%20Data.db.tar.gz", "Monroe Real Property Data.db.tar.gz")
  }
  # Decompress the file
  untar("Monroe Real Property Data.db.tar.gz")
}

Now that the data is downloaded it’s time to read it in.

# Connect to the database
con <- dbConnect(RSQLite::SQLite(), "Monroe Real Property Data.db")

# Get the latest unduplicated sales price data
results <- dbSendQuery(con, "SELECT MAX(sale_date) AS sale_date, SWIS, SBL FROM sales_data GROUP BY SWIS, SBL HAVING sale_date > '2015-12-31'")
df <- dbFetch(results) 

dbClearResult(results)

There are 22346 sales prices in the data frame.

df <- dbReadTable(con, "sales_data") %>% # Pull all the records
  distinct() %>% # remove any duplicates
  merge(df) %>% # Merge with the targeted sales observations
  rename(Price = price)

The count of observations is now 22392. There was some duplication introduced through the merging process. I will drop all the observations with more than one price.

df <- df %>%
  select(SWIS, SBL) %>% # Select the unique identifiers
  group_by(SWIS, SBL) %>% 
  summarise(count = n()) %>% # Get the number of records
  ungroup() %>%
  filter(count == 1) %>% # Only keep the observations with one price
  select(-count) %>% # Drop the count
  merge(df) # subset the price observations

There are 22300 observations in the data frame. Now that we have the dependent variable, it is time to add in the housing attributes.

# Add in the housing attributes
df <- dbReadTable(con, "property_info") %>% # Pull all the records
  distinct() %>% # remove any duplicates
  merge(df) # Merge with the targeted sales observations

dbDisconnect(con)

Now that the housing attributes are added in I will filter out observations that are odd. Some variable have 0 when there isn’t data.

df <- df %>% # First filter out some obviously weird data
  filter(YR_BLT > 0) %>%
  filter(SQFT_LIV > 0) %>%
  filter(NBR_KITCHN > 0) %>%
  filter(NBR_BEDRM > 0)

The last step is to change the year built quantitative variable into a qualitative variable.

get_era <- function(year){
  if(year < 1940){
    return("1. Before 1940s")
  } else if(year < 1980){
    return("2. Post-War Boom")
  }else{
    return("3. Last 30+ Years")
  }
}

df <- df %>% 
  rowwise() %>%
  mutate(Era = get_era(YR_BLT)) %>%
  ungroup() %>%
  mutate(Era = as.factor(Era)) %>%
  rename(Size = SQFT_LIV) %>%
  select(Price, Size, Era)

Cases

Each case is a residential property in Monroe County, New York. There are 22283 cases in the data set.

Relevant Summary Statistics

Here’s the summary of the data

Table 1: Summary of Data

     Price               Size                     Era       
 Min.   :       2   Min.   : 400   1. Before 1940s  : 3201  
 1st Qu.:  112425   1st Qu.:1326   2. Post-War Boom :11886  
 Median :  147500   Median :1642   3. Last 30+ Years: 7196  
 Mean   :  175464   Mean   :1767                            
 3rd Qu.:  212000   3rd Qu.:2060                            
 Max.   :11650000   Max.   :9847

There are extreme price and size values on both ends. There will probably need to be some more filtering of the the data. Let’s explore the data with some visualizations.

Figure 1: Price and Size Scatterplot

This looks encouraging. I probably should remove the few high price outliers from the data set.

Figure 2: Price and Era Boxplot

This visualization is not very helpful because of the outliers. Let’s look at the same data in tabular form.

Table 2: Price Summary by Era

Era	min	mean	median	max
Before 1940s	2	152151.5	119900	11650000
Post-War Boom	2	144699.2	131000	5890753
Last 30+ Years	2	236651.3	215625	1700000

Again it looks like more work will need to be done on the price data to clean up the records that are more transactional than capturing the value of a home. Is a home really worth $2.00?