INTRODUCTION

New York City is repeatedly named as among the most expensive cities in the world to buy real estate. CNBC 15’s recent article on the most expensive places in the US to buy a home included three neighbourhoods from NYC. With the NYC Property Sales Dataset, the New York City Department of Finance opened up its real estate market for analysis.

The Dataset contains information of all the property sales in NYC from September 1, 2016 to August 31, 2017. With such a recent data set, I was able to analyze trends about NYC real estate market borough and neighbourhood-wise.

Analysis Methodology:

My analysis of the NYC Real estate market is broken down into the following sections.

  • Exploratory Data Analysis - Variables such as Borough, Neighborhood, Age of the building/property, Size of property and type of building are the important ones that I explored majorly. The descriptive statistics section speaks about the distribution of each variable and makes necessary changes for the analysis. I’ve also tried to isolate any potential outliers for the variable that will need special attention.

  • Visualization of results - The visualization section is split into a few broad categories for need of clarity. The tabs within the viz section explore one variable at a time with respect to the important numerical fields.

  1. Most In-Demand Borough - Where did New Yorkers buy their properties last year?
  2. Most In-Demand Neighborhood - Which neighborhood do New Yorkers prefer?
  3. The Hottest Buildings - What kind of properties do they buy?
  4. Property sizes in NYC/ Square footage - Does more money mean larger properties in NYC?
  5. Age of the buildings in NYC - Does more money mean newer properties in NYC?

Visualizing each of these plots showed me that many of these variables will be an important predictor of Sales Price, forming the basis for a potential predictive modeling.

REQUIRED PACKAGES

The following packages were used for the analysis. Detials about the code block that uses the package is also mentioned.

library(data.table)
library(tidyverse)
library(lubridate)
library(magrittr)
library(formattable)
library(knitr)
library(modelr)
library(plotly)
  • data.table - In the Data Import section, the fread() function belongs to the data.table package. It is used to import the csv file saved to the system into R.

  • tidyverse - Tidyverse is a combination package for data analyses that installs dplyr, ggplot2, tibble, readr, tidyr and purr packages.

  • lubridate - In the Data cleaning section, the ymd() function uses the lubridate package.

  • magrittr - The magrittr package is necessary to use the pipe operator %>%.

  • formattable - In the Data Preview section, the formattable() function uses the DT package.

  • knitr - In the Data Description section, the kable() function that shows R data in the form of table, uses the knitr package.

  • modelr - Used the modelr package for some of the functions in the Predictive Analysis section.

DATA PREPARATION

The Data Preparation section contains the logic and the steps performed in bringing the data to a form suitable for statistical analysis. Each tab explains the related steps for importing and cleaning the data. The cleaned data can be previewed at the Preview tab with details about variables presented in the Data Description tab.

Data Import

Dataset:

The NYC Property Sales Dataset is a record of every building or apartment unit that was sold in the NYC Property market over a 12 month period.

The dataset was downloaded from Kaggle. Other details about the dataset can be found here.

As this data set is a relatively cleaned- up verison of the original NYC Department of Finance’s dataset, all required details are available in the codebook. The Codebook for the NYC Property Sales is available as a pdf here. It provides a glossary of terms explaining the column headers in the data. A detailed set of Building Classification codes published by the City of New York are available at this link.

Data Import:

I downloaded the dataset from Kaggle and read it as a data frame using the fread() function.

The NYC Property Sales Data has 84548 observations and 22 variables. It has property sales data of each of the 5 boroughs in NYC - Manhattan, the Bronx, Queens, Brookyln and Staten Island.

nyc_prop_orig <- as_data_frame(fread("nyc-rolling-sales.csv"))
class(nyc_prop_orig)
## [1] "tbl_df"     "tbl"        "data.frame"
dim(nyc_prop_orig)
## [1] 84548    22
names(nyc_prop_orig)
##  [1] "V1"                             "BOROUGH"                       
##  [3] "NEIGHBORHOOD"                   "BUILDING CLASS CATEGORY"       
##  [5] "TAX CLASS AT PRESENT"           "BLOCK"                         
##  [7] "LOT"                            "EASE-MENT"                     
##  [9] "BUILDING CLASS AT PRESENT"      "ADDRESS"                       
## [11] "APARTMENT NUMBER"               "ZIP CODE"                      
## [13] "RESIDENTIAL UNITS"              "COMMERCIAL UNITS"              
## [15] "TOTAL UNITS"                    "LAND SQUARE FEET"              
## [17] "GROSS SQUARE FEET"              "YEAR BUILT"                    
## [19] "TAX CLASS AT TIME OF SALE"      "BUILDING CLASS AT TIME OF SALE"
## [21] "SALE PRICE"                     "SALE DATE"

Data Cleaning

Cleaning:

In the original file, nyc_prop_orig, the first column ‘V1’ has observation numbers which is not necessary. Also, the column ‘BUILDING CLASS CATEGORY’ has the category number and title together. I’ve split this column into 2 with the category number in one and title in another. Also, removed the ‘EASE-MENT’ column, which had NULL values throughout the entire dataset and the ‘TIME OF SALE’.

# Removed the first column 'V1' in the working copy, nyc_property  
nyc_property <- as_data_frame(nyc_prop_orig[,-1])

# Separated the columns in 'BUILDING CLASS CATEGORY'
nyc_property <- nyc_property %>%
  separate(col = "BUILDING CLASS CATEGORY", into = c("BUILDING CLASS CATEGORY NUMBER",
                                                     "BUILDING CLASS CATEGORY"), sep = 3) %>%
  separate(col = "SALE DATE", into = c("SALE DATE", "TIME"), sep = " ")

# Removed unnecessary columns - 'EASE-MENT', 'TIME OF SALE' (column #8,23)
nyc_property <- nyc_property[,c(-8,-23)]

After removing the unnecessary columns, I created a new column called ‘Building Age’ transforming the variable, ‘Year Built’. I did this as Building age is a much clearer metric to understand.

# Creating a new column called 'Building Age' transforming the variable, 'Year Built'
nyc_property <- nyc_property %>% 
  mutate(`BUILDING AGE` = 2017 - `YEAR BUILT`)

At this stage, I checked if my dataset had any duplicates. This check returned 765 duplicates in my data frame that I removed.

# Removing the Duplicates in the data frame, nyc_property 
nyc_property %>% filter(duplicated(nyc_property) == TRUE) %>% nrow()
## [1] 765
nyc_property <- unique(nyc_property)

The dataset contains 83783 observations and 22 variables.

Type conversions:

The original data set has columns in data types that are not suitable for analyses. For this purpose, I’ve converted the data types of the columns.

The Borough values were made into ‘factors’ and named accordingly. Building class category, Tax class (present, at sale) were converted into ‘factors’. Block, Lot, Address, Apartment Number, Zipcode were made ‘char’ data types. Land Square Feet, Gross Square Feet, Year, Sale Price were coverted into ‘Numeric’ data types. Sale Date was converted into a ‘Date’ format.

# Data Type conversions to the Data set
fac <- c(1,3,4,5,8,11,18,19)
nyc_property %<>% mutate_at(fac, funs(factor(.)))
levels(nyc_property$BOROUGH) <- c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")

num <- c(15,16,17,20)
nyc_property %<>% mutate_at(num, funs(as.numeric(.)))

chr <- c(6,7)
nyc_property %<>% mutate_at(chr, funs(as.character(.)))

nyc_property$`SALE DATE` <- ymd(nyc_property$`SALE DATE`)

Missing and Zero Values:

Numerical Variables:

The Data type conversion step above introduced NAs into the dataset by converting the “-” in the original dataset into NAs.

sum(is.na(nyc_property))
## [1] 67615
colSums(is.na(nyc_property))
##                        BOROUGH                   NEIGHBORHOOD 
##                              0                              0 
## BUILDING CLASS CATEGORY NUMBER        BUILDING CLASS CATEGORY 
##                              0                              0 
##           TAX CLASS AT PRESENT                          BLOCK 
##                              0                              0 
##                            LOT      BUILDING CLASS AT PRESENT 
##                              0                              0 
##                        ADDRESS               APARTMENT NUMBER 
##                              0                              0 
##                       ZIP CODE              RESIDENTIAL UNITS 
##                              0                              0 
##               COMMERCIAL UNITS                    TOTAL UNITS 
##                              0                              0 
##               LAND SQUARE FEET              GROSS SQUARE FEET 
##                          26054                          27385 
##                     YEAR BUILT      TAX CLASS AT TIME OF SALE 
##                              0                              0 
## BUILDING CLASS AT TIME OF SALE                     SALE PRICE 
##                              0                          14176 
##                      SALE DATE                   BUILDING AGE 
##                              0                              0

With the code above, I found that only 3 out of 21 variables seem to have missing values, which is surprising. I went ahead and checked I had lost the zero values in my numeric vairables to data coercion.

sum(nyc_prop_orig$`SALE PRICE` == "-")      # 14561 
sum(is.na(nyc_property$`SALE PRICE`))       # 14176 

sum(nyc_prop_orig$`SALE PRICE` == 0)                  #10228 
nyc_property %>% filter(`SALE PRICE` == 0) %>% nrow() #10012

I found that the original dataset, nyc_prop_orig had 14561 ‘-’ values in the SALE PRICE variable and the nyc_property dataset has 14176 NAs. The number is a little lesser than in the original as we removed some duplicates that I suspect had the blank values.

Similarly, I also counted the zero values in SALE PRICE variable and found similar results. I’m now sure that I’ve not lost any zero or missing values to data conversion. It is important to note that this data set has NA values as well as 0 values. I expect this to be the case for all my numeric fields, SALE PRICE, LAND SQUARE FEET and GROSS SQUARE FEET.

1. Sale Price

As the zero values in the SALE PRICE column have a meaning (zero values in Sale Price indicate that this was a sale transfer deed), it would skew the dataset to impute or convert the NA values. This data set contains 10012 observations that have 0 values and 14176 observations that are missing values. At this points, I’m removing them and will analyze them separately.

I created nyc_salepricefaulty, for these observations (24188 observations). I alse created a clean dataset, nyc_analysisdata, by removing all Missing/ Zero values in SALE PRICE from nyc_property. By this step, I removed 24188 observations, leaving nyc_analysisdata, with 71.13% of the original data.

# Missing & Zero values in 'Sale Price' - 28.86% of the original dataset
nyc_salepricefaulty <- nyc_property %>% filter(nyc_property$`SALE PRICE` == 0 | is.na(nyc_property$`SALE PRICE`))

# Base Dataset - (Sale Price = 0 or NULL) - 71.13% of original dataset
nyc_analysisdata <- nyc_property %>% filter((!nyc_property$`SALE PRICE` == 0) & !is.na(nyc_property$`SALE PRICE`))

For all the analysis requiring Sale price values, I’ll be using the nyc_analysisdata dataset.

2. Land Square Feet, Gross Square Feet

For the variables, LAND SQUARE FEET, GROSS SQUARE FEET, I need to evaluate if missing values and zero values mean the same. Zero sq feet could mean the property that was sold was land only with no enclosed square footage.

nyc_analysisdata %>% filter(nyc_analysisdata$`LAND SQUARE FEET` == 0 | nyc_analysisdata$`GROSS SQUARE FEET` == 0) %>% nrow()
explore <- nyc_analysisdata %>% filter(nyc_analysisdata$`LAND SQUARE FEET` == 0 | nyc_analysisdata$`GROSS SQUARE FEET` == 0)

summary(explore$`BUILDING CLASS CATEGORY`)

There are 8674 rows in nyc_analysisdata where Land Sq Footage or Gross Sq Footage is 0. Exploring the Building Class Category variable for the observations where LSF or GSF = 0, we can see that these properties are of different categories. Hence, the assumption of Zero sq footage being land only property is wrong. Hence, 0 and missing values are equivalent in this dataset.

In the nyc_analysisdata, ‘Land Square Feet’ and ‘Gross Square Feet’ variables have 29343, 30380 zero/missing values individually.

nyc_analysisdata %>% filter(nyc_analysisdata$`LAND SQUARE FEET` == 0 | is.na(nyc_analysisdata$`LAND SQUARE FEET`)) %>% nrow()
## [1] 29343
nyc_analysisdata %>% filter(nyc_analysisdata$`GROSS SQUARE FEET` == 0 | is.na(nyc_analysisdata$`GROSS SQUARE FEET`)) %>% nrow()
## [1] 30380

As these are equivalent, I converted the zero values in ‘Land Square Feet’ and ‘Gross Square Feet’ to ‘NA’. This will help with my summary statistic calculations. I’m performing this action on a copy of my nyc_analysisdata table, nyc_data.

nyc_data <- nyc_analysisdata 

# converting the 8185 zero values in LSF into NAs -> 29343 NAs in `LAND SQUARE FEET`
nyc_data$`LAND SQUARE FEET`[nyc_data$`LAND SQUARE FEET` == 0] <- NA

# converting the zero values in GSF into NAs - > 30380 NAs in `GROSS SQUARE FEET`
nyc_data$`GROSS SQUARE FEET`[nyc_data$`GROSS SQUARE FEET` == 0] <- NA

3. Residential Units, Commercial Units and Total Units

There are no missing values in these variables. However, there are 18634, 56819, 16589 zero values individually. At this point, I’m not removing these observations, but will study them independently. I expect that the Total Units = Residential Units + Commercial Units. I found that as expected, for 58801 rows (98% of the nyc_analysisdata), this holds. For 794 rows where it doesn’t seem to match, Total Units = 1 in a majority of the observations. So Total Units might be a good surrogate for these columns.

nyc_analysisdata %>% filter(nyc_analysisdata$`TOTAL UNITS` != nyc_analysisdata$`RESIDENTIAL UNITS` +
                              nyc_analysisdata$`COMMERCIAL UNITS`) %>% nrow()
## [1] 794

Categorical Variables:

  • YEAR BUILT - has 158 levels and 4310 zero values. I have to keep this in mind when I analyze this later.

  • BUILDING CLASS CATEGORY NUMBER, BUILDING CLASS CATEGORY - has 47 levels and no zero values

  • TAX CLASS AT PRESENT - 10 levels + 593 missing obs

  • TAX CLASS AT TIME OF SALE - 4 levels

  • BUILDING CLASS AT PRESENT - 166 levels + 593 missing obs,

  • BUILDING CLASS AT TIME OF SALE - 166 levels

  • ZIP CODE - 185 levels + 0 (656)

levels(nyc_data$BOROUGH)      # 5 levels
## [1] "Manhattan"     "Bronx"         "Brooklyn"      "Queens"       
## [5] "Staten Island"
levels(nyc_data$`BUILDING CLASS CATEGORY NUMBER`) # 47 levels
##  [1] "01 " "02 " "03 " "04 " "05 " "06 " "07 " "08 " "09 " "10 " "11 "
## [12] "11A" "12 " "13 " "14 " "15 " "16 " "17 " "18 " "21 " "22 " "23 "
## [23] "25 " "26 " "27 " "28 " "29 " "30 " "31 " "32 " "33 " "34 " "35 "
## [34] "36 " "37 " "38 " "39 " "40 " "41 " "42 " "43 " "44 " "45 " "46 "
## [45] "47 " "48 " "49 "
levels(nyc_data$`BUILDING CLASS CATEGORY`)        # 47 levels
##  [1] " CONDO-RENTALS"                         
##  [2] "ASYLUMS AND HOMES"                      
##  [3] "COMMERCIAL CONDOS"                      
##  [4] "COMMERCIAL GARAGES"                     
##  [5] "COMMERCIAL VACANT LAND"                 
##  [6] "CONDO COOPS"                            
##  [7] "CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC" 
##  [8] "CONDO HOTELS"                           
##  [9] "CONDO NON-BUSINESS STORAGE"             
## [10] "CONDO OFFICE BUILDINGS"                 
## [11] "CONDO PARKING"                          
## [12] "CONDO STORE BUILDINGS"                  
## [13] "CONDO TERRACES/GARDENS/CABANAS"         
## [14] "CONDO WAREHOUSES/FACTORY/INDUS"         
## [15] "CONDOS - 2-10 UNIT RESIDENTIAL"         
## [16] "CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT"
## [17] "CONDOS - ELEVATOR APARTMENTS"           
## [18] "CONDOS - WALKUP APARTMENTS"             
## [19] "COOPS - ELEVATOR APARTMENTS"            
## [20] "COOPS - WALKUP APARTMENTS"              
## [21] "EDUCATIONAL FACILITIES"                 
## [22] "FACTORIES"                              
## [23] "HOSPITAL AND HEALTH FACILITIES"         
## [24] "INDOOR PUBLIC AND CULTURAL FACILITIES"  
## [25] "LOFT BUILDINGS"                         
## [26] "LUXURY HOTELS"                          
## [27] "OFFICE BUILDINGS"                       
## [28] "ONE FAMILY DWELLINGS"                   
## [29] "OTHER HOTELS"                           
## [30] "OUTDOOR RECREATIONAL FACILITIES"        
## [31] "RELIGIOUS FACILITIES"                   
## [32] "RENTALS - 4-10 UNIT"                    
## [33] "RENTALS - ELEVATOR APARTMENTS"          
## [34] "RENTALS - WALKUP APARTMENTS"            
## [35] "SELECTED GOVERNMENTAL FACILITIES"       
## [36] "SPECIAL CONDO BILLING LOTS"             
## [37] "STORE BUILDINGS"                        
## [38] "TAX CLASS 1 - OTHER"                    
## [39] "TAX CLASS 1 CONDOS"                     
## [40] "TAX CLASS 1 VACANT LAND"                
## [41] "TAX CLASS 3 - UNTILITY PROPERTIES"      
## [42] "TAX CLASS 4 - OTHER"                    
## [43] "THEATRES"                               
## [44] "THREE FAMILY DWELLINGS"                 
## [45] "TRANSPORTATION FACILITIES"              
## [46] "TWO FAMILY DWELLINGS"                   
## [47] "WAREHOUSES"
levels(nyc_data$`TAX CLASS AT PRESENT`)           # 10 levels + "" (593)
##  [1] ""   "1"  "1A" "1B" "1C" "2"  "2A" "2B" "2C" "3"  "4"
sum(nyc_data$`TAX CLASS AT PRESENT` == "") 
## [1] 593
levels(nyc_data$`TAX CLASS AT TIME OF SALE`)  # 4 levels
## [1] "1" "2" "3" "4"

Data Preview

Listed below are the first 10 entries of the NYC Property Sales data. Each row is a property sale entry with details about the sale provided in the columns.

library(formattable)
head(nyc_data, n = 50) %>%
  formattable() %>%
    as.datatable(options = list(dom = 't',scrollX = TRUE,scrollCollapse = TRUE))

Data Description

Details about the Variables in the Dataset are provided below.

variable.type <- lapply(nyc_data, class)

variable.description <- c("Name of the borough where the property is located", 
"Neighbourhood name", "Building class category code to identify similar properties", 
"Building class category title to identify similar properties", 
"Assigned tax class of the property in the city - Classes 1, 2, 3 or 4", 
"Sub-division of the borough on which real properties are located", 
"Sub-division of a Tax Block, used to uniquely represent the property location",
"Used to describe a property's constructive use", "Property's street address", 
"Property's apartment number", "Property's postal code", 
"Number of residential units at the listed property",
"Number of commercial units at the listed property",
"Total number of units at the listed property", 
"Land area of the property listed in square feet", 
"Total area of all the floors of a building", "Property's construction year", 
"Assigned tax class of the property in the city at the time of sale",
"Used to describe a property's constructive use at the time of sale",
"Price paid for the property", "Date of property sale", "Age of the Building")

variable.name <- colnames(nyc_data)

nyc_datadesc <- as_data_frame(cbind(variable.name, variable.type, variable.description))
colnames(nyc_datadesc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(nyc_datadesc)
Variable Name Data Type Variable Description
BOROUGH factor Name of the borough where the property is located
NEIGHBORHOOD character Neighbourhood name
BUILDING CLASS CATEGORY NUMBER factor Building class category code to identify similar properties
BUILDING CLASS CATEGORY factor Building class category title to identify similar properties
TAX CLASS AT PRESENT factor Assigned tax class of the property in the city - Classes 1, 2, 3 or 4
BLOCK character Sub-division of the borough on which real properties are located
LOT character Sub-division of a Tax Block, used to uniquely represent the property location
BUILDING CLASS AT PRESENT factor Used to describe a property’s constructive use
ADDRESS character Property’s street address
APARTMENT NUMBER character Property’s apartment number
ZIP CODE factor Property’s postal code
RESIDENTIAL UNITS integer Number of residential units at the listed property
COMMERCIAL UNITS integer Number of commercial units at the listed property
TOTAL UNITS integer Total number of units at the listed property
LAND SQUARE FEET numeric Land area of the property listed in square feet
GROSS SQUARE FEET numeric Total area of all the floors of a building
YEAR BUILT numeric Property’s construction year
TAX CLASS AT TIME OF SALE factor Assigned tax class of the property in the city at the time of sale
BUILDING CLASS AT TIME OF SALE factor Used to describe a property’s constructive use at the time of sale
SALE PRICE numeric Price paid for the property
SALE DATE Date Date of property sale
BUILDING AGE numeric Age of the Building

ANALYSIS

The analysis of the NYC Property Sales data can be found under the following tabs. The Descriptive Statistics section explores each individual variable, while relationships within variables are explored using visualizations.

Descriptive Analysis

Before we start the Descriptive Analysis, let’s understand my cleaned dataset, nyc_dataand understand the dimensions of the dataset.

clean <- dim(nyc_data)

The data set has 59595 observations and 22 rows.

I have a few Numerical variables in the data set.

1. Sale Price

Running a quantile function to the variable shows me that the data is skewed and also that there are many values with very low Sale Price.

quantile(nyc_data$`SALE PRICE`, probs = seq(from = 0, to = 1, by = .1))
##         0%        10%        20%        30%        40%        50% 
##          1     199000     319410     420810     515000     628000 
##        60%        70%        80%        90%       100% 
##     753403     935000    1290000    2250000 2210000000
nyc_data %>% filter(`SALE PRICE` <= 1000) %>% nrow()
## [1] 1125

As I’ve removed these data points earlier, the extremely low data points shouldn’t still be here. I found 1125 observations with Sale Price <= $1000. I added these values into my nyc_salepricefaulty dataset and updated my nyc_data dataset

# Adding these rows to nyc_salespricefaulty
nyc_salepricefaulty <-  nyc_data %>% filter(`SALE PRICE` <= 1000) %>% 
  bind_rows(nyc_salepricefaulty)

# Update nyc_data
nyc_data <- nyc_data %>% filter(!`SALE PRICE` <= 1000)
quantile(nyc_data$`SALE PRICE`, probs = seq(from = 0, to = 1, by = .1))
##         0%        10%        20%        30%        40%        50% 
##       1110     220000     333240     435000     529000     640000 
##        60%        70%        80%        90%       100% 
##     765000     950000    1300000    2300000 2210000000

Re-running the quantile function shows me a better distribution without many low value outliers. However, I see the highest Sale Price value = $2,210,000,000. I looked further at the data point and let the value remain as it looks legitimate.

2. Land Square Feet

summary(nyc_data$`LAND SQUARE FEET`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      33    2000    2500    4618    4000 4252327   29082
quantile(nyc_data$`LAND SQUARE FEET`, probs = seq(from = 0, to = 1, by = .1), na.rm = TRUE)
##        0%       10%       20%       30%       40%       50%       60% 
##      33.0    1680.0    1950.0    2003.0    2351.0    2500.0    2847.2 
##       70%       80%       90%      100% 
##    3500.0    4000.0    5300.0 4252327.0
outliers::outlier(nyc_data$`LAND SQUARE FEET`)
## [1] 4252327

Exploring the Land Square Feet variable, there seem to be outliers it this field as well. Assuming a ball-park value that buildings that have > 500,000 land sq footage should be ballpark, I explored them further. As expected, they were Commercial Vacant lands, Store Buildings and other large properties.

3. Gross Square Feet

I pulled up a similar summary statistic on my ‘Gross Square Feet’ variable and xamined the extreme values, ones that have Gross sq footage > 500,000 - 17 props. As expected, these properties are Store/ Office buildings, Rentals - Elevator Apartments, Factories among others.

summary(nyc_data$`GROSS SQUARE FEET`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     120    1360    1872    4396    2658 3750565   30070
quantile(nyc_data$`GROSS SQUARE FEET`, probs = seq(from = 0, to = 1, by = .1), na.rm = TRUE)
##        0%       10%       20%       30%       40%       50%       60% 
##     120.0    1120.0    1280.0    1448.0    1640.0    1872.0    2120.0 
##       70%       80%       90%      100% 
##    2420.0    2975.2    3984.0 3750565.0
outliers::outlier(nyc_data$`GROSS SQUARE FEET`)
## [1] 3750565

4. Residential Units, Commercial Units and Total Units

summary(nyc_data$`RESIDENTIAL UNITS`)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    1.000    1.702    1.000 1844.000
nyc_data %>% filter(`RESIDENTIAL UNITS` > 5) %>% nrow()
## [1] 1376
summary(nyc_data$`COMMERCIAL UNITS`)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1546    0.0000 2261.0000
nyc_data %>% filter(`COMMERCIAL UNITS` > 5) %>% nrow()
## [1] 140
nyc_data %>% filter(`TOTAL UNITS` > 5) %>% nrow()
## [1] 1573

In the entire data set, there are only 1376 buildings in which more than 5 Residential units were sold in 2016. In the entire data set, there are only 140 buildings in which more than 5 Commercial units were sold in 2016.

As explored earlier, we know that only for 794 properties, Total Units is not equal to Residential, Commericial Units. As Total Units has the least NAs, I will be using this field for further analysis.

5. Building Age

There are 4195 properties with Building age. When I remove properties that don’t have a Year Built entry or Year Built = 0, we get 28 property details.

summary(nyc_data$`BUILDING AGE`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    51.0    77.0   205.2    97.0  2017.0
summary(nyc_data$`YEAR BUILT`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1920    1940    1812    1966    2017
nyc_data %>% filter(`BUILDING AGE` > 200) %>% nrow() # 4195
## [1] 4195
nyc_data %>% filter(`BUILDING AGE` > 200 & `YEAR BUILT` != 0) %>% 
  arrange(desc(`BUILDING AGE`)) %>% nrow()
## [1] 28

Visualization

After understanding the distribution of my numerical and categorical fields in the previous section, I explored the relationship between the variables by visualizing the fields.

Under each tab, I’ve explored the major Boroughs, Neighborhoods, Building types,

Boroughs

I started the analysis trying to figure out which Borough recored the most property sales last year. I also wanted to understand how much an average property sale price would be across the boroughs in NYC.

Most In-Demand and Expensive Borough in NYC

p1 <- ggplot(data = nyc_property, aes(x = reorder_size(`BOROUGH`))) +
  geom_bar() +
  ggtitle("Most In-Demand Borough in NYC", subtitle = "Borough-wise # of property sales in NYC") +
  scale_y_continuous("# of Property Sales", labels = scales::comma) +
  scale_x_discrete("Borough")

p2 <- ggplot(data = nyc_data, aes(x = reorder_size(`BOROUGH`), y = mean(`SALE PRICE`))) +
  geom_bar(stat = "identity") +
  ggtitle("Most Expensive Borough in NYC", subtitle = "Borough-wise Avg Property Sale Price in NYC") +
  scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
  scale_x_discrete("Borough")

gridExtra::grid.arrange(p1, p2, ncol = 2)

The plots above showed me that Queens had the most number of property sales, followed by Brookyln. The Average Sale Price of a property in Queens was $27 bil, while in Manhattan was $22 bil. I find this surprising as I expected properties in Manhattan to cost more. Let’s explore this further with the other fields.

Neighborhoods

With the exploration of the property sales and prices across Boroughs in NYC, I want to know the numbers divide up with respect to each Neighborhood. I’ll start answering this by looking at the Number of Sales and Average property Sales Prices across the most in-demand neighborhoods.

Most-in demand and Expensive Neighborhood in NYC

df1 <- as.data.frame(table(nyc_property$BOROUGH, nyc_property$NEIGHBORHOOD))
names(df1) <- c('BOROUGH','NEIGHBORHOOD', 'Freq')
df1 <- df1 %>% arrange(desc(Freq)) %>% head(10)

p1 <- ggplot(df1, aes(x = `NEIGHBORHOOD`, y = `Freq`, fill = `BOROUGH`)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Most-in demand Neighborhood in NYC", subtitle = "Top Neighborhoods by Number") +
  theme(legend.position = "bottom") +
  scale_y_continuous("# of Sales", labels = scales::comma) +
  scale_x_discrete("Neighborhood") 

df2 <- 
  nyc_data %>% group_by(BOROUGH, NEIGHBORHOOD) %>% 
  summarise(MeanSP = mean(`SALE PRICE`)) %>% 
  arrange(desc(MeanSP)) %>% head(10)

p2 <- ggplot(data = df2, aes(x = `NEIGHBORHOOD`, y = MeanSP, fill = `BOROUGH`)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "bottom") +
  ggtitle("Most Expensive Neighborhoods in NYC", 
          subtitle = "Top Neighborhoods by Avg Price") +
  scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
  scale_x_discrete("Neighborhood") 

gridExtra::grid.arrange(p1, p2, ncol = 2)

Consistent with the previous plots, the Top Neighborhoods by # of Property Sales in 2016 plot shows that neighborhoods in Queens and Manhattan had the most number of properties sold - this accounted to 7 out of the 10 top neighborhoods.

With respect to the Avg value of properties sold in different neighborhoods, Bloomfield in Staten Island was the top. Staten Island, however, was among the lowest when we arranged the Avg. Sale Price according to Borough. So these properties must have been the top 20% of the Sale Price that we explored earlier. Also interesting to note is that 7 out the 10 top property value neighborhoods are from Manhattan. Clearly, even though no neighborhood in Queens fetched top bucks last year, it sold much more properties than the other boroughs.

Also note that the avg. price of the Most expensive Neighborhood and the second most expensive by $12.5 billion. Needless to say, the standard deviation of the property prices in NYC is large!

I want to further check this by plotting the least expensive neighborhoods.

Least Expensive Neighborhoods of NYC

df2 <- 
  nyc_data %>% group_by(BOROUGH, NEIGHBORHOOD) %>% 
  summarise(MeanSP = mean(`SALE PRICE`)) %>% 
  arrange(MeanSP) %>% head(10)

ggplot(data = df2, aes(x = `NEIGHBORHOOD`, y = MeanSP, fill = `BOROUGH`)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "bottom") +
  ggtitle("Least Expensive Neighborhoods in NYC", subtitle = "Top Neighborhoods by the lowest avg. Price") +
  scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
  scale_x_discrete("Neighborhood") 

As expected, Staten Island, Queens and Bronx - the Boroughs that didn’t feature in the Top Borough list made it here. Interesting to note is the differences in the Avg Sale Price scale of the Most and the Least expensive properties in NYC. The avg property price in Van Cortlandt Park in the Bronx sold for $160,000, while the most expensive property in Staten Island, Bloomsfield sold for $46 billion.

Buildings

With the knowledge of the demand and prices in neighborhoods across Boroughs, I want to understand what kind of buildings get sold across NYC. This will clearly show me what the hottest buildings around NYC are and their sale prices.

Most In-Demand Buildings and Expensive in NYC by Borough

df1 <- as.data.frame(table(nyc_property$BOROUGH, nyc_property$`BUILDING CLASS CATEGORY`))
names(df1) <- c('BOROUGH','BUILDING CLASS CATEGORY', 'Freq')
df1 <- df1 %>% group_by(BOROUGH) %>% arrange(desc(Freq)) %>% head(10)

ggplot(df1, aes(x = `BOROUGH`, y = `Freq`, fill = `BUILDING CLASS CATEGORY`)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Most In-Demand Buildings in NYC by Borough", subtitle = "Top types of Buildings sold in NYC") +
  scale_y_continuous("# of Sales", labels = scales::comma) +
  scale_x_discrete("Borough") 

Clearly, the most in-demand buildings in NYC over the last year were one family dwellings, across Staten Island, Queens and Brooklyn. Coops in Elevator Apartments were also much wanted over the last year.

Let’s see what kind of buildings sold for the most money in NYC.

Most Expensive Buildings by value in NYC

df2 <- 
  nyc_data %>% group_by(BOROUGH, `BUILDING CLASS CATEGORY`) %>% 
  summarise(MeanSP = mean(`SALE PRICE`)) %>% 
  arrange(desc(MeanSP)) %>% head(10)

ggplot(data = df2, aes(x = `BUILDING CLASS CATEGORY`, y = MeanSP, fill = `BOROUGH`)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "bottom") +
  ggtitle("Most Expensive Buildings in NYC", subtitle = "Top Types of Property by Value in NYC in 2016") +
  scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
  scale_x_discrete("Building Type") 

Most of the top properties in Manhattan were commercial - Office buildings, Luxury Hotels, and other commercial classes, while the apartments and condos, though expensive, were on the cheaper side for Manhattan.

Another clear pattern is that the most expensive buildings are almost entirely commerical buildings. Also note how the theatres in Queens are as expensive as Rental apartments in Manhattan.

Least expensive Buildings in NYC

To make the property price variance argument more solid, let’s explore how the lease expensive buildings in NYC look.

df2 <- 
  nyc_data %>% group_by(BOROUGH, `BUILDING CLASS CATEGORY`) %>% 
  summarise(MeanSP = mean(`SALE PRICE`)) %>% 
  arrange(MeanSP) %>% head(10)

ggplot(data = df2, aes(x = `BUILDING CLASS CATEGORY`, y = MeanSP, fill = `BOROUGH`)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  theme(legend.position = "bottom") +
  ggtitle("Least Expensive Buildings in NYC", subtitle = "Lowest Types of Property by Value in NYC in 2016") +
  scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
  scale_x_discrete("Building Type") 

The least expensive property in NYC is a Condo Parking space in Staten island. Interestingly, the most expensive and the least expensive buildings in NYC are commercial buildings. For $45,000 you could also buy a Condo Terrace in the Queens!

Tax Class of the Properties sold

Adding another variable to the equation now, let’s look at the Tax Class of the Properties sold in NYC. There are 4 tax classes that Property sales are categorized into. Over the last year, there were no Tax Class, 3, property sales.

Class 1: Includes most residential property of up to three units

Class 2: Includes all other property that is primarily residential, such as cooperatives and condominiums.

Class 3: Includes property with equipment owned by a gas, telephone or electric company

Class 4: Includes all other properties not included in class 1,2, and 3, such as offices, factories, warehouses, garage buildings, etc.

With this knowledge, let’s look at the property Tax classes across Boroughs and their value.

ggplot(data = nyc_property, aes(x = reorder_size(`TAX CLASS AT TIME OF SALE`))) +
  geom_bar() +
  facet_wrap(~ BOROUGH) + 
  ggtitle("Borough-wise Sold Property Tax Class") +
  scale_y_continuous("# of Property Sold", labels = scales::comma) +
  scale_x_discrete("Tax Class")

Avg Property Price by Tax Class sold in NYC

ggplot(data = nyc_data, aes(x = `TAX CLASS AT TIME OF SALE`, y = mean(`SALE PRICE`))) +
  geom_bar(stat = "identity") +
  facet_wrap(~ BOROUGH) + 
  ggtitle("Borough-wise Sold Property Tax Class") +
  scale_y_continuous("Avg Property Value", labels = scales::dollar) +
  scale_x_discrete("Tax Class")

Consistent with our previous charts, Manhattan has a large number of high-value residential condos and coops, while Queens has a large number of lower-valued small sized residential properties.

Property Square Footage

Having seen that property prices vary so largely across the neighborhoods in NYC, let’s see if big money in NYC means large space! (Who doesn’t know to take a joke!)

Evaluating this holds merit as we just found out that rental living spaces in Manhattan are as expensive as renting theatres in Queens.

Let’s understand this trend by plotting the average Land Square Feet, Gross square Feet across Boroughs.

Land Square Feet across Boroughs

ggplot(data = nyc_data, aes(x = `BOROUGH`, y = log(`LAND SQUARE FEET`), fill = nyc_data$BOROUGH)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Land Square Feet across Boroughs", subtitle = "Borough-wise Distribution of Avg Land Square Feet") +
  scale_y_continuous("Avg. Land Square Footage", labels = scales::comma) +
  scale_x_discrete("Borough") +
  coord_flip()

There are many outliers within Staten Island Borough. The extreme values seem to have average Land square footage greater than 500,000 sq feet. These are the outliers we identified in the previous section when we explored the LSF field.

Gross Square Feet across Boroughs

ggplot(data = nyc_data, aes(x = `BOROUGH`, y = log(`GROSS SQUARE FEET`), fill = nyc_data$BOROUGH)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Gross Square Feet across Boroughs", subtitle = "Borough-wise Distribution of Avg Gross Square Feet") +
  scale_y_continuous("Abg. Gross Square Footage", labels = scales::comma) +
  scale_x_discrete("Borough") +
  coord_flip()

Similarly exploring the average Gross square footage, we find many property sales greater than 500,000 sq feet. These, again are the outliers we identified in the previous section when we explored the GSF field. However, in contrast to the previous plot, a majority of these seem to be in Manhattan.

Property Size

With the square footage plots before, we can guess that the tentative Price/unit area varies with Neighborhood as well. To explore this metric, I plotted Sale Price vs Land square Feet, Sale Price vs Gross square Feet borough-wise. They clearly all have a different pattern.

Sale Price vs Land Square Feet

ggplot(data = nyc_data, aes(x = log(`LAND SQUARE FEET`), y = log(`SALE PRICE`), color = `BOROUGH`)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  theme(legend.position = "bottom") +
  facet_wrap(~ BOROUGH) +
  ggtitle("Price Vs Land Square Footage in NYC", 
          subtitle = "Distribution of Sale Price vs Land Square feet Borough-wise") +
  scale_y_continuous("Property Sale Price", labels = scales::dollar) +
  scale_x_continuous("Land Square Footage", labels = scales::comma) 

Sale Price vs Gross Square Feet

ggplot(data = nyc_data, aes(x = log(`GROSS SQUARE FEET`), y = log(`SALE PRICE`), color = `BOROUGH`)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  facet_wrap(~ BOROUGH) +
  ggtitle("Price Vs Gross Square Footage in NYC",
          subtitle = "Distribution of Sale Price vs Gross Square feet Borough-wise") +
  scale_y_continuous("Property Sale Price", labels = scales::dollar) +
  scale_x_continuous("Gross Square Footage", labels = scales::comma) +
  theme(legend.position = "bottom")

The trends seem to largely be similar for Gross Square Footage and Land Square Footage except for the different trend in Manhattan.

To complement the charts above, I created a metric ‘Price/unit area’ and plotted this metric for all the Boroughs. This will be the clearest indicator of the price of a unit sq foot of space in NYC

Price/sq. Feet in NYC

df1 <- nyc_data %>% filter(`LAND SQUARE FEET` != 0) %>%
      mutate(PriceLSF = `SALE PRICE`/`LAND SQUARE FEET`) %>%
      group_by(`BOROUGH`, `NEIGHBORHOOD`) %>%
      summarise(MeanPriceLSF = mean(PriceLSF, na.rm = TRUE)) %>%
      arrange(desc(MeanPriceLSF)) %>% head(15)

p1 <- ggplot(data = df1, aes(x = `NEIGHBORHOOD`, y = MeanPriceLSF, fill = `BOROUGH`)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "bottom") +
  ggtitle("NYC Price per Square Feet", subtitle = "Top Price/sqft Neighborhood") +
  scale_y_continuous("Price/sqft", labels = scales::dollar) +
  scale_x_discrete("Neighborhood") 

df2 <- nyc_data %>% filter(`LAND SQUARE FEET` != 0) %>%
  mutate(PriceLSF = `SALE PRICE`/`LAND SQUARE FEET`) %>%
  group_by(`BOROUGH`, `NEIGHBORHOOD`) %>%
  summarise(MeanPriceLSF = mean(PriceLSF, na.rm = TRUE)) %>%
  arrange(MeanPriceLSF) %>% head(15)

p2 <- ggplot(data = df2, aes(x = `NEIGHBORHOOD`, y = MeanPriceLSF, fill = `BOROUGH`)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "bottom") +
  ggtitle("MYC Price per Square Feet", subtitle = "Lowest Price/sqft Neighborhood") +
  scale_y_continuous("Price/ sqft", labels = scales::dollar) +
  scale_x_discrete("Neighborhood") 

gridExtra::grid.arrange(p1, p2, ncol = 2)

Consistent with the analysis above, we can see all top 15 Price/sqft from Manhattan and a mojority of the Lowest Price/sqft from Staten Island. While the most expensive property in NYC sold at $16,000/sqft in Midtown CBD, the least expensive property was priced at $26/sqft.

Land Square Feet vs Gross Square Feet

Knowing the distribution of Land Square Feet, Gross Square Feet, we’d expect a strong correlation. I expect Gross Square Feet to be always greater than Land Square Feet, but this might be skewed for commercial/ non-residential properties.

ggplot(data = nyc_data, aes(x = log(`GROSS SQUARE FEET`), y = log(`LAND SQUARE FEET`), color = `BOROUGH`)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Borough-wise Land-Gross Square Footage Ratio",
          subtitle = "Variation of property spaces") +
  scale_y_continuous("Land Square Footage", labels = scales::comma) +
  scale_x_continuous("Gross Square Footage", labels = scales::comma)

Building Age

Another important variable in the dataset is Building age. Exploring this variable will help us contrast how Property prices fluctuate across Boroughs with the age of the building.

I plotted the Distribution of Building age across each borough to figure where the older buildings in NYC are. After that, I also plotted the Sale Price vs Building age across each borough to identify if building age impacted the property sale price or not.

Building age across Borough

ggplot(data = df, aes(x = `BOROUGH`, y = `BUILDING AGE`, fill = `BOROUGH`)) +
  geom_boxplot() +
  coord_flip() +
  ggtitle("Age of Properties sold in NYC", subtitle = "Distribution of building age in NYC") +
  scale_y_continuous("Building Age", labels = scales::comma) +
  scale_x_discrete("Borough") 

The plot shows that the newest of buildings are in Queens and Bronx, and the oldest building is in Manhattan. There were buildings in this dataset that were constructed in 1820, so this is a legitimate outlier.

Sale Price vs Building age

ggplot(data = df, aes(x = `BUILDING AGE`, y = log(`SALE PRICE`))) +
  geom_point(aes(col = df$BOROUGH)) +
  geom_smooth(method = "lm") +
  theme(legend.position = "bottom") +
  ggtitle("Price of Oldest Properties sold in NYC", subtitle = "Oldest buildings in NYC in 2016") +
  scale_y_continuous("Property Value Distribution", labels = scales::dollar) +
  scale_x_continuous("Building Age") 

The plot shows that only in Manhattan and the Bronx can we expect property prices to fall as the age of the Building increases. Building age might not even be a good predictor of Sale price for properties in the other Boroughs. But I’ll need to further analyze this to identify if that’s really the case.

Predictive Analysis

work in progress

Preparation

To predict the NYC Property Price using this data set, I started with creating a new data set for prediction removing all character values - such as Address, etc and retained only the fields that could help in prediction.

I also transformed Sale date into the month in which the sale occurred

nyc_data$`SALE MONTH` <- as.factor(months(nyc_data$`SALE DATE`))

nyc_pred <- nyc_data[, -c(3, 5, 6, 7, 8, 9, 10, 17, 19, 21)]
nyc_pred$NEIGHBORHOOD <- as.factor(nyc_pred$NEIGHBORHOOD)

nyc_pred <- nyc_pred[c(1:10, 12,13,11)]

str(nyc_pred)
## Classes 'tbl_df', 'tbl' and 'data.frame':    58470 obs. of  13 variables:
##  $ BOROUGH                  : Factor w/ 5 levels "Manhattan","Bronx",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ NEIGHBORHOOD             : Factor w/ 253 levels "AIRPORT LA GUARDIA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ BUILDING CLASS CATEGORY  : Factor w/ 47 levels " CONDO-RENTALS",..: 34 34 34 34 33 33 20 20 20 20 ...
##  $ ZIP CODE                 : Factor w/ 186 levels "0","10001","10002",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ RESIDENTIAL UNITS        : int  5 10 6 8 24 10 0 0 0 0 ...
##  $ COMMERCIAL UNITS         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TOTAL UNITS              : int  5 10 6 8 24 10 0 0 0 0 ...
##  $ LAND SQUARE FEET         : num  1633 2272 2369 1750 4489 ...
##  $ GROSS SQUARE FEET        : num  6440 6794 4615 4226 18523 ...
##  $ TAX CLASS AT TIME OF SALE: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 2 2 ...
##  $ BUILDING AGE             : num  117 104 117 97 97 8 97 97 97 92 ...
##  $ SALE MONTH               : Factor w/ 12 levels "April","August",..: 6 12 10 12 10 11 8 7 6 8 ...
##  $ SALE PRICE               : num  6625000 3936272 8000000 3192840 16232000 ...

To begin with the modeling exercise, I split my data set into an 80-20% training and test set.

# Train - Test split
# Splitting the nyc_pred data set into an 80-20% Training- testing split
# to use later for testing the 20% test set Sale prices

set.seed(12420360)
index <- sample(nrow(nyc_pred),nrow(nyc_pred)*0.80)
nyc_pred.train <- nyc_pred[index,]
nyc_pred.test <- nyc_pred[-index,]

The Visualization step earlier showed me that there is a relationship between each of the explored variables and the response variable, Sale Price. To confirm the association, I checked the correlation between Sale Price and the predictor variables.

Correlations

1. Correlation Table (Y vs X)

dist_corr <- vapply(nyc_pred.train[c(1:12)], 
                    function(x) { cor(nyc_pred.train$`SALE PRICE`, as.numeric(x), use = "complete.obs") }, 
                    FUN.VALUE = numeric(1))
sign_corr <- vapply(dist_corr, function(x) { ifelse(x >= 0, "Positive", "Negative")}, 
                    FUN.VALUE = character(1))
var_corr <- names(nyc_pred.train[c(1:12)])

# correlation table - Table 1
table1 <- data.frame(var_corr, abs(dist_corr), sign_corr)
table1 <- table1[order(-abs(dist_corr)),]
names(table1) <- c("Variable", "Size of the Correlation", "Direction of Correlation")

kable(table1)
Variable Size of the Correlation Direction of Correlation
GROSS SQUARE FEET GROSS SQUARE FEET 0.5136732 Positive
TAX CLASS AT TIME OF SALE TAX CLASS AT TIME OF SALE 0.1088512 Positive
TOTAL UNITS TOTAL UNITS 0.0901199 Positive
RESIDENTIAL UNITS RESIDENTIAL UNITS 0.0838978 Positive
BOROUGH BOROUGH 0.0733236 Negative
ZIP CODE ZIP CODE 0.0668161 Negative
LAND SQUARE FEET LAND SQUARE FEET 0.0562736 Positive
COMMERCIAL UNITS COMMERCIAL UNITS 0.0389103 Positive
BUILDING CLASS CATEGORY BUILDING CLASS CATEGORY 0.0152690 Negative
BUILDING AGE BUILDING AGE 0.0052518 Positive
SALE MONTH SALE MONTH 0.0013249 Positive
NEIGHBORHOOD NEIGHBORHOOD 0.0006959 Positive

The correlation matrix above shows that the response variable is only explained by the Gross Square feet, the statistical significance of which I need to check.

I further checked the correlation between all the predictor variables to keep in mind the potential multi-collinearity effects in future.

# 2.b. Correlation numeric predictors (all X's)
correl <- cor(nyc_pred[sapply(nyc_pred, is.numeric)], use = "complete.obs")
library(corrplot)
## corrplot 0.84 loaded
corrplot(correl, type = "upper",order = "hclust", 
         tl.col = "black", tl.srt = 45)

round(correl, 3)
##                   RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS
## RESIDENTIAL UNITS             1.000            0.012       0.824
## COMMERCIAL UNITS              0.012            1.000       0.577
## TOTAL UNITS                   0.824            0.577       1.000
## LAND SQUARE FEET              0.472            0.050       0.414
## GROSS SQUARE FEET             0.726            0.058       0.626
## BUILDING AGE                  0.002            0.000       0.002
## SALE PRICE                    0.139            0.046       0.140
##                   LAND SQUARE FEET GROSS SQUARE FEET BUILDING AGE
## RESIDENTIAL UNITS            0.472             0.726        0.002
## COMMERCIAL UNITS             0.050             0.058        0.000
## TOTAL UNITS                  0.414             0.626        0.002
## LAND SQUARE FEET             1.000             0.664       -0.018
## GROSS SQUARE FEET            0.664             1.000       -0.016
## BUILDING AGE                -0.018            -0.016        1.000
## SALE PRICE                   0.042             0.524       -0.001
##                   SALE PRICE
## RESIDENTIAL UNITS      0.139
## COMMERCIAL UNITS       0.046
## TOTAL UNITS            0.140
## LAND SQUARE FEET       0.042
## GROSS SQUARE FEET      0.524
## BUILDING AGE          -0.001
## SALE PRICE             1.000

Clearly, Res. units - Total units - Gross Sq are highly correlated (> 0.7); Commercial units and Total units are also fairly correlated (0.577); Total units has high correlation with almost all variables.

LSF and GSF are highly correlated as well (0.664). Consider using Total units, instead of Res units and Comm units; and LSF instead of LSF and GSF.

# 2.c. Correlation between the categorical predictors in the data
temp <- nyc_pred[,c(1:4,10,12)]    
num <- c(1:6)
temp %<>% mutate_at(num, funs(as.numeric(.)))
round(cor(temp), 3)
##                           BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY
## BOROUGH                     1.000       -0.164                   0.362
## NEIGHBORHOOD               -0.164        1.000                  -0.108
## BUILDING CLASS CATEGORY     0.362       -0.108                   1.000
## ZIP CODE                    0.655       -0.143                   0.312
## TAX CLASS AT TIME OF SALE  -0.365        0.042                  -0.613
## SALE MONTH                  0.017       -0.005                   0.009
##                           ZIP CODE TAX CLASS AT TIME OF SALE SALE MONTH
## BOROUGH                      0.655                    -0.365      0.017
## NEIGHBORHOOD                -0.143                     0.042     -0.005
## BUILDING CLASS CATEGORY      0.312                    -0.613      0.009
## ZIP CODE                     1.000                    -0.259      0.002
## TAX CLASS AT TIME OF SALE   -0.259                     1.000     -0.006
## SALE MONTH                   0.002                    -0.006      1.000
corrplot(cor(temp), type = "upper",order = "hclust", 
         tl.col = "black", tl.srt = 45)

Checking the correlation between the categorical variables shows some clear trends. Borough and Zip Code have high correlation (0.65) and BCC, Tax class have high correlation too (0.613).

Remember to group variables that have high correlation while building models. Also note the previously noted patterns - RUnits + CUnits = TUnits (for close to 90% of the data).

Regression (SLR)

2. Single Factor Linear Regressions

Running a full model Linear regression for this data set doesn’t seem too appropriate as there are too many categorical predictors that will create lots of dummy variables. I checked if they’re necessary using a single factor regression between the predictor and each response variable.

# 1. Single Factor Regression - Borough
model1 <- lm(nyc_pred.train$`SALE PRICE` ~ nyc_pred.train$BOROUGH)
a <- summary(model1) 

# Summary of the Model:
a
## 
## Call:
## lm(formula = nyc_pred.train$`SALE PRICE` ~ nyc_pred.train$BOROUGH)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##   -3437392    -832058    -398325     -33840 2206561433 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          3438567     125025   27.50   <2e-16
## nyc_pred.train$BOROUGHBronx         -2609727     243847  -10.70   <2e-16
## nyc_pred.train$BOROUGHBrooklyn      -2140241     173188  -12.36   <2e-16
## nyc_pred.train$BOROUGHQueens        -2678845     166761  -16.06   <2e-16
## nyc_pred.train$BOROUGHStaten Island -2890102     230860  -12.52   <2e-16
##                                        
## (Intercept)                         ***
## nyc_pred.train$BOROUGHBronx         ***
## nyc_pred.train$BOROUGHBrooklyn      ***
## nyc_pred.train$BOROUGHQueens        ***
## nyc_pred.train$BOROUGHStaten Island ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13280000 on 46771 degrees of freedom
## Multiple R-squared:  0.006844,   Adjusted R-squared:  0.006759 
## F-statistic: 80.57 on 4 and 46771 DF,  p-value: < 2.2e-16
# F-statistic p-value
pf(a$fstatistic[1],a$fstatistic[2],a$fstatistic[3],lower.tail = FALSE)
##        value 
## 2.905515e-68

The F statistic for Borough is significant. So, I’m maintaining the 4 dummy levels.

# 2. Single Factor Regression - Neighborhood 
model1 <- lm(nyc_pred.train$`SALE PRICE` ~ nyc_pred.train$NEIGHBORHOOD)
a <- summary(model1)

# F-statistic p-value
pf(a$fstatistic[1],a$fstatistic[2],a$fstatistic[3],lower.tail = FALSE)
##         value 
## 8.978386e-141

Only 4 neighborhoods are significant. Manually creating dummies for these categories and clubbing the rest under a Neighborhood ‘Others’ variable.

nyc_pred.train$N_BLOOMFIELD = ifelse(nyc_pred.train$NEIGHBORHOOD == "BLOOMFIELD", 1,0)
nyc_pred.train$N_FASHION = ifelse(nyc_pred.train$NEIGHBORHOOD == "FASHION", 1,0)
nyc_pred.train$`N_JAVITS CENTER` = ifelse(nyc_pred.train$NEIGHBORHOOD == "JAVITS CENTER", 1,0)
nyc_pred.train$`N_MIDTOWN CBD` = ifelse(nyc_pred.train$NEIGHBORHOOD == "MIDTOWN CBD", 1,0)
nyc_pred.train$`N_OTHERS` = ifelse((nyc_pred.train$NEIGHBORHOOD != "MIDTOWN CBD") &
                                                       (nyc_pred.train$NEIGHBORHOOD != "JAVITS CENTER") &
                                                       (nyc_pred.train$NEIGHBORHOOD != "FASHION") &
                                                       (nyc_pred.train$NEIGHBORHOOD != "BLOOMFIELD"), 1,0)

# Removing the original Neighborhood predictor
nyc_pred.train <- nyc_pred.train[,-2] 
# 3. Single Factor Regression - BCC
model1 <- lm(nyc_pred.train$`SALE PRICE` ~ nyc_pred.train$`BUILDING CLASS CATEGORY`)
a <- summary(model1) 

# F-statistic p-value
pf(a$fstatistic[1],a$fstatistic[2],a$fstatistic[3],lower.tail = FALSE)
## value 
##     0

Only 8 of the 46 variables created are NOT significant. So, at this point I’m leaving all the 46 dummy levels created for this variable.

# 4. Single Factor Regression - Tax class
model1 <- lm(nyc_pred.train$`SALE PRICE` ~ nyc_pred.train$`TAX CLASS AT TIME OF SALE`)
a <- summary(model1) 

a
## 
## Call:
## lm(formula = nyc_pred.train$`SALE PRICE` ~ nyc_pred.train$`TAX CLASS AT TIME OF SALE`)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##   -9177872    -991038    -345990      29010 2200820928 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                   745990      90554   8.238
## nyc_pred.train$`TAX CLASS AT TIME OF SALE`2   845048     125145   6.753
## nyc_pred.train$`TAX CLASS AT TIME OF SALE`4  8433082     308263  27.357
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## nyc_pred.train$`TAX CLASS AT TIME OF SALE`2 1.47e-11 ***
## nyc_pred.train$`TAX CLASS AT TIME OF SALE`4  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13220000 on 46773 degrees of freedom
## Multiple R-squared:  0.01577,    Adjusted R-squared:  0.01573 
## F-statistic: 374.7 on 2 and 46773 DF,  p-value: < 2.2e-16
# F-statistic p-value
pf(a$fstatistic[1],a$fstatistic[2],a$fstatistic[3],lower.tail = FALSE)
##         value 
## 3.786435e-162

All the levels are significant - so maintaining them.

# 5. Single Factor Regression - Sale Month
model1 <- lm(nyc_pred.train$`SALE PRICE` ~ nyc_pred.train$`SALE MONTH`)
a <- summary(model1) 

a
## 
## Call:
## lm(formula = nyc_pred.train$`SALE PRICE` ~ nyc_pred.train$`SALE MONTH`)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##   -2052435   -1149577    -853747    -409599 2207946065 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           1411280     223734   6.308 2.85e-10
## nyc_pred.train$`SALE MONTH`August      -31681     317720  -0.100   0.9206
## nyc_pred.train$`SALE MONTH`December    420970     303357   1.388   0.1652
## nyc_pred.train$`SALE MONTH`February   -152998     318250  -0.481   0.6307
## nyc_pred.train$`SALE MONTH`January     345690     311797   1.109   0.2676
## nyc_pred.train$`SALE MONTH`July        -11703     312998  -0.037   0.9702
## nyc_pred.train$`SALE MONTH`June         -9492     296350  -0.032   0.9744
## nyc_pred.train$`SALE MONTH`March       -10634     304468  -0.035   0.9721
## nyc_pred.train$`SALE MONTH`May         642655     305369   2.105   0.0353
## nyc_pred.train$`SALE MONTH`November     89711     311958   0.288   0.7737
## nyc_pred.train$`SALE MONTH`October      -4436     316386  -0.014   0.9888
## nyc_pred.train$`SALE MONTH`September    92467     302789   0.305   0.7601
##                                         
## (Intercept)                          ***
## nyc_pred.train$`SALE MONTH`August       
## nyc_pred.train$`SALE MONTH`December     
## nyc_pred.train$`SALE MONTH`February     
## nyc_pred.train$`SALE MONTH`January      
## nyc_pred.train$`SALE MONTH`July         
## nyc_pred.train$`SALE MONTH`June         
## nyc_pred.train$`SALE MONTH`March        
## nyc_pred.train$`SALE MONTH`May       *  
## nyc_pred.train$`SALE MONTH`November     
## nyc_pred.train$`SALE MONTH`October      
## nyc_pred.train$`SALE MONTH`September    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13330000 on 46764 degrees of freedom
## Multiple R-squared:  0.0002845,  Adjusted R-squared:  4.939e-05 
## F-statistic:  1.21 on 11 and 46764 DF,  p-value: 0.2736
# F-statistic p-value
pf(a$fstatistic[1],a$fstatistic[2],a$fstatistic[3],lower.tail = FALSE)
##     value 
## 0.2735656

None of the levels of the month variable are significant. So, I won’t be using the predictor to build the full model.

Regression (MLR)

3. Full Model - Multiple Linear Regression

Running full model based on the variables reduced from the above step.

attach(nyc_pred.train)
full.model <- lm(`SALE PRICE` ~ .  -`SALE MONTH`, data = nyc_pred.train)
fm <- summary(full.model)

Though the Adjusted R-squared value is quite high at 0.9265332, the Model MSE is very large at 2.544710310^{13}. This is likely due to the large degrees of freedom in the model. Further the AIC of this model is 7.687737410^{5}.

NEXT STEPS:

  1. Variable Reduction to be employed for the Zip Code variable.
  2. Variable selection methods to choose the optimal number of parameters.
  3. Employ cross-validation methods to test the out-of-sample error
  4. Try other algorithms such as Random Forest.

more to be added

SUMMARY

Much of the work with the NYC Property Sales data was data cleaning. After the initial process of data cleaning (predominantly treating missing values), I identified many outliers within the Sales Price and Square footage numerical variables. Isolating these data points and exploring the points individually was valuable with this data set.

Insights:

With this exploratory data analysis of the NYC Property Sale Prices, I found many interesting trends.

  • Property prices in NYC range from $220,000 (10% percentile of Property prices) all the way to $2.2 billion (95% percentile of Property prices). NYC has a place for everyone!

  • Price per square footage in Manhattan is as high as $16,000/sqft, while in Bloomfield, Staten Island is $26/sqft. Move to Staten Island, everyone!

  • Manhattan and Bronx sold the most residential condo apartments in large buildings/ residential societies, while Queens sold the most residential homes.

Further analyses in progress

Moreover, this exploration forms the basis to show association between Sale Prices and the variables explored such as Borough, Neighborhood, Building Type, Tax Class, Building Age, Land/Gross Square Feet. Using this analysis as a first step, I’ll be extending the analysis to evaluate a multiple regression model with the Sale Price as a predictor variable.

A futher extension of this study will be using KNN imputation to fill in missing values and then evaluate the regression strength