INTRODUCTION

New York City is repeatedly named as among the most expensive cities in the world to buy real estate. The Dataset contains information of all the property sales in NYC from January 2020 to December 2020. With such a recent data set, I was able to analyze trends about NYC real estate market borough and neighbourhood-wise.

Analysis Methodology:

My analysis of the NYC Real estate market is broken down into the following sections.

Exploratory Data Analysis

Here, we take a look at variables such as Borough, Neighborhood, Age of the building/property, Size of property and type of building.
The descriptive statistics section speaks about the distribution of each variable and makes necessary changes for the analysis.
I’ve also tried to isolate any potential outliers for the variable that will need special attention.

Visualization of results

The visualization section is split into a few broad categories for need of clarity. The tabs within the viz section explore one variable at a time with respect to the important numerical fields.
1. Most In-Demand Borough - Where did New Yorkers buy their properties last year?
2. Most In-Demand Neighborhood - Which neighborhood do New Yorkers prefer?
3. The Hottest Buildings - What kind of properties do they buy?
4. Property sizes in NYC/ Square footage - Does more money mean larger properties in NYC?
5. Age of the buildings in NYC - Does more money mean newer properties in NYC?
Visualizing each of these plots showed me that many of these variables will be an important predictor of Sales Price, forming the basis for a potential predictive modeling.

LOADING REQUIRED PACKAGES

library(data.table)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.1
## v tidyr   1.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'readr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.0.3
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(formattable)
## Warning: package 'formattable' was built under R version 4.0.4
library(knitr)
library(modelr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:formattable':
## 
##     style
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(readxl)

DATA PREPARATION

This section contains the logic and the steps performed in bringing the data to a form suitable for statistical analysis. The first step is to import the data.

Data Import

My data is seperated by boroughs however we can combine the data as follows.
I used read_excel because it tends to perform some degree of cleaning the data during importation. We skip the first 4 rows because they contain description of the data.

brooklyn <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_brooklyn.xls", skip = 4)
bronx <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_bronx.xls", skip = 4)
manhattan <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_manhattan.xls", skip = 4)
queens <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_queens.xls", skip = 4)
statenisland <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_statenisland.xls", skip = 4)

Combining data to one dataset is only possible if the datasets involves have similar columns which is the case in our datasets.

NYC_property_sales <- bind_rows(brooklyn, bronx, manhattan, queens, statenisland)
class(NYC_property_sales)
## [1] "tbl_df"     "tbl"        "data.frame"
glimpse(NYC_property_sales)
## Rows: 62,957
## Columns: 21
## $ BOROUGH                          <chr> "3", "3", "3", "3", "3", "3", "3",...
## $ NEIGHBORHOOD                     <chr> "BATH BEACH", "BATH BEACH", "BATH ...
## $ `BUILDING CLASS CATEGORY`        <chr> "01 ONE FAMILY DWELLINGS", "01 ONE...
## $ `TAX CLASS AT PRESENT`           <chr> "1", "1", "1", "1", "1", "1", "1",...
## $ BLOCK                            <dbl> 6359, 6360, 6360, 6362, 6367, 6371...
## $ LOT                              <dbl> 70, 48, 56, 23, 24, 19, 60, 65, 11...
## $ `EASE-MENT`                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `BUILDING CLASS AT PRESENT`      <chr> "S1", "A5", "A5", "A9", "A9", "A9"...
## $ ADDRESS                          <chr> "8684 15TH AVENUE", "14 BAY 10TH S...
## $ `APARTMENT NUMBER`               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `ZIP CODE`                       <dbl> 11228, 11228, 11228, 11228, 11214,...
## $ `RESIDENTIAL UNITS`              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ `COMMERCIAL UNITS`               <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ `TOTAL UNITS`                    <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2...
## $ `LAND SQUARE FEET`               <dbl> 1933, 2513, 1547, 2280, 1571, 2320...
## $ `GROSS SQUARE FEET`              <dbl> 4080, 1428, 1428, 1052, 1456, 1566...
## $ `YEAR BUILT`                     <dbl> 1930, 1930, 1930, 1901, 1935, 1930...
## $ `TAX CLASS AT TIME OF SALE`      <chr> "1", "1", "1", "1", "1", "1", "1",...
## $ `BUILDING CLASS AT TIME OF SALE` <chr> "S1", "A5", "A5", "A9", "A9", "A9"...
## $ `SALE PRICE`                     <dbl> 1300000, 849000, 75000, 0, 830000,...
## $ `SALE DATE`                      <dttm> 2020-04-28, 2020-03-18, 2020-11-3...
dim(NYC_property_sales)
## [1] 62957    21

The combined dataset NYC has 21 columns and 62957 rows.

names(NYC_property_sales)
##  [1] "BOROUGH"                        "NEIGHBORHOOD"                  
##  [3] "BUILDING CLASS CATEGORY"        "TAX CLASS AT PRESENT"          
##  [5] "BLOCK"                          "LOT"                           
##  [7] "EASE-MENT"                      "BUILDING CLASS AT PRESENT"     
##  [9] "ADDRESS"                        "APARTMENT NUMBER"              
## [11] "ZIP CODE"                       "RESIDENTIAL UNITS"             
## [13] "COMMERCIAL UNITS"               "TOTAL UNITS"                   
## [15] "LAND SQUARE FEET"               "GROSS SQUARE FEET"             
## [17] "YEAR BUILT"                     "TAX CLASS AT TIME OF SALE"     
## [19] "BUILDING CLASS AT TIME OF SALE" "SALE PRICE"                    
## [21] "SALE DATE"

Data cleaning

First of all we shall use magnittr and the pipe function to clean the data. We shall convert the column names to lower case and eliminate spaces to make analysis easier.

colnames(NYC_property_sales) %<>% 
  str_replace_all("\\s", "_") %<>% 
  tolower()

The column ‘building_class_category’ has the category number and title together. I’ve split this column into 2 with the category number in one and title in another. Also, removed the ‘EASE-MENT’ column, which had NULL values throughout the entire dataset and the ‘TIME OF SALE’.

# Separated the columns in 'BUILDING CLASS CATEGORY'
nyc_property <- NYC_property_sales %>%
  separate(col = building_class_category, into = c("building_class_category_number","building_class_category"), sep = 3)

# Removed unnecessary columns - 'ease-ment'(column #8)
nyc_property <- nyc_property[,c(-8)]
names(nyc_property)
##  [1] "borough"                        "neighborhood"                  
##  [3] "building_class_category_number" "building_class_category"       
##  [5] "tax_class_at_present"           "block"                         
##  [7] "lot"                            "building_class_at_present"     
##  [9] "address"                        "apartment_number"              
## [11] "zip_code"                       "residential_units"             
## [13] "commercial_units"               "total_units"                   
## [15] "land_square_feet"               "gross_square_feet"             
## [17] "year_built"                     "tax_class_at_time_of_sale"     
## [19] "building_class_at_time_of_sale" "sale_price"                    
## [21] "sale_date"

I now create a new column called ‘Building Age’ transforming the variable, ‘Year Built’. I did this as Building age is a much clearer metric to understand.

# Creating a new column called 'Building Age' transforming the variable, 'Year Built'
nyc_property <- nyc_property %>% 
  mutate(`building_age` = 2020 - `year_built`)

The next step is to check for duplicates in my dataframe.

# Removing the Duplicates in the data frame, nyc_property 
nyc_property %>% 
  filter(duplicated(nyc_property) == TRUE) %>% 
  nrow()
## [1] 56

We have 56 duplicated entries in our data.

nyc_property <- unique(nyc_property)
dim(nyc_property)
## [1] 62901    22

We have eliminated the 56 rows.

Converting data types

glimpse(nyc_property)
## Rows: 62,901
## Columns: 22
## $ borough                        <chr> "3", "3", "3", "3", "3", "3", "3", "...
## $ neighborhood                   <chr> "BATH BEACH", "BATH BEACH", "BATH BE...
## $ building_class_category_number <chr> "01 ", "01 ", "01 ", "01 ", "01 ", "...
## $ building_class_category        <chr> "ONE FAMILY DWELLINGS", "ONE FAMILY ...
## $ tax_class_at_present           <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ block                          <dbl> 6359, 6360, 6360, 6362, 6367, 6371, ...
## $ lot                            <dbl> 70, 48, 56, 23, 24, 19, 60, 65, 115,...
## $ building_class_at_present      <chr> "S1", "A5", "A5", "A9", "A9", "A9", ...
## $ address                        <chr> "8684 15TH AVENUE", "14 BAY 10TH STR...
## $ apartment_number               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ zip_code                       <dbl> 11228, 11228, 11228, 11228, 11214, 1...
## $ residential_units              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ commercial_units               <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ total_units                    <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ land_square_feet               <dbl> 1933, 2513, 1547, 2280, 1571, 2320, ...
## $ gross_square_feet              <dbl> 4080, 1428, 1428, 1052, 1456, 1566, ...
## $ year_built                     <dbl> 1930, 1930, 1930, 1901, 1935, 1930, ...
## $ tax_class_at_time_of_sale      <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ building_class_at_time_of_sale <chr> "S1", "A5", "A5", "A9", "A9", "A9", ...
## $ sale_price                     <dbl> 1300000, 849000, 75000, 0, 830000, 0...
## $ sale_date                      <dttm> 2020-04-28, 2020-03-18, 2020-11-30,...
## $ building_age                   <dbl> 90, 90, 90, 119, 85, 90, 90, 80, 75,...

We need to convert some variables. Borough, Building class category and tax class at present are converted to factors. Block, Lot, Address, Apartment number and Zip code are converted to characters and lastly, land sq feet, gross sq feet, year built and sales price are made numeric.

fac <- c(1,3,4,5,8,11,18,19) #these numbers represent column number/location
nyc_property %<>% 
  mutate_at(fac, funs(factor(.)))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
levels(nyc_property$borough) <- c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")

num <- c(15,16,17,20)
nyc_property %<>% 
  mutate_at(num, funs(as.numeric(.)))

chr <- c(6,7)
nyc_property %<>% 
  mutate_at(chr, funs(as.character(.)))
glimpse(nyc_property)
## Rows: 62,901
## Columns: 22
## $ borough                        <fct> Brooklyn, Brooklyn, Brooklyn, Brookl...
## $ neighborhood                   <chr> "BATH BEACH", "BATH BEACH", "BATH BE...
## $ building_class_category_number <fct> 01 , 01 , 01 , 01 , 01 , 01 , 01 , 0...
## $ building_class_category        <fct> ONE FAMILY DWELLINGS, ONE FAMILY DWE...
## $ tax_class_at_present           <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ block                          <chr> "6359", "6360", "6360", "6362", "636...
## $ lot                            <chr> "70", "48", "56", "23", "24", "19", ...
## $ building_class_at_present      <fct> S1, A5, A5, A9, A9, A9, A9, A2, A5, ...
## $ address                        <chr> "8684 15TH AVENUE", "14 BAY 10TH STR...
## $ apartment_number               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ zip_code                       <fct> 11228, 11228, 11228, 11228, 11214, 1...
## $ residential_units              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ commercial_units               <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ total_units                    <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ land_square_feet               <dbl> 1933, 2513, 1547, 2280, 1571, 2320, ...
## $ gross_square_feet              <dbl> 4080, 1428, 1428, 1052, 1456, 1566, ...
## $ year_built                     <dbl> 1930, 1930, 1930, 1901, 1935, 1930, ...
## $ tax_class_at_time_of_sale      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ building_class_at_time_of_sale <fct> S1, A5, A5, A9, A9, A9, A9, A2, A5, ...
## $ sale_price                     <dbl> 1300000, 849000, 75000, 0, 830000, 0...
## $ sale_date                      <dttm> 2020-04-28, 2020-03-18, 2020-11-30,...
## $ building_age                   <dbl> 90, 90, 90, 119, 85, 90, 90, 80, 75,...

Missing or zero values

Next we check for any missing values.

sum(is.na(nyc_property))
## [1] 158862
colSums(is.na(nyc_property))
##                        borough                   neighborhood 
##                              0                              0 
## building_class_category_number        building_class_category 
##                              0                              0 
##           tax_class_at_present                          block 
##                             29                              0 
##                            lot      building_class_at_present 
##                              0                             29 
##                        address               apartment_number 
##                              0                          48893 
##                       zip_code              residential_units 
##                              1                          13682 
##               commercial_units                    total_units 
##                          24139                          12053 
##               land_square_feet              gross_square_feet 
##                          25768                          25768 
##                     year_built      tax_class_at_time_of_sale 
##                           4250                              0 
## building_class_at_time_of_sale                     sale_price 
##                              0                              0 
##                      sale_date                   building_age 
##                              0                           4250

We find that 11 out of 22 variables have missing values.

sum(NYC_property_sales$sale_price == "-")      #0 
## [1] 0
sum(is.na(nyc_property$sale_price))       # 0 
## [1] 0

I counted the number of missing values in both the original and the modified dataframe. There were none. I also counted the zero values in SALE PRICE variable and found different results. It is important to note that this data set has NA values as well as 0 values. I expect this to be the case for all my numeric fields, SALE PRICE, LAND SQUARE FEET and GROSS SQUARE FEET.

sum(NYC_property_sales$sale_price == 0)                  #20746 
## [1] 20746
nyc_property %>% filter(sale_price == 0) %>% 
  nrow() #20703
## [1] 20703

The difference of 43 must be among the 56 duplicates deleted before.

    1. Sale price
      As the zero values in the SALE PRICE column have a meaning (zero values in Sale Price indicate that this was a sale transfer deed), it would skew the dataset to impute or convert the NA values.
      This data set contains 20703 observations that have 0 values. At this points, I’m removing them and will analyze them separately. I created nyc_salepricezero, for these observations (20703 observations). I also created a clean dataset, nyc_salepricecomplete, by removing all Missing/ Zero values in SALE PRICE from nyc_property. By this step, I removed 20703 observations, leaving nyc_salepricecomplete, with 67.09% of the original data.
# The zero sales prices account for 32.91% of our original data
nyc_salepricezero <- nyc_property %>%
  filter(nyc_property$sale_price == 0)

# The complete sale prices account for 67.09% of our original dataset
nyc_salepricecomplete <- nyc_property %>%
  filter(!nyc_property$sale_price == 0)
    1. Land square feet and gross square feet
      For the variables, LAND SQUARE FEET, GROSS SQUARE FEET, I need to evaluate if missing values and zero values mean the same. Zero sq feet could mean the property that was sold was land only with no enclosed square footage.
nyc_salepricecomplete %>% filter(nyc_salepricecomplete$land_square_feet == 0 | nyc_salepricecomplete$gross_square_feet == 0) %>% nrow()
## [1] 907
explore <- nyc_salepricecomplete %>% filter(nyc_salepricecomplete$land_square_feet == 0 | nyc_salepricecomplete$gross_square_feet == 0)

summary(explore$building_class_category)
##                       ASYLUMS AND HOMES                       COMMERCIAL CONDOS 
##                                       0                                       0 
##                      COMMERCIAL GARAGES                  COMMERCIAL VACANT LAND 
##                                     144                                     163 
##                             CONDO COOPS  CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC 
##                                       0                                       0 
##                            CONDO HOTELS              CONDO NON-BUSINESS STORAGE 
##                                       0                                       0 
##                  CONDO OFFICE BUILDINGS                           CONDO PARKING 
##                                       0                                       0 
##                   CONDO STORE BUILDINGS          CONDO TERRACES/GARDENS/CABANAS 
##                                       0                                       0 
##          CONDO WAREHOUSES/FACTORY/INDUS          CONDOS - 2-10 UNIT RESIDENTIAL 
##                                       0                                       0 
## CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT            CONDOS - ELEVATOR APARTMENTS 
##                                       0                                       0 
##              CONDOS - WALKUP APARTMENTS             COOPS - ELEVATOR APARTMENTS 
##                                       0                                       0 
##               COOPS - WALKUP APARTMENTS                  EDUCATIONAL FACILITIES 
##                                       0                                       0 
##                               FACTORIES          HOSPITAL AND HEALTH FACILITIES 
##                                       1                                       0 
##   INDOOR PUBLIC AND CULTURAL FACILITIES                           LUXURY HOTELS 
##                                       1                                       0 
##                        OFFICE BUILDINGS                    ONE FAMILY DWELLINGS 
##                                       0                                      12 
##                            OTHER HOTELS         OUTDOOR RECREATIONAL FACILITIES 
##                                       0                                      10 
##                    RELIGIOUS FACILITIES                     RENTALS - 4-10 UNIT 
##                                       3                                       0 
##           RENTALS - ELEVATOR APARTMENTS             RENTALS - WALKUP APARTMENTS 
##                                       1                                       0 
##        SELECTED GOVERNMENTAL FACILITIES              SPECIAL CONDO BILLING LOTS 
##                                       0                                       3 
##                         STORE BUILDINGS                     TAX CLASS 1 - OTHER 
##                                       0                                      73 
##                      TAX CLASS 1 CONDOS                 TAX CLASS 1 VACANT LAND 
##                                       0                                     445 
##                     TAX CLASS 4 - OTHER                                THEATRES 
##                                      44                                       0 
##                  THREE FAMILY DWELLINGS               TRANSPORTATION FACILITIES 
##                                       1                                       3 
##                    TWO FAMILY DWELLINGS                              WAREHOUSES 
##                                       2                                       1

There are 907 rows in nyc_salespricecomplete where Land Sq Footage or Gross Sq Footage is 0. Exploring the Building Class Category variable for the observations where LSF or GSF = 0, we can see that these properties are of different categories. Hence, the assumption of Zero sq footage being land only property is wrong. Hence, 0 and missing values are equivalent in this dataset.
In the nyc_salespricecomplete, ‘Land Square Feet’ and ‘Gross Square Feet’ variables have 20721, 21622 zero/missing values individually.

nyc_salepricecomplete %>% filter(nyc_salepricecomplete$land_square_feet == 0 | is.na(nyc_salepricecomplete$land_square_feet)) %>% nrow()
## [1] 20721
nyc_salepricecomplete %>% filter(nyc_salepricecomplete$gross_square_feet == 0 | is.na(nyc_salepricecomplete$gross_square_feet)) %>% nrow()
## [1] 21622

Lets convert the 0 values to NA’s and store in a new dataset called nyc_data.

nyc_data <- nyc_salepricecomplete 

# converting the zero values in LSF into NAs -> 20721 NAs in `LAND SQUARE FEET`
nyc_data$land_square_feet[nyc_data$land_square_feet == 0] <- NA

# converting the zero values in GSF into NAs - > 21622 NAs in `GROSS SQUARE FEET`
nyc_data$gross_square_feet[nyc_data$gross_square_feet == 0] <- NA

Categorical Variables

Lets explore the categorical variables in our dataset.

levels(nyc_data$borough)
## [1] "Manhattan"     "Bronx"         "Brooklyn"      "Queens"       
## [5] "Staten Island"
levels((nyc_data$building_class_category_number))
##  [1] "01 " "02 " "03 " "04 " "05 " "06 " "07 " "08 " "09 " "10 " "11 " "12 "
## [13] "13 " "14 " "15 " "16 " "17 " "21 " "22 " "25 " "26 " "27 " "28 " "29 "
## [25] "30 " "31 " "32 " "33 " "34 " "35 " "36 " "37 " "38 " "39 " "40 " "41 "
## [37] "42 " "43 " "44 " "45 " "46 " "47 " "48 " "49 "
levels(nyc_data$building_class_category)
##  [1] "ASYLUMS AND HOMES"                      
##  [2] "COMMERCIAL CONDOS"                      
##  [3] "COMMERCIAL GARAGES"                     
##  [4] "COMMERCIAL VACANT LAND"                 
##  [5] "CONDO COOPS"                            
##  [6] "CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC" 
##  [7] "CONDO HOTELS"                           
##  [8] "CONDO NON-BUSINESS STORAGE"             
##  [9] "CONDO OFFICE BUILDINGS"                 
## [10] "CONDO PARKING"                          
## [11] "CONDO STORE BUILDINGS"                  
## [12] "CONDO TERRACES/GARDENS/CABANAS"         
## [13] "CONDO WAREHOUSES/FACTORY/INDUS"         
## [14] "CONDOS - 2-10 UNIT RESIDENTIAL"         
## [15] "CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT"
## [16] "CONDOS - ELEVATOR APARTMENTS"           
## [17] "CONDOS - WALKUP APARTMENTS"             
## [18] "COOPS - ELEVATOR APARTMENTS"            
## [19] "COOPS - WALKUP APARTMENTS"              
## [20] "EDUCATIONAL FACILITIES"                 
## [21] "FACTORIES"                              
## [22] "HOSPITAL AND HEALTH FACILITIES"         
## [23] "INDOOR PUBLIC AND CULTURAL FACILITIES"  
## [24] "LUXURY HOTELS"                          
## [25] "OFFICE BUILDINGS"                       
## [26] "ONE FAMILY DWELLINGS"                   
## [27] "OTHER HOTELS"                           
## [28] "OUTDOOR RECREATIONAL FACILITIES"        
## [29] "RELIGIOUS FACILITIES"                   
## [30] "RENTALS - 4-10 UNIT"                    
## [31] "RENTALS - ELEVATOR APARTMENTS"          
## [32] "RENTALS - WALKUP APARTMENTS"            
## [33] "SELECTED GOVERNMENTAL FACILITIES"       
## [34] "SPECIAL CONDO BILLING LOTS"             
## [35] "STORE BUILDINGS"                        
## [36] "TAX CLASS 1 - OTHER"                    
## [37] "TAX CLASS 1 CONDOS"                     
## [38] "TAX CLASS 1 VACANT LAND"                
## [39] "TAX CLASS 4 - OTHER"                    
## [40] "THEATRES"                               
## [41] "THREE FAMILY DWELLINGS"                 
## [42] "TRANSPORTATION FACILITIES"              
## [43] "TWO FAMILY DWELLINGS"                   
## [44] "WAREHOUSES"
levels(nyc_data$tax_class_at_present)
##  [1] "1"  "1A" "1B" "1C" "1D" "2"  "2A" "2B" "2C" "4"
table(nyc_data$tax_class_at_present)  
## 
##     1    1A    1B    1C    1D     2    2A    2B    2C     4 
## 18072   781   452   112   108 17501   657   236  1478  2773
levels(nyc_data$tax_class_at_time_of_sale)
## [1] "1" "2" "4"

DATA PREVIEW

library(formattable)
head(nyc_data, n = 10) %>%
  formattable()
borough neighborhood building_class_category_number building_class_category tax_class_at_present block lot building_class_at_present address apartment_number zip_code residential_units commercial_units total_units land_square_feet gross_square_feet year_built tax_class_at_time_of_sale building_class_at_time_of_sale sale_price sale_date building_age
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6359 70 S1 8684 15TH AVENUE NA 11228 1 1 2 1933 4080 1930 1 S1 1300000 2020-04-28 90
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6360 48 A5 14 BAY 10TH STREET NA 11228 1 0 1 2513 1428 1930 1 A5 849000 2020-03-18 90
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6360 56 A5 30 BAY 10TH STREET NA 11228 1 0 1 1547 1428 1930 1 A5 75000 2020-11-30 90
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6367 24 A9 8645 BAY 16TH STREET NA 11214 1 0 1 1571 1456 1935 1 A9 830000 2020-06-26 85
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6371 60 A9 8620 19TH AVENUE NA 11214 1 0 1 2417 2106 1930 1 A9 1188000 2020-07-20 90
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6392 65 A2 140 BAY 8 STREET NA 11228 1 0 1 3222 1804 1940 1 A2 975000 2020-01-21 80
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6392 115 A5 145 BAY 7TH STREET NA 11228 1 0 1 1448 1280 1945 1 A5 870000 2020-10-15 75
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6399 13 A1 155 BAY 14TH ST NA 11214 1 0 1 3680 936 1930 1 A1 1250000 2020-12-23 90
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6409 40 A9 106 BAY 25TH STREET NA 11214 1 0 1 1740 1382 1915 1 A9 805000 2020-12-02 105
Brooklyn BATH BEACH 01 ONE FAMILY DWELLINGS 1 6433 10 A5 1703 CROPSEY AVENUE NA 11214 1 0 1 1438 1568 1937 1 A5 760000 2020-06-29 83

DATA DESCRIPTION

variable.type <- lapply(nyc_data, class)

variable.description <- c(
"Name of the borough where the property is located", 
"Neighbourhood name", 
"Building class category code to identify similar properties", 
"Building class category title to identify similar properties", 
"Assigned tax class of the property in the city - Classes 1, 2, 3 or 4", 
"Sub-division of the borough on which real properties are located", 
"Sub-division of a Tax Block, used to uniquely represent the property location",
"Used to describe a property's constructive use", 
"Property's street address", 
"Property's apartment number", 
"Property's postal code", 
"Number of residential units at the listed property",
"Number of commercial units at the listed property",
"Total number of units at the listed property", 
"Land area of the property listed in square feet", 
"Total area of all the floors of a building", 
"Property's construction year", 
"Assigned tax class of the property in the city at the time of sale",
"Used to describe a property's constructive use at the time of sale",
"Price paid for the property", 
"Date of property sale", 
"Age of the Building")

variable.name <- colnames(nyc_data)

nyc_datadesc <- as_tibble(cbind(variable.name, variable.type, variable.description))

colnames(nyc_datadesc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(nyc_datadesc)
Variable Name Data Type Variable Description
borough factor Name of the borough where the property is located
neighborhood character Neighbourhood name
building_class_category_number factor Building class category code to identify similar properties
building_class_category factor Building class category title to identify similar properties
tax_class_at_present factor Assigned tax class of the property in the city - Classes 1, 2, 3 or 4
block character Sub-division of the borough on which real properties are located
lot character Sub-division of a Tax Block, used to uniquely represent the property location
building_class_at_present factor Used to describe a property’s constructive use
address character Property’s street address
apartment_number character Property’s apartment number
zip_code factor Property’s postal code
residential_units numeric Number of residential units at the listed property
commercial_units numeric Number of commercial units at the listed property
total_units numeric Total number of units at the listed property
land_square_feet numeric Land area of the property listed in square feet
gross_square_feet numeric Total area of all the floors of a building
year_built numeric Property’s construction year
tax_class_at_time_of_sale factor Assigned tax class of the property in the city at the time of sale
building_class_at_time_of_sale factor Used to describe a property’s constructive use at the time of sale
sale_price numeric Price paid for the property
sale_date c(“POSIXct”, “POSIXt”) Date of property sale
building_age numeric Age of the Building

ANALYSIS

The analysis of the NYC Property Sales data can be found under the following tabs. The Descriptive Statistics section explores each individual variable, while relationships within variables are explored using visualizations.

Descriptive analysis

dim(nyc_data)
## [1] 42198    22

Our cleaned dataset has 22 columns and 42198 rows.
Lets explore some numeric variables in our data set.

  1. Sale price.
quantile(nyc_data$sale_price, probs = seq(from = 0, to = 1, by = .1))
##        0%       10%       20%       30%       40%       50%       60%       70% 
##         1    250000    392000    502500    600000    700000    830000    990000 
##       80%       90%      100% 
##   1380000   2495000 978090439
nyc_data %>% filter(sale_price <= 1000) %>% nrow()
## [1] 946

The 10th percentile lies at (Usd 250,000) and the price ranges on to (Usd 2,495,000). This means there is something for everyone in New York.
Our highest price is slightly less that $1billion (978,090,439).

If we check out this observation;

nyc_data%>%
  filter(sale_price>900000000)
## # A tibble: 1 x 22
##   borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
##   <fct>   <chr>        <fct>            <fct>            <fct>            <chr>
## 1 Manhat~ FASHION      "22 "            STORE BUILDINGS  4                840  
## # ... with 16 more variables: lot <chr>, building_class_at_present <fct>,
## #   address <chr>, apartment_number <chr>, zip_code <fct>,
## #   residential_units <dbl>, commercial_units <dbl>, total_units <dbl>,
## #   land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## #   tax_class_at_time_of_sale <fct>, building_class_at_time_of_sale <fct>,
## #   sale_price <dbl>, sale_date <dttm>, building_age <dbl>

The building appears to be exactly 100yrs old. It is probably a public monument. The outlier can be explained.

  1. Land Square feet
summary(nyc_data$land_square_feet)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      45    2000    2506   16011    4000 7649136   20721
quantile(nyc_data$land_square_feet, probs = seq(from = 0, to = 1, by = .1), na.rm = TRUE)
##        0%       10%       20%       30%       40%       50%       60%       70% 
##      45.0    1700.6    1976.0    2038.0    2400.0    2506.0    3000.0    3750.0 
##       80%       90%      100% 
##    4041.6    5700.0 7649136.0
nyc_data %>%
  filter(land_square_feet > 500000) %>%
  tail(10)
## # A tibble: 10 x 22
##    borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
##    <fct>   <chr>        <fct>            <fct>            <fct>            <chr>
##  1 Bronx   HUNTS POINT  "41 "            TAX CLASS 4 - O~ 4                2770 
##  2 Bronx   MOTT HAVEN/~ "36 "            OUTDOOR RECREAT~ 4                2557 
##  3 Manhat~ KIPS BAY     "08 "            RENTALS - ELEVA~ 2                972  
##  4 Manhat~ MIDTOWN EAST "41 "            TAX CLASS 4 - O~ 4                1354 
##  5 Staten~ ANNADALE     "05 "            TAX CLASS 1 VAC~ 1B               6431 
##  6 Staten~ BLOOMFIELD   "27 "            FACTORIES        4                1801 
##  7 Staten~ ROSSVILLE-C~ "31 "            COMMERCIAL VACA~ 4                7328 
##  8 Staten~ ROSSVILLE-P~ "27 "            FACTORIES        4                7247 
##  9 Staten~ ROSSVILLE-P~ "31 "            COMMERCIAL VACA~ 4                7207 
## 10 Staten~ TOTTENVILLE  "29 "            COMMERCIAL GARA~ 4                7971 
## # ... with 16 more variables: lot <chr>, building_class_at_present <fct>,
## #   address <chr>, apartment_number <chr>, zip_code <fct>,
## #   residential_units <dbl>, commercial_units <dbl>, total_units <dbl>,
## #   land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## #   tax_class_at_time_of_sale <fct>, building_class_at_time_of_sale <fct>,
## #   sale_price <dbl>, sale_date <dttm>, building_age <dbl>

Assuming a ball-park value that buildings that have > 500,000 land sq footage should be ballpark, I explored them further. I find that I have 22 outlier values. These buildings happen to be in specific neighbourhoods most especially Hunts Point and these buildings fall under building class category of warehouses, tax class 4, outdoor recreational facilities, factories and commercial vacant land. This makes perfect sense.

  1. Gross square feet
summary(nyc_data$gross_square_feet)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     100    1360    1872    9193    2640 8942176   21622
quantile(nyc_data$gross_square_feet, probs = seq(from = 0, to = 1, by= .1), na.rm = TRUE)
##      0%     10%     20%     30%     40%     50%     60%     70%     80%     90% 
##     100    1120    1284    1444    1636    1872    2104    2400    2960    4000 
##    100% 
## 8942176
nyc_data %>%
  filter(gross_square_feet > 500000)
## # A tibble: 62 x 22
##    borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
##    <fct>   <chr>        <fct>            <fct>            <fct>            <chr>
##  1 Bronx   HUNTS POINT  "30 "            WAREHOUSES       4                2781 
##  2 Bronx   HUNTS POINT  "30 "            WAREHOUSES       4                2781 
##  3 Bronx   HUNTS POINT  "30 "            WAREHOUSES       4                2781 
##  4 Bronx   HUNTS POINT  "30 "            WAREHOUSES       4                2781 
##  5 Bronx   HUNTS POINT  "30 "            WAREHOUSES       4                2781 
##  6 Bronx   HUNTS POINT  "30 "            WAREHOUSES       4                2781 
##  7 Bronx   HUNTS POINT  "30 "            WAREHOUSES       4                2781 
##  8 Bronx   HUNTS POINT  "30 "            WAREHOUSES       4                2781 
##  9 Bronx   HUNTS POINT  "41 "            TAX CLASS 4 - O~ 4                2770 
## 10 Bronx   HUNTS POINT  "41 "            TAX CLASS 4 - O~ 4                2770 
## # ... with 52 more rows, and 16 more variables: lot <chr>,
## #   building_class_at_present <fct>, address <chr>, apartment_number <chr>,
## #   zip_code <fct>, residential_units <dbl>, commercial_units <dbl>,
## #   total_units <dbl>, land_square_feet <dbl>, gross_square_feet <dbl>,
## #   year_built <dbl>, tax_class_at_time_of_sale <fct>,
## #   building_class_at_time_of_sale <fct>, sale_price <dbl>, sale_date <dttm>,
## #   building_age <dbl>

On investigating the buildings with extreme gross square feet, I found that the said buildings were warehouses, store buildings, office buildings, luxury hotels and tax class 4.

  1. Building age
summary(nyc_data$building_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -1.00   45.00   70.00   67.01   95.00 1002.00    2868
quantile(nyc_data$building_age, probs = seq(from = 0, to = 1, by = .1), na.rm =  TRUE)
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##   -1    8   31   55   62   70   85   90  100  110 1002

Lets look at these really old buildings.

nyc_data %>%
  filter(building_age > 500)
## # A tibble: 2 x 22
##   borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
##   <fct>   <chr>        <fct>            <fct>            <fct>            <chr>
## 1 Manhat~ CHELSEA      "29 "            COMMERCIAL GARA~ 4                799  
## 2 Queens  ARVERNE      "02 "            TWO FAMILY DWEL~ 1                15915
## # ... with 16 more variables: lot <chr>, building_class_at_present <fct>,
## #   address <chr>, apartment_number <chr>, zip_code <fct>,
## #   residential_units <dbl>, commercial_units <dbl>, total_units <dbl>,
## #   land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## #   tax_class_at_time_of_sale <fct>, building_class_at_time_of_sale <fct>,
## #   sale_price <dbl>, sale_date <dttm>, building_age <dbl>

The oldest buildings, ie over 500 yrs old, are two buildings; 909 yrs and 1002 yrs old. One is a commercial garage in Manhattan while the other is a two family dwelling at Queens.

nyc_data %>%
  filter(building_age < 0)
## # A tibble: 1 x 22
##   borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
##   <fct>   <chr>        <fct>            <fct>            <fct>            <chr>
## 1 Bronx   RIVERDALE    "05 "            TAX CLASS 1 VAC~ 1B               5731 
## # ... with 16 more variables: lot <chr>, building_class_at_present <fct>,
## #   address <chr>, apartment_number <chr>, zip_code <fct>,
## #   residential_units <dbl>, commercial_units <dbl>, total_units <dbl>,
## #   land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## #   tax_class_at_time_of_sale <fct>, building_class_at_time_of_sale <fct>,
## #   sale_price <dbl>, sale_date <dttm>, building_age <dbl>

The -1 raises questions. On investigating, it is vacant land. I believe construction was to commence in 2021.

Visualization

Let us conduct some data visualization analysis. We shall focus on the various boroughs, neighbourhoods and building classes.

  1. Boroughs

I started the analysis trying to figure out which Borough recored the most property sales last year. I also wanted to understand how much an average property sale price would be across the boroughs in NYC.

Most in demand and expensive boroughs in NYC

p1 <- ggplot(data = nyc_property, aes(x = (borough))) +
  geom_bar(fill = "blue") +
  ggtitle("Most In-Demand Borough in NYC", subtitle = "Borough-wise # of property sales in NYC") +
  scale_y_continuous("# of Property Sales", labels = scales::comma) +
  scale_x_discrete("Borough")

p2 <- ggplot(data = nyc_data, aes(x = (borough), y = mean(sale_price))) +
  geom_bar(stat = "identity", color = "darkgreen") +
  ggtitle("Most Expensive Borough in NYC", subtitle = "Borough-wise Avg Property Sale Price in NYC") +
  scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
  scale_x_discrete("Borough")

gridExtra::grid.arrange(p1, p2, nrow = 2)

The plots above showed me that Queens had the most number of property sales, followed by Brookyln then Manhattan. I find this surprising as I expected properties in Manhattan to cost more. Let’s explore this further with the other fields.

  1. Neighbourhoods
df1 <- as.data.frame(table(nyc_property$borough, nyc_property$neighborhood))
names(df1) <- c('BOROUGH','NEIGHBORHOOD', 'Freq')
df1 <- df1 %>% 
  arrange(desc(Freq)) %>% 
  head(10)

p1 <- ggplot(df1, aes(x = `NEIGHBORHOOD`, y = `Freq`, fill = `BOROUGH`)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Most-in demand Neighborhood in NYC", subtitle = "Top Neighborhoods by Number") +
  theme(legend.position = "bottom") +
  scale_y_continuous("# of Sales", labels = scales::comma) +
  scale_x_discrete("Neighborhood") 

p1

df2 <- 
  nyc_data %>% group_by(borough, neighborhood) %>% 
  summarise(MeanSP = mean(sale_price)) %>% 
  arrange(desc(MeanSP)) %>% 
  head(10)
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
p2 <- ggplot(data = df2, aes(x = neighborhood, y = MeanSP, fill = borough)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "bottom") +
  ggtitle("Most Expensive Neighborhoods in NYC", 
          subtitle = "Top Neighborhoods by Avg Price") +
  scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
  scale_x_discrete("Neighborhood") 

p2

df3 <- 
  nyc_data %>% group_by(borough, neighborhood) %>% 
  summarise(MeanSP = mean(sale_price)) %>% 
  arrange(MeanSP) %>%  ## Note the order is ascending
  head(10)
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
ggplot(data = df3, aes(x = neighborhood, y = MeanSP, fill = borough)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "bottom") +
  ggtitle("Least Expensive Neighborhoods in NYC", subtitle = "Top Neighborhoods by the lowest avg. Price") +
  scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
  scale_x_discrete("Neighborhood") 

  1. Buildings
df4<- as.data.frame(table(nyc_property$borough, nyc_property$building_class_category))
names(df4) <- c("Borough", "Building class category", "Freq")
df4 <- df4 %>%
  group_by(Borough) %>%
  arrange(desc(Freq)) %>%
  head(10)

p4 <- ggplot(df4, aes(x = Borough, y = Freq, fill = `Building class category`)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Most in-demand Buildings in NYC by Borough", subtitle = "Top types of Building sold in NYC") +
  scale_y_continuous("No. of Sales", labels = scales::comma) +
  scale_x_discrete("Borough")

p4

df5 <- nyc_data %>%
  group_by(borough, building_class_category) %>%
  summarise(MeanSP = mean(sale_price)) %>%
  arrange(desc(MeanSP)) %>%
  head(10)
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
ggplot(df5, aes(x= building_class_category, y = MeanSP, fill = borough)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() + 
  theme(legend.position = "bottom") +
  ggtitle("Most expensive building types in NYC", subtitle = "Top ten types of property by value in NYC as at 2020")+
  scale_y_continuous("Average Sale Price", labels = scales::dollar) +
  scale_x_discrete("Building type")

df6 <- nyc_data %>%
  group_by(borough, building_class_category) %>%
  summarise(MeanSP = mean(sale_price)) %>%
  arrange(MeanSP) %>%
  head(10)
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
ggplot(df6, aes(x= building_class_category, y = MeanSP, fill = borough)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() + 
  theme(legend.position = "bottom") +
  ggtitle("Least expensive building types in NYC", subtitle = "Top ten cheapest types of property by value in NYC as at 2020")+
  scale_y_continuous("Average Sale Price", labels = scales::dollar) +
  scale_x_discrete("Building type")