New York City is repeatedly named as among the most expensive cities in the world to buy real estate. The Dataset contains information of all the property sales in NYC from January 2020 to December 2020. With such a recent data set, I was able to analyze trends about NYC real estate market borough and neighbourhood-wise.
My analysis of the NYC Real estate market is broken down into the following sections.
Here, we take a look at variables such as Borough, Neighborhood, Age of the building/property, Size of property and type of building.
The descriptive statistics section speaks about the distribution of each variable and makes necessary changes for the analysis.
I’ve also tried to isolate any potential outliers for the variable that will need special attention.
The visualization section is split into a few broad categories for need of clarity. The tabs within the viz section explore one variable at a time with respect to the important numerical fields.
1. Most In-Demand Borough - Where did New Yorkers buy their properties last year?
2. Most In-Demand Neighborhood - Which neighborhood do New Yorkers prefer?
3. The Hottest Buildings - What kind of properties do they buy?
4. Property sizes in NYC/ Square footage - Does more money mean larger properties in NYC?
5. Age of the buildings in NYC - Does more money mean newer properties in NYC?
Visualizing each of these plots showed me that many of these variables will be an important predictor of Sales Price, forming the basis for a potential predictive modeling.
library(data.table)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 1.0.1
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'readr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.0.3
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(formattable)
## Warning: package 'formattable' was built under R version 4.0.4
library(knitr)
library(modelr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:formattable':
##
## style
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(readxl)
This section contains the logic and the steps performed in bringing the data to a form suitable for statistical analysis. The first step is to import the data.
My data is seperated by boroughs however we can combine the data as follows.
I used read_excel because it tends to perform some degree of cleaning the data during importation. We skip the first 4 rows because they contain description of the data.
brooklyn <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_brooklyn.xls", skip = 4)
bronx <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_bronx.xls", skip = 4)
manhattan <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_manhattan.xls", skip = 4)
queens <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_queens.xls", skip = 4)
statenisland <- read_excel("C:/Users/Folio 9480/Downloads/rollingsales_statenisland.xls", skip = 4)
Combining data to one dataset is only possible if the datasets involves have similar columns which is the case in our datasets.
NYC_property_sales <- bind_rows(brooklyn, bronx, manhattan, queens, statenisland)
class(NYC_property_sales)
## [1] "tbl_df" "tbl" "data.frame"
glimpse(NYC_property_sales)
## Rows: 62,957
## Columns: 21
## $ BOROUGH <chr> "3", "3", "3", "3", "3", "3", "3",...
## $ NEIGHBORHOOD <chr> "BATH BEACH", "BATH BEACH", "BATH ...
## $ `BUILDING CLASS CATEGORY` <chr> "01 ONE FAMILY DWELLINGS", "01 ONE...
## $ `TAX CLASS AT PRESENT` <chr> "1", "1", "1", "1", "1", "1", "1",...
## $ BLOCK <dbl> 6359, 6360, 6360, 6362, 6367, 6371...
## $ LOT <dbl> 70, 48, 56, 23, 24, 19, 60, 65, 11...
## $ `EASE-MENT` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `BUILDING CLASS AT PRESENT` <chr> "S1", "A5", "A5", "A9", "A9", "A9"...
## $ ADDRESS <chr> "8684 15TH AVENUE", "14 BAY 10TH S...
## $ `APARTMENT NUMBER` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `ZIP CODE` <dbl> 11228, 11228, 11228, 11228, 11214,...
## $ `RESIDENTIAL UNITS` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ `COMMERCIAL UNITS` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ `TOTAL UNITS` <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2...
## $ `LAND SQUARE FEET` <dbl> 1933, 2513, 1547, 2280, 1571, 2320...
## $ `GROSS SQUARE FEET` <dbl> 4080, 1428, 1428, 1052, 1456, 1566...
## $ `YEAR BUILT` <dbl> 1930, 1930, 1930, 1901, 1935, 1930...
## $ `TAX CLASS AT TIME OF SALE` <chr> "1", "1", "1", "1", "1", "1", "1",...
## $ `BUILDING CLASS AT TIME OF SALE` <chr> "S1", "A5", "A5", "A9", "A9", "A9"...
## $ `SALE PRICE` <dbl> 1300000, 849000, 75000, 0, 830000,...
## $ `SALE DATE` <dttm> 2020-04-28, 2020-03-18, 2020-11-3...
dim(NYC_property_sales)
## [1] 62957 21
The combined dataset NYC has 21 columns and 62957 rows.
names(NYC_property_sales)
## [1] "BOROUGH" "NEIGHBORHOOD"
## [3] "BUILDING CLASS CATEGORY" "TAX CLASS AT PRESENT"
## [5] "BLOCK" "LOT"
## [7] "EASE-MENT" "BUILDING CLASS AT PRESENT"
## [9] "ADDRESS" "APARTMENT NUMBER"
## [11] "ZIP CODE" "RESIDENTIAL UNITS"
## [13] "COMMERCIAL UNITS" "TOTAL UNITS"
## [15] "LAND SQUARE FEET" "GROSS SQUARE FEET"
## [17] "YEAR BUILT" "TAX CLASS AT TIME OF SALE"
## [19] "BUILDING CLASS AT TIME OF SALE" "SALE PRICE"
## [21] "SALE DATE"
First of all we shall use magnittr and the pipe function to clean the data. We shall convert the column names to lower case and eliminate spaces to make analysis easier.
colnames(NYC_property_sales) %<>%
str_replace_all("\\s", "_") %<>%
tolower()
The column ‘building_class_category’ has the category number and title together. I’ve split this column into 2 with the category number in one and title in another. Also, removed the ‘EASE-MENT’ column, which had NULL values throughout the entire dataset and the ‘TIME OF SALE’.
# Separated the columns in 'BUILDING CLASS CATEGORY'
nyc_property <- NYC_property_sales %>%
separate(col = building_class_category, into = c("building_class_category_number","building_class_category"), sep = 3)
# Removed unnecessary columns - 'ease-ment'(column #8)
nyc_property <- nyc_property[,c(-8)]
names(nyc_property)
## [1] "borough" "neighborhood"
## [3] "building_class_category_number" "building_class_category"
## [5] "tax_class_at_present" "block"
## [7] "lot" "building_class_at_present"
## [9] "address" "apartment_number"
## [11] "zip_code" "residential_units"
## [13] "commercial_units" "total_units"
## [15] "land_square_feet" "gross_square_feet"
## [17] "year_built" "tax_class_at_time_of_sale"
## [19] "building_class_at_time_of_sale" "sale_price"
## [21] "sale_date"
I now create a new column called ‘Building Age’ transforming the variable, ‘Year Built’. I did this as Building age is a much clearer metric to understand.
# Creating a new column called 'Building Age' transforming the variable, 'Year Built'
nyc_property <- nyc_property %>%
mutate(`building_age` = 2020 - `year_built`)
The next step is to check for duplicates in my dataframe.
# Removing the Duplicates in the data frame, nyc_property
nyc_property %>%
filter(duplicated(nyc_property) == TRUE) %>%
nrow()
## [1] 56
We have 56 duplicated entries in our data.
nyc_property <- unique(nyc_property)
dim(nyc_property)
## [1] 62901 22
We have eliminated the 56 rows.
glimpse(nyc_property)
## Rows: 62,901
## Columns: 22
## $ borough <chr> "3", "3", "3", "3", "3", "3", "3", "...
## $ neighborhood <chr> "BATH BEACH", "BATH BEACH", "BATH BE...
## $ building_class_category_number <chr> "01 ", "01 ", "01 ", "01 ", "01 ", "...
## $ building_class_category <chr> "ONE FAMILY DWELLINGS", "ONE FAMILY ...
## $ tax_class_at_present <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ block <dbl> 6359, 6360, 6360, 6362, 6367, 6371, ...
## $ lot <dbl> 70, 48, 56, 23, 24, 19, 60, 65, 115,...
## $ building_class_at_present <chr> "S1", "A5", "A5", "A9", "A9", "A9", ...
## $ address <chr> "8684 15TH AVENUE", "14 BAY 10TH STR...
## $ apartment_number <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ zip_code <dbl> 11228, 11228, 11228, 11228, 11214, 1...
## $ residential_units <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ commercial_units <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ total_units <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ land_square_feet <dbl> 1933, 2513, 1547, 2280, 1571, 2320, ...
## $ gross_square_feet <dbl> 4080, 1428, 1428, 1052, 1456, 1566, ...
## $ year_built <dbl> 1930, 1930, 1930, 1901, 1935, 1930, ...
## $ tax_class_at_time_of_sale <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ building_class_at_time_of_sale <chr> "S1", "A5", "A5", "A9", "A9", "A9", ...
## $ sale_price <dbl> 1300000, 849000, 75000, 0, 830000, 0...
## $ sale_date <dttm> 2020-04-28, 2020-03-18, 2020-11-30,...
## $ building_age <dbl> 90, 90, 90, 119, 85, 90, 90, 80, 75,...
We need to convert some variables. Borough, Building class category and tax class at present are converted to factors. Block, Lot, Address, Apartment number and Zip code are converted to characters and lastly, land sq feet, gross sq feet, year built and sales price are made numeric.
fac <- c(1,3,4,5,8,11,18,19) #these numbers represent column number/location
nyc_property %<>%
mutate_at(fac, funs(factor(.)))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
levels(nyc_property$borough) <- c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")
num <- c(15,16,17,20)
nyc_property %<>%
mutate_at(num, funs(as.numeric(.)))
chr <- c(6,7)
nyc_property %<>%
mutate_at(chr, funs(as.character(.)))
glimpse(nyc_property)
## Rows: 62,901
## Columns: 22
## $ borough <fct> Brooklyn, Brooklyn, Brooklyn, Brookl...
## $ neighborhood <chr> "BATH BEACH", "BATH BEACH", "BATH BE...
## $ building_class_category_number <fct> 01 , 01 , 01 , 01 , 01 , 01 , 01 , 0...
## $ building_class_category <fct> ONE FAMILY DWELLINGS, ONE FAMILY DWE...
## $ tax_class_at_present <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ block <chr> "6359", "6360", "6360", "6362", "636...
## $ lot <chr> "70", "48", "56", "23", "24", "19", ...
## $ building_class_at_present <fct> S1, A5, A5, A9, A9, A9, A9, A2, A5, ...
## $ address <chr> "8684 15TH AVENUE", "14 BAY 10TH STR...
## $ apartment_number <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ zip_code <fct> 11228, 11228, 11228, 11228, 11214, 1...
## $ residential_units <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ commercial_units <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ total_units <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ land_square_feet <dbl> 1933, 2513, 1547, 2280, 1571, 2320, ...
## $ gross_square_feet <dbl> 4080, 1428, 1428, 1052, 1456, 1566, ...
## $ year_built <dbl> 1930, 1930, 1930, 1901, 1935, 1930, ...
## $ tax_class_at_time_of_sale <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ building_class_at_time_of_sale <fct> S1, A5, A5, A9, A9, A9, A9, A2, A5, ...
## $ sale_price <dbl> 1300000, 849000, 75000, 0, 830000, 0...
## $ sale_date <dttm> 2020-04-28, 2020-03-18, 2020-11-30,...
## $ building_age <dbl> 90, 90, 90, 119, 85, 90, 90, 80, 75,...
Next we check for any missing values.
sum(is.na(nyc_property))
## [1] 158862
colSums(is.na(nyc_property))
## borough neighborhood
## 0 0
## building_class_category_number building_class_category
## 0 0
## tax_class_at_present block
## 29 0
## lot building_class_at_present
## 0 29
## address apartment_number
## 0 48893
## zip_code residential_units
## 1 13682
## commercial_units total_units
## 24139 12053
## land_square_feet gross_square_feet
## 25768 25768
## year_built tax_class_at_time_of_sale
## 4250 0
## building_class_at_time_of_sale sale_price
## 0 0
## sale_date building_age
## 0 4250
We find that 11 out of 22 variables have missing values.
sum(NYC_property_sales$sale_price == "-") #0
## [1] 0
sum(is.na(nyc_property$sale_price)) # 0
## [1] 0
I counted the number of missing values in both the original and the modified dataframe. There were none. I also counted the zero values in SALE PRICE variable and found different results. It is important to note that this data set has NA values as well as 0 values. I expect this to be the case for all my numeric fields, SALE PRICE, LAND SQUARE FEET and GROSS SQUARE FEET.
sum(NYC_property_sales$sale_price == 0) #20746
## [1] 20746
nyc_property %>% filter(sale_price == 0) %>%
nrow() #20703
## [1] 20703
The difference of 43 must be among the 56 duplicates deleted before.
# The zero sales prices account for 32.91% of our original data
nyc_salepricezero <- nyc_property %>%
filter(nyc_property$sale_price == 0)
# The complete sale prices account for 67.09% of our original dataset
nyc_salepricecomplete <- nyc_property %>%
filter(!nyc_property$sale_price == 0)
nyc_salepricecomplete %>% filter(nyc_salepricecomplete$land_square_feet == 0 | nyc_salepricecomplete$gross_square_feet == 0) %>% nrow()
## [1] 907
explore <- nyc_salepricecomplete %>% filter(nyc_salepricecomplete$land_square_feet == 0 | nyc_salepricecomplete$gross_square_feet == 0)
summary(explore$building_class_category)
## ASYLUMS AND HOMES COMMERCIAL CONDOS
## 0 0
## COMMERCIAL GARAGES COMMERCIAL VACANT LAND
## 144 163
## CONDO COOPS CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC
## 0 0
## CONDO HOTELS CONDO NON-BUSINESS STORAGE
## 0 0
## CONDO OFFICE BUILDINGS CONDO PARKING
## 0 0
## CONDO STORE BUILDINGS CONDO TERRACES/GARDENS/CABANAS
## 0 0
## CONDO WAREHOUSES/FACTORY/INDUS CONDOS - 2-10 UNIT RESIDENTIAL
## 0 0
## CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT CONDOS - ELEVATOR APARTMENTS
## 0 0
## CONDOS - WALKUP APARTMENTS COOPS - ELEVATOR APARTMENTS
## 0 0
## COOPS - WALKUP APARTMENTS EDUCATIONAL FACILITIES
## 0 0
## FACTORIES HOSPITAL AND HEALTH FACILITIES
## 1 0
## INDOOR PUBLIC AND CULTURAL FACILITIES LUXURY HOTELS
## 1 0
## OFFICE BUILDINGS ONE FAMILY DWELLINGS
## 0 12
## OTHER HOTELS OUTDOOR RECREATIONAL FACILITIES
## 0 10
## RELIGIOUS FACILITIES RENTALS - 4-10 UNIT
## 3 0
## RENTALS - ELEVATOR APARTMENTS RENTALS - WALKUP APARTMENTS
## 1 0
## SELECTED GOVERNMENTAL FACILITIES SPECIAL CONDO BILLING LOTS
## 0 3
## STORE BUILDINGS TAX CLASS 1 - OTHER
## 0 73
## TAX CLASS 1 CONDOS TAX CLASS 1 VACANT LAND
## 0 445
## TAX CLASS 4 - OTHER THEATRES
## 44 0
## THREE FAMILY DWELLINGS TRANSPORTATION FACILITIES
## 1 3
## TWO FAMILY DWELLINGS WAREHOUSES
## 2 1
There are 907 rows in nyc_salespricecomplete where Land Sq Footage or Gross Sq Footage is 0. Exploring the Building Class Category variable for the observations where LSF or GSF = 0, we can see that these properties are of different categories. Hence, the assumption of Zero sq footage being land only property is wrong. Hence, 0 and missing values are equivalent in this dataset.
In the nyc_salespricecomplete, ‘Land Square Feet’ and ‘Gross Square Feet’ variables have 20721, 21622 zero/missing values individually.
nyc_salepricecomplete %>% filter(nyc_salepricecomplete$land_square_feet == 0 | is.na(nyc_salepricecomplete$land_square_feet)) %>% nrow()
## [1] 20721
nyc_salepricecomplete %>% filter(nyc_salepricecomplete$gross_square_feet == 0 | is.na(nyc_salepricecomplete$gross_square_feet)) %>% nrow()
## [1] 21622
Lets convert the 0 values to NA’s and store in a new dataset called nyc_data.
nyc_data <- nyc_salepricecomplete
# converting the zero values in LSF into NAs -> 20721 NAs in `LAND SQUARE FEET`
nyc_data$land_square_feet[nyc_data$land_square_feet == 0] <- NA
# converting the zero values in GSF into NAs - > 21622 NAs in `GROSS SQUARE FEET`
nyc_data$gross_square_feet[nyc_data$gross_square_feet == 0] <- NA
Lets explore the categorical variables in our dataset.
levels(nyc_data$borough)
## [1] "Manhattan" "Bronx" "Brooklyn" "Queens"
## [5] "Staten Island"
levels((nyc_data$building_class_category_number))
## [1] "01 " "02 " "03 " "04 " "05 " "06 " "07 " "08 " "09 " "10 " "11 " "12 "
## [13] "13 " "14 " "15 " "16 " "17 " "21 " "22 " "25 " "26 " "27 " "28 " "29 "
## [25] "30 " "31 " "32 " "33 " "34 " "35 " "36 " "37 " "38 " "39 " "40 " "41 "
## [37] "42 " "43 " "44 " "45 " "46 " "47 " "48 " "49 "
levels(nyc_data$building_class_category)
## [1] "ASYLUMS AND HOMES"
## [2] "COMMERCIAL CONDOS"
## [3] "COMMERCIAL GARAGES"
## [4] "COMMERCIAL VACANT LAND"
## [5] "CONDO COOPS"
## [6] "CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC"
## [7] "CONDO HOTELS"
## [8] "CONDO NON-BUSINESS STORAGE"
## [9] "CONDO OFFICE BUILDINGS"
## [10] "CONDO PARKING"
## [11] "CONDO STORE BUILDINGS"
## [12] "CONDO TERRACES/GARDENS/CABANAS"
## [13] "CONDO WAREHOUSES/FACTORY/INDUS"
## [14] "CONDOS - 2-10 UNIT RESIDENTIAL"
## [15] "CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT"
## [16] "CONDOS - ELEVATOR APARTMENTS"
## [17] "CONDOS - WALKUP APARTMENTS"
## [18] "COOPS - ELEVATOR APARTMENTS"
## [19] "COOPS - WALKUP APARTMENTS"
## [20] "EDUCATIONAL FACILITIES"
## [21] "FACTORIES"
## [22] "HOSPITAL AND HEALTH FACILITIES"
## [23] "INDOOR PUBLIC AND CULTURAL FACILITIES"
## [24] "LUXURY HOTELS"
## [25] "OFFICE BUILDINGS"
## [26] "ONE FAMILY DWELLINGS"
## [27] "OTHER HOTELS"
## [28] "OUTDOOR RECREATIONAL FACILITIES"
## [29] "RELIGIOUS FACILITIES"
## [30] "RENTALS - 4-10 UNIT"
## [31] "RENTALS - ELEVATOR APARTMENTS"
## [32] "RENTALS - WALKUP APARTMENTS"
## [33] "SELECTED GOVERNMENTAL FACILITIES"
## [34] "SPECIAL CONDO BILLING LOTS"
## [35] "STORE BUILDINGS"
## [36] "TAX CLASS 1 - OTHER"
## [37] "TAX CLASS 1 CONDOS"
## [38] "TAX CLASS 1 VACANT LAND"
## [39] "TAX CLASS 4 - OTHER"
## [40] "THEATRES"
## [41] "THREE FAMILY DWELLINGS"
## [42] "TRANSPORTATION FACILITIES"
## [43] "TWO FAMILY DWELLINGS"
## [44] "WAREHOUSES"
levels(nyc_data$tax_class_at_present)
## [1] "1" "1A" "1B" "1C" "1D" "2" "2A" "2B" "2C" "4"
table(nyc_data$tax_class_at_present)
##
## 1 1A 1B 1C 1D 2 2A 2B 2C 4
## 18072 781 452 112 108 17501 657 236 1478 2773
levels(nyc_data$tax_class_at_time_of_sale)
## [1] "1" "2" "4"
library(formattable)
head(nyc_data, n = 10) %>%
formattable()
| borough | neighborhood | building_class_category_number | building_class_category | tax_class_at_present | block | lot | building_class_at_present | address | apartment_number | zip_code | residential_units | commercial_units | total_units | land_square_feet | gross_square_feet | year_built | tax_class_at_time_of_sale | building_class_at_time_of_sale | sale_price | sale_date | building_age |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6359 | 70 | S1 | 8684 15TH AVENUE | NA | 11228 | 1 | 1 | 2 | 1933 | 4080 | 1930 | 1 | S1 | 1300000 | 2020-04-28 | 90 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6360 | 48 | A5 | 14 BAY 10TH STREET | NA | 11228 | 1 | 0 | 1 | 2513 | 1428 | 1930 | 1 | A5 | 849000 | 2020-03-18 | 90 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6360 | 56 | A5 | 30 BAY 10TH STREET | NA | 11228 | 1 | 0 | 1 | 1547 | 1428 | 1930 | 1 | A5 | 75000 | 2020-11-30 | 90 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6367 | 24 | A9 | 8645 BAY 16TH STREET | NA | 11214 | 1 | 0 | 1 | 1571 | 1456 | 1935 | 1 | A9 | 830000 | 2020-06-26 | 85 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6371 | 60 | A9 | 8620 19TH AVENUE | NA | 11214 | 1 | 0 | 1 | 2417 | 2106 | 1930 | 1 | A9 | 1188000 | 2020-07-20 | 90 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6392 | 65 | A2 | 140 BAY 8 STREET | NA | 11228 | 1 | 0 | 1 | 3222 | 1804 | 1940 | 1 | A2 | 975000 | 2020-01-21 | 80 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6392 | 115 | A5 | 145 BAY 7TH STREET | NA | 11228 | 1 | 0 | 1 | 1448 | 1280 | 1945 | 1 | A5 | 870000 | 2020-10-15 | 75 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6399 | 13 | A1 | 155 BAY 14TH ST | NA | 11214 | 1 | 0 | 1 | 3680 | 936 | 1930 | 1 | A1 | 1250000 | 2020-12-23 | 90 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6409 | 40 | A9 | 106 BAY 25TH STREET | NA | 11214 | 1 | 0 | 1 | 1740 | 1382 | 1915 | 1 | A9 | 805000 | 2020-12-02 | 105 |
| Brooklyn | BATH BEACH | 01 | ONE FAMILY DWELLINGS | 1 | 6433 | 10 | A5 | 1703 CROPSEY AVENUE | NA | 11214 | 1 | 0 | 1 | 1438 | 1568 | 1937 | 1 | A5 | 760000 | 2020-06-29 | 83 |
variable.type <- lapply(nyc_data, class)
variable.description <- c(
"Name of the borough where the property is located",
"Neighbourhood name",
"Building class category code to identify similar properties",
"Building class category title to identify similar properties",
"Assigned tax class of the property in the city - Classes 1, 2, 3 or 4",
"Sub-division of the borough on which real properties are located",
"Sub-division of a Tax Block, used to uniquely represent the property location",
"Used to describe a property's constructive use",
"Property's street address",
"Property's apartment number",
"Property's postal code",
"Number of residential units at the listed property",
"Number of commercial units at the listed property",
"Total number of units at the listed property",
"Land area of the property listed in square feet",
"Total area of all the floors of a building",
"Property's construction year",
"Assigned tax class of the property in the city at the time of sale",
"Used to describe a property's constructive use at the time of sale",
"Price paid for the property",
"Date of property sale",
"Age of the Building")
variable.name <- colnames(nyc_data)
nyc_datadesc <- as_tibble(cbind(variable.name, variable.type, variable.description))
colnames(nyc_datadesc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(nyc_datadesc)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| borough | factor | Name of the borough where the property is located |
| neighborhood | character | Neighbourhood name |
| building_class_category_number | factor | Building class category code to identify similar properties |
| building_class_category | factor | Building class category title to identify similar properties |
| tax_class_at_present | factor | Assigned tax class of the property in the city - Classes 1, 2, 3 or 4 |
| block | character | Sub-division of the borough on which real properties are located |
| lot | character | Sub-division of a Tax Block, used to uniquely represent the property location |
| building_class_at_present | factor | Used to describe a property’s constructive use |
| address | character | Property’s street address |
| apartment_number | character | Property’s apartment number |
| zip_code | factor | Property’s postal code |
| residential_units | numeric | Number of residential units at the listed property |
| commercial_units | numeric | Number of commercial units at the listed property |
| total_units | numeric | Total number of units at the listed property |
| land_square_feet | numeric | Land area of the property listed in square feet |
| gross_square_feet | numeric | Total area of all the floors of a building |
| year_built | numeric | Property’s construction year |
| tax_class_at_time_of_sale | factor | Assigned tax class of the property in the city at the time of sale |
| building_class_at_time_of_sale | factor | Used to describe a property’s constructive use at the time of sale |
| sale_price | numeric | Price paid for the property |
| sale_date | c(“POSIXct”, “POSIXt”) | Date of property sale |
| building_age | numeric | Age of the Building |
The analysis of the NYC Property Sales data can be found under the following tabs. The Descriptive Statistics section explores each individual variable, while relationships within variables are explored using visualizations.
dim(nyc_data)
## [1] 42198 22
Our cleaned dataset has 22 columns and 42198 rows.
Lets explore some numeric variables in our data set.
quantile(nyc_data$sale_price, probs = seq(from = 0, to = 1, by = .1))
## 0% 10% 20% 30% 40% 50% 60% 70%
## 1 250000 392000 502500 600000 700000 830000 990000
## 80% 90% 100%
## 1380000 2495000 978090439
nyc_data %>% filter(sale_price <= 1000) %>% nrow()
## [1] 946
The 10th percentile lies at (Usd 250,000) and the price ranges on to (Usd 2,495,000). This means there is something for everyone in New York.
Our highest price is slightly less that $1billion (978,090,439).
If we check out this observation;
nyc_data%>%
filter(sale_price>900000000)
## # A tibble: 1 x 22
## borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
## <fct> <chr> <fct> <fct> <fct> <chr>
## 1 Manhat~ FASHION "22 " STORE BUILDINGS 4 840
## # ... with 16 more variables: lot <chr>, building_class_at_present <fct>,
## # address <chr>, apartment_number <chr>, zip_code <fct>,
## # residential_units <dbl>, commercial_units <dbl>, total_units <dbl>,
## # land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## # tax_class_at_time_of_sale <fct>, building_class_at_time_of_sale <fct>,
## # sale_price <dbl>, sale_date <dttm>, building_age <dbl>
The building appears to be exactly 100yrs old. It is probably a public monument. The outlier can be explained.
summary(nyc_data$land_square_feet)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 45 2000 2506 16011 4000 7649136 20721
quantile(nyc_data$land_square_feet, probs = seq(from = 0, to = 1, by = .1), na.rm = TRUE)
## 0% 10% 20% 30% 40% 50% 60% 70%
## 45.0 1700.6 1976.0 2038.0 2400.0 2506.0 3000.0 3750.0
## 80% 90% 100%
## 4041.6 5700.0 7649136.0
nyc_data %>%
filter(land_square_feet > 500000) %>%
tail(10)
## # A tibble: 10 x 22
## borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
## <fct> <chr> <fct> <fct> <fct> <chr>
## 1 Bronx HUNTS POINT "41 " TAX CLASS 4 - O~ 4 2770
## 2 Bronx MOTT HAVEN/~ "36 " OUTDOOR RECREAT~ 4 2557
## 3 Manhat~ KIPS BAY "08 " RENTALS - ELEVA~ 2 972
## 4 Manhat~ MIDTOWN EAST "41 " TAX CLASS 4 - O~ 4 1354
## 5 Staten~ ANNADALE "05 " TAX CLASS 1 VAC~ 1B 6431
## 6 Staten~ BLOOMFIELD "27 " FACTORIES 4 1801
## 7 Staten~ ROSSVILLE-C~ "31 " COMMERCIAL VACA~ 4 7328
## 8 Staten~ ROSSVILLE-P~ "27 " FACTORIES 4 7247
## 9 Staten~ ROSSVILLE-P~ "31 " COMMERCIAL VACA~ 4 7207
## 10 Staten~ TOTTENVILLE "29 " COMMERCIAL GARA~ 4 7971
## # ... with 16 more variables: lot <chr>, building_class_at_present <fct>,
## # address <chr>, apartment_number <chr>, zip_code <fct>,
## # residential_units <dbl>, commercial_units <dbl>, total_units <dbl>,
## # land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## # tax_class_at_time_of_sale <fct>, building_class_at_time_of_sale <fct>,
## # sale_price <dbl>, sale_date <dttm>, building_age <dbl>
Assuming a ball-park value that buildings that have > 500,000 land sq footage should be ballpark, I explored them further. I find that I have 22 outlier values. These buildings happen to be in specific neighbourhoods most especially Hunts Point and these buildings fall under building class category of warehouses, tax class 4, outdoor recreational facilities, factories and commercial vacant land. This makes perfect sense.
summary(nyc_data$gross_square_feet)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 100 1360 1872 9193 2640 8942176 21622
quantile(nyc_data$gross_square_feet, probs = seq(from = 0, to = 1, by= .1), na.rm = TRUE)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 100 1120 1284 1444 1636 1872 2104 2400 2960 4000
## 100%
## 8942176
nyc_data %>%
filter(gross_square_feet > 500000)
## # A tibble: 62 x 22
## borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
## <fct> <chr> <fct> <fct> <fct> <chr>
## 1 Bronx HUNTS POINT "30 " WAREHOUSES 4 2781
## 2 Bronx HUNTS POINT "30 " WAREHOUSES 4 2781
## 3 Bronx HUNTS POINT "30 " WAREHOUSES 4 2781
## 4 Bronx HUNTS POINT "30 " WAREHOUSES 4 2781
## 5 Bronx HUNTS POINT "30 " WAREHOUSES 4 2781
## 6 Bronx HUNTS POINT "30 " WAREHOUSES 4 2781
## 7 Bronx HUNTS POINT "30 " WAREHOUSES 4 2781
## 8 Bronx HUNTS POINT "30 " WAREHOUSES 4 2781
## 9 Bronx HUNTS POINT "41 " TAX CLASS 4 - O~ 4 2770
## 10 Bronx HUNTS POINT "41 " TAX CLASS 4 - O~ 4 2770
## # ... with 52 more rows, and 16 more variables: lot <chr>,
## # building_class_at_present <fct>, address <chr>, apartment_number <chr>,
## # zip_code <fct>, residential_units <dbl>, commercial_units <dbl>,
## # total_units <dbl>, land_square_feet <dbl>, gross_square_feet <dbl>,
## # year_built <dbl>, tax_class_at_time_of_sale <fct>,
## # building_class_at_time_of_sale <fct>, sale_price <dbl>, sale_date <dttm>,
## # building_age <dbl>
On investigating the buildings with extreme gross square feet, I found that the said buildings were warehouses, store buildings, office buildings, luxury hotels and tax class 4.
summary(nyc_data$building_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.00 45.00 70.00 67.01 95.00 1002.00 2868
quantile(nyc_data$building_age, probs = seq(from = 0, to = 1, by = .1), na.rm = TRUE)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## -1 8 31 55 62 70 85 90 100 110 1002
Lets look at these really old buildings.
nyc_data %>%
filter(building_age > 500)
## # A tibble: 2 x 22
## borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
## <fct> <chr> <fct> <fct> <fct> <chr>
## 1 Manhat~ CHELSEA "29 " COMMERCIAL GARA~ 4 799
## 2 Queens ARVERNE "02 " TWO FAMILY DWEL~ 1 15915
## # ... with 16 more variables: lot <chr>, building_class_at_present <fct>,
## # address <chr>, apartment_number <chr>, zip_code <fct>,
## # residential_units <dbl>, commercial_units <dbl>, total_units <dbl>,
## # land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## # tax_class_at_time_of_sale <fct>, building_class_at_time_of_sale <fct>,
## # sale_price <dbl>, sale_date <dttm>, building_age <dbl>
The oldest buildings, ie over 500 yrs old, are two buildings; 909 yrs and 1002 yrs old. One is a commercial garage in Manhattan while the other is a two family dwelling at Queens.
nyc_data %>%
filter(building_age < 0)
## # A tibble: 1 x 22
## borough neighborhood building_class_~ building_class_~ tax_class_at_pr~ block
## <fct> <chr> <fct> <fct> <fct> <chr>
## 1 Bronx RIVERDALE "05 " TAX CLASS 1 VAC~ 1B 5731
## # ... with 16 more variables: lot <chr>, building_class_at_present <fct>,
## # address <chr>, apartment_number <chr>, zip_code <fct>,
## # residential_units <dbl>, commercial_units <dbl>, total_units <dbl>,
## # land_square_feet <dbl>, gross_square_feet <dbl>, year_built <dbl>,
## # tax_class_at_time_of_sale <fct>, building_class_at_time_of_sale <fct>,
## # sale_price <dbl>, sale_date <dttm>, building_age <dbl>
The -1 raises questions. On investigating, it is vacant land. I believe construction was to commence in 2021.
Let us conduct some data visualization analysis. We shall focus on the various boroughs, neighbourhoods and building classes.
I started the analysis trying to figure out which Borough recored the most property sales last year. I also wanted to understand how much an average property sale price would be across the boroughs in NYC.
Most in demand and expensive boroughs in NYC
p1 <- ggplot(data = nyc_property, aes(x = (borough))) +
geom_bar(fill = "blue") +
ggtitle("Most In-Demand Borough in NYC", subtitle = "Borough-wise # of property sales in NYC") +
scale_y_continuous("# of Property Sales", labels = scales::comma) +
scale_x_discrete("Borough")
p2 <- ggplot(data = nyc_data, aes(x = (borough), y = mean(sale_price))) +
geom_bar(stat = "identity", color = "darkgreen") +
ggtitle("Most Expensive Borough in NYC", subtitle = "Borough-wise Avg Property Sale Price in NYC") +
scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
scale_x_discrete("Borough")
gridExtra::grid.arrange(p1, p2, nrow = 2)
The plots above showed me that Queens had the most number of property sales, followed by Brookyln then Manhattan. I find this surprising as I expected properties in Manhattan to cost more. Let’s explore this further with the other fields.
df1 <- as.data.frame(table(nyc_property$borough, nyc_property$neighborhood))
names(df1) <- c('BOROUGH','NEIGHBORHOOD', 'Freq')
df1 <- df1 %>%
arrange(desc(Freq)) %>%
head(10)
p1 <- ggplot(df1, aes(x = `NEIGHBORHOOD`, y = `Freq`, fill = `BOROUGH`)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Most-in demand Neighborhood in NYC", subtitle = "Top Neighborhoods by Number") +
theme(legend.position = "bottom") +
scale_y_continuous("# of Sales", labels = scales::comma) +
scale_x_discrete("Neighborhood")
p1
df2 <-
nyc_data %>% group_by(borough, neighborhood) %>%
summarise(MeanSP = mean(sale_price)) %>%
arrange(desc(MeanSP)) %>%
head(10)
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
p2 <- ggplot(data = df2, aes(x = neighborhood, y = MeanSP, fill = borough)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "bottom") +
ggtitle("Most Expensive Neighborhoods in NYC",
subtitle = "Top Neighborhoods by Avg Price") +
scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
scale_x_discrete("Neighborhood")
p2
df3 <-
nyc_data %>% group_by(borough, neighborhood) %>%
summarise(MeanSP = mean(sale_price)) %>%
arrange(MeanSP) %>% ## Note the order is ascending
head(10)
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
ggplot(data = df3, aes(x = neighborhood, y = MeanSP, fill = borough)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "bottom") +
ggtitle("Least Expensive Neighborhoods in NYC", subtitle = "Top Neighborhoods by the lowest avg. Price") +
scale_y_continuous("Avg Sale Price", labels = scales::dollar) +
scale_x_discrete("Neighborhood")
df4<- as.data.frame(table(nyc_property$borough, nyc_property$building_class_category))
names(df4) <- c("Borough", "Building class category", "Freq")
df4 <- df4 %>%
group_by(Borough) %>%
arrange(desc(Freq)) %>%
head(10)
p4 <- ggplot(df4, aes(x = Borough, y = Freq, fill = `Building class category`)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("Most in-demand Buildings in NYC by Borough", subtitle = "Top types of Building sold in NYC") +
scale_y_continuous("No. of Sales", labels = scales::comma) +
scale_x_discrete("Borough")
p4
df5 <- nyc_data %>%
group_by(borough, building_class_category) %>%
summarise(MeanSP = mean(sale_price)) %>%
arrange(desc(MeanSP)) %>%
head(10)
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
ggplot(df5, aes(x= building_class_category, y = MeanSP, fill = borough)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
theme(legend.position = "bottom") +
ggtitle("Most expensive building types in NYC", subtitle = "Top ten types of property by value in NYC as at 2020")+
scale_y_continuous("Average Sale Price", labels = scales::dollar) +
scale_x_discrete("Building type")
df6 <- nyc_data %>%
group_by(borough, building_class_category) %>%
summarise(MeanSP = mean(sale_price)) %>%
arrange(MeanSP) %>%
head(10)
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
ggplot(df6, aes(x= building_class_category, y = MeanSP, fill = borough)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
theme(legend.position = "bottom") +
ggtitle("Least expensive building types in NYC", subtitle = "Top ten cheapest types of property by value in NYC as at 2020")+
scale_y_continuous("Average Sale Price", labels = scales::dollar) +
scale_x_discrete("Building type")