This dataset is a record of every building or building unit (apartment, etc.) sold in the New York City property market over a 12-month period. The focus on this analysis will be residential and commercial buildings.
This dataset contains the location, address, type, sale price, and sale date of building units sold.
BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5). BLOCK; LOT: The combination of borough, block, and lot forms a unique key for property in New York City. Commonly called a BBL. BUILDING CLASS AT PRESENT and BUILDING CLASS AT TIME OF SALE: The type of building at various points in time.
For further reference on individual fields see the Glossary of Terms. For the building classification codes see the Building Classifications Glossary.
As this is a financial transaction dataset, there are some points that need to be kept in mind:
Many sales occur with a nonsensically small dollar amount: $0 most commonly. These sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement. This dataset uses the financial definition of a building/building unit, for tax purposes. In case a single entity owns the building in question, a sale covers the value of the entire building. In case a building is owned piecemeal by its residents (a condominium), a sale refers to a single apartment (or group of apartments) owned by some individual.
#Installing libraries
library(tidyverse)
library(skimr)
library(janitor)
library(data.table)
library(readr)
#Importing data for property sales
Property_Sales <- read_csv("nyc-rolling-sales.csv", show_col_types = FALSE)
## New names:
## • `` -> `...1`
Property_Sales[1:10,]
Since the dataset has transactions of 0 dollars that are a product of deed transfers. I will remove them for the purpose of this analysis of sales only. I will also be removing abnormally low transactions for simplicity and converting the sale price column’s unit into Tens of Thousands for simplicity’s sake.
#Cleaning column names
Property_Sales <- Property_Sales %>% clean_names()
#Converting sale price into integer and in unit of ten thousands USD for simplicity
Property_Sales$sale_price = as.numeric(as.character(Property_Sales$sale_price))
## Warning: NAs introduced by coercion
Property_Sales <- Property_Sales %>%
mutate(sale_price_in_ten_thousands = round(sale_price/10000, digits = 1)) %>%
filter(!is.na(sale_price_in_ten_thousands)) %>%
filter(sale_price_in_ten_thousands > 1) %>%
arrange(building_class_category)
Property_Sales
#having an unmodified but clean original dataset
Property_Sales_og <- Property_Sales
As the focus of this analysis is on residential and commercial buildings only, let’s see the different types of building categories and filter the nonresidential and noncommercial buildings. Code Descriptions can be found in above reference links
unique(Property_Sales$building_class_category)
## [1] "01 ONE FAMILY DWELLINGS"
## [2] "02 TWO FAMILY DWELLINGS"
## [3] "03 THREE FAMILY DWELLINGS"
## [4] "04 TAX CLASS 1 CONDOS"
## [5] "05 TAX CLASS 1 VACANT LAND"
## [6] "06 TAX CLASS 1 - OTHER"
## [7] "07 RENTALS - WALKUP APARTMENTS"
## [8] "08 RENTALS - ELEVATOR APARTMENTS"
## [9] "09 COOPS - WALKUP APARTMENTS"
## [10] "10 COOPS - ELEVATOR APARTMENTS"
## [11] "11 SPECIAL CONDO BILLING LOTS"
## [12] "11A CONDO-RENTALS"
## [13] "12 CONDOS - WALKUP APARTMENTS"
## [14] "13 CONDOS - ELEVATOR APARTMENTS"
## [15] "14 RENTALS - 4-10 UNIT"
## [16] "15 CONDOS - 2-10 UNIT RESIDENTIAL"
## [17] "16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT"
## [18] "17 CONDO COOPS"
## [19] "21 OFFICE BUILDINGS"
## [20] "22 STORE BUILDINGS"
## [21] "23 LOFT BUILDINGS"
## [22] "25 LUXURY HOTELS"
## [23] "26 OTHER HOTELS"
## [24] "27 FACTORIES"
## [25] "28 COMMERCIAL CONDOS"
## [26] "29 COMMERCIAL GARAGES"
## [27] "30 WAREHOUSES"
## [28] "31 COMMERCIAL VACANT LAND"
## [29] "32 HOSPITAL AND HEALTH FACILITIES"
## [30] "33 EDUCATIONAL FACILITIES"
## [31] "34 THEATRES"
## [32] "35 INDOOR PUBLIC AND CULTURAL FACILITIES"
## [33] "36 OUTDOOR RECREATIONAL FACILITIES"
## [34] "37 RELIGIOUS FACILITIES"
## [35] "38 ASYLUMS AND HOMES"
## [36] "39 TRANSPORTATION FACILITIES"
## [37] "40 SELECTED GOVERNMENTAL FACILITIES"
## [38] "41 TAX CLASS 4 - OTHER"
## [39] "42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC"
## [40] "43 CONDO OFFICE BUILDINGS"
## [41] "44 CONDO PARKING"
## [42] "45 CONDO HOTELS"
## [43] "46 CONDO STORE BUILDINGS"
## [44] "47 CONDO NON-BUSINESS STORAGE"
## [45] "48 CONDO TERRACES/GARDENS/CABANAS"
## [46] "49 CONDO WAREHOUSES/FACTORY/INDUS"
#Filtering for only residential and commercial categories
Property_Sales<- Property_Sales %>%
filter(!grepl('05|06|29|32|33|37|35|36|38|39|30|41|44|47', building_class_category)) %>%
arrange(sale_price)
The first step would be to explore the median sale prices between the different boroughs
Property_Sales %>%
group_by(borough) %>%
summarise(Median_Sale_Price = round(median(sale_price_in_ten_thousands), digits = 1), Mean_Sale_Price = round(mean(sale_price_in_ten_thousands), digits =1)) %>%
mutate(Borough_Name = c("Manhattan", "Bronx", "Brooklyn", "Queens", "Staten Island")) %>%
arrange(desc(Median_Sale_Price))
We can see that Manhattan has the highest Median Sale Price followed by Brooklyn, Queens, Staten Island and lastly, Bronx however surprisingly the Mean Sale Price of Bronx is higher than Staten Island and roughly equal to Queens! The mean can be skewed because of abnormally high or low values. So for the next step, let’s explore the box-plots of each borough’s sale prices.
ggplot(Property_Sales, aes(borough, sale_price_in_ten_thousands)) +
geom_boxplot(aes(group = borough)) +
labs(title = "Median Sale Price of each Borough", x = "Borough", y = "Sale Price (In Tens of Thousands Dollars")
It’s very clear from this graph that Manhattan has the highest amount of high value transactions and Bronx has the lowest however the high values of Manhattan’s transactions is making it hard to compare the plots of the other boroughs so let’s limit the y-axis limit.
ggplot(Property_Sales, aes(borough, sale_price_in_ten_thousands)) +
geom_boxplot(aes(group = borough)) +
scale_y_continuous(limits = c(0, 6000)) +
labs(y = "Sale Price (In Tens of Thousands Dollars)", x = "Borough", title = "Median Sale Price of each Borough")
## Warning: Removed 87 rows containing non-finite values (stat_boxplot).
Now we can see much more clearly that while the median of the Bronx’s Sale Price is lower than Staten Island, it has many high value transactions outside of the median compared to Staten Island. These high value transactions must be cause of skewing Bronx’s mean transaction value higher than Staten Island’s. Our graphs also showed that every borough has a high number of transactions that are outside of the media. These outliers are abnormally large that may need further investigation before adding them to our results, so for simplicity’s sake, I am removing them for now.
#Our first step will be to eliminate outlines shown above for simplicity's sake
#Showing what outliers will be removed from each borough
Property_Sales %>%
filter(sale_price_in_ten_thousands < 1 | sale_price_in_ten_thousands > 1000) %>%
ggplot(aes(sale_price_in_ten_thousands)) +
geom_histogram(bins = 20) +
facet_wrap(~borough, scales = "free") +
theme(axis.text = element_text(size = 7)) +
theme(panel.spacing = unit(0.5, "lines")) +
labs(title = "Outliers Removed Fom Each Borough", x = "Sale Price (In Tens of Thousands of DOllars)", y = "Outliers Removed")
#Removing said outliers
Property_Sales <- subset(Property_Sales, select = -sale_price)
Property_Sales <- Property_Sales %>%
filter(sale_price_in_ten_thousands < 1000)
#Making a function that creates a table for each borough's neighborhoods median and mean price and looping it through each borough
for(i in 1:5) {Neighborhood_Median_Mean <- Property_Sales %>%
filter(borough == i) %>%
group_by(neighborhood) %>%
summarise(Median_Sale_Price = round(median(sale_price_in_ten_thousands), digits = 1), Mean_Sale_Price = round(mean(sale_price_in_ten_thousands), digits = 1), borough = i) %>%
arrange(desc(Median_Sale_Price))
print(Neighborhood_Median_Mean)}
## # A tibble: 39 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 CIVIC CENTER 469. 462. 1
## 2 LITTLE ITALY 325 357. 1
## 3 TRIBECA 280 304. 1
## 4 SOHO 255. 293. 1
## 5 FLATIRON 225. 295. 1
## 6 UPPER EAST SIDE (96-110) 187 220. 1
## 7 CHINATOWN 174. 224. 1
## 8 FASHION 160 204. 1
## 9 FINANCIAL 145 187. 1
## 10 GREENWICH VILLAGE-CENTRAL 133. 195. 1
## # … with 29 more rows
## # A tibble: 37 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 CITY ISLAND-PELHAM STRIP 73.6 73.6 2
## 2 MOTT HAVEN/PORT MORRIS 68.2 121. 2
## 3 MOUNT HOPE/MOUNT EDEN 57.5 122. 2
## 4 BRONX PARK 56 56 2
## 5 BELMONT 54.5 102 2
## 6 FORDHAM 54 162. 2
## 7 PELHAM GARDENS 51.5 61.1 2
## 8 COUNTRY CLUB 50.2 51.1 2
## 9 MORRIS PARK/VAN NEST 50.2 52.7 2
## 10 CITY ISLAND 50 52.1 2
## # … with 27 more rows
## # A tibble: 60 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 DOWNTOWN-FULTON FERRY 197. 220. 3
## 2 CARROLL GARDENS 168 199. 3
## 3 RED HOOK 162. 166. 3
## 4 DOWNTOWN-FULTON MALL 153. 162. 3
## 5 GOWANUS 150 184. 3
## 6 BROOKLYN HEIGHTS 150. 221. 3
## 7 BOERUM HILL 149. 201. 3
## 8 MANHATTAN BEACH 133. 128 3
## 9 PARK SLOPE SOUTH 132. 144. 3
## 10 PARK SLOPE 131. 164. 3
## # … with 50 more rows
## # A tibble: 59 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 LONG ISLAND CITY 99.8 123. 4
## 2 RIDGEWOOD 93 100. 4
## 3 JAMAICA ESTATES 88.2 97.7 4
## 4 ASTORIA 83.6 107 4
## 5 NEPONSIT 83 91.2 4
## 6 FRESH MEADOWS 81 69.7 4
## 7 BELLE HARBOR 80 78.8 4
## 8 WHITESTONE 75.2 73.8 4
## 9 FLUSHING-NORTH 75 82.9 4
## 10 HOLLIS HILLS 75 63 4
## # … with 49 more rows
## # A tibble: 58 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 ROSSVILLE-PORT MOBIL 606. 606. 5
## 2 TODT HILL 114. 131. 5
## 3 EMERSON HILL 91.2 85.1 5
## 4 RICHMONDTOWN-LIGHTHS HILL 81.2 86.3 5
## 5 DONGAN HILLS-COLONY 80 83.8 5
## 6 ROSSVILLE-RICHMOND VALLEY 69.2 67 5
## 7 RICHMONDTOWN 66.5 70.6 5
## 8 HUGUENOT 65 69 5
## 9 PLEASANT PLAINS 64.5 73.3 5
## 10 PRINCES BAY 64.4 70.3 5
## # … with 48 more rows
The highest and lowest median sale price for the neighborhoods of each borough are the following:
Manhattan: Civic Center has the highest $4680000 and Inwood has the lowest at $398000.
Bronx: City Island-Pelham Strip has the highest at $736000 and Parkchester has the lowest at $138000.
Brooklyn: Downtown-Fulton Ferry has the highest at $1974000 and Spring Creek has the lowest at $420000.
Queens: Long Island City has the highest at $998000 and Holliswood has the lowest at $205000.
Staten Island: Rossville Port Mobile has the highest at $6056000 and Fresh Kills has the lowest at $20000.
It must be kept in mind that these observations come after we have removed our outliers, let’s do the same calculations with just the original dataset to see the difference. You can see the lack of extreme observations by the fact that the median and sale prices are very similar to each other. Let’s do the same calculations with the original dataset to the see differences.
for(i in 1:5) {Neighborhood_Median_Mean_og <- Property_Sales_og %>%
filter(borough == i) %>%
group_by(neighborhood) %>%
summarise(Median_Sale_Price = round(median(sale_price_in_ten_thousands), digits = 1), Mean_Sale_Price = round(mean(sale_price_in_ten_thousands), digits = 1), borough = i) %>%
arrange(desc(Mean_Sale_Price))
print(Neighborhood_Median_Mean_og)}
## # A tibble: 39 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 MIDTOWN CBD 175 2748. 1
## 2 FASHION 187. 2188 1
## 3 JAVITS CENTER 92.8 1874. 1
## 4 FINANCIAL 156. 834. 1
## 5 CIVIC CENTER 519. 710. 1
## 6 LITTLE ITALY 375 610. 1
## 7 SOHO 283 496. 1
## 8 KIPS BAY 112 467. 1
## 9 FLATIRON 246. 446. 1
## 10 GREENWICH VILLAGE-WEST 140 443 1
## # … with 29 more rows
## # A tibble: 38 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 EAST RIVER 1120 1120 2
## 2 MOUNT HOPE/MOUNT EDEN 58.5 263. 2
## 3 BELMONT 55 225. 2
## 4 KINGSBRIDGE HTS/UNIV HTS 23.5 215. 2
## 5 MOTT HAVEN/PORT MORRIS 75 214. 2
## 6 FORDHAM 56.2 203. 2
## 7 HIGHBRIDGE/MORRIS HEIGHTS 42.9 189. 2
## 8 MELROSE/CONCOURSE 37.9 176. 2
## 9 HUNTS POINT 54 139. 2
## 10 FIELDSTON 31.4 126. 2
## # … with 28 more rows
## # A tibble: 60 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 DOWNTOWN-METROTECH 78.1 654. 3
## 2 DOWNTOWN-FULTON FERRY 178. 406 3
## 3 DOWNTOWN-FULTON MALL 155 372. 3
## 4 RED HOOK 170 357 3
## 5 WILLIAMSBURG-NORTH 105 290. 3
## 6 BROOKLYN HEIGHTS 154. 286. 3
## 7 GOWANUS 165. 260. 3
## 8 BOERUM HILL 150. 252. 3
## 9 CARROLL GARDENS 169 198. 3
## 10 PARK SLOPE 131. 196. 3
## # … with 50 more rows
## # A tibble: 59 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 LONG ISLAND CITY 106. 350. 4
## 2 ASTORIA 87 143. 4
## 3 EAST ELMHURST 70 128. 4
## 4 ELMHURST 52 121. 4
## 5 COLLEGE POINT 72.5 99.7 4
## 6 FLUSHING-NORTH 75 99.2 4
## 7 JAMAICA ESTATES 88.2 97.7 4
## 8 RIDGEWOOD 90 97.1 4
## 9 MASPETH 72 96.6 4
## 10 SUNNYSIDE 40 94.8 4
## # … with 49 more rows
## # A tibble: 58 × 4
## neighborhood Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 BLOOMFIELD 3150 4625. 5
## 2 ROSSVILLE-PORT MOBIL 312. 359 5
## 3 TODT HILL 110. 126. 5
## 4 TRAVIS 43.5 92.3 5
## 5 DONGAN HILLS-COLONY 80 83.8 5
## 6 EMERSON HILL 90 80.8 5
## 7 RICHMONDTOWN-LIGHTHS HILL 77.5 77 5
## 8 PRINCES BAY 63.8 76.2 5
## 9 TOTTENVILLE 60 71.4 5
## 10 RICHMONDTOWN 65.7 69.8 5
## # … with 48 more rows
We can see how much the mean price changes without removing our outliers. The abnormally high transactions skew the mean up for all neighborhoods. For our next step, I want to explore if there is a correlation between sale price and type of building.
unique(Property_Sales$building_class_at_time_of_sale)
## [1] "A1" "D4" "B9" "A5" "RW" "B3" "C6" "R3" "R9" "A2" "B1" "B2" "C0" "S3" "V1"
## [16] "H3" "S1" "A9" "C5" "C3" "R2" "R4" "R8" "K4" "RB" "A4" "S2" "C2" "C7" "C1"
## [31] "S9" "F5" "R1" "RH" "D1" "RA" "C4" "R5" "RT" "A6" "RK" "O1" "K2" "CM" "O2"
## [46] "C9" "K1" "K9" "R6" "C8" "A0" "D0" "S4" "L1" "K6" "K8" "A3" "V9" "K5" "O7"
## [61] "S0" "O5" "S5" "O8" "F9" "F1" "D6" "F4" "R0" "RR" "HR" "H4" "F2" "O4" "L9"
## [76] "A7" "H8" "L8" "D7" "K3" "K7" "D9" "D3" "L3" "J4" "D2" "D5" "H6" "O9" "J9"
Since we have so many types of building classes, I’ll be grouping them by their first initial. The descriptions of these classes can be seen in the glossary terms provided above.
Property_Sales <- Property_Sales %>%
mutate(building_code = substr(building_class_at_time_of_sale,1,1))
for (i in 1:5) {Code_Median_Mean <- Property_Sales %>%
filter(borough == i) %>%
group_by(building_code) %>%
summarise(Median_Sale_Price = round(median(sale_price_in_ten_thousands), digits = 1), Mean_Sale_Price = round(mean(sale_price_in_ten_thousands), digits = 1), borough = i) %>%
arrange(desc(Median_Sale_Price))
print(Code_Median_Mean)}
## # A tibble: 13 × 4
## building_code Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 S 566 554. 1
## 2 K 545 509. 1
## 3 O 545 552. 1
## 4 J 500 500 1
## 5 L 482. 484. 1
## 6 A 425 457. 1
## 7 F 370 370 1
## 8 V 313. 348 1
## 9 B 298. 366. 1
## 10 R 143. 212 1
## 11 D 79.9 123. 1
## 12 C 73 186. 1
## 13 H 5 18.8 1
## # A tibble: 12 × 4
## building_code Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 H 570 570 2
## 2 F 230 315. 2
## 3 L 208. 208. 2
## 4 O 123. 200. 2
## 5 K 121. 186. 2
## 6 C 59.9 101. 2
## 7 S 57.5 72.1 2
## 8 V 50.4 102 2
## 9 B 49 48.3 2
## 10 A 40.5 44.8 2
## 11 D 19 32.5 2
## 12 R 16.8 29.5 2
## # A tibble: 12 × 4
## building_code Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 F 299 356. 3
## 2 O 196. 232. 3
## 3 K 175 246. 3
## 4 L 120 120 3
## 5 S 110 141. 3
## 6 C 102. 131. 3
## 7 R 86.6 109. 3
## 8 B 81.5 97.3 3
## 9 H 73.8 73.8 3
## 10 A 68 90.9 3
## 11 V 65 140. 3
## 12 D 33.9 47.3 3
## # A tibble: 12 × 4
## building_code Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 H 500. 463. 4
## 2 L 500 500 4
## 3 F 212. 280. 4
## 4 K 141. 203. 4
## 5 O 97.5 178. 4
## 6 S 87.2 104. 4
## 7 V 78 149. 4
## 8 B 67.5 70.9 4
## 9 R 56 67.6 4
## 10 A 55 61.3 4
## 11 C 30 61.5 4
## 12 D 26.2 30.6 4
## # A tibble: 12 × 4
## building_code Median_Sale_Price Mean_Sale_Price borough
## <chr> <dbl> <dbl> <int>
## 1 H 825 626 5
## 2 J 721. 721. 5
## 3 F 235 280. 5
## 4 K 80.5 145. 5
## 5 B 60 58.6 5
## 6 O 57.5 89.5 5
## 7 A 47 49.3 5
## 8 S 43.5 49.6 5
## 9 R 30 32.3 5
## 10 V 27 84.4 5
## 11 C 26.8 42.7 5
## 12 D 20.6 22.3 5
I have choose to explore the median/mean sale prices in each borough compared to aggregating all medians/means. This is because as shown by our previous exploration, buildings in Manhattan have a higher sale price than Bronx which may influence our mean, boroughs may also have a different amount of different buildings being sold such as Staten Island being more suburban and residential than other boroughs or Manhattan having a higher number of Office Buildings due to its designation of being a Central Business District. These differences will skew our median sale price value. By having a narrower view, we can have a better understanding of not only the different sale values but also of the differences between the boroughs themselves.
Interestingly, there is much variance between building codes and sale price between the different boroughs.
For Manhattan, Classes S, K, O and J have high and similarly high median prices whereas building codes R, D, and C have much lower median sale prices. Class H has the lowest median but its value gives an impression of a low outlier which may need further investigation.
For Bronx, Class H has the highest median and seems to be an outlier among the other codes. Classes F and L have high median sale prices whereas Classes D and R have lower median prices.
For Brooklyn, Classes F, O and K have the highest median prices whereas Classes A, V and D have the lowest median prices.
For Queens, Classes H and L have the high similar median prices whereas Classes A, C and D have the lowest median prices.
For Staten Island, Classes H and J have abnormally high median prices whereas Classes V, C and D have the lowest median prices.
#Visualization of the above observations
Property_Sales %>%
arrange(building_code) %>%
ggplot(aes(building_code, sale_price_in_ten_thousands)) +
geom_boxplot(aes(group = building_code)) +
facet_wrap(~borough, scales = "free") +
labs(title = "Median Sale Price of different building codes within each borough", x = "Building Code", y = "Sale Price (in Tens of Thosuands Dollars)")