NYC Housing

Author

Long Lin

Overview

I am using a dataset from Kaggle with information about NYC housing prices. The dataset contains relevant information such as the borough associated with the property sale as well as the year that the primary structure on the tax lot was originally built. My motivation for selecting this dataset is my interest in the local housing market, especially in Brooklyn, where I live. I will use R to process the data by filtering the data for properties in Brooklyn and by the year that they were constructed.

NYC Housing Prices

Source: https://www.kaggle.com/datasets/ishank2005/nyc-housing-prices-csv?resource=download

I will read the data from my github in raw format using tidyverse.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/longflin/NYC_Housing/refs/heads/main/nyc_housing_base.csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)

glimpse(df)
Rows: 34,439
Columns: 19
$ borough_x    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ block        <dbl> 400, 402, 402, 404, 372, 373, 373, 374, 375, 376, 376, 37…
$ lot          <dbl> 11, 36, 60, 42, 55, 46, 46, 8, 30, 5, 5, 13, 31, 62, 67, …
$ sale_price   <dbl> 280000, 2000000, 3400000, 4000000, 655000, 470000, 474000…
$ zip_code     <dbl> 10009, 10009, 10009, 10009, 10009, 10009, 10009, 10009, 1…
$ borough_y    <chr> "MN", "MN", "MN", "MN", "MN", "SI", "SI", "BK", "MN", "MN…
$ yearbuilt    <dbl> 1900, 1900, 1920, 1855, 1920, 1944, 1944, 1899, 1902, 190…
$ lotarea      <dbl> 2404, 1919, 2150, 2369, 5298, 3132, 3132, 1400, 3082, 377…
$ bldgarea     <dbl> 6875, 7810, 6152, 6990, 17990, 2440, 2440, 2700, 17556, 1…
$ resarea      <dbl> 6875, 6810, 5352, 6990, 17990, 440, 440, 2700, 17556, 140…
$ comarea      <dbl> 0, 1000, 800, 0, 0, 2000, 2000, 0, 0, 1000, 1000, 0, 0, 0…
$ unitsres     <dbl> 18, 8, 8, 10, 20, 1, 1, 3, 19, 22, 22, 5, 19, 13, 17, 16,…
$ unitstotal   <dbl> 18, 10, 9, 10, 20, 2, 2, 3, 19, 24, 24, 5, 19, 13, 17, 18…
$ numfloors    <dbl> 5, 5, 5, 5, 5, 2, 2, 3, 6, 6, 6, 3, 6, 6, 6, 6, 6, 6, 6, …
$ latitude     <dbl> 40.72442, 40.72497, 40.72535, 40.72740, 40.72062, 40.6131…
$ longitude    <dbl> -73.98389, -73.98154, -73.98311, -73.98060, -73.97953, -7…
$ landuse      <dbl> 2, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ bldgclass    <chr> "C6", "C4", "C7", "C4", "C6", "K4", "K4", "C0", "C6", "C6…
$ building_age <dbl> 125, 125, 105, 170, 105, 81, 81, 126, 123, 125, 125, 117,…

Subset data

First, I will create a subset that looks at sales in Brooklyn and include the sale price, zip code, year that the primary structure on the lot was built, and the lot area measured in square feet. I also filter for properties where the primary structure was built after 1999 and that has a lot area greater than 10,000 square feet.

library(gt)

df1 <- df[df$borough_y == "BK" & df$yearbuilt > 1999 & df$lotarea > 10000, c("borough_y", "sale_price", "zip_code", "yearbuilt", "lotarea")]

df1 |>
  gt()
borough_y sale_price zip_code yearbuilt lotarea
BK 5300000 10014 2022 74000
BK 4200000 10014 2022 74000
BK 4900000 10014 2022 74000
BK 505000 10016 2019 15025
BK 410000 10016 2019 15025
BK 535000 10016 2019 15025
BK 575000 10016 2019 15025
BK 425000 10016 2019 15025
BK 385000 10016 2019 15025
BK 1536250 10025 2015 12140
BK 2100000 10025 2015 12140
BK 2595000 10025 2015 12140
BK 4550000 10467 2000 13473
BK 3779865 11209 2008 12690
BK 1500000 11234 2018 10100
BK 470000 11213 2009 19466
BK 176073 11201 2013 13295
BK 4000000 11234 2002 22774
BK 813582 11231 2007 15262
BK 1690000 11106 2024 25282
BK 950000 11373 2015 12140
BK 216000 11358 2005 28260
BK 1400000 11375 2019 13800
BK 850000 11385 2011 20000
BK 988000 11370 2014 22012
BK 245000 11372 2020 14504
BK 545000 11372 2015 28375
BK 325000 11372 2015 28375
BK 815000 11372 2015 28375
BK 285000 11374 2005 20000
BK 370000 11374 2005 20000
BK 925000 11385 2024 25123
BK 855000 10308 2024 20287
BK 855000 10303 2016 21371
BK 100000 10309 2021 14221
BK 650000 10314 2023 43020

Sort data

Then I’ll sort the data in order of year that the primary structure was built in order to improve readability.

df2 <- df1[order(df1$yearbuilt), c("borough_y", "sale_price", "zip_code", "yearbuilt", "lotarea")]

df2 |>
  gt()
borough_y sale_price zip_code yearbuilt lotarea
BK 4550000 10467 2000 13473
BK 4000000 11234 2002 22774
BK 216000 11358 2005 28260
BK 285000 11374 2005 20000
BK 370000 11374 2005 20000
BK 813582 11231 2007 15262
BK 3779865 11209 2008 12690
BK 470000 11213 2009 19466
BK 850000 11385 2011 20000
BK 176073 11201 2013 13295
BK 988000 11370 2014 22012
BK 1536250 10025 2015 12140
BK 2100000 10025 2015 12140
BK 2595000 10025 2015 12140
BK 950000 11373 2015 12140
BK 545000 11372 2015 28375
BK 325000 11372 2015 28375
BK 815000 11372 2015 28375
BK 855000 10303 2016 21371
BK 1500000 11234 2018 10100
BK 505000 10016 2019 15025
BK 410000 10016 2019 15025
BK 535000 10016 2019 15025
BK 575000 10016 2019 15025
BK 425000 10016 2019 15025
BK 385000 10016 2019 15025
BK 1400000 11375 2019 13800
BK 245000 11372 2020 14504
BK 100000 10309 2021 14221
BK 5300000 10014 2022 74000
BK 4200000 10014 2022 74000
BK 4900000 10014 2022 74000
BK 650000 10314 2023 43020
BK 1690000 11106 2024 25282
BK 925000 11385 2024 25123
BK 855000 10308 2024 20287

Final Touches

Finally, I will change the column names to be more readable, change the borough name from “BK” to “Brooklyn”, and properly format the sale price to show that it’s a currency value.

df3 <- df2[order(df2$yearbuilt), c("borough_y", "sale_price", "zip_code", "yearbuilt", "lotarea")]

df3 <- df3 |>
  mutate(
    borough_y = if_else(borough_y == "BK", "Brooklyn", borough_y)
  )

df3 |>
  gt() |>
  cols_label(
    borough_y = "Borough",
    sale_price = "Sale Price",
    zip_code = "Zip Code",
    yearbuilt = "Year Built",
    lotarea = "Lot Area (sqft)"
  ) |>
  fmt_currency(
    columns = sale_price,
  )
Borough Sale Price Zip Code Year Built Lot Area (sqft)
Brooklyn $4,550,000.00 10467 2000 13473
Brooklyn $4,000,000.00 11234 2002 22774
Brooklyn $216,000.00 11358 2005 28260
Brooklyn $285,000.00 11374 2005 20000
Brooklyn $370,000.00 11374 2005 20000
Brooklyn $813,582.00 11231 2007 15262
Brooklyn $3,779,865.00 11209 2008 12690
Brooklyn $470,000.00 11213 2009 19466
Brooklyn $850,000.00 11385 2011 20000
Brooklyn $176,073.00 11201 2013 13295
Brooklyn $988,000.00 11370 2014 22012
Brooklyn $1,536,250.00 10025 2015 12140
Brooklyn $2,100,000.00 10025 2015 12140
Brooklyn $2,595,000.00 10025 2015 12140
Brooklyn $950,000.00 11373 2015 12140
Brooklyn $545,000.00 11372 2015 28375
Brooklyn $325,000.00 11372 2015 28375
Brooklyn $815,000.00 11372 2015 28375
Brooklyn $855,000.00 10303 2016 21371
Brooklyn $1,500,000.00 11234 2018 10100
Brooklyn $505,000.00 10016 2019 15025
Brooklyn $410,000.00 10016 2019 15025
Brooklyn $535,000.00 10016 2019 15025
Brooklyn $575,000.00 10016 2019 15025
Brooklyn $425,000.00 10016 2019 15025
Brooklyn $385,000.00 10016 2019 15025
Brooklyn $1,400,000.00 11375 2019 13800
Brooklyn $245,000.00 11372 2020 14504
Brooklyn $100,000.00 10309 2021 14221
Brooklyn $5,300,000.00 10014 2022 74000
Brooklyn $4,200,000.00 10014 2022 74000
Brooklyn $4,900,000.00 10014 2022 74000
Brooklyn $650,000.00 10314 2023 43020
Brooklyn $1,690,000.00 11106 2024 25282
Brooklyn $925,000.00 11385 2024 25123
Brooklyn $855,000.00 10308 2024 20287

Conclusions

In conclusion, something extra I would do to extend the data would be a price per square foot column to compare the pricing of the various properties independent of size. The findings based on the data shows that there are some years with multiple sales of big (10,000+ sqft) properties built after 1999 in Brooklyn, as well as some years like 2006, where there are none.