Introduction

This dataset is about the New York City house prices in the past years, which is sourced from Kaggle: https://www.kaggle.com/datasets/nelgiriyewithana/new-york-housing-market. I am going to use the csv file that I have modified, in which the redundant columns have been removed and the data has been cleaned.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

house <- read.csv('https://raw.githubusercontent.com/vincent-usny/nychouse/refs/heads/main/nychouse.csv')

house <- house %>%
  mutate(price_per_sqft = round(Price / Propertysqft))

# Compare prices by types
house %>%
  group_by(Type) %>%
  summarise(
    avg_price = mean(Price),
    median_price = median(Price)) %>%
  arrange(desc(avg_price))

## # A tibble: 13 × 3
##    Type              avg_price median_price
##    <chr>                 <dbl>        <dbl>
##  1 Townhouse          6365925.      2950000
##  2 House              3709576.       858500
##  3 Condo              2627017.       899000
##  4 For sale           1954536.      1044500
##  5 Multi-family home  1683690.      1198500
##  6 Foreclosure        1343010.       592450
##  7 Pending            1341263.       799500
##  8 Mobile house       1288000       1288000
##  9 Coming Soon        1172000       1172000
## 10 Land               1112852.       675000
## 11 Co-op              1104385.       425000
## 12 Condop              998600       1080000
## 13 Contingent          881395.       675000

# Compare prices by cities
house %>%
  group_by(City) %>%
  summarise(
    avg_price = mean(Price),
    median_price = median(Price)
  ) %>%
  arrange(desc(median_price))

## # A tibble: 71 × 3
##    City               avg_price median_price
##    <chr>                  <dbl>        <dbl>
##  1 Malba               2966296       3900000
##  2 Brooklyn Heights    3000000       3000000
##  3 Canarsie            2999950       2999950
##  4 Stuyvesant Heights  2380000       2380000
##  5 New York            7212066.      1599000
##  6 Manhattan           3374570.      1337500
##  7 Whitestone          1444166       1288000
##  8 Bedford Stuyvesant  1307666.      1279000
##  9 Ditmas Park         1250000       1250000
## 10 Ridgewood           1524617.      1250000
## # ℹ 61 more rows

# compare average beds and baths in cities
house %>%
  group_by(City) %>%
  summarise(
    avg_beds = round(mean(Beds)),
    avg_bath = round(mean(Bath))
  )

## # A tibble: 71 × 3
##    City               avg_beds avg_bath
##    <chr>                 <dbl>    <dbl>
##  1 Arverne                   4        3
##  2 Astoria                   4        2
##  3 Bayside                   3        2
##  4 Bedford Stuyvesant        4        2
##  5 Beechhurst                2        2
##  6 Belle Harbor              3        2
##  7 Bellerose                 3        2
##  8 Briarwood                 2        2
##  9 Brighton Beach            1        1
## 10 Bronx                     4        2
## # ℹ 61 more rows

# compare price and sqft between condo and house
house %>%
  filter(
    Type %in% c("Condo","House"),
    Price < 2000000) %>% # narrow the y-axis
  ggplot(aes(x = Propertysqft, y = Price, color = Type)) +
  geom_point(alpha = 0.6, na.rm=TRUE) +
  labs(
    title = "Price vs Propertysqft",
    x = "Propertysqft",
    y = "Price"
  )

It approximately shows that the houses have larger propertysqft when having the same price with condos.

city_house <- house %>%
  filter(Type %in% c("House")) %>%
  group_by(City,Type) %>%
  summarise(median_price = median(Price, na.rm=TRUE), .groups = "drop") %>%
  arrange(desc(median_price))

# Compare top 10 cities' houses with highest median price
top10_cities <- city_house %>%
  slice_max(median_price, n=10) %>%
  arrange(desc(median_price))

ggplot(top10_cities, aes(x = reorder(City, -median_price), y = median_price)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Top 10 cities' median prices",
    x = "Cities",
    y = " Median Price"
  ) +
  theme(axis.text.x = element_text(size = 8))

Conclusion

This assignment performs some data analysis using group_by(), summarise(), arrange(), etc, and also comes up with two plots which visualize some patterns of nyc’s houses market.

Extended Analysis (Sabina Baraili)

Below is an extension of the original vignette with additional TidyVerse transformations and visualizations.

1. Price Per Sqft by Property Type (Extended)

house %>%
  group_by(Type) %>%
  summarise(
    avg_ppsqft = mean(price_per_sqft, na.rm = TRUE),
    median_ppsqft = median(price_per_sqft, na.rm = TRUE),
    count = n()
  ) %>%
  arrange(desc(avg_ppsqft))

## # A tibble: 13 × 4
##    Type              avg_ppsqft median_ppsqft count
##    <chr>                  <dbl>         <dbl> <int>
##  1 Townhouse              1534.          854    299
##  2 Condo                  1218.          923    889
##  3 For sale               1018.          950     20
##  4 House                   837.          493   1000
##  5 Co-op                   607.          352.  1442
##  6 Pending                 606.          458.   242
##  7 Multi-family home       598.          461    716
##  8 Mobile house            590           590      1
##  9 Coming Soon             525           525      2
## 10 Land                    522.          309     47
## 11 Contingent              505.          424     87
## 12 Foreclosure             460           351     14
## 13 Condop                  457.          495      5

2. Property Size Category vs Median Price

house %>%
  mutate(size_category = case_when(
    Propertysqft < 800 ~ "Small",
    Propertysqft >= 800 & Propertysqft < 1500 ~ "Medium",
    TRUE ~ "Large"
  )) %>%
  group_by(size_category) %>%
  summarise(
    median_price = median(Price, na.rm = TRUE),
    avg_ppsqft = mean(price_per_sqft, na.rm = TRUE)
  ) %>%
  arrange(desc(median_price))

## # A tibble: 3 × 3
##   size_category median_price avg_ppsqft
##   <chr>                <dbl>      <dbl>
## 1 Large              1049000       891.
## 2 Medium              640000       686.
## 3 Small               375000       698.

Visualization: Property Size vs Price

house %>%
  mutate(size_category = case_when(
    Propertysqft < 800 ~ "Small",
    Propertysqft >= 800 & Propertysqft < 1500 ~ "Medium",
    TRUE ~ "Large"
  )) %>%
  ggplot(aes(x = size_category, y = Price)) +
  geom_boxplot(fill = "skyblue") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Price Distribution by Property Size Category",
    x = "Size Category",
    y = "Price"
  )

3. Price Per Sqft Distribution

ggplot(house, aes(x = price_per_sqft)) +
  geom_histogram(bins = 30, fill = "purple", alpha = 0.6) +
  labs(
    title = "Distribution of Price Per Sqft",
    x = "Price Per Sqft",
    y = "Count"
  )

4. Bedrooms vs Price (Extended)

house %>%
  filter(Beds <= 6, Price < 2500000) %>%
  ggplot(aes(x = Beds, y = Price)) +
  geom_point(alpha = 0.4, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Relationship Between Bedrooms and Price",
    x = "Bedrooms",
    y = "Price"
  )

## `geom_smooth()` using formula = 'y ~ x'

5. Beds and Baths Correlation (Extended)

house %>%
  summarise(
    correlation_beds_baths = cor(Beds, Bath, use = "complete.obs")
  )

##   correlation_beds_baths
## 1              0.7740146

Conclusion (Extended Analysis)

The extended portion of this vignette builds on the original NYC housing analysis by introducing additional TidyVerse techniques that provide deeper insight into the dataset. New transformations such as creating property size categories, calculating price-per-square-foot statistics, and exploring numerical relationships helped highlight structural patterns in the market. Visualizations added in this section offer clearer comparisons across property sizes, price distributions, and bedroom–price trends.

These additions demonstrate how TidyVerse functions can be combined to enrich exploratory data analysis and uncover insights that were not visible in the original summary alone.

TidyVerse - NYC Housing Analysis (Extension)

Haoming Chen (Original Author)

Sabina Baraili (Extended Analysis)

2025-11-22