This dataset is about the New York City house prices in the past years, which is sourced from Kaggle: https://www.kaggle.com/datasets/nelgiriyewithana/new-york-housing-market. I am going to use the csv file that I have modified, in which the redundant columns have been removed and the data has been cleaned.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
house <- read.csv('https://raw.githubusercontent.com/vincent-usny/nychouse/refs/heads/main/nychouse.csv')
house <- house %>%
mutate(price_per_sqft = round(Price / Propertysqft))
# Compare prices by types
house %>%
group_by(Type) %>%
summarise(
avg_price = mean(Price),
median_price = median(Price)) %>%
arrange(desc(avg_price))
## # A tibble: 13 × 3
## Type avg_price median_price
## <chr> <dbl> <dbl>
## 1 Townhouse 6365925. 2950000
## 2 House 3709576. 858500
## 3 Condo 2627017. 899000
## 4 For sale 1954536. 1044500
## 5 Multi-family home 1683690. 1198500
## 6 Foreclosure 1343010. 592450
## 7 Pending 1341263. 799500
## 8 Mobile house 1288000 1288000
## 9 Coming Soon 1172000 1172000
## 10 Land 1112852. 675000
## 11 Co-op 1104385. 425000
## 12 Condop 998600 1080000
## 13 Contingent 881395. 675000
# Compare prices by cities
house %>%
group_by(City) %>%
summarise(
avg_price = mean(Price),
median_price = median(Price)
) %>%
arrange(desc(median_price))
## # A tibble: 71 × 3
## City avg_price median_price
## <chr> <dbl> <dbl>
## 1 Malba 2966296 3900000
## 2 Brooklyn Heights 3000000 3000000
## 3 Canarsie 2999950 2999950
## 4 Stuyvesant Heights 2380000 2380000
## 5 New York 7212066. 1599000
## 6 Manhattan 3374570. 1337500
## 7 Whitestone 1444166 1288000
## 8 Bedford Stuyvesant 1307666. 1279000
## 9 Ditmas Park 1250000 1250000
## 10 Ridgewood 1524617. 1250000
## # ℹ 61 more rows
# compare average beds and baths in cities
house %>%
group_by(City) %>%
summarise(
avg_beds = round(mean(Beds)),
avg_bath = round(mean(Bath))
)
## # A tibble: 71 × 3
## City avg_beds avg_bath
## <chr> <dbl> <dbl>
## 1 Arverne 4 3
## 2 Astoria 4 2
## 3 Bayside 3 2
## 4 Bedford Stuyvesant 4 2
## 5 Beechhurst 2 2
## 6 Belle Harbor 3 2
## 7 Bellerose 3 2
## 8 Briarwood 2 2
## 9 Brighton Beach 1 1
## 10 Bronx 4 2
## # ℹ 61 more rows
# compare price and sqft between condo and house
house %>%
filter(
Type %in% c("Condo","House"),
Price < 2000000) %>% # narrow the y-axis
ggplot(aes(x = Propertysqft, y = Price, color = Type)) +
geom_point(alpha = 0.6, na.rm=TRUE) +
labs(
title = "Price vs Propertysqft",
x = "Propertysqft",
y = "Price"
)
It approximately shows that the houses have larger propertysqft when having the same price with condos.
city_house <- house %>%
filter(Type %in% c("House")) %>%
group_by(City,Type) %>%
summarise(median_price = median(Price, na.rm=TRUE), .groups = "drop") %>%
arrange(desc(median_price))
# Compare top 10 cities' houses with highest median price
top10_cities <- city_house %>%
slice_max(median_price, n=10) %>%
arrange(desc(median_price))
ggplot(top10_cities, aes(x = reorder(City, -median_price), y = median_price)) +
geom_col(fill = "steelblue") +
labs(
title = "Top 10 cities' median prices",
x = "Cities",
y = " Median Price"
) +
theme(axis.text.x = element_text(size = 8))
This assignment performs some data analysis using group_by(), summarise(), arrange(), etc, and also comes up with two plots which visualize some patterns of nyc’s houses market.
Below is an extension of the original vignette with additional TidyVerse transformations and visualizations.
house %>%
group_by(Type) %>%
summarise(
avg_ppsqft = mean(price_per_sqft, na.rm = TRUE),
median_ppsqft = median(price_per_sqft, na.rm = TRUE),
count = n()
) %>%
arrange(desc(avg_ppsqft))
## # A tibble: 13 × 4
## Type avg_ppsqft median_ppsqft count
## <chr> <dbl> <dbl> <int>
## 1 Townhouse 1534. 854 299
## 2 Condo 1218. 923 889
## 3 For sale 1018. 950 20
## 4 House 837. 493 1000
## 5 Co-op 607. 352. 1442
## 6 Pending 606. 458. 242
## 7 Multi-family home 598. 461 716
## 8 Mobile house 590 590 1
## 9 Coming Soon 525 525 2
## 10 Land 522. 309 47
## 11 Contingent 505. 424 87
## 12 Foreclosure 460 351 14
## 13 Condop 457. 495 5
house %>%
mutate(size_category = case_when(
Propertysqft < 800 ~ "Small",
Propertysqft >= 800 & Propertysqft < 1500 ~ "Medium",
TRUE ~ "Large"
)) %>%
group_by(size_category) %>%
summarise(
median_price = median(Price, na.rm = TRUE),
avg_ppsqft = mean(price_per_sqft, na.rm = TRUE)
) %>%
arrange(desc(median_price))
## # A tibble: 3 × 3
## size_category median_price avg_ppsqft
## <chr> <dbl> <dbl>
## 1 Large 1049000 891.
## 2 Medium 640000 686.
## 3 Small 375000 698.
house %>%
mutate(size_category = case_when(
Propertysqft < 800 ~ "Small",
Propertysqft >= 800 & Propertysqft < 1500 ~ "Medium",
TRUE ~ "Large"
)) %>%
ggplot(aes(x = size_category, y = Price)) +
geom_boxplot(fill = "skyblue") +
scale_y_continuous(labels = comma) +
labs(
title = "Price Distribution by Property Size Category",
x = "Size Category",
y = "Price"
)
ggplot(house, aes(x = price_per_sqft)) +
geom_histogram(bins = 30, fill = "purple", alpha = 0.6) +
labs(
title = "Distribution of Price Per Sqft",
x = "Price Per Sqft",
y = "Count"
)
house %>%
filter(Beds <= 6, Price < 2500000) %>%
ggplot(aes(x = Beds, y = Price)) +
geom_point(alpha = 0.4, color = "darkgreen") +
geom_smooth(method = "lm", se = FALSE) +
scale_y_continuous(labels = comma) +
labs(
title = "Relationship Between Bedrooms and Price",
x = "Bedrooms",
y = "Price"
)
## `geom_smooth()` using formula = 'y ~ x'
house %>%
summarise(
correlation_beds_baths = cor(Beds, Bath, use = "complete.obs")
)
## correlation_beds_baths
## 1 0.7740146
The extended portion of this vignette builds on the original NYC housing analysis by introducing additional TidyVerse techniques that provide deeper insight into the dataset. New transformations such as creating property size categories, calculating price-per-square-foot statistics, and exploring numerical relationships helped highlight structural patterns in the market. Visualizations added in this section offer clearer comparisons across property sizes, price distributions, and bedroom–price trends.
These additions demonstrate how TidyVerse functions can be combined to enrich exploratory data analysis and uncover insights that were not visible in the original summary alone.