library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
# Read the Airbnb dataset
df_airbnb <- read_csv('AB_NYC_2019.csv')
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): name, host_name, neighbourhood_group, neighbourhood, room_type, la...
## dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_of...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cat("Original Airbnb Dataset Shape:", dim(df_airbnb)[1], "rows x", dim(df_airbnb)[2], "columns\n")
## Original Airbnb Dataset Shape: 48895 rows x 16 columns
cat("\nFirst 3 rows of original data:\n")
##
## First 3 rows of original data:
print(head(df_airbnb, 3))
## # A tibble: 3 × 16
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 2539 Clean & qu… 2787 John Brooklyn Kensington 40.6
## 2 2595 Skylit Mid… 2845 Jennifer Manhattan Midtown 40.8
## 3 3647 THE VILLAG… 4632 Elisabeth Manhattan Harlem 40.8
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>
# Create long format using pivot_longer()
# Keep identifier columns, pivot measurement columns
airbnb_long <- df_airbnb %>%
pivot_longer(
cols = c(price, minimum_nights, number_of_reviews,
reviews_per_month, calculated_host_listings_count, availability_365),
names_to = "metric_type",
values_to = "metric_value"
)
# Clean and tidy the data
airbnb_long_tidy <- airbnb_long %>%
drop_na(metric_value) %>% # Remove null values
arrange(id, metric_type) # Sort by id and metric_type
print(head(airbnb_long_tidy, 12))
## # A tibble: 12 × 12
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 2539 Clean & q… 2787 John Brooklyn Kensington 40.6
## 2 2539 Clean & q… 2787 John Brooklyn Kensington 40.6
## 3 2539 Clean & q… 2787 John Brooklyn Kensington 40.6
## 4 2539 Clean & q… 2787 John Brooklyn Kensington 40.6
## 5 2539 Clean & q… 2787 John Brooklyn Kensington 40.6
## 6 2539 Clean & q… 2787 John Brooklyn Kensington 40.6
## 7 2595 Skylit Mi… 2845 Jennifer Manhattan Midtown 40.8
## 8 2595 Skylit Mi… 2845 Jennifer Manhattan Midtown 40.8
## 9 2595 Skylit Mi… 2845 Jennifer Manhattan Midtown 40.8
## 10 2595 Skylit Mi… 2845 Jennifer Manhattan Midtown 40.8
## 11 2595 Skylit Mi… 2845 Jennifer Manhattan Midtown 40.8
## 12 2595 Skylit Mi… 2845 Jennifer Manhattan Midtown 40.8
## # ℹ 5 more variables: longitude <dbl>, room_type <chr>, last_review <chr>,
## # metric_type <chr>, metric_value <dbl>
# Save tidied long format
write_csv(airbnb_long_tidy, 'airbnb_long_tidy.csv')
cat("1. AIRBNB DATASET:\n")
## 1. AIRBNB DATASET:
cat(" Wide: ", nrow(df_airbnb), "rows ×", ncol(df_airbnb), "columns\n")
## Wide: 48895 rows × 16 columns
cat(" Long: ", nrow(airbnb_long_tidy), "rows ×", ncol(airbnb_long_tidy), "columns\n")
## Long: 283318 rows × 12 columns
cat(" Output: airbnb_long_tidy.csv\n\n")
## Output: airbnb_long_tidy.csv
The Airbnb dataset originally had 48,895 rows (one per listing) with
measurements spread across 16 columns. Using
pivot_longer()
, we transformed 6 measurement columns
(price, minimum_nights, number_of_reviews, etc.) into long format by
creating two new columns: “metric_type” (what we’re measuring) and
“metric_value” (the actual number). Each listing that was 1 row became 6
rows - one per metric - with identifier information repeated across
rows. The dataset grew from ~49,000 to ~290,000 rows, creating a
cleaner, more analyzable structure where each row represents a single
measurement.