Transforming Data with the Airbnb dataset (Second Example)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

# Read the Airbnb dataset
df_airbnb <- read_csv('AB_NYC_2019.csv')
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): name, host_name, neighbourhood_group, neighbourhood, room_type, la...
## dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_of...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cat("Original Airbnb Dataset Shape:", dim(df_airbnb)[1], "rows x", dim(df_airbnb)[2], "columns\n")
## Original Airbnb Dataset Shape: 48895 rows x 16 columns
cat("\nFirst 3 rows of original data:\n")
## 
## First 3 rows of original data:
print(head(df_airbnb, 3))
## # A tibble: 3 × 16
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <dbl> <chr>         <dbl> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>
# Create long format using pivot_longer()
# Keep identifier columns, pivot measurement columns
airbnb_long <- df_airbnb %>%
  pivot_longer(
    cols = c(price, minimum_nights, number_of_reviews, 
             reviews_per_month, calculated_host_listings_count, availability_365),
    names_to = "metric_type",
    values_to = "metric_value"
  )

# Clean and tidy the data
airbnb_long_tidy <- airbnb_long %>%
  drop_na(metric_value) %>%  # Remove null values
  arrange(id, metric_type)    # Sort by id and metric_type

print(head(airbnb_long_tidy, 12))
## # A tibble: 12 × 12
##       id name       host_id host_name neighbourhood_group neighbourhood latitude
##    <dbl> <chr>        <dbl> <chr>     <chr>               <chr>            <dbl>
##  1  2539 Clean & q…    2787 John      Brooklyn            Kensington        40.6
##  2  2539 Clean & q…    2787 John      Brooklyn            Kensington        40.6
##  3  2539 Clean & q…    2787 John      Brooklyn            Kensington        40.6
##  4  2539 Clean & q…    2787 John      Brooklyn            Kensington        40.6
##  5  2539 Clean & q…    2787 John      Brooklyn            Kensington        40.6
##  6  2539 Clean & q…    2787 John      Brooklyn            Kensington        40.6
##  7  2595 Skylit Mi…    2845 Jennifer  Manhattan           Midtown           40.8
##  8  2595 Skylit Mi…    2845 Jennifer  Manhattan           Midtown           40.8
##  9  2595 Skylit Mi…    2845 Jennifer  Manhattan           Midtown           40.8
## 10  2595 Skylit Mi…    2845 Jennifer  Manhattan           Midtown           40.8
## 11  2595 Skylit Mi…    2845 Jennifer  Manhattan           Midtown           40.8
## 12  2595 Skylit Mi…    2845 Jennifer  Manhattan           Midtown           40.8
## # ℹ 5 more variables: longitude <dbl>, room_type <chr>, last_review <chr>,
## #   metric_type <chr>, metric_value <dbl>
# Save tidied long format
write_csv(airbnb_long_tidy, 'airbnb_long_tidy.csv')



cat("1. AIRBNB DATASET:\n")
## 1. AIRBNB DATASET:
cat("   Wide:  ", nrow(df_airbnb), "rows ×", ncol(df_airbnb), "columns\n")
##    Wide:   48895 rows × 16 columns
cat("   Long:  ", nrow(airbnb_long_tidy), "rows ×", ncol(airbnb_long_tidy), "columns\n")
##    Long:   283318 rows × 12 columns
cat("   Output: airbnb_long_tidy.csv\n\n")
##    Output: airbnb_long_tidy.csv
Airbnb Data Transformation Summary: from Wide to Long Format

The Airbnb dataset originally had 48,895 rows (one per listing) with measurements spread across 16 columns. Using pivot_longer(), we transformed 6 measurement columns (price, minimum_nights, number_of_reviews, etc.) into long format by creating two new columns: “metric_type” (what we’re measuring) and “metric_value” (the actual number). Each listing that was 1 row became 6 rows - one per metric - with identifier information repeated across rows. The dataset grew from ~49,000 to ~290,000 rows, creating a cleaner, more analyzable structure where each row represents a single measurement.