For this analysis, I plan to explore New York City Airbnb listings to understand factors that influence rental prices and availability across different neighborhoods and room types. The primary goal is to clean and transform the dataset into a format suitable for analysis, identify key variables, and prepare it for potential predictive modeling in later assignments.
To tackle this problem, I will: - Source a publicly accessible NYC Airbnb dataset that includes pricing, room type, neighborhood, and availability information. - Subset relevant columns and ensure meaningful column names are used. - Identify a potential target variable (price) and explore patterns across different categories such as neighborhood group and room type. - Handle data cleaning tasks, such as removing extreme outliers (e.g., listings with unrealistically high prices or minimum nights) and addressing missing values.
Anticipated data challenges include: - Large dataset size (~48,000 listings), which may require efficient data handling. - Missing or inconsistent values for certain variables (e.g., reviews, availability). - Outliers that could distort summary statistics or visualizations.
This approach will allow me to produce a clean, analyzable dataset and provide a foundation for further exploratory analysis or predictive modeling in later weeks.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the dataseturl <-"https://huggingface.co/datasets/gradio/NYC-Airbnb-Open-Data/resolve/main/AB_NYC_2019.csv"airbnb_raw <-read_csv(url)
Rows: 48895 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): name, host_name, neighbourhood_group, neighbourhood, room_type
dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
date (1): last_review
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Subset 5 columnsairbnb_subset <- airbnb_raw %>%select( name, neighbourhood_group, room_type, price, number_of_reviews ) %>%slice_head(n =7) # take first 7 rows# Show as a tableairbnb_subset