Week1_Approach

Author

Theresa Benny

Introduction

For this analysis, I plan to explore New York City Airbnb listings to understand factors that influence rental prices and availability across different neighborhoods and room types. The primary goal is to clean and transform the dataset into a format suitable for analysis, identify key variables, and prepare it for potential predictive modeling in later assignments.

To tackle this problem, I will: - Source a publicly accessible NYC Airbnb dataset that includes pricing, room type, neighborhood, and availability information. - Subset relevant columns and ensure meaningful column names are used. - Identify a potential target variable (price) and explore patterns across different categories such as neighborhood group and room type. - Handle data cleaning tasks, such as removing extreme outliers (e.g., listings with unrealistically high prices or minimum nights) and addressing missing values.

Anticipated data challenges include: - Large dataset size (~48,000 listings), which may require efficient data handling. - Missing or inconsistent values for certain variables (e.g., reviews, availability). - Outliers that could distort summary statistics or visualizations.

This approach will allow me to produce a clean, analyzable dataset and provide a foundation for further exploratory analysis or predictive modeling in later weeks.

url <- “https://huggingface.co/datasets/gradio/NYC-Airbnb-Open-Data/resolve/main/AB_NYC_2019.csv”

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load the dataset
url <- "https://huggingface.co/datasets/gradio/NYC-Airbnb-Open-Data/resolve/main/AB_NYC_2019.csv"
airbnb_raw <- read_csv(url)

Rows: 48895 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
date  (1): last_review

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Subset 5 columns
airbnb_subset <- airbnb_raw %>%
  select(
    name,
    neighbourhood_group,
    room_type,
    price,
    number_of_reviews
  ) %>%
  slice_head(n = 7)  # take first 7 rows

# Show as a table
airbnb_subset

# A tibble: 7 × 5
  name                     neighbourhood_group room_type price number_of_reviews
  <chr>                    <chr>               <chr>     <dbl>             <dbl>
1 Clean & quiet apt home … Brooklyn            Private …   149                 9
2 Skylit Midtown Castle    Manhattan           Entire h…   225                45
3 THE VILLAGE OF HARLEM..… Manhattan           Private …   150                 0
4 Cozy Entire Floor of Br… Brooklyn            Entire h…    89               270
5 Entire Apt: Spacious St… Manhattan           Entire h…    80                 9
6 Large Cozy 1 BR Apartme… Manhattan           Entire h…   200                74
7 BlissArtsSpace!          Brooklyn            Private …    60                49