project 2 - airbnb

Author

BG

Source: Centeno (2019), Shared Economy Tax.

Airbnb Data

My project is about New York City Airbnb data. This dataset has information about Airbnb homes in New York City, like price, number of reviews, availability, and neighborhood. Some variables are numbers, like price and reviews, and some are categories, like room type and neighborhood.

The data comes from Airbnb Inside, which shares public Airbnb information. I cleaned the data by removing missing values and fixing wrong formats. I also organized the data to make it easier to understand and use for this project.

I chose this topic because I am interested in travel and housing. Also, New York City is my favorite city, and my dream is to live and work there one day. That is why this dataset is very interesting and meaningful for me.

Load the libraries and set the working directory

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
setwd("/Users/bettyovalle/Desktop/College/007 – Spring 2026/DATA 110/week 11")
airbnbNYdata <- read_csv("Airbnb_Open_Data.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 102599 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): NAME, host_identity_verified, host name, neighbourhood group, neig...
dbl (11): id, host id, lat, long, Construction year, minimum nights, number ...
lgl  (2): instant_bookable, license

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the Data

  1. Check column names
names(airbnbNYdata)
 [1] "id"                             "NAME"                          
 [3] "host id"                        "host_identity_verified"        
 [5] "host name"                      "neighbourhood group"           
 [7] "neighbourhood"                  "lat"                           
 [9] "long"                           "country"                       
[11] "country code"                   "instant_bookable"              
[13] "cancellation_policy"            "room type"                     
[15] "Construction year"              "price"                         
[17] "service fee"                    "minimum nights"                
[19] "number of reviews"              "last review"                   
[21] "reviews per month"              "review rate number"            
[23] "calculated host listings count" "availability 365"              
[25] "house_rules"                    "license"                       
  1. Clean currency symbols
airbnbNYdata$`service fee` <- parse_number(as.character(airbnbNYdata$`service fee`))

airbnbNYdata$price <- parse_number(as.character(airbnbNYdata$price))

head(airbnbNYdata[, c("price", "service fee")])
# A tibble: 6 × 2
  price `service fee`
  <dbl>         <dbl>
1   966           193
2   142            28
3   620           124
4   368            74
5   204            41
6   577           115
  1. Clean variable names
names(airbnbNYdata) <- tolower(names(airbnbNYdata))
names(airbnbNYdata) <- gsub(" ", "_", names(airbnbNYdata))
  1. Removing unnecessary columns
airbnbNYdata_unColumns <- airbnbNYdata |>
  select(-id,
         -name,
         -host_id,
         -host_name,
         -country_code,
         -instant_bookable,
         -cancellation_policy,
         -room_type,
         -construction_year,
         -last_review,
         -reviews_per_month,
         -calculated_host_listings_count,
         -availability_365,
         -house_rules,
         -license)

head(airbnbNYdata_unColumns)
# A tibble: 6 × 11
  host_identity_verified neighbourhood_group neighbourhood   lat  long country  
  <chr>                  <chr>               <chr>         <dbl> <dbl> <chr>    
1 unconfirmed            Brooklyn            Kensington     40.6 -74.0 United S…
2 verified               Manhattan           Midtown        40.8 -74.0 United S…
3 <NA>                   Manhattan           Harlem         40.8 -73.9 United S…
4 unconfirmed            Brooklyn            Clinton Hill   40.7 -74.0 United S…
5 verified               Manhattan           East Harlem    40.8 -73.9 United S…
6 verified               Manhattan           Murray Hill    40.7 -74.0 United S…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, review_rate_number <dbl>

I removed irrelevant and administrative variables such as identifiers and booking settings to simplify the dataset. I selected these variables to focus my analysis only on listings in Manhattan.

Data filtering

Verified hosts only

verified_data <- airbnbNYdata_unColumns |>
  filter(host_identity_verified == "verified")

head(verified_data)
# A tibble: 6 × 11
  host_identity_verified neighbourhood_group neighbourhood     lat  long country
  <chr>                  <chr>               <chr>           <dbl> <dbl> <chr>  
1 verified               Manhattan           Midtown          40.8 -74.0 United…
2 verified               Manhattan           East Harlem      40.8 -73.9 United…
3 verified               Manhattan           Murray Hill      40.7 -74.0 United…
4 verified               Manhattan           Hell's Kitchen   40.8 -74.0 United…
5 verified               Manhattan           Chinatown        40.7 -74.0 United…
6 verified               Manhattan           Upper West Side  40.8 -74.0 United…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, review_rate_number <dbl>

Manhattan

Filtering the dataset to include only Manhattan

manhattan_data <- verified_data |>
  filter(neighbourhood_group == "Manhattan")

head(manhattan_data)
# A tibble: 6 × 11
  host_identity_verified neighbourhood_group neighbourhood     lat  long country
  <chr>                  <chr>               <chr>           <dbl> <dbl> <chr>  
1 verified               Manhattan           Midtown          40.8 -74.0 United…
2 verified               Manhattan           East Harlem      40.8 -73.9 United…
3 verified               Manhattan           Murray Hill      40.7 -74.0 United…
4 verified               Manhattan           Hell's Kitchen   40.8 -74.0 United…
5 verified               Manhattan           Chinatown        40.7 -74.0 United…
6 verified               Manhattan           Upper West Side  40.8 -74.0 United…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, review_rate_number <dbl>

Filtering by price

Below 500 USD

price_under_500 <- manhattan_data |>
  filter(price < 500)

head(price_under_500)
# A tibble: 6 × 11
  host_identity_verified neighbourhood_group neighbourhood     lat  long country
  <chr>                  <chr>               <chr>           <dbl> <dbl> <chr>  
1 verified               Manhattan           Midtown          40.8 -74.0 United…
2 verified               Manhattan           East Harlem      40.8 -73.9 United…
3 verified               Manhattan           Chinatown        40.7 -74.0 United…
4 verified               Manhattan           Upper West Side  40.8 -74.0 United…
5 verified               Manhattan           East Harlem      40.8 -73.9 United…
6 verified               Manhattan           Inwood           40.9 -73.9 United…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, review_rate_number <dbl>

Filtering minimum nights

oneNight <- price_under_500 |>
  filter(minimum_nights == 1)

head(oneNight)
# A tibble: 6 × 11
  host_identity_verified neighbourhood_group neighbourhood     lat  long country
  <chr>                  <chr>               <chr>           <dbl> <dbl> <chr>  
1 verified               Manhattan           Chinatown        40.7 -74.0 United…
2 verified               Manhattan           Lower East Side  40.7 -74.0 United…
3 verified               Manhattan           Harlem           40.8 -73.9 United…
4 verified               Manhattan           Harlem           40.8 -73.9 United…
5 verified               Manhattan           Harlem           40.8 -74.0 United…
6 verified               Manhattan           East Village     40.7 -74.0 United…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, review_rate_number <dbl>

Review rating

Filtering review rates 4 or higher

high_reviews <- oneNight |>
  filter(review_rate_number >= 4)

head(high_reviews)
# A tibble: 6 × 11
  host_identity_verified neighbourhood_group neighbourhood     lat  long country
  <chr>                  <chr>               <chr>           <dbl> <dbl> <chr>  
1 verified               Manhattan           Harlem           40.8 -73.9 United…
2 verified               Manhattan           Lower East Side  40.7 -74.0 United…
3 verified               Manhattan           Financial Dist…  40.7 -74.0 United…
4 verified               Manhattan           Lower East Side  40.7 -74.0 United…
5 verified               Manhattan           East Village     40.7 -74.0 United…
6 verified               Manhattan           West Village     40.7 -74.0 United…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, review_rate_number <dbl>

Number of reviews

Selecting airbnbs that have more than 50 reviews.

cleanData <- high_reviews |>
  filter(number_of_reviews > 50)

head(cleanData)
# A tibble: 6 × 11
  host_identity_verified neighbourhood_group neighbourhood     lat  long country
  <chr>                  <chr>               <chr>           <dbl> <dbl> <chr>  
1 verified               Manhattan           Harlem           40.8 -73.9 United…
2 verified               Manhattan           Lower East Side  40.7 -74.0 United…
3 verified               Manhattan           East Village     40.7 -74.0 United…
4 verified               Manhattan           West Village     40.7 -74.0 United…
5 verified               Manhattan           East Harlem      40.8 -73.9 United…
6 verified               Manhattan           West Village     40.7 -74.0 United…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, review_rate_number <dbl>

Budget Airbnb NYC

This analysis focuses on Airbnbs in New York City for people who want to visit the city on a budget. The goal is to filter and analyze affordable and highly rated listings in Manhattan, so travelers can find safe, well-reviewed, and reasonably priced places to stay. This helps identify good options for visitors who want to enjoy New York City without spending too much money.

Load library

library(ggplot2)
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout

See the available data

names(cleanData)
 [1] "host_identity_verified" "neighbourhood_group"    "neighbourhood"         
 [4] "lat"                    "long"                   "country"               
 [7] "price"                  "service_fee"            "minimum_nights"        
[10] "number_of_reviews"      "review_rate_number"    

boxplot

Price changes between highly rated Airbnb properties

plot1 <- ggplot(cleanData, 
                aes(x = factor(review_rate_number),
                y = price,
                fill = factor(review_rate_number))) +
                geom_boxplot() +

  labs(title = "Comparison of Price by Review Rating in Manhattan",
       x = "Review Rate Number (4 and 5)",
       y = "Price (USD)",
       fill = "Rating") +

  scale_fill_brewer(palette = "BuPu") +
  theme_minimal()

ggplotly(plot1)

The visualization suggests a slight negative relationship between price and review ratings, meaning that higher-priced Airbnbs do not always have higher ratings.

Map

  1. Load Library
library(leaflet)
  1. Pick colors and review data
pal <- colorFactor(
  palette = c("darkorchid1", "turquoise"),
  domain = cleanData$review_rate_number)

popup_airbnb <- paste0(
  "<b>Price: </b>$", cleanData$price, "<br>",
  "<b>Rating: </b>", cleanData$review_rate_number, "<br>",
  "<b>Reviews: </b>", cleanData$number_of_reviews, "<br>",
  "<b>Neighborhood: </b>", cleanData$neighbourhood)
  1. mapping
leaflet(cleanData) |>
  setView(lng = -73.9855, lat = 40.7580, zoom = 11) |>
  addProviderTiles("CartoDB.Positron") |>
  addCircles(
    lng = ~long,
    lat = ~lat,
    radius = ~price / 5,
    color = ~pal(review_rate_number),
    fillColor = ~pal(review_rate_number),
    fillOpacity = 0.5,
    popup = ~popup_airbnb)

Map shows Airbnb locations in Manhattan with price-based sizing and rating-based colors.

Summary

This project shows Airbnb data in Manhattan for people who want to visit New York City on a budget. The dataset includes information about price, ratings, number of reviews, and location. I cleaned the data by removing unnecessary variables, fixing formats, and filtering step by step to focus on verified, highly rated, and affordable rooms and apartments.

The plot helps show the relationship between price, ratings, and number of reviews, while the map shows where the Airbnb properties are located in the city. One interesting result is that more expensive Airbnbs do not always have higher ratings. Also, highly reviewed places are more common in certain areas of Manhattan.

I had some difficulties working with the dataset because it was large and had many variables. I had to clean and filter the data step by step (in chunks) because it took too long to process everything at once. This made the process slower, but it helped me understand the data better and build the final dataset correctly.

In conclusion, this project helped me understand Airbnb pricing and quality patterns in New York City and identify good budget-friendly options for travelers.

Works Cited

Airbnb Inside. (n.d.). New York City Airbnb open data. http://insideairbnb.com/explore/

Azmoudeh, A. (n.d.). Airbnb open data. Kaggle. https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata

Brown, K. W. (n.d.). Colors in R. RPubs. https://rpubs.com/kylewbrown/r-colors