My project is about New York City Airbnb data. This dataset has information about Airbnb homes in New York City, like price, number of reviews, availability, and neighborhood. Some variables are numbers, like price and reviews, and some are categories, like room type and neighborhood.
The data comes from Airbnb Inside, which shares public Airbnb information. I cleaned the data by removing missing values and fixing wrong formats. I also organized the data to make it easier to understand and use for this project.
I chose this topic because I am interested in travel and housing. Also, New York City is my favorite city, and my dream is to live and work there one day. That is why this dataset is very interesting and meaningful for me.
Load the libraries and set the working directory
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)setwd("/Users/bettyovalle/Desktop/College/007 – Spring 2026/DATA 110/week 11")airbnbNYdata <-read_csv("Airbnb_Open_Data.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 102599 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): NAME, host_identity_verified, host name, neighbourhood group, neig...
dbl (11): id, host id, lat, long, Construction year, minimum nights, number ...
lgl (2): instant_bookable, license
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 11
host_identity_verified neighbourhood_group neighbourhood lat long country
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 unconfirmed Brooklyn Kensington 40.6 -74.0 United S…
2 verified Manhattan Midtown 40.8 -74.0 United S…
3 <NA> Manhattan Harlem 40.8 -73.9 United S…
4 unconfirmed Brooklyn Clinton Hill 40.7 -74.0 United S…
5 verified Manhattan East Harlem 40.8 -73.9 United S…
6 verified Manhattan Murray Hill 40.7 -74.0 United S…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
# number_of_reviews <dbl>, review_rate_number <dbl>
I removed irrelevant and administrative variables such as identifiers and booking settings to simplify the dataset. I selected these variables to focus my analysis only on listings in Manhattan.
# A tibble: 6 × 11
host_identity_verified neighbourhood_group neighbourhood lat long country
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 verified Manhattan Harlem 40.8 -73.9 United…
2 verified Manhattan Lower East Side 40.7 -74.0 United…
3 verified Manhattan East Village 40.7 -74.0 United…
4 verified Manhattan West Village 40.7 -74.0 United…
5 verified Manhattan East Harlem 40.8 -73.9 United…
6 verified Manhattan West Village 40.7 -74.0 United…
# ℹ 5 more variables: price <dbl>, service_fee <dbl>, minimum_nights <dbl>,
# number_of_reviews <dbl>, review_rate_number <dbl>
Budget Airbnb NYC
This analysis focuses on Airbnbs in New York City for people who want to visit the city on a budget. The goal is to filter and analyze affordable and highly rated listings in Manhattan, so travelers can find safe, well-reviewed, and reasonably priced places to stay. This helps identify good options for visitors who want to enjoy New York City without spending too much money.
Load library
library(ggplot2)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Price changes between highly rated Airbnb properties
plot1 <-ggplot(cleanData, aes(x =factor(review_rate_number),y = price,fill =factor(review_rate_number))) +geom_boxplot() +labs(title ="Comparison of Price by Review Rating in Manhattan",x ="Review Rate Number (4 and 5)",y ="Price (USD)",fill ="Rating") +scale_fill_brewer(palette ="BuPu") +theme_minimal()ggplotly(plot1)
The visualization suggests a slight negative relationship between price and review ratings, meaning that higher-priced Airbnbs do not always have higher ratings.
Map shows Airbnb locations in Manhattan with price-based sizing and rating-based colors.
Summary
This project shows Airbnb data in Manhattan for people who want to visit New York City on a budget. The dataset includes information about price, ratings, number of reviews, and location. I cleaned the data by removing unnecessary variables, fixing formats, and filtering step by step to focus on verified, highly rated, and affordable rooms and apartments.
The plot helps show the relationship between price, ratings, and number of reviews, while the map shows where the Airbnb properties are located in the city. One interesting result is that more expensive Airbnbs do not always have higher ratings. Also, highly reviewed places are more common in certain areas of Manhattan.
I had some difficulties working with the dataset because it was large and had many variables. I had to clean and filter the data step by step (in chunks) because it took too long to process everything at once. This made the process slower, but it helped me understand the data better and build the final dataset correctly.
In conclusion, this project helped me understand Airbnb pricing and quality patterns in New York City and identify good budget-friendly options for travelers.