library(tidyverse)
library(tidymodels)
library(infer)
library(skimr)
library(tidyr)DATA 110 Final Project
** INTRODUCTION**
Over the past couple of decades, Airbnb has established itself as a reliable accommodation option in major cities around the world. A quick Google search of “airbnb_ny” (airbnb_ny19 - Google Search) reveals that this alternative lodging model has played a significant role in New York City’s tourism industry, bringing visitors and supporting local businesses across all five boroughs. In fact, spending by Airbnb guests in 2019 made a substantial contribution to the broader tourism economy, particularly in neighborhoods outside of Manhattan. In addition to enhancing annual income for property owners, Airbnb has also contributed to both national and local tax revenues according to the same source.
For my final project, I have chosen to explore a dataset titled airbnb_ny19, sourced directly from Airbnb. This dataset offers valuable insights into the dynamics of the Airbnb market in New York City , prior to the implementation of stricter regulatory policies. According to my google search, Cornell University reports that: “the data was primarily collected through web scraping of publicly available information on the Airbnb website. This means that scripts were used to visit listing pages and extract relevant details”. The airbnb_ny19 dataset captures listing activity and relevant metrics for New York City in 2019. It contains about 48,895 listings (observations) with 16 variables, including:
id
host_id
host_name
latitude
price
last_review
availability_365
name
neighborhood_group
neighborhood
longitude
minimum_nights
reviews_per_month
room_type
number_of_reviews
calculated_host_listings_count
In my project, the overarching question I would like to explore is, if there is any relationship between the price and the room type. Given that price is a numeric variable, i will have to test a simple linear regression model. The room type is also a categorical variable, i will therefore use it probably as a factor. This data set and the type of exploration is of interest to me because i would like to create an Airbnb business in Burkina Faso as part of some family projects.
Loading the libraries and setting the working directory
Load data set
setwd("~/Telesphore/Personnel/Etudes/Montgomery_College/Data_Sciences_Certificate_program/Data_110/Week5")
airbnb_ny19 <- read_csv("airbnb_ny19.csv")Rows: 48895 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): name, host_name, neighbourhood_group, neighbourhood, room_type, la...
dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_of...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
** Data cleaning**
# 1. quick look at the data
glimpse(airbnb_ny19)Rows: 48,895
Columns: 16
$ id <dbl> 2539, 2595, 3647, 3831, 5022, 5099, 512…
$ name <chr> "Clean & quiet apt home by the park", "…
$ host_id <dbl> 2787, 2845, 4632, 4869, 7192, 7322, 735…
$ host_name <chr> "John", "Jennifer", "Elisabeth", "LisaR…
$ neighbourhood_group <chr> "Brooklyn", "Manhattan", "Manhattan", "…
$ neighbourhood <chr> "Kensington", "Midtown", "Harlem", "Cli…
$ latitude <dbl> 40.64749, 40.75362, 40.80902, 40.68514,…
$ longitude <dbl> -73.97237, -73.98377, -73.94190, -73.95…
$ room_type <chr> "Private room", "Entire home/apt", "Pri…
$ price <dbl> 149, 225, 150, 89, 80, 200, 60, 79, 79,…
$ minimum_nights <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2, 4…
$ number_of_reviews <dbl> 9, 45, 0, 270, 9, 74, 49, 430, 118, 160…
$ last_review <chr> "10/19/2018", "5/21/2019", NA, "7/5/201…
$ reviews_per_month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.40,…
$ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, …
$ availability_365 <dbl> 365, 355, 365, 194, 0, 129, 0, 220, 0, …
# 2. Looking at understanding variables definitions
names(airbnb_ny19) [1] "id" "name"
[3] "host_id" "host_name"
[5] "neighbourhood_group" "neighbourhood"
[7] "latitude" "longitude"
[9] "room_type" "price"
[11] "minimum_nights" "number_of_reviews"
[13] "last_review" "reviews_per_month"
[15] "calculated_host_listings_count" "availability_365"
# 3. Make all headers lowercase and remove spaces
names(airbnb_ny19) <- tolower(names(airbnb_ny19))
names(airbnb_ny19) <- gsub(" ","",names(airbnb_ny19))
head(airbnb_ny19)# A tibble: 6 × 16
id name host_id host_name neighbourhood_group neighbourhood latitude
<dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 2539 Clean & qu… 2787 John Brooklyn Kensington 40.6
2 2595 Skylit Mid… 2845 Jennifer Manhattan Midtown 40.8
3 3647 THE VILLAG… 4632 Elisabeth Manhattan Harlem 40.8
4 3831 Cozy Entir… 4869 LisaRoxa… Brooklyn Clinton Hill 40.7
5 5022 Entire Apt… 7192 Laura Manhattan East Harlem 40.8
6 5099 Large Cozy… 7322 Chris Manhattan Murray Hill 40.7
# ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
# minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
# reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
# availability_365 <dbl>
# filtering to focus the data set only on Manhattan
airbnb_ny19_clean <- airbnb_ny19 |>
filter(neighbourhood_group == "Manhattan")# checking for missing or non-finite values
summary(airbnb_ny19$price) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 69.0 106.0 152.7 175.0 10000.0
summary(airbnb_ny19$reviews_per_month) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.010 0.190 0.720 1.373 2.020 58.500 10052
# removing NAs
airbnb_ny19_clean2 <- airbnb_ny19_clean |>
filter(!is.na(reviews_per_month))
head(airbnb_ny19_clean2)# A tibble: 6 × 16
id name host_id host_name neighbourhood_group neighbourhood latitude
<dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 2595 Skylit Mid… 2845 Jennifer Manhattan Midtown 40.8
2 5022 Entire Apt… 7192 Laura Manhattan East Harlem 40.8
3 5099 Large Cozy… 7322 Chris Manhattan Murray Hill 40.7
4 5178 Large Furn… 8967 Shunichi Manhattan Hell's Kitch… 40.8
5 5203 Cozy Clean… 7490 MaryEllen Manhattan Upper West S… 40.8
6 5238 Cute & Coz… 7549 Ben Manhattan Chinatown 40.7
# ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
# minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
# reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
# availability_365 <dbl>
# filtering only for listings with reviews per month not more than 3
airbnb_ny19_reviews <- airbnb_ny19_clean2 |>
filter(number_of_reviews <= 3)
head(airbnb_ny19_reviews)# A tibble: 6 × 16
id name host_id host_name neighbourhood_group neighbourhood latitude
<dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 20300 Great Loca… 76627 Pas Manhattan East Village 40.7
2 21644 Upper Manh… 82685 Elliott Manhattan Harlem 40.8
3 41513 Convenient… 181167 Lorenzo Manhattan Harlem 40.8
4 47370 Chelsea St… 214287 Alex Manhattan Chelsea 40.7
5 60673 Private Ro… 249372 Cynthia Manhattan Harlem 40.8
6 64707 Amazing S… 7310 Tilly Manhattan Little Italy 40.7
# ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
# minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
# reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
# availability_365 <dbl>
Calculate for each type of room, the average reviews per month and the number of listing
airbnb_ny19_reviews |>
group_by(room_type) |>
summarize(avg_reviews = mean(reviews_per_month, na.rm = TRUE),
number_of_listing = n()
)# A tibble: 3 × 3
room_type avg_reviews number_of_listing
<chr> <dbl> <int>
1 Entire home/apt 0.414 3101
2 Private room 0.343 1905
3 Shared room 0.545 88
Exploratory Data Analysis (EDA)
First Visualization: Distribution of Reviews Per Month by type of Room
# 1.Checking distribution of reviews per month
ggplot(data=airbnb_ny19_reviews, aes(x=reviews_per_month, fill = room_type)) +
geom_density (adjust = 1.5, color = "black")+
labs(title = "Distribution of Reviews Per Month by type of Room",
x = "reviews_per_month",
y = "Count") +
theme_minimal()# Descriptive paragraph: In Manhattan, regardless of the type of rooms, most listings receive very few reviews per month. The peak near 0 for all room types confirms that consumer behavior. However, shared rooms in blue stretches longer longer along the x-axis and shows a relative wider spread. this may suggest that shared rooms have more bookings because of the lower cost probably. It might alos mean that providing reviews is not a typical behavior of clients who rented Airbnb in Manhattan in 2019.** Statistical Test to find out if there is any association between Prices and Room Type**
Given that the distribution the outcome variable (prices) is very skewed to the right i am going to log transform it before I perform my simple linear model. given the size of the sample is large enough, the basic conditions to perform a simple linear regression model if the independent variable (roomtype)is used as a factor.
Scatter Plot
# Step 1: log-transform price
airbnb_ny19_reviews2 <- airbnb_ny19_reviews |>
filter(price > 0 & price < 800) |>
mutate(log_price = log(price))
# Step 2: Scatter plot of log(price) by room_type
ggplot(airbnb_ny19_reviews2, aes(x=room_type, y=log_price))+
geom_point() +
theme_bw()+
labs(x="room_type", y="log_price",
title = "Scatterplot of Room Type to logprice",
caption = "Airbnb")# Entire home/apt has the highest prices overall, Private room has lower prices and shared room has the lowest pices. Room type clearly seem to affects prices. This results allows for fitting a linear model, of the form: y ~ x. the equation here is: (logprice) = β₀ + β₁(room_type) + ε with:
# Ho = There is no difference in the average log(price)of listings across room types
# Ha = There is at list one room type whose average log(price) is different from the others.# Performing the simple linear model
fit1 <- lm(data = airbnb_ny19_reviews2, log_price ~ room_type)
summary(fit1)
Call:
lm(formula = log_price ~ room_type, data = airbnb_ny19_reviews2)
Residuals:
Min 1Q Median 3Q Max
-2.45611 -0.31433 -0.03574 0.29277 2.19465
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.228694 0.009049 577.82 <2e-16 ***
room_typePrivate room -0.739981 0.014626 -50.59 <2e-16 ***
room_typeShared room -1.028051 0.054346 -18.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4998 on 5027 degrees of freedom
Multiple R-squared: 0.3536, Adjusted R-squared: 0.3533
F-statistic: 1375 on 2 and 5027 DF, p-value: < 2.2e-16
# - We can see that this linear equation is: log_price = 5.228694 -0.739981
# - The intercept of 5.228694 represents the expected log(price) for listings categorized as 'Entire home/apt', which is the reference room type..
# - The p-value for room_type to predict logprice is very small, <2e-16 ***, and finally, the adjusted Adjusted R-squared: 0.3533 which means that approximately 35.3% of the variability in the log(price) is explained by the room type.
# - Conclusion: We fail to reject the null hypothesis and conclude that room type is an important predictor of logprice.BACKGROUND SEARCH
Although there is some kind of scarcity of research and scientific publications related to the New York 2019 airbnb data set, there are a lot of analysis, reports, blogs or github and Rpubs publication on the data set will various type of analysis of finding. Raymond Atta-Fynn and Charlie Zien (2019) through a blog post, reported that they: “explored and modeled Airbnb listings data in NYC from August 2018 to August 2019. they found that”About 80% of the listings are apartments, with an average nightly price of $180”.
Raymond Atta-Fynn and Charlie Zien (2019) also report that “a machine learning model of listing price per month based on 50 features indicated that tree based models, namely random forest regression, gradient boosting regression, and extreme gradient boosting regression, explain the price variation in the training data set quite well (as measured by the coefficient variation R2)”.
** Creating a Mapp of Airbnb in Manhattan: 2nd Visualization Per Assignment**
library(leaflet)# Step 1: Create a color palette for room type
room_palette <- colorFactor(
palette = c("darkorange", "steelblue", "forestgreen"),
domain = airbnb_ny19_reviews$room_type
)
## The code above was created by asskin ChatGPT: "How do i add 3 different colors to my map"
# step 2 building an interactive map
leaflet(data = airbnb_ny19_reviews) |>
addProviderTiles("Esri.WorldStreetMap") |>
addCircleMarkers(
~longitude, ~latitude,
radius = 7,
color = ~room_palette(room_type),
stroke = FALSE,
fillOpacity = 0.5,
popup = ~paste(
"<strong> Room Type:</strong>", room_type,
"<br><strong>Price:</strong>$", price
)
) |>
# Add a legend
addLegend(
"bottomright",
pal = room_palette,
values = ~room_type,
title = "Room Type",
opacity = 1
) |>
## The code above was created by asking ChatGPT: "How do i add a legend explaining colors to my map"
# Add title using HTML overlay (annotation)
addControl(
html = "<h3 style='color:#2c3e50;'>Airbnb Listings in Manhattan (2019)</h3>",
position = "topright"
) |>
# Add a fixed annotation (e.g., pointing to Central Park)
addLabelOnlyMarkers(
lng = -73.9654, lat = 40.7829,
label = "Central Park",
labelOptions = labelOptions(
noHide = TRUE,
direction = "top",
style = list(
"color" = "black",
"font-weight" = "bold",
"background" = "white",
"padding" = "4px"
)
)
) |>
# Set initial view to central Manhattan
setView(lng = -73.98057, lat = 40.72912, zoom = 127) ## The code above was created by asking ChatGPT: "How to add an annotation to my map"# This map is an interactive visualization of airbnb listing in central Manhattan. Given the very high number of listings to plot in this neighborhood group, i decided to use the longitude and latitude of East village to build the map. As the legend explains, each color represents the type of airbnb available in the neighborhood. I also included a mouse-click tooltip so that readers can find out the price of each airbnb and the room type.** BIBLIOGRAPHY AND RESOURCES USED**
https://www.google.com/search?q=How+was+the+airbnb_ny19+data+colleccted%3F&sca_esv=94c780ffd014c36e&sxsrf=AE3TifNxgDf2ZcTYaneajhHpo3id8R0zgw%3A1751644411007&ei=-vhnaOrdPJ6s5NoP9aSdsQc&ved=0ahUKEwjqwo_3x6OOAxUeFlkFHXVSJ3YQ4dUDCBI&uact=5&oq=How+was+the+airbnb_ny19+data+colleccted%3F&gs_lp=Egxnd3Mtd2l6LXNlcnAiKEhvdyB3YXMgdGhlIGFpcmJuYl9ueTE5IGRhdGEgY29sbGVjY3RlZD8yBRAhGKABMgUQIRigATIFECEYoAEyBRAhGKsCMgUQIRirAjIFECEYqwJIo3hQ9BRYy3VwAXgBkAEAmAHNAqAB4RyqAQkxMy4xNi4wLjG4AQPIAQD4AQGYAh-gAo4ewgIKEAAYsAMY1gQYR8ICBxAAGIAEGA3CAgYQABgNGB7CAggQABiiBBiJBcICBRAAGO8FwgIIEAAYgAQYogTCAgoQIRigARjDBBgKwgIHECEYoAEYCpgDAIgGAZAGCJIHCTEzLjE3LjAuMaAH6IkBsgcJMTIuMTcuMC4xuAeHHsIHCDAuNi4yMy4yyAeEAQ&sclient=gws-wiz-serp
Cornell University: https://pages.github.coecis.cornell.edu/info2950-s23/project-wondrous-raichu/report.html#:~:text=Data%20description&text=The%20dataset%20is%20aggregated%20through,be%20seen%20in%20Appendix%201.
Raymond Atta-Fynn and Charlie Zien Posted on Oct 7, 2019: https://nycdatascience.com/blog/student-works/analysis-and-machine-learning-modeling-of-new-york-city-airbnb-data/