library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rvest)
library(purrr)
library(tidyr)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
library(textdata)
## Warning: package 'textdata' was built under R version 4.3.3
library(httr)
##
## Attaching package: 'httr'
## The following object is masked from 'package:textdata':
##
## cache_info
Ask any one of my friends or family members and they will tell you my long term goal is to start a real estate business. More specifically, I would like to start a rental business. Whether it be long term or short term rentals, real estate has always been something I felt a passion for. I can see myself making a career out of it once I obtain proper funds. One of the things that is hard to understand without experience is what makes a good property. There are so many factors like location, furniture, amenities, land, etc. that can make it overwhelming for someone like me that is new to the space. This lack of experience drove my decision to find a database related to rentals for this project.
Although data and numbers can only do so much for a potential new real estate investor, I will attempt to use it to answer the following questions:
What factors are important for building a strong reputation as a property/airbnb owner?
Do reviews and sentiment play an impact on the popularity of properties, and does this translate to other properties the owner is listing?
After searching around for a while, I came across a great csv with information on New York City Airbnbs. This dataset contains information on 49,000 airbnb listings in the New York City area for the year of 2019. While my long term goal is to start investing in Detroit, I feel like similar concepts will apply across most cities.
The link to download the dataset is attached below:
This link will download the dataset as a CSV file.
There are 16 different factors we can look at when examining this dataset, the column names are listed below, as well as any additional description I deemed necessary:
id: Unique identifier for each listing.
name
host_id: Unique identifier for each host.
host_name
neighbourhood_group: NYC borough where the listing is located.
neighbourhood: Specific neighborhood of the listing.
latitude: Geographic coordinates (latitude).
longitude: Geographic coordinates (longitude).
room_type: Type of room offered (e.g., Entire home, Private room).
price: Nightly price in USD.
minimum_nights: Minimum stay required (in nights).
number_of_reviews: Total number of reviews received.
last_review: Date of the last review.
reviews_per_month: Average number of reviews per month.
calculated_host_listings_count: Number of listings from the same host.
availability_365: Number of days available for booking per year.
NYCBNB <-
read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/estepa1_xavier_edu/ESo38ziw33NErCpVQXKkJ7kB_fG3qk_crw4WRcOO35iaRQ?download=1")
Since the end goal here is for this business to be financially stable, I will start by looking at what drives higher prices in these listings, starting with location. More specifically, I will be looking at the Neighborhood group each listing is in.
average_prices <- NYCBNB %>%
group_by(neighbourhood_group) %>%
summarise(average_price = mean(price, na.rm = TRUE))
ggplot(average_prices, aes(x = neighbourhood_group, y = average_price, fill = neighbourhood_group)) +
geom_bar(stat = "identity") +
labs(title = "Average Price by Neighbourhood Group",
x = "Neighbourhood Group",
y = "Average Price (USD)") +
theme_minimal() +
theme(legend.position = "none")
As we can see, Manhattan is one of the highest prices neighborhood groups for properties, With the Bronx being the lowest. When I am first entering the market, it may make sense for me to start in an area like the Bronx or Queens, where real estate prices will be lowest. As I establish myself, I may be able to upgrade to properties in Manhattan which may drive higher margins. While real estate in this area may be more expensive, it could deliver greater profits in the long run, lets look at specific neighborhoods within the Manhattan area and see which charge the most.
manhattan_prices <- NYCBNB %>%
filter(neighbourhood_group == "Manhattan") %>%
group_by(neighbourhood) %>%
summarise(average_price = mean(price, na.rm = TRUE)) %>%
arrange(desc(average_price))
ggplot(manhattan_prices, aes(x = reorder(neighbourhood, -average_price), y = average_price, fill = neighbourhood)) +
geom_bar(stat = "identity") +
labs(title = "Average Price by Neighborhood in Manhattan",
x = "Neighborhood",
y = "Average Price (USD)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = "none")
There is a very clear distribution of prices for specific neighborhoods within the Manhattan area. Many real estate investors and property owners suggest that it is best to buy a C tier hour in an A tier area, so it may make sense to invest in a cheaper area of Manhattan. Some areas like Murray Hill are right in the middle of the pricing structure and could have huge upside potential in the long term.
Another area of concern for me in my quest to become a real estate investor is the debate of short vs longer term rentals. With Airbnb listings, the range for short term rentals is typically 1 to 3 weeks, with longer term rentals going above 1 month in duration. Luckily, this dataset contains data on the minimum nights required to stay, which will give us a general idea as to what benefits property owners obtain with short and long term rentals.
ggplot(NYCBNB, aes(x = minimum_nights, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Scatter Plot of Price vs. Minimum Nights",
x = "Minimum Nights",
y = "Price (USD)") +
theme_minimal() +
coord_cartesian(xlim = c(0, 365), ylim = c(0, 750))
## `geom_smooth()` using formula = 'y ~ x'
price_by_stay_length <- NYCBNB %>%
mutate(stay_category = ifelse(minimum_nights < 30, "Less than 30 nights", "30 nights or more")) %>%
group_by(stay_category) %>%
summarise(average_price = mean(price, na.rm = TRUE))
ggplot(price_by_stay_length, aes(x = stay_category, y = average_price, fill = stay_category)) +
geom_bar(stat = "identity") +
labs(title = "Average Price by Minimum Nights Required",
x = "Category of Minimum Nights",
y = "Average Price (USD)") +
theme_minimal() +
theme(legend.position = "none")
As we can see based on both the scatter and Bar chart, there is a clear correlation between the price of a listing and the amount of nights required to stay excluding the outliars. There are many risks associated with both short and long term rentals. However, listing a long term airbnb will deliver more consistent revenue and in larger quantity.
Since my long term goal is to scale this business to a larger scale, it is important to evaluate what successful property owners / LLC’s are doing with their business strategies. Below is a list of the 10 most common property owners, which we will use for further analysis:
popular_hosts <- NYCBNB %>%
count(host_id, host_name) %>%
top_n(10, n) %>%
arrange(desc(n))
ggplot(popular_hosts, aes(x = reorder(host_name, n), y = n, fill = host_name)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Most Common Airbnb Hosts by Name",
x = "Host Name",
y = "Count of Listings") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for better readability
legend.position = "none")
This bar chart shows a count of the 10 most popular hosts on the dataset, each host is matched with a host ID to eliminate duplicates (for example, I do not want multiple people named Kara to be counted under 1 person). As we can see, there are plenty of hosts, whether corporate or individual, that have successfully scaled their business to at least 50 properties in the NYC area.
top_hosts <- NYCBNB %>%
group_by(host_id, host_name) %>%
summarise(total_listings = n(), .groups = "drop") %>%
top_n(5, total_listings) %>%
arrange(desc(total_listings))
top_host_prices <- NYCBNB %>%
filter(host_id %in% top_hosts$host_id)
ggplot(top_host_prices, aes(x = factor(host_name), y = price, fill = host_name)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
labs(title = "Price Distribution of Listings by Top 5 Hosts",
x = "Host Name",
y = "Price (USD)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
Based on the most popular hosts on the site in 2019, we can see that many are charging between $275 and $300 a night. When looking to purchase a property, it may be valuable to take after these popular hosts and look for properties that can rent out for a similar price while still being profitable.
Finally, I want to see if these top hosts are listing their properties for the whole year, or just optimal seasons for renting. This could be important to minimize expenses in the off season and maximize profit.
top_host_availability <- NYCBNB %>%
filter(host_id %in% top_hosts$host_id)
ggplot(top_host_availability, aes(x = factor(host_name), y = availability_365, fill = host_name)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
labs(title = "Availability (365 Days) of Listings by Top 5 Hosts",
x = "Host Name",
y = "Availability (Days)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
As we can see in this box plot, majority of the top 5 hosts are listing their units for at least 300 days on average. This means it makes more sense to list the property for nearly the entire year, with extremely short breaks during slow times for maintenance and updating.
One of the top hosts, Kara, has quite a few reviews we can scrape for a sentiment analysis. From this, we can determine what users are looking for in a property owned by a successful manager.
kara_airbnb_url <-
read_html("https://www.airbnb.com/users/show/54283244")
fetch_reviews <- function(url) {
page <- read_html(url)
reviewer_names <- page %>%
html_elements("div.t126ex63.atm_c8_2x1prs.atm_g3_1jbyh58.atm_fr_11a07z3.atm_cs_9dzvea.dir.dir-ltr") %>%
html_text(trim = TRUE)
review_dates <- page %>%
html_elements("div.s17vloqa.atm_7l_1esdqks.atm_cs_6adqpa.dir.dir-ltr") %>%
html_text(trim = TRUE)
review_content <-
kara_airbnb_url %>%
html_elements("div.c141wyfg.atm_5j_1fwxnve") %>%
html_text2()
reviews <- data.frame(Name = reviewer_names, Date = review_dates, Content = review_content)
return(head(reviews, 100))
}
reviews <- fetch_reviews("https://www.airbnb.com/users/show/54283244")
I will now perform a sentiment analysis on the reviews of the host using the NRC lexicon
nrc_words <- get_sentiments("nrc")
review_words <- reviews %>%
unnest_tokens(word, Content)
emotional_words <- review_words %>%
inner_join(nrc_words, by = "word")
## Warning in inner_join(., nrc_words, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 25 of `x` matches multiple rows in `y`.
## ℹ Row 6067 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
word_counts <- emotional_words %>%
group_by(word) %>%
summarise(count = n(), .groups = 'drop')
ggplot(word_counts, aes(x = word, y = count, fill = word)) +
geom_col() +
theme_minimal() +
labs(title = "Frequency of Words in Reviews", x = "Word", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
sentiment_counts <- emotional_words %>%
group_by(sentiment) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
ggplot(sentiment_counts, aes(x = sentiment, y = count, fill = sentiment)) +
geom_col() +
labs(title = "Frequency of Sentiments",
x = "Sentiment",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Based on the word and sentiment counts, there are many factors that go into a great hosts properties. Most importantly, reviewers want the host to be helpful and friendly to them above all else. This makes sense since many people will see the property before going there, so all they are unaware of is the service. In terms of sentiment, one of the most important things for this host and potentially others would be trust. They have to trust the area, cleanliness of the interior, and trust the host themselves to help out when needed. This is important for me, who will need to choose what is important for me when I start serving my tenants.
There are many factors that need to be taken into account when starting an Airbnb business. It is not as simple as buying a property and putting it on the site. From pricing by location and property type to seasonal demand, managing an airbnb is not straightforward. However, using some of these outcomes derived from data can make entering the market one step easier. For those looking to start an airbnb business now or in the future, this can be used as a stepping stone into the complex world of real estate.