I am a budding entrepreneur who wants to get into the real estate market, and I concluded that setting up some AirBnB rentals would be a good way for me to start that process. However, I am not sure if I want to start in New York City, London, or both! So I am going to rely on the data sets that were on dropbox.com to assist me in picking out the first neighborhood for me to start.
setwd("C:/Users/cford/OneDrive/Documents/school/Spring 2024/Assignment data/DownloadFiles")
library(DescTools)
library(lubridate)
library(dplyr)
library(ggplot2) #allows us to graph
library(scales) #allows us to include commas
library(RColorBrewer)
library(ggthemes)#can experiment with look and feel of final plot
library(data.table)
library(ggrepel)
library(plotly)
library(plyr)
library(leaflet)
filenameNYC<- "listings_NYC.csv"
dfNYC<-fread(filenameNYC, na.string=c(NA, ""))
filenameLondon<- "listings_London.csv"
dfLondon<-fread(filenameLondon, na.string=c(NA, ""))
#Note that the data set had neighborhood spelled "neighbourhood," so I decided to keep the spelling the same within my write up.
The data sets provide information on AirBnB listings in both London (dfLondon) and New York City (dfNYC) from years 2009 through 2021. Both sets are adjusted for USD and they were combined into one data frame (dfBothCities) for the purpose of this analysis. The fields that were used predominantly in this analysis are neighbourhood, latitude, longitude, room_type, price, minimum_nights, and number_of_reviews. The fields that were created for more enhanced analysis are City and Total Price, and license was removed from our analysis due to the level of blank fields. When combined, the data sets have 105,614 observable rows and 19 columns.
The data sets are being used to better understand which neighbourhoods would be viable for us to set up our first listing. To find this, our main benchmarks involve number_of_reviews, which allows us to get an understanding of the most actively reviewed neighbourhoods and room types, price, which allows us to get an understanding of the nightly list prices per neighbourhood and room type, and Total Price, which allows us to get an understanding of how much revenue has been earned per neighbourhood and room type since 2009. Total Price is an assumed calculation of price * minimum_nights * number_of_reviews.
We begin our graphical analysis by looking at the top ten neighbourhoods by reviews and sales and which room types are most prevalent in each city. We then drill further into the top five earning neighbourhoods by AirBnB room type and their average nightly price so that we can make a final decision on which neighbourhood we will start in, the room type we will choose, and the nightly price that we will charge. Finally, we look at map coordinates to see which areas of our chosen neighbourhood have the most listings to see if that would be a popular area for our first listing. Based on our models, the data suggests that London may be our best bet for our first listing.
#The next section below is to clean the data for our analysis
dfLondon$neighbourhood_group<- as.character(dfLondon$neighbourhood_group)
#I wanted to set all of the columns in each data frame to the same class
dfLondon <- dfLondon[,-c("license")]
dfNYC <- dfNYC[,-c("license")]
#license is showing as NA in both data sets and did not seem valuable to keep as a column
dfLondon$City<-"London"
#create a city column because my plan is to bind the data frames together for comparison of the cities
dfNYC$City<-"New York"
dfLondon$price<- dfLondon$price * 1.27
#to compare apples to apples, I translated the price in the London data set because I suspect it is in GBP (I viewed a couple other foreign cities in the listings and it appears that they were not in USD either). I used the current FX rate as of 1/26/2024 to do this, which is 1.27.
dfBothCities<- rbind(dfLondon,dfNYC)
#I wanted to combine the data sets for some comparison graphing
dfBothCities$TotalPrice<-dfBothCities$price * dfBothCities$minimum_nights * dfBothCities$number_of_reviews
#the above formula assumes that total price for a listing is equal to price * minimum nights to stay * number of reviews. Multiplying by reviews should discount any listings from revenue that haven't had a review (I am assuming that a review = a stay).
dfBothCities<-dfBothCities[-which(dfBothCities$id==7974622),]
#airbnb id 7974622 looks to be an outlier in Tower Hamlets due to min nights required on the listing, so I am deleting it
dfBothCities<-dfBothCities[-which(dfBothCities$id==36300441),]
#airbnb id 36300441 looks to be an outlier in Tower Hamlets due to price on the listing, so I am deleting it (avg price per night for a shared room in the heatmap is higher than an entire home due to that row)
To start our analysis, we begin by examining review count by neighbourhood to get an understanding of where the rental market is more active between our two chosen cities. This graph depicts the ten neighbourhoods with the most reviews and is filled in with color so that we can tell which city each neighbourhood resides. Based on the information, 80% of the most reviewed neighbourhoods reside in London, with Wesminster being the most reviewed neighbourhood.
Could this be an indication that we would get more feedback from our renters in London? As a first time lister, this might be valuable for learning and growth.
neighborhood_reviews<- dfBothCities %>%
select(neighbourhood, City, number_of_reviews)%>%
group_by(neighbourhood, City)%>%
dplyr::summarise(Totreviews = sum(number_of_reviews), .groups = 'keep')%>%
data.frame()
neighborhood_reviews<-neighborhood_reviews[order(neighborhood_reviews$Totreviews, decreasing = TRUE),]
maxy<-round_any(max(neighborhood_reviews$Totreviews), 25000, ceiling)
ggplot(neighborhood_reviews[1:10,], aes(x = reorder(neighbourhood,-Totreviews), y = Totreviews, fill = City))+
geom_bar(color="black", stat = "identity")+
theme_economist()+
labs(title = "Top 10 Neighbourhoods with Most Reviews", x="", y="Total Reviews")+
theme(plot.title=element_text(hjust = 0.5),axis.title.y = element_text(vjust = 4,size=12),
axis.text.x = element_text(size=6.5, vjust = 1.25),legend.text = element_text(size=10))+
scale_x_discrete(labels=label_wrap(10))+
scale_y_continuous(labels = comma,
breaks = seq(0, maxy, by = 25000),
limits = c(0,maxy))
Next we examine the ten neighbourhoods with the most sales and their average list price per night to see where the most money is being generated. Our second graph depicts this for years 2009 through 2021. The average price per night within our top ten neighbourhoods sits at a minimum of $120.50 per night, a maximum of $315.90 per night and an average of $173.40 per night. If we are to be successful within this market, we will need to make sure that the prices that we charge are within these ranges.
Interestingly, where London’s neighbourhoods were more active with reviews, we see a greater proportion of New York City neighbourhoods within the top ten list for sales. 70% of the most lucrative neighbourhoods reside in the boroughs of New York City, but two neighbourhoods in London still take the top two spots.
NeighbourhoodPrice<- dfBothCities %>%
select(neighbourhood, City, price, TotalPrice) %>%
group_by(neighbourhood, City) %>%
dplyr::summarise(TotalPrice=sum(TotalPrice),avgprice=mean(price), .groups = 'keep')%>%
data.frame()
NeighbourhoodPrice <- NeighbourhoodPrice[order(NeighbourhoodPrice$TotalPrice, decreasing = TRUE),]
yavg<-seq(0, max(NeighbourhoodPrice$avgprice),50)
ggplot(NeighbourhoodPrice[1:10,], aes(x=reorder(neighbourhood, TotalPrice), y=TotalPrice/1000000, fill=City))+
geom_bar(color='black', stat = "identity", position = position_stack(reverse = TRUE))+
coord_flip()+
theme_light()+
labs(title="10 Neighbourhoods with Most Sales & their Average Price", x="", y="Total Price", caption = "Total Price is for years 2009 - 2021")+
theme(plot.title=element_text(hjust=0.5, face="bold"), plot.caption=element_text(hjust=0.5, face="italic", size=11))+
scale_fill_brewer(palette="Pastel2",guide = guide_legend(reverse = FALSE))+
geom_line(inherit.aes = FALSE, data = NeighbourhoodPrice[1:10,],
aes(x=neighbourhood,y=avgprice, color = "Average Price per Night", group = 1), size = 1.5)+
scale_color_manual(NULL, values = "black")+
scale_y_continuous(labels = dollar_format(suffix = "M", prefix = "$"),
sec.axis = sec_axis(~.*1, name="Average Price per Night",labels=dollar_format(suffix = "",prefix = "$"), breaks=yavg))+
geom_point(inherit.aes = FALSE, data=NeighbourhoodPrice[1:10,], aes(x=neighbourhood,y=avgprice,group=1), size=4, shape=21, fill="white", color="black")+
theme(legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent", color=NA),
legend.spacing = unit(-5.5, "lines"))
Now that we have an understanding of what neighbourhoods are receiving the most reviews and generating the most sales, we decided to add room type to our analysis to understand what was popular by neighbourhood and to get an idea of what type of listing we should be offering in our targeted area. Our next graph is a high-level analysis by city of the room types that are receiving the most reviews.
We can see in both London and New York that entire homes/apartments and private rooms are the most reviewed. However, entire homes/apartments have slightly more reviews at 55.21% in New York while private rooms have more reviews in London at 50.95%. Could this be an indication that listing an entire home or apartment may be the option for New York?
roomtype_reviews<- dfBothCities %>%
select(room_type, City, number_of_reviews)%>%
group_by(City, room_type)%>%
dplyr::summarise(Totreviews = sum(number_of_reviews), .groups = 'keep')%>%
group_by(City)%>%
dplyr::mutate(percent_TR = round(100*Totreviews/sum(Totreviews),2))%>%
arrange(City)%>%
data.frame()
ggplot(data = roomtype_reviews, aes(x="",y=Totreviews, fill=room_type))+
geom_bar(stat="identity", position="fill", color="black",size=0.7)+
coord_polar(theta="y",start=0)+
labs(fill="Room Type", x=NULL, y=NULL, title = "Total Reviews by Room Type and City",
caption = "", legend = "Room Type")+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, face="bold"),
axis.text = element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank(),
strip.text = element_text(size=13, face="bold"))+
facet_wrap(~City, nrow=1)+
scale_fill_brewer(palette="YlOrRd")+
geom_text(aes(x=1.8,label=paste0(percent_TR,"%")), size=3, position=position_fill(vjust=0.5))
We’ve now been able to narrow down our search to the top five neighbourhoods based on revenue, but now, we need to get an understanding of which room types are earning the most revenue in our five chosen neighbourhoods. Our next graphical analysis does this and we can see that four out of five (or 80%) of the neighbourhoods show that entire homes or apartments are the most earning listings, except in Haringey.
Based on this data, we will open our first listing in Tower Hamlets and buy an entire home to rent out with the capital that we have raised. However, what price should we charge?
roomtypePrice<- dfBothCities %>%
select(neighbourhood, room_type, City, price, TotalPrice) %>%
group_by(neighbourhood, room_type,City) %>%
dplyr::summarise(TotalPrice=sum(TotalPrice), .groups = 'keep')%>%
data.frame()
roomtypePrice <- roomtypePrice[order(roomtypePrice$TotalPrice, decreasing = TRUE),]
#This roomtype data frame allows me to sort and find the top neighbourhoods. Now I will only look at the top 5 earning neighbourhoods and see what AirBnB type is most earning by neighbourhood.
top5<-roomtypePrice %>%
select(neighbourhood, room_type, City, TotalPrice) %>%
dplyr::filter(grepl('Tower Hamlets|Westminster|Bedford-Stuyvesant|Williamsburg|Haringey',neighbourhood))%>%
group_by(neighbourhood, room_type,City) %>%
dplyr::summarise(TotalPrice=sum(TotalPrice), .groups = 'keep')%>%
data.frame()
#top5 data frame is filtering on the top 5 earning neighbourhoods and will be used to make a trellis bar char
maxy2<-round_any(max(top5$TotalPrice), 50000000, ceiling)
ggplot(top5, aes(x=reorder(room_type,-TotalPrice),y=TotalPrice,fill=City))+
geom_bar(stat="identity", position="dodge")+
theme_excel_new()+
theme(plot.title = element_text(hjust = 0.5, size = 12.5, face="bold"), strip.text = element_text(size=9, face="bold"),
axis.text.x = element_text(size=10, angle = 90, hjust = 1, vjust = 0.3, face="bold"), axis.text.y = element_text(face="bold"))+
labs(title = "Top Earning AirBnB Types in Top 5 Neighbourhoods",
x="Room Type",
y="Total Price",
fill = "City")+
scale_fill_brewer(palette ="Set1")+
facet_wrap(~neighbourhood, ncol=5, nrow=1, as.table=FALSE)+
scale_y_continuous(labels = dollar_format(prefix = "$"),
breaks = seq(0, maxy2, by = 50000000),
limits = c(0,maxy2))
Our next visual breaks out the average price per night by room type and neighbourhood, so that we can get an idea of what our nightly charge should be. Interestingly, Haringey and Williamsburg appear to have no hotel rooms listed, based on our data set. Also, most of the room type and neighbourhood prices appear to fall in line with each other, with the exception of Bedford-Stuyvesant for hotel rooms and Wesminster for entire homes/apartments and hotel rooms.
Based on this feedback, we will charge $188 per night for our first listing in Tower Hamlets.
avgtop5<-dfBothCities %>%
select(neighbourhood, room_type, City, price) %>%
dplyr::filter(grepl('Tower Hamlets|Westminster|Bedford-Stuyvesant|Williamsburg|Haringey',neighbourhood))%>%
group_by(neighbourhood, room_type,City) %>%
dplyr::summarise(avgprice=mean(price), .groups = 'keep')%>%
data.frame()
avgtop5[nrow(avgtop5)+1,]<-list("Haringey","Hotel room","London",0)
#add a $0 value for Hotel room in Haringey so that it doesn't look blank in the heatmap
ggplot(avgtop5,aes(x=room_type,y=neighbourhood,fill=avgprice))+
geom_tile(color="black")+
geom_text(aes(label=paste0("$",round(avgprice,0))))+
coord_equal(ratio=1)+
labs(title="Average Price in Top 5 Neighbourhoods and Room Type",
x="", y="", fill="Average Price per Night")+
theme_minimal()+
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_text(face="bold"),
axis.text.y=element_text(face="bold"))+
scale_x_discrete(labels=label_wrap(10))+
scale_y_discrete(limits=rev)+
guides(fill = guide_legend(reverse=TRUE, override.aes=list(color="black")))+
scale_fill_continuous(low="white", high="gold",labels = dollar_format(prefix = "$"))
Based on the data, Tower Hamlets is showing to have earned the most revenue out of our top five neighbourhoods and also has average prices per night consistent with three out of five of the neighbourhoods (Westminster being the outlier). Below you will find our last piece of data, which is a map of Tower Hamlets showing listings with reviews greater than 150, with the assumption that number of reviews and listings clustered in a specific area indicate a more active rental area of Tower Hamlets. We are using this to pinpoint where we should set up our first listing, and our clustering suggests that Commercial Street and Osborn Street to the west of Whitechapel may be our best option for our first entire home/apartment listing at a price of $188 per night.
FinalSpot <- dfBothCities %>%
select(neighbourhood, room_type, latitude,longitude, number_of_reviews) %>%
dplyr::filter(grepl('Tower Hamlets',neighbourhood))%>%
group_by(neighbourhood, room_type, latitude, longitude) %>%
dplyr::summarise(Totreviews = sum(number_of_reviews), .groups = 'keep')%>%
data.frame()
FinalSpot<-FinalSpot[order(FinalSpot$Totreviews,decreasing = TRUE),]
#set view based on listing with most reviews which was coordinates below. If you click on the circles, it will also give you the room_type of each listing
map = leaflet() %>%
addTiles() %>%
setView(lng = -0.070620, lat = 51.51421, zoom = 14 ) %>%
addCircles( lng = subset(FinalSpot, Totreviews>150)$longitude,
lat = subset(FinalSpot, Totreviews>150)$latitude,
opacity=10,
color="blue",
popup = paste('Room Type:',subset(FinalSpot,Totreviews>150)$room_type),
label = paste('Total Reviews:',subset(FinalSpot,Totreviews>150)$Totreviews),
radius = sqrt(subset(FinalSpot,Totreviews>150)$Totreviews))
map