Airbnb, founded by Joe Gebbia, Brian Chesky and Blecharczyk in 2008, started off with renting their own place as Bed and Breakfast. They saw a potential in this space and developed a website that allows people to rent their properties. As properties on Airbnb are typically more affordable than hotels and short-terms stays are allowed, this concept slowly becomes popular for tourists and disrupts the once predictable hospitality industry.
Things aren’t all rosy for the corporation. There are problems include trust, security, and ethnics. One of the biggest problems is illegality of short-term bookings. In Singapore, it is illegal to rent public housing for less than 180 days and private housing for less than 90 days. However, there are property owners in Singapore who are willing to violate the ruling and take their chances on Airbnb for the extra cash.
This objective of this dashboard is to help users to get an overview of all listings on Airbnb as well as understand how prices are correlated to other features like location, property type and review score.
The analysis should include the prices of properties that had been booked, and reviewed to properly analyse the features. However, as advised by Inside Airbnb, I assumed properties with occupancy rate of more than 70% could be blacked out by the hosts, instead of being occupied.
In the data, there are text data with comma in them. For the ease of reading the data properly in R, I first change the delimiter from comma to pipe in python and output as a .txt file.
I also filtered out hosts are not active (based on how updated their calendars are) or properties with no review. This is to get a more realistic view of the listings.
Sketch of the Design
packages = c('tidyverse','leaflet','ggridges','ggplot2','plotly','RColorBrewer')
for (p in packages){
if (!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
main_listing<-read.table("C:/Users/User/Desktop/MITB_2019/AY2020T3_Visual_ISSS608_G2/Assignment5/Data/list_transform.txt",header=TRUE,sep="|",fill=TRUE)
calendar<-read.csv("C:/Users/User/Desktop/MITB_2019/AY2020T3_Visual_ISSS608_G2/Assignment5/Data/calendar.csv",header=TRUE)
reviews<-read.csv("C:/Users/User/Desktop/MITB_2019/AY2020T3_Visual_ISSS608_G2/Assignment5/Data/reviews.csv",header=TRUE)
Renaming some variables
colnames(main_listing)[colnames(main_listing) == 'ï..id'] <- 'listing_id'
#Group by listing_id to count number of days
cal_New <- data.frame(table(calendar$listing_id))
#Rename columns
colnames(cal_New)[colnames(cal_New) == 'Var1'] <- 'listing_id'
colnames(cal_New)[colnames(cal_New) == 'Freq'] <- 'total_days'
To get compute the occupancy rate of each property, I counted the number of days that the property is available in the next 365 days and divide by 365. I only include properties available for more than 30% of the year.
#Group by listing_id to count number of available days in next 365 days
cal_avail <- calendar %>%
filter(available == "t") %>%
group_by(listing_id) %>%
dplyr::summarise(avail_days_ct = n())
#Join to get properties with availability of more than 30% (exclude properties blocked by owners)
cal_New <-merge(x=cal_New,y=cal_avail,by="listing_id",all.x=TRUE) ##left join
cal_New$pct_avail <- cal_New$avail_days_ct/cal_New$total_days
cal_New<-filter(cal_New, pct_avail > 0.3)
listing<-merge(x=main_listing,y=cal_New,by="listing_id") ##inner join
I also exclude the inactive hosts, unnecessary columns and those properties with no reviews.
##colnames(listing)
##Exclude inactive hosts
listing<-filter(listing, calendar_updated %in% c("1 week ago","10 months ago","11 months ago","12 months ago","2 days ago","2 months ago","2 weeks ago","3 days ago",
"3 months ago","3 weeks ago","4 days ago","4 months ago","4 weeks ago","5 days ago","5 months ago","5 weeks ago",
"6 days ago","6 months ago","6 weeks ago","7 months ago","7 weeks ago","8 months ago","9 months ago","a week ago","today",
"yesterday"))
#Remove unwanted columns and rename the table as list_db
listing<-select(listing, -c(listing_url,listing_summary, property_type_temp,
listing_space, listing_description, experiences_offered, thumbnail_url, medium_url, neighborhood_overview, notes, transit, access,
picture_url, xl_picture_url, host_url, interaction, house_rules, host_about, host_response_rate, calendar_updated_2, Freq_book_ind,
weekly_price, monthly_price, security_deposit, cleaning_fee, minimum_nights, maximum_nights, minimum_minimum_nights, maximum_minimum_nights,
minimum_maximum_nights, maximum_maximum_nights, square_feet, has_availability, calendar_updated, Pct_avail_30, availability_30,
Pct_avail_60, availability_60, Pct_avail_90, availability_90, Pct_avail_365, availability_365, requires_license, license, jurisdiction_names,
property_type_2, is_business_travel_ready, cancellation_policy_2, reviews_per_month, smart_location, neighbourhood, host_neighbourhood, country_code, country, market, host_acceptance_rate))
#Remove listing without reviews
listing$review_scores_rating <- as.numeric(as.character(listing$review_scores_rating))
listing<-listing[!is.na(listing$review_scores_rating),]
listing<-listing[!(listing$review_scores_rating=="."),]
Price is read as a character column due to the dollar sign. I remove the dollar sign and change it to a numeric column. The distribution of price is highly skewed. For better visualization of the distribution of price across different factors, I log transformed the price variable.
#Convert the target variables as numeric and create histogram
listing$price2 = as.numeric(gsub("\\$", "", listing$price))
ggplot(listing, aes(x=listing$price2)) +
geom_histogram(bins = 20, color = "black", fill = "#E69F00", alpha=0.7)+
geom_vline(aes(xintercept = mean(price2)), color = "#FC4E07", linetype = "dashed", size = 1)+
labs(title="Histogram of Price",x="Price($)", y = "Number of Properties")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))
##Log Transformation and create a histogram
listing$log_price = log(listing$price2+1)
ggplot(listing, aes(x=listing$log_price)) +
geom_histogram(bins = 20, color = "black", fill = "#E69F00", alpha=0.7)+
geom_vline(aes(xintercept = mean(log_price)), color = "#FC4E07", linetype = "dashed", size = 1)+
labs(title="Histogram of Price After Log Transformation",x="Log Price($)", y = "Number of Properties")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))
Property type and number of rooms variable has too many similar levels. Hence, I group them together. Review score is also skewed to right, therefore I choose to band the review score.
##Group the categories
listing$property_type_group <- case_when(listing$property_type %in% c("Apartment", "Condominium", "Loft", "Guest suite") ~ "Apartment/Condominium",
listing$property_type %in% c("Serviced apartment", "Hotel", "Boat", "Boutique hotel", "Cabin") ~ "Serviced Apartment/Hotel/Yacht",
listing$property_type %in% c("Hostel", "Bed and breakfast", "Aparthotel", "Farm stay") ~ "Hostel/B&B",
listing$property_type %in% c("House", "Townhouse", "Guesthouse", "Villa","Cottage") ~ "Houses/Villa",
TRUE ~ "Others")
listing$review_score_group <- case_when(listing$review_scores_rating=="." ~ "No review",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=10 ~ "[0,10]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=20 ~ "[11-20]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=30 ~ "[21-30]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=40 ~ "[31-40]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=50 ~ "[41-50]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=60 ~ "[51-60]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=70 ~ "[61-70]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=80 ~ "[71-80]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=90 ~ "[81-90]",
listing$review_scores_rating!="." & as.numeric(as.character(listing$review_scores_rating))<=100 ~ "[91-100]",
TRUE ~ "Unknown")
listing$bedroom_group <- case_when(as.numeric(as.character(listing$bedrooms))>=3 ~ "At least 3 bedrooms",
as.numeric(as.character(listing$bedrooms))==2 ~ "2 bedrooms",
as.numeric(as.character(listing$bedrooms))==1 ~ "1 bedroom",
as.numeric(as.character(listing$bedrooms))==0 ~ "No bedroom",
TRUE ~ "Unknown")
The price of Apartment, Condominium, Serviced Apartment, Hotel and Yacht are higher compared to Hostel, B&B, Houses or Villa. The distribution of prices also seemed to differ across the various room type (ie Entire property and Private room are priced higher than Shared rooms). Properties located in the central region are the most expensive, followed by East Region and the rest are priced similarly. I used the number of bedrooms to estimate the size of the property. Properties with 3 or more bedrooms are the most highly priced and followed by 2 bedrooms. Those with no bedroom seems to be more expensive than those with 1 bedroom. It could be because some hotels have registered as no bedroom. Prices of properties with good reviews are higher (score > 60), compared to those with neutral reviews (score between 31 to 60). The interesting finding is those with bad review scores (<= 30) are more expensive with neutral reviews. It could be these properties did not meet the expectation of the customers even through the properties were expensive, hence the reviews were bad. Due to the 2 outliers, the number of reviews received does not seem to be correlated to the price of the property. The pearson’s correlation test also shows that the relationship between these 2 variables is not significant.
##Map of location
leaflet(listing) %>%
addTiles() %>%
addMarkers(~longitude, ~latitude,labelOptions = labelOptions(noHide = F),clusterOptions = markerClusterOptions(),popup = paste0("<b> Name: </b>", listing$listing_name , "<br/><b> Host Name: </b>", listing$host_name, "<br> <b> Price: </b>", listing$price, "<br/><b> Room Type: </b>", listing$room_type, "<br/><b> Property Type: </b>", listing$property_type
)) %>%
setView(103.85, 1.32, zoom = 11) %>%
addProviderTiles("CartoDB.Positron")
listing <- listing %>%
mutate(property_type_group = fct_relevel(property_type_group, levels = "Apartment/Condominium", "Hostel/B&B", "Houses/Villa", "Serviced Apartment/Hotel/Yacht","Others"))
ggplot(data = listing, aes(x=log_price, y=property_type_group,
fill = 0.5 - abs(0.5 - stat(ecdf))
))+
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01, gradient_lwd = 1., calc_ecdf = TRUE) +
theme_ridges(font_size = 13, grid = FALSE)+
scale_fill_viridis_c(name = "Tail probability", direction = -1)+
labs(
x = "Log Price ($)",
title = "Distribution of Log Price by Type of Property"
) +
theme(axis.title.y = element_blank())
ggplot(listing, aes(x = log_price)) +
geom_histogram(aes(color = room_type, fill=room_type), alpha =0.3, position = "identity", bins = 20) +
facet_grid(room_type ~.)
ggplot(data = listing, aes(x=log_price, y=neighbourhood_group_cleansed, group=neighbourhood_group_cleansed))+
geom_density_ridges_gradient(aes(fill=neighbourhood_group_cleansed), stat = "binline", binwidth = 0.5, scale = 0.95)+
theme_ridges(font_size = 13, grid = FALSE)+
labs(
x = "Log Price ($)",
title = "Distribution of Log Price by Location"
) +
scale_fill_cyclical(values = c("#0000B0", "#7070D0"))+
theme(axis.title.y = element_blank())
listing$bedroom_group_reorder<-factor(listing$bedroom_group, c("No bedroom","1 bedroom","2 bedrooms","At least 3 bedrooms"))
fig <- plot_ly(listing, x=~bedroom_group_reorder, y=~log_price, color = ~bedroom_group_reorder) %>%
add_boxplot(hoverinfo = "y") %>%
layout(yaxis = list(hoverformat = ".2f"))
ax <- list(
title = "Number of Bedrooms"
)
ay <- list(
title = "Log Price ($)",
zeroline = FALSE
)
fig <- fig %>% layout(title = "<b>Boxplot of Log Price by Number of Bedrooms</b>")
fig <- fig %>% layout(xaxis = ax, yaxis = ay)
fig
fig <- plot_ly(data = listing,
y = ~log_price,
color = ~review_score_group,
marker = list(size = 9,
color = colorRampPalette(brewer.pal(9,"Spectral"))(9))) %>%
add_boxplot(hoverinfo = "y") %>%
layout(yaxis = list(hoverformat = ".2f"))
ax <- list(
title = "Review Score"
)
ay <- list(
title = "Log Price ($)",
zeroline = FALSE
)
fig <- fig %>% layout(title = "<b>Boxplot of Log Price by Review Score</b>")
fig <- fig %>% layout(xaxis = ax, yaxis = ay)
fig
fig<-plot_ly(listing, x=~number_of_reviews_ltm, y=~log_price,
marker = list(size = 10,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',
width = 2)))
ax <- list(
title = "Number of Reviews"
)
ay <- list(
title = "Log Price ($)"
)
fig <- fig %>% layout(title = "<b>Scatterplot of Log Price by Number of Reviews</b>",xaxis = ax, yaxis = ay)
fig
cor.test(as.numeric(listing$number_of_reviews_ltm), listing$log_price)
##
## Pearson's product-moment correlation
##
## data: as.numeric(listing$number_of_reviews_ltm) and listing$log_price
## t = 0.74387, df = 3183, p-value = 0.457
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02155752 0.04789338
## sample estimates:
## cor
## 0.01318383