Airbnb is a home-sharing platform that allows home-owners to put their properties online and guests from anywhere can use the platform to book and stay in them. Hosts are expected to set their own prices for their listings. Although Airbnb provide some general guidance, there are currently no free and accurate services which help hosters understand the current airbnb situation in Singapore. The main focus of the study will address the following:
The following packages are activated:
library(tidyverse)
library(gridExtra)
library(grid)
library(gvlma)
library(moments)
library(ggcorrplot)
library(caret)
library(broom)
library(lubridate)
library(ggmap)
library(modelr)
library(car)
library(ggfortify)
library(leaflet)
library(jtools)
library(huxtable)
library(ggstance)
library(interactions)
In this study, Singapore Airbnb dataset is imported from Inside Airbnb scraped on 22nd June 2020.
raw.data <- read_csv("http://data.insideairbnb.com/singapore/sg/singapore/2020-06-22/data/listings.csv.gz")
The dataset consists of 106 variables and 7,323 observations.
The variables that will be selected in this study are:
price)neighbourhood_group_cleansed)neighbourhood_cleansed)property_type)room_type)number_of_reviews)review_scores_rating)latitude/longitude)amenities)cleaning_fee)host_is_superhost)bedrooms)bathrooms)accommodates)Data is filtered to analyze minimum nights of less than 3.
data_filtered <- raw.data %>%
select(neighbourhood_group_cleansed, neighbourhood_cleansed, property_type,
room_type, price, number_of_reviews, review_scores_rating,
latitude, longitude, amenities, cleaning_fee, host_is_superhost,
bedrooms, bathrooms, accommodates)
glimpse(data_filtered)
## Rows: 7,323
## Columns: 15
## $ neighbourhood_group_cleansed <chr> "North Region", "Central Region", "North…
## $ neighbourhood_cleansed <chr> "Woodlands", "Bukit Timah", "Woodlands",…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "…
## $ room_type <chr> "Private room", "Private room", "Private…
## $ price <chr> "$84.00", "$80.00", "$70.00", "$167.00",…
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20,…
## $ review_scores_rating <dbl> 94, 91, 98, 89, 83, 88, 82, 99, 99, 88, …
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.34…
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.9571, …
## $ amenities <chr> "{TV,\"Cable TV\",Internet,Wifi,\"Air co…
## $ cleaning_fee <chr> NA, NA, NA, "$56.00", "$28.00", "$28.00"…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ bedrooms <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, …
## $ accommodates <dbl> 1, 2, 1, 6, 3, 3, 6, 1, 2, 1, 1, 2, 1, 1…
Next step of tidying process includes:
host_is_superhost variables are replaced with FALSE. superhost column is mutated to transform boolean expression into numeric where FALSE is 0 and True is 1. data_filtered <- data_filtered %>%
mutate(
review_scores_rating = replace_na(review_scores_rating, 0),
cleaning_fee = replace_na(cleaning_fee, 0),
price = parse_number(price),
cleaning_fee = parse_number(cleaning_fee),
amenities = (str_count(amenities, "," ) + 1),
host_is_superhost = replace_na(host_is_superhost, FALSE),
superhost = ifelse(host_is_superhost == FALSE, 0, 1),
bedrooms = replace_na(bedrooms, 0),
bathrooms = replace_na(bathrooms, 0))
Check for any presence of NAs
map(data_filtered, ~sum(is.na(.)))
To calculate the distance of listing to City Centre (Marina Bay Sands, MBS),
latitude and longitude values are extracted from Google API.
## Rows: 1
## Columns: 2
## $ lon <dbl> 103.8607
## $ lat <dbl> 1.283894
Create a function to calculate distance between airbnb listings and MBS.
The latitude and longitude of MBS is 1.283894 and 103.8607.
Listings that are within 5km from MBS are termed near if else it shall be far.
dist_centre <- function(lat, long) {
degree_to_km <- 111.139;
degree_to_km*((1.285332 - lat)**2 + (103.8594 - long)**2)**0.5
}
data_filtered <- data_filtered %>%
mutate(distance = unlist(map2(latitude, longitude, dist_centre)))
Rename columns
region neighbourhood satisfaction_ratedata_filtered <- data_filtered %>%
rename(region = neighbourhood_group_cleansed) %>%
rename(neighbourhood = neighbourhood_cleansed) %>%
rename(satisfaction_rate = review_scores_rating)
As price variable is an important predictor in model building, it is essential to check for any presence of outliers.
price_bp1 <- data_filtered %>%
ggplot(aes(y = price)) +
geom_boxplot() +
labs(title = "Singapore Airbnb Price Distribution",
subtitle = "Presence of extreme outliers",
caption = "Source: http://data.insideairbnb.com",
y = "Price (SGD)")
price_bp1
From the boxplot diagram, extreme price outliers can be observed where listing price per night can go beyond 5000 sgd.
Outliers have to be removed to make the results statistically significant. Hence, outliers are assigned into a vector and were removed subsequently.
price_outlier <- boxplot(data_filtered$price, plot = FALSE)$out
data_filtered <- data_filtered[-which(data_filtered$price %in% price_outlier),]
Visualization comparison of price distribution before and after outliers removal
grid.arrange(price_bp1, price_bp2, ncol= 2)
After removal of the outliers, the interquatile range of the price variable becomes clearer and more statistics insights can be inferred.
In the region variable, there are 5 unique categories. From the bar chart distribution, central region has the most number of Airbnb listing while north region has the least units. From the boxplot diagram, central and east region has similar mean and interquartile range price.
In the neighbourhood variable, there are 39 unique categories. The 10 most frequent neighbourhood are filtered to gain more insights. Kallang is the most popular neighbourhood choice for most Airbnb host and based on the same top 10 neighbourhood, Downtown Core Airbnb listing has the highest average price. 9 out of the top 10 areas are located in the Central Region, with Bedok located in East Region as an exception.
In the room_type variable, there are 4 unique categories. From the bar chart distribution, private room has the most number of listing counts and entire Home and apartment room type has the highest average price.
Before deep diving into the dataset, further transformation is required of the variables to aggregate meaningful analysis. Replace categorical to numeric variables
columns <- c("region","neighbourhood","property_type","room_type")
data_filtered[, columns] <- data_filtered %>%
select(all_of(columns)) %>% lapply(as.factor) %>% lapply(as.numeric)
glimpse(data_filtered)
## Rows: 6,973
## Columns: 17
## $ region <dbl> 3, 1, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, …
## $ neighbourhood <dbl> 40, 7, 40, 36, 36, 36, 36, 2, 2, 5, 5, 20, 12, 5, 5…
## $ property_type <dbl> 2, 2, 2, 26, 18, 18, 18, 25, 25, 2, 2, 2, 2, 2, 2, …
## $ room_type <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ price <dbl> 84, 80, 70, 167, 95, 84, 209, 52, 54, 49, 41, 63, 4…
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20, 12, 133, 1…
## $ satisfaction_rate <dbl> 94, 91, 98, 89, 83, 88, 82, 99, 99, 88, 94, 89, 90,…
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.34567, 1.3470…
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.9571, 103.9596, 1…
## $ amenities <dbl> 9, 13, 10, 28, 25, 19, 24, 37, 36, 15, 17, 28, 16, …
## $ cleaning_fee <dbl> 0, 0, 0, 56, 28, 28, 70, 0, 0, 65, 65, 42, 0, 55, 6…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TR…
## $ bedrooms <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, …
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 1.0, 0.0, 0…
## $ accommodates <dbl> 1, 2, 1, 6, 3, 3, 6, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, …
## $ superhost <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ distance <dbl> 18.848617, 9.761806, 18.803281, 12.748842, 13.00218…
For the preparation of the model, a correlational matrix is created to see how the variables of interest (within the model) are related.
data_modeling <- select(data_filtered, -8, -9, -12)
glimpse(data_modeling)
corr <- data_modeling %$%
cor(tibble(data_modeling)) %>%
round(.,2)
ggcorrplot(corr, lab = TRUE, colors = c("indianred1", "white", "dodgerblue"),
show.legend = T, outline.color = "white", type = "lower", hc.order = T,
tl.cex = 10, lab_size = 3, sig.level = .2,
title = "Correlation Matrix")
## Model Building
Mean-centering transformations are performed on all the variables that will be turned into interaction terms.
data_centered<- data_modeling %>%
select(satisfaction_rate, # Outcome Variable
price,
neighbourhood, property_type, distance,
room_type, region, number_of_reviews, accommodates,
amenities, cleaning_fee, bedrooms, bathrooms,
superhost) %>% # Moderator
mutate_at(vars(price:bathrooms),
funs(. - mean(., na.rm=T)))
The regression models are created using poisson distribution. The first model regressed price and comfort factors orientations onto satisfaction rate (model1).
\[ \begin{align} log(\widehat{satisfactionrate}) = \\ &intercept + b_1region + b_2neighbourhood + b_3propertytype + b_4roomtype + b_5bathrooms + \\ & b_6amenities + b_7numberofreviews + b_8cleaningfee + b_9bedrooms + b_{10}accommodates + \\ & b_{11}superhost + b_{12}distcat + b_{13}price + \epsilon \end{align} \]
The key investigation lies in the next model, in which price and comfort factors orientations is regressed along with interaction terms, onto satisfaction rate (model2).
\[ \begin{align} log(\widehat{satisfactionrate}) = \\ & intercept + b_1region + b_2neighbourhood + b_3propertytype + b_4roomtype + b_5bathrooms + \\ & b_6amenities + b_7numberofreviews + b_8cleaningfee + b_9bedrooms + b_{10}accommodates + \\ & b_{11}distcat + (b_{12}superhost\times b_{13}price) + \epsilon \end{align} \]
To test if model2, with interaction terms, enhances the explanatory power of the model using anova function.
anova(model1, model2, test="Chisq")
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 6.96e+03 | 3.56e+05 | |||
| 6.96e+03 | 3.55e+05 | 1 | 602 | 5.27e-133 |
model2, as compared to model1.
Check the linear assumptions for Model 1 and Model 2 using the vif.
vif(model1);vif(model2)
## price superhost neighbourhood room_type
## 2.115065 1.140453 1.049608 1.799833
## property_type distance number_of_reviews accommodates
## 1.056309 3.348875 1.098825 1.608353
## amenities cleaning_fee region bedrooms
## 1.263925 1.175992 3.204368 1.657571
## bathrooms
## 1.228814
## price superhost neighbourhood room_type
## 2.593531 1.178599 1.049829 1.828312
## property_type distance number_of_reviews accommodates
## 1.060624 3.363900 1.104111 1.604539
## amenities cleaning_fee region bedrooms
## 1.285326 1.179046 3.212724 1.658764
## bathrooms price:superhost
## 1.236818 1.452826
From the above results, VIF value all less than 5 indicates no issue with collinearity.
kable in R MarkdownThe following shows the variable level information of Model 2.
library(knitr)
library(kableExtra)
kable(tidy(model2))%>%
kable_paper("striped", full_width = F) %>%
column_spec(c(1, 5), bold = T) %>%
row_spec(c(2, 4, 6, 8, 10, 12, 14), bold = T, color = "white", background = "blue")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.9088136 | 0.0018790 | 2080.315414 | 0.0000000 |
| price | -0.0010250 | 0.0000333 | -30.815984 | 0.0000000 |
| superhost | 0.2611071 | 0.0042999 | 60.723561 | 0.0000000 |
| neighbourhood | 0.0003497 | 0.0001761 | 1.985555 | 0.0470827 |
| room_type | 0.0505853 | 0.0021399 | 23.639313 | 0.0000000 |
| property_type | -0.0075858 | 0.0002314 | -32.779606 | 0.0000000 |
| distance | -0.0058162 | 0.0007120 | -8.169097 | 0.0000000 |
| number_of_reviews | 0.0044253 | 0.0000355 | 124.823633 | 0.0000000 |
| accommodates | 0.0266563 | 0.0008344 | 31.945657 | 0.0000000 |
| amenities | 0.0141954 | 0.0001984 | 71.561580 | 0.0000000 |
| cleaning_fee | 0.0005054 | 0.0000465 | 10.875236 | 0.0000000 |
| region | -0.0104208 | 0.0025757 | -4.045759 | 0.0000522 |
| bedrooms | -0.1008835 | 0.0028799 | -35.030087 | 0.0000000 |
| bathrooms | 0.0453119 | 0.0013558 | 33.420752 | 0.0000000 |
| price:superhost | 0.0012241 | 0.0000495 | 24.709793 | 0.0000000 |
The following shows the model level information of Model 2.
kable(glance(model2))%>%
kable_paper("striped", full_width = F) %>%
column_spec(c(2, 4, 6, 8), bold = T, color = "white", background = "blue")
| null.deviance | df.null | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|
| 393806.2 | 6972 | -190887.3 | 381804.7 | 381907.4 | 355166.8 | 6958 | 6973 |
To visualize the OLS regression analysis performed above, it is stored in the OLS regression model’s predictions.
predicted_Y1 <- data_centered %>%
modelr::data_grid(price,
superhost,
neighbourhood = 0,
region =0,
property_type = 0,
number_of_reviews = 0,
accommodates = 0,
amenities = 0,
cleaning_fee = 0,
bedrooms = 0,
bathrooms = 0,
room_type = 0,
distance = 0) %>%
mutate(pred_SR1 = predict(model2, . , type="response"))
Undo the centering of variable (price).
predicted_Y1 <- predicted_Y1 %>%
mutate(price = price + mean(data_modeling$price)
)
The following figure represents the two lines which explains how the superhost status differs in its relationships between satisfaction rate and price.
export_summs(model1, model2,
error_format = "(t = {statistic}, p = {p.value})",
align = "right",
model.names = c("Main Effects Only", "Price X Superhost"),
digits = 3)
| Main Effects Only | Price X Superhost | |
|---|---|---|
| (Intercept) | 3.912 *** | 3.909 *** |
| (t = 2091.684, p = 0.000) | (t = 2080.315, p = 0.000) | |
| price | -0.001 *** | -0.001 *** |
| (t = -23.020, p = 0.000) | (t = -30.816, p = 0.000) | |
| superhost | 0.274 *** | 0.261 *** |
| (t = 64.816, p = 0.000) | (t = 60.724, p = 0.000) | |
| neighbourhood | 0.000 * | 0.000 * |
| (t = 2.314, p = 0.021) | (t = 1.986, p = 0.047) | |
| room_type | 0.056 *** | 0.051 *** |
| (t = 26.403, p = 0.000) | (t = 23.639, p = 0.000) | |
| property_type | -0.007 *** | -0.008 *** |
| (t = -31.750, p = 0.000) | (t = -32.780, p = 0.000) | |
| distance | -0.007 *** | -0.006 *** |
| (t = -9.638, p = 0.000) | (t = -8.169, p = 0.000) | |
| number_of_reviews | 0.004 *** | 0.004 *** |
| (t = 123.541, p = 0.000) | (t = 124.824, p = 0.000) | |
| accommodates | 0.026 *** | 0.027 *** |
| (t = 30.988, p = 0.000) | (t = 31.946, p = 0.000) | |
| amenities | 0.015 *** | 0.014 *** |
| (t = 74.408, p = 0.000) | (t = 71.562, p = 0.000) | |
| cleaning_fee | 0.001 *** | 0.001 *** |
| (t = 10.869, p = 0.000) | (t = 10.875, p = 0.000) | |
| region | -0.007 ** | -0.010 *** |
| (t = -2.709, p = 0.007) | (t = -4.046, p = 0.000) | |
| bedrooms | -0.102 *** | -0.101 *** |
| (t = -35.530, p = 0.000) | (t = -35.030, p = 0.000) | |
| bathrooms | 0.047 *** | 0.045 *** |
| (t = 35.215, p = 0.000) | (t = 33.421, p = 0.000) | |
| price:superhost | 0.001 *** | |
| (t = 24.710, p = 0.000) | ||
| N | 6973 | 6973 |
| AIC | 382404.962 | 381804.653 |
| BIC | 382500.859 | 381907.400 |
| Pseudo R2 | 0.996 | 0.996 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
stone <-
glm(satisfaction_rate ~ price*superhost + neighbourhood + room_type + property_type + distance + number_of_reviews + accommodates + amenities + cleaning_fee + region + bedrooms + bathrooms, data_centered, family = poisson)
Run simple slopes analysis using Johnson-Neyman Techniques.
sim_slopes(stone,
data = data_modeling,
pred = price,
modx = superhost,
johnson_neyman = F)
## SIMPLE SLOPES ANALYSIS
##
## Slope of price when superhost = 0.00 (0):
##
## Est. S.E. z val. p
## ------- ------ -------- ------
## -0.00 0.00 -30.82 0.00
##
## Slope of price when superhost = 1.00 (1):
##
## Est. S.E. z val. p
## ------ ------ -------- ------
## 0.00 0.00 4.26 0.00
sim_slopes(stone,
data = data_modeling,
pred = price,
modx = superhost,
johnson_neyman = T)
## JOHNSON-NEYMAN INTERVAL
##
## When superhost is OUTSIDE the interval [0.78, 0.91], the slope of price is
## p < .05.
##
## Note: The range of observed values of superhost is [0.00, 1.00]
##
## SIMPLE SLOPES ANALYSIS
##
## Slope of price when superhost = 0.00 (0):
##
## Est. S.E. z val. p
## ------- ------ -------- ------
## -0.00 0.00 -30.82 0.00
##
## Slope of price when superhost = 1.00 (1):
##
## Est. S.E. z val. p
## ------ ------ -------- ------
## 0.00 0.00 4.26 0.00
The result indicates that for superhost outside interval of 0.78 to 0.91, the slope of satisfaction rate is p < 0.05.
Run interaction_plot() by adding benchmark for regions of significance.
From the analysis above, the following findings are inferred:
Guest satisfaction rate increases with price for superhost status. Hence, it is important for Airbnb host to achieve superhost status as it seems to garner better staying experiences for guests and this may indirectly increase the booking demand of their listing.
From earlier exploratory data analysis, price variable is more room type and region specific. Entire apartment tends to be expensive in comparison because of bigger space area rented. Central region has higher demand at the same time most popular neighbourhoods are also located in central region.
Airbnb has a fairly complex relationship in Singapore. Since five years ago, the Singapore government labeled the short-term rental offered by Airbnb as an illegal service. This study used the dataset scraped from June 2020 during this Covid-19 pandemic period. With the negative tourism impact from Covid-19 and government interference on Airbnb business concept, many host may have left the platform. In addition, the study are based on short term rental hence many observations have been filtered away.
R For Data Science, by Hadley Wickham & Garrett Grolemund
Inside Airbnb Dataset, from www.insideairbnb.com
Channel News Asia, from www.channelnewsasia.com