Problem Statement

Airbnb is a home-sharing platform that allows home-owners to put their properties online and guests from anywhere can use the platform to book and stay in them. Hosts are expected to set their own prices for their listings. Although Airbnb provide some general guidance, there are currently no free and accurate services which help hosters understand the current airbnb situation in Singapore. The main focus of the study will address the following:

  • Does the relationship of guest satisfaction rate and price differ on superhost status?

The following packages are activated:

library(tidyverse)
library(gridExtra)
library(grid)
library(gvlma)
library(moments)
library(ggcorrplot)
library(caret)
library(broom)
library(lubridate)
library(ggmap)
library(modelr)
library(car)
library(ggfortify)
library(leaflet)
library(jtools)
library(huxtable)
library(ggstance)
library(interactions)

Import

In this study, Singapore Airbnb dataset is imported from Inside Airbnb scraped on 22nd June 2020.

raw.data <- read_csv("http://data.insideairbnb.com/singapore/sg/singapore/2020-06-22/data/listings.csv.gz")

Variables

The dataset consists of 106 variables and 7,323 observations.
The variables that will be selected in this study are:

  • Price of listing given by host (price)
  • Region of the listing (neighbourhood_group_cleansed)
  • Neighbourhood of the listing (neighbourhood_cleansed)
  • Type of property listed on Airbnb (property_type)
  • Type of room listed on Airbnb (room_type)
  • Numbers of reviews given by guests (number_of_reviews)
  • Satisfaction rating given by guests (review_scores_rating)
  • Latitude/longitude location of the unit listed on Airbnb (latitude/longitude)
  • Amenities provided by the host (amenities)
  • Cleaning fee imposed by the host (cleaning_fee)
  • Status of the host on Airbnb (host_is_superhost)
  • Number of bedrooms in the listed unit (bedrooms)
  • Number of bathrooms in the listed unit (bathrooms)
  • Number of guest the listed unit can accommodate (accommodates)

Tidy & Transform

Data is filtered to analyze minimum nights of less than 3.

data_filtered <-  raw.data %>% 
  select(neighbourhood_group_cleansed, neighbourhood_cleansed, property_type,
         room_type, price, number_of_reviews, review_scores_rating, 
         latitude, longitude, amenities, cleaning_fee, host_is_superhost,
         bedrooms, bathrooms, accommodates)

glimpse(data_filtered)
## Rows: 7,323
## Columns: 15
## $ neighbourhood_group_cleansed <chr> "North Region", "Central Region", "North…
## $ neighbourhood_cleansed       <chr> "Woodlands", "Bukit Timah", "Woodlands",…
## $ property_type                <chr> "Apartment", "Apartment", "Apartment", "…
## $ room_type                    <chr> "Private room", "Private room", "Private…
## $ price                        <chr> "$84.00", "$80.00", "$70.00", "$167.00",…
## $ number_of_reviews            <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20,…
## $ review_scores_rating         <dbl> 94, 91, 98, 89, 83, 88, 82, 99, 99, 88, …
## $ latitude                     <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.34…
## $ longitude                    <dbl> 103.7958, 103.7852, 103.7967, 103.9571, …
## $ amenities                    <chr> "{TV,\"Cable TV\",Internet,Wifi,\"Air co…
## $ cleaning_fee                 <chr> NA, NA, NA, "$56.00", "$28.00", "$28.00"…
## $ host_is_superhost            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ bedrooms                     <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ bathrooms                    <dbl> 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, …
## $ accommodates                 <dbl> 1, 2, 1, 6, 3, 3, 6, 1, 2, 1, 1, 2, 1, 1…

Next step of tidying process includes:

  • Undefined (NA) observations are replaced with zero.
  • All prices categorical variables are converted to numerical variables and dollar signs are removed.
  • Missing fill in host_is_superhost variables are replaced with FALSE.
  • superhost column is mutated to transform boolean expression into numeric where FALSE is 0 and True is 1.
data_filtered <- data_filtered %>% 
  mutate(
    review_scores_rating = replace_na(review_scores_rating, 0),
    cleaning_fee = replace_na(cleaning_fee, 0), 
    price = parse_number(price),
    cleaning_fee = parse_number(cleaning_fee),
    amenities = (str_count(amenities, "," ) + 1),
    host_is_superhost = replace_na(host_is_superhost, FALSE),
    superhost = ifelse(host_is_superhost == FALSE, 0, 1),
    bedrooms = replace_na(bedrooms, 0),
    bathrooms = replace_na(bathrooms, 0)) 

Check for any presence of NAs

map(data_filtered, ~sum(is.na(.)))

To calculate the distance of listing to City Centre (Marina Bay Sands, MBS),
latitude and longitude values are extracted from Google API.

## Rows: 1
## Columns: 2
## $ lon <dbl> 103.8607
## $ lat <dbl> 1.283894

Create a function to calculate distance between airbnb listings and MBS.
The latitude and longitude of MBS is 1.283894 and 103.8607.
Listings that are within 5km from MBS are termed near if else it shall be far.

dist_centre <- function(lat, long) {
  degree_to_km <- 111.139;
  degree_to_km*((1.285332 - lat)**2 + (103.8594 - long)**2)**0.5
}

data_filtered <- data_filtered %>% 
  mutate(distance = unlist(map2(latitude, longitude, dist_centre)))

Rename columns

  • neighbourhood_group_cleansed to region
  • neighbourhood_cleansed to neighbourhood
  • review_scores_rating to satisfaction_rate
data_filtered <- data_filtered %>% 
  rename(region = neighbourhood_group_cleansed) %>% 
  rename(neighbourhood = neighbourhood_cleansed) %>% 
  rename(satisfaction_rate = review_scores_rating) 

As price variable is an important predictor in model building, it is essential to check for any presence of outliers.

price_bp1 <- data_filtered %>% 
  ggplot(aes(y = price)) + 
  geom_boxplot() + 
  labs(title = "Singapore Airbnb Price Distribution",
       subtitle = "Presence of extreme outliers",
       caption = "Source: http://data.insideairbnb.com",
       y = "Price (SGD)")

price_bp1

From the boxplot diagram, extreme price outliers can be observed where listing price per night can go beyond 5000 sgd.
Outliers have to be removed to make the results statistically significant. Hence, outliers are assigned into a vector and were removed subsequently.

price_outlier <- boxplot(data_filtered$price, plot = FALSE)$out
data_filtered <- data_filtered[-which(data_filtered$price %in% price_outlier),]

Visualization comparison of price distribution before and after outliers removal

grid.arrange(price_bp1, price_bp2, ncol= 2)

After removal of the outliers, the interquatile range of the price variable becomes clearer and more statistics insights can be inferred.

Exploratory Data Analysis

Visualization of price/listing distribution by Region

In the region variable, there are 5 unique categories. From the bar chart distribution, central region has the most number of Airbnb listing while north region has the least units. From the boxplot diagram, central and east region has similar mean and interquartile range price.

Visualization of price/listing distribution by Neighbourhood

In the neighbourhood variable, there are 39 unique categories. The 10 most frequent neighbourhood are filtered to gain more insights. Kallang is the most popular neighbourhood choice for most Airbnb host and based on the same top 10 neighbourhood, Downtown Core Airbnb listing has the highest average price. 9 out of the top 10 areas are located in the Central Region, with Bedok located in East Region as an exception.

Visualization of price/listing distribution by Room Type

In the room_type variable, there are 4 unique categories. From the bar chart distribution, private room has the most number of listing counts and entire Home and apartment room type has the highest average price.

Transformation

Before deep diving into the dataset, further transformation is required of the variables to aggregate meaningful analysis. Replace categorical to numeric variables

columns <- c("region","neighbourhood","property_type","room_type")
data_filtered[, columns] <- data_filtered %>% 
  select(all_of(columns)) %>% lapply(as.factor) %>% lapply(as.numeric)
glimpse(data_filtered)
## Rows: 6,973
## Columns: 17
## $ region            <dbl> 3, 1, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, …
## $ neighbourhood     <dbl> 40, 7, 40, 36, 36, 36, 36, 2, 2, 5, 5, 20, 12, 5, 5…
## $ property_type     <dbl> 2, 2, 2, 26, 18, 18, 18, 25, 25, 2, 2, 2, 2, 2, 2, …
## $ room_type         <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ price             <dbl> 84, 80, 70, 167, 95, 84, 209, 52, 54, 49, 41, 63, 4…
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20, 12, 133, 1…
## $ satisfaction_rate <dbl> 94, 91, 98, 89, 83, 88, 82, 99, 99, 88, 94, 89, 90,…
## $ latitude          <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.34567, 1.3470…
## $ longitude         <dbl> 103.7958, 103.7852, 103.7967, 103.9571, 103.9596, 1…
## $ amenities         <dbl> 9, 13, 10, 28, 25, 19, 24, 37, 36, 15, 17, 28, 16, …
## $ cleaning_fee      <dbl> 0, 0, 0, 56, 28, 28, 70, 0, 0, 65, 65, 42, 0, 55, 6…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TR…
## $ bedrooms          <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, …
## $ bathrooms         <dbl> 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 1.0, 0.0, 0…
## $ accommodates      <dbl> 1, 2, 1, 6, 3, 3, 6, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, …
## $ superhost         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ distance          <dbl> 18.848617, 9.761806, 18.803281, 12.748842, 13.00218…

Correlation Matrix

For the preparation of the model, a correlational matrix is created to see how the variables of interest (within the model) are related.

data_modeling <- select(data_filtered, -8, -9, -12)
glimpse(data_modeling)

corr <- data_modeling %$% 
  cor(tibble(data_modeling)) %>%
  round(.,2)

ggcorrplot(corr, lab = TRUE, colors = c("indianred1", "white", "dodgerblue"), 
           show.legend = T, outline.color = "white", type = "lower", hc.order = T,  
           tl.cex = 10, lab_size = 3, sig.level = .2,
           title = "Correlation Matrix") 

## Model Building

Mean Centering

Mean-centering transformations are performed on all the variables that will be turned into interaction terms.

data_centered<- data_modeling %>% 
  select(satisfaction_rate, # Outcome Variable
         price,
         neighbourhood, property_type, distance,
         room_type, region, number_of_reviews, accommodates,
         amenities, cleaning_fee, bedrooms, bathrooms,
         superhost) %>%  # Moderator
  mutate_at(vars(price:bathrooms), 
            funs(. - mean(., na.rm=T)))

Specify OLS Model

The regression models are created using poisson distribution. The first model regressed price and comfort factors orientations onto satisfaction rate (model1).

\[ \begin{align} log(\widehat{satisfactionrate}) = \\ &intercept + b_1region + b_2neighbourhood + b_3propertytype + b_4roomtype + b_5bathrooms + \\ & b_6amenities + b_7numberofreviews + b_8cleaningfee + b_9bedrooms + b_{10}accommodates + \\ & b_{11}superhost + b_{12}distcat + b_{13}price + \epsilon \end{align} \]

The key investigation lies in the next model, in which price and comfort factors orientations is regressed along with interaction terms, onto satisfaction rate (model2).

\[ \begin{align} log(\widehat{satisfactionrate}) = \\ & intercept + b_1region + b_2neighbourhood + b_3propertytype + b_4roomtype + b_5bathrooms + \\ & b_6amenities + b_7numberofreviews + b_8cleaningfee + b_9bedrooms + b_{10}accommodates + \\ & b_{11}distcat + (b_{12}superhost\times b_{13}price) + \epsilon \end{align} \]

To test if model2, with interaction terms, enhances the explanatory power of the model using anova function.

anova(model1, model2, test="Chisq") 

Resid. DfResid. DevDfDeviancePr(>Chi)
6.96e+033.56e+05        
6.96e+033.55e+0516025.27e-133
The results of the analysis suggest that adding the interaction terms significantly increases the R-squared of model2, as compared to model1.

Multicollinearity Check

Check the linear assumptions for Model 1 and Model 2 using the vif.

vif(model1);vif(model2)
##             price         superhost     neighbourhood         room_type 
##          2.115065          1.140453          1.049608          1.799833 
##     property_type          distance number_of_reviews      accommodates 
##          1.056309          3.348875          1.098825          1.608353 
##         amenities      cleaning_fee            region          bedrooms 
##          1.263925          1.175992          3.204368          1.657571 
##         bathrooms 
##          1.228814
##             price         superhost     neighbourhood         room_type 
##          2.593531          1.178599          1.049829          1.828312 
##     property_type          distance number_of_reviews      accommodates 
##          1.060624          3.363900          1.104111          1.604539 
##         amenities      cleaning_fee            region          bedrooms 
##          1.285326          1.179046          3.212724          1.658764 
##         bathrooms   price:superhost 
##          1.236818          1.452826

From the above results, VIF value all less than 5 indicates no issue with collinearity.

Report the Results with kable in R Markdown

The following shows the variable level information of Model 2.

library(knitr)
library(kableExtra)
kable(tidy(model2))%>%
  kable_paper("striped", full_width = F) %>%
  column_spec(c(1, 5), bold = T) %>%
  row_spec(c(2, 4, 6, 8, 10, 12, 14), bold = T, color = "white", background = "blue")
term estimate std.error statistic p.value
(Intercept) 3.9088136 0.0018790 2080.315414 0.0000000
price -0.0010250 0.0000333 -30.815984 0.0000000
superhost 0.2611071 0.0042999 60.723561 0.0000000
neighbourhood 0.0003497 0.0001761 1.985555 0.0470827
room_type 0.0505853 0.0021399 23.639313 0.0000000
property_type -0.0075858 0.0002314 -32.779606 0.0000000
distance -0.0058162 0.0007120 -8.169097 0.0000000
number_of_reviews 0.0044253 0.0000355 124.823633 0.0000000
accommodates 0.0266563 0.0008344 31.945657 0.0000000
amenities 0.0141954 0.0001984 71.561580 0.0000000
cleaning_fee 0.0005054 0.0000465 10.875236 0.0000000
region -0.0104208 0.0025757 -4.045759 0.0000522
bedrooms -0.1008835 0.0028799 -35.030087 0.0000000
bathrooms 0.0453119 0.0013558 33.420752 0.0000000
price:superhost 0.0012241 0.0000495 24.709793 0.0000000

The following shows the model level information of Model 2.

kable(glance(model2))%>%
  kable_paper("striped", full_width = F) %>%
  column_spec(c(2, 4, 6, 8), bold = T, color = "white", background = "blue")
null.deviance df.null logLik AIC BIC deviance df.residual nobs
393806.2 6972 -190887.3 381804.7 381907.4 355166.8 6958 6973

Visualize

Price x Superhost

To visualize the OLS regression analysis performed above, it is stored in the OLS regression model’s predictions.

predicted_Y1 <- data_centered %>%  
  modelr::data_grid(price, 
                    superhost, 
                    neighbourhood = 0,
                    region =0,
                    property_type = 0, 
                    number_of_reviews = 0,
                    accommodates = 0,
                    amenities = 0,
                    cleaning_fee = 0,
                    bedrooms = 0,
                    bathrooms = 0,
                    room_type = 0,
                    distance = 0) %>% 
  mutate(pred_SR1 = predict(model2, . , type="response"))

Undo the centering of variable (price).

predicted_Y1 <- predicted_Y1 %>% 
  mutate(price = price + mean(data_modeling$price)
  )

The following figure represents the two lines which explains how the superhost status differs in its relationships between satisfaction rate and price.

Export Sums

export_summs(model1, model2, 
             error_format = "(t = {statistic}, p = {p.value})",
             align = "right",
             model.names = c("Main Effects Only", "Price X Superhost"),
             digits = 3)
Main Effects OnlyPrice X Superhost
(Intercept)3.912 ***3.909 ***
(t = 2091.684, p = 0.000)(t = 2080.315, p = 0.000)
price-0.001 ***-0.001 ***
(t = -23.020, p = 0.000)(t = -30.816, p = 0.000)
superhost0.274 ***0.261 ***
(t = 64.816, p = 0.000)(t = 60.724, p = 0.000)
neighbourhood0.000 *0.000 *
(t = 2.314, p = 0.021)(t = 1.986, p = 0.047)
room_type0.056 ***0.051 ***
(t = 26.403, p = 0.000)(t = 23.639, p = 0.000)
property_type-0.007 ***-0.008 ***
(t = -31.750, p = 0.000)(t = -32.780, p = 0.000)
distance-0.007 ***-0.006 ***
(t = -9.638, p = 0.000)(t = -8.169, p = 0.000)
number_of_reviews0.004 ***0.004 ***
(t = 123.541, p = 0.000)(t = 124.824, p = 0.000)
accommodates0.026 ***0.027 ***
(t = 30.988, p = 0.000)(t = 31.946, p = 0.000)
amenities0.015 ***0.014 ***
(t = 74.408, p = 0.000)(t = 71.562, p = 0.000)
cleaning_fee0.001 ***0.001 ***
(t = 10.869, p = 0.000)(t = 10.875, p = 0.000)
region-0.007 **-0.010 ***
(t = -2.709, p = 0.007)(t = -4.046, p = 0.000)
bedrooms-0.102 ***-0.101 ***
(t = -35.530, p = 0.000)(t = -35.030, p = 0.000)
bathrooms0.047 ***0.045 ***
(t = 35.215, p = 0.000)(t = 33.421, p = 0.000)
price:superhost0.001 ***
(t = 24.710, p = 0.000)
N69736973
AIC382404.962381804.653
BIC382500.859381907.400
Pseudo R20.9960.996
*** p < 0.001; ** p < 0.01; * p < 0.05.

Probing Interactions.

stone <- 
glm(satisfaction_rate ~ price*superhost + neighbourhood  + room_type + property_type + distance + number_of_reviews + accommodates + amenities + cleaning_fee + region + bedrooms + bathrooms, data_centered, family = poisson)

Run simple slopes analysis using Johnson-Neyman Techniques.

sim_slopes(stone,
           data = data_modeling,
           pred = price, 
           modx = superhost,
           johnson_neyman = F)
## SIMPLE SLOPES ANALYSIS 
## 
## Slope of price when superhost = 0.00 (0): 
## 
##    Est.   S.E.   z val.      p
## ------- ------ -------- ------
##   -0.00   0.00   -30.82   0.00
## 
## Slope of price when superhost = 1.00 (1): 
## 
##   Est.   S.E.   z val.      p
## ------ ------ -------- ------
##   0.00   0.00     4.26   0.00
sim_slopes(stone,
           data = data_modeling,
           pred = price, 
           modx = superhost,
           johnson_neyman = T)
## JOHNSON-NEYMAN INTERVAL 
## 
## When superhost is OUTSIDE the interval [0.78, 0.91], the slope of price is
## p < .05.
## 
## Note: The range of observed values of superhost is [0.00, 1.00]
## 
## SIMPLE SLOPES ANALYSIS 
## 
## Slope of price when superhost = 0.00 (0): 
## 
##    Est.   S.E.   z val.      p
## ------- ------ -------- ------
##   -0.00   0.00   -30.82   0.00
## 
## Slope of price when superhost = 1.00 (1): 
## 
##   Est.   S.E.   z val.      p
## ------ ------ -------- ------
##   0.00   0.00     4.26   0.00

The result indicates that for superhost outside interval of 0.78 to 0.91, the slope of satisfaction rate is p < 0.05.

Run interaction_plot() by adding benchmark for regions of significance.

Interpretation of the Results

From the analysis above, the following findings are inferred:

  1. Guest satisfaction rate increases with price for superhost status. Hence, it is important for Airbnb host to achieve superhost status as it seems to garner better staying experiences for guests and this may indirectly increase the booking demand of their listing.

  2. From earlier exploratory data analysis, price variable is more room type and region specific. Entire apartment tends to be expensive in comparison because of bigger space area rented. Central region has higher demand at the same time most popular neighbourhoods are also located in central region.

Limitations

Airbnb has a fairly complex relationship in Singapore. Since five years ago, the Singapore government labeled the short-term rental offered by Airbnb as an illegal service. This study used the dataset scraped from June 2020 during this Covid-19 pandemic period. With the negative tourism impact from Covid-19 and government interference on Airbnb business concept, many host may have left the platform. In addition, the study are based on short term rental hence many observations have been filtered away.

References