1 Introduction

Airbnb, a peer-to-peer sharing platform that enables the short-term rental of private rooms or homes by individuals to potential guests, is increasingly popular with tourists. As of 30 September 2020, Airbnb has operations in 100,000 cities with over 4 million hosts providing 5.6 million active listings and 800 million guest arrivals since its launch (Airbnb, 2021).

Airbnb launched its Asia Pacific headquarters in Singapore in November 2012, but has had hosts in Singapore from as early as 2009. Local Airbnb stays are regulated by the Urban Redevelopment Authority (URA) and the Housing Development Board (HDB). The authorities conducted consultations with the public and key stakeholders from 2015 to explore a regulatory framework for short term accommodation (Channel News Asia, 2018), but maintained the regulatory status quo in May 2019 (Co, 2019). The minimum stay for private property is three months and six months for HDB flats. Strict penalties have been enforced including fines of up to $200,000 for first time offenders, with additional fines and possible jail term for repeat offenders.

Price is an important factor of why people choose Airbnb accommodation over the typical (hotels, hostels, bed and breakfast). They may also value the experience and cultural exchanges with hosts, having more local flavour / custom into their travels. Some may also be longer term travellers for specific reasons (e.g. food tourism, medical tourism, concerts, etc). Therefore it is important to understand the characteristics of Airbnb that make it attractive to users.

Hedonic pricing theory states that the price of a good is not intrinsically from the good itself, but is a function of attributes or characteristics that contribute to the utility of the consumer of the good. Consumers would then try to make the optimal selection of the attributes to maximize their utilities under the budget constraints (Lancaster, K., 1966). In the case of Airbnb, an Airbnb listing has various attributes (e.g. location, host attributes, room type, etc) that provide consumers with value and influence the overall quality of the good itself. A hedonic pricing model uses multiple regression analysis to estimate the attributes that contribute to the price of the good.

Understanding the hedonic price of Airbnb listings would help to provide hosts or would be hosts with insights on how to price their listing, and also help researchers and policy makers trying to understand how Airbnb pricing may impact adjacent areas such as hotel revenues, pricing of long-term rentals and the housing market, as well as impact on gentrification of neighbourhoods.

The literature review from the previous part of this project has shown there is very little research done into Airbnb in Asia. There have been recent research that looked at price determinants of Airbnb listings from Hong Kong, as well as comparison of the price determinants of 36 cities in China. However these studies are limited in analysing the spatial variances that may influence the pricing variables. They use a generalised linear model and static spatial information (e.g. distance to a fixed city centre) to estimate location effects.

This study aims to examine the hedonic pricing of Airbnb listings in Singapore, using a geographically weighted regression (GWR) and a multiscale geographically weighted regression (MGWR). We also compare the results to that obtained by the typical Ordinary Least Squares (OLS) method used in obtaining the price determinants.

This report is structured as follows: firstly, we review literature related to hedonic pricing research in the Airbnb context, GWR-related studies for Airbnb, and examine the context for our study. The next section sets out our methodology and data used. We then present and discuss our results. Finally, we conclude with a summary of results and implications of our study.

2 Literature Review

2.1 Pricing determinants in the Airbnb space

The research on pricing determinants for Airbnb is still fairly sparse compared to other research on the motivations of Airbnb consumers (Guttentag 2015; Mohlmann, 2015) and hosts (Ert et al., 2016; Karlsson and Dolnicar, 2016; Li et al., 2015), geographical, social and economic variables that explain the penetration of Airbnb in various cities (Quattrone et al., 2018, Quattrone et al., 2016, Lagonigro, Martori, & Apparicio, 2020), or the economic effects of sharing economy based accommodation services (Fang et al., 2015, Li et al., 2015; Zervas et al., 2015).

Recently, there has been an increased focus on hedonic pricing models of Airbnb (Chen & Xie, 2017; Gibbs, Guttentag, Gretzel, Morton et al., 2018; Gibbs, Guttentag, Gretzel, Yao et al., 2018, Wang & Nicolau, 2017, Cai et al., 2019). These studies explore the impact of the more unique characteristics of sharing economy based accommodation rentals (e.g. host attributes, site and property attributes, reviews, etc.) on pricing. While the research focused on different cities, there have been some determinants that were common to all and consistent in its effects; while there are some determinants that behave differently depending on the city / area studied.

Most of these studies centre on Western cities (e.g. San Francisco; New York City; Tallinn, Estonia; Canada). There are only a couple of studies on hedonic pricing of Airbnb in Asia – one examining the price determinants of Airbnb in Hong Kong (Cai et al. 2019) and another study on the price determinants of 36 cities in China (Wu and Qiu 2019). Both used OLS regression and quantile regression models to analyse selected variables. The study by Wu and Qiu (2019) is not available in English and therefore the results have not been discussed here.

The price determinants of Airbnb listings from various studies and their effects can be categorised into five groups of explanatory variables: (a) Listing attributes (b) Host attributes (c) Listing reputation (d) Rental policies (e) Listing location

Cai et al. 2019 summarised the findings from previous studies and used selected variables to analyse the price determinants of Airbnb listings in Hong Kong. We will only highlight the significant determinants from prior research to help guide our hedonic price model in this study.

2.1.1 Listing attributes

Chen and Xie (2017, Ert et al. (2016), Gibbs et al. (2018), Kakar et al. (2016), Wang & Nicolau (2017) have shown that room type, accommodation type, number of bedrooms, number of bathrooms had significant positive impact on the Airbnb listing price. Other listing attributes with positive effects include the number of accommodation photos, having a real bed, wifi provision, and property amenities (parking, pool, gym). Conversely, provision of free breakfast had a negative effect on price. Cai et al. (2019) did not have different findings from the above attributes in their research on Hong Kong Airbnb listings. A finding unique to the HK market was that the coefficients of room types were much higher compared to the other studies: the price of an entire home and private room were 376.4% and 174.6% respectively higher than a shared room in Hong Kong.

2.1.2 Host attributes

Research from Chen and Xie, 2017; Ert et al., 2016; Wang and Nicolau, 2017 indicated that the hosts’ listing count, host verification, host profile picture and response time had a positive impact on the price, whilst non-white hosts had lower prices on Airbnb listings in San Francisco. Having superhost status or 2 or more listings (“professional hosts”) had mixed effects on the price of the listing. Gibbs et al. (2018), and Wang & Nicolau (2017) showed that professional hosts had higher listing prices than hosts with a single listing; however Li et al. (2015) found that they did not have significantly higher prices than “non-professional hosts”, but they did have a higher daily revenue due to higher occupancy of the listings. Conversely, Cai et al. (2019) found that higher listing counts had a negative impact on listing price in Hong Kong – they attributed it to a higher percentage of hosts having multi-listings which leads to a more competitive market and hence a decrease in listing price.

2.1.3 Listing reputation

Most of the studies showed that the number of reviews has a negative effect on listing price but has a positive effect on the daily revenue and occupancy rate of each property (J. Li et al., 2015). Gibbs et al 2018 has postulated that this is due to a greater demand for cheaper listings, and the quantity of reviews is an indication of the demand, hence the negative impact. The rating score on value also has a negative impact, which could also connect to the same reasoning above. Customer ratings have a mixed effect on the pricing. The overall rating and rating on communication have both positive (Chen & Xie, 2017; Gibbs et al., 2018, Wang & Nicolau, 2017, Cai et al., 2019) and negative (J. Li et al., 2015) impact on pricing; rating on cleanliness and location had a positive impact.

2.1.4 Rental policies

The same studies showed that having a strict cancellation policy, and guests’ phone verification corresponds to a higher listing price, while instant bookable listings and smoking permissisions correspond to a lower listing price.

2.1.5 Listing location

Closeness to the city centre (distance between listing and a point denoted as the city centre, measured using the Harversine formula) correspond to a higher listing price. The number of Airbnb listings in the same district, price of surrounding Airbnb listings, number of points of interest (POIs) in the surrounding area, or proximity to sightseeing, food, shopping or coastal areas all contribute to an increased listing price; whereas density of hotels corresponds to lower listing price. Higher median gross rent in the district also corresponds to a higher Airbnb price. Cai et al., 2019 discovered that highly priced Airbnb listings in Hong Kong are not sensitive to location factors , but low and medium priced listings were. Location factors included listing density, distance from city centre, malls, or tourist attractions.

2.1.6 Limitations of the above research

Most of the above research use OLS and quantile regression to obtain the hedonic pricing model. Many studies assumed that the study area had homogeneous geographical characteristics, and typically used Euclidean distances to a set point (typically the city centre) to account for location factors. However, we do know that there are differences in spatial characteristics that influence the pricing (e.g. desirable districts, proximity to transport links and amenities, etc), that are not captured by these studies. Instead, we should look to a Geographically Weighted Regression (GWR) to include spatial variation.

2.2 Geographically Weighted Regression and Multiscale Geographically Weighted Regression

Tobler’s First Law of Geography states that “Everything is related to everything else. However, near things are more related than distant things” (Tobler, 1969). In the Airbnb context, hosts will be influenced by the pricing strategies of hosts of other listings, especially those of nearby listings. There will be a threshold distance where this does not apply, denoting the scale of operation.

Geographically Weighted Regression (GWR) includes spatial variation through a localised estimation of the pricing variables – i.e. it captures geographical variance in regression estimates across space (Brunsdon, Fotheringham & Charlton, 1998). It computes localised regression coefficients of given explanatory variables for a given location, using neighbouring locations, as defined by a search bandwidth.

Studies have started using GWR in finding the price determinants of Airbnb pricing: Zhang et al. (2017) used both a general linear model (GLM) and GWR to identify key factors affecting Airbnb listing prices in Metro Nashville, Tennessee. They discovered that the distance to the city centre (represented by the convention centre) and number of reviews are negatively related to the Airbnb listing price in most regions of Metro Nashville. The Airbnb listing price is more sensitive to the distance to the city centre in the central areas, than in other areas. The distance to the nearest highway, review ratings and age of the listing correlate positively or negatively depending on the area of the listing.

Xu et al. (2019) studied the spatial distribution of Airbnb listings in London using kernel density estimation to calculate density of point features and the modelled using both OLS and GWR. They showed that Airbnb listings were mainly located in the city centre and around tourist attractions. Elements such as travel and transport links, university locations, nightlife spots, tourist attractions (museums, monuments, etc), were among factors that influenced the location of Airbnb listings.

Voltes-Dorta & Sanchez-Medina (2020) studied the drivers of Airbnb prices in Bristol using both OLS and GWR methods, and also differentiated between different property and room types in their analysis. Their results show that the number of bathrooms is a positive and significant price determinant for entire properties and apartments, but less so for houses and private rooms. The maximum capacity of the listing is significantly more important for houses than apartments.

They performed GWR analysis on entire apartments and entire houses due to better goodness-of-fit. The discovered that prices for apartments were positively influenced by their distance to bus stops, while house prices were positively influenced by their distance to bus or train stations. As expected, the listing price is negatively related to the distance to the city centre as well as proximity to tourist attractions. GWR showed that this effect is more pronounced for listings in the western neighbourhoods. They also showed that hosts with multiple listings charge higher prices, and experience of hosts are only significant for houses. The GWR model revealed a higher degree of market power in more affluent neighbourhoods northwest of the city centre. Other significant findings include a positive relationship with annual availability, and a negative relationship with number of reviews and ratings for houses.

Results of all studies have shown that the GWR model performs better than GLM models in terms of accuracy (higher R2 and adjusted R2, lower AIC, and variable selections. However, GWR is still limited as it assumes that all the explanatory variables have a similar threshold distance or scale of operation (Fotheringham, Yang & Kang 2017). For example, cancellation policy and superhost status are likely to vary at a global level (with respect to the study area) whereas variables like number of bedrooms, distance to tourist attractions and proximity to transport links are more likely to vary at the local level. Fotheringham et al (2017) proposed a multi-scale geographically weighted regression, using a back-fitting approach to derive different bandwidths for each of the explanatory variables, allowing for each of the explanatory variables to have different scales of operation.

There have been 2 studies that have used MGWR in the context of Airbnb listings as detailed below:

Hong & Yoo (2020) examined the pricing determinants of Airbnb listings in New York City and Los Angeles using OLS and MGWR models. OLS results showed that reputation variables (ratings, number of reviews) have a negative on the price. However, superhost status and duration that the listing has been active have a positive impact on the price.

As expected, the MGWR model has a better explanatory power than the OLS model. The MGWR model identified cancellation policy, number of reviews and distance to tourist destinations as global variables (i.e. these variables affect listing price across the entire city as a whole) for both cities. The number of bedrooms were identified as a local variable, where the bandwidth of the variable was about 1.3km in both cities (i.e. hosts only referred to listings within 1.3km when it came to number of bedrooms on offer in deciding how to price their listings). They also discovered that while the OLS model gives a negative relationship between price and review ratings, the MGWR model did not give a statistically significant relationship between the two.

Shabrina, Buyuklieva and Ng (2020) studied the relationship between Airbnb locations in London and elements of urban tourism (hotel locations, food and beverages venues, access to public transport). They used the OLS model as a baseline, and compared them to GWR and MGWR models of the same variables. Their results show that both GWR and MGWR models perform better than the OLS model, with the MGWR giving slightly better results than the GWR model (in terms of AICc). The MGWR also provides a larger bandwidth for the different parameter estimates and hence a larger degree of clustering.

3 Packages Used

Set up

The following code chunk loads the packages required for the GWR; it will also install the packages if they have not been installed. Table 1 shows the different packages used in this study:

table1 <- read_csv("data/tablepackages.csv")
table1

## # A tibble: 13 x 3
##    Type         Package         Usage                                           
##    <chr>        <chr>           <chr>                                           
##  1 Data Explor~ tidyverse       Data manipulation & wrangling                   
##  2 Data Explor~ lubridate       Manipulating date-time data                     
##  3 Data Explor~ knitr           knit R-Markdown document, with code to show spe~
##  4 Spatial Data sf (Simple Fea~ Read and manipulate spatial data for analysis   
##  5 Spatial Data tmap            Graphing and mapping spatial data               
##  6 Spatial Data leaflet         Graphing and mapping spatial data               
##  7 Spatial Data gridExtra       Customise display of graphs and plots (in a gri~
##  8 Spatial Data rgdal           Provides access to projection/transformation op~
##  9 Spatial Data maptools        Manipulating geographic data                    
## 10 Spatial Data raster          Manipulating raster data                        
## 11 Spatial Data tmaptools       Reading and mapping spatial data                
## 12 Modelling    GWmodel         Perform geographically weighted regression mode~
## 13 Modelling    olsrr           Tools to build OLS regression models and collin~

# table1 %>% kbl() %>% kable_styling(bootstrap_options = c("striped")) %>% collapse_rows(columns = 1:2)

4 Methodology

The literature review above, whilst not exhaustive, has shown that there are few studies using GWR, much less MGWR on pricing determinants of Airbnb listings in Asia. As such, our study aims to construct a hedonic pricing model using OLS, GWR and MGWR analysis, and to compare and discuss the results for Airbnb listings in Singapore.

5 Data

Data was downloaded from InsideAirbnb on 29 September 2020 for this project - the dataset downloaded was compiled on 22 June 2020 for Singapore.

From the research above, we identified variables that have been shown to be statistically significant in many markets and studies, including new ones used in Hong & Yoo (2020), grouped in 5 categories (Table 2).

table2 <- read_csv("data/tabledata.csv")
table2

## # A tibble: 13 x 4
##    Category     Variable         Description            `Data Source`           
##    <chr>        <chr>            <chr>                  <chr>                   
##  1 Listing att~ bathrooms        The number of bathroo~ Insideairbnb.com (detai~
##  2 Listing att~ bedrooms         The number of bedroom~ Insideairbnb.com (detai~
##  3 Listing att~ guests_included  The number of guests ~ Insideairbnb.com (detai~
##  4 Listing att~ Room type: priv~ Listing room type (pr~ Insideairbnb.com (detai~
##  5 Host attrib~ superhost        Superhost status of a~ Insideairbnb.com (detai~
##  6 Reputation   number_of_revie~ The number of reviews~ Insideairbnb.com (detai~
##  7 Rental poli~ Cancellation po~ The cancellation poli~ Insideairbnb.com (detai~
##  8 Geographica~ mrt_350m, mrt_7~ Number of mrt station~ Data.gov.sg https://dat~
##  9 Geographica~ tourdistindex    Distance index of tou~ Data.gov.sg https://dat~
## 10 Geographica~ hotels_200m, ho~ Number of hotels with~ Data.gov.sg https://dat~
## 11 Geographica~ hosp_500m, hosp~ Number of major hospi~ Health Hub (Ministry of~
## 12 Geographica~ mall_500m, mall~ Number of malls withi~ Wikipedia https://en.wi~
## 13 Geographica~ malldistindex    Distance index of 57 ~ Wikipedia https://en.wi~

# table2 %>% kbl() %>% kable_styling(bootstrap_options = c("striped")) %>% collapse_rows(columns = 1:2)

We used the sum of the listing price and cleaning fee as the price for each listing (total_price).

We use number of MRT train station exits within a set distance (350m, 700m) as an indicator for transport links; these distances correlate to approximately 4- and 8-min walking distance (as the crow flies). While bus stations are also prevalent, they be less easy for tourists or visitors to use, and therefore are omitted from the study.

Additionally we want to test if the number of hotels, malls and hospitals (in the case of healthcare tourism) nearby has any impact on Airbnb pricing. Our hypothesis is that the number of hotels, hospitals and malls would each positively correlate with the listing price.

Similar to Hong & Yoo (2019), we use a distance index of tourist attractions to reflect accessibility to multiple tourist destinations, as most visitors are likely to visit more than one tourist attraction. We also add a distance index of major shopping malls in the Orchard and Central area, as retail tourism is one of the reasons for tourism in Singapore.

Information on data collation and wrangling can be found in Appendix 1 of the report.

5.1 Loading the Data

load("data/listings_gwr1_v3.RData")
load("data/listings_gwr3_v1.RData")
load("data/tmapGWR.RData")

We load the data that contains our dependent and independent variables for the regression analysis. listings_gwr1 contains the variables for 7272 listings. listings_gwr3 filters out listings that do not have a review, meaning that they have had at least one visitor, and has 4466 listings.

6 Correlation Analysis

# Select only variables that we will be using for the regression analysis
all_indepvar <- c("bathrooms", "bedrooms", "private", "entire", "shared", "number_of_reviews", "guests_included", "host_is_superhost", "mrt_350m", "mrt_700m", "tourdistindex", "hotels_250m", "hotels_500m", "hotels_1000m", "hotels_2000m", "hosp_500m", "hosp_1000m", "hosp_2000m", "hosp_5000m", "mall_500m", "mall_1000m", "mall_2000m", "malldistindex", "flexible", "moderate")

listings_cor <- listings_gwr1 %>% as.data.frame() %>% dplyr::select(all_indepvar)

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(all_indepvar)` instead of `all_indepvar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

res1 <- cor.mtest(listings_cor, conf.level=0.95)
corrmatrix <- cor(listings_cor)
corrplot(corrmatrix, method="ellipse", type="lower", p.mat=res1$p, sig.level = 0.05, insig="blank")

We use the package corrplot to plot the correlation between the different input variables. The above correlation plot shows that at a 95% significance level:

Entire apts/home are negatively correlated with private room listings.
Number of mrt stations within 700m of listings are highly positively correlated with the number of hotels within all distances of listings.
Number of mrt stations within 350m of listings are slightly positively correlated with number of hotels within all distances of listings; they are also slightly positively correlated with hospitals within 5km of listings and malls within various distance of the listings.
The tourist distance index is positively correlated with hotels and malls at various distances, as well as the mall distance index. We can probably attribute this to hotels and malls being close to tourist attractions, with some of them being in the same compound (e.g. Marina Bay Sands hotel and Shoppes at Marina Bay Sands, Resorts World Sentosa)
As expected, similar variables (e.g. hotels_250m, hotels_500m, hotels_1000m, hotels_2000m) are positively correlated with each other, in varying degrees.

7 Ordinary Least Squares Regression

The GWmodel package will perform an Ordinary Least Squares (OLS) regression when we perform a basic GWR regression.

OLS_all <- lm(total_price ~ bathrooms + bedrooms + private + entire + shared + number_of_reviews + guests_included + superhost + mrt_350m + mrt_700m + tourdistindex + hotels_250m + hotels_500m + hotels_1000m + hotels_2000m + hosp_500m + hosp_1000m + hosp_2000m + hosp_5000m + mall_500m + mall_1000m + mall_2000m + malldistindex + flexible + moderate, listings_gwr1)

summary(OLS_all)

## 
## Call:
## lm(formula = total_price ~ bathrooms + bedrooms + private + entire + 
##     shared + number_of_reviews + guests_included + superhost + 
##     mrt_350m + mrt_700m + tourdistindex + hotels_250m + hotels_500m + 
##     hotels_1000m + hotels_2000m + hosp_500m + hosp_1000m + hosp_2000m + 
##     hosp_5000m + mall_500m + mall_1000m + mall_2000m + malldistindex + 
##     flexible + moderate, data = listings_gwr1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -587.3   -96.2   -37.4    22.4 12572.7 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        60.1484    31.6313   1.902 0.057270 .  
## bathrooms           9.7494     5.3369   1.827 0.067772 .  
## bedrooms           47.1305     7.8654   5.992 2.17e-09 ***
## private           -29.8163    25.7181  -1.159 0.246350    
## entire            120.9772    25.4121   4.761 1.97e-06 ***
## shared            -92.4561    36.9137  -2.505 0.012279 *  
## number_of_reviews  -0.4296     0.1760  -2.441 0.014674 *  
## guests_included     8.9770     4.0869   2.197 0.028087 *  
## superhost          -7.4820    15.6635  -0.478 0.632898    
## mrt_350m           -9.6856     3.0364  -3.190 0.001429 ** 
## mrt_700m            1.1087     1.5314   0.724 0.469097    
## tourdistindex     700.9590   418.4449   1.675 0.093947 .  
## hotels_250m        -6.5100     1.9490  -3.340 0.000841 ***
## hotels_500m         0.7228     1.3191   0.548 0.583735    
## hotels_1000m        1.3396     0.6050   2.214 0.026853 *  
## hotels_2000m       -0.4457     0.2535  -1.758 0.078775 .  
## hosp_500m         -29.6752    19.3104  -1.537 0.124399    
## hosp_1000m          7.1051    11.7646   0.604 0.545906    
## hosp_2000m          2.8152     6.4144   0.439 0.660749    
## hosp_5000m         -7.7355     2.9552  -2.618 0.008873 ** 
## mall_500m          19.1369     4.5357   4.219 2.48e-05 ***
## mall_1000m         -8.3046     2.4674  -3.366 0.000767 ***
## mall_2000m          3.4345     1.0958   3.134 0.001729 ** 
## malldistindex     281.6339   528.1493   0.533 0.593879    
## flexible           58.9784    14.9603   3.942 8.15e-05 ***
## moderate          -12.9466    16.3513  -0.792 0.428518    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.5 on 7246 degrees of freedom
## Multiple R-squared:  0.05488,    Adjusted R-squared:  0.05162 
## F-statistic: 16.83 on 25 and 7246 DF,  p-value: < 2.2e-16

When we run the OLS model with all variables we get an AIC of 1.099471810^{5}. A summary of the model shows that the following variables are significant as explanatory variables at the 95% significance level: - number of bedrooms - whether the listing is an entire apt/home - whether the listing is a shared listing - number of reviews of the listing - number of guests included in the listing - number of MRT exits within 350m and 700m of the listing - number of hotels within 250m and 1,000m of the listing - number of malls within 500m, 1km and 2km of the listing - whether the cancellation policy is flexible.

We check for multi-collinearity using Variance Inflation Factors (VIF). The VIF score of an independent variable represents how well the variable is explained by other independent variables. When VIF = 1, there is no correlation between the selected independent variable and the other independent variables. When VIF is between 1 to 5, there is moderate collinearity. VIF that is greater than 5 or 10 shows high multicollinearity between the selected independent variable and the other independent variables. The table below summarises the VIF score.

ols_vif_tol(OLS_all)

##            Variables  Tolerance       VIF
## 1          bathrooms 0.76485182  1.307443
## 2           bedrooms 0.63078787  1.585319
## 3            private 0.18237565  5.483188
## 4             entire 0.18350597  5.449414
## 5             shared 0.61739411  1.619711
## 6  number_of_reviews 0.94741147  1.055508
## 7    guests_included 0.68518051  1.459469
## 8          superhost 0.90629178  1.103397
## 9           mrt_350m 0.46594336  2.146184
## 10          mrt_700m 0.25350187  3.944744
## 11     tourdistindex 0.24779875  4.035533
## 12       hotels_250m 0.19449401  5.141547
## 13       hotels_500m 0.08990391 11.122986
## 14      hotels_1000m 0.08306211 12.039184
## 15      hotels_2000m 0.10137748  9.864124
## 16         hosp_500m 0.60019874  1.666115
## 17        hosp_1000m 0.27871890  3.587844
## 18        hosp_2000m 0.18423176  5.427946
## 19        hosp_5000m 0.25497432  3.921964
## 20         mall_500m 0.29264397  3.417122
## 21        mall_1000m 0.18390909  5.437469
## 22        mall_2000m 0.14129053  7.077615
## 23     malldistindex 0.24171125  4.137168
## 24          flexible 0.77680322  1.287327
## 25          moderate 0.89360750  1.119060

From the VIF table above, we can see that the number of hotels at 500m, 1km and 2km have the highest VIF scores (>10). The other variables that have a VIF of 5-10 are: - private room listings - entire home/apt listings - hotels_250m - hospitals_2000m - mall_1000m - mall_2000m

# Removing independent variables
OLS_2 <- lm(total_price ~ bathrooms + bedrooms + hotel + entire + shared + number_of_reviews + guests_included + superhost + mrt_350m + mrt_700m + tourdistindex + hotels_250m + hosp_500m + hosp_1000m + mall_500m + mall_1000m + malldistindex + flexible + moderate, listings_gwr1)

AIC(OLS_2)

## [1] 109963.9

summary(OLS_2)

## 
## Call:
## lm(formula = total_price ~ bathrooms + bedrooms + hotel + entire + 
##     shared + number_of_reviews + guests_included + superhost + 
##     mrt_350m + mrt_700m + tourdistindex + hotels_250m + hosp_500m + 
##     hosp_1000m + mall_500m + mall_1000m + malldistindex + flexible + 
##     moderate, data = listings_gwr1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -580.7   -90.4   -41.7    16.2 12615.4 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        15.6541    16.5642   0.945 0.344660    
## bathrooms          10.1836     5.2778   1.930 0.053704 .  
## bedrooms           49.5347     7.8444   6.315 2.87e-10 ***
## hotel              18.1846    25.4477   0.715 0.474888    
## entire            136.5154    13.2570  10.298  < 2e-16 ***
## shared            -62.8493    30.3785  -2.069 0.038593 *  
## number_of_reviews  -0.3758     0.1755  -2.141 0.032302 *  
## guests_included     8.0678     4.0873   1.974 0.048435 *  
## superhost         -10.2840    15.6163  -0.659 0.510211    
## mrt_350m           -8.5817     2.9728  -2.887 0.003903 ** 
## mrt_700m            3.3643     1.4242   2.362 0.018187 *  
## tourdistindex     618.0706   396.0503   1.561 0.118665    
## hotels_250m        -4.0652     1.0226  -3.975 7.09e-05 ***
## hosp_500m         -26.4106    18.6600  -1.415 0.157006    
## hosp_1000m          6.5873     7.7480   0.850 0.395243    
## mall_500m          16.1642     4.1641   3.882 0.000105 ***
## mall_1000m         -4.3734     1.9858  -2.202 0.027671 *  
## malldistindex     419.2812   459.3577   0.913 0.361402    
## flexible           59.6360    14.8728   4.010 6.14e-05 ***
## moderate          -15.3170    16.3105  -0.939 0.347719    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 464.2 on 7252 degrees of freedom
## Multiple R-squared:  0.05114,    Adjusted R-squared:  0.04866 
## F-statistic: 20.57 on 19 and 7252 DF,  p-value: < 2.2e-16

ols_vif_tol(OLS_2)

##            Variables Tolerance      VIF
## 1          bathrooms 0.7845299 1.274649
## 2           bedrooms 0.6361508 1.571954
## 3              hotel 0.7865894 1.271311
## 4             entire 0.6763853 1.478447
## 5             shared 0.9144516 1.093552
## 6  number_of_reviews 0.9559814 1.046045
## 7    guests_included 0.6872005 1.455179
## 8          superhost 0.9146239 1.093346
## 9           mrt_350m 0.4876161 2.050794
## 10          mrt_700m 0.2940458 3.400831
## 11     tourdistindex 0.2774785 3.603882
## 12       hotels_250m 0.7087277 1.410979
## 13         hosp_500m 0.6447805 1.550915
## 14        hosp_1000m 0.6446068 1.551333
## 15         mall_500m 0.3482975 2.871108
## 16        mall_1000m 0.2848229 3.510954
## 17     malldistindex 0.3205257 3.119874
## 18          flexible 0.7884228 1.268355
## 19          moderate 0.9008882 1.110016

Even with removing variables with higher VIF, the AIC increases slightly instead of improving.

8 Geographically Weighted Regression

GWR provides a local model of the variables by fitting a regression equation to every feature in the dataset. This is done by incorporating the dependent and independent variables falling within the bandwidth of each target feature. The shape and size of the bandwidth depends on the type of kernel used (e.g. gaussian, bisquare, etc.), whether it uses a distance (fixed) or the number of neighbours (adapative).

8.1 Bandwidth selection

During model calibration, bandwidths are tested and assigned cross validation (CV) scores; the bandwidth with the lowest CV score produces the lowest root mean square prediction error. We use bw.gwr() to select the best bandwidth to be used in our GWR analysis. We use 2 different approaches (AICc, CV) and use 2 different kernels (gaussian, bisquare) to identify the bandwidths.

# Adaptive bandwidths
bwidth_adaptive <- list()

for (i in seq_along(kerneltype)) {
  kernel_name <- kerneltype[i]
  bwidth_adaptive[[kernel_name]] <- list()
  for (j in seq_along(approachtype)) {
    approach_name <- approachtype[j]
    bwidth_adaptive[[kernel_name]][[approach_name]] <- bw.gwr(total_price ~ bathrooms + bedrooms + private + entire + shared + number_of_reviews + guests_included + superhost + mrt_350m + mrt_700m + tourdistindex + hotels_250m + hotels_500m + hotels_1000m + hotels_2000m + hosp_500m + hosp_1000m + hosp_2000m + hosp_5000m + mall_500m + mall_1000m + mall_2000m + malldistindex + flexible + moderate, data=listings_gwr1, approach=approachtype[j], kernel=kerneltype[i], adaptive = TRUE, longlat=TRUE)
  }
}

We create a list object with the various bandwidths calculated to be able to call them up later. They can be accessed in the form bwidth_{fixed/adaptive}[[kerneltype]][[approachtype]]. The above code above gives us the adaptive bandwidths. The details for getting the bandwidths and models can be found in Appendix 2.

bwidth_adaptive %>% as.data.frame()

##   gaussian.AICc gaussian.CV bisquare.AICc bisquare.CV
## 1           134         109          1596        1598

The best bandwidth using AIC and CV is 134 and 109 respectively using a Gaussian kernel; the best bandwidth using AIC and CV is 1596 and 1598 respectively using a Bisquare kernel. We save the bandwidths into a list object to fit our basic gwr model.

8.2 Basic GWR

We create a vector of all the independent variables.

# Create vector of all independent variables
all_indepvar <- c("bathrooms", "bedrooms", "private", "entire", "shared", "number_of_reviews", "guests_included", "superhost", "mrt_350m", "mrt_700m", "tourdistindex", "hotels_250m", "hotels_500m", "hotels_1000m", "hotels_2000m", "hosp_500m", "hosp_1000m", "hosp_2000m", "hosp_5000m", "mall_500m", "mall_1000m", "mall_2000m", "malldistindex", "flexible", "moderate")

Model selection

We use the gwr.model.selection() function to go through models with different permutations of the independent variables. This returns an object with the model, the independent variables and the kernel, AIC, AICc, and RSS of the models. As we will be doing this for the 4 different permutations, we create a function to help select the best models by RSS, AIC and AICc and then the best model amongst the three with the highest adjusted R2 value.

Details of the model selections can be found in Appendix 2. We load the results of the different models here for analysis.

8.2.1 Helper Functions

8.2.1.1 Select models with lowest AIC/AICc/RSS

# Function to select models with lowest AIC, AICc and RSS
model_type <- c("best_AIC", "best_AICc", "best_RSS")

best_model <- function(modelsel, indepvar) {
  sorted.models <- gwr.model.sort(modelsel, numVars = length(indepvar), ruler.vector = modelsel[[2]][,2])
  modelsel.df <- data.frame()
  for (i in seq_along(sorted.models[[1]])) {
  modelsel.df[i, "model"] <-sorted.models[[1]][[i]][[1]]
  }
  res_1 <- sorted.models[[2]] %>% as.data.frame()
  colnames(res_1) <- c("bandwidth", "AIC", "AICc", "RSS")

  modelsel.df <- cbind(modelsel.df, res_1)

  obj_name <- rbind(
    modelsel.df[which.min(modelsel.df$AIC),],
    modelsel.df[which.min(modelsel.df$AICc),],
    modelsel.df[which.min(modelsel.df$RSS),]
  )
  modelselname <- rep(deparse(substitute(modelsel)), 3)
  obj_name <- cbind(obj_name, model_type, modelselname)
}

We create a function to sort the models and select the 3 models with the lowest AIC, AICc, and RSS, extract the model into a dataframe.

8.2.1.2 Functions to get best model by adjusted R2

best_model_gwr <- function(bestmodelobj, kernel_sel) {
  basegwr <- list()
  for (i in seq_along(bestmodelobj$model)) {
    x = bestmodelobj$model_type[i]
     basegwr[[x]] <- gwr.basic(formula=bestmodelobj$model[i], data=listings_gwr1, kernel=kernel_sel, adaptive = TRUE, bw=bestmodelobj$bandwidth[i], cv=TRUE)
  }
  return(basegwr)
}

best_model_adjr2 <- function(x) {
  diagnostics <- data.frame()
  for (i in seq_along(x)) {
    temp <- x[[i]]$GW.diagnostic %>% as.data.frame()
    temp <- cbind(temp, x[[i]]$GW.arguments %>% as.data.frame())
    temp <- cbind(temp, model_type=names(x)[i])
    diagnostics <- rbind(diagnostics, temp)
  }
  diagnostics[which.max(diagnostics$gwR2.adj),]
}

The above functions help us to run the basic gwr function on the 3 selected models using the given bandwidths and kernel and select the best model with the highest adjusted R2.

8.2.2 Model Selection

8.2.3 Best model

We load the results from Appendix 2 here.

load(file="data/results_adaptive.RData")
load(file="data/GWRbestmodel.RData")

results

##        RSS.gw      AIC     AICc      enp      edf     gw.R2  gwR2.adj      BIC
## 1  1217086290 108547.7 109051.1 596.4907 6675.509 0.2609965 0.1949529 104775.3
## 2  1250116073 108554.3 108830.4 349.0820 6922.918 0.2409411 0.2026607 103297.5
## 11 1315243846 108921.2 109194.6 328.0388 6943.961 0.2013962 0.1636640 103645.4
## 12 1315503012 108922.4 109195.4 327.6988 6944.301 0.2012388 0.1635401 103644.4
##                                                                                                                                                                                                                                                                                                  formula
## 1                                                               total_price~entire+bedrooms+flexible+hosp_5000m+mrt_350m+mall_2000m+hotels_250m+hosp_1000m+mall_500m+hotels_500m+hotels_1000m+hosp_500m+tourdistindex+shared+bathrooms+hotels_2000m+mrt_700m+mall_1000m+guests_included+private+moderate
## 2                                                                                                                                                                                                 total_price~flexible+entire+hosp_5000m+bedrooms+mrt_350m+mall_2000m+hotels_250m+hosp_1000m+hotels_500m
## 11 total_price~hosp_5000m+entire+flexible+bedrooms+bathrooms+mrt_350m+mrt_700m+hotels_1000m+tourdistindex+hosp_2000m+hosp_1000m+hotels_250m+number_of_reviews+moderate+malldistindex+guests_included+mall_1000m+mall_500m+hotels_500m+hosp_500m+hotels_2000m+mall_2000m+host_is_superhost+shared+private
## 12 total_price~hosp_5000m+entire+flexible+bedrooms+bathrooms+mrt_350m+mrt_700m+hotels_1000m+tourdistindex+hosp_2000m+hosp_1000m+hotels_250m+number_of_reviews+moderate+malldistindex+guests_included+mall_1000m+mall_500m+hotels_500m+hosp_500m+hotels_2000m+mall_2000m+host_is_superhost+shared+private
##    rp.given hatmatrix   bw   kernel adaptive p theta longlat DM.given F123.test
## 1     FALSE      TRUE  134 gaussian     TRUE 2     0   FALSE    FALSE     FALSE
## 2     FALSE      TRUE  109 gaussian     TRUE 2     0   FALSE    FALSE     FALSE
## 11    FALSE      TRUE 1596 bisquare     TRUE 2     0   FALSE    FALSE     FALSE
## 12    FALSE      TRUE 1598 bisquare     TRUE 2     0   FALSE    FALSE     FALSE
##    model_type
## 1    best_AIC
## 2   best_AICc
## 11   best_AIC
## 12   best_AIC

# results[which.max(results$gwR2.adj),]

The above shows the best results of the four types of models. The best GWR model is the one with a Gaussian kernel and the CV approach, and the lowest AICc value.

results.best <- results[which.max(results$gwR2.adj),]
results.formula <- results.best$formula
best <- gwr.basic(results.best$formula, data=listings_gwr1, kernel = "gaussian", bw=results.best$bw, adaptive = TRUE)

results.best$formula

## [1] "total_price~flexible+entire+hosp_5000m+bedrooms+mrt_350m+mall_2000m+hotels_250m+hosp_1000m+hotels_500m"

8.2.4 Comparison of results

Results	OLS	GWR
AIC	109972.60000	108488.9000000
AICc	109972.60000	108763.7000000
R2	0.04739	0.2476165
Adjusted R2	0.04621	0.2098307
RSS	1568875980.00000	1239122141.0000000

The table above shows the results of the OLS and GWR. The AIC, AICc and RSS for the GWR model are lower than the normal OLS model. The GWR outperforms the OLS model by 4 times looking at the adjusted R2 values.

8.2.5 Discussion of results

# Show results of GWR
best

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2021-04-30 13:31:54 
##    Call:
##    gwr.basic(formula = results.best$formula, data = listings_gwr1, 
##     bw = results.best$bw, kernel = "gaussian", adaptive = TRUE)
## 
##    Dependent (y) variable:  NA
##    Independent variables:  
##    Number of data points: 7272
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##     Min      1Q  Median      3Q     Max 
##  -627.3   -89.4   -40.3    12.1 12648.9 
## 
##    Coefficients:
##                Estimate Std. Error t value    Pr(>|t|)    
##    (Intercept)  37.3711    15.4176   2.424     0.01538 *  
##    flexible     56.2799    13.9537   4.033 0.000055554 ***
##    entire      146.9356    12.2684  11.977     < 2e-16 ***
##    hosp_5000m   -6.7267     2.3484  -2.864     0.00419 ** 
##    bedrooms     59.0357     6.7572   8.737     < 2e-16 ***
##    mrt_350m     -0.9425     2.2738  -0.414     0.67853    
##    mall_2000m    2.8063     0.5484   5.118 0.000000317 ***
##    hotels_250m  -6.0534     1.8436  -3.283     0.00103 ** 
##    hosp_1000m    3.8405     7.0355   0.546     0.58517    
##    hotels_500m   2.8255     0.9071   3.115     0.00185 ** 
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 464.8 on 7262 degrees of freedom
##    Multiple R-squared: 0.04739
##    Adjusted R-squared: 0.04621 
##    F-statistic: 40.14 on 9 and 7262 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 1568875980
##    Sigma(hat): 464.5443
##    AIC:  109972.6
##    AICc:  109972.6
##    BIC:  102874.2
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Adaptive bandwidth: 109 (number of nearest neighbours)
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                        Min.      1st Qu.       Median      3rd Qu.      Max.
##    Intercept   -11314.03145   -193.24773     17.87486     95.28624 17253.366
##    flexible      -129.41484    -15.99949     21.02387     75.42637  1454.784
##    entire         -22.23338     60.08179     98.31452    126.94262  1568.879
##    hosp_5000m    -332.21168     -5.69241      2.75113     24.83807  1226.895
##    bedrooms      -465.73390     42.82185     63.78579     81.76701   150.560
##    mrt_350m      -138.97949    -11.46909     -1.74250      5.38222   203.472
##    mall_2000m    -294.93361     -2.65940      0.94723      4.30534    37.477
##    hotels_250m   -287.60703     -9.43553     -2.43416      0.82434    38.690
##    hosp_1000m    -497.38147    -20.57995     -2.66157     18.55469   742.839
##    hotels_500m    -40.61800     -0.67713      1.64860      4.49251    55.089
##    ************************Diagnostic information*************************
##    Number of data points: 7272 
##    Effective number of parameters (2trace(S) - trace(S'S)): 347.6987 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 6924.301 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 108763.7 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 108488.9 
##    BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 103223 
##    Residual sum of squares: 1239122141 
##    R-square value:  0.2476165 
##    Adjusted R-square value:  0.2098307 
## 
##    ***********************************************************************
##    Program stops at: 2021-04-30 13:33:28

The above shows the results of the best model. The variables that are significant at the 95% level are:

coeff <- read_csv("data/coeff.csv")
coeff %>% kbl() %>% kable_styling(bootstrap_options = c("striped"))

Variable	Estimate	Std_Error	t-value	Pr(>\|t\|)	Significance
(Intercept)	37.3711	15.4176	2.424	0.01538
flexible	56.2799	13.9537	4.033	0.000055554	***
entire	146.9356	12.2684	11.977	< 2e-16	***
hosp_5000m	-6.7267	2.3484	-2.864	0.00419	**
bedrooms	59.0357	6.7572	8.737	< 2e-16	***
mrt_350m	-0.9425	2.2738	-0.414	0.67853	NA
mall_2000m	2.8063	0.5484	5.118	0.000000317	***
hotels_250m	-6.0534	1.8436	-3.283	0.00103	**
hosp_1000m	3.8405	7.0355	0.546	0.58517	NA
hotels_500m	2.8255	0.9071	3.115	0.00185	**

We can see the having a flexible cancellation policy, having entire/house apartment available for listings and the number of bedrooms are strongly correlated to the price. Surprisingly, the number of mrts within the listings are not significant with the GWR model, but it is significant in the overall OLS model.

8.2.6 Localised results

We plot the PV results of the significant variables to see how they affect the localised results.

# Create vector of significant independent variables
sigvariables <- coeff %>% na.omit()
sel_indepvar <- sigvariables$Variable 
sel_indepvar[1] <- "Intercept" #rename first variable

We also create a vector of independent variables that were significant.

8.2.6.1 PV values

best.SDF <- best$SDF %>% as.data.frame()

coln <- colnames(best.SDF)
coln <- coln[grepl("_TV", coln)]

edf <- best$GW.diagnostic$edf
for (column in coln) {
    best.SDF[paste(column, "PV", sep = "_")] <- as.vector(2*pt(abs(data.matrix(best.SDF[column])), edf, lower=FALSE))
  }

The above code calculates the p-values (two tailed) of the various points from the results of the GWR so that we can use them to plot.

8.2.6.2 Mapping p-values of GWR model

load("data/tmapGWR.RData")
plot(nhood_map_sf)

We load the neighbourhood map from our previous parts of the project.

# Select significant variables from SDF and remap col names
bestpv <- best.SDF %>% 
            dplyr::select(sel_indepvar, y, yhat, residual, flexible_TV_PV, entire_TV_PV, hosp_5000m_TV_PV, bedrooms_TV_PV, mall_2000m_TV_PV, hotels_250m_TV_PV, hotels_500m_TV_PV, Local_R2, coords.x1, coords.x2) %>%
            rename(flexible_PV = flexible_TV_PV, 
                   entire_PV = entire_TV_PV, 
                   hosp_5000m_PV = hosp_5000m_TV_PV, 
                   bedrooms_PV = bedrooms_TV_PV, 
                   mall_2000m_PV = mall_2000m_TV_PV, 
                   hotels_250m_PV = hotels_250m_TV_PV, 
                   hotels_500m_PV = hotels_500m_TV_PV)

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(sel_indepvar)` instead of `sel_indepvar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

We then select the coefficients, coordinates, local R2 and p-values of the significant variables in the code above. We then convert this into an sf object for plotting using tmap.

best_sf <- st_as_sf(bestpv, coords = c("coords.x1", "coords.x2"), crs=3414)

# Functions to map coefficients and PV values
coeffmap <- function(varname) {
  tm_shape(sg_osm) +
            tm_rgb() +
              tm_shape(nhood_map_sf) + tm_polygons() +
                tm_shape(best_sf)+
                tm_symbols(col = varname, size=0.2, midpoint=NA, border.col = "black", border.lwd = 1)+
                tm_layout(legend.title.size = .3, legend.text.size = .5, legend.frame = TRUE, main.title = paste("Coefficient:", varname))
}
pvaluemap <- function(varname) {
              tm_shape(sg_osm) +
                tm_rgb() +
              tm_shape(nhood_map_sf) + tm_polygons() +
              tm_shape(best_sf)+
              tm_symbols(col = varname, size=0.2, style="cont", midpoint=0.5, border.col= "black", border.lwd=1) +
              tm_layout(legend.title.size = .5, legend.frame = TRUE, main.title = paste("p-value:", varname))
}

The above functions provides plots of the coefficients and the p-values of selected variables.

8.2.6.3 Comparison of localised p-values

When we plot the p-values of the significant values, we see that they vary by region. We can also plot the coefficients and p-values side-by-side to see what variables are significant in the localised context and how their coefficients vary.

tmap_mode("plot")

## tmap mode set to plotting

# P-values of all significant variables
tmap_arrange(pvaluemap("flexible_PV"), pvaluemap("entire_PV"), pvaluemap("bedrooms_PV"), pvaluemap("mall_2000m_PV"), ncol=2)

The above 4 graphs show the p-values for flexible cancellations, entire listings, number of bedrooms, and number of malls.

# P-values of all significant variables
tmap_arrange(pvaluemap("hotels_250m_PV"), pvaluemap("hotels_500m_PV"), pvaluemap("hosp_5000m_PV"), ncol=2)

The above 3 graphs show the p-values for hotels_250m, hotels_500m, hosp_5000m.

8.2.7 Discussion of results

8.2.7.1 Entire listings

# entire listings
tmap_arrange(coeffmap("entire"), pvaluemap("entire_PV"), ncol = 2)

Having an entire home listing looks like a ‘global’ variable that raises the price of the listings, regardless of where it is located.

8.2.7.2 Cancellation Flexibility

# Number of bedrooms
tmap_arrange(coeffmap("flexible"), pvaluemap("flexible_PV"), ncol = 2)

The flexibility of the cancellation policy affects pricing positively in the central region (Kallang, Geylang, Downtown Core), as well as some areas outside of the central region (e.g. Serangoon, Hougang, Bishan, Bukit Batok). This can be explained due to listings using flexibility as attractive factors for outer regions, while central regions are competing with hotels, and therefore the flexibility also affects the price of the listing.

8.2.7.3 Number of bedrooms

# Number of bedrooms
tmap_arrange(coeffmap("bedrooms"), pvaluemap("bedrooms_PV"), ncol = 2)

The number of bedrooms is also significant for most regions, except some small concentrations in the northern most tip (Woodlands, Sembawang) and west regions (Jurong West, Jurong East, Bukit Batok). This corresponds to the number of guests it can accommodate and is expected.

8.2.7.4 Number of malls within 2km

# Number of malls within 2km
tmap_arrange(coeffmap("mall_2000m"), pvaluemap("mall_2000m_PV"), ncol = 2)

The number of malls within a 2km radius is more positively correlated at the fringes of the city centre (e.g. Bukit Timah, Kallang, Geylang) and also in the north eastern regions of Hougang, Serangoon, Bishan, etc.

8.2.7.5 Hotels 250m and 500m

# Hotels 250m and 500m
tmap_arrange(coeffmap("hotels_250m"), pvaluemap("hotels_250m_PV"), coeffmap("hotels_500m"), pvaluemap("hotels_500m_PV"), ncol = 2)

The number of hotels within a 250m radius negatively affects the price, especially in central fringe, which can be explained by having many hotels close by, visitors have more choice and therefore the price of these listings may be competing with those of the hotels.
The number of hotels within a 500m radius slightly positively affects the price. The difference in having hotels within 250m vs 500m is contradictory which may be explained by the fact that having hotels within 250m is competition for listings, but above 500m means that there is more demand for these areas and could be due to better links (e.g. transport, nightspots, attractions, etc). When we look closer at the coefficients we also see that the ones that are significant have positive coefficients vs negative coefficients in the other regions that are not significant.

# hospitals
tmap_arrange(coeffmap("hosp_5000m"), pvaluemap("hosp_5000m_PV"), ncol = 2)

The number of hospitals within 5km seems to be more of a key factor in the Tanglin, Newton neighbourhoods. This can be explained as there are private hospitals that cater to healthcare tourism in those areas. This is also significant at the eastern region, which may be due to the fact that there are fewer hospitals there, or its proximity to the airport.

9 Conclusion

This project has compared the global OLS model to the GWR model to understand the determinants of pricing of Airbnb listings. The local GWR model provides better explanatory power (4x) of the price determinants, and it gives better context as it takes into account differences in the neighbourhoods. A variable that would not have shown up in the global OLS model (hospital_5000m) shows that a localised model may indeed include factors such as healthcare tourism, for areas that are close to private hospitals in Tanglin and Newton area.

Variables such as number of bedrooms, and having entire homes are positively correlated to prices, which is in line with the other studies. Having a flexible cancellation policy makes a difference to the price, as opposed to some other studies where having a strict cancellation policy increases the price.

One surprisingly find in the local GWR model is that the number of mrt station exits are not significant. This could be explained by the fact that Singapore is a small city state and that transport links are very accessible across the entire island.

Limitations and future studies We have presented our findings using the local GWR and OLS models and contributes to the understanding of pricing determinants of Airbnb listings in Singapore, one of the first few that look at Airbnb pricing in Asia. However, we used a static GWR model, and future studies could include comparison of the GWR and MGWR models to see if having a multi-scale local model would have a better explanatory power of the independent variables. Future studies could also look at other factors such as availability of housing, other amenities such as food and beverage, as pricing determinants in Singapore, and also compare pricing determinants for other cities in Asia.

Geographically Weighted Regression of Airbnb in Singapore

Geospatial Analysis of Airbnb in Singapore

Clara Chua

25/04/2020 (updated: 2021-06-09)