Set up

packages = c('tidyverse', 'funModeling', 'knitr', 'rgdal', 'maptools', 'sf','raster','spatstat', 'tmap','tmaptools', 'gridExtra', 'leaflet', 'OpenStreetMap', 'microbenchmark', 'doParallel', 'foreach', 'GWmodel', 'car', 'httr', 'jsonlite', 'kableExtra')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
options(kableExtra.auto_format = FALSE)

1 Methodology

We generate OLS and compare GWR models to determine the hedonic pricing model and look at how well the different models explain the pricing.

Our hypothesis is that the price determinants of the Airbnb listing includes spatial factors such as:

Proximity to tourist attractions
Proximity to transport links - proxied by the number of MRT stations
Proximity to medical tourist facilities (private and major hospitals)
Proximity to malls and the main shopping districts
Proximity to hotels

2 Data

Data was downloaded from InsideAirbnb on 29 September 2020 for this project - the dataset downloaded was compiled on 22 June 2020 for Singapore.

From the research above, we identified variables that have been shown to be statistically significant in many markets and studies, including new ones used in Hong & Yoo (2020), grouped in 5 categories (Table 1).

table1 <- read_csv("data/tabledata.csv")
# table1 %>% kbl() %>% kable_styling(bootstrap_options = c("striped")) %>% collapse_rows(columns = 1:2)

We used the sum of the listing price and cleaning fee as the price for each listing (total_price).

We use number of MRT train station exits within a set distance (350m, 700m) as an indicator for transport links; these distances correlate to approximately 4- and 8-min walking distance (as the crow flies). While bus stations are also prevalent, they be less easy for tourists or visitors to use, and therefore are omitted from the study.

Additionally we want to test if the number of hotels, malls and hospitals (in the case of healthcare tourism) nearby has any impact on Airbnb pricing. Our hypothesis is that the number of hotels, hospitals and malls would each positively correlate with the listing price.

Similar to Hong & Yoo (2019), we use a distance index of tourist attractions to reflect accessibility to multiple tourist destinations, as most visitors are likely to visit more than one tourist attraction. We also add a distance index of major shopping malls in the Orchard and Central area, as retail tourism is one of the reasons for tourism in Singapore.

Data Sources

The coordinates of mrt station exits, gazetted hotels and tourist attractions are available on the Data.gov.sg website, which is the Singapore government’s portal of publicly available datasets. The data (in either KML or SHP format) was downloaded from the site on 31 March 2020.

Hospital data was collated and provided by Prof Kam. Locations of hospitals could also be scraped from the Ministry of Health’s website (Health Hub), which gave the postal code of the hospitals. The coordinates (latitude, longitude, and X,Y coordinates in SVY21 format) can then be obtained using the OneMap API provided by the Land Transport Authority of Singapore.

The list of shopping malls was collated from Wikipedia and the postal codes were obtained from the Singapore Post website or Google Maps. The coordinates were obtained using the OneMap API.

3 Data Wrangling

3.1 Listings data

# Load basedataGWR consisting of listings_clean, d_listings, neighbourhoods, neighbourhood map 
load("data/basedataGWR.RData")

We read in the base data required consisting of the detailed listings table and cleaned listings from the previous phase of the project. We also read in the neighbourhoods table and neighbourhood map in sf for plotting later. We also ensure that we obtain the coordinates in longitude / latitutde to be able to compute the distance matrix for the geographically weighted regression in relation to the different room types.

3.1.1 Listings variables

We use the listings_clean data from Part 1 of the project, where 37 listings in non-residential areas have been removed. We are also interested in other aspects of the listings, such as whether a host is a superhost, the total price of the listing (price + cleaning fee) and the number of guests that the listing can accommodate. We also look at the number of bedrooms and the number of bathrooms to see if these are variables in the hedonic pricing model. These have been known to correlate and explain the price of a listing. Most of this information is available from the detailed listings table. We do an inner join of the listings_clean data and selected columns from the detailed listings table (d_listings) to only take the cleaned listings data that have information in the detailed listings column.

# Joining listings_clean with selected columns from detailed listing.
# Select columns from detailed listing
d_listings_sel <- d_listings %>% dplyr::select(id, listing_url, host_is_superhost, latitude, longitude, property_type, bathrooms, bedrooms, cleaning_fee, guests_included, review_scores_rating, cancellation_policy) %>% drop_na(bedrooms, bathrooms)

# Perform inner join
listings_gwr <- inner_join(listings_clean, d_listings_sel, by="id")
# Remove selected listings from global environment
rm(d_listings_sel)
# Show head of listings_gwr
head(listings_gwr)

## Simple feature collection with 6 features and 25 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 22646.02 ymin: 34950.06 xmax: 42212.88 ymax: 47135.4
## projected CRS:  SVY21 / Singapore TM
## # A tibble: 6 x 26
##   id    name    host_id host_name neighbourhood_g~ neighbourhood room_type price
##   <chr> <chr>   <chr>   <chr>     <chr>            <chr>         <chr>     <dbl>
## 1 49091 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    84
## 2 50646 Pleasa~ 227796  Sujatha   Central Region   Bukit Timah   Private ~    80
## 3 56334 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    70
## 4 71609 Ensuit~ 367042  Belinda   East Region      Tampines      Private ~   167
## 5 71896 B&B  R~ 367042  Belinda   East Region      Tampines      Private ~    95
## 6 71903 Room 2~ 367042  Belinda   East Region      Tampines      Private ~    84
## # ... with 18 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   geometry <POINT [m]>, listing_url <chr>, host_is_superhost <lgl>,
## #   latitude <dbl>, longitude <dbl>, property_type <chr>, bathrooms <dbl>,
## #   bedrooms <dbl>, cleaning_fee <dbl>, guests_included <dbl>,
## #   review_scores_rating <dbl>, cancellation_policy <chr>

3.1.2 Dealing with empty or NA data

# Change NA in cleaning fees to 0
listings_gwr$cleaning_fee[is.na(listings_gwr$cleaning_fee)] <- 0

# Calculate total price of listing (price + cleaning_fees)
listings_gwr <- listings_gwr %>% mutate(total_price = price + cleaning_fee)

From the previous section, we see that there are NAs in the cleaning fees. We assume NA in cleaning fees means that there are none, and change those values to 0. We then calculate the total price of the listing - which is the sum of price and cleaning fees. We assume that any empty/NA data in cleaning fees means that there isn’t any and set it to 0.

3.1.3 Listings coordinates

listing_coords <- listings_gwr %>% st_coordinates() %>% as.matrix()

We extract the coordinates from the listings_gwr, which is an sf object. The CRS of the cleaned listings data is in SVY21 format.

3.2 Train station exits data

# Read in train station exits data; data unzipped into TrainStationExit folder in data folder
trainstatexits <- st_read(dsn="data/TrainStationExit", layer="TrainStationExit06032020")

## Reading layer `TrainStationExit06032020' from data source `C:\Users\clarachua\Documents\2. Capstone Project\capstone\data\TrainStationExit' using driver `ESRI Shapefile'
## Simple feature collection with 493 features and 6 fields
## geometry type:  POINT
## dimension:      XYZ
## bbox:           xmin: 103.6368 ymin: 1.264972 xmax: 103.9893 ymax: 1.449157
## z_range:        zmin: 0 zmax: 0
## geographic CRS: WGS 84

glimpse(trainstatexits)

## Rows: 493
## Columns: 7
## $ STN_NAME   <chr> "CANBERRA MRT STATION", "CANBERRA MRT STATION", "CANBERRA M~
## $ EXIT_CODE  <chr> "A", "B", "C", "D", "E", "1", "2", "1", "2", "3", "4", "5",~
## $ LAST_UPD_U <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ LAST_UPD_D <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ CRT_USRID_ <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ STN_NO     <chr> "NS12", "NS12", "NS12", "NS12", "NS12", "TE1", "TE1", "TE3"~
## $ geometry   <POINT [°]> POINT Z (103.8296 1.443477 0), POINT Z (103.83 1.4428~

The train station exit information is available as a .shp file. The data downloaded was extracted into folder “TrainStationExits” and read in using the st_read() function from the sf package. Taking a glimpse at the information, we have the station name, station number, station exits, and the coordinates in the geometry column. We also see that the CRS is WGS84.

As we read in the data from the Train Station Exits, we see that the CRS is WGS84. As our listing data is in SVY21 format, we transform the coordinates to SVY21, and get the coordinates from the sf object as coordinates in matrix form using the function st_coordinates(). As the matrix returns a Z column, we remove that column so that we can compute the distance matrix (in X, Y dimensions).

# extract the coordinates of train station exits and remove Z column
trainstatcoords <- st_transform(trainstatexits, 3414) %>% st_coordinates()
trainstatcoords <- trainstatcoords[,-3]

3.3 Tourist Attractions

# Read in Tourist attraction data from Data.gov, unzip into folder tourist-attractions
tourattr <- st_read(dsn="data/tourist-attractions", layer="TOURISM")

## Reading layer `TOURISM' from data source `C:\Users\clarachua\Documents\2. Capstone Project\capstone\data\tourist-attractions' using driver `ESRI Shapefile'
## Simple feature collection with 107 features and 19 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -1.797693e+308 ymin: -1.797693e+308 xmax: 43659.54 ymax: 47596.73
## projected CRS:  SVY21

glimpse(tourattr)

## Rows: 107
## Columns: 20
## $ OBJECTID   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ~
## $ URL_PATH   <chr> "www.yoursingapore.com/en/see-do-singapore/culture-heritage~
## $ IMAGE_PATH <chr> "www.yoursingapore.com/content/dam/desktop/global/see-do-si~
## $ IMAGE_ALT_ <chr> "Learn more about local Chinese culture at the Singapore Ch~
## $ PHOTOCREDI <chr> "Joel Chua DY", "Joel Chua DY", "Joel Chua DY", NA, "Malcol~
## $ PAGETITLE  <chr> "Chinatown Heritage Centre, Singapore", "Thian Hock Keng Te~
## $ LASTMODIFI <chr> "2015-11-02T10:16:52.847+08:00", "2015-11-02T10:20:57.245+0~
## $ LATITUDE   <dbl> 1.283510, 1.280940, 1.310070, 1.277219, 1.275490, 1.302910,~
## $ LONGTITUDE <dbl> 103.8444, 103.8476, 103.8994, 103.8373, 103.8414, 103.8628,~
## $ ADDRESS    <chr> "48 Pagoda Street", "158 Telok Ayer Street", "139 Ceylon Ro~
## $ POSTALCODE <chr> NA, NA, NA, "0", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ OVERVIEW   <chr> "Experience how Singapore08:00", "Beautifully restored, Thi~
## $ EXTERNAL_L <chr> "http://www.singaporechinatown.com.sg/", "http://www.thianh~
## $ META_DESCR <chr> "At the Chinatown Heritage Centre, experience how Singapore~
## $ OPENING_HO <chr> "Daily, 9am atown Heritage Centre, experience how Singapore~
## $ INC_CRC    <chr> "255C03B58EE7B003", "D5D3E3930E0C117F", "9CB05BD01BA1715C",~
## $ FMEL_UPD_D <date> 2017-01-09, 2017-01-09, 2017-01-09, 2017-01-09, 2017-01-09~
## $ X_ADDR     <dbl> 29227.71, 29592.74, 35356.47, 28447.06, 28901.62, 31286.58,~
## $ Y_ADDR     <dbl> 29549.54, 29265.36, 32486.50, 28853.86, 28662.73, 31694.71,~
## $ geometry   <POINT [m]> POINT (29227.71 29549.54), POINT (29592.74 29265.36),~

Similarly for Tourist Attractions, we extract the data into the folder “tourist-attractions” and read in the data using st_read(). Taking a glimpse() at the data, we see that the coordinates are already in SVY21 format. The data was likely extracted from web pages hosted by the Singapore Tourism Board (STB) and includes a pagetitle, which corresponds to the name of the tourist attraction, as well as image links, URL links, latitude, longitude, and X-Y coordinates.

3.3.1 Missing Data

# Check for missing data (where longitude / latitude is NA or 0)
sum(is.na(tourattr$LATITUDE))

## [1] 0

tourattr[tourattr$LATITUDE == 0,]

## Simple feature collection with 1 feature and 19 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -1.797693e+308 ymin: -1.797693e+308 xmax: -1.797693e+308 ymax: -1.797693e+308
## projected CRS:  SVY21
##    OBJECTID                                                        URL_PATH
## 73       73 www.yoursingapore.com/en/see-do-singapore/beyond-singapore.html
##                                                                                                                 IMAGE_PATH
## 73 www.yoursingapore.com/content/dam/desktop/global/see-do-singapore/beyond-singapore/beyond-singapore-carousel01-rect.jpg
##    IMAGE_ALT_ PHOTOCREDI              PAGETITLE                    LASTMODIFI
## 73       <NA>       <NA> Cruises from Singapore 2015-11-03T14:54:39.639+08:00
##    LATITUDE LONGTITUDE ADDRESS POSTALCODE
## 73        0          0    <NA>          0
##                                                                                                                                                                                         OVERVIEW
## 73 Watch the Singapore skyline disappear from view as you sail away into the horizon. As the gateway to Southeast Asia and the world, Singapore is the choice homeport of numerous cruise lines.
##    EXTERNAL_L
## 73       <NA>
##                                                                                                                                                              META_DESCR
## 73 Take these luxurious cruise liners from Singapore and visit lush tropical rainforests, colourful open air markets, mystifying ancient cities and white sand beaches.
##    OPENING_HO          INC_CRC FMEL_UPD_D X_ADDR Y_ADDR
## 73       <NA> BE271D2BD630CE77 2017-01-09      0      0
##                          geometry
## 73 POINT (-1.797693e+308 -1.79...

There is a row of missing coordinates for the Cruise Centre where the latitude and longitude are 0 and the geometry does not correspond to the SVY21 format. As there is just 1 set of coordinates, we obtain the coodinates via Google maps for the Marina Bay Cruise Centre longlat: (1.267190433028671, 103.85996202606474) or X,Y coordinates in SVY format (30965.215, 27745.016). Similarly we extract the coordinates into a matrix format.

# Clean data from tourist attraction
tourattr[73, "LATITUDE"] <- 1.267190433028671
tourattr[73, "LONGITUDE"] <- 103.85996202606474
cruise = st_point(c(30965.215,27745.016))
tourattr$geometry[73] <- cruise

tourattr_coords <- tourattr %>% st_coordinates()
head(tourattr_coords)

##          X        Y
## 1 29227.71 29549.54
## 2 29592.74 29265.36
## 3 35356.47 32486.50
## 4 28447.06 28853.86
## 5 28901.62 28662.73
## 6 31286.58 31694.71

3.4 Hotels

hotels <- st_read("data/hotels/hotels.kml")

## Reading layer `HOTELS' from data source `C:\Users\clarachua\Documents\2. Capstone Project\capstone\data\hotels\hotels.kml' using driver `KML'
## Simple feature collection with 425 features and 2 fields
## geometry type:  POINT
## dimension:      XYZ
## bbox:           xmin: 103.6351 ymin: 1.245797 xmax: 103.9891 ymax: 1.419278
## z_range:        zmin: 0 zmax: 0
## geographic CRS: WGS 84

hotels_coords <- st_transform(hotels, 3414)  %>% st_coordinates()
hotels_coords <- hotels_coords[,-3]
head(hotels_coords)

##          X        Y
## 1 33146.83 32676.17
## 2 33320.08 32695.04
## 3 30764.81 32415.64
## 4 28156.71 29714.87
## 5 28991.54 31440.85
## 6 30833.05 30656.96

The hotel data comes in a KML file. We read in the file as an sf object and transform the coordinates from WGS84 to SVY21 format. We then extract the coordinates similar to how it was done previously.

3.5 Healthcare facilities

# Read in extracted data on hospitals (healthcare tourism) and convert to an sf object
hospitals <- read_csv("data/hospital_final_distinct.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Hospital = col_character(),
##   Postcode = col_double(),
##   X = col_double(),
##   Y = col_double(),
##   Latitude = col_double(),
##   Longitude = col_double(),
##   Num_beds = col_double()
## )

glimpse(hospitals)

## Rows: 12
## Columns: 7
## $ Hospital  <chr> "National University Hospital", "Singapore General Hospital"~
## $ Postcode  <dbl> 119074, 169608, 188770, 217562, 228510, 229899, 258500, 3076~
## $ X         <dbl> 22476.40, 27711.00, 30687.33, 30293.80, 28236.02, 29499.53, ~
## $ Y         <dbl> 30755.99, 29154.75, 31503.57, 32752.15, 31944.06, 32531.44, ~
## $ Latitude  <dbl> 1.294420, 1.279940, 1.301182, 1.312473, 1.305165, 1.310477, ~
## $ Longitude <dbl> 119074.0000, 103.8307, 103.8575, 103.8539, 103.8354, 103.846~
## $ Num_beds  <dbl> 1239, 1785, 380, 121, 345, 830, 258, 190, 1500, 333, 106, 304

hospitals_sf <- hospitals %>% st_as_sf(coords = c("X", "Y"), 
                                       crs = 3414)

Hospital information was collated from the web and the coordinates of the hospitals were given in both SVY21 and WGS84 format. We extract the X,Y coordinates from the dataframe.

# Extract X, Y coordinates in matrix form
hospital_coords <- hospitals %>% dplyr::select(X, Y) %>% as.matrix()
hospital_coords

##              X        Y
##  [1,] 22476.40 30755.99
##  [2,] 27711.00 29154.75
##  [3,] 30687.33 31503.57
##  [4,] 30293.80 32752.15
##  [5,] 28236.02 31944.06
##  [6,] 29499.53 32531.44
##  [7,] 26473.81 32194.26
##  [8,] 28905.83 34177.16
##  [9,] 29851.37 33596.15
## [10,] 29213.92 33821.77
## [11,] 36395.56 33041.42
## [12,] 28476.61 35997.47

3.6 Shopping Malls

# Collate coordinate data using OneMap search API

# Initialise libraries httr and jsonlite which are part of base R packages
library("httr", "jsonlite")

# Read in list of mall data with postal codes
mallslist <- read_csv("data/shopping_malls.csv", col_types = "ccc")
head(mallslist)

## # A tibble: 6 x 3
##   Mall_Name                      Region  Post_Code
##   <chr>                          <chr>   <chr>    
## 1 100 AM                         Central 079027   
## 2 313 @ Somerset                 Central 238895   
## 3 Aperia                         Central 339509   
## 4 Balestier Hill Shopping Centre Central 300001   
## 5 Bugis Cube                     Central 188735   
## 6 Bugis Junction                 Central 188021

# Initialise empty data frame
allmalls <- data.frame()

# Function to generate coordinates from OneMap search API
onemap_getcoords <- function(pcode) {
  geturl <- paste0("https://developers.onemap.sg/commonapi/search?searchVal=", pcode, "&returnGeom=Y&getAddrDetails=Y")
  response = GET(geturl)
  status = response$status_code
  rescontent <- content(response, as="text") %>% fromJSON(., flatten=TRUE) %>% as.data.frame()
  allmalls <<- rbind(allmalls, rescontent)
}

# Test function to get coords
# onemap_getcoords(518457)

# Get all coords by postcode
for (i in seq_along(mallslist$Post_Code)) {
  onemap_getcoords(mallslist$Post_Code[i])
}

# Check allmalls data
head(allmalls)

##   found totalNumPages pageNum results.SEARCHVAL results.BLK_NO
## 1     3             1       1            100 AM            100
## 2     3             1       1      UNION SQUARE            100
## 3     3             1       1   AMARA SINGAPORE            100
## 4     4             1       1 OCBC 313@SOMERSET            313
## 5     4             1       1  UOB 313@SOMERSET            313
## 6     4             1       1    313 @ SOMERSET            313
##   results.ROAD_NAME  results.BUILDING
## 1       TRAS STREET            100 AM
## 2       TRAS STREET      UNION SQUARE
## 3       TRAS STREET   AMARA SINGAPORE
## 4      ORCHARD ROAD OCBC 313@SOMERSET
## 5      ORCHARD ROAD  UOB 313@SOMERSET
## 6      ORCHARD ROAD    313 @ SOMERSET
##                                       results.ADDRESS results.POSTAL
## 1             100 TRAS STREET 100 AM SINGAPORE 079027         079027
## 2       100 TRAS STREET UNION SQUARE SINGAPORE 079027         079027
## 3    100 TRAS STREET AMARA SINGAPORE SINGAPORE 079027         079027
## 4 313 ORCHARD ROAD OCBC 313@SOMERSET SINGAPORE 238895         238895
## 5  313 ORCHARD ROAD UOB 313@SOMERSET SINGAPORE 238895         238895
## 6    313 ORCHARD ROAD 313 @ SOMERSET SINGAPORE 238895         238895
##          results.X        results.Y results.LATITUDE results.LONGITUDE
## 1 29129.8552264441 28563.0121062025 1.27458821795424   103.84347073661
## 2 29134.8516533198 28588.0262943786 1.27481443733105  103.843515632198
## 3 29138.9450046336 28601.3295877781 1.27493474750941  103.843552412915
## 4 28548.4434900116 31484.2152895685 1.30100656917241  103.838246592796
## 5 28485.8709110245 31526.0714011192 1.30138510214714  103.837684350436
## 6 28560.0035594245 31481.0653610379 1.30097808212209  103.838350465208
##   results.LONGTITUDE
## 1    103.84347073661
## 2   103.843515632198
## 3   103.843552412915
## 4   103.838246592796
## 5   103.837684350436
## 6   103.838350465208

# Write response data to mallsall.csv
write_csv(allmalls, "data/mallsall.csv")

# Data wrangle in R - to ARCHIVE AS JOIN DOESN'T WORK WELL ON CHAR VECTORS
# colnames(allmalls1)[7] <- "Mall_Name"
# allmalls2 <- allmalls1 %>% dplyr::select(Mall_Name, results.X, results.Y, results.LATITUDE, results.LONGITUDE)
# malls_data1 <- left_join(mallslist, allmalls2, by="Mall_Name")

A list of shopping malls was collated from Wikipedia (see Table 1). The list was cleaned to include only malls that still exist and their postal codes were taken from Singapore Post or Google Maps. The above code gets the coordinates (lat-long) and X,Y coordinates (SVY21 format) using the OneMap API. We use the packages httr and jsonlite that are part of the R base package to extract the information. We extract the data to a CSV file and map the shopping mall names to the coordinates extracted. The cleaned information is loaded from a new list (shopping_malls_data.csv) and the X,Y coordinates are extracted.

# Load in cleaned data (matching mall names)
malls_data <- read_csv("data/shopping_malls_data.csv", col_types = "cccdddd")
head(malls_data)

## # A tibble: 6 x 7
##   Mall_Name                    Region Post_Code      X      Y Latitude Longitude
##   <chr>                        <chr>  <chr>      <dbl>  <dbl>    <dbl>     <dbl>
## 1 100 AM                       Centr~ 079027    29130. 28563.     1.27      104.
## 2 313 @ Somerset               Centr~ 238895    28560. 31481.     1.30      104.
## 3 Aperia                       Centr~ 339509    31399. 32572.     1.31      104.
## 4 Balestier Hill Shopping Cen~ Centr~ 300001    29021. 34200.     1.33      104.
## 5 Bugis Cube                   Centr~ 188735    30486. 31173.     1.30      104.
## 6 Bugis Junction               Centr~ 188021    30467. 31264.     1.30      104.

malls_coords <- malls_data %>% dplyr::select(X,Y) %>% as.matrix()

3.7 Distance Matrices

From here, we compute the distance matrices of the listings and the various spatial variables, using the gw.dist() function from the GWmodel package.

3.7.1 Train station exits

# Compute distance matrix of train station exits and listings
list_mrt <- gw.dist(dp.locat = trainstatcoords, rp.locat = listing_coords, longlat=FALSE) %>% t()

# Count number of station exits that are within 0.35km, 0.7km of the listing
listings_gwr$mrt_350m <- rowSums(list_mrt <= 350)
listings_gwr$mrt_700m <- rowSums(list_mrt <= 700)
head(listings_gwr)

## Simple feature collection with 6 features and 28 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 22646.02 ymin: 34950.06 xmax: 42212.88 ymax: 47135.4
## projected CRS:  SVY21 / Singapore TM
## # A tibble: 6 x 29
##   id    name    host_id host_name neighbourhood_g~ neighbourhood room_type price
##   <chr> <chr>   <chr>   <chr>     <chr>            <chr>         <chr>     <dbl>
## 1 49091 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    84
## 2 50646 Pleasa~ 227796  Sujatha   Central Region   Bukit Timah   Private ~    80
## 3 56334 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    70
## 4 71609 Ensuit~ 367042  Belinda   East Region      Tampines      Private ~   167
## 5 71896 B&B  R~ 367042  Belinda   East Region      Tampines      Private ~    95
## 6 71903 Room 2~ 367042  Belinda   East Region      Tampines      Private ~    84
## # ... with 21 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   geometry <POINT [m]>, listing_url <chr>, host_is_superhost <lgl>,
## #   latitude <dbl>, longitude <dbl>, property_type <chr>, bathrooms <dbl>,
## #   bedrooms <dbl>, cleaning_fee <dbl>, guests_included <dbl>,
## #   review_scores_rating <dbl>, cancellation_policy <chr>, total_price <dbl>,
## #   mrt_350m <dbl>, mrt_700m <dbl>

The above code computes the distance matrix of the listing and mrt station exits in Singapore. We then compute the number of station exits that are 350m and 750m of the listing as variables in the listings_gwr object.

3.7.2 Tourist attractions

# Compute distance matrix of tourist attractions and listings
list_tour <- gw.dist(dp.locat = tourattr_coords, rp.locat = listing_coords, longlat=FALSE) %>% t()

# Compute distance index of tourist attractions (sum of inversed distance)
listings_gwr$tourdistindex <- list_tour %>% as.data.frame() %>% mutate_each(function(x) (1/x)) %>% rowSums()
head(listings_gwr)

## Simple feature collection with 6 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 22646.02 ymin: 34950.06 xmax: 42212.88 ymax: 47135.4
## projected CRS:  SVY21 / Singapore TM
## # A tibble: 6 x 30
##   id    name    host_id host_name neighbourhood_g~ neighbourhood room_type price
##   <chr> <chr>   <chr>   <chr>     <chr>            <chr>         <chr>     <dbl>
## 1 49091 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    84
## 2 50646 Pleasa~ 227796  Sujatha   Central Region   Bukit Timah   Private ~    80
## 3 56334 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    70
## 4 71609 Ensuit~ 367042  Belinda   East Region      Tampines      Private ~   167
## 5 71896 B&B  R~ 367042  Belinda   East Region      Tampines      Private ~    95
## 6 71903 Room 2~ 367042  Belinda   East Region      Tampines      Private ~    84
## # ... with 22 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   geometry <POINT [m]>, listing_url <chr>, host_is_superhost <lgl>,
## #   latitude <dbl>, longitude <dbl>, property_type <chr>, bathrooms <dbl>,
## #   bedrooms <dbl>, cleaning_fee <dbl>, guests_included <dbl>,
## #   review_scores_rating <dbl>, cancellation_policy <chr>, total_price <dbl>,
## #   mrt_350m <dbl>, mrt_700m <dbl>, tourdistindex <dbl>

The above code computes the distance matrix of the listing and tourist attractions in Singapore. Using a similar approach from Hong & Yoo (2020), we compute a distance matrix of the tourist attractions using the sum of inversed distance of all the tourist attractions to the listings. Hence we take into account the overall accessibility to the various tourist attractions from this index.

3.7.3 Gazetted hotels

# Compute distance matrix of gazetted hotels and listings
list_hotels <- gw.dist(dp.locat = hotels_coords, rp.locat = listing_coords, longlat=FALSE) %>% t()

# Count number of hotels that are within 0.25km, 0.5km, 1km, 2km of the listing
listings_gwr$hotels_250m <- rowSums(list_hotels <= 250)
listings_gwr$hotels_500m <- rowSums(list_hotels <= 500)
listings_gwr$hotels_1000m <- rowSums(list_hotels <= 1000)
listings_gwr$hotels_2000m <- rowSums(list_hotels <= 2000)
head(listings_gwr)

## Simple feature collection with 6 features and 33 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 22646.02 ymin: 34950.06 xmax: 42212.88 ymax: 47135.4
## projected CRS:  SVY21 / Singapore TM
## # A tibble: 6 x 34
##   id    name    host_id host_name neighbourhood_g~ neighbourhood room_type price
##   <chr> <chr>   <chr>   <chr>     <chr>            <chr>         <chr>     <dbl>
## 1 49091 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    84
## 2 50646 Pleasa~ 227796  Sujatha   Central Region   Bukit Timah   Private ~    80
## 3 56334 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    70
## 4 71609 Ensuit~ 367042  Belinda   East Region      Tampines      Private ~   167
## 5 71896 B&B  R~ 367042  Belinda   East Region      Tampines      Private ~    95
## 6 71903 Room 2~ 367042  Belinda   East Region      Tampines      Private ~    84
## # ... with 26 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   geometry <POINT [m]>, listing_url <chr>, host_is_superhost <lgl>,
## #   latitude <dbl>, longitude <dbl>, property_type <chr>, bathrooms <dbl>,
## #   bedrooms <dbl>, cleaning_fee <dbl>, guests_included <dbl>,
## #   review_scores_rating <dbl>, cancellation_policy <chr>, total_price <dbl>,
## #   mrt_350m <dbl>, mrt_700m <dbl>, tourdistindex <dbl>, hotels_250m <dbl>,
## #   hotels_500m <dbl>, hotels_1000m <dbl>, hotels_2000m <dbl>

The above code computes the distance matrix of the listing and gazetted hotels in Singapore. We then compute the number of hotels that are 250m, 500m, 1km and 2km of the listing as variables in the listings_gwr object.

3.7.4 Healthcare facilities

# Compute distance matrix of hospitals and listings
list_hospital <- gw.dist(dp.locat = hospital_coords, rp.locat = listing_coords, longlat=FALSE) %>% t()

# Count number of hospitals that are within 0.5km, 1km, 2km of the listing
listings_gwr$hosp_500m <- rowSums(list_hospital <= 500)
listings_gwr$hosp_1000m <- rowSums(list_hospital <= 1000)
listings_gwr$hosp_2000m <- rowSums(list_hospital <= 2000)
listings_gwr$hosp_5000m <- rowSums(list_hospital <= 5000)
head(listings_gwr)

## Simple feature collection with 6 features and 37 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 22646.02 ymin: 34950.06 xmax: 42212.88 ymax: 47135.4
## projected CRS:  SVY21 / Singapore TM
## # A tibble: 6 x 38
##   id    name    host_id host_name neighbourhood_g~ neighbourhood room_type price
##   <chr> <chr>   <chr>   <chr>     <chr>            <chr>         <chr>     <dbl>
## 1 49091 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    84
## 2 50646 Pleasa~ 227796  Sujatha   Central Region   Bukit Timah   Private ~    80
## 3 56334 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    70
## 4 71609 Ensuit~ 367042  Belinda   East Region      Tampines      Private ~   167
## 5 71896 B&B  R~ 367042  Belinda   East Region      Tampines      Private ~    95
## 6 71903 Room 2~ 367042  Belinda   East Region      Tampines      Private ~    84
## # ... with 30 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   geometry <POINT [m]>, listing_url <chr>, host_is_superhost <lgl>,
## #   latitude <dbl>, longitude <dbl>, property_type <chr>, bathrooms <dbl>,
## #   bedrooms <dbl>, cleaning_fee <dbl>, guests_included <dbl>,
## #   review_scores_rating <dbl>, cancellation_policy <chr>, total_price <dbl>,
## #   mrt_350m <dbl>, mrt_700m <dbl>, tourdistindex <dbl>, hotels_250m <dbl>,
## #   hotels_500m <dbl>, hotels_1000m <dbl>, hotels_2000m <dbl>, hosp_500m <dbl>,
## #   hosp_1000m <dbl>, hosp_2000m <dbl>, hosp_5000m <dbl>

The above code computes the distance matrix of the listing and specific hospitals in Singapore. As hospitals are typically further away from residential areas, we compute the number of hospitals that are 500m, 1km, 2km and 5km from the listing as variables in the listings_gwr object.

3.7.5 Shopping Malls

# Get malls coords as matrix and calculate distance matrix to listing coords
malls_coords <- malls_data %>% dplyr::select(X,Y) %>% as.matrix()
list_malls <- gw.dist(dp.locat = malls_coords, rp.locat = listing_coords, longlat=FALSE) %>% t()

# Filter out malls in Orchard, CBD area (denoted by post codes starting with 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 17, 18, 19, 20, 21, 22, 23) and do a distance matrix of selected malls
malls_major <- malls_data %>% filter((Post_Code < 110000) | between(Post_Code, 170000, 240000)) 
malls_major_coords <- malls_major %>% dplyr::select(X,Y) %>% as.matrix()
list_malls_major <- gw.dist(dp.locat=malls_major_coords, rp.locat = listing_coords, longlat=FALSE) %>% t()

# Count number of malls that are within 0.5km, 1km, 2km of the listing
listings_gwr$mall_500m <- rowSums(list_malls <= 500)
listings_gwr$mall_1000m <- rowSums(list_malls <= 1000)
listings_gwr$mall_2000m <- rowSums(list_malls <= 2000)

# Compute distance index of Orchard Road / CBD malls (sum of inversed distance)
listings_gwr$malldistindex <- list_malls_major %>% as.data.frame() %>% mutate_each(function(x) (1/x)) %>% rowSums()
head(listings_gwr)

## Simple feature collection with 6 features and 41 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 22646.02 ymin: 34950.06 xmax: 42212.88 ymax: 47135.4
## projected CRS:  SVY21 / Singapore TM
## # A tibble: 6 x 42
##   id    name    host_id host_name neighbourhood_g~ neighbourhood room_type price
##   <chr> <chr>   <chr>   <chr>     <chr>            <chr>         <chr>     <dbl>
## 1 49091 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    84
## 2 50646 Pleasa~ 227796  Sujatha   Central Region   Bukit Timah   Private ~    80
## 3 56334 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    70
## 4 71609 Ensuit~ 367042  Belinda   East Region      Tampines      Private ~   167
## 5 71896 B&B  R~ 367042  Belinda   East Region      Tampines      Private ~    95
## 6 71903 Room 2~ 367042  Belinda   East Region      Tampines      Private ~    84
## # ... with 34 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   geometry <POINT [m]>, listing_url <chr>, host_is_superhost <lgl>,
## #   latitude <dbl>, longitude <dbl>, property_type <chr>, bathrooms <dbl>,
## #   bedrooms <dbl>, cleaning_fee <dbl>, guests_included <dbl>,
## #   review_scores_rating <dbl>, cancellation_policy <chr>, total_price <dbl>,
## #   mrt_350m <dbl>, mrt_700m <dbl>, tourdistindex <dbl>, hotels_250m <dbl>,
## #   hotels_500m <dbl>, hotels_1000m <dbl>, hotels_2000m <dbl>, hosp_500m <dbl>,
## #   hosp_1000m <dbl>, hosp_2000m <dbl>, hosp_5000m <dbl>, mall_500m <dbl>,
## #   mall_1000m <dbl>, mall_2000m <dbl>, malldistindex <dbl>

The above code computes the distance matrix of the listing and shopping malls in Singapore. There are 2 types of variables to consider. Firstly we obtain the number of malls within a set distance of the listings (500m, 1km and 2km). As shopping malls in the central districts are typically attractions (e.g. The Shoppes at Marina Bay Sands, malls on Orchard Road), we also derive malls in the central districts (by filtering via postal codes) and compute a distance index to those malls to the listings as well.

3.7.6 Dummy Variables

We compute dummy variables from categorical data to be used in our regression models.

# Change any NA in host_is_superhost to FALSE
listings_gwr$host_is_superhost[is.na(listings_gwr$host_is_superhost)] <- FALSE
listings_gwr$superhost <- ifelse(listings_gwr$host_is_superhost == TRUE, 1,0)

First we impute false for any NA values for whether a host is a superhost. We use a dummy variable superhost to represent whether the host is a superhost (1 if the host is a superhost, 0 otherwise).

# Create dummy variable for roomtype
listings_gwr$entire <- ifelse(listings_gwr$room_type == "Entire home/apt", 1,0)
listings_gwr$private <- ifelse(listings_gwr$room_type == "Private room", 1,0)
listings_gwr$hotel <- ifelse(listings_gwr$room_type == "Hotel room", 1,0)
listings_gwr$shared <- ifelse(listings_gwr$room_type == "Shared room", 1,0)

# Create dummy variable for cancellation policy
listings_gwr$flexible <- ifelse(listings_gwr$cancellation_policy == "flexible", 1,0)
listings_gwr$moderate <- ifelse(listings_gwr$cancellation_policy == "moderate", 1,0)

We then create dummy variables for the room types: entire, private, hotel, shared. Although we only need 3 variable, we can decide which variable will be the reference variable in that dummy (e.g. we use shared, private and entire, which means that hotel is the reference variable). We also create 2 dummy variables for the cancellation policy: flexible, moderate, where strict policies are the reference variable.

3.8 Final data for regression

# Jitter datapoints
listings_gwr_jitter <- st_jitter(listings_gwr)

# Convert sf dataframe to Spatial Dataframe
listings_gwr1 <- listings_gwr_jitter %>% as_Spatial()

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on WGS84 ellipsoid in CRS definition

## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum SVY21 in CRS definition

proj4string(listings_gwr1) <- CRS("+init=epsg:3414")

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum SVY21 in CRS definition

## Warning in proj4string(obj): CRS object has comment, which is lost in output

To ensure that there is no overlap, we jitter the datapoints using the st_jitter() function from the sf object. As GWmodel uses a SpatialDataframe (spdf) object, we convert the sf dataframe into an spdf object, and ensure that it has the right CRS (XVY21).

# Write listings_gwr to RData for use in GWR
save(listings_gwr1, file="data/listings_gwr1_v3.RData")

We then save the object listings_gwr1 as an RData file to be used in our modelling.

3.8.1 Smaller dataset

# Filter out only listings with reviews
listings_gwr2 <- listings_gwr_jitter %>% filter(number_of_reviews >0)

# Convert sf dataframe to Spatial Dataframe
listings_gwr3 <- listings_gwr2 %>% as_Spatial() 
proj4string(listings_gwr3) <- CRS("+init=epsg:3414")
save(listings_gwr3, file="data/listings_gwr3_v1.RData")

We also extract a smaller dataset by only using listings with a review to proxy listings that have been rented out. This gives us 4,466 listings (from the 7,272 listings in listings_gwr1), and we save this object as listings_gwr3 to be used in our modelling as well.

Appendix 2 - GWR Data Collation and Wrangling

Geospatial Analysis of Airbnb in Singapore

Clara Chua

25/04/2020 (updated: 2021-06-09)