packages = c('tidyverse', 'knitr', 'rgdal', 'maptools', 'sf','raster','spatstat', 'tmap','tmaptools', 'gridExtra', 'leaflet', 'OpenStreetMap')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}

1 Setting up

We load the data saved previously on the previous markdown document, which consists of the listings, cleaned listings, neighbourhoods table and the neighbourhoods map.

load("basedata.RData")

2 Spatial Point Pattern Analysis

Spatial Point Pattern analysis is the study of the location of events generated by a point process within a bounded area of interest. A point process is a stochastic process where the locations of some events of interest are observed within a bounded area. It can be used in ecology and agriculture to study the spatial distribution of different species of plants and crops, or in epidemiology, where the growth of a disease can be studied spatially. We can determine from Spatial Point Pattern analysis whether events (e.g. cases of disease) are clustered, or whether different plant species compete with each other within a specified region. First order properties of point patterns measure the distribution of events in the study region, namely the intensity and spatial density. Kernel Density Estimation is one method of looking at the first order properties of a point pattern. Second order properties measure the interaction between 2 points or observations, e.g. seed dispersal are dependent on the presence of other trees or genus.

In this study, we use Spatial Point Pattern analysis to discover if Airbnb listings are clustered in Singapore and where they occur. We use the K-function to test whether there is clustering or competition between listings, and the Kernel Density Estimation to ascertain where there are high intensities or densities of listings.

2.1 Data Wrangling for SPPA

We will be using the package spatstat for the analysis of the point patterns. spatstat uses ppp objects to store point patterns and owin objects to store observation windows which is the region of interest. We use the maptools package to convert SpatialPoints or SpatialPointsDataFrames to ppp objects and SpatialPolygons to owin objects.

As some of the other coordinates are in SVY21 format, we will need to convert the cleaned listings dataframe to SVY21 and SpatialPoints format, and add a projection for use in spatstat. Once done, we can then convert them into ppp and owin formats for spatstat to compute the spatial point patterns and density.

# listings to be converted from the listings_sf dataframe to SVY21 format (otherwise it will be out of bounds when using Spatstat) and converting to SpatialPoints (sp)
#Remove listings from neighbourhoods that are not zoned for residential

listings_clean <- st_transform(listings_clean, 3414)
nonreslisting <- c("Tuas", "Pioneer")
listings_clean <- listings_clean %>% filter(!neighbourhood %in% nonreslisting)
listings_sp <- listings_clean %>% dplyr::select(geometry, room_type) %>% as(., Class = "Spatial")

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on WGS84 ellipsoid in CRS definition

## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum SVY21 in CRS definition

summary(listings_sp)

## Object of class SpatialPointsDataFrame
## Coordinates:
##                min      max
## coords.x1 11515.98 43401.32
## coords.x2 25166.35 48466.72
## Is projected: TRUE 
## proj4string :
## [+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1
## +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +units=m +no_defs]
## Number of points: 7286
## Data attributes:
##   room_type        
##  Length:7286       
##  Class :character  
##  Mode  :character

plot(listings_sp)

From the summary above, we can see that the listings data has been transformed to SpatialPoints with the right projections.

2.1.1 Wrangling subzone map

Instead of using the neighbourhood map provided from InsideAirbnb, we use the master plan subzone from URA as this gives greater details to the various areas. We also use the read_osm() function from the OpenStreetMap package to load the background map as a raster in tmap’s plot mode. We load this in sf format for mapping purposes; we also assign the crs projection and convert it to a SpatialPolygons object for use in spatstat.

We check to ensure that the geometry is valid and that the crs of the subzone map is attributed correctly.

# Read in URA's master planning subzones in SF to plot the polygon layer
# Read in OSM raster of listings data for plot view
mpsz_sf <- st_read(dsn = "data/MP14_SUBZONE_WEB_PL.shp", layer="MP14_SUBZONE_WEB_PL")

## Reading layer `MP14_SUBZONE_WEB_PL' from data source `C:\Users\clarachua\Documents\2. Capstone Project\capstone\data\MP14_SUBZONE_WEB_PL.shp' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS:  SVY21

sg_osm <- read_osm(listings_clean, ext=1.3)

# Ensure that geometry is valid and check crs of subzone map
mpsz_sf <- st_make_valid(mpsz_sf)
all(st_is_valid(mpsz_sf))

## [1] TRUE

crs(mpsz_sf)

## CRS arguments:
##  +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1
## +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs

We use the cleaned data from the previous section and do additional cleaning of the subzone map to exclude areas that are not zoned for residential purposes (e.g. Industrial areas, Airports, military / other zones) and any subzones that do not have any listings. We use the st_join function to join the listings data frame that fall within the subzone map, and filter out areas in the subzone map that are empty.

mpsz_sf2 <- mpsz_sf %>% dplyr::select(SUBZONE_N, REGION_N)
mpsz_sf2 <- st_transform(mpsz_sf2, st_crs(listings_clean))
listings_sf2 <- listings_clean %>% dplyr::select(neighbourhood, room_type)
listings_subzones <- st_join(listings_sf2, mpsz_sf2, prepared=TRUE, join=st_within)
subzone_list <- unique(listings_subzones$SUBZONE_N)
mpsz_sf3 <- mpsz_sf %>% filter(SUBZONE_N %in% subzone_list)
plot(mpsz_sf3["SUBZONE_N"])

2.1.2 Splitting data and map into sub-regions

As we have a relatively large sets of data points to analyse, we split the map of Singapore into the 5 sub-regions to make it computationally faster for the second order analysis.

The following code splits the subzone map into a list of sub-regions that we can call up.

# Create list of regions to iterate over
region_list <- unique(listings_subzones$REGION_N)

# Create a store of region SF
region_store_sf <- list()

# Create list of sub-regions
for (i in seq_along(region_list)) {
  region_name <- paste(tolower(word(region_list[i],1)))
  region_name <- gsub("-", "", region_name)
  var_name <- paste("sf", region_name, sep="_")
  region_store_sf[[region_name]] <- mpsz_sf3[mpsz_sf3$REGION_N == region_list[i],c("SUBZONE_N", "PLN_AREA_N")]
}

# Test region_store
plot(region_store_sf$northeast)

We then convert the subregion maps into a list of owins for use in the analysis. We also dissolve all subzone boundaries in the subregion using the st_union() function to make the computation of the analysis quicker.

# Convert to owin store, with subzone boundaries dissolved
region_store_owin <- list()

for (i in seq_along(region_list)) {
  region_name <- paste(tolower(word(region_list[i],1))) %>% gsub("-", "", .)
  var_name <- paste("sf", region_name, sep="_")
  region_store_owin[[region_name]] <- mpsz_sf3[mpsz_sf3$REGION_N == region_list[i],3] %>% st_union(.) %>% as(., Class = "Spatial") %>% as(., "owin")
}

# Plot owin to check
plot(region_store_owin$northeast)

Similarly we create a ppp object of the listings using the as.ppp() function from spatstat and add marks by room type. Next, we create a jittered ppp to ensure that there is no overlap of the points.

# create ppp in owin objects 
listings_ppp <- as.ppp(listings_sp)
marks(listings_ppp) <- factor(listings_clean$room_type)
listing_ppp_jit <- rjitter(listings_ppp, retry = TRUE, nsim = 1, drop = TRUE)

We then create a list of ppp objects split by sub-regions, then by room type in the form of ppp_store$region$room_type. Hence we would call all the listings in the east by ppp_store4$east$all or only the private rooms in the central region by ppp_store4$central$private. These are also rescaled to ‘km’ for easier interpretation when graphed. We then save the ppp store into an RData file for use in running the tests in a separate file.

# Create ppp in owin objects split by sub-regions, rescaled to km
ppp_east <- listing_ppp_jit[region_store_owin$east] %>% rescale(., 1000, "km")
ppp_central <- listing_ppp_jit[region_store_owin$central] %>% rescale(., 1000, "km")
ppp_west <- listing_ppp_jit[region_store_owin$west] %>% rescale(., 1000, "km")
ppp_north <- listing_ppp_jit[region_store_owin$north] %>% rescale(., 1000, "km")
ppp_northeast <- listing_ppp_jit[region_store_owin$northeast] %>% rescale(., 1000, "km")

# create object store split by region, then by room type of form ppp_store$region$roomtype
types_room <- unique(listings_sf$room_type)
ppp_store4 <- list()
for (i in seq_along(region_list)) {
  region_name <- paste(tolower(word(region_list[i],1))) %>% gsub("-", "", .)
  ppp_store4[[region_name]] <- list()
  for (j in seq_along(types_room)) {
    room_name <- paste(tolower(word(types_room[j],1)))
    var_name <- paste("ppp", region_name, sep="_")
    ppp_store4[[region_name]][["all"]] <- eval(as.name(var_name))
    ppp_store4[[region_name]][[room_name]] <- ppp_store4[[region_name]][["all"]] [ppp_store4[[region_name]][["all"]][["marks"]]==types_room[j]]
  }
}

# Saving data to run SPPA tests
save(ppp_store4, file = "ppp_store4.RData")

2.2 Second order analysis (K-test)

Second-order analysis looks at the strength and type of interactions between events in the point process (R.S. Bivand et al, 2013). We use this to look for clustering and competition between listings within the geographic study area. One of the ways to measure this is via the K-function as defined by Ripley, 1976. Ripley’s K-function measures the number of events found up to a given distance of any particular event. It can be used to summarize a point pattern, test hypotheses and estimate parameters.

The K-function is \[K(t) = \lambda^{-1} E[number\: of\: extra\:events\:within\:distance\:t\:of\:a\:randomly\:chosen\:event]\].

We test the hypothesis that the point processes exhibit Complete Spatial Randomness (CSR) i.e. it follows a homogeneous Poisson process. This is given by the equation $K(t) = \pi t^2$. We use the Ripley edge correction to account for any edge effects. Edge effects arise because boundaries of a study area are arbitrary and points outside the boundary are not counted. The estimator is given by:

\[\hat{K}(t) = \hat\lambda^{-1} \sum_{i} \sum_{j \neq i} w(l_{i}, l_{j})^{-1} \frac{I(d_{ij} < t)}{N} \]

where $d_{ij}$ is the distance between the ith and jth points, and I(x) is the indicator function with the value 1 if x is true and 0 otherwise. The edge correction is given by the weight function w(li, lj). It has the value of 1 when the circle centered at li and passing through the point Ij (with radius $d_{ij}$) is completely inside the study area. If part of the circle falls outside the study area, then w(li,lj) is the proportion of the circumference of that circle that falls in the study area. The effects of edge corrections are more important for large t because large circles are more likely to be outside the study area.

spatstat provides a function Kest() to compute an unbiased estimate of K(r) and we can run a series of Monte Carlo simulations using the envelope() function to compare it to the theoretical value. We use this to test for the null hypothesis of CSR. We would reject the null hypothesis if simulated values fall outside the bounds of the theoretical values. Values higher than the upper bound of theoretical values would imply clustered processes whilst values lower than the lower bound of theoretical values would imply competitive processes. spatstat also offers an alternative to the Kest() function for large pattern of points using a Fast Fourier Transform Kest.fft().

Methodology

We explored both ways of obtaining K-test results - using the Kest() and the Kest.fft() functions. As it was computationally intensive to run the Kest() function for the whole of Singapore, we split the analysis by the 5 different subregions (East, West, Central, North and Northwest). We then ran the Fast Fourier Transform version Kest.fft() for the whole of Singapore. We also compared the results from both versions of the K-tests for the Central region using the same parameters; the Central region was used as it had the most number of data points for all room types.

2.2.1 K-Function tests by subregions

For brevity, we will show the code and results for one subregion. The rest of the code can be found in Appendix 1 and results there were saved into Rdata files for use in this analysis. We used the Ripley correction, and ran 99 simulations using the envelope() function from spatstat.

2.2.1.1 Sample code

# K-function test for East Region
Ktest_east_hotel <- Kest(ppp_store4$east$hotel, correction = "Ripley")
Ktest_east_hotel.csr <- envelope(ppp_store4$east$hotel, Kest, nsim=99, rank=1, glocal=TRUE)

Ktest_east_entire <- Kest(ppp_store4$east$entire, correction = "Ripley")
Ktest_east_entire.csr <- envelope(ppp_store4$east$entire, Kest, nsim=99, rank=1, glocal=TRUE)

Ktest_east_private <- Kest(ppp_store4$east$private, correction = "Ripley")
Ktest_east_private.csr <- envelope(ppp_store4$east$private, Kest, nsim=99, rank=1, glocal=TRUE)

Ktest_east_shared <- Kest(ppp_store4$east$shared, correction = "Ripley")
Ktest_east_shared.csr <- envelope(ppp_store4$east$shared, Kest, nsim=99, rank=1, glocal=TRUE)

2.2.1.2 Loading Results

The results from the K-test for the different regions were saved in the form ‘Ktest_[subregion].RData’. The following code loads the RData files of that form into the global environment.

Ktestfiles = as.list(dir(pattern="Ktest__*"))
lapply(Ktestfiles,load,.GlobalEnv)

2.2.1.3 East Region

The following shows a table of the room types in the East region - there are very few hotel and shared room listings. As such, it is not possible make a proper inference from the K-test. The figure below plots the envelope simulations and observed point processes for the room types in the East region. For both the entire homes and private room listings, the observed values are above the theoretical envelope of simulations (marked by the grey area). Thus we would conclude that the these two room types show signs of clustering. However, it is not possible to see from the simulation at what distance the clusters form.

summary(ppp_store4$east$all$marks)

## Entire home/apt      Hotel room    Private room     Shared room 
##             103               3             330              10

# Plot K-function test results for East Region
par(mfrow=c(2,2))
plot(Ktest_east_hotel.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_east_entire.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_east_private.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_east_shared.csr, .-r~r, xlab="d", ylab="K(d)-r")

2.2.1.4 West Region

Similarly in the West region, there are very few hotel room listings and therefore we are unable to make an inference. Shared rooms in the West region seem to exhibit spatial randomness as the observed values are within the bounds of the theoretical values, whereas the private rooms and entire homes seem to be clustered together where the observed values are above the theoretical bounds.

summary(ppp_store4$west$all$marks)

## Entire home/apt      Hotel room    Private room     Shared room 
##             142               3             348              23

# Plot K-function test results for West Region
par(mfrow=c(2,2))
plot(Ktest_west_hotel.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_west_entire.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_west_private.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_west_shared.csr, .-r~r, xlab="d", ylab="K(d)-r")

2.2.1.5 North Region

There are no hotel room listings in the North region; there are also few entire homes and shared rooms. The K-test seems to show that entire homes would cluster in the North region over a distance of 0.2km, and shared rooms would cluster under a distance of 0.3km. Private rooms in the North show signs of clustering.

summary(ppp_store4$north$all$marks)

## Entire home/apt      Hotel room    Private room     Shared room 
##              18               0             144              14

par(mfrow=c(2,2))
plot(Ktest_north_entire.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_north_private.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_north_shared.csr, .-r~r, xlab="d", ylab="K(d)-r")

2.2.1.6 Northeast Region

Entire homes and private rooms show signs of clustering in the Northeast region as well, while the number of hotel and shared rooms are too small to make an inference.

summary(ppp_store4$northeast$all$marks)

## Entire home/apt      Hotel room    Private room     Shared room 
##              51               3             221               7

# Run K function test for Northeast Region
par(mfrow=c(2,2))
plot(Ktest_northeast_hotel.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_northeast_entire.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_northeast_private.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_northeast_shared.csr, .-r~r, xlab="d", ylab="K(d)-r")

2.2.1.7 Central Region

The central region has the largest concentration of the four room types; it is not surprising that we would find evidence of clustering for all the room types in this region. While we have a smaller envelope from running only 99 simulations due to computational issues, we see the same results when we run the same test using the Fast Fourier Transform in the next section.

summary(ppp_store4$central$all$marks)

## Entire home/apt      Hotel room    Private room     Shared room 
##            3129             442            2083             211

par(mfrow=c(2,2))
plot(Ktest_central_hotel.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_central_entire.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_central_private.csr, .-r~r, xlab="d", ylab="K(d)-r")
plot(Ktest_central_shared.csr, .-r~r, xlab="d", ylab="K(d)-r")

2.2.2 K-test Using Fast Fourier Transform

The Kest.fft() function is an alternative to analyse large pattern of points. From the spatstat package documentation, Kest() computes the distance between each pair of points analytically and would therefore take a long time in computing large datasets (N-1 points * No. of simulations). The Kest.fft() function discretises the point pattern onto a rectangular raster and applies Fast Fourier Transform (FFT) techniques to estimate the K-function. In essence, the Fast Fourier Transform algorithm allows the estimates to be computed in parallel, by splitting them pairwise, and aggregating them again.

The result depends on the resolution of the pixel raster and requires the standard deviation of the isotropic Gaussian smoothing kernel that is introduced (sigma). We use the bw.diggle() function on the ppp object as an estimate of sigma and pass it into the Kest.fft() function.

2.2.2.1 Data set-up

We set up a list of ppp objects covering all of Singapore, split into the different room types and calculate the sigma for the 4 ppp objects.

# Create ppp for all of Singapore with only the subzones containing listings and dissolve the boundaries  
sg_owin <- mpsz_sf3 %>% st_union(.) %>% as(., Class="Spatial") %>% as(., "owin")  
ppp_all <- listing_ppp_jit[sg_owin] %>% rescale(., 1000, "km")
plot(split(ppp_all))

# Create ppp store of the 4 different room types
types_room <- unique(listings_sf$room_type)
ppp_store <- list()

for (i in seq_along(types_room)) {
  room_name <- paste(tolower(word(types_room[i],1)))
  var_name <- paste("ppp", room_name, sep="_") # Construct the name
  ppp_store[[room_name]] <- ppp_all[ppp_all$marks == types_room[i]]
  # x1 <- listingsg_ppp[listingsg_ppppp_all == types_room[i]]
  # assign(var_name, x1, env=.GlobalEnv) # Assign values to variable
}

The following code calculates the sigma to pass into the FFT function.

# Create sigma using the bw.diggle() function for the various room types
sigma_hotel <- bw.diggle(ppp_store$hotel)
sigma_shared <- bw.diggle(ppp_store$shared)
sigma_private <- bw.diggle(ppp_store$private)
sigma_entire <- bw.diggle(ppp_store$entire)

2.2.2.2 Hotel rooms

We first run the Kest.fft() function on the hotel rooms, and pass the sigma obtained earlier into the function. As this is computationally faster, we can run more simulations (300). For reproducibility, we set the seed shown below. As we see from the time taken, it only takes 44 seconds to compute 300 simulations as compared to minutes or hours it would have taken using the Kest() function. For the other room types, the simulation results have been hidden.

# K-test using FFT - Hotel Room
ptm <- proc.time()
set.seed(123)
Kfft_hotel <- Kest.fft(ppp_store$hotel, sigma_hotel)
Kfft_hotel.csr <- envelope(ppp_store$hotel, Kest.fft, sigma = sigma_hotel, nsim = 300, rank = 1, global=TRUE)

## Generating 300 simulations of CSR  ...
## 1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
## .42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
## .82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
## .122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
## .162.164.166.168.170.172.174.176.178.180.182.184.186.188.190.192.194.196.198.200
## .202.204.206.208.210.212.214.216.218.220.222.224.226.228.230.232.234.236.238.240
## .242.244.246.248.250.252.254.256.258.260.262.264.266.268.270.272.274.276.278.280
## .282.284.286.288.290.292.294.296.298. 300.
## 
## Done.

proc.time() - ptm

##    user  system elapsed 
##   37.00    0.27   37.38

The following figure shows the plot of the K-test using FFT on the left. The plot on the right zooms into the the range between 0 - 1km and we are able to make out a distance (r) where the point patterns exhibit signs of clustering, at around the 0.25km mark. This is consistent with our previous plot of the points as hotels are predominantly within the Central region, and are located close to one another.

par(mfrow=c(1,2))
plot(Kfft_hotel.csr, . - r ~ r, xlab="d", ylab="K(d)-r")
plot(Kfft_hotel.csr, . - r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))

2.2.2.3 Shared rooms

The K-test shows signs of clustering for shared rooms across Singapore but this time at a wider radius of 0.6km. This matches what we have seen with the plotted point patterns as there are less shared rooms across Singapore but most of them are in the Central region as well.

# K-test using FFT - Shared Room
set.seed(123)
ptm <- proc.time()
Kfft_shared <- Kest.fft(ppp_store$shared, sigma_shared)
Kfft_shared.csr <- envelope(ppp_store$shared, Kest.fft, sigma = sigma_shared, nsim = 300, rank = 1, global=TRUE)
proc.time() - ptm

par(mfrow=c(1,2))
plot(Kfft_shared.csr, . -r ~r, xlab="d", ylab="K(d)-r")
plot(Kfft_shared.csr, . - r ~ r, xlab="d", ylab="K(d)-r", xlim = range(0,1))

2.2.2.4 Private rooms

Private rooms across Singapore also show signs of clustering, at a distance of 0.4km and above.

# K-test using FFT - Shared Room
ptm <- proc.time()
Kfft_private <- Kest.fft(ppp_store$private, sigma_private)
Kfft_private.csr <- envelope(ppp_store$private, Kest.fft, sigma = sigma_private, nsim = 300, rank = 1, global=TRUE)
proc.time() - ptm

par(mfrow=c(1,2))
plot(Kfft_private.csr, . -r ~r, xlab="d", ylab="K(d)-r")
plot(Kfft_private.csr, . - r ~ r, xlab="d", ylab="K(d)-r", xlim = range(0,1))

2.2.2.5 Entire home / apartments

Entire homes and apartments also show signs of clustering, which is consistent with the point patterns observed previously. The distance at which they cluster is from 0.3km and up.

# K-test using FFT - Shared Room
ptm <- proc.time()
Kfft_entire <- Kest.fft(ppp_store$entire, sigma_entire)
Kfft_entire.csr <- envelope(ppp_store$entire, Kest.fft, sigma = sigma_entire, nsim = 300, rank = 1, global=TRUE)
proc.time() - ptm

par(mfrow=c(1,2))
plot(Kfft_entire.csr, . -r ~r, xlab="d", ylab="K(d)-r")
plot(Kfft_entire.csr, . - r ~ r, xlab="d", ylab="K(d)-r", xlim = range(0,1))

2.2.2.6 Summary of Cluster Radius

# Save clustering distance for use in density plots
kdist <- list("private" = 0.4, "entire" = 0.3, "hotel" = 0.25, "shared"=0.6)

From the Monte Carlo simulation, we can see that all 4 room types exhibit signs of clustering at a whole of Singapore level. We zoom into the plot between the 0 - 1km range and we can see the distance that clustering occurs:

Room Type	Cluster Radius
Entire Home/Apt	0.3km
Private Rooms	0.4km
Shared Rooms	0.6km
Hotel Rooms	0.25km

Using different sigmas would change the distance at which we observe the clustering effects.

2.2.2.7 Comparison of FFT method vs normal K-test

We compare the results from the Fast Fourier Transform versus the normal K-test for the Central neighbourhood to see if there are any significant differences using the 2 methods. Following the same steps, we find the sigma for the different room types in the Central neighbourhood, then run the envelope test for them using the same number of simulations as in the K-test.

# Create sigma
sigma_cen_private <- bw.diggle(ppp_store4$central$private)
sigma_cen_entire <- bw.diggle(ppp_store4$central$entire)
sigma_cen_hotel <- bw.diggle(ppp_store4$central$hotel)
sigma_cen_shared <- bw.diggle(ppp_store4$central$shared)

# Run Kfft test for 4 room types in Central region
ptm <- proc.time()
Kfft_cen_private.csr <- envelope(ppp_store4$central$private, Kest.fft, sigma = sigma_cen_private, nsim = 99, rank = 1, global=TRUE)

## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.

Kfft_cen_entire.csr <- envelope(ppp_store4$central$entire, Kest.fft, sigma = sigma_cen_entire, nsim = 99, rank = 1, global=TRUE)

## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.

Kfft_cen_shared.csr <- envelope(ppp_store4$central$shared, Kest.fft, sigma = sigma_cen_shared, nsim = 99, rank = 1, global=TRUE)

## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.

Kfft_cen_hotel.csr <- envelope(ppp_store4$central$hotel, Kest.fft, sigma = sigma_cen_hotel, nsim = 99, rank = 1, global=TRUE)

## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.

proc.time() - ptm

##    user  system elapsed 
##   93.01    1.58   94.67

We can see that the time taken to run the 99 simulations using FFT is a fraction of the normal K-test (64 seconds vs 60 hours). We then plot the Kest.fft() and Kest() results side by side for each room type to see the difference.

# plot
par(mfrow = c(2,2))
plot(Kfft_cen_private.csr, . -r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))
plot(Ktest_central_private.csr, . -r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))
plot(Kfft_cen_entire.csr, . -r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))
plot(Ktest_central_entire.csr, . -r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))

From the above graphs, we can see that the Fast Fourier Transform provides us with a larger range of values in the Monte Carlo simulation. Both tests lead us to reject the hypothesis of CSR. However in the case of the Fast Fourier Transform, we can identify a distance where the listings tend to cluster - 0.2km in the case of both private rooms and entire homes in the Central region. Similarly for hotel rooms and shared rooms in the graphs below, shared rooms in the Central region tend to cluster at the 0.2km mark, whereas hotel rooms cluster at approximately 0.25km.

# plot
par(mfrow = c(2,2))
plot(Kfft_cen_shared.csr, . -r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))
plot(Ktest_central_shared.csr, . -r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))
plot(Kfft_cen_hotel.csr, . -r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))
plot(Ktest_central_hotel.csr, . -r ~ r, xlab="d", ylab="K(d)-r", xlim=range(0,1))

2.3 Kernel Density Estimation

The K-tests above show that Airbnb listings are not randomly distributed across Singapore. We can use the Kernel Density Estimation (KDE) to explore the areas where these clusters form. KDE is a non-parametric method using local information defined by windows or kernels to estimate densities of specific features at given locations. We assume a homogeneous or stationary point process. The bandwidth of a kernal estimator can be fixed across the entire mapping area or adaptive based on the location of the points.

As we were able to derive a distance that clusters occured for the whole of Singapore, we can use that distance and pass that as a fixed bandwidth into the Kernel Density Estimation. We then explore where these areas are across the whole area of study. We also compare the difference in using the fixed bandwidth derived from the K-test and using an adaptive bandwidth for the private rooms and entire homes to see if there are any significant differences in the 2 methods.

2.3.1 KDE mapping function

As we will be doing the kernel density estimation for various cuts of the data, we create a function kdemap() that derives the kernel density estimation and transforms it into a raster to overlay onto the basemap to identify the neighbourhoods with potential clusters. We also use the disaggregate() function to increase the resolution of the raster from the resolution (0.265km x 0.195km) to approximately 50m (0.05km).

# Function maps the ppp (in km) with either an adaptive bandwidth ("adaptive) or the specified bandwidth to be used ("bw.diggle, bw.ppl, etc) 
kdemap <- function(ppp, y, polyshape, gtitle) {
  # ppp.km <- rescale(ppp, 1000, "km")
# Computing kernel density estimation using specified bandwidth
    if (y == "adaptive") {
      kde_ppp <- adaptive.density(ppp, method = "kernel")
    } else if (is.numeric(y)) {
      kde_ppp <- density(ppp, sigma = y, edge = TRUE, kernel = "gaussian")
      }
      else {
      kde_ppp <- density(ppp, sigma = eval(as.name(y)), edge = TRUE, kernel = "gaussian")
    }
  rastmap <- as.SpatialGridDataFrame.im(kde_ppp) %>% raster(.)
  projection(rastmap) <- CRS("+init=EPSG:3414 +units=km")
  rastmap1 <- disaggregate(rastmap, fact=3) # increase resolution of the raster map by factor of 3 to approximately 50m 

  tm_shape(sg_osm) +
    tm_rgb() +
  tm_shape(polyshape) +
    # tm_text("SUBZONE_N", size = 0.6) +
    tm_polygons(alpha = 0) +
  tm_shape(rastmap1) +
    tm_raster("v", alpha = 0.7, palette = "YlOrBr") +
    tm_layout(legend.position = c("right", "bottom"), frame = FALSE, main.title = gtitle, main.title.position="center", main.title.size = 1.25) +
    tm_view(set.zoom.limits = c(11, 17))
}

2.3.2 KDE for different room types

We look at the KDE for different room types in Singapore to identify neighbourhoods that have high density of listings. We use the bandwidth identified from the K-test previously to plot the density estimate for each room type. A lower bandwidth means that points that are close to the location are given more weight, whereas as higher bandwidth means that further points would contribute to the density estimation.

2.3.2.1 Private Rooms

The density map below show that private rooms cluster in the Lavender area, spreading to the surrounding areas from Kallang Bahru to Balestier and Little India. Another cluster is in the Chinatown area and surrounding. These two areas are considered city fringe, with Chinatown being closer to the main Central Business District. The River Valley area has a small cluster (Leonie Hill, Oxley, etc)- which is close to the main shopping district in Orchard Road. Aljunied has another cluster of private rooms, being further out on the city fringes but still fairly accessible from the public transport network. These neighbourhoods tend to have a good mix of rental and owner-occupied houses, which may contribute to the number of private rooms listed for rent.

tmap_mode("view")
kdemap(ppp_store$private, kdist$private, mpsz_sf3[,3], "KDE private rooms")

2.3.2.2 Entire homes / apartments

Entire homes and apartments have slightly different clusters. Instead of Lavender, the cluster is more concentrated in Balestier, with smaller densities in Lavender and surrounding areas like Bendemeer and Farrer Park. Chinatown has a high concentration, and has a wider spread to surrounding areas, including Anson and Raffles Place. The area surrounding Orchard has a higher concentration of entire homes listings as compared to private rooms. Aljunied has 2 distinct clusters of entire home listings; one is closer to the city centre, while the other is closer to the MRT station. The greater concentration of entire home listings in Orchard and Chinatown area may be due to these areas being popular for investment or rental properties.

# tmap_mode("view")
kdemap(ppp_store$entire, kdist$entire, mpsz_sf3[,3], "KDE entire homes/apt")

2.3.2.3 Shared rooms

Shared rooms are concentrated in Lavender, spreading out to surrounding areas. However, the density of shared rooms is much lower (25 - 30) compared to that of private rooms and entire homes/apartments. There is also a small concentration in the Chinatown environs.

kdemap(ppp_store$shared, kdist$shared, mpsz_sf3[,3], "KDE shared rooms")

2.3.2.4 Hotel rooms

Hotel rooms are concentrated in the Chinatown and Boat Quay/Clarke Quay area. These correspond to smaller boutique hotels and some of the low to mid-tier hotels that serve the business district and tourists. The density is comparable to that of private rooms.

kdemap(ppp_store$hotel, kdist$hotel, mpsz_sf3[,3], "KDE hotel rooms")

2.3.2.5 Using an adaptive bandwidth

The following maps the results in using an adaptive bandwidth without the benefit of the K-test derived bandwidths. We can see that this method identifies the subzones that have a higher density of listings; however they are very concentrated and do not show the surrounding areas contributing to the cluster.

In the plots below, the adaptive bandwidth method shows high density in the Lavender area for private rooms, but it does not show the contributing listings from the surrounding areas of Kallang Bahru, Little India, etc. For entire homes and apartments, the adaptive bandwidth gives us an indication of the same clusters: clusters in Aljunied (2), Chinatown, Balestier and River Valley. It also does not show the contributing listings from the surrounding areas. The number of listings are very tightly concentrated at one point despite the disaggregation of the raster map: it shows 1,500 - 2,000 listings within Chinatown but there are only 3,500+ listings in total, which does not give us as much information as using the fixed bandwidth derived from the K-test.

kdemap(ppp_store$private, "adaptive", mpsz_sf3[,3], "KDE private rooms adaptive bw")

tmap_mode("view")
kdemap(ppp_store$entire, "adaptive", mpsz_sf3[,3], "KDE entire homes/apt adaptive bw")

2.4 Discussion on findings and results

Methods and Findings

There are several functions that measure whether a point pattern meets CSR; we bypassed G-test, and F-test as they were measures of the first order properties, and would only tell us if the distribution and intensity of the events would come under CSR.

For the second order analysis, we explored the use of both the K-test and the Fast Fourier Tranform of the K-test. The FFT version of the K-test out-performed the normal K-test in terms of computation speed (more than 3,000 times faster) and the results (where we were able to ascertain the distance where the listings start to cluster). With the FFT version of the K-test, we are also able to examine the entire country, instead of splitting it into sub-regions or smaller study areas, which would be arbitrary, and may not account for edge effects, depsite using a Ripley correction.

When we compared both versions of the K-test on the Central region, both versions gave a similar result. However, the FFT version produced larger upper and lower bounds that fall within CSR, and we were able to estimate the distance at which clustering occurred in the Central region.

We did not use the L-test as it gives a similar estimate to the K-test and as it did not have a FFT version, we deemed it computationally inefficient to do for similar results.

We then fed the resultant bandwidths into the Kernel Density Estimation function, which showed us the areas where the listings clustered. The KDE produced a few focal points of clusters, with other listings radiating from the focal points, and we could identify in greater detail the areas where they clustered and the potential reasons for the clustering.

We also performed a KDE using an adaptive bandwidth to compare the 2 methods of deriving the density. While both methods identified similar areas for clustering, using the adaptive bandwidth results in identifying only the focal points, with less of the surrounding areas identified. In addition, as a standalone analysis, we would still have needed to perform statistical analysis of the spatial point pattern using one of the tests listed above.

Challenges

The main challenge was the analysis of a large dataset. In most examples of spatial point pattern analysis, the number of data points were typically smaller or they used a smaller observation window. We had to exclude areas that were not zoned for residential housing, and this created ‘holes’ in the polygons, which made it more difficult for the algorithm to create the envelope simulations. We also had to dissolve the subzone boundaries within the observation window to get the envelope simulations to run.

The time taken to run the K-test for 99 simulations was substantial, coming in at 60 hours for the 4 room types in the Central region. As such, we could not run more simulations without incurring more computational cost. This could have led to less robust results when we analysed the plot for the clustering distance.

Aside from the Fast Fourier Transform version of the K-test function, spatstat does not offer any other method or function to analyse large datasets. As such, we could not complete the L-test for the analysis and had to rely solely on the K-test for our test of CSR.

One potential solution is to parallelise the operations. However, spatstat is not configured for parallelisation, and users would have to undertake and code the parallelisation using other tools. Mac users can use the parallel package (part of base R) in conjunction with snow, using mclapply() or similar functions to parallelise the operations, and aggregate the results back into an overall envelope. Unfortunately, these functions do not work on a Windows OS. Initialising clusters on the local machine did not work as the resultant object could not be aggregated.

3 Conclusion

This report is concerned mainly with the exploratory data analysis of Airbnb listings taken from InsideAirbnb at September 2020. We looked at the prices across the 5 regions in Singapore and confirmed that prices for listings were significantly different, especially the ones in the Central Region. We also confirmed that prices for entire home listings and private room listings are not the same across the different regions, and that there are significant differences between the Central regions and other regions as well. Shared rooms however do not differ significantly across the different regions.

We also discover that approximately 75% of the Airbnb listings are posted by hosts with multiple listings. Despite this, there are no significant differences in the prices of hosts who have single or multiple listings; hence a host with multiple listings does not make the prices. It is likely that the price of the listing is dependent on other factors such as location, size of apartment and other amenities provided.

However, we did find that 89% of the listings have a minimum stay that is less than the legislated minimum stay. This has potential implications for better enforcement of regulations either from the Airbnb side by not allowing listings under the legislated minimum stay, or from the authorities, by identifying, investigating, and charging offenders from the data provided.

Through our spatial point pattern statistical analysis, we confirm that there is clustering in specific areas for listings across Singapore, particularly in the case of entire homes/apartment listings and private room listings. As expected, they clustered mainly in the central region and correspond to access to shopping districts, central business district and city fringe area for cheaper stays as compared to hotels. The different room types also have different concentrations in the different areas, which imply that they cater to different types of visitors (e.g. More casual tourists vs budget travellers or business travellers).

We also explored different methods of testing the second order properties using different K-test functions. We have also showed that the Fast Fourier Transform version of the K-test performs better than the normal K-test for large datasets, and it also produces a better Kernel Density Estimation than just using the adaptive bandwidth.

The findings from the EDA and derived bandwidth from the K-tests will be used in the next phase of the project, which is to develop a hedonic model of Airbnb listings in Singapore.

Limitations and further study

This report looked at mainly understanding the distribution of the listings, and confirming if the listings adhere to Complete Spatial Randomness, or if they are clustered or in competition. It also provided an exploratory data analysis of the various cuts of data (reviews, host listings, etc).

Further iterations of the study could look at a K-cross analysis to see if the listings of different room types impact one another. Our hypothesis would be that they are likely to have some cross correlation for private and entire homes as there may be a mixture of those 2 room types in the same estate. Listings of a private room could motivate others to list an entire apartment, and vice versa. Similarly we could also test for any correlation between the listings and amenities such as transport links, tourist attractions and key business areas.

Additionally, we could explore using an inhomogenous K-test to test if our assumption of stationarity across the entire study area holds true. Other co-variate testing methods could also be introduced (e.g. factors such as whether being a multiple listing host affects the location, and pricing effects, rating of the location in the reviews, etc.)

Section 2: Exploratory Spatial Data Analysis

Clara Chua

9/29/2020 (updated: 2021-03-04)