1 Introduction

Airbnb, a peer-to-peer sharing platform that enables the short-term rental of private rooms or homes by individuals to potential guests, is increasingly popular with tourists. As of 30 September 2020, Airbnb has operations in 100,000 cities with over 4 million hosts providing 5.6 million active listings and 800 million guest arrivals since its launch (Airbnb, 2021).

Airbnb launched its Asia Pacific headquarters in Singapore in November 2012, but has had hosts in Singapore from as early as 2009. Local Airbnb stays are regulated by the Urban Redevelopment Authority (URA) and the Housing Development Board (HDB). The authorities conducted consultations with the public and key stakeholders from 2015 to explore a regulatory framework for short term accommodation (Channel News Asia, 2018), but maintained the regulatory status quo in May 2019 (Co, 2019). The minimum stay for private property is three months and six months for HDB flats. Strict penalties have been enforced including fines of up to $200,000 for first time offenders, with additional fines and possible jail term for repeat offenders.

There have been an increasing number of studies on the effects of Airbnb in various contexts, from the impact on the hotel industry (Zervas et al. 2014, Gutierrez et al. 2017, Dogru et al. 2019, Dogru et al. 2020, Blal et al. 2018, Heo et al. 2019), to housing markets (Barron et al 2018, Horn & Merante 2017, Yrigoy 2018, Ayouba et al. 2019) to whether and how Airbnb should be regulated (Kaplan & Nadler 2015, DiNatale et al. 2018, Wegmann & Jiao 2017), to spatial studies which are discussed in greater detail in the literature review section below.

1.1 Project Objectives

The motivation for this project is to specifically explore Airbnb in the Singapore market to understand the spatial distribution of Airbnb accommodation in Singapore and see how it correlates with various spatial factors in Singapore. Further analysis could also inform future policy and regulatory frameworks on short term accommodation.

The project will look at geospatial analysis to explore and explain the data around Airbnb rentals in Singapore. Namely the project aims to:

Understand the supply of Airbnb listings from a geographical standpoint
Determine if there are any spatial clustering of Airbnb listings and its impact on hotels (if any)
Develop a hedonic pricing model to explain factors affecting rental pricing and take up rate of the Airbnb properties in question.

The report covers the following 5 sections: firstly we look at spatial studies involving spatial distributions of Airbnb listings and its impact, geographically weighted regression (GWR) models for Airbnb pricing, and Asian based studies for Airbnb. Second, we explore the data available from InsideAirbnb, followed by exploratory data analysis (EDA) to look at potential insights and hypotheses about Airbnb in Singapore. Fourth, we conduct exploratory spatial analysis on Airbnb listings in Singapore, including Spatial Point Pattern Analysis. Finally the report concludes by a brief discussion on the implications of the findings for the next phase of the project – developing a hedonic (GWR) pricing model of Airbnb listings.

1.2 Literature Review

Spatial Studies

There have been studies that use spatial analysis to explore the distribution and impact of Airbnb on various factors. A large number of studies are centered in European cities (e.g. Barcelona, London), or in the USA (e.g. New York City, San Francisco) and many studies show that Airbnb listings are concentrated around tourist or leisure areas.

In Barcelona, Gutierrez et al. (2017) found that Airbnb listings are concentrated in the city centre, but cover a slightly wider area than that of hotels. Bivariate spatial autocorrelation analysis showed a close association between Airbnb listings and hotels, with proximity to leisure and tourism activities explaining Airbnb location patterns, whereas hotels were slightly more widespread.

Adamiak et al. (2016) looked at the spatial concentration and autocorrelation of the density of Airbnb listings for the whole of Spain, with a focus on the impact on tourism. They found that Airbnb listings concentrate in large cities and areas with high tourism and leisure activities such as the coastal areas, national parks and mountain tourist areas. Entire homes or apartments dominated the listings in touristic and leisure areas (~90% of all listings in the area), and the listings were positively correlated with coastal areas, and the high number of nonprimary accommodation (such as a 2nd holiday home) and hotel supply. In such cases, Airbnb helps people to commercialise holiday homes or apartments already used for tourism purposes. The authors suggested that Airbnb encourages the growth of tourist accommodation stock in touristic hotspots, be supplementary to hotel supply, and could open new opportunities for tourism.

Quattrone et al. (2018) looked at geographic, social and economic variables to try to explain the spatial penetration of Airbnb in 8 US cities at the census tract level. Their results in the geographic variables show that distance from the city centre was negatively related to Airbnb offerings in 5 out of 8 cities. The attractiveness of an area (number of points of interests within the census tract) was positively correlated with number of listings in 5 out of 8 cities - i.e. Airbnb listings are predominantly located in more touristic areas. The number of bus stops per tract, which measured the strength of an area’s infrastructure and transport links, were also positively correlated with Airbnb listings in 3 cities (Austin, Oakland, San Francisco). The authors also compared the study to a similar study (Quattrone et al., 2016) for London and found similarities in the geographic results in terms of distance to centre, tourism factor and hotel presence (no relationship between hotel presence and Airbnb adoption). The comparison of social and economic indices and found that Airbnb listings are correlated to areas with the young, bohemian and talent indices, with differing correlation on racial index in different cities.

Another study by Lagonigro, Martori, & Apparicio (2020) analysed the factors affecting the spatial distribution of Airbnb listings in Barcelona, in relation to population and tourism indicators using a Geographically Weighted Regression (GWR) model. They found that medium-low family incomes show positive correlation between poverty and Airbnb ratios, whereas neighborhoods with higher incomes attract more Airbnb accommodations. Their study also uncovered how Airbnb contributed to gentrification of some neighbourhoods by removing housing from residential stock to short term rentals.

Other studies have also explored spatial characteristics of Airbnb accommodation (Zhang and Chen, 2019), or used GWR techniques to model variation of hotel room prices, (Zhang et al, 2011), tourism and rural poverty rates (Deller, 2010), or the housing market (Bitter, Mulligan, & Dall’erba, 2007).

Asian studies

In Singapore and Asia, the majority of the studies of Airbnb in Asia have focused on the user experience (from either the guest or host’s point of view), and the disruptive impact it has on the tourism industry and hotel revenues; but there have been no studies that have looked in-depth into spatial analysis of Airbnb in Singapore or Asia.

Choi et al (2015) looked at the impact of Airbnb on hotel revenues across different cities in Korea and found that at the national level, Airbnb accommodation did not affect hotel revenue, but there were slight variations in different cities. Airbnb had a slight negative effect on budget hotels in Seoul; whereas there was a negative effect on upscale hotels and positive effect on midscale hotels in Busan, but the magnitude of those effects was very small.

Kiatkawsin, Sutherland and Kim (2020) conducted a text analysis of Airbnb reviews in Hong Kong and Singapore using Latent Dirichlet Allocation (LDA) to extract topics from the data. There were 12 topics in Hong Kong and 5 topics in Singapore reviews. Topics were related to established hotel attributes (e.g. unit or room amenities, location), but also included host and listing management, which are unique topics to Airbnb listings. Their results show that hosts needs to focus on delivering quality service for the entire ‘transaction’ pre-trip to post-trip, and ensure that their listings are accurate and comprehensive for better guest satisfaction.

Koh, and King (2017) conducted a qualitative assessment of the impact of Airbnb on Singapore’s budget hotels. Interviews with key stakeholders from budget hotels and hostels were conducted - while there were growing concerns that airbnb may prove to be competition down the road, they did not consider Airbnb rentals an immediate threat at that point in time.

The Development Bank of Singapore published a briefing on the rise of home sharing platforms (Yong & Tan 2019) and a case study on Airbnb but this study had a broader focus on the entire market across Asia and while it discussed the impact of Airbnb on hotel prices, it did not look in depth into any spatial analysis.

As such, this study aims to close the gap by looking at the spatial densities of Airbnb in Singapore, and similar to the Barcelona study, determine the factors that affect the spatial distribution of Airbnb listings in Singapore using a Geographically Weighted Regression (GWR) model. In addition, we would look at hotels in Singapore and determine if the location affects Airbnb listings as well. We can determine if Airbnb is in competition with hotels or whether they are complementary, as Airbnb claims,

2 Setting Up

The following code chunk loads the packages required for the Exploratory Data Analysis; it will also install the packages if they have not been installed. The following table shows the different packages used in this study:

Type	Package	Usage
Data Exploration	tidyverse	Data manipulation & wrangling
Data Exploration	lubridate	Manipulating date-time data
Data Exploration	knitr	knit R-Markdown document, with code to show specific lines of output for the purpose of this report
Data Exploration	funModeling	Provides functions to help in exploratory data analysis, data preparation and model performance
Spatial Data	sf (Simple Features)	read and manipulate spatial data for analysis
Spatial Data	tmap	graphing and mapping spatial data
Spatial Data	leaflet	graphing and mapping spatial data
Spatial Data	gridExtra	customise display of graphs and plots (in a grid format)
Spatial Data	OpenStreetMap	Accesses high resolution raster maps using the OpenStreetMap protocol. This provides a basemap when tmap is set to ‘plot’ mode
Spatial Data	rgdal	Provides access to projection/transformation operations and importing of raster / vector data
Spatial Data	maptools	Manipulating geographic data
Spatial Data	raster	Manipulating raster data
Spatial Data	spatstat	statistical analysis of Spatial Point Patterns
Spatial Data	tmaptools	Reading and mapping spatial data

# Loading in required packages
packages = c('tidyverse', 'funModeling', 'ggstatsplot', 'statsExpressions', 'lubridate', 'knitr', 'rgdal', 'spatstat', 'maptools', 'sf','tmap', 'tmaptools', 'leaflet', 'raster', 'gridExtra', 'OpenStreetMap')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}

3 Data Preparation

3.1 Data Source

InsideAirbnb is an independent, non-commercial site that provides publicly available information about a city’s Airbnb listings. Started by Murray Cox, data is provided for over 90 cities, by scraping and compiling publicly available information from the Airbnb website at regular intervals. The intention behind InsideAirbnb is to enable data exploration into how Airbnb impacts community housing issues and residential housing markets in various cities around the world.

The data has been famously used to uncover misrepresentations from Airbnb that their hosts only occasionally rent the homes in which they live. In 2016, Murray Cox, together with Tom Slee, reported ¹ that before releasing data on its New York City listings, Airbnb had removed over 1,000 entire home listings that violated New York City’s multiple dwelling law (i.e. hosts with multiple listings). The law states that an apartment in a building with 3 or more units cannot be rented out for under 30 days unless there’s a permanent occupant present.

Note that information such as actual stays (e.g. number of days), actual rental income per host are not available publicly.

3.2 What Data is Available?

InsideAirbnb provides a snapshot of the following information:

Listings - Summary information on listings
Detailed Listings - Detailed listing information of airbnb for rent
Calendar - Detailed calendar data for listings
Reviews - Summary review data
Detailed Reviews - Detailed review data for listings
Neighbourhoods - list of neighbourhoods in the city and a neighbourhood GeoJSON file

For Singapore, InsideAirbnb has periodic snapshots from 18 March 2019 to 26 October 2020. Data was downloaded from InsideAirbnb on 29 September 2020 for this project - the dataset downloaded was compiled on 22 June 2020.

3.2.1 Reading in the data

# Loading the Data
listings <- read_csv("data/listings.csv")
d_listings <- read_csv("data/detailedlistings.csv")

## Warning: 5 parsing failures.
##  row     col           expected     actual                        file
## 3083 license 1/0/T/F/TRUE/FALSE 201117828H 'data/detailedlistings.csv'
## 4215 license 1/0/T/F/TRUE/FALSE 201537598E 'data/detailedlistings.csv'
## 4684 license 1/0/T/F/TRUE/FALSE 201202564R 'data/detailedlistings.csv'
## 5668 license 1/0/T/F/TRUE/FALSE 201537598E 'data/detailedlistings.csv'
## 5674 license 1/0/T/F/TRUE/FALSE 201537598E 'data/detailedlistings.csv'

calendar <- read_csv("data/calendar.csv")
reviews <- read_csv("data/reviews.csv")
d_reviews <- read_csv("data/detailedreviews.csv")
neighbourhoods <- read_csv("data/neighbourhoods.csv")

We see 5 parsing failures where a True / False was expected but there were characters in the actual data. These are actually the business registration numbers for Singapore companies, and we can change these to a True or Yes in the columns when looking at the data. We are not using this column for analysis at the moment.

Handling spatial data²
The two popular packages for handling geographical data in R are sp released in 2005, and sf (simple features) released in 2016. They allow users to standardize how spatial data would be treated in R (points, lines, polygons, grids) and operate on them. However, the packages reads and stores geographical data differently:

sp uses an S4 class object with slots to build a spatial object. It has 2 pre-defined slots: + bounding box: a box that provides the boundaries or window for the object + crs: the Coordinate Reference System, that tells R how to project the 2D coordinate systems onto 3D space.

One slot is for the geometric object (points, lines, polygons) and is either a matrix of coordinates or a list of lines or polygons objects.

The last slot is for attributes associated with the geometric object and this will transform a Spatial object into a Spatial Dataframe object. There are different objects for points, lines and polygons (e.g. SpatialPoints, SpatialLines and SpatialPolygons objects, SpatialPointsDataframe, SpatialLinesDataframe objects).

sf stores spatial objects as a dataframe with a special column named geometry that contains spatial information. This geometry column contains the simple features collection (sfc) which includes: + geometric objects (points, lines, polygons), stored as a simple feature geometry (sfg) object + bounding box + crs (epsg or proj4string)

The other columns of the dataframe generally represent the attributes of the data (e.g. place names, roads, elevation, temperature, etc). We can conceive of a sf object as a dataframe with a spatial extension.

sf is useful when reading in larger dataframes (faster read/write) and provides a simpler interface in its usage.

Reading in Spatial Data
We use st_read() from the sf package to read the neighbourhood geojson file, specifying the layer name and the data source name. This dispenses with the need to call the RGDAL library. As the geojson file only contains the geometries and not any projection information, we need to use st_transform() to assign the same crs to the neighbourhood polygons so that they will appear in the same projected space.

# reading in the neighbourhood geojson file
nhood_map_sf <- st_read(dsn = "data/neighbourhoods.geojson", 
                        layer="neighbourhoods") %>%
                st_transform(crs = 3414)

## Reading layer `neighbourhoods' from data source `C:\Users\clarachua\Documents\2. Capstone Project\capstone\data\neighbourhoods.geojson' using driver `GeoJSON'
## Simple feature collection with 55 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 103.6054 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
## geographic CRS: WGS 84

We can see that the ‘neighbourhoods’ layer is a simple feature collection of 55 polygons and 2 attributes (neighbourhood name and neighbourhood group corresponding to the multipolygon geometry). The projected CRS used is the SVY21 projection (3414), which provides a more accurate representation of Singapore’s spatial references than the global WGS84 projection due to Earth’s imperfect ellipsoid. The ‘neighbourhoods’ data also has its own bounding box.

3.2.2 Data Structure

glimpse() is used to take a first look at the listings data.

# Reviewing the listings data
glimpse(listings)

## Rows: 7,323
## Columns: 16
## $ id                             <dbl> 49091, 50646, 56334, 71609, 71896, 7190~
## $ name                           <chr> "COZICOMFORT LONG TERM STAY ROOM 2", "P~
## $ host_id                        <dbl> 266763, 227796, 266763, 367042, 367042,~
## $ host_name                      <chr> "Francesca", "Sujatha", "Francesca", "B~
## $ neighbourhood_group            <chr> "North Region", "Central Region", "Nort~
## $ neighbourhood                  <chr> "Woodlands", "Bukit Timah", "Woodlands"~
## $ latitude                       <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.3~
## $ longitude                      <dbl> 103.7958, 103.7852, 103.7967, 103.9571,~
## $ room_type                      <chr> "Private room", "Private room", "Privat~
## $ price                          <dbl> 84, 80, 70, 167, 95, 84, 209, 52, 54, 4~
## $ minimum_nights                 <dbl> 180, 90, 6, 90, 90, 90, 1, 90, 90, 14, ~
## $ number_of_reviews              <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20~
## $ last_review                    <date> 2013-10-21, 2014-12-26, 2015-10-01, 20~
## $ reviews_per_month              <dbl> 0.01, 0.24, 0.18, 0.19, 0.22, 0.43, 0.2~
## $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3, 4, 4, 7, 1, ~
## $ availability_365               <dbl> 365, 365, 365, 365, 365, 365, 180, 356,~

This table provides basic information about the 7,323 listings that were available as at the date compiled (22 June 2020). We can see the unique listing id, the name of the listing, host name and id, the neighbourhood it is in together with the coordinates of the listing. Note however that Airbnb randomizes the listing by about 150m so there may be a slight variation in the actual coordinates and the stated ones in the table. It also provides information about the room type, price of the listing, the minimum nights and some review statistics for the listing (number of reviews, when the last review was, and the number of reviews per month). There is also some information about how many listings the host has in total, and how many days the listing is available for within a year (availability_365). The data types are correct except for the id and host_id - we can convert them to character or categorical.

glimpse(reviews)

## Rows: 91,250
## Columns: 2
## $ listing_id <dbl> 49091, 50646, 50646, 50646, 50646, 50646, 50646, 50646, 506~
## $ date       <date> 2013-10-21, 2014-04-18, 2014-06-05, 2014-07-02, 2014-07-08~

glimpse(d_reviews)

## Rows: 91,250
## Columns: 6
## $ listing_id    <dbl> 49091, 50646, 50646, 50646, 50646, 50646, 50646, 50646, ~
## $ id            <dbl> 8243238, 11909864, 13823948, 15117222, 15426462, 1555291~
## $ date          <date> 2013-10-21, 2014-04-18, 2014-06-05, 2014-07-02, 2014-07~
## $ reviewer_id   <dbl> 8557223, 1356099, 15222393, 5543172, 817532, 10942382, 1~
## $ reviewer_name <chr> "Jared", "James", "Welli", "Cyril", "Jake", "Subba", "Cl~
## $ comments      <chr> "Fran was absolutely gracious and welcoming. Made my sta~

Similarly glimpsing the data for the other data imports:

The reviews data shows just a date of a review and the listing_id that it is associated with. There are 91,250 reviews in total.
In addition to the data in the reviews table, the detailed reviews give a unique review id, information about the reviewer and the comments made (text data). The number of reviews in the both reviews and detailed reviews table match.

glimpse(calendar)

## Rows: 2,673,655
## Columns: 7
## $ listing_id     <dbl> 819034, 2362558, 2362558, 2362558, 2362558, 2362558, 23~
## $ date           <date> 2020-06-22, 2020-06-23, 2020-06-24, 2020-06-25, 2020-0~
## $ available      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,~
## $ price          <chr> "$350.00", "$362.00", "$362.00", "$362.00", "$362.00", ~
## $ adjusted_price <chr> "$350.00", "$344.00", "$344.00", "$344.00", "$344.00", ~
## $ minimum_nights <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ maximum_nights <dbl> 30, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,~

The calendar data shows the availability of a listing, the dates that it is available for, the price for that particular date and the minimum and maximum number of nights that you can book the listing for. As such this table is the largest, with 2.6million rows of data. The price and adjusted price will need to be changed to numerical (instead of character).

glimpse(d_listings)

## Rows: 7,323
## Columns: 106
## $ id                                           <dbl> 49091, 50646, 56334, 7160~
## $ listing_url                                  <chr> "https://www.airbnb.com/r~
## $ scrape_id                                    <dbl> 2.020062e+13, 2.020062e+1~
## $ last_scraped                                 <date> 2020-06-22, 2020-06-22, ~
## $ name                                         <chr> "COZICOMFORT LONG TERM ST~
## $ summary                                      <chr> NA, "Fully furnished bedr~
## $ space                                        <chr> "This is Room No. 2.(avai~
## $ description                                  <chr> "This is Room No. 2.(avai~
## $ experiences_offered                          <chr> "none", "none", "none", "~
## $ neighborhood_overview                        <chr> NA, "The serenity & quiet~
## $ notes                                        <chr> NA, "Accommodation has a ~
## $ transit                                      <chr> NA, "Less than 400m from ~
## $ access                                       <chr> NA, "Kitchen, washing fac~
## $ interaction                                  <chr> NA, "We love to host peop~
## $ house_rules                                  <chr> "No smoking indoors. Plea~
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA, N~
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA, N~
## $ picture_url                                  <chr> "https://a0.muscache.com/~
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA, N~
## $ host_id                                      <dbl> 266763, 227796, 266763, 3~
## $ host_url                                     <chr> "https://www.airbnb.com/u~
## $ host_name                                    <chr> "Francesca", "Sujatha", "~
## $ host_since                                   <date> 2010-10-20, 2010-09-08, ~
## $ host_location                                <chr> "singapore", "Singapore, ~
## $ host_about                                   <chr> "I am a private tutor by ~
## $ host_response_time                           <chr> "within an hour", "N/A", ~
## $ host_response_rate                           <chr> "100%", "N/A", "100%", "1~
## $ host_acceptance_rate                         <chr> "N/A", "N/A", "N/A", "100~
## $ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, FALS~
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
## $ host_picture_url                             <chr> "https://a0.muscache.com/~
## $ host_neighbourhood                           <chr> "Woodlands", "Bukit Timah~
## $ host_listings_count                          <dbl> 2, 1, 2, 8, 8, 8, 8, 4, 4~
## $ host_total_listings_count                    <dbl> 2, 1, 2, 8, 8, 8, 8, 4, 4~
## $ host_verifications                           <chr> "['email', 'phone', 'face~
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
## $ host_identity_verified                       <lgl> FALSE, FALSE, FALSE, TRUE~
## $ street                                       <chr> "Singapore, Singapore", "~
## $ neighbourhood                                <chr> "Woodlands", "Bukit Timah~
## $ neighbourhood_cleansed                       <chr> "Woodlands", "Bukit Timah~
## $ neighbourhood_group_cleansed                 <chr> "North Region", "Central ~
## $ city                                         <chr> "Singapore", "Singapore",~
## $ state                                        <chr> NA, NA, NA, NA, NA, NA, N~
## $ zipcode                                      <chr> "730702", "589664", NA, "~
## $ market                                       <chr> "Singapore", "Singapore",~
## $ smart_location                               <chr> "Singapore", "Singapore",~
## $ country_code                                 <chr> "SG", "SG", "SG", "SG", "~
## $ country                                      <chr> "Singapore", "Singapore",~
## $ latitude                                     <dbl> 1.44255, 1.33235, 1.44246~
## $ longitude                                    <dbl> 103.7958, 103.7852, 103.7~
## $ is_location_exact                            <lgl> TRUE, TRUE, TRUE, TRUE, T~
## $ property_type                                <chr> "Apartment", "Apartment",~
## $ room_type                                    <chr> "Private room", "Private ~
## $ accommodates                                 <dbl> 1, 2, 1, 6, 3, 3, 6, 1, 2~
## $ bathrooms                                    <dbl> 1.0, 1.0, 1.0, 1.0, 0.5, ~
## $ bedrooms                                     <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1~
## $ beds                                         <dbl> 1, 1, 1, 3, 1, 2, 7, 1, 2~
## $ bed_type                                     <chr> "Real Bed", "Real Bed", "~
## $ amenities                                    <chr> "{TV,\"Cable TV\",Interne~
## $ square_feet                                  <dbl> 0, NA, 0, 205, NA, NA, 45~
## $ price                                        <chr> "$84.00", "$80.00", "$70.~
## $ weekly_price                                 <chr> NA, "$400.00", NA, NA, "$~
## $ monthly_price                                <chr> "$1,048.00", "$1,600.00",~
## $ security_deposit                             <chr> NA, NA, NA, "$279.00", "$~
## $ cleaning_fee                                 <chr> NA, NA, NA, "$56.00", "$2~
## $ guests_included                              <dbl> 1, 2, 1, 4, 1, 1, 4, 1, 1~
## $ extra_people                                 <chr> "$14.00", "$20.00", "$14.~
## $ minimum_nights                               <dbl> 180, 90, 6, 90, 90, 90, 1~
## $ maximum_nights                               <dbl> 360, 730, 14, 1125, 1125,~
## $ minimum_minimum_nights                       <dbl> 180, 90, 6, 90, 90, 90, 1~
## $ maximum_minimum_nights                       <dbl> 180, 90, 6, 90, 90, 90, 1~
## $ minimum_maximum_nights                       <dbl> 360, 730, 14, 1125, 1125,~
## $ maximum_maximum_nights                       <dbl> 360, 730, 14, 1125, 1125,~
## $ minimum_nights_avg_ntm                       <dbl> 180, 90, 6, 90, 90, 90, 1~
## $ maximum_nights_avg_ntm                       <dbl> 360, 730, 14, 1125, 1125,~
## $ calendar_updated                             <chr> "73 months ago", "71 mont~
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
## $ availability_30                              <dbl> 30, 30, 30, 30, 30, 30, 3~
## $ availability_60                              <dbl> 60, 60, 60, 60, 60, 60, 6~
## $ availability_90                              <dbl> 90, 90, 90, 90, 90, 90, 9~
## $ availability_365                             <dbl> 365, 365, 365, 365, 365, ~
## $ calendar_last_scraped                        <date> 2020-06-22, 2020-06-22, ~
## $ number_of_reviews                            <dbl> 1, 18, 20, 20, 24, 48, 29~
## $ number_of_reviews_ltm                        <dbl> 0, 0, 0, 8, 4, 13, 6, 2, ~
## $ first_review                                 <date> 2013-10-21, 2014-04-18, ~
## $ last_review                                  <date> 2013-10-21, 2014-12-26, ~
## $ review_scores_rating                         <dbl> 94, 91, 98, 89, 83, 88, 8~
## $ review_scores_accuracy                       <dbl> 10, 9, 10, 9, 8, 9, 9, 10~
## $ review_scores_cleanliness                    <dbl> 10, 10, 10, 8, 8, 9, 8, 1~
## $ review_scores_checkin                        <dbl> 10, 10, 10, 9, 9, 9, 9, 1~
## $ review_scores_communication                  <dbl> 10, 10, 10, 10, 9, 9, 9, ~
## $ review_scores_location                       <dbl> 8, 9, 8, 9, 8, 9, 9, 10, ~
## $ review_scores_value                          <dbl> 8, 9, 9, 9, 8, 9, 8, 10, ~
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, FALS~
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA, N~
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA, N~
## $ instant_bookable                             <lgl> FALSE, FALSE, FALSE, TRUE~
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, FALS~
## $ cancellation_policy                          <chr> "flexible", "moderate", "~
## $ require_guest_profile_picture                <lgl> TRUE, FALSE, TRUE, FALSE,~
## $ require_guest_phone_verification             <lgl> TRUE, TRUE, TRUE, TRUE, T~
## $ calculated_host_listings_count               <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3~
## $ calculated_host_listings_count_entire_homes  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ calculated_host_listings_count_private_rooms <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3~
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ reviews_per_month                            <dbl> 0.01, 0.24, 0.18, 0.19, 0~

Detailed listings is the largest table, comprising of the same 7,323 listings and 106 columns that include data on:
- the listing URL, when it was scraped
- description of the accommodation, space, number of bedrooms, beds and bathrooms
- neighbourhood overview (including transit and access) and any experiences offered by the host
- information about the host (whether they are a superhost, verified, response time and acceptance rate, etc) and number of listings the host has
- geographical information (spatial coordinates, neighbourhoods, street information if available)
- pricing and availability, including any deposits and fees
- review information, including review scores
- guest requirements, house rules and cancellation policy
Similar to the other tables, the price data needs to be changed to numerical, and all ids will need to be treated as a character instead of numerical.

The following code makes the changes as specified above.

# Change Data Types of id to characters
listings <- listings %>% mutate_at(vars(id, host_id), as.character)
reviews <- reviews %>% mutate_at(vars(listing_id), as.character)
d_reviews <- d_reviews %>% mutate_at(vars(id, reviewer_id, listing_id), as.character)
calendar <- calendar %>% mutate_at(vars(listing_id), as.character)
d_listings <- d_listings %>% mutate_at(vars(host_id, id), as.character)

# Change price in detailed listings to numerical
# Remove $ and , symbol in columns where currency is read as character.
strip_dollars = function(x) {as.numeric(gsub("[\\$,]", "", x))}
d_listings[,61:65] <- sapply(d_listings[,61:65], strip_dollars)
d_listings[,67] <- sapply(d_listings[,67], strip_dollars)

3.2.3 Missing data

We examine the data to see if there are missing data and decide how to handle them. Missing data could be a zero value, which we will need to change to be able to analyse the data correctly; it may reflect actual missing data, which may be omitted depending on our use of the data.

df_status() from the funModeling package is a useful function to show missing data as it shows the number and percentage of zeros, N/As and infinite values, the data type, as well as the number of unique values.

Listings

df_status(listings)

##                          variable q_zeros p_zeros q_na  p_na q_inf p_inf
## 1                              id       0    0.00    0  0.00     0     0
## 2                            name       0    0.00    1  0.01     0     0
## 3                         host_id       0    0.00    0  0.00     0     0
## 4                       host_name       0    0.00   22  0.30     0     0
## 5             neighbourhood_group       0    0.00    0  0.00     0     0
## 6                   neighbourhood       0    0.00    0  0.00     0     0
## 7                        latitude       0    0.00    0  0.00     0     0
## 8                       longitude       0    0.00    0  0.00     0     0
## 9                       room_type       0    0.00    0  0.00     0     0
## 10                          price       0    0.00    0  0.00     0     0
## 11                 minimum_nights       0    0.00    0  0.00     0     0
## 12              number_of_reviews    2835   38.71    0  0.00     0     0
## 13                    last_review       0    0.00 2835 38.71     0     0
## 14              reviews_per_month       0    0.00 2835 38.71     0     0
## 15 calculated_host_listings_count       0    0.00    0  0.00     0     0
## 16               availability_365    1761   24.05    0  0.00     0     0
##         type unique
## 1  character   7323
## 2  character   6766
## 3  character   2466
## 4  character   1739
## 5  character      5
## 6  character     43
## 7    numeric   4579
## 8    numeric   4974
## 9  character      4
## 10   numeric    429
## 11   numeric     74
## 12   numeric    215
## 13      Date   1158
## 14   numeric    451
## 15   numeric     53
## 16   numeric    319

The key listing information (host_id, longitude, latitude, neighbourhood, room_type and price) are all intact.
While there are listings with missing data (name, host_names, last_review and reviews_per_month, these are not as consequential:
- Missing data for last review and reviews per month can be taken as the listing not having been rented out previously and it matches the number of reviews that are 0.
- While there are missing host names, there are no missing values for host_id (which is the unique identifier of the host); therefore we do not need to be concerned about this.
- Similarly for the missing listing name, there is sufficient other identifying data for this to be included.

Calendar

df_status(calendar)

##         variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1     listing_id       0    0.00    0 0.00     0     0 character   7323
## 2           date       0    0.00    0 0.00     0     0      Date    366
## 3      available 1037094   38.79    0 0.00     0     0   logical      2
## 4          price       0    0.00 2215 0.08     0     0 character    744
## 5 adjusted_price       0    0.00 2215 0.08     0     0 character    743
## 6 minimum_nights       0    0.00  725 0.03     0     0   numeric     75
## 7 maximum_nights       0    0.00  725 0.03     0     0   numeric    109

Each date for the listing is a row and we can surmise that the dates where there is missing data on the price, minimum or maximum nights are dates that are not available despite it being marked as available.
As we are not doing any analysis on the calendar availability, we can leave the data as is for now.

Reviews and Detailed Reviews

df_status(reviews)

##     variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1 listing_id       0       0    0    0     0     0 character   4488
## 2       date       0       0    0    0     0     0      Date   2687

df_status(d_reviews)

##        variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1    listing_id       0       0    0 0.00     0     0 character   4488
## 2            id       0       0    0 0.00     0     0 character  91250
## 3          date       0       0    0 0.00     0     0      Date   2687
## 4   reviewer_id       0       0    0 0.00     0     0 character  83957
## 5 reviewer_name       0       0    0 0.00     0     0 character  32853
## 6      comments       0       0  104 0.11     0     0 character  87115

The reviews table does not have any missing data
We also see that 61 of listings have at least 1 review.
The detailed reviews table have 104 reviews with missing comments, however this could be due to reviewers rating the listing, without providing any commentary. We will leave them as is, unless removal is necessary (e.g. for text mining).

Detailed Listings

df_status(d_listings)

##                                         variable q_zeros p_zeros q_na   p_na
## 1                                             id       0    0.00    0   0.00
## 2                                    listing_url       0    0.00    0   0.00
## 3                                      scrape_id       0    0.00    0   0.00
## 4                                   last_scraped       0    0.00    0   0.00
## 5                                           name       0    0.00    1   0.01
## 6                                        summary       0    0.00  324   4.42
## 7                                          space       0    0.00 1957  26.72
## 8                                    description       0    0.00  249   3.40
## 9                            experiences_offered       0    0.00    0   0.00
## 10                         neighborhood_overview       0    0.00 2933  40.05
## 11                                         notes       0    0.00 3315  45.27
## 12                                       transit       0    0.00 2895  39.53
## 13                                        access       0    0.00 2821  38.52
## 14                                   interaction       0    0.00 3250  44.38
## 15                                   house_rules       0    0.00 3874  52.90
## 16                                 thumbnail_url       0    0.00 7323 100.00
## 17                                    medium_url       0    0.00 7323 100.00
## 18                                   picture_url       0    0.00    0   0.00
## 19                                xl_picture_url       0    0.00 7323 100.00
## 20                                       host_id       0    0.00    0   0.00
## 21                                      host_url       0    0.00    0   0.00
## 22                                     host_name       0    0.00   22   0.30
## 23                                    host_since       0    0.00   22   0.30
## 24                                 host_location       0    0.00   41   0.56
## 25                                    host_about       0    0.00 2441  33.33
## 26                            host_response_time       0    0.00   22   0.30
## 27                            host_response_rate       0    0.00   22   0.30
## 28                          host_acceptance_rate       0    0.00   22   0.30
## 29                             host_is_superhost    6143   83.89   22   0.30
## 30                            host_thumbnail_url       0    0.00   22   0.30
## 31                              host_picture_url       0    0.00   22   0.30
## 32                            host_neighbourhood       0    0.00  842  11.50
## 33                           host_listings_count     339    4.63   22   0.30
## 34                     host_total_listings_count     339    4.63   22   0.30
## 35                            host_verifications       0    0.00    0   0.00
## 36                          host_has_profile_pic      19    0.26   22   0.30
## 37                        host_identity_verified    5668   77.40   22   0.30
## 38                                        street       0    0.00    0   0.00
## 39                                 neighbourhood       0    0.00    2   0.03
## 40                        neighbourhood_cleansed       0    0.00    0   0.00
## 41                  neighbourhood_group_cleansed       0    0.00    0   0.00
## 42                                          city       0    0.00   64   0.87
## 43                                         state       1    0.01 6817  93.09
## 44                                       zipcode       0    0.00  818  11.17
## 45                                        market       0    0.00   90   1.23
## 46                                smart_location       0    0.00    0   0.00
## 47                                  country_code       0    0.00    0   0.00
## 48                                       country       0    0.00    0   0.00
## 49                                      latitude       0    0.00    0   0.00
## 50                                     longitude       0    0.00    0   0.00
## 51                             is_location_exact    1478   20.18    0   0.00
## 52                                 property_type       0    0.00    0   0.00
## 53                                     room_type       0    0.00    0   0.00
## 54                                  accommodates       0    0.00    0   0.00
## 55                                     bathrooms     103    1.41    3   0.04
## 56                                      bedrooms     591    8.07   12   0.16
## 57                                          beds     235    3.21   71   0.97
## 58                                      bed_type       0    0.00    0   0.00
## 59                                     amenities       0    0.00    0   0.00
## 60                                   square_feet      13    0.18 7292  99.58
## 61                                         price       0    0.00    0   0.00
## 62                                  weekly_price       0    0.00 6857  93.64
## 63                                 monthly_price       0    0.00 6826  93.21
## 64                              security_deposit    2191   29.92 2217  30.27
## 65                                  cleaning_fee     885   12.09 1947  26.59
## 66                               guests_included       0    0.00    0   0.00
## 67                                  extra_people    3144   42.93    0   0.00
## 68                                minimum_nights       0    0.00    0   0.00
## 69                                maximum_nights       0    0.00    0   0.00
## 70                        minimum_minimum_nights       0    0.00    0   0.00
## 71                        maximum_minimum_nights       0    0.00    0   0.00
## 72                        minimum_maximum_nights       0    0.00    0   0.00
## 73                        maximum_maximum_nights       0    0.00    0   0.00
## 74                        minimum_nights_avg_ntm       0    0.00    0   0.00
## 75                        maximum_nights_avg_ntm       0    0.00    0   0.00
## 76                              calendar_updated       0    0.00    0   0.00
## 77                              has_availability       0    0.00    0   0.00
## 78                               availability_30    2453   33.50    0   0.00
## 79                               availability_60    2154   29.41    0   0.00
## 80                               availability_90    2045   27.93    0   0.00
## 81                              availability_365    1761   24.05    0   0.00
## 82                         calendar_last_scraped       0    0.00    0   0.00
## 83                             number_of_reviews    2835   38.71    0   0.00
## 84                         number_of_reviews_ltm    4206   57.44    0   0.00
## 85                                  first_review       0    0.00 2835  38.71
## 86                                   last_review       0    0.00 2835  38.71
## 87                          review_scores_rating       0    0.00 2969  40.54
## 88                        review_scores_accuracy       0    0.00 2974  40.61
## 89                     review_scores_cleanliness       0    0.00 2972  40.58
## 90                         review_scores_checkin       0    0.00 2978  40.67
## 91                   review_scores_communication       0    0.00 2974  40.61
## 92                        review_scores_location       0    0.00 2979  40.68
## 93                           review_scores_value       0    0.00 2978  40.67
## 94                              requires_license    7323  100.00    0   0.00
## 95                                       license       0    0.00 7323 100.00
## 96                            jurisdiction_names       0    0.00 7323 100.00
## 97                              instant_bookable    4227   57.72    0   0.00
## 98                      is_business_travel_ready    7323  100.00    0   0.00
## 99                           cancellation_policy       0    0.00    0   0.00
## 100                require_guest_profile_picture    7289   99.54    0   0.00
## 101             require_guest_phone_verification    7276   99.36    0   0.00
## 102               calculated_host_listings_count       0    0.00    0   0.00
## 103  calculated_host_listings_count_entire_homes    2784   38.02    0   0.00
## 104 calculated_host_listings_count_private_rooms    3057   41.75    0   0.00
## 105  calculated_host_listings_count_shared_rooms    6612   90.29    0   0.00
## 106                            reviews_per_month       0    0.00 2835  38.71
##     q_inf p_inf      type unique
## 1       0     0 character   7323
## 2       0     0 character   7323
## 3       0     0   numeric      1
## 4       0     0      Date      2
## 5       0     0 character   6766
## 6       0     0 character   4365
## 7       0     0 character   3139
## 8       0     0 character   5180
## 9       0     0 character      1
## 10      0     0 character   2135
## 11      0     0 character   1634
## 12      0     0 character   2211
## 13      0     0 character   2002
## 14      0     0 character   1665
## 15      0     0 character   2059
## 16      0     0   logical      0
## 17      0     0   logical      0
## 18      0     0 character   6777
## 19      0     0   logical      0
## 20      0     0 character   2466
## 21      0     0 character   2466
## 22      0     0 character   1739
## 23      0     0      Date   1576
## 24      0     0 character    217
## 25      0     0 character   1174
## 26      0     0 character      5
## 27      0     0 character     56
## 28      0     0 character     79
## 29      0     0   logical      2
## 30      0     0 character   2448
## 31      0     0 character   2448
## 32      0     0 character     62
## 33      0     0   numeric     60
## 34      0     0   numeric     60
## 35      0     0 character    187
## 36      0     0   logical      2
## 37      0     0   logical      2
## 38      0     0 character     93
## 39      0     0 character     45
## 40      0     0 character     43
## 41      0     0 character      5
## 42      0     0 character     39
## 43      0     0 character     50
## 44      0     0 character   1975
## 45      0     0 character      2
## 46      0     0 character     43
## 47      0     0 character      1
## 48      0     0 character      1
## 49      0     0   numeric   4579
## 50      0     0   numeric   4974
## 51      0     0   logical      2
## 52      0     0 character     26
## 53      0     0 character      4
## 54      0     0   numeric     16
## 55      0     0   numeric     24
## 56      0     0   numeric     10
## 57      0     0   numeric     26
## 58      0     0 character      5
## 59      0     0 character   5621
## 60      0     0   numeric     14
## 61      0     0   numeric    429
## 62      0     0   numeric    203
## 63      0     0   numeric    193
## 64      0     0   numeric    156
## 65      0     0   numeric    113
## 66      0     0   numeric     16
## 67      0     0   numeric     81
## 68      0     0   numeric     74
## 69      0     0   numeric    113
## 70      0     0   numeric     72
## 71      0     0   numeric     75
## 72      0     0   numeric    106
## 73      0     0   numeric    106
## 74      0     0   numeric    140
## 75      0     0   numeric    154
## 76      0     0 character     79
## 77      0     0   logical      1
## 78      0     0   numeric     31
## 79      0     0   numeric     61
## 80      0     0   numeric     90
## 81      0     0   numeric    319
## 82      0     0      Date      2
## 83      0     0   numeric    215
## 84      0     0   numeric     73
## 85      0     0      Date   1730
## 86      0     0      Date   1158
## 87      0     0   numeric     47
## 88      0     0   numeric      9
## 89      0     0   numeric      9
## 90      0     0   numeric      9
## 91      0     0   numeric      8
## 92      0     0   numeric      8
## 93      0     0   numeric      9
## 94      0     0   logical      1
## 95      0     0   logical      0
## 96      0     0   logical      0
## 97      0     0   logical      2
## 98      0     0   logical      1
## 99      0     0 character      5
## 100     0     0   logical      2
## 101     0     0   logical      2
## 102     0     0   numeric     53
## 103     0     0   numeric     45
## 104     0     0   numeric     31
## 105     0     0   numeric     12
## 106     0     0   numeric    451

There are several data columns that do not have any data (e.g. thumbnail_url, medium_url, xl_picture_url, license, requires_license, is_business_travel_ready, jurisdiction_names) - they are probably information that may be present in other cities, but not available or required for Singapore. We can omit these columns.
There are also data columns that have a high percentage of missing data that can be safely omitted: State is not applicable in the Singapore context, and the percentage of missing values for square_feet is almost 100% (99.58%).
At this stage of the study, we will leave the remaining data in this table as is, until we need to wrangle the data for specific analysis.

3.2.4 Geospatial Data Wrangling

To be able to map listings we need to convert the listings data into an sf object. The st_as_sf() function converts any foreign object into an sf object and specifies the coordinates (taken from the longitude and latitude columns in the listings dataframe). As the long/lat coordinates are based on the WSG84 projection, we assign that to the listings data, and further transform it into SVY21 coordinates to match the neighbourhoods polygon datafile so that they are projected onto the same crs. The st_as_sf() function leaves the original dataframe listings untouched.

# Convert listings to SF dataframe
listings_sf <- listings %>% 
                st_as_sf(coords = c("longitude", "latitude"),
                         crs = 4326) %>%
                st_transform(crs = 3414)

head() is used to display the first ten records and their details. It shows the geometry type (sfc_point) and we can check that the projected CRS is SVY21 as intended. From the records, we can see that it is the same dataframe as the listings with a geometry column, consisting of an sfc_point object in each row, that has replaced the longitude and latitude columns in the original listings dataframe.

head(listings_sf)

## Simple feature collection with 6 features and 14 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 22646.02 ymin: 34950.06 xmax: 42212.88 ymax: 47135.4
## projected CRS:  SVY21 / Singapore TM
## # A tibble: 6 x 15
##   id    name    host_id host_name neighbourhood_g~ neighbourhood room_type price
##   <chr> <chr>   <chr>   <chr>     <chr>            <chr>         <chr>     <dbl>
## 1 49091 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    84
## 2 50646 Pleasa~ 227796  Sujatha   Central Region   Bukit Timah   Private ~    80
## 3 56334 COZICO~ 266763  Francesca North Region     Woodlands     Private ~    70
## 4 71609 Ensuit~ 367042  Belinda   East Region      Tampines      Private ~   167
## 5 71896 B&B  R~ 367042  Belinda   East Region      Tampines      Private ~    95
## 6 71903 Room 2~ 367042  Belinda   East Region      Tampines      Private ~    84
## # ... with 7 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   geometry <POINT [m]>

glimpse() shows the point details of the geometry column, giving the x,y coordinates in SVY21 projection.

glimpse(listings_sf)

## Rows: 7,323
## Columns: 15
## $ id                             <chr> "49091", "50646", "56334", "71609", "71~
## $ name                           <chr> "COZICOMFORT LONG TERM STAY ROOM 2", "P~
## $ host_id                        <chr> "266763", "227796", "266763", "367042",~
## $ host_name                      <chr> "Francesca", "Sujatha", "Francesca", "B~
## $ neighbourhood_group            <chr> "North Region", "Central Region", "Nort~
## $ neighbourhood                  <chr> "Woodlands", "Bukit Timah", "Woodlands"~
## $ room_type                      <chr> "Private room", "Private room", "Privat~
## $ price                          <dbl> 84, 80, 70, 167, 95, 84, 209, 52, 54, 4~
## $ minimum_nights                 <dbl> 180, 90, 6, 90, 90, 90, 1, 90, 90, 14, ~
## $ number_of_reviews              <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20~
## $ last_review                    <date> 2013-10-21, 2014-12-26, 2015-10-01, 20~
## $ reviews_per_month              <dbl> 0.01, 0.24, 0.18, 0.19, 0.22, 0.43, 0.2~
## $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3, 4, 4, 7, 1, ~
## $ availability_365               <dbl> 365, 365, 365, 365, 365, 365, 180, 356,~
## $ geometry                       <POINT [m]> POINT (23824.77 47135.4), POINT (~

4 Exploratory Data Analysis

Exploratory data analysis (EDA) is performed on the data to have a better understanding of the data, identify ways to approach the analysis and suggest hypotheses to test. Some potential questions to answer are:

How are the airbnb rentals spatially distributed and do they cluster together?
How do location and room types affect airbnb rental prices?
Who are the hosts, are they in violation of local regulations or are they providing a complementary service for international visitors as claimed by airbnb?
Do airbnb rentals affect the pricing and availability of housing in the area?

ggplot2 is mainly used to graph and plot the data to answer some of these questions.

4.1 Type of Accommodation

Airbnb uses the planning boundaries from the the Urban Redevelopment Authority of Singapore. There are 5 main regions, encompassing a total of 55 neighbourhoods in the dataset. The following shows the types of accommodation available and the price distribution of each accommodation type using the cleaned data.

4.1.1 Room Types by Region

The number of listings is first summarized for the various neighbourhood groups and room type. ggplot2 is then used to plot a bar chart of the number of listings for different room types across the different regions.

The majority of listings are entire apartments/houses for rent, followed by private rooms for rent. Shared rooms constitute the lowest number of listings in Singapore.

Examining the listings by region, we see unsurprisingly that the majority of the listings are in the Central Region. In fact, (98%) of the hotel listings are in the Central Region.

There are more listings of private rooms than entire apartments in the other regions, with a small proportion of listings being shared rooms, and a miniscule number of hotel listings. We could surmise that these non-central region listings are possibly owner-occupied homes, who are renting out a spare room for additional income.

# Summarizing the types of listings by neighbourhood groups
regionlist <- listings %>%
              group_by(neighbourhood_group, room_type) %>%
              summarise(
                num_listings = n(),
                avg_price = mean(price),
                med_price = median(price))

# Plotting the type of accommodation by region
ggplot(regionlist, aes(x=room_type, fill = room_type)) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + 
      geom_col(aes(y = num_listings)) +
      facet_grid(cols=vars(neighbourhood_group), margins = T, labeller = labeller(neighbourhood_group = label_wrap_gen(width = 5, multi_line = TRUE))) +
      labs(x = "", y = "No. of listings", fill = "Room Type")

No. of listings by region and room type

4.1.2 Pricing of Room Types

We use a boxplot of the price for the different room types to understand the distribution of pricing (left figure - Listing price). We can see that there are outliers e.g. more than $10,000 rental for an entire home/apt or a private room for a day. It is possible that there was a mistake in the listing price - e.g. the listing price was for a month instead of a day but we will need to remove the outliers to make a proper comparison, and to be able to zoom in on the variations.

# Remove price outliers from listings
outlier_price = quantile(listings$price, 0.99)
listings_cleanprice <- listings %>% filter(listings$price <= outlier_price)

# Plot boxplot of prices for each room type
p1 <- ggplot(listings, aes(x=room_type, fill = room_type)) + 
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "None") + 
  geom_boxplot(aes(y=price)) +
  labs(x = "", y = "Listing Price", fill = "Room Type", title = "Listing price")

#Cleaned pricing
p2 <- ggplot(listings_cleanprice, aes(x=room_type, fill = room_type)) + 
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + 
  geom_boxplot(aes(y=price)) +
  labs(x = "", y = "Listing Price", fill = "Room Type", title = "Listing price (outliers removed)")

grid.arrange(p1, p2, nrow = 1)

a. Listing price by room type b. Listing price by room type with outliers removed

Listing price by room type b. Listing price by room type with outliers removed

The top 1% of prices ($799) was used as a benchmark to remove price outliers. 72 data points were removed.

As expected, entire homes and apartments command the highest prices of all listings, followed by hotel rooms, private rooms with the lowest prices coming in for shared rooms as seen in the right figure with outliers removed.

4.1.3 Room Types by Price and Region

Now that we have removed the outliers, we use the facet-grid function in ggplot2 to graph a boxplot of price distribution of the different room types and region.

ggplot(listings_cleanprice, aes(x=neighbourhood_group, fill = neighbourhood_group)) + 
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
  geom_boxplot(aes(y=price)) +
  facet_grid(~room_type, labeller = labeller(room_type = label_wrap_gen(width = 5, multi_line = TRUE))) +
  labs(x = "", y = "Listing Price", fill = "Region")

Boxplot of price distribution by room type and region

From the plot above, we can see that despite the cleaning there are still many price outliers, especially in the Central Region.

The median price of entire homes and apartments are highest in the Central Region, followed by the West Region, and is fairly uniform across the other regions.
Hotel rooms show the greatest disparity across the regions, with a wide spread of prices in the Central Region, and median prices higher than the Central Region in the other regions. This could be due to a misattribution error - apartments for rent instead of hotel rooms, or serviced apartments, that are categorized under hotel rooms. There are no hotel listings in the North Region.
The median price for a private room is fairly uniform across the different regions, with more outliers in the Central Region. Interestingly, the median price for private rooms in the East Region is slightly higher than that in the Central region and has a greater inter-quartile spread.
The East Region has the highest price and largest spread for shared rooms, followed by the North-East Region. It is interesting that the median price of a shared room in the Central Region is lower than the other regions. This could be more due to competition of more attractive listings (entire house/apartment or private rooms) as the majority of those listings are in the Central Region.

4.1.4 Confirmatory data analysis

Prices for all listings between neighbourhoods We use the ggstatsplot to perform confirmationary data analysis. Our null hypothesis is that the price of listings in different neighbourhood groups are the same. We reject the hypothesis if the p-value is greater than 0.05 at 95% significance level. Firstly we select only the variables required (neighbourhood_group, room_type and price) and pass that through the function ggbetweenstats() to get the p-value and a violin plot of the listing prices by neighbourhood group. This function also calculates the pair-wise comparison between the grouped variables, and shows only the significant comparisons and their p-values.

conftest <- listings_cleanprice %>% dplyr::select(neighbourhood_group, room_type, price)
ggbetweenstats(
  data = conftest,
  x = neighbourhood_group,
  y = price
)

Confirmatory analysis of listing price by region

From the above chart, we can see that the p-value for prices across all neighbourhoods is less than 0.05, which means we reject the null hypothesis and can state the the price for listings across neighbourhoods are significantly different. We also see that prices in the Central Region is significantly different from all other neighbourhoods, whilst only the East and North-East region has significantly different prices from each other.

Prices for each room type by neighbourhood group

We now look at the statistics for prices of listings in each neighbourhood group for each room type. As hotel rooms are not present in the North Region and has only 3 samples in each region other than the Central Region, we will filter hotel rooms out. For the other 3 room types, we can pass the data through the grouped_ggbetweenstats() function to obtain the statistics for our null hypothesis that the prices of listings in different neighbourhoods are the same by each room type.

conftest2 <- conftest %>% filter(room_type != "Shared room", room_type != "Hotel room")
grouped_ggbetweenstats(
  data = conftest2,
  x = neighbourhood_group,
  y = price,
  grouping.var = room_type,
  ggsignif.args = list(textsize = 4, tip_length = 0.01),
  p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
  # adding new components to `ggstatsplot` default
  ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
  k = 3,
  title.prefix = "Room Type",
  palette = "default_jama",
  package = "ggsci",
  plotgrid.args = list(nrow = 2),
  title.text = "Differences in listing prices by neighbourhoods for different room types"
)

For entire home listings, we reject the null hypothesis that the prices are the same across the neighbourhood groups. There is a significant difference between listings between the North-East and Central Regions, and North-East and West Regions.
For private room listings, we can also reject the null hypothesis that the prices are the same, and we also see a significant difference between the West and Central Region, and West and East Region.

sharedrooms <- conftest %>% filter(room_type == "Shared room")
 ggbetweenstats(
   data = sharedrooms,
   x = neighbourhood_group,
   y = price
)

Confirmatory analysis of price listings of shared rooms

When we compare the prices of the listings of shared rooms, we cannot reject the null hypothesis that prices are the same across the different regions, despite the high average price in the East Region. There is also no significant differences in the pairwise comparisons.

4.2 Hosts & Listings

4.2.1 Hosts with multiple listings

Next we examine the number of hosts and listings. As expected we see a large number of hosts - 70% of all hosts) with just 1 listing. However we see that there are hosts with more than 1 listing and one host with more than 300 listings to their name.

# Create table of % of hosts by no. of listings
list_byhost <- listings %>%
                group_by(host_id, host_name) %>%
                count(name = "number_of_listings", sort = TRUE) %>%
                ungroup() %>%
                group_by(number_of_listings) %>%
                count(name = "number_of_hosts")

# Plot above table
ggplot(list_byhost, aes(x=number_of_listings, y= number_of_hosts/sum(number_of_hosts)*100)) +
  geom_point() +
  labs(y="Percentage of hosts", title = "% of hosts vs number of listings", x = "number of listings")

Percentage of hosts vs No. of listings

To explore the whether the market is dominated by hosts with single or multiple listings, we plot a Pareto chart by adding the cumulative frequency of the number of listings of hosts, to a descending list of airbnb rentals by host. We see that almost 75% of the Airbnb ‘stock’ are taken up by hosts with multiple listings.

There are currently no regulations or legislation around the number of listings that a host may have, unlike in other cities that regulate the maximum number of listings (e.g. New York). People with more than 1 listing are likely to be agents managing these properties on behalf of landlords.

However, there are restrictions on the minimum stay for short term rentals: 3 months for private housing and 6 months for HDB flats. This means that typical tourist stays (e.g. 2-7 days) in Airbnb listings would technically be illegal. Despite this, 89% of listings have a minimum stay of less than the legislated minimum stay.

# Create Pareto Chart | Source: https://rpubs.com/dav1d00/ggpareto
list_byhost <- list_byhost[order(list_byhost$number_of_listings, decreasing = TRUE),]
list_byhost$number_of_listings1 <- factor(list_byhost$number_of_listings, levels = list_byhost$number_of_listings)
list_byhost$listfreq <- list_byhost$number_of_hosts * list_byhost$number_of_listings
list_byhost$cumul <- cumsum(list_byhost$listfreq)
nr <- nrow(list_byhost)
N <- sum(list_byhost$listfreq)
y2 <- c("  0%", " 10%", " 20%", " 30%", " 40%", " 50%", " 60%", " 70%", " 80%", " 90%", "100%")
ggplot(list_byhost, aes(x=number_of_listings1)) +
  geom_bar(aes(y=number_of_hosts), fill = "blue", stat = "identity") +
  geom_point(aes(x=number_of_listings1, y=cumul)) +
  geom_line(aes(x=number_of_listings1, y=cumul)) +
  geom_path(aes(y=cumul, group=1)) +
  labs(y="Frequency", title = "Pareto Chart of hosts and listings", x = "No. of listings") +
  theme(plot.margin = margin(c(1,1,1,1), unit="cm"), axis.text.x = element_text(angle=90, vjust=0.6)) +
  annotate("text", x = nr + 3, y = seq(0, N, N/10), label = y2, size = 3.5, hjust = "inward")

Pareto chart of hosts and listings

4.2.1.1 Distribution of host type and room type

A simple way to see the split between hosts with single or multiple listings is to use a mosaic plot. While there are other packages such as vcd or ggmosaic, we use mosaicplot() from base R as that is sufficient for our needs.

# Create column with Single / Multiple host types
listings <- listings %>% mutate(host_type = ifelse(calculated_host_listings_count ==1, "Single", "Multiple"))

mosaicplot(listings$room_type ~ listings$host_type, color = c("steelblue", "wheat"), xlab = "Room Type", ylab = "Host Type", main = "Mosaic Plot of Room Type and Host Type")

Private rooms make up the majority of listings for hosts with Single listings, followed by shared rooms - which corresponds to the assumption that these are people who are renting out their spare / shared bedroom for extra cash. The number of people renting out the entire home/apt can be attributed to people who may have an investment home or are not in the country for this period. There are a few hosts with single hotel room listings that could have the wrong room type attributed, or are special / boutique offerings, as we expect hotel operators to have multiple listings.

4.2.1.2 Analysis of price of room types by single or multiple hosts

# Set seed for reproducibility
set.seed(123)
test2 <- listings %>% dplyr::select(room_type, host_type, neighbourhood_group, price)
grouped_ggbetweenstats(
  data = test2,
  x = host_type,
  y = price,
  grouping.var = room_type,
  ggsignif.args = list(textsize = 4, tip_length = 0.01),
  p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
  # adding new components to `ggstatsplot` default
  ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
  k = 3,
  title.prefix = "Room Type",
  palette = "default_jama",
  package = "ggsci",
  plotgrid.args = list(nrow = 2),
  title.text = "Differences in listing prices for single/multiple hosts by different room types"
)

Confirmatory analysis of listing price by host type

We can use confirmatory analysis to test our hypothesis that hosts with multiple listings will charge the same price to that of hosts with single listings in the different room types. From the above, we can see that the p-value > 0.01 and therefore we cannot reject the null hypothesis. We also performed the same confirmatory analysis by different regions and we also cannot reject the null hypothesis (i.e. that they do not differ).

grouped_ggbetweenstats(
  data = test2 %>% filter(room_type != "Hotel room"),
  x = host_type,
  y = price,
  grouping.var = neighbourhood_group,
  ggsignif.args = list(textsize = 4, tip_length = 0.01),
  p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
  # adding new components to `ggstatsplot` default
  # ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
  # k = 3,
  title.prefix = "Room Type",
  palette = "default_jama",
  package = "ggsci",
  # plotgrid.args = list(nrow = 2),
  title.text = "Differences in listing prices for single/multiple hosts by different neighbourhoods", 
  output = "subtitle"
)

## $`Central Region`
## paste(italic("t")["Welch"], "(", "2384.01", ") = ", "0.54", ", ", 
##     italic("p"), " = ", "0.592", ", ", widehat(italic("g"))["Hedge"], 
##     " = ", "0.02", ", CI"["95%"], " [", "-0.05", ", ", "0.08", 
##     "]", ", ", italic("n")["obs"], " = ", 5422L)
## 
## $`East Region`
## paste(italic("t")["Welch"], "(", "241.52", ") = ", "-1.85", ", ", 
##     italic("p"), " = ", "0.065", ", ", widehat(italic("g"))["Hedge"], 
##     " = ", "-0.17", ", CI"["95%"], " [", "-0.36", ", ", "0.01", 
##     "]", ", ", italic("n")["obs"], " = ", 445L)
## 
## $`North-East Region`
## paste(italic("t")["Welch"], "(", "256.21", ") = ", "-1.89", ", ", 
##     italic("p"), " = ", "0.060", ", ", widehat(italic("g"))["Hedge"], 
##     " = ", "-0.21", ", CI"["95%"], " [", "-0.47", ", ", "0.01", 
##     "]", ", ", italic("n")["obs"], " = ", 279L)
## 
## $`North Region`
## paste(italic("t")["Welch"], "(", "163.94", ") = ", "0.27", ", ", 
##     italic("p"), " = ", "0.791", ", ", widehat(italic("g"))["Hedge"], 
##     " = ", "0.04", ", CI"["95%"], " [", "-0.24", ", ", "0.31", 
##     "]", ", ", italic("n")["obs"], " = ", 208L)
## 
## $`West Region`
## paste(italic("t")["Welch"], "(", "233.65", ") = ", "-1.29", ", ", 
##     italic("p"), " = ", "0.197", ", ", widehat(italic("g"))["Hedge"], 
##     " = ", "-0.12", ", CI"["95%"], " [", "-0.29", ", ", "0.06", 
##     "]", ", ", italic("n")["obs"], " = ", 518L)

4.2.2 When did hosts join airbnb?

# When did the hosts sign up?
host_byyear <- d_listings %>% dplyr::select(id, host_id, host_since) %>% group_by(year_joined = year(host_since)) %>% drop_na() %>% summarise(number_hosts = n()) %>% ungroup() %>% mutate(change = (number_hosts - lag(number_hosts)) / lag(number_hosts)*100)

ggplot(host_byyear, aes(x=year_joined, y = number_hosts)) + geom_bar(stat = "identity", fill = "steelblue") + 
   labs(title = "New hosts by year", y = "No. of hosts joined", x = "year") +
   geom_text(aes(label = number_hosts), vjust = -0.3)

New hosts by year

The number of hosts have steadily increased from the start in 2010 and peaked in 2016. The number of new hosts dropped by -34.08 % in 2017. This can be attributed to the new legislation enacted in May 2017 that HDB flats has a minimum rental period of 6 months and cannot be rented to tourists; private residential properties has a minimum rental period of 3 months. Offenders were prosecuted and fined for illegal short term stays on Airbnb.

The number of new hosts joining has held steady around 800-900 per year, with a drop to 204 new hosts in 2020, which corresponds to the covid-19 pandemic, where there is little to no travel from March 2020, especially after Singapore closed its borders.

4.2.3 Detailed Information on Hosts

The detailed listings table gives more information on the hosts, including the host join date, identity verification, and their response and acceptance rate. Note that the acceptance rate is tied to the host and not the listings (e.g. a listing could have no reviews, but the host could have accepted guests on their other listings). Conversely there are 198 listings whose acceptance rates are missing or 0%, but still have reviews attributed to the listing. As acceptance rate (as defined by Airbnb) reflects activity in the 365 days, these are listings that have been inactive for the past year (from June 2019 - June 2020).

# How many listings do not have reviews (i.e. no stays)
d_listings %>% filter(., host_acceptance_rate == "0%" | is.na(host_acceptance_rate)) %>% dplyr::select(id, host_acceptance_rate, number_of_reviews) %>% arrange(desc(number_of_reviews))

## # A tibble: 198 x 3
##    id       host_acceptance_rate number_of_reviews
##    <chr>    <chr>                            <dbl>
##  1 17616042 0%                                 108
##  2 8399111  0%                                  55
##  3 32318920 <NA>                                47
##  4 14211027 0%                                  46
##  5 13325975 0%                                  37
##  6 4583694  0%                                  29
##  7 3980202  0%                                  28
##  8 395191   0%                                  27
...

4.3 Reviews

4.3.1 Listings with no reviews

There are 2835 listings without reviews. As there is no publicly available data on actual rentals / stays, reviews and acceptance rate are our proxy for inactive listings or listings that have not been rented out. We can also look into them to see their location and other factors (such as room type, price, host join date, etc) to see what could contribute to them not having stays.

no_reviews <- listings_sf %>% filter(is.na(last_review))
host_join <- d_listings %>% dplyr::select(id, host_since)
no_reviews_host <- left_join(no_reviews, host_join, by = c("id")) %>% group_by(year_joined = year(host_since)) %>% summarise(number_hosts = n()) %>% drop_na()  %>% ungroup()

no_reviews_host$year_joined = as.character(no_reviews_host$year_joined)
ggplot(no_reviews_host, aes(x=year_joined, y = number_hosts)) + geom_bar(stat = "identity", fill = "steelblue4") + 
   labs(title = "Listings with no reviews - Hosts by year", y = "No. of hosts joined", x = "year") + geom_text(aes(label = number_hosts), vjust = -0.3) + scale_x_discrete(breaks = no_reviews_host$year_joined)

Year joined by hosts for listings with no reviews

We expect that there are no reviews for newer listings (i.e. when hosts join later) or much older listings (when hosts join for more than 5 years) but at first glance, the pattern follows that of the overall host joining dates.

Airbnb also tracks the number of reviews in the last twelve months (number_of_reviews_ltm). If this data exists at other points in time, we could build a picture of how many reviews (stays) a listing had in that timeframe.

4.3.2 Review scores

Review scores are provided in the detailed listings data. Examining the review scores, there is one overall rating (review_scores_rating) rated on a scale of 0 - 100, and scores on specific attributes such as accuracy of the listing, cleanliness of the property, communication of the host, ease of check-in, location and value of the listing rated on a scale of 0 - 10. The distribution of the overall rating scores and the attributes are shown in the histograms below. Rows with missing data were not counted.

Guests generally gave high overall ratings to the majority of properties, with 80% or more scoring in the top quartile. As the Airbnb platform is partly driven by user / host ratings, it is not surprising that reviews are generally positive. The bulk of the ratings corresponded to the majority of listings (entire listings and private rooms).

While the distribution of the specific attributes were similar, there were greater variations especially in terms of property cleanliness and the value of the listing.

# Select relevant data for review scores
review_scores <- d_listings %>% dplyr::select(id, host_id, number_of_reviews, room_type, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_communication, review_scores_checkin, review_scores_location, review_scores_value, reviews_per_month)

# Plot histogram of overall rating
ggplot(review_scores, aes(x=review_scores_rating)) + geom_histogram(aes(fill = room_type), bins = 20) + stat_bin(aes(label = ..count..), bins = 20, size = 3, geom= "text", vjust = -1)

Distribution of overall review score

# Plot histogram of attribute scores
ggplot(gather(review_scores[, -c(1:5,12)], cols, value), aes(x = value)) + 
       geom_histogram(binwidth = 1) + facet_grid(.~cols)

Distribution of individual attribute scores

5 Spatial Distribution of Airbnb Listings in Singapore

5.1 Handling Spatial Data Outliers

Examining the data from the listings table, we see that there are listings that lie within the Central Water Catchment area, which is a non-residential area (consisting of parks and reservoirs) and we need to investigate them and check that they are attributed correctly. We see that there are 28 listings that are in the Central Water Catchment. Exploring the listings data further we see that some of the names show that they are located wrongly (e.g. 3 mins from Jurong East MRT), which are nowhere near the catchment area. However, as it is not possible to identify where exactly these listings are located, we will drop these listings from our dataset. Similarly, there are listings within the Mandai and Sungei Kadut, Western Water Catchment neighbourhood, which are industrial zoned and do not have any residential properties in those areas. We would also drop listings from the other 3 neighbourhoods.

Alternatively, if we want to include these points, we can try to identify a proxy location from the description such as using the MRT station coordinates for the listing from Jurong East; this is only possible if there are a small number of listings; this would not be feasible or reproducible for larger data sets.

# Identify listings that fall within the Water Cachement Area
filter(listings, neighbourhood == "Central Water Catchment")

## # A tibble: 28 x 17
##    id     name        host_id host_name neighbourhood_g~ neighbourhood  latitude
##    <chr>  <chr>       <chr>   <chr>     <chr>            <chr>             <dbl>
##  1 51848~ Little Hab~ 2927830 Shwu She~ North Region     Central Water~     1.35
##  2 28139~ comtempora~ 147591~ Roland    North Region     Central Water~     1.35
##  3 28967~ Lem's crib  190297~ Lemuel    North Region     Central Water~     1.35
##  4 29148~ Home with ~ 764645~ D-Jin     North Region     Central Water~     1.35
##  5 33621~ LUXURY AT ~ 156409~ Rj        North Region     Central Water~     1.35
##  6 33755~ <U+2600>BEAUTIFUL~ 219550~ Ray       North Region     Central Water~     1.35
##  7 33883~ <U+2600> HUGE & M~ 219550~ Ray       North Region     Central Water~     1.35
##  8 34157~ 1br cosy a~ 664061~ Jay       North Region     Central Water~     1.35
##  9 34188~ Gem of the~ 664061~ Jay       North Region     Central Water~     1.35
## 10 34188~ 1br cosy a~ 664061~ Jay       North Region     Central Water~     1.35
## # ... with 18 more rows, and 10 more variables: longitude <dbl>,
## #   room_type <chr>, price <dbl>, minimum_nights <dbl>,
## #   number_of_reviews <dbl>, last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   host_type <chr>

# Remove listings in the Central Water Catchment, Sungei Kadut and Mandai areas
listings_clean <- filter(listings_sf, !neighbourhood %in% c("Central Water Catchment", "Sungei Kadut", "Mandai", "Western Water Catchment")) %>% st_as_sf()

5.2 Mapping Airbnb listings in Singapore

We use tmap as our main package for generating maps. Similar to ggplot2, the syntax follows the grammar of graphics and is compatible with sf, leaflet, and other spatial wrangling packages. It is also useful as maps can be viewed interactively and access tiles from map providers such as OpenStreetMap or ESRI; they can also be plotted as a static image for the purposes of a report. We also use the read_osm() function from the OpenStreetMap package to load the background map as a raster in tmap’s plot mode.

5.2.1 Loading basemap raster with bounding box

# Read in OSM raster of listings data for plot view and create bounding box 
sg_osm <- tmaptools::read_osm(listings_clean, ext=1.3)
bb_sg_osm <- st_bbox(listings_clean, crs = 3414)

5.2.2 Listings by room types and neighbourhoods

The listings are mapped below by room types and neighbourhoods. The listings are clearly clustered around the Central Business District (CBD) and main shopping district (Orchard). There are also listings in suburban parts of Singapore which, while less expected could be explained by either being closer to ‘desirable’ neighbourhoods such as East Coast, or closer to specific industrial areas such as Pioneer, Sembawang, Changi Business Park / Loyang.

Entire home/apartment listings are mainly in Kallang, Rochor and Novena area, which are close to the CBD and Orchard. There are a fair number of listings in the other neighbourhoods, and they look to be in clusters, which we will examine using Spatial Point Pattern analysis in the next part of the Project.

Private room listings also show high concentration in the same neighbourhoods as entire home / apartment listings, but are also prevalent in neighbourhoods like Outram, Bedok and Rochor. They also appear to be more spread out within the neighbourhoods.

Hotel room listings as mentioned above are mainly found in the neighbourhoods of Outram, Kallang and Singapore River, whilst shared room listings are concentrated in the Kallang and Rochor areas, with the other listings being sparsely populated across the other neighbourhoods.

# Plotting neighbourhood listings on tmap
tmap_mode("view")

# Plotting points
tm_basemap(leaflet::providers$OpenStreetMap) +
# The commented out code is for plot mode (report)
# tm_shape(sg_osm, bbox=bb_sg_osm) +
#   tm_rgb() +
tm_shape(nhood_map_sf) +
  tm_polygons(alpha = 0.3) +
tm_shape(listings_clean) +
  tm_symbols(col="room_type", size = 0.2) +
  tm_view(set.zoom.limits = c(11, 17)) +
  tm_facets(by="room_type") +
  tm_layout(legend.show = F)

5.2.3 Rental prices by room type

# Removing price outliers from the sf listings
listings_sf_price <- listings_clean %>% filter(price <= outlier_price)

# Plotting points
# tmap_mode("plot")

tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
  tm_rgb() +
tm_shape(nhood_map_sf) +
  tm_polygons(alpha = 0.3) +
  tm_shape(listings_sf_price) +
  tm_symbols(col = "price", size = 0.2, palette = "YlOrBr", legend.hist = TRUE) +
  # tm_view(set.zoom.limits = c(11, 18)) +
  tm_facets(by="room_type") +
  tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.stack = "horizontal", legend.hist.height = 1, legend.hist.width = 0.85, legend.outside.size=0.1)

The map above shows the different range of listing rental prices by room type.

Most of the neighbourhoods have listings within the price range of $0 - $200 per night. We can see some outliers in neighbourhoods like Queenstown, Clementi and Woodlands for entire homes. Private room listings are more homogenous across the neighbourhoods. However there are also high priced private room listings that are outside the central neighbourhoods - e.g. Tampines, Pasir Ris and Hougang.

We would need to examine the denser neighbourhoods for entire homes/apts and private rooms.

5.2.4 Rental prices by room type and host type

# Facet point symbol map showing rental price by hosts with single and multiple hosts
listings_sf_price <- listings_sf_price %>% mutate(host_type = ifelse(calculated_host_listings_count ==1, "Single", "Multiple"))

tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
  tm_rgb() +
tm_shape(nhood_map_sf) +
  tm_polygons(alpha = 0.3) +
  tm_shape(listings_sf_price) +
  tm_symbols(col = "host_type", shape = "price", size = 0.2, title.col = "Host Type", title.shape = "Price") +
  tm_facets(by="room_type")+
  tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.stack = "horizontal", legend.outside.size=0.1)

The map above shows the listings rental price as well as whether the host of the listing has one listing (Single) or more than one listing (Multiple). Hotel rooms are largely listed by hosts with multiple listings, which is expected as hotel operators would have multiple rooms and/or locations available. Entire homes are dominated by multiple-host listings, especially in the central neighbourhoods, whereas the ratio of multiple-host listings to single-host listings are more even for private rooms (2:1) and shared rooms (3:1).

5.2.5 Chloropleth map of listings and median price

# Create a summary of the cleaned price listings by neighbourhood and room type (number of listings and median price)
listings_cleanprice_sum <- st_drop_geometry(listings_sf_price) %>% group_by(neighbourhood, room_type) %>%
                            summarise(num_listings = n(),
                                      med_price = median(price)) %>%
                            arrange(desc(num_listings))

# Join neighbourhood mapping and summary dataframe
listings_join <- left_join(nhood_map_sf, listings_cleanprice_sum, by = c("neighbourhood"))

The following shows the median price and number of listings for each neighbourhood.
+ Entire home/apt: + High median prices as expected in central and premium neighbourhoods such as Southern Islands, Orchard, Bukit Timah, Tanglin, Singapore River, Rochor. Potential outliers are Clementi and Choa Chu Kang.
+ Highest density of listings are in Geylang, Kallang, Novena, Downtown Core, Rochor and River Valley. These correspond to city fringe areas, with accessible transport.

Private rooms:
- High median prices in Marina South, Southern Islands which correspond to higher value of properties there; Pioneer, on the other hand is not a premium area and the high median price is unexpected.
- Highest density of listings are in Kallang, Geylang and Bedok. While Bedok is not considered city fringe, its proximity to leisure activities in the East Coast and airport may explain the number of listings.
Hotel rooms:
- Highest median price is in shopping district Orchard and the adjacent neighbourhood of Novena.
- Highest density of listings in Outram, which is on the fringe of tourist hotspots (Chinatown, Tiong Bahru) and the CBD.
Shared rooms:
- Price and density of shared rooms are fairly homogenous.

We will examine the density of listings further using Kernel Density Estimation in the next section.

tmap_mode("plot")
tmap_arrange(
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
  tm_rgb() +
  tm_shape(listings_join) +
   tm_polygons("med_price", title = "Median Price") +
   tm_view(set.zoom.limits = c(10, 18)) + 
   tm_facets(by="room_type", drop.NA.facets = T) +
   tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.outside.size = 0.2),
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
  tm_rgb() +
   tm_shape(listings_join) +
   tm_polygons("num_listings", title = "No. of listings", palette = "Blues", alpha = 0.6) +
   tm_view(set.zoom.limits = c(10, 18)) + 
   tm_facets(by="room_type", drop.NA.facets = T) +
   tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.outside.size = 0.2)
)

Chloropleth map of median listing price by room type

# Saving data to be loaded for Spatial EDA 
# listings, neighbourhood maps in OGR and sf
basedata <- c("listings_sf", "listings_clean", "neighbourhoods", "nhood_map_sf")
save(listings_sf, listings_clean, neighbourhoods, nhood_map_sf, file = "basedata.RData")
save(listings_clean, d_listings, neighbourhoods, nhood_map_sf, file = "basedataGWR.RData")

Geospatial Analysis of Airbnb in Singapore

Section 1: Exploratory Data Analysis

Clara Chua

9/29/2020 (updated: 2021-06-09)