Airbnb, a peer-to-peer sharing platform that enables the short-term rental of private rooms or homes by individuals to potential guests, is increasingly popular with tourists. As of 30 September 2020, Airbnb has operations in 100,000 cities with over 4 million hosts providing 5.6 million active listings and 800 million guest arrivals since its launch (Airbnb, 2021).
Airbnb launched its Asia Pacific headquarters in Singapore in November 2012, but has had hosts in Singapore from as early as 2009. Local Airbnb stays are regulated by the Urban Redevelopment Authority (URA) and the Housing Development Board (HDB). The authorities conducted consultations with the public and key stakeholders from 2015 to explore a regulatory framework for short term accommodation (Channel News Asia, 2018), but maintained the regulatory status quo in May 2019 (Co, 2019). The minimum stay for private property is three months and six months for HDB flats. Strict penalties have been enforced including fines of up to $200,000 for first time offenders, with additional fines and possible jail term for repeat offenders.
There have been an increasing number of studies on the effects of Airbnb in various contexts, from the impact on the hotel industry (Zervas et al. 2014, Gutierrez et al. 2017, Dogru et al. 2019, Dogru et al. 2020, Blal et al. 2018, Heo et al. 2019), to housing markets (Barron et al 2018, Horn & Merante 2017, Yrigoy 2018, Ayouba et al. 2019) to whether and how Airbnb should be regulated (Kaplan & Nadler 2015, DiNatale et al. 2018, Wegmann & Jiao 2017), to spatial studies which are discussed in greater detail in the literature review section below.
The motivation for this project is to specifically explore Airbnb in the Singapore market to understand the spatial distribution of Airbnb accommodation in Singapore and see how it correlates with various spatial factors in Singapore. Further analysis could also inform future policy and regulatory frameworks on short term accommodation.
The project will look at geospatial analysis to explore and explain the data around Airbnb rentals in Singapore. Namely the project aims to:
The report covers the following 5 sections: firstly we look at spatial studies involving spatial distributions of Airbnb listings and its impact, geographically weighted regression (GWR) models for Airbnb pricing, and Asian based studies for Airbnb. Second, we explore the data available from InsideAirbnb, followed by exploratory data analysis (EDA) to look at potential insights and hypotheses about Airbnb in Singapore. Fourth, we conduct exploratory spatial analysis on Airbnb listings in Singapore, including Spatial Point Pattern Analysis. Finally the report concludes by a brief discussion on the implications of the findings for the next phase of the project – developing a hedonic (GWR) pricing model of Airbnb listings.
Spatial Studies
There have been studies that use spatial analysis to explore the distribution and impact of Airbnb on various factors. A large number of studies are centered in European cities (e.g. Barcelona, London), or in the USA (e.g. New York City, San Francisco) and many studies show that Airbnb listings are concentrated around tourist or leisure areas.
In Barcelona, Gutierrez et al. (2017) found that Airbnb listings are concentrated in the city centre, but cover a slightly wider area than that of hotels. Bivariate spatial autocorrelation analysis showed a close association between Airbnb listings and hotels, with proximity to leisure and tourism activities explaining Airbnb location patterns, whereas hotels were slightly more widespread.
Adamiak et al. (2016) looked at the spatial concentration and autocorrelation of the density of Airbnb listings for the whole of Spain, with a focus on the impact on tourism. They found that Airbnb listings concentrate in large cities and areas with high tourism and leisure activities such as the coastal areas, national parks and mountain tourist areas. Entire homes or apartments dominated the listings in touristic and leisure areas (~90% of all listings in the area), and the listings were positively correlated with coastal areas, and the high number of nonprimary accommodation (such as a 2nd holiday home) and hotel supply. In such cases, Airbnb helps people to commercialise holiday homes or apartments already used for tourism purposes. The authors suggested that Airbnb encourages the growth of tourist accommodation stock in touristic hotspots, be supplementary to hotel supply, and could open new opportunities for tourism.
Quattrone et al. (2018) looked at geographic, social and economic variables to try to explain the spatial penetration of Airbnb in 8 US cities at the census tract level. Their results in the geographic variables show that distance from the city centre was negatively related to Airbnb offerings in 5 out of 8 cities. The attractiveness of an area (number of points of interests within the census tract) was positively correlated with number of listings in 5 out of 8 cities - i.e. Airbnb listings are predominantly located in more touristic areas. The number of bus stops per tract, which measured the strength of an area’s infrastructure and transport links, were also positively correlated with Airbnb listings in 3 cities (Austin, Oakland, San Francisco). The authors also compared the study to a similar study (Quattrone et al., 2016) for London and found similarities in the geographic results in terms of distance to centre, tourism factor and hotel presence (no relationship between hotel presence and Airbnb adoption). The comparison of social and economic indices and found that Airbnb listings are correlated to areas with the young, bohemian and talent indices, with differing correlation on racial index in different cities.
Another study by Lagonigro, Martori, & Apparicio (2020) analysed the factors affecting the spatial distribution of Airbnb listings in Barcelona, in relation to population and tourism indicators using a Geographically Weighted Regression (GWR) model. They found that medium-low family incomes show positive correlation between poverty and Airbnb ratios, whereas neighborhoods with higher incomes attract more Airbnb accommodations. Their study also uncovered how Airbnb contributed to gentrification of some neighbourhoods by removing housing from residential stock to short term rentals.
Other studies have also explored spatial characteristics of Airbnb accommodation (Zhang and Chen, 2019), or used GWR techniques to model variation of hotel room prices, (Zhang et al, 2011), tourism and rural poverty rates (Deller, 2010), or the housing market (Bitter, Mulligan, & Dall’erba, 2007).
Asian studies
In Singapore and Asia, the majority of the studies of Airbnb in Asia have focused on the user experience (from either the guest or host’s point of view), and the disruptive impact it has on the tourism industry and hotel revenues; but there have been no studies that have looked in-depth into spatial analysis of Airbnb in Singapore or Asia.
Choi et al (2015) looked at the impact of Airbnb on hotel revenues across different cities in Korea and found that at the national level, Airbnb accommodation did not affect hotel revenue, but there were slight variations in different cities. Airbnb had a slight negative effect on budget hotels in Seoul; whereas there was a negative effect on upscale hotels and positive effect on midscale hotels in Busan, but the magnitude of those effects was very small.
Kiatkawsin, Sutherland and Kim (2020) conducted a text analysis of Airbnb reviews in Hong Kong and Singapore using Latent Dirichlet Allocation (LDA) to extract topics from the data. There were 12 topics in Hong Kong and 5 topics in Singapore reviews. Topics were related to established hotel attributes (e.g. unit or room amenities, location), but also included host and listing management, which are unique topics to Airbnb listings. Their results show that hosts needs to focus on delivering quality service for the entire ‘transaction’ pre-trip to post-trip, and ensure that their listings are accurate and comprehensive for better guest satisfaction.
Koh, and King (2017) conducted a qualitative assessment of the impact of Airbnb on Singapore’s budget hotels. Interviews with key stakeholders from budget hotels and hostels were conducted - while there were growing concerns that airbnb may prove to be competition down the road, they did not consider Airbnb rentals an immediate threat at that point in time.
The Development Bank of Singapore published a briefing on the rise of home sharing platforms (Yong & Tan 2019) and a case study on Airbnb but this study had a broader focus on the entire market across Asia and while it discussed the impact of Airbnb on hotel prices, it did not look in depth into any spatial analysis.
As such, this study aims to close the gap by looking at the spatial densities of Airbnb in Singapore, and similar to the Barcelona study, determine the factors that affect the spatial distribution of Airbnb listings in Singapore using a Geographically Weighted Regression (GWR) model. In addition, we would look at hotels in Singapore and determine if the location affects Airbnb listings as well. We can determine if Airbnb is in competition with hotels or whether they are complementary, as Airbnb claims,
The following code chunk loads the packages required for the Exploratory Data Analysis; it will also install the packages if they have not been installed. The following table shows the different packages used in this study:
| Type | Package | Usage |
|---|---|---|
| Data Exploration | tidyverse | Data manipulation & wrangling |
| Data Exploration | lubridate | Manipulating date-time data |
| Data Exploration | knitr | knit R-Markdown document, with code to show specific lines of output for the purpose of this report |
| Data Exploration | funModeling | Provides functions to help in exploratory data analysis, data preparation and model performance |
| Spatial Data | sf (Simple Features) | read and manipulate spatial data for analysis |
| Spatial Data | tmap | graphing and mapping spatial data |
| Spatial Data | leaflet | graphing and mapping spatial data |
| Spatial Data | gridExtra | customise display of graphs and plots (in a grid format) |
| Spatial Data | OpenStreetMap | Accesses high resolution raster maps using the OpenStreetMap protocol. This provides a basemap when tmap is set to ‘plot’ mode |
| Spatial Data | rgdal | Provides access to projection/transformation operations and importing of raster / vector data |
| Spatial Data | maptools | Manipulating geographic data |
| Spatial Data | raster | Manipulating raster data |
| Spatial Data | spatstat | statistical analysis of Spatial Point Patterns |
| Spatial Data | tmaptools | Reading and mapping spatial data |
# Loading in required packages
packages = c('tidyverse', 'funModeling', 'ggstatsplot', 'statsExpressions', 'lubridate', 'knitr', 'rgdal', 'spatstat', 'maptools', 'sf','tmap', 'tmaptools', 'leaflet', 'raster', 'gridExtra', 'OpenStreetMap')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
InsideAirbnb is an independent, non-commercial site that provides publicly available information about a city’s Airbnb listings. Started by Murray Cox, data is provided for over 90 cities, by scraping and compiling publicly available information from the Airbnb website at regular intervals. The intention behind InsideAirbnb is to enable data exploration into how Airbnb impacts community housing issues and residential housing markets in various cities around the world.
The data has been famously used to uncover misrepresentations from Airbnb that their hosts only occasionally rent the homes in which they live. In 2016, Murray Cox, together with Tom Slee, reported 1 that before releasing data on its New York City listings, Airbnb had removed over 1,000 entire home listings that violated New York City’s multiple dwelling law (i.e. hosts with multiple listings). The law states that an apartment in a building with 3 or more units cannot be rented out for under 30 days unless there’s a permanent occupant present.
Note that information such as actual stays (e.g. number of days), actual rental income per host are not available publicly.
InsideAirbnb provides a snapshot of the following information:
For Singapore, InsideAirbnb has periodic snapshots from 18 March 2019 to 26 October 2020. Data was downloaded from InsideAirbnb on 29 September 2020 for this project - the dataset downloaded was compiled on 22 June 2020.
# Loading the Data
listings <- read_csv("data/listings.csv")
d_listings <- read_csv("data/detailedlistings.csv")
## Warning: 5 parsing failures.
## row col expected actual file
## 3083 license 1/0/T/F/TRUE/FALSE 201117828H 'data/detailedlistings.csv'
## 4215 license 1/0/T/F/TRUE/FALSE 201537598E 'data/detailedlistings.csv'
## 4684 license 1/0/T/F/TRUE/FALSE 201202564R 'data/detailedlistings.csv'
## 5668 license 1/0/T/F/TRUE/FALSE 201537598E 'data/detailedlistings.csv'
## 5674 license 1/0/T/F/TRUE/FALSE 201537598E 'data/detailedlistings.csv'
calendar <- read_csv("data/calendar.csv")
reviews <- read_csv("data/reviews.csv")
d_reviews <- read_csv("data/detailedreviews.csv")
neighbourhoods <- read_csv("data/neighbourhoods.csv")
We see 5 parsing failures where a True / False was expected but there were characters in the actual data. These are actually the business registration numbers for Singapore companies, and we can change these to a True or Yes in the columns when looking at the data. We are not using this column for analysis at the moment.
Handling spatial data2
The two popular packages for handling geographical data in R are sp released in 2005, and sf (simple features) released in 2016. They allow users to standardize how spatial data would be treated in R (points, lines, polygons, grids) and operate on them. However, the packages reads and stores geographical data differently:
sp uses an S4 class object with slots to build a spatial object. It has 2 pre-defined slots: + bounding box: a box that provides the boundaries or window for the object + crs: the Coordinate Reference System, that tells R how to project the 2D coordinate systems onto 3D space.
One slot is for the geometric object (points, lines, polygons) and is either a matrix of coordinates or a list of lines or polygons objects.
The last slot is for attributes associated with the geometric object and this will transform a Spatial object into a Spatial Dataframe object. There are different objects for points, lines and polygons (e.g. SpatialPoints, SpatialLines and SpatialPolygons objects, SpatialPointsDataframe, SpatialLinesDataframe objects).
sf stores spatial objects as a dataframe with a special column named geometry that contains spatial information. This geometry column contains the simple features collection (sfc) which includes: + geometric objects (points, lines, polygons), stored as a simple feature geometry (sfg) object + bounding box + crs (epsg or proj4string)
The other columns of the dataframe generally represent the attributes of the data (e.g. place names, roads, elevation, temperature, etc). We can conceive of a sf object as a dataframe with a spatial extension.
sf is useful when reading in larger dataframes (faster read/write) and provides a simpler interface in its usage.
Reading in Spatial Data
We use st_read() from the sf package to read the neighbourhood geojson file, specifying the layer name and the data source name. This dispenses with the need to call the RGDAL library. As the geojson file only contains the geometries and not any projection information, we need to use st_transform() to assign the same crs to the neighbourhood polygons so that they will appear in the same projected space.
# reading in the neighbourhood geojson file
nhood_map_sf <- st_read(dsn = "data/neighbourhoods.geojson",
layer="neighbourhoods") %>%
st_transform(crs = 3414)
## Reading layer `neighbourhoods' from data source `C:\Users\clarachua\Documents\2. Capstone Project\capstone\data\neighbourhoods.geojson' using driver `GeoJSON'
## Simple feature collection with 55 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 103.6054 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
## geographic CRS: WGS 84
We can see that the ‘neighbourhoods’ layer is a simple feature collection of 55 polygons and 2 attributes (neighbourhood name and neighbourhood group corresponding to the multipolygon geometry). The projected CRS used is the SVY21 projection (3414), which provides a more accurate representation of Singapore’s spatial references than the global WGS84 projection due to Earth’s imperfect ellipsoid. The ‘neighbourhoods’ data also has its own bounding box.
glimpse() is used to take a first look at the listings data.
# Reviewing the listings data
glimpse(listings)
## Rows: 7,323
## Columns: 16
## $ id <dbl> 49091, 50646, 56334, 71609, 71896, 7190~
## $ name <chr> "COZICOMFORT LONG TERM STAY ROOM 2", "P~
## $ host_id <dbl> 266763, 227796, 266763, 367042, 367042,~
## $ host_name <chr> "Francesca", "Sujatha", "Francesca", "B~
## $ neighbourhood_group <chr> "North Region", "Central Region", "Nort~
## $ neighbourhood <chr> "Woodlands", "Bukit Timah", "Woodlands"~
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.3~
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.9571,~
## $ room_type <chr> "Private room", "Private room", "Privat~
## $ price <dbl> 84, 80, 70, 167, 95, 84, 209, 52, 54, 4~
## $ minimum_nights <dbl> 180, 90, 6, 90, 90, 90, 1, 90, 90, 14, ~
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20~
## $ last_review <date> 2013-10-21, 2014-12-26, 2015-10-01, 20~
## $ reviews_per_month <dbl> 0.01, 0.24, 0.18, 0.19, 0.22, 0.43, 0.2~
## $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3, 4, 4, 7, 1, ~
## $ availability_365 <dbl> 365, 365, 365, 365, 365, 365, 180, 356,~
This table provides basic information about the 7,323 listings that were available as at the date compiled (22 June 2020). We can see the unique listing id, the name of the listing, host name and id, the neighbourhood it is in together with the coordinates of the listing. Note however that Airbnb randomizes the listing by about 150m so there may be a slight variation in the actual coordinates and the stated ones in the table. It also provides information about the room type, price of the listing, the minimum nights and some review statistics for the listing (number of reviews, when the last review was, and the number of reviews per month). There is also some information about how many listings the host has in total, and how many days the listing is available for within a year (availability_365). The data types are correct except for the id and host_id - we can convert them to character or categorical.
glimpse(reviews)
## Rows: 91,250
## Columns: 2
## $ listing_id <dbl> 49091, 50646, 50646, 50646, 50646, 50646, 50646, 50646, 506~
## $ date <date> 2013-10-21, 2014-04-18, 2014-06-05, 2014-07-02, 2014-07-08~
glimpse(d_reviews)
## Rows: 91,250
## Columns: 6
## $ listing_id <dbl> 49091, 50646, 50646, 50646, 50646, 50646, 50646, 50646, ~
## $ id <dbl> 8243238, 11909864, 13823948, 15117222, 15426462, 1555291~
## $ date <date> 2013-10-21, 2014-04-18, 2014-06-05, 2014-07-02, 2014-07~
## $ reviewer_id <dbl> 8557223, 1356099, 15222393, 5543172, 817532, 10942382, 1~
## $ reviewer_name <chr> "Jared", "James", "Welli", "Cyril", "Jake", "Subba", "Cl~
## $ comments <chr> "Fran was absolutely gracious and welcoming. Made my sta~
Similarly glimpsing the data for the other data imports:
glimpse(calendar)
## Rows: 2,673,655
## Columns: 7
## $ listing_id <dbl> 819034, 2362558, 2362558, 2362558, 2362558, 2362558, 23~
## $ date <date> 2020-06-22, 2020-06-23, 2020-06-24, 2020-06-25, 2020-0~
## $ available <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,~
## $ price <chr> "$350.00", "$362.00", "$362.00", "$362.00", "$362.00", ~
## $ adjusted_price <chr> "$350.00", "$344.00", "$344.00", "$344.00", "$344.00", ~
## $ minimum_nights <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ maximum_nights <dbl> 30, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,~
glimpse(d_listings)
## Rows: 7,323
## Columns: 106
## $ id <dbl> 49091, 50646, 56334, 7160~
## $ listing_url <chr> "https://www.airbnb.com/r~
## $ scrape_id <dbl> 2.020062e+13, 2.020062e+1~
## $ last_scraped <date> 2020-06-22, 2020-06-22, ~
## $ name <chr> "COZICOMFORT LONG TERM ST~
## $ summary <chr> NA, "Fully furnished bedr~
## $ space <chr> "This is Room No. 2.(avai~
## $ description <chr> "This is Room No. 2.(avai~
## $ experiences_offered <chr> "none", "none", "none", "~
## $ neighborhood_overview <chr> NA, "The serenity & quiet~
## $ notes <chr> NA, "Accommodation has a ~
## $ transit <chr> NA, "Less than 400m from ~
## $ access <chr> NA, "Kitchen, washing fac~
## $ interaction <chr> NA, "We love to host peop~
## $ house_rules <chr> "No smoking indoors. Plea~
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA, N~
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA, N~
## $ picture_url <chr> "https://a0.muscache.com/~
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA, N~
## $ host_id <dbl> 266763, 227796, 266763, 3~
## $ host_url <chr> "https://www.airbnb.com/u~
## $ host_name <chr> "Francesca", "Sujatha", "~
## $ host_since <date> 2010-10-20, 2010-09-08, ~
## $ host_location <chr> "singapore", "Singapore, ~
## $ host_about <chr> "I am a private tutor by ~
## $ host_response_time <chr> "within an hour", "N/A", ~
## $ host_response_rate <chr> "100%", "N/A", "100%", "1~
## $ host_acceptance_rate <chr> "N/A", "N/A", "N/A", "100~
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALS~
## $ host_thumbnail_url <chr> "https://a0.muscache.com/~
## $ host_picture_url <chr> "https://a0.muscache.com/~
## $ host_neighbourhood <chr> "Woodlands", "Bukit Timah~
## $ host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 4, 4~
## $ host_total_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 4, 4~
## $ host_verifications <chr> "['email', 'phone', 'face~
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
## $ host_identity_verified <lgl> FALSE, FALSE, FALSE, TRUE~
## $ street <chr> "Singapore, Singapore", "~
## $ neighbourhood <chr> "Woodlands", "Bukit Timah~
## $ neighbourhood_cleansed <chr> "Woodlands", "Bukit Timah~
## $ neighbourhood_group_cleansed <chr> "North Region", "Central ~
## $ city <chr> "Singapore", "Singapore",~
## $ state <chr> NA, NA, NA, NA, NA, NA, N~
## $ zipcode <chr> "730702", "589664", NA, "~
## $ market <chr> "Singapore", "Singapore",~
## $ smart_location <chr> "Singapore", "Singapore",~
## $ country_code <chr> "SG", "SG", "SG", "SG", "~
## $ country <chr> "Singapore", "Singapore",~
## $ latitude <dbl> 1.44255, 1.33235, 1.44246~
## $ longitude <dbl> 103.7958, 103.7852, 103.7~
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE, T~
## $ property_type <chr> "Apartment", "Apartment",~
## $ room_type <chr> "Private room", "Private ~
## $ accommodates <dbl> 1, 2, 1, 6, 3, 3, 6, 1, 2~
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 0.5, ~
## $ bedrooms <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1~
## $ beds <dbl> 1, 1, 1, 3, 1, 2, 7, 1, 2~
## $ bed_type <chr> "Real Bed", "Real Bed", "~
## $ amenities <chr> "{TV,\"Cable TV\",Interne~
## $ square_feet <dbl> 0, NA, 0, 205, NA, NA, 45~
## $ price <chr> "$84.00", "$80.00", "$70.~
## $ weekly_price <chr> NA, "$400.00", NA, NA, "$~
## $ monthly_price <chr> "$1,048.00", "$1,600.00",~
## $ security_deposit <chr> NA, NA, NA, "$279.00", "$~
## $ cleaning_fee <chr> NA, NA, NA, "$56.00", "$2~
## $ guests_included <dbl> 1, 2, 1, 4, 1, 1, 4, 1, 1~
## $ extra_people <chr> "$14.00", "$20.00", "$14.~
## $ minimum_nights <dbl> 180, 90, 6, 90, 90, 90, 1~
## $ maximum_nights <dbl> 360, 730, 14, 1125, 1125,~
## $ minimum_minimum_nights <dbl> 180, 90, 6, 90, 90, 90, 1~
## $ maximum_minimum_nights <dbl> 180, 90, 6, 90, 90, 90, 1~
## $ minimum_maximum_nights <dbl> 360, 730, 14, 1125, 1125,~
## $ maximum_maximum_nights <dbl> 360, 730, 14, 1125, 1125,~
## $ minimum_nights_avg_ntm <dbl> 180, 90, 6, 90, 90, 90, 1~
## $ maximum_nights_avg_ntm <dbl> 360, 730, 14, 1125, 1125,~
## $ calendar_updated <chr> "73 months ago", "71 mont~
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
## $ availability_30 <dbl> 30, 30, 30, 30, 30, 30, 3~
## $ availability_60 <dbl> 60, 60, 60, 60, 60, 60, 6~
## $ availability_90 <dbl> 90, 90, 90, 90, 90, 90, 9~
## $ availability_365 <dbl> 365, 365, 365, 365, 365, ~
## $ calendar_last_scraped <date> 2020-06-22, 2020-06-22, ~
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29~
## $ number_of_reviews_ltm <dbl> 0, 0, 0, 8, 4, 13, 6, 2, ~
## $ first_review <date> 2013-10-21, 2014-04-18, ~
## $ last_review <date> 2013-10-21, 2014-12-26, ~
## $ review_scores_rating <dbl> 94, 91, 98, 89, 83, 88, 8~
## $ review_scores_accuracy <dbl> 10, 9, 10, 9, 8, 9, 9, 10~
## $ review_scores_cleanliness <dbl> 10, 10, 10, 8, 8, 9, 8, 1~
## $ review_scores_checkin <dbl> 10, 10, 10, 9, 9, 9, 9, 1~
## $ review_scores_communication <dbl> 10, 10, 10, 10, 9, 9, 9, ~
## $ review_scores_location <dbl> 8, 9, 8, 9, 8, 9, 9, 10, ~
## $ review_scores_value <dbl> 8, 9, 9, 9, 8, 9, 8, 10, ~
## $ requires_license <lgl> FALSE, FALSE, FALSE, FALS~
## $ license <lgl> NA, NA, NA, NA, NA, NA, N~
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA, N~
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, TRUE~
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, FALS~
## $ cancellation_policy <chr> "flexible", "moderate", "~
## $ require_guest_profile_picture <lgl> TRUE, FALSE, TRUE, FALSE,~
## $ require_guest_phone_verification <lgl> TRUE, TRUE, TRUE, TRUE, T~
## $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3~
## $ calculated_host_listings_count_entire_homes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ calculated_host_listings_count_private_rooms <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3~
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ reviews_per_month <dbl> 0.01, 0.24, 0.18, 0.19, 0~
The following code makes the changes as specified above.
# Change Data Types of id to characters
listings <- listings %>% mutate_at(vars(id, host_id), as.character)
reviews <- reviews %>% mutate_at(vars(listing_id), as.character)
d_reviews <- d_reviews %>% mutate_at(vars(id, reviewer_id, listing_id), as.character)
calendar <- calendar %>% mutate_at(vars(listing_id), as.character)
d_listings <- d_listings %>% mutate_at(vars(host_id, id), as.character)
# Change price in detailed listings to numerical
# Remove $ and , symbol in columns where currency is read as character.
strip_dollars = function(x) {as.numeric(gsub("[\\$,]", "", x))}
d_listings[,61:65] <- sapply(d_listings[,61:65], strip_dollars)
d_listings[,67] <- sapply(d_listings[,67], strip_dollars)
We examine the data to see if there are missing data and decide how to handle them. Missing data could be a zero value, which we will need to change to be able to analyse the data correctly; it may reflect actual missing data, which may be omitted depending on our use of the data.
df_status() from the funModeling package is a useful function to show missing data as it shows the number and percentage of zeros, N/As and infinite values, the data type, as well as the number of unique values.
df_status(listings)
## variable q_zeros p_zeros q_na p_na q_inf p_inf
## 1 id 0 0.00 0 0.00 0 0
## 2 name 0 0.00 1 0.01 0 0
## 3 host_id 0 0.00 0 0.00 0 0
## 4 host_name 0 0.00 22 0.30 0 0
## 5 neighbourhood_group 0 0.00 0 0.00 0 0
## 6 neighbourhood 0 0.00 0 0.00 0 0
## 7 latitude 0 0.00 0 0.00 0 0
## 8 longitude 0 0.00 0 0.00 0 0
## 9 room_type 0 0.00 0 0.00 0 0
## 10 price 0 0.00 0 0.00 0 0
## 11 minimum_nights 0 0.00 0 0.00 0 0
## 12 number_of_reviews 2835 38.71 0 0.00 0 0
## 13 last_review 0 0.00 2835 38.71 0 0
## 14 reviews_per_month 0 0.00 2835 38.71 0 0
## 15 calculated_host_listings_count 0 0.00 0 0.00 0 0
## 16 availability_365 1761 24.05 0 0.00 0 0
## type unique
## 1 character 7323
## 2 character 6766
## 3 character 2466
## 4 character 1739
## 5 character 5
## 6 character 43
## 7 numeric 4579
## 8 numeric 4974
## 9 character 4
## 10 numeric 429
## 11 numeric 74
## 12 numeric 215
## 13 Date 1158
## 14 numeric 451
## 15 numeric 53
## 16 numeric 319
df_status(calendar)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 listing_id 0 0.00 0 0.00 0 0 character 7323
## 2 date 0 0.00 0 0.00 0 0 Date 366
## 3 available 1037094 38.79 0 0.00 0 0 logical 2
## 4 price 0 0.00 2215 0.08 0 0 character 744
## 5 adjusted_price 0 0.00 2215 0.08 0 0 character 743
## 6 minimum_nights 0 0.00 725 0.03 0 0 numeric 75
## 7 maximum_nights 0 0.00 725 0.03 0 0 numeric 109
df_status(reviews)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 listing_id 0 0 0 0 0 0 character 4488
## 2 date 0 0 0 0 0 0 Date 2687
df_status(d_reviews)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 listing_id 0 0 0 0.00 0 0 character 4488
## 2 id 0 0 0 0.00 0 0 character 91250
## 3 date 0 0 0 0.00 0 0 Date 2687
## 4 reviewer_id 0 0 0 0.00 0 0 character 83957
## 5 reviewer_name 0 0 0 0.00 0 0 character 32853
## 6 comments 0 0 104 0.11 0 0 character 87115
df_status(d_listings)
## variable q_zeros p_zeros q_na p_na
## 1 id 0 0.00 0 0.00
## 2 listing_url 0 0.00 0 0.00
## 3 scrape_id 0 0.00 0 0.00
## 4 last_scraped 0 0.00 0 0.00
## 5 name 0 0.00 1 0.01
## 6 summary 0 0.00 324 4.42
## 7 space 0 0.00 1957 26.72
## 8 description 0 0.00 249 3.40
## 9 experiences_offered 0 0.00 0 0.00
## 10 neighborhood_overview 0 0.00 2933 40.05
## 11 notes 0 0.00 3315 45.27
## 12 transit 0 0.00 2895 39.53
## 13 access 0 0.00 2821 38.52
## 14 interaction 0 0.00 3250 44.38
## 15 house_rules 0 0.00 3874 52.90
## 16 thumbnail_url 0 0.00 7323 100.00
## 17 medium_url 0 0.00 7323 100.00
## 18 picture_url 0 0.00 0 0.00
## 19 xl_picture_url 0 0.00 7323 100.00
## 20 host_id 0 0.00 0 0.00
## 21 host_url 0 0.00 0 0.00
## 22 host_name 0 0.00 22 0.30
## 23 host_since 0 0.00 22 0.30
## 24 host_location 0 0.00 41 0.56
## 25 host_about 0 0.00 2441 33.33
## 26 host_response_time 0 0.00 22 0.30
## 27 host_response_rate 0 0.00 22 0.30
## 28 host_acceptance_rate 0 0.00 22 0.30
## 29 host_is_superhost 6143 83.89 22 0.30
## 30 host_thumbnail_url 0 0.00 22 0.30
## 31 host_picture_url 0 0.00 22 0.30
## 32 host_neighbourhood 0 0.00 842 11.50
## 33 host_listings_count 339 4.63 22 0.30
## 34 host_total_listings_count 339 4.63 22 0.30
## 35 host_verifications 0 0.00 0 0.00
## 36 host_has_profile_pic 19 0.26 22 0.30
## 37 host_identity_verified 5668 77.40 22 0.30
## 38 street 0 0.00 0 0.00
## 39 neighbourhood 0 0.00 2 0.03
## 40 neighbourhood_cleansed 0 0.00 0 0.00
## 41 neighbourhood_group_cleansed 0 0.00 0 0.00
## 42 city 0 0.00 64 0.87
## 43 state 1 0.01 6817 93.09
## 44 zipcode 0 0.00 818 11.17
## 45 market 0 0.00 90 1.23
## 46 smart_location 0 0.00 0 0.00
## 47 country_code 0 0.00 0 0.00
## 48 country 0 0.00 0 0.00
## 49 latitude 0 0.00 0 0.00
## 50 longitude 0 0.00 0 0.00
## 51 is_location_exact 1478 20.18 0 0.00
## 52 property_type 0 0.00 0 0.00
## 53 room_type 0 0.00 0 0.00
## 54 accommodates 0 0.00 0 0.00
## 55 bathrooms 103 1.41 3 0.04
## 56 bedrooms 591 8.07 12 0.16
## 57 beds 235 3.21 71 0.97
## 58 bed_type 0 0.00 0 0.00
## 59 amenities 0 0.00 0 0.00
## 60 square_feet 13 0.18 7292 99.58
## 61 price 0 0.00 0 0.00
## 62 weekly_price 0 0.00 6857 93.64
## 63 monthly_price 0 0.00 6826 93.21
## 64 security_deposit 2191 29.92 2217 30.27
## 65 cleaning_fee 885 12.09 1947 26.59
## 66 guests_included 0 0.00 0 0.00
## 67 extra_people 3144 42.93 0 0.00
## 68 minimum_nights 0 0.00 0 0.00
## 69 maximum_nights 0 0.00 0 0.00
## 70 minimum_minimum_nights 0 0.00 0 0.00
## 71 maximum_minimum_nights 0 0.00 0 0.00
## 72 minimum_maximum_nights 0 0.00 0 0.00
## 73 maximum_maximum_nights 0 0.00 0 0.00
## 74 minimum_nights_avg_ntm 0 0.00 0 0.00
## 75 maximum_nights_avg_ntm 0 0.00 0 0.00
## 76 calendar_updated 0 0.00 0 0.00
## 77 has_availability 0 0.00 0 0.00
## 78 availability_30 2453 33.50 0 0.00
## 79 availability_60 2154 29.41 0 0.00
## 80 availability_90 2045 27.93 0 0.00
## 81 availability_365 1761 24.05 0 0.00
## 82 calendar_last_scraped 0 0.00 0 0.00
## 83 number_of_reviews 2835 38.71 0 0.00
## 84 number_of_reviews_ltm 4206 57.44 0 0.00
## 85 first_review 0 0.00 2835 38.71
## 86 last_review 0 0.00 2835 38.71
## 87 review_scores_rating 0 0.00 2969 40.54
## 88 review_scores_accuracy 0 0.00 2974 40.61
## 89 review_scores_cleanliness 0 0.00 2972 40.58
## 90 review_scores_checkin 0 0.00 2978 40.67
## 91 review_scores_communication 0 0.00 2974 40.61
## 92 review_scores_location 0 0.00 2979 40.68
## 93 review_scores_value 0 0.00 2978 40.67
## 94 requires_license 7323 100.00 0 0.00
## 95 license 0 0.00 7323 100.00
## 96 jurisdiction_names 0 0.00 7323 100.00
## 97 instant_bookable 4227 57.72 0 0.00
## 98 is_business_travel_ready 7323 100.00 0 0.00
## 99 cancellation_policy 0 0.00 0 0.00
## 100 require_guest_profile_picture 7289 99.54 0 0.00
## 101 require_guest_phone_verification 7276 99.36 0 0.00
## 102 calculated_host_listings_count 0 0.00 0 0.00
## 103 calculated_host_listings_count_entire_homes 2784 38.02 0 0.00
## 104 calculated_host_listings_count_private_rooms 3057 41.75 0 0.00
## 105 calculated_host_listings_count_shared_rooms 6612 90.29 0 0.00
## 106 reviews_per_month 0 0.00 2835 38.71
## q_inf p_inf type unique
## 1 0 0 character 7323
## 2 0 0 character 7323
## 3 0 0 numeric 1
## 4 0 0 Date 2
## 5 0 0 character 6766
## 6 0 0 character 4365
## 7 0 0 character 3139
## 8 0 0 character 5180
## 9 0 0 character 1
## 10 0 0 character 2135
## 11 0 0 character 1634
## 12 0 0 character 2211
## 13 0 0 character 2002
## 14 0 0 character 1665
## 15 0 0 character 2059
## 16 0 0 logical 0
## 17 0 0 logical 0
## 18 0 0 character 6777
## 19 0 0 logical 0
## 20 0 0 character 2466
## 21 0 0 character 2466
## 22 0 0 character 1739
## 23 0 0 Date 1576
## 24 0 0 character 217
## 25 0 0 character 1174
## 26 0 0 character 5
## 27 0 0 character 56
## 28 0 0 character 79
## 29 0 0 logical 2
## 30 0 0 character 2448
## 31 0 0 character 2448
## 32 0 0 character 62
## 33 0 0 numeric 60
## 34 0 0 numeric 60
## 35 0 0 character 187
## 36 0 0 logical 2
## 37 0 0 logical 2
## 38 0 0 character 93
## 39 0 0 character 45
## 40 0 0 character 43
## 41 0 0 character 5
## 42 0 0 character 39
## 43 0 0 character 50
## 44 0 0 character 1975
## 45 0 0 character 2
## 46 0 0 character 43
## 47 0 0 character 1
## 48 0 0 character 1
## 49 0 0 numeric 4579
## 50 0 0 numeric 4974
## 51 0 0 logical 2
## 52 0 0 character 26
## 53 0 0 character 4
## 54 0 0 numeric 16
## 55 0 0 numeric 24
## 56 0 0 numeric 10
## 57 0 0 numeric 26
## 58 0 0 character 5
## 59 0 0 character 5621
## 60 0 0 numeric 14
## 61 0 0 numeric 429
## 62 0 0 numeric 203
## 63 0 0 numeric 193
## 64 0 0 numeric 156
## 65 0 0 numeric 113
## 66 0 0 numeric 16
## 67 0 0 numeric 81
## 68 0 0 numeric 74
## 69 0 0 numeric 113
## 70 0 0 numeric 72
## 71 0 0 numeric 75
## 72 0 0 numeric 106
## 73 0 0 numeric 106
## 74 0 0 numeric 140
## 75 0 0 numeric 154
## 76 0 0 character 79
## 77 0 0 logical 1
## 78 0 0 numeric 31
## 79 0 0 numeric 61
## 80 0 0 numeric 90
## 81 0 0 numeric 319
## 82 0 0 Date 2
## 83 0 0 numeric 215
## 84 0 0 numeric 73
## 85 0 0 Date 1730
## 86 0 0 Date 1158
## 87 0 0 numeric 47
## 88 0 0 numeric 9
## 89 0 0 numeric 9
## 90 0 0 numeric 9
## 91 0 0 numeric 8
## 92 0 0 numeric 8
## 93 0 0 numeric 9
## 94 0 0 logical 1
## 95 0 0 logical 0
## 96 0 0 logical 0
## 97 0 0 logical 2
## 98 0 0 logical 1
## 99 0 0 character 5
## 100 0 0 logical 2
## 101 0 0 logical 2
## 102 0 0 numeric 53
## 103 0 0 numeric 45
## 104 0 0 numeric 31
## 105 0 0 numeric 12
## 106 0 0 numeric 451
To be able to map listings we need to convert the listings data into an sf object. The st_as_sf() function converts any foreign object into an sf object and specifies the coordinates (taken from the longitude and latitude columns in the listings dataframe). As the long/lat coordinates are based on the WSG84 projection, we assign that to the listings data, and further transform it into SVY21 coordinates to match the neighbourhoods polygon datafile so that they are projected onto the same crs. The st_as_sf() function leaves the original dataframe listings untouched.
# Convert listings to SF dataframe
listings_sf <- listings %>%
st_as_sf(coords = c("longitude", "latitude"),
crs = 4326) %>%
st_transform(crs = 3414)
head() is used to display the first ten records and their details. It shows the geometry type (sfc_point) and we can check that the projected CRS is SVY21 as intended. From the records, we can see that it is the same dataframe as the listings with a geometry column, consisting of an sfc_point object in each row, that has replaced the longitude and latitude columns in the original listings dataframe.
head(listings_sf)
## Simple feature collection with 6 features and 14 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 22646.02 ymin: 34950.06 xmax: 42212.88 ymax: 47135.4
## projected CRS: SVY21 / Singapore TM
## # A tibble: 6 x 15
## id name host_id host_name neighbourhood_g~ neighbourhood room_type price
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 49091 COZICO~ 266763 Francesca North Region Woodlands Private ~ 84
## 2 50646 Pleasa~ 227796 Sujatha Central Region Bukit Timah Private ~ 80
## 3 56334 COZICO~ 266763 Francesca North Region Woodlands Private ~ 70
## 4 71609 Ensuit~ 367042 Belinda East Region Tampines Private ~ 167
## 5 71896 B&B R~ 367042 Belinda East Region Tampines Private ~ 95
## 6 71903 Room 2~ 367042 Belinda East Region Tampines Private ~ 84
## # ... with 7 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
## # last_review <date>, reviews_per_month <dbl>,
## # calculated_host_listings_count <dbl>, availability_365 <dbl>,
## # geometry <POINT [m]>
glimpse() shows the point details of the geometry column, giving the x,y coordinates in SVY21 projection.
glimpse(listings_sf)
## Rows: 7,323
## Columns: 15
## $ id <chr> "49091", "50646", "56334", "71609", "71~
## $ name <chr> "COZICOMFORT LONG TERM STAY ROOM 2", "P~
## $ host_id <chr> "266763", "227796", "266763", "367042",~
## $ host_name <chr> "Francesca", "Sujatha", "Francesca", "B~
## $ neighbourhood_group <chr> "North Region", "Central Region", "Nort~
## $ neighbourhood <chr> "Woodlands", "Bukit Timah", "Woodlands"~
## $ room_type <chr> "Private room", "Private room", "Privat~
## $ price <dbl> 84, 80, 70, 167, 95, 84, 209, 52, 54, 4~
## $ minimum_nights <dbl> 180, 90, 6, 90, 90, 90, 1, 90, 90, 14, ~
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20~
## $ last_review <date> 2013-10-21, 2014-12-26, 2015-10-01, 20~
## $ reviews_per_month <dbl> 0.01, 0.24, 0.18, 0.19, 0.22, 0.43, 0.2~
## $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3, 4, 4, 7, 1, ~
## $ availability_365 <dbl> 365, 365, 365, 365, 365, 365, 180, 356,~
## $ geometry <POINT [m]> POINT (23824.77 47135.4), POINT (~
Exploratory data analysis (EDA) is performed on the data to have a better understanding of the data, identify ways to approach the analysis and suggest hypotheses to test. Some potential questions to answer are:
ggplot2 is mainly used to graph and plot the data to answer some of these questions.
Airbnb uses the planning boundaries from the the Urban Redevelopment Authority of Singapore. There are 5 main regions, encompassing a total of 55 neighbourhoods in the dataset. The following shows the types of accommodation available and the price distribution of each accommodation type using the cleaned data.
The number of listings is first summarized for the various neighbourhood groups and room type. ggplot2 is then used to plot a bar chart of the number of listings for different room types across the different regions.
The majority of listings are entire apartments/houses for rent, followed by private rooms for rent. Shared rooms constitute the lowest number of listings in Singapore.
Examining the listings by region, we see unsurprisingly that the majority of the listings are in the Central Region. In fact, (98%) of the hotel listings are in the Central Region.
There are more listings of private rooms than entire apartments in the other regions, with a small proportion of listings being shared rooms, and a miniscule number of hotel listings. We could surmise that these non-central region listings are possibly owner-occupied homes, who are renting out a spare room for additional income.
# Summarizing the types of listings by neighbourhood groups
regionlist <- listings %>%
group_by(neighbourhood_group, room_type) %>%
summarise(
num_listings = n(),
avg_price = mean(price),
med_price = median(price))
# Plotting the type of accommodation by region
ggplot(regionlist, aes(x=room_type, fill = room_type)) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
geom_col(aes(y = num_listings)) +
facet_grid(cols=vars(neighbourhood_group), margins = T, labeller = labeller(neighbourhood_group = label_wrap_gen(width = 5, multi_line = TRUE))) +
labs(x = "", y = "No. of listings", fill = "Room Type")
No. of listings by region and room type
We use a boxplot of the price for the different room types to understand the distribution of pricing (left figure - Listing price). We can see that there are outliers e.g. more than $10,000 rental for an entire home/apt or a private room for a day. It is possible that there was a mistake in the listing price - e.g. the listing price was for a month instead of a day but we will need to remove the outliers to make a proper comparison, and to be able to zoom in on the variations.
# Remove price outliers from listings
outlier_price = quantile(listings$price, 0.99)
listings_cleanprice <- listings %>% filter(listings$price <= outlier_price)
# Plot boxplot of prices for each room type
p1 <- ggplot(listings, aes(x=room_type, fill = room_type)) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "None") +
geom_boxplot(aes(y=price)) +
labs(x = "", y = "Listing Price", fill = "Room Type", title = "Listing price")
#Cleaned pricing
p2 <- ggplot(listings_cleanprice, aes(x=room_type, fill = room_type)) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
geom_boxplot(aes(y=price)) +
labs(x = "", y = "Listing Price", fill = "Room Type", title = "Listing price (outliers removed)")
grid.arrange(p1, p2, nrow = 1)
The top 1% of prices ($799) was used as a benchmark to remove price outliers. 72 data points were removed.
As expected, entire homes and apartments command the highest prices of all listings, followed by hotel rooms, private rooms with the lowest prices coming in for shared rooms as seen in the right figure with outliers removed.
Now that we have removed the outliers, we use the facet-grid function in ggplot2 to graph a boxplot of price distribution of the different room types and region.
ggplot(listings_cleanprice, aes(x=neighbourhood_group, fill = neighbourhood_group)) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
geom_boxplot(aes(y=price)) +
facet_grid(~room_type, labeller = labeller(room_type = label_wrap_gen(width = 5, multi_line = TRUE))) +
labs(x = "", y = "Listing Price", fill = "Region")
Boxplot of price distribution by room type and region
From the plot above, we can see that despite the cleaning there are still many price outliers, especially in the Central Region.
Prices for all listings between neighbourhoods We use the ggstatsplot to perform confirmationary data analysis. Our null hypothesis is that the price of listings in different neighbourhood groups are the same. We reject the hypothesis if the p-value is greater than 0.05 at 95% significance level. Firstly we select only the variables required (neighbourhood_group, room_type and price) and pass that through the function ggbetweenstats() to get the p-value and a violin plot of the listing prices by neighbourhood group. This function also calculates the pair-wise comparison between the grouped variables, and shows only the significant comparisons and their p-values.
conftest <- listings_cleanprice %>% dplyr::select(neighbourhood_group, room_type, price)
ggbetweenstats(
data = conftest,
x = neighbourhood_group,
y = price
)
Confirmatory analysis of listing price by region
From the above chart, we can see that the p-value for prices across all neighbourhoods is less than 0.05, which means we reject the null hypothesis and can state the the price for listings across neighbourhoods are significantly different. We also see that prices in the Central Region is significantly different from all other neighbourhoods, whilst only the East and North-East region has significantly different prices from each other.
Prices for each room type by neighbourhood group
We now look at the statistics for prices of listings in each neighbourhood group for each room type. As hotel rooms are not present in the North Region and has only 3 samples in each region other than the Central Region, we will filter hotel rooms out. For the other 3 room types, we can pass the data through the grouped_ggbetweenstats() function to obtain the statistics for our null hypothesis that the prices of listings in different neighbourhoods are the same by each room type.
conftest2 <- conftest %>% filter(room_type != "Shared room", room_type != "Hotel room")
grouped_ggbetweenstats(
data = conftest2,
x = neighbourhood_group,
y = price,
grouping.var = room_type,
ggsignif.args = list(textsize = 4, tip_length = 0.01),
p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
# adding new components to `ggstatsplot` default
ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
k = 3,
title.prefix = "Room Type",
palette = "default_jama",
package = "ggsci",
plotgrid.args = list(nrow = 2),
title.text = "Differences in listing prices by neighbourhoods for different room types"
)
sharedrooms <- conftest %>% filter(room_type == "Shared room")
ggbetweenstats(
data = sharedrooms,
x = neighbourhood_group,
y = price
)
Confirmatory analysis of price listings of shared rooms
When we compare the prices of the listings of shared rooms, we cannot reject the null hypothesis that prices are the same across the different regions, despite the high average price in the East Region. There is also no significant differences in the pairwise comparisons.
Next we examine the number of hosts and listings. As expected we see a large number of hosts - 70% of all hosts) with just 1 listing. However we see that there are hosts with more than 1 listing and one host with more than 300 listings to their name.
# Create table of % of hosts by no. of listings
list_byhost <- listings %>%
group_by(host_id, host_name) %>%
count(name = "number_of_listings", sort = TRUE) %>%
ungroup() %>%
group_by(number_of_listings) %>%
count(name = "number_of_hosts")
# Plot above table
ggplot(list_byhost, aes(x=number_of_listings, y= number_of_hosts/sum(number_of_hosts)*100)) +
geom_point() +
labs(y="Percentage of hosts", title = "% of hosts vs number of listings", x = "number of listings")
Percentage of hosts vs No. of listings
To explore the whether the market is dominated by hosts with single or multiple listings, we plot a Pareto chart by adding the cumulative frequency of the number of listings of hosts, to a descending list of airbnb rentals by host. We see that almost 75% of the Airbnb ‘stock’ are taken up by hosts with multiple listings.
There are currently no regulations or legislation around the number of listings that a host may have, unlike in other cities that regulate the maximum number of listings (e.g. New York). People with more than 1 listing are likely to be agents managing these properties on behalf of landlords.
However, there are restrictions on the minimum stay for short term rentals: 3 months for private housing and 6 months for HDB flats. This means that typical tourist stays (e.g. 2-7 days) in Airbnb listings would technically be illegal. Despite this, 89% of listings have a minimum stay of less than the legislated minimum stay.
# Create Pareto Chart | Source: https://rpubs.com/dav1d00/ggpareto
list_byhost <- list_byhost[order(list_byhost$number_of_listings, decreasing = TRUE),]
list_byhost$number_of_listings1 <- factor(list_byhost$number_of_listings, levels = list_byhost$number_of_listings)
list_byhost$listfreq <- list_byhost$number_of_hosts * list_byhost$number_of_listings
list_byhost$cumul <- cumsum(list_byhost$listfreq)
nr <- nrow(list_byhost)
N <- sum(list_byhost$listfreq)
y2 <- c(" 0%", " 10%", " 20%", " 30%", " 40%", " 50%", " 60%", " 70%", " 80%", " 90%", "100%")
ggplot(list_byhost, aes(x=number_of_listings1)) +
geom_bar(aes(y=number_of_hosts), fill = "blue", stat = "identity") +
geom_point(aes(x=number_of_listings1, y=cumul)) +
geom_line(aes(x=number_of_listings1, y=cumul)) +
geom_path(aes(y=cumul, group=1)) +
labs(y="Frequency", title = "Pareto Chart of hosts and listings", x = "No. of listings") +
theme(plot.margin = margin(c(1,1,1,1), unit="cm"), axis.text.x = element_text(angle=90, vjust=0.6)) +
annotate("text", x = nr + 3, y = seq(0, N, N/10), label = y2, size = 3.5, hjust = "inward")
Pareto chart of hosts and listings
A simple way to see the split between hosts with single or multiple listings is to use a mosaic plot. While there are other packages such as vcd or ggmosaic, we use mosaicplot() from base R as that is sufficient for our needs.
# Create column with Single / Multiple host types
listings <- listings %>% mutate(host_type = ifelse(calculated_host_listings_count ==1, "Single", "Multiple"))
mosaicplot(listings$room_type ~ listings$host_type, color = c("steelblue", "wheat"), xlab = "Room Type", ylab = "Host Type", main = "Mosaic Plot of Room Type and Host Type")
Private rooms make up the majority of listings for hosts with Single listings, followed by shared rooms - which corresponds to the assumption that these are people who are renting out their spare / shared bedroom for extra cash. The number of people renting out the entire home/apt can be attributed to people who may have an investment home or are not in the country for this period. There are a few hosts with single hotel room listings that could have the wrong room type attributed, or are special / boutique offerings, as we expect hotel operators to have multiple listings.
# Set seed for reproducibility
set.seed(123)
test2 <- listings %>% dplyr::select(room_type, host_type, neighbourhood_group, price)
grouped_ggbetweenstats(
data = test2,
x = host_type,
y = price,
grouping.var = room_type,
ggsignif.args = list(textsize = 4, tip_length = 0.01),
p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
# adding new components to `ggstatsplot` default
ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
k = 3,
title.prefix = "Room Type",
palette = "default_jama",
package = "ggsci",
plotgrid.args = list(nrow = 2),
title.text = "Differences in listing prices for single/multiple hosts by different room types"
)
Confirmatory analysis of listing price by host type
We can use confirmatory analysis to test our hypothesis that hosts with multiple listings will charge the same price to that of hosts with single listings in the different room types. From the above, we can see that the p-value > 0.01 and therefore we cannot reject the null hypothesis. We also performed the same confirmatory analysis by different regions and we also cannot reject the null hypothesis (i.e. that they do not differ).
grouped_ggbetweenstats(
data = test2 %>% filter(room_type != "Hotel room"),
x = host_type,
y = price,
grouping.var = neighbourhood_group,
ggsignif.args = list(textsize = 4, tip_length = 0.01),
p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
# adding new components to `ggstatsplot` default
# ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
# k = 3,
title.prefix = "Room Type",
palette = "default_jama",
package = "ggsci",
# plotgrid.args = list(nrow = 2),
title.text = "Differences in listing prices for single/multiple hosts by different neighbourhoods",
output = "subtitle"
)
## $`Central Region`
## paste(italic("t")["Welch"], "(", "2384.01", ") = ", "0.54", ", ",
## italic("p"), " = ", "0.592", ", ", widehat(italic("g"))["Hedge"],
## " = ", "0.02", ", CI"["95%"], " [", "-0.05", ", ", "0.08",
## "]", ", ", italic("n")["obs"], " = ", 5422L)
##
## $`East Region`
## paste(italic("t")["Welch"], "(", "241.52", ") = ", "-1.85", ", ",
## italic("p"), " = ", "0.065", ", ", widehat(italic("g"))["Hedge"],
## " = ", "-0.17", ", CI"["95%"], " [", "-0.36", ", ", "0.01",
## "]", ", ", italic("n")["obs"], " = ", 445L)
##
## $`North-East Region`
## paste(italic("t")["Welch"], "(", "256.21", ") = ", "-1.89", ", ",
## italic("p"), " = ", "0.060", ", ", widehat(italic("g"))["Hedge"],
## " = ", "-0.21", ", CI"["95%"], " [", "-0.47", ", ", "0.01",
## "]", ", ", italic("n")["obs"], " = ", 279L)
##
## $`North Region`
## paste(italic("t")["Welch"], "(", "163.94", ") = ", "0.27", ", ",
## italic("p"), " = ", "0.791", ", ", widehat(italic("g"))["Hedge"],
## " = ", "0.04", ", CI"["95%"], " [", "-0.24", ", ", "0.31",
## "]", ", ", italic("n")["obs"], " = ", 208L)
##
## $`West Region`
## paste(italic("t")["Welch"], "(", "233.65", ") = ", "-1.29", ", ",
## italic("p"), " = ", "0.197", ", ", widehat(italic("g"))["Hedge"],
## " = ", "-0.12", ", CI"["95%"], " [", "-0.29", ", ", "0.06",
## "]", ", ", italic("n")["obs"], " = ", 518L)
# When did the hosts sign up?
host_byyear <- d_listings %>% dplyr::select(id, host_id, host_since) %>% group_by(year_joined = year(host_since)) %>% drop_na() %>% summarise(number_hosts = n()) %>% ungroup() %>% mutate(change = (number_hosts - lag(number_hosts)) / lag(number_hosts)*100)
ggplot(host_byyear, aes(x=year_joined, y = number_hosts)) + geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "New hosts by year", y = "No. of hosts joined", x = "year") +
geom_text(aes(label = number_hosts), vjust = -0.3)
New hosts by year
The number of hosts have steadily increased from the start in 2010 and peaked in 2016. The number of new hosts dropped by -34.08 % in 2017. This can be attributed to the new legislation enacted in May 2017 that HDB flats has a minimum rental period of 6 months and cannot be rented to tourists; private residential properties has a minimum rental period of 3 months. Offenders were prosecuted and fined for illegal short term stays on Airbnb.
The number of new hosts joining has held steady around 800-900 per year, with a drop to 204 new hosts in 2020, which corresponds to the covid-19 pandemic, where there is little to no travel from March 2020, especially after Singapore closed its borders.
The detailed listings table gives more information on the hosts, including the host join date, identity verification, and their response and acceptance rate. Note that the acceptance rate is tied to the host and not the listings (e.g. a listing could have no reviews, but the host could have accepted guests on their other listings). Conversely there are 198 listings whose acceptance rates are missing or 0%, but still have reviews attributed to the listing. As acceptance rate (as defined by Airbnb) reflects activity in the 365 days, these are listings that have been inactive for the past year (from June 2019 - June 2020).
# How many listings do not have reviews (i.e. no stays)
d_listings %>% filter(., host_acceptance_rate == "0%" | is.na(host_acceptance_rate)) %>% dplyr::select(id, host_acceptance_rate, number_of_reviews) %>% arrange(desc(number_of_reviews))
## # A tibble: 198 x 3
## id host_acceptance_rate number_of_reviews
## <chr> <chr> <dbl>
## 1 17616042 0% 108
## 2 8399111 0% 55
## 3 32318920 <NA> 47
## 4 14211027 0% 46
## 5 13325975 0% 37
## 6 4583694 0% 29
## 7 3980202 0% 28
## 8 395191 0% 27
...
There are 2835 listings without reviews. As there is no publicly available data on actual rentals / stays, reviews and acceptance rate are our proxy for inactive listings or listings that have not been rented out. We can also look into them to see their location and other factors (such as room type, price, host join date, etc) to see what could contribute to them not having stays.
no_reviews <- listings_sf %>% filter(is.na(last_review))
host_join <- d_listings %>% dplyr::select(id, host_since)
no_reviews_host <- left_join(no_reviews, host_join, by = c("id")) %>% group_by(year_joined = year(host_since)) %>% summarise(number_hosts = n()) %>% drop_na() %>% ungroup()
no_reviews_host$year_joined = as.character(no_reviews_host$year_joined)
ggplot(no_reviews_host, aes(x=year_joined, y = number_hosts)) + geom_bar(stat = "identity", fill = "steelblue4") +
labs(title = "Listings with no reviews - Hosts by year", y = "No. of hosts joined", x = "year") + geom_text(aes(label = number_hosts), vjust = -0.3) + scale_x_discrete(breaks = no_reviews_host$year_joined)
Year joined by hosts for listings with no reviews
We expect that there are no reviews for newer listings (i.e. when hosts join later) or much older listings (when hosts join for more than 5 years) but at first glance, the pattern follows that of the overall host joining dates.
Airbnb also tracks the number of reviews in the last twelve months (number_of_reviews_ltm). If this data exists at other points in time, we could build a picture of how many reviews (stays) a listing had in that timeframe.
Review scores are provided in the detailed listings data. Examining the review scores, there is one overall rating (review_scores_rating) rated on a scale of 0 - 100, and scores on specific attributes such as accuracy of the listing, cleanliness of the property, communication of the host, ease of check-in, location and value of the listing rated on a scale of 0 - 10. The distribution of the overall rating scores and the attributes are shown in the histograms below. Rows with missing data were not counted.
Guests generally gave high overall ratings to the majority of properties, with 80% or more scoring in the top quartile. As the Airbnb platform is partly driven by user / host ratings, it is not surprising that reviews are generally positive. The bulk of the ratings corresponded to the majority of listings (entire listings and private rooms).
While the distribution of the specific attributes were similar, there were greater variations especially in terms of property cleanliness and the value of the listing.
# Select relevant data for review scores
review_scores <- d_listings %>% dplyr::select(id, host_id, number_of_reviews, room_type, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_communication, review_scores_checkin, review_scores_location, review_scores_value, reviews_per_month)
# Plot histogram of overall rating
ggplot(review_scores, aes(x=review_scores_rating)) + geom_histogram(aes(fill = room_type), bins = 20) + stat_bin(aes(label = ..count..), bins = 20, size = 3, geom= "text", vjust = -1)
Distribution of overall review score
# Plot histogram of attribute scores
ggplot(gather(review_scores[, -c(1:5,12)], cols, value), aes(x = value)) +
geom_histogram(binwidth = 1) + facet_grid(.~cols)
Distribution of individual attribute scores
Examining the data from the listings table, we see that there are listings that lie within the Central Water Catchment area, which is a non-residential area (consisting of parks and reservoirs) and we need to investigate them and check that they are attributed correctly. We see that there are 28 listings that are in the Central Water Catchment. Exploring the listings data further we see that some of the names show that they are located wrongly (e.g. 3 mins from Jurong East MRT), which are nowhere near the catchment area. However, as it is not possible to identify where exactly these listings are located, we will drop these listings from our dataset. Similarly, there are listings within the Mandai and Sungei Kadut, Western Water Catchment neighbourhood, which are industrial zoned and do not have any residential properties in those areas. We would also drop listings from the other 3 neighbourhoods.
Alternatively, if we want to include these points, we can try to identify a proxy location from the description such as using the MRT station coordinates for the listing from Jurong East; this is only possible if there are a small number of listings; this would not be feasible or reproducible for larger data sets.
# Identify listings that fall within the Water Cachement Area
filter(listings, neighbourhood == "Central Water Catchment")
## # A tibble: 28 x 17
## id name host_id host_name neighbourhood_g~ neighbourhood latitude
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 51848~ Little Hab~ 2927830 Shwu She~ North Region Central Water~ 1.35
## 2 28139~ comtempora~ 147591~ Roland North Region Central Water~ 1.35
## 3 28967~ Lem's crib 190297~ Lemuel North Region Central Water~ 1.35
## 4 29148~ Home with ~ 764645~ D-Jin North Region Central Water~ 1.35
## 5 33621~ LUXURY AT ~ 156409~ Rj North Region Central Water~ 1.35
## 6 33755~ <U+2600>BEAUTIFUL~ 219550~ Ray North Region Central Water~ 1.35
## 7 33883~ <U+2600> HUGE & M~ 219550~ Ray North Region Central Water~ 1.35
## 8 34157~ 1br cosy a~ 664061~ Jay North Region Central Water~ 1.35
## 9 34188~ Gem of the~ 664061~ Jay North Region Central Water~ 1.35
## 10 34188~ 1br cosy a~ 664061~ Jay North Region Central Water~ 1.35
## # ... with 18 more rows, and 10 more variables: longitude <dbl>,
## # room_type <chr>, price <dbl>, minimum_nights <dbl>,
## # number_of_reviews <dbl>, last_review <date>, reviews_per_month <dbl>,
## # calculated_host_listings_count <dbl>, availability_365 <dbl>,
## # host_type <chr>
# Remove listings in the Central Water Catchment, Sungei Kadut and Mandai areas
listings_clean <- filter(listings_sf, !neighbourhood %in% c("Central Water Catchment", "Sungei Kadut", "Mandai", "Western Water Catchment")) %>% st_as_sf()
We use tmap as our main package for generating maps. Similar to ggplot2, the syntax follows the grammar of graphics and is compatible with sf, leaflet, and other spatial wrangling packages. It is also useful as maps can be viewed interactively and access tiles from map providers such as OpenStreetMap or ESRI; they can also be plotted as a static image for the purposes of a report. We also use the read_osm() function from the OpenStreetMap package to load the background map as a raster in tmap’s plot mode.
# Read in OSM raster of listings data for plot view and create bounding box
sg_osm <- tmaptools::read_osm(listings_clean, ext=1.3)
bb_sg_osm <- st_bbox(listings_clean, crs = 3414)
The listings are mapped below by room types and neighbourhoods. The listings are clearly clustered around the Central Business District (CBD) and main shopping district (Orchard). There are also listings in suburban parts of Singapore which, while less expected could be explained by either being closer to ‘desirable’ neighbourhoods such as East Coast, or closer to specific industrial areas such as Pioneer, Sembawang, Changi Business Park / Loyang.
Entire home/apartment listings are mainly in Kallang, Rochor and Novena area, which are close to the CBD and Orchard. There are a fair number of listings in the other neighbourhoods, and they look to be in clusters, which we will examine using Spatial Point Pattern analysis in the next part of the Project.
Private room listings also show high concentration in the same neighbourhoods as entire home / apartment listings, but are also prevalent in neighbourhoods like Outram, Bedok and Rochor. They also appear to be more spread out within the neighbourhoods.
Hotel room listings as mentioned above are mainly found in the neighbourhoods of Outram, Kallang and Singapore River, whilst shared room listings are concentrated in the Kallang and Rochor areas, with the other listings being sparsely populated across the other neighbourhoods.
# Plotting neighbourhood listings on tmap
tmap_mode("view")
# Plotting points
tm_basemap(leaflet::providers$OpenStreetMap) +
# The commented out code is for plot mode (report)
# tm_shape(sg_osm, bbox=bb_sg_osm) +
# tm_rgb() +
tm_shape(nhood_map_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(listings_clean) +
tm_symbols(col="room_type", size = 0.2) +
tm_view(set.zoom.limits = c(11, 17)) +
tm_facets(by="room_type") +
tm_layout(legend.show = F)
# Removing price outliers from the sf listings
listings_sf_price <- listings_clean %>% filter(price <= outlier_price)
# Plotting points
# tmap_mode("plot")
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
tm_rgb() +
tm_shape(nhood_map_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(listings_sf_price) +
tm_symbols(col = "price", size = 0.2, palette = "YlOrBr", legend.hist = TRUE) +
# tm_view(set.zoom.limits = c(11, 18)) +
tm_facets(by="room_type") +
tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.stack = "horizontal", legend.hist.height = 1, legend.hist.width = 0.85, legend.outside.size=0.1)
Most of the neighbourhoods have listings within the price range of $0 - $200 per night. We can see some outliers in neighbourhoods like Queenstown, Clementi and Woodlands for entire homes. Private room listings are more homogenous across the neighbourhoods. However there are also high priced private room listings that are outside the central neighbourhoods - e.g. Tampines, Pasir Ris and Hougang.
We would need to examine the denser neighbourhoods for entire homes/apts and private rooms.
# Facet point symbol map showing rental price by hosts with single and multiple hosts
listings_sf_price <- listings_sf_price %>% mutate(host_type = ifelse(calculated_host_listings_count ==1, "Single", "Multiple"))
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
tm_rgb() +
tm_shape(nhood_map_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(listings_sf_price) +
tm_symbols(col = "host_type", shape = "price", size = 0.2, title.col = "Host Type", title.shape = "Price") +
tm_facets(by="room_type")+
tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.stack = "horizontal", legend.outside.size=0.1)
# Create a summary of the cleaned price listings by neighbourhood and room type (number of listings and median price)
listings_cleanprice_sum <- st_drop_geometry(listings_sf_price) %>% group_by(neighbourhood, room_type) %>%
summarise(num_listings = n(),
med_price = median(price)) %>%
arrange(desc(num_listings))
# Join neighbourhood mapping and summary dataframe
listings_join <- left_join(nhood_map_sf, listings_cleanprice_sum, by = c("neighbourhood"))
The following shows the median price and number of listings for each neighbourhood.
+ Entire home/apt: + High median prices as expected in central and premium neighbourhoods such as Southern Islands, Orchard, Bukit Timah, Tanglin, Singapore River, Rochor. Potential outliers are Clementi and Choa Chu Kang.
+ Highest density of listings are in Geylang, Kallang, Novena, Downtown Core, Rochor and River Valley. These correspond to city fringe areas, with accessible transport.
We will examine the density of listings further using Kernel Density Estimation in the next section.
tmap_mode("plot")
tmap_arrange(
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
tm_rgb() +
tm_shape(listings_join) +
tm_polygons("med_price", title = "Median Price") +
tm_view(set.zoom.limits = c(10, 18)) +
tm_facets(by="room_type", drop.NA.facets = T) +
tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.outside.size = 0.2),
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
tm_rgb() +
tm_shape(listings_join) +
tm_polygons("num_listings", title = "No. of listings", palette = "Blues", alpha = 0.6) +
tm_view(set.zoom.limits = c(10, 18)) +
tm_facets(by="room_type", drop.NA.facets = T) +
tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.outside.size = 0.2)
)
Chloropleth map of median listing price by room type
# Saving data to be loaded for Spatial EDA
# listings, neighbourhood maps in OGR and sf
basedata <- c("listings_sf", "listings_clean", "neighbourhoods", "nhood_map_sf")
save(listings_sf, listings_clean, neighbourhoods, nhood_map_sf, file = "basedata.RData")
save(listings_clean, d_listings, neighbourhoods, nhood_map_sf, file = "basedataGWR.RData")