Airbnb, Inc. operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. It is based in San Francisco, California. The platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.It currently covers more than 100,000 cities and 220 countries worldwide, including Singapore.

Now we are going to look at Data AirBnb in Singapore.The data sourced from Kaggle.com and collected on 28 August 2019 according to the website.

Data Exploratory and Explanatory

Before we do the exploratory and explanatory data analysis, we will install all the library needed to support the data analysis.

library(lubridate)
library(scales)
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
library(glue)
library(viridis)
library(leaflet)
library(treemapify)
library(skimr)

After we install the libraries, we call the data and check all the detail of the data.

air <- read_csv("airbnb/listings.csv")
glimpse(air)
## Rows: 7,907
## Columns: 16
## $ id                             <dbl> 49091, 50646, 56334, 71609, 71896, 7190~
## $ name                           <chr> "COZICOMFORT LONG TERM STAY ROOM 2", "P~
## $ host_id                        <dbl> 266763, 227796, 266763, 367042, 367042,~
## $ host_name                      <chr> "Francesca", "Sujatha", "Francesca", "B~
## $ neighbourhood_group            <chr> "North Region", "Central Region", "Nort~
## $ neighbourhood                  <chr> "Woodlands", "Bukit Timah", "Woodlands"~
## $ latitude                       <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.3~
## $ longitude                      <dbl> 103.7958, 103.7852, 103.7967, 103.9571,~
## $ room_type                      <chr> "Private room", "Private room", "Privat~
## $ price                          <dbl> 83, 81, 69, 206, 94, 104, 208, 50, 54, ~
## $ minimum_nights                 <dbl> 180, 90, 6, 1, 1, 1, 1, 90, 90, 90, 15,~
## $ number_of_reviews              <dbl> 1, 18, 20, 14, 22, 39, 25, 174, 198, 23~
## $ last_review                    <date> 2013-10-21, 2014-12-26, 2015-10-01, 20~
## $ reviews_per_month              <dbl> 0.01, 0.28, 0.20, 0.15, 0.22, 0.38, 0.2~
## $ calculated_host_listings_count <dbl> 2, 1, 2, 9, 9, 9, 9, 4, 4, 4, 32, 32, 9~
## $ availability_365               <dbl> 365, 365, 365, 353, 355, 346, 172, 59, ~

After we check, the dimension of the data is 7,907 rows and 16 columns. We also check the data types for every column. There are 5 columns that we have to change the data types:

  • name
  • host_name
  • neighbourhood_group
  • neighbourhood
  • room_type
airsg <- air %>% 
  mutate(name = as.factor(name),
         host_name = as.factor(host_name),
         neighbourhood_group = as.factor(neighbourhood_group),
         neighbourhood = as.factor(neighbourhood),
         room_type = as.factor(room_type))

head(airsg)

Now let us check once more the data and skim the details to check if there is some cleanings necessary

skim(airsg)
Data summary
Name airsg
Number of rows 7907
Number of columns 16
_______________________
Column type frequency:
Date 1
factor 5
numeric 10
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_review 2758 0.65 2013-10-21 2019-08-27 2019-06-27 1001

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
name 2 1 FALSE 7457 Lux: 13, Inv: 9, Stu: 9, Cit: 8
host_name 0 1 FALSE 1833 Jay: 290, Alv: 249, Ric: 157, Aar: 145
neighbourhood_group 0 1 FALSE 5 Cen: 6309, Wes: 540, Eas: 508, Nor: 346
neighbourhood 0 1 FALSE 43 Kal: 1043, Gey: 994, Nov: 537, Roc: 536
room_type 0 1 FALSE 3 Ent: 4132, Pri: 3381, Sha: 394

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 23388624.63 10164162.07 49091.00 15821800.50 24706270.00 32348500.00 38112762.00 ▂▃▅▆▇
host_id 0 1.00 91144807.41 81909095.31 23666.00 23058075.00 63448912.00 155381142.00 288567551.00 ▇▃▂▂▁
latitude 0 1.00 1.31 0.03 1.24 1.30 1.31 1.32 1.45 ▂▇▂▁▁
longitude 0 1.00 103.85 0.04 103.65 103.84 103.85 103.87 103.97 ▁▁▃▇▁
price 0 1.00 169.33 340.19 0.00 65.00 124.00 199.00 10000.00 ▇▁▁▁▁
minimum_nights 0 1.00 17.51 42.09 1.00 1.00 3.00 10.00 1000.00 ▇▁▁▁▁
number_of_reviews 0 1.00 12.81 29.71 0.00 0.00 2.00 10.00 323.00 ▇▁▁▁▁
reviews_per_month 2758 0.65 1.04 1.29 0.01 0.18 0.55 1.37 13.00 ▇▁▁▁▁
calculated_host_listings_count 0 1.00 40.61 65.14 1.00 2.00 9.00 48.00 274.00 ▇▁▁▁▁
availability_365 0 1.00 208.73 146.12 0.00 54.00 260.00 355.00 365.00 ▅▂▂▂▇

We found large missing value data on “reviews_per_month” column and “last_review” column. We probably will not use this column for analysis. So at the moment we keep the column and will do the transformation of the column in the future if necessary

Now let us look at the summary of the data to check the descriptive statistic for each columns

summary(airsg)
##        id                                                          name     
##  Min.   :   49091   Luxury hostel with in-cabin locker - Single mixed:  13  
##  1st Qu.:15821800   Inviting & Cozy 1BR APT 3 mins from Tg Pagar MRT :   9  
##  Median :24706270   Studio Apartment - Oakwood Premier               :   9  
##  Mean   :23388625   City-located 1BR loft apartment *BRAND NEW*      :   8  
##  3rd Qu.:32348500   Stylish 1BR Located 7 mins from Tg Pagar MRT     :   8  
##  Max.   :38112762   (Other)                                          :7858  
##                     NA's                                             :   2  
##     host_id             host_name           neighbourhood_group
##  Min.   :    23666   Jay     : 290   Central Region   :6309    
##  1st Qu.: 23058075   Alvin   : 249   East Region      : 508    
##  Median : 63448912   Richards: 157   North-East Region: 346    
##  Mean   : 91144807   Aaron   : 145   North Region     : 204    
##  3rd Qu.:155381142   Rain    : 115   West Region      : 540    
##  Max.   :288567551   Darcy   : 114                             
##                      (Other) :6837                             
##      neighbourhood     latitude       longitude               room_type   
##  Kallang    :1043   Min.   :1.244   Min.   :103.6   Entire home/apt:4132  
##  Geylang    : 994   1st Qu.:1.296   1st Qu.:103.8   Private room   :3381  
##  Novena     : 537   Median :1.311   Median :103.8   Shared room    : 394  
##  Rochor     : 536   Mean   :1.314   Mean   :103.8                         
##  Outram     : 477   3rd Qu.:1.322   3rd Qu.:103.9                         
##  Bukit Merah: 470   Max.   :1.455   Max.   :104.0                         
##  (Other)    :3850                                                         
##      price         minimum_nights    number_of_reviews  last_review        
##  Min.   :    0.0   Min.   :   1.00   Min.   :  0.00    Min.   :2013-10-21  
##  1st Qu.:   65.0   1st Qu.:   1.00   1st Qu.:  0.00    1st Qu.:2018-11-21  
##  Median :  124.0   Median :   3.00   Median :  2.00    Median :2019-06-27  
##  Mean   :  169.3   Mean   :  17.51   Mean   : 12.81    Mean   :2019-01-11  
##  3rd Qu.:  199.0   3rd Qu.:  10.00   3rd Qu.: 10.00    3rd Qu.:2019-08-07  
##  Max.   :10000.0   Max.   :1000.00   Max.   :323.00    Max.   :2019-08-27  
##                                                        NA's   :2758        
##  reviews_per_month calculated_host_listings_count availability_365
##  Min.   : 0.010    Min.   :  1.00                 Min.   :  0.0   
##  1st Qu.: 0.180    1st Qu.:  2.00                 1st Qu.: 54.0   
##  Median : 0.550    Median :  9.00                 Median :260.0   
##  Mean   : 1.044    Mean   : 40.61                 Mean   :208.7   
##  3rd Qu.: 1.370    3rd Qu.: 48.00                 3rd Qu.:355.0   
##  Max.   :13.000    Max.   :274.00                 Max.   :365.0   
##  NA's   :2758

If we look at above the data summary descriptive statistic, we can conclude:

  • The most booked host in Singapore listing is Jay
  • Central Region is the most favourite area for airbnb booking in Singapore and North Region is the least favourite.
  • In terms of Neighbourhood, Kallang is the most desirable area in Singapore for Airbnb booking
  • There are 3 type of listing : ‘Entire home/apt’, ‘Private room’ and ‘Shared room’ type
  • Price median is around SGD 124.0
  • Average staying nights is 17 days

Visualization and Analyis

Price

Now let’s take a look at the price distribution as per room types

plot1 <- airsg %>% 
  select(room_type,price,name) %>%
  mutate(label=glue(
    "{name}
    Price in SGD: {price}
    Room Type: {room_type}"
  )) %>% 
  ggplot(aes(x=room_type,y=price,text=label)) +
  geom_jitter(aes(col=room_type),cex=1.8,shape=8,alpha=0.3,show.legend = F) + 
  scale_y_continuous(limits=c(0,500),breaks=seq(0,500,100))+
  geom_hline(aes(yintercept=median(price, na.rm=T)),   
               color="red", linetype="dashed", size=1)+
  labs(title = "Price Distribution to Room Type",
       subtitle = "AirBnb Singapore",
       caption= "Data Source: Kaggle.com",
       x="Room Type",
       y="Price in SGD"
       )+
  theme_bw()
  
ggplotly(plot1,tooltip = "text")  %>% layout(showlegend = FALSE)

If we look at the graphic above, the price for entire home/apt per night is averagely higher than shared room and private room

Now let’s take a look at the price distribution as per Neighboorhood Group

plot2 <- airsg %>% 
  select(neighbourhood_group,neighbourhood,price,name) %>%
  mutate(label=glue(
    "{name}
    Price in SGD: {price}
    Region: {neighbourhood_group}
    Area: {neighbourhood}"
  )) %>% 
  ggplot(aes(y=price,x=neighbourhood_group,text=label)) +
  geom_jitter(aes(col=neighbourhood_group)
              ,alpha=0.3,show.legend = F,cex=1.8,shape=8) + 
    scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)+
  scale_y_continuous(limits=c(0,500),breaks=seq(0,500,100))+
  geom_hline(aes(yintercept=median(price, na.rm=T)),   
               color="red", linetype="dashed", size=1) +
  labs(title = "Price Distribution to Neighbourhood Group Region",
       subtitle = "AirBnb Singapore",
       caption= "Data Source: Kaggle.com",
       x="",
       y="Price in SGD",
       fill="",
       col="")+
  theme_bw()
  
ggplotly(plot2,tooltip = "text")  %>% layout(showlegend = FALSE)

If we look at the graphic above, the price in the Central Region is averagely higher than other region. It is predictable since this region is the home for famous tourism spots in Singapore.

Since the price in Central Region is averagely high, Now let’s take a look at the price distribution in Central Region area.

plot3 <- airsg %>% 
  filter(neighbourhood_group=="Central Region") %>% 
  select(neighbourhood,price,name) %>% 
  mutate(label=glue(
    "{name}
    Price avg in SGD: {price}
    {neighbourhood}"
  )) %>% 
  ggplot(aes(y=reorder(neighbourhood,price),x=price,text=label)) +
  geom_point(aes(col=neighbourhood),alpha=0.4,show.legend = F) + 
  scale_x_continuous(limits=c(0,3000),breaks=seq(0,3000,500))+
  geom_vline(aes(xintercept=median(price, na.rm=T)),   
               color="dodgerblue4", linetype="dashed", size=1) +
  labs(title = "Price Distribution in Central Region",
       subtitle = "AirBnb Singapore",
       caption= "Data Source: Kaggle.com",
       x="Price in SGD",
       y="",
       col="")+
  theme_bw()
  
ggplotly(plot3,tooltip = "text")  %>% layout(showlegend = FALSE)

If we look at the plot above, all of the price listing in Southern Island and Marina South are higher than the average price in Central Region.

Now let us check the Top 10 host with the average highest price bookings

ploth <- airsg %>% 
  group_by(host_name) %>% 
  summarise(avg_price=mean(price)) %>% 
  arrange(desc(avg_price)) %>% 
  head(10) %>% 
  mutate(label=glue(
    "Host Name: {host_name}
     avg price: {avg_price}"
  )) %>% 
  ggplot(aes(x=avg_price,y=reorder(host_name,avg_price),text=label))+
  geom_col(aes(fill=host_name),show.legend = F)+
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)+
  labs(title="Top 10 Host with The Average Highest Price per Listing",
       x="Price in SG",
       y="")+
  theme_bw()

ggplotly(ploth,tooltip = "text") %>% layout(showlegend=F)

Based on the plot above, Yolivia’s listings are the most expensive listings in Singapore.

Room Listings

Now we will take a look more detail on the room listings.

Let us check the population of the listings in every region in Singapore

plotpop <- airsg %>% 
  select(neighbourhood_group) %>% 
  count(neighbourhood_group) %>% 
  mutate(label=glue(
    "number of listing: {n} rooms
     {neighbourhood_group}"
  )) %>% 
  ggplot(aes(y=reorder(neighbourhood_group,n),x=n,text=label))+
  geom_col(aes(fill=neighbourhood_group)) +
  labs(title= "Population of Listings per Region",
       x="Numbers of Listings", 
       y = NULL) +
    theme_bw()

ggplotly(plotpop,tooltip = "text") %>% layout(showlegend = FALSE)

If we look at the plot above, the most populated listings in Singapore is in Central Region Area

Now let us check the population of the listings for every room type in Singapore

plotr <- airsg %>%
  select(room_type) %>%
  count(room_type) %>%
  mutate(label = glue(
    "number of listing: {n} rooms
     {room_type}")) %>%
  ggplot(aes(
    x = reorder(room_type, n),
    y = n,
    text = label
  )) +
  geom_col(aes(fill = room_type)) +
  labs(title = "Population of Listings per Room Type",
       x = "Numbers of Listings",
       y = NULL) +
  theme_bw()

ggplotly(plotr, tooltip = "text") %>% layout(showlegend = FALSE)

If we look at the plot above, the most listings in Singapore is with Entire home/apt type..

Let us check the availibility in a year for different room-type in every region

plot4 <- airsg %>%
  group_by(neighbourhood_group, room_type) %>%
  summarise(average_avail = mean(availability_365)) %>%
  mutate(
    label = glue(
    "Avg availability {round(average_avail,0)} days
    Type: {room_type}
    {neighbourhood_group}"
    )
  ) %>%
  ggplot(aes(
    y = reorder(neighbourhood_group, average_avail),
    x = average_avail,
    text = label
  )) +
  geom_col(aes(fill = room_type), position = "dodge", alpha = 0.8) +
  scale_fill_viridis(discrete = TRUE) +
  scale_color_viridis(discrete = TRUE) +
  facet_wrap( ~ room_type, scales = "free_x") +
  labs(
    title = "Average Days Availibility in a Year",
    subtitle = "AirBnb Singapore",
    caption = "Data Source: Kaggle.com",
    x = "",
    y = "",
    fill = ""
  ) +
  theme_bw()

ggplotly(plot4, tooltip = "text") %>% layout(legend = list(
  orientation = "h",
  x = 0.2,
  y = -0.1
))

Based on plot above, for Shared Room, availability in Central Region is higher than other region. For Private Room, availability in North Region is higher than other region. For Entire/home apt, availibility in Central Region is higher than other region.

plothn <- airsg %>%
  group_by(host_name) %>%
  count(host_name) %>%
  arrange(desc(n)) %>%
  head(10) %>%
  mutate(label = glue(
    "Host Name: {host_name}
     Total Listing: {n} rooms")) %>%
  ggplot(aes(
    x = n,
    y = reorder(host_name, n),
    text = label
  )) +
  geom_col(aes(fill = host_name), show.legend = F) +
  scale_fill_viridis(discrete = TRUE) +
  scale_color_viridis(discrete = TRUE) +
  labs(title = "Top 10 Host with The Most Number of Listing",
       x = "Number of Listing",
       y = "") +
  theme_bw()

ggplotly(plothn, tooltip = "text") %>% layout(showlegend = F)

Based on plot above, Jay has the most numnber of listing with whooping 290 rooms.

Mapping

Now let us take a look at map below to show the number of listings in Singapore

Pic <- makeIcon(
  iconUrl = "images (1).png",
  iconWidth = 100 * 0.35,
  iconHeight = 100 * 0.35
)

map <- leaflet()
map <- addTiles(map)

map <- addMarkers(
  map,
  lng = airsg$longitude,
  lat = airsg$latitude,
  popup = airsg$name,
  clusterOptions = markerClusterOptions(),
  icon = Pic
)

map