Airbnb, Inc. operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. It is based in San Francisco, California. The platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.It currently covers more than 100,000 cities and 220 countries worldwide, including Singapore.

Now we are going to look at Data AirBnb in Singapore.The data sourced from Kaggle.com and collected on 28 August 2019 according to the website.

Data Exploratory and Explanatory

Before we do the exploratory and explanatory data analysis, we will install all the library needed to support the data analysis.

library(lubridate)
library(scales)
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
library(glue)
library(viridis)
library(leaflet)
library(treemapify)
library(skimr)

After we install the libraries, we call the data and check all the detail of the data.

air <- read_csv("airbnb/listings.csv")

glimpse(air)

## Rows: 7,907
## Columns: 16
## $ id                             <dbl> 49091, 50646, 56334, 71609, 71896, 7190~
## $ name                           <chr> "COZICOMFORT LONG TERM STAY ROOM 2", "P~
## $ host_id                        <dbl> 266763, 227796, 266763, 367042, 367042,~
## $ host_name                      <chr> "Francesca", "Sujatha", "Francesca", "B~
## $ neighbourhood_group            <chr> "North Region", "Central Region", "Nort~
## $ neighbourhood                  <chr> "Woodlands", "Bukit Timah", "Woodlands"~
## $ latitude                       <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.3~
## $ longitude                      <dbl> 103.7958, 103.7852, 103.7967, 103.9571,~
## $ room_type                      <chr> "Private room", "Private room", "Privat~
## $ price                          <dbl> 83, 81, 69, 206, 94, 104, 208, 50, 54, ~
## $ minimum_nights                 <dbl> 180, 90, 6, 1, 1, 1, 1, 90, 90, 90, 15,~
## $ number_of_reviews              <dbl> 1, 18, 20, 14, 22, 39, 25, 174, 198, 23~
## $ last_review                    <date> 2013-10-21, 2014-12-26, 2015-10-01, 20~
## $ reviews_per_month              <dbl> 0.01, 0.28, 0.20, 0.15, 0.22, 0.38, 0.2~
## $ calculated_host_listings_count <dbl> 2, 1, 2, 9, 9, 9, 9, 4, 4, 4, 32, 32, 9~
## $ availability_365               <dbl> 365, 365, 365, 353, 355, 346, 172, 59, ~

After we check, the dimension of the data is 7,907 rows and 16 columns. We also check the data types for every column. There are 5 columns that we have to change the data types:

name
host_name
neighbourhood_group
neighbourhood
room_type

airsg <- air %>% 
  mutate(name = as.factor(name),
         host_name = as.factor(host_name),
         neighbourhood_group = as.factor(neighbourhood_group),
         neighbourhood = as.factor(neighbourhood),
         room_type = as.factor(room_type))

head(airsg)

Now let us check once more the data and skim the details to check if there is some cleanings necessary

skim(airsg)

Data summary
Name	airsg
Number of rows	7907
Number of columns	16
_______________________
Column type frequency:
Date	1
factor	5
numeric	10
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
last_review	2758	0.65	2013-10-21	2019-08-27	2019-06-27	1001

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
name	2	1	FALSE	7457	Lux: 13, Inv: 9, Stu: 9, Cit: 8
host_name	0	1	FALSE	1833	Jay: 290, Alv: 249, Ric: 157, Aar: 145
neighbourhood_group	0	1	FALSE	5	Cen: 6309, Wes: 540, Eas: 508, Nor: 346
neighbourhood	0	1	FALSE	43	Kal: 1043, Gey: 994, Nov: 537, Roc: 536
room_type	0	1	FALSE	3	Ent: 4132, Pri: 3381, Sha: 394

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	23388624.63	10164162.07	49091.00	15821800.50	24706270.00	32348500.00	38112762.00	▂▃▅▆▇
host_id	0	1.00	91144807.41	81909095.31	23666.00	23058075.00	63448912.00	155381142.00	288567551.00	▇▃▂▂▁
latitude	0	1.00	1.31	0.03	1.24	1.30	1.31	1.32	1.45	▂▇▂▁▁
longitude	0	1.00	103.85	0.04	103.65	103.84	103.85	103.87	103.97	▁▁▃▇▁
price	0	1.00	169.33	340.19	0.00	65.00	124.00	199.00	10000.00	▇▁▁▁▁
minimum_nights	0	1.00	17.51	42.09	1.00	1.00	3.00	10.00	1000.00	▇▁▁▁▁
number_of_reviews	0	1.00	12.81	29.71	0.00	0.00	2.00	10.00	323.00	▇▁▁▁▁
reviews_per_month	2758	0.65	1.04	1.29	0.01	0.18	0.55	1.37	13.00	▇▁▁▁▁
calculated_host_listings_count	0	1.00	40.61	65.14	1.00	2.00	9.00	48.00	274.00	▇▁▁▁▁
availability_365	0	1.00	208.73	146.12	0.00	54.00	260.00	355.00	365.00	▅▂▂▂▇

We found large missing value data on “reviews_per_month” column and “last_review” column. We probably will not use this column for analysis. So at the moment we keep the column and will do the transformation of the column in the future if necessary

Now let us look at the summary of the data to check the descriptive statistic for each columns

summary(airsg)

##        id                                                          name     
##  Min.   :   49091   Luxury hostel with in-cabin locker - Single mixed:  13  
##  1st Qu.:15821800   Inviting & Cozy 1BR APT 3 mins from Tg Pagar MRT :   9  
##  Median :24706270   Studio Apartment - Oakwood Premier               :   9  
##  Mean   :23388625   City-located 1BR loft apartment *BRAND NEW*      :   8  
##  3rd Qu.:32348500   Stylish 1BR Located 7 mins from Tg Pagar MRT     :   8  
##  Max.   :38112762   (Other)                                          :7858  
##                     NA's                                             :   2  
##     host_id             host_name           neighbourhood_group
##  Min.   :    23666   Jay     : 290   Central Region   :6309    
##  1st Qu.: 23058075   Alvin   : 249   East Region      : 508    
##  Median : 63448912   Richards: 157   North-East Region: 346    
##  Mean   : 91144807   Aaron   : 145   North Region     : 204    
##  3rd Qu.:155381142   Rain    : 115   West Region      : 540    
##  Max.   :288567551   Darcy   : 114                             
##                      (Other) :6837                             
##      neighbourhood     latitude       longitude               room_type   
##  Kallang    :1043   Min.   :1.244   Min.   :103.6   Entire home/apt:4132  
##  Geylang    : 994   1st Qu.:1.296   1st Qu.:103.8   Private room   :3381  
##  Novena     : 537   Median :1.311   Median :103.8   Shared room    : 394  
##  Rochor     : 536   Mean   :1.314   Mean   :103.8                         
##  Outram     : 477   3rd Qu.:1.322   3rd Qu.:103.9                         
##  Bukit Merah: 470   Max.   :1.455   Max.   :104.0                         
##  (Other)    :3850                                                         
##      price         minimum_nights    number_of_reviews  last_review        
##  Min.   :    0.0   Min.   :   1.00   Min.   :  0.00    Min.   :2013-10-21  
##  1st Qu.:   65.0   1st Qu.:   1.00   1st Qu.:  0.00    1st Qu.:2018-11-21  
##  Median :  124.0   Median :   3.00   Median :  2.00    Median :2019-06-27  
##  Mean   :  169.3   Mean   :  17.51   Mean   : 12.81    Mean   :2019-01-11  
##  3rd Qu.:  199.0   3rd Qu.:  10.00   3rd Qu.: 10.00    3rd Qu.:2019-08-07  
##  Max.   :10000.0   Max.   :1000.00   Max.   :323.00    Max.   :2019-08-27  
##                                                        NA's   :2758        
##  reviews_per_month calculated_host_listings_count availability_365
##  Min.   : 0.010    Min.   :  1.00                 Min.   :  0.0   
##  1st Qu.: 0.180    1st Qu.:  2.00                 1st Qu.: 54.0   
##  Median : 0.550    Median :  9.00                 Median :260.0   
##  Mean   : 1.044    Mean   : 40.61                 Mean   :208.7   
##  3rd Qu.: 1.370    3rd Qu.: 48.00                 3rd Qu.:355.0   
##  Max.   :13.000    Max.   :274.00                 Max.   :365.0   
##  NA's   :2758

If we look at above the data summary descriptive statistic, we can conclude:

The most booked host in Singapore listing is Jay
Central Region is the most favourite area for airbnb booking in Singapore and North Region is the least favourite.
In terms of Neighbourhood, Kallang is the most desirable area in Singapore for Airbnb booking
There are 3 type of listing : ‘Entire home/apt’, ‘Private room’ and ‘Shared room’ type
Price median is around SGD 124.0
Average staying nights is 17 days

Visualization and Analyis

Price

Now let’s take a look at the price distribution as per room types

plot1 <- airsg %>% 
  select(room_type,price,name) %>%
  mutate(label=glue(
    "{name}
    Price in SGD: {price}
    Room Type: {room_type}"
  )) %>% 
  ggplot(aes(x=room_type,y=price,text=label)) +
  geom_jitter(aes(col=room_type),cex=1.8,shape=8,alpha=0.3,show.legend = F) + 
  scale_y_continuous(limits=c(0,500),breaks=seq(0,500,100))+
  geom_hline(aes(yintercept=median(price, na.rm=T)),   
               color="red", linetype="dashed", size=1)+
  labs(title = "Price Distribution to Room Type",
       subtitle = "AirBnb Singapore",
       caption= "Data Source: Kaggle.com",
       x="Room Type",
       y="Price in SGD"
       )+
  theme_bw()
  
ggplotly(plot1,tooltip = "text")  %>% layout(showlegend = FALSE)

If we look at the graphic above, the price for entire home/apt per night is averagely higher than shared room and private room

Now let’s take a look at the price distribution as per Neighboorhood Group

plot2 <- airsg %>% 
  select(neighbourhood_group,neighbourhood,price,name) %>%
  mutate(label=glue(
    "{name}
    Price in SGD: {price}
    Region: {neighbourhood_group}
    Area: {neighbourhood}"
  )) %>% 
  ggplot(aes(y=price,x=neighbourhood_group,text=label)) +
  geom_jitter(aes(col=neighbourhood_group)
              ,alpha=0.3,show.legend = F,cex=1.8,shape=8) + 
    scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)+
  scale_y_continuous(limits=c(0,500),breaks=seq(0,500,100))+
  geom_hline(aes(yintercept=median(price, na.rm=T)),   
               color="red", linetype="dashed", size=1) +
  labs(title = "Price Distribution to Neighbourhood Group Region",
       subtitle = "AirBnb Singapore",
       caption= "Data Source: Kaggle.com",
       x="",
       y="Price in SGD",
       fill="",
       col="")+
  theme_bw()
  
ggplotly(plot2,tooltip = "text")  %>% layout(showlegend = FALSE)

If we look at the graphic above, the price in the Central Region is averagely higher than other region. It is predictable since this region is the home for famous tourism spots in Singapore.

Since the price in Central Region is averagely high, Now let’s take a look at the price distribution in Central Region area.

plot3 <- airsg %>% 
  filter(neighbourhood_group=="Central Region") %>% 
  select(neighbourhood,price,name) %>% 
  mutate(label=glue(
    "{name}
    Price avg in SGD: {price}
    {neighbourhood}"
  )) %>% 
  ggplot(aes(y=reorder(neighbourhood,price),x=price,text=label)) +
  geom_point(aes(col=neighbourhood),alpha=0.4,show.legend = F) + 
  scale_x_continuous(limits=c(0,3000),breaks=seq(0,3000,500))+
  geom_vline(aes(xintercept=median(price, na.rm=T)),   
               color="dodgerblue4", linetype="dashed", size=1) +
  labs(title = "Price Distribution in Central Region",
       subtitle = "AirBnb Singapore",
       caption= "Data Source: Kaggle.com",
       x="Price in SGD",
       y="",
       col="")+
  theme_bw()
  
ggplotly(plot3,tooltip = "text")  %>% layout(showlegend = FALSE)

If we look at the plot above, all of the price listing in Southern Island and Marina South are higher than the average price in Central Region.

Now let us check the Top 10 host with the average highest price bookings

ploth <- airsg %>% 
  group_by(host_name) %>% 
  summarise(avg_price=mean(price)) %>% 
  arrange(desc(avg_price)) %>% 
  head(10) %>% 
  mutate(label=glue(
    "Host Name: {host_name}
     avg price: {avg_price}"
  )) %>% 
  ggplot(aes(x=avg_price,y=reorder(host_name,avg_price),text=label))+
  geom_col(aes(fill=host_name),show.legend = F)+
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)+
  labs(title="Top 10 Host with The Average Highest Price per Listing",
       x="Price in SG",
       y="")+
  theme_bw()

ggplotly(ploth,tooltip = "text") %>% layout(showlegend=F)

Based on the plot above, Yolivia’s listings are the most expensive listings in Singapore.

Room Listings

Now we will take a look more detail on the room listings.

Let us check the population of the listings in every region in Singapore

plotpop <- airsg %>% 
  select(neighbourhood_group) %>% 
  count(neighbourhood_group) %>% 
  mutate(label=glue(
    "number of listing: {n} rooms
     {neighbourhood_group}"
  )) %>% 
  ggplot(aes(y=reorder(neighbourhood_group,n),x=n,text=label))+
  geom_col(aes(fill=neighbourhood_group)) +
  labs(title= "Population of Listings per Region",
       x="Numbers of Listings", 
       y = NULL) +
    theme_bw()

ggplotly(plotpop,tooltip = "text") %>% layout(showlegend = FALSE)

If we look at the plot above, the most populated listings in Singapore is in Central Region Area

Now let us check the population of the listings for every room type in Singapore

plotr <- airsg %>%
  select(room_type) %>%
  count(room_type) %>%
  mutate(label = glue(
    "number of listing: {n} rooms
     {room_type}")) %>%
  ggplot(aes(
    x = reorder(room_type, n),
    y = n,
    text = label
  )) +
  geom_col(aes(fill = room_type)) +
  labs(title = "Population of Listings per Room Type",
       x = "Numbers of Listings",
       y = NULL) +
  theme_bw()

ggplotly(plotr, tooltip = "text") %>% layout(showlegend = FALSE)

If we look at the plot above, the most listings in Singapore is with Entire home/apt type..

Let us check the availibility in a year for different room-type in every region

plot4 <- airsg %>%
  group_by(neighbourhood_group, room_type) %>%
  summarise(average_avail = mean(availability_365)) %>%
  mutate(
    label = glue(
    "Avg availability {round(average_avail,0)} days
    Type: {room_type}
    {neighbourhood_group}"
    )
  ) %>%
  ggplot(aes(
    y = reorder(neighbourhood_group, average_avail),
    x = average_avail,
    text = label
  )) +
  geom_col(aes(fill = room_type), position = "dodge", alpha = 0.8) +
  scale_fill_viridis(discrete = TRUE) +
  scale_color_viridis(discrete = TRUE) +
  facet_wrap( ~ room_type, scales = "free_x") +
  labs(
    title = "Average Days Availibility in a Year",
    subtitle = "AirBnb Singapore",
    caption = "Data Source: Kaggle.com",
    x = "",
    y = "",
    fill = ""
  ) +
  theme_bw()

ggplotly(plot4, tooltip = "text") %>% layout(legend = list(
  orientation = "h",
  x = 0.2,
  y = -0.1
))

Based on plot above, for Shared Room, availability in Central Region is higher than other region. For Private Room, availability in North Region is higher than other region. For Entire/home apt, availibility in Central Region is higher than other region.

plothn <- airsg %>%
  group_by(host_name) %>%
  count(host_name) %>%
  arrange(desc(n)) %>%
  head(10) %>%
  mutate(label = glue(
    "Host Name: {host_name}
     Total Listing: {n} rooms")) %>%
  ggplot(aes(
    x = n,
    y = reorder(host_name, n),
    text = label
  )) +
  geom_col(aes(fill = host_name), show.legend = F) +
  scale_fill_viridis(discrete = TRUE) +
  scale_color_viridis(discrete = TRUE) +
  labs(title = "Top 10 Host with The Most Number of Listing",
       x = "Number of Listing",
       y = "") +
  theme_bw()

ggplotly(plothn, tooltip = "text") %>% layout(showlegend = F)

Based on plot above, Jay has the most numnber of listing with whooping 290 rooms.

Mapping

Now let us take a look at map below to show the number of listings in Singapore

Pic <- makeIcon(
  iconUrl = "images (1).png",
  iconWidth = 100 * 0.35,
  iconHeight = 100 * 0.35
)

map <- leaflet()
map <- addTiles(map)

map <- addMarkers(
  map,
  lng = airsg$longitude,
  lat = airsg$latitude,
  popup = airsg$name,
  clusterOptions = markerClusterOptions(),
  icon = Pic
)

map