Data Visualisation: Singapore Airbnb

Husain Hidayat

12/29/2020

Last Update : March 24th, 2021

1 Introduction

Liburan tahun baru segera tiba, saatnya bersiap untuk liburan awal tahun baru (semoga di tahun 2021 wabah pandemi covid 19 segera berakhir). Diantara sekian banyak tujuan wisata, Singapura menjadi salah satu destinasi favorit warga Indonesia, karena jarak yang relatif dekat dengan negara kita dan pilihan tempat wisata yang atraktif.

Di era digital saat ini terdapat banyak opsi alternatif untuk mencari penginapan, salah satunya yang cukup populer ialah Airbnb. Airbnb merupakan market digital platform dimana user bisa memasarkan atau mencari penginapan (homestay) sesuai dengan preferensi masing-masing individu. Airbnb populer dikalangan pelancong karena menawarkan beragam jenis homestay dengan harga yang kompetitif, tersebar di hampir seluruh dunia. Selain itu para pelancong bisa berbagi pengalaman liburannya dan berinterkasi pengguna lainnya, baik itu pelancong maupun pemilik homestay.

Pada artikel ini saya akan coba menggali insight dari data homestay Airbnb di Singapura. Data yang dipakai pada tulisan ini diperoleh dari open data (CC Public License) Inside Airbnb yang saya pertengahan Desember 2020. Harapannya tulisan ini bisa memberikan gambaran sederhana kepada pembaca terkait data homestay Airbnb di Singapura. Untuk visualisasi plot interaktif di tulisan ini bisa mengunjungi halaman berikut.


2 Data Preparation

2.1 About the data

Berikut beberapa penjelasan sederhana terkait feature yang terdapat di dataset.

Variable Description
ID ID Listing
Name Nama deskripsi homestay
host_id ID Host
host_name Nama Host
neighbourhood_group Lokasi Region homestay
neighbourhood Lokasi Planning Area homestay
latitude Koordinat lintang homestay
longitude Koordinat bujur homestay
room_type Jenis homestay
price Harga sewa homestay (dalam USD)
minimum_nights jumlah hari/malam minimal yang dibutuhkan untuk memesan homestay
availability_365 ketersediaan (availability) homestay dalam satu tahun dari periode tahun lalu
number_of_reviews Akumulasi jumlah komentar terhadap homestay
Penjelasan lebih lanjut mengenai data bisa merujuk ke referensi berikut About Inside Airbnb



2.2 Import Library and Dataset

Load library yang dibutuhkan

# for data wrangling
library(tidyr)
library(dplyr)
library(readr)
#library(lubridate)

#for data visualisation
library(ggplot2)
library(ggtext)
library(plotly)
library(ggthemes)
library(ggpubr)
library(leaflet)
library(ggmap)
library(knitr)
library(gridExtra)

Load Dataset

df_list_summary <- read_csv("data_input/listings_summary.csv")
glimpse(df_list_summary)
#> Rows: 4,492
#> Columns: 16
#> $ id                             <dbl> 49091, 50646, 56334, 71609, 71896, 7...
#> $ name                           <chr> "COZICOMFORT LONG TERM STAY ROOM 2",...
#> $ host_id                        <dbl> 266763, 227796, 266763, 367042, 3670...
#> $ host_name                      <chr> "Francesca", "Sujatha", "Francesca",...
#> $ neighbourhood_group            <chr> "North Region", "Central Region", "N...
#> $ neighbourhood                  <chr> "Woodlands", "Bukit Timah", "Woodlan...
#> $ latitude                       <dbl> 1.44255, 1.33235, 1.44246, 1.34541, ...
#> $ longitude                      <dbl> 103.7958, 103.7852, 103.7967, 103.95...
#> $ room_type                      <chr> "Private room", "Private room", "Pri...
#> $ price                          <dbl> 82, 80, 68, 179, 95, 82, 208, 52, 54...
#> $ minimum_nights                 <dbl> 180, 90, 6, 90, 90, 90, 1, 90, 90, 1...
#> $ number_of_reviews              <dbl> 1, 18, 20, 20, 24, 48, 29, 178, 199,...
#> $ last_review                    <date> 2013-10-21, 2014-12-26, 2015-10-01,...
#> $ reviews_per_month              <dbl> 0.01, 0.23, 0.18, 0.19, 0.21, 0.42, ...
#> $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3, 4, 4, 7, ...
#> $ availability_365               <dbl> 365, 365, 365, 365, 365, 365, 181, 3...

Cuplikan 10 data awal

kable(head(df_list_summary, 10))
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
49091 COZICOMFORT LONG TERM STAY ROOM 2 266763 Francesca North Region Woodlands 1.44255 103.7958 Private room 82 180 1 2013-10-21 0.01 2 365
50646 Pleasant Room along Bukit Timah 227796 Sujatha Central Region Bukit Timah 1.33235 103.7852 Private room 80 90 18 2014-12-26 0.23 1 365
56334 COZICOMFORT 266763 Francesca North Region Woodlands 1.44246 103.7967 Private room 68 6 20 2015-10-01 0.18 2 365
71609 Ensuite Room (Room 1 & 2) near EXPO 367042 Belinda East Region Tampines 1.34541 103.9571 Private room 179 90 20 2020-01-17 0.19 8 365
71896 B&B Room 1 near Airport & EXPO 367042 Belinda East Region Tampines 1.34567 103.9596 Private room 95 90 24 2019-10-13 0.21 8 365
71903 Room 2-near Airport & EXPO 367042 Belinda East Region Tampines 1.34702 103.9610 Private room 82 90 48 2020-01-09 0.42 8 365
71907 3rd level Jumbo room 5 near EXPO 367042 Belinda East Region Tampines 1.34348 103.9634 Private room 208 1 29 2020-01-11 0.25 8 181
241503 Long stay at The Breezy East “Leopard” 1017645 Bianca East Region Bedok 1.32391 103.9128 Private room 52 90 178 2020-10-16 1.67 3 365
241508 Long stay at The Breezy East “Plumeria” 1017645 Bianca East Region Bedok 1.32391 103.9128 Private room 54 90 199 2019-09-21 1.82 3 354
275343 Conveniently located City Room!(1,2,3,4,5,6,7,8) 1439258 K2 Guesthouse Central Region Bukit Merah 1.28875 103.8081 Private room 52 14 20 2020-04-17 0.22 4 364

2.3 Pre-processsing Data

Konversi data neighbourhood_group,neighbourhood,room_type menjadi jenis factor

df_list_summary[, c("neighbourhood_group","neighbourhood","room_type")] <- lapply(df_list_summary[, c("neighbourhood_group","neighbourhood","room_type")], as.factor)
glimpse(df_list_summary)
#> Rows: 4,492
#> Columns: 16
#> $ id                             <dbl> 49091, 50646, 56334, 71609, 71896, 7...
#> $ name                           <chr> "COZICOMFORT LONG TERM STAY ROOM 2",...
#> $ host_id                        <dbl> 266763, 227796, 266763, 367042, 3670...
#> $ host_name                      <chr> "Francesca", "Sujatha", "Francesca",...
#> $ neighbourhood_group            <fct> North Region, Central Region, North ...
#> $ neighbourhood                  <fct> Woodlands, Bukit Timah, Woodlands, T...
#> $ latitude                       <dbl> 1.44255, 1.33235, 1.44246, 1.34541, ...
#> $ longitude                      <dbl> 103.7958, 103.7852, 103.7967, 103.95...
#> $ room_type                      <fct> Private room, Private room, Private ...
#> $ price                          <dbl> 82, 80, 68, 179, 95, 82, 208, 52, 54...
#> $ minimum_nights                 <dbl> 180, 90, 6, 90, 90, 90, 1, 90, 90, 1...
#> $ number_of_reviews              <dbl> 1, 18, 20, 20, 24, 48, 29, 178, 199,...
#> $ last_review                    <date> 2013-10-21, 2014-12-26, 2015-10-01,...
#> $ reviews_per_month              <dbl> 0.01, 0.23, 0.18, 0.19, 0.21, 0.42, ...
#> $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3, 4, 4, 7, ...
#> $ availability_365               <dbl> 365, 365, 365, 365, 365, 365, 181, 3...

Cek missing value

dim(df_list_summary)
#> [1] 4492   16
colSums(is.na(df_list_summary))
#>                             id                           name 
#>                              0                              0 
#>                        host_id                      host_name 
#>                              0                              2 
#>            neighbourhood_group                  neighbourhood 
#>                              0                              0 
#>                       latitude                      longitude 
#>                              0                              0 
#>                      room_type                          price 
#>                              0                              0 
#>                 minimum_nights              number_of_reviews 
#>                              0                              0 
#>                    last_review              reviews_per_month 
#>                           1799                           1799 
#> calculated_host_listings_count               availability_365 
#>                              0                              0

Dataset memiliki 2693 baris dan 16 kolom, d

Dari hasil diatas terlihat di terdapat missing value pada kolom host_name, last_review, and reviews_per_month, dimana untuk kolom last_review, and reviews_per_month, jumlah missing value nya sama. Berikut baris data yang memiliki missing value pada kolom last_review

#df_list_summary[is.na(df_list_summary$host_name),]
df_list_summary[is.na(df_list_summary$last_review),] %>% 
  select("id","name","host_name","last_review","reviews_per_month")
#> # A tibble: 1,799 x 5
#>         id name                           host_name last_review reviews_per_mon~
#>      <dbl> <chr>                          <chr>     <date>                 <dbl>
#>  1  355955 Double room in an Authentic P~ Aresha    NA                        NA
#>  2  469454 Nice view bedroom with aircon~ Hung      NA                        NA
#>  3  481789 Master Bedroom in Newly Built~ Susan     NA                        NA
#>  4  642660 BEST CITY LIVING WITH GA RESI~ Roger     NA                        NA
#>  5  733863 Homestay at Serangoon          Shirlnet  NA                        NA
#>  6  768313 Common Room for rent immediate Immellym~ NA                        NA
#>  7  823571 Apartment away from town       Tania     NA                        NA
#>  8 1562453 Deluxe Quad-sharing room       Domus     NA                        NA
#>  9 1581224 Life Impact Coaching           Gerard    NA                        NA
#> 10 1611318 Central Condo, reasonable pri~ Betty     NA                        NA
#> # ... with 1,789 more rows

Terlihat pola missing value kolom last_review selalu di baris yang sama dengan data reviews_per_month. Kemungkinnan data pada baris tersebut belum memiliki review. Untuk baris data pada kolom last_review, and reviews_per_month yang memiliki missing value akan dihilangkan, sedangkan missing value pada kolom host_name akan diset ke “NA”.

df_list_summary <- df_list_summary %>% 
  drop_na(last_review) %>% 
  mutate(host_name = replace_na(host_name, "NA"))
colSums(is.na(df_list_summary))
#>                             id                           name 
#>                              0                              0 
#>                        host_id                      host_name 
#>                              0                              0 
#>            neighbourhood_group                  neighbourhood 
#>                              0                              0 
#>                       latitude                      longitude 
#>                              0                              0 
#>                      room_type                          price 
#>                              0                              0 
#>                 minimum_nights              number_of_reviews 
#>                              0                              0 
#>                    last_review              reviews_per_month 
#>                              0                              0 
#> calculated_host_listings_count               availability_365 
#>                              0                              0
dim(df_list_summary)
#> [1] 2693   16

Ukuran akhir data setelah dibersihkan menjadi 2693 x 16.

Berikut ringkasan data awal yang bertipe numerik

summary(df_list_summary)
#>        id               name              host_id           host_name        
#>  Min.   :   49091   Length:2693        Min.   :   227796   Length:2693       
#>  1st Qu.:15532514   Class :character   1st Qu.: 17526618   Class :character  
#>  Median :24211014   Mode  :character   Median : 52978586   Mode  :character  
#>  Mean   :24392302                      Mean   : 90806544                     
#>  3rd Qu.:34870326                      3rd Qu.:138649185                     
#>  Max.   :45818138                      Max.   :362776064                     
#>                                                                              
#>         neighbourhood_group      neighbourhood     latitude       longitude    
#>  Central Region   :2174     Kallang     : 434   Min.   :1.245   Min.   :103.6  
#>  East Region      : 176     Geylang     : 259   1st Qu.:1.295   1st Qu.:103.8  
#>  North-East Region: 107     Outram      : 259   Median :1.310   Median :103.9  
#>  North Region     :  75     Rochor      : 214   Mean   :1.313   Mean   :103.8  
#>  West Region      : 161     Novena      : 172   3rd Qu.:1.319   3rd Qu.:103.9  
#>                             River Valley: 143   Max.   :1.453   Max.   :104.0  
#>                             (Other)     :1212                                  
#>            room_type        price         minimum_nights    number_of_reviews
#>  Entire home/apt:1018   Min.   :   15.0   Min.   :   1.00   Min.   :  1.00   
#>  Hotel room     : 211   1st Qu.:   56.0   1st Qu.:   1.00   1st Qu.:  2.00   
#>  Private room   :1351   Median :   94.0   Median :   5.00   Median :  6.00   
#>  Shared room    : 113   Mean   :  137.2   Mean   :  21.01   Mean   : 22.85   
#>                         3rd Qu.:  150.0   3rd Qu.:  18.00   3rd Qu.: 24.00   
#>                         Max.   :10286.0   Max.   :1000.00   Max.   :370.00   
#>                                                                              
#>   last_review         reviews_per_month calculated_host_listings_count
#>  Min.   :2013-10-21   Min.   : 0.0100   Min.   :  1.0                 
#>  1st Qu.:2019-08-02   1st Qu.: 0.1000   1st Qu.:  3.0                 
#>  Median :2020-01-26   Median : 0.2700   Median : 10.0                 
#>  Mean   :2019-10-07   Mean   : 0.7253   Mean   : 29.7                 
#>  3rd Qu.:2020-05-06   3rd Qu.: 0.8700   3rd Qu.: 39.0                 
#>  Max.   :2020-10-26   Max.   :22.5600   Max.   :137.0                 
#>                                                                       
#>  availability_365
#>  Min.   :  0.0   
#>  1st Qu.:178.0   
#>  Median :344.0   
#>  Mean   :269.2   
#>  3rd Qu.:364.0   
#>  Max.   :365.0   
#> 

Dari data diatas terdapat data yang unik. Pada kolom minimum_nights, terdapat angka maximum == 1000. Berikut ini adalah datanya.

df_list_summary %>% filter(minimum_nights == 1000)
#> # A tibble: 2 x 16
#>       id name  host_id host_name neighbourhood_g~ neighbourhood latitude
#>    <dbl> <chr>   <dbl> <chr>     <fct>            <fct>            <dbl>
#> 1 3.08e7 Room~  2.51e7 Natasha K West Region      Clementi          1.32
#> 2 3.21e7 Room~  2.51e7 Natasha K West Region      Clementi          1.32
#> # ... with 9 more variables: longitude <dbl>, room_type <fct>, price <dbl>,
#> #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
#> #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#> #   availability_365 <dbl>

Terlihat bahwa untuk data dengan nilai minimum_nights == 1000, nilai availability_365 nya == 365 dan last review nya di tahun 2019. Kemungkinan homestay tersebut di tahun 2020 belum pernah terisi. Berdasarkan kesimpulan diatas maka data tersebut tetap dipakai untuk eksplorasi data.


3 Data Exploration

Terdapat empat subject yang akan dieksplor dari dataset yakni: Lokasi Homestay (Homestay Location), Jenis Homestay (Room Type), Harga (Price), dan Homestay Map,

3.1 Homestay Location

Sebelum eksplor data, saya mencari tahu terkait data wilayah di Singapura. Berdasarkan referensi dari Wiki: Region of Singapore, Singapura terdiri dari 5 Region dan 55 Planning Area.

Dari informasi tersebut mari lihat persebaran data homestay Airbnb.

#aggregate df
df_list_summary %>% 
  group_by(neighbourhood_group) %>% 
  summarise(Total_Planning_Area = n_distinct(neighbourhood),
            Total_Homestay = n()) %>% 
  bind_rows(summarise(.,
                      across(where(is.numeric), sum),
                      across(where(is.factor), ~"TOTAL")))
#> # A tibble: 6 x 3
#>   neighbourhood_group Total_Planning_Area Total_Homestay
#>   <chr>                             <int>          <int>
#> 1 Central Region                       20           2174
#> 2 East Region                           4            176
#> 3 North-East Region                     5            107
#> 4 North Region                          5             75
#> 5 West Region                           8            161
#> 6 TOTAL                                42           2693

Dari aggregasi data diatas, terdapat 2,693 homestay yang tersebar di 5 Region dan 42 Planning Area di Singapura.

3.1.1 Wilayah dengan jumlah homestay terbanyak

plot_col_top10listnb <- df_list_summary %>%
    group_by(neighbourhood) %>%
    summarize(num_listings = n(), neighbourhood_grp = unique(neighbourhood_group)) %>%
    arrange(desc(num_listings)) %>% 
  mutate(Text=paste(
         "Region: ",neighbourhood_grp, "<br>",
         "Planning Area: ",neighbourhood, "<br>",
         "No. of Listings: ",num_listings, sep = ""
      )) %>% 
    head(10) %>% 
  
ggplot(aes(x = reorder(neighbourhood, num_listings), 
           y = num_listings, 
           text = Text, fill = neighbourhood_grp)) +
    geom_col() +
    coord_flip() +
    theme(legend.position = "bottom") +
    labs(x = "", y = "Homestay Count",
         fill = "Region") +
    theme(
      panel.border = element_blank(),
      panel.grid.major = element_blank()
    )

ggplotly(plot_col_top10listnb, tooltip="text") %>%
  layout(title = list(text = paste0('Top 10 Planning Area with Highest Number of Airbnb Homestays',
                                    '<br>',
                                    '<sup>',
                                     'among all Region in Singapore','</sup>')),
         legend = list(orientation = "h", x = 0.1, y = -0.3)) %>%
   config(displayModeBar = F)

Berdasarkan Region secara keseluruhan, jumlah homestay banyak tersebar di Central Region. Terlihat pada diagram diatas, sembilan Planning Area dengan jumlah homestay terbanyak berasal dari Central Region. Hal ini wajar karena Central Region adalah Region dengan luas wilayah terbesar dan populasi penduduk terbanyak. Selain itu beberapa perkantoran dan tempat wisata terdapat disana.

Jika di breakdown per Region, berikut daftar Planning Area dengan jumlah homestay terbanyak di Singapore:

agg_homestayperRegion <- df_list_summary %>% 
  group_by(neighbourhood) %>% 
  summarize(neighbourhood_group = unique(neighbourhood_group), num_listings = n()) %>% 
  arrange(desc(neighbourhood_group), desc(num_listings))


agg_homestayperRegion %>% group_by(neighbourhood_group) %>% 
  slice_max(num_listings, n = 3, with_ties = F) %>%
  arrange(desc(neighbourhood_group)) %>% 
  ggplot(aes(x = num_listings, y = reorder(neighbourhood, num_listings))) +
  geom_col(aes(fill = neighbourhood), show.legend = F) +
  geom_label(mapping = aes(label = num_listings),
             position = position_dodge(width = 0.5), 
             #nudge_y = 3, 
             size = 2.5, 
             show.legend = F, fill = "white") +
  facet_wrap(facets = ~neighbourhood_group, nrow = 5, scales = "free_y") +
  theme_linedraw() + scale_fill_discrete() +
  labs(title = "Top 3 Planning Area with Highest Number of Homestay",
       subtitle = "per Region in Singapore",
       x = "Homestay Count",
       y = "Planning Area",
       fill = "Frequency") +
  theme(panel.grid.major.y = element_blank(), 
        panel.grid.minor.y = element_blank(), 
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_line(colour = "grey", linetype = "dashed"))


3.1.2 Persebaran homestay di Singapura

Berdasarkan persebaran homestay di Singapura, mayoritas tersebar di Central Region (80.73 %), sementara sekitar 20 % tersebar di Region lainnya.

homestaydist_perRegion <- df_list_summary %>% 
  group_by(neighbourhood_group) %>% summarize(num_listings = n()) %>%
  mutate(percentage = round((num_listings/sum(num_listings))*100,2))


ggplot(homestaydist_perRegion, 
       aes(x=num_listings,
           y=reorder(neighbourhood_group, num_listings))) +
  geom_col(aes(fill=neighbourhood_group, show.legends = FALSE)) +
  coord_flip() +
  geom_hline(yintercept = mean(homestaydist_perRegion$num_listings), linetype = 3) +
  geom_label(aes(label = paste0(num_listings, ' ','(',percentage,' %',')'
                                )),
             size = 3, 
             show.legend = F, fill = "white") + theme_bw() +
  labs(title = "Top 10 Region in Singapore",
       subtitle = "based on number of homestay",
       x = "Number of Homestay", y = "Region") +
  theme(
    panel.border = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none"
  )

Untuk menggambarkan persebarannya, saya coba mapping homestay berdasarkan koordinatnya. Note: terkait lebih lanjut mengenai data lokasi homestay airbnb yang tersedia di dataset ini dapat dibaca di bagian “Disclaimer” InsideAirbnb.

# Plot host dist in map
temp_height <- max(df_list_summary$latitude) - min(df_list_summary$latitude)
temp_width <- max(df_list_summary$longitude) - min(df_list_summary$longitude)
temp_borders <- c(bottom  = min(df_list_summary$latitude)  - 0.05 * temp_height,
                  top     = max(df_list_summary$latitude)  + 0.05 * temp_height,
                  left    = min(df_list_summary$longitude) - 0.1 * temp_width,
                  right   = max(df_list_summary$longitude) + 0.1 * temp_width)

#load staten map
temp_map <- get_stamenmap(temp_borders, zoom = 12, maptype = "toner-lite")

#load ggmap + aes geom point
plot_hostpointmap <- ggmap(temp_map) +
  geom_point(data = df_list_summary, 
             mapping = aes(x = longitude, y = latitude,
            col = neighbourhood_group, alpha=0.7)) +
  labs(x = "longitude",
       y = "latitude", 
       title = "Homestay Distribution in Singapore",
       col = "Region") +
  guides(alpha = FALSE) +
  theme(plot.title = element_text(vjust=5))

plot_hostpointmap



3.2 Jenis Homestay

3.2.1 Jenis homestay (Type of Place) di Airbnb Singapura

Terdapat empat jenis homestay yang tersedia di airbnb Singapura, yakni:

kable(unique(df_list_summary$room_type))
x
Private room
Entire home/apt
Shared room
Hotel room

Penjelasan terkait jenis homestay Airbnb bisa dilihat di artikel berikut “what do the different home types mean?”

3.2.2 Type of Place Distribution

df_list_summary %>%
  count(room_type, sort = TRUE) %>%
  mutate(room_type = reorder(room_type, n), percentage = round((n/sum(n))*100,2)) %>%
  
  ggplot(aes(room_type, n)) +
    geom_col(aes(fill = room_type)) +
  coord_flip() +
  geom_label(aes(label = paste0(n, ' ', '(',percentage,' %',')')),
             nudge_y = 1, size = 2.5, show.legend = F, fill = "white") +
  labs(title = "Aibnb Type of Place Distribution",
       subtitle = "over Singapore",
       x=NULL, y= "Count", fill = "Type of Place") + 
  theme_bw() +
  theme(
    panel.border = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "bottom"
  )

Homestay terbanyak secara berurutan adalah Private Room (50.17 %), Entire Home / Apartment (37.8 %), Hotel Room (7.84 %), dan Shared Room (4.2 %)

Sedangkan persebaran Type of Place per Region disajikan pada diagram berikut.

plot_agg_roomtypeperRegion <- data.frame(table(region = df_list_summary$neighbourhood_group,
                                               room_type = df_list_summary$room_type)) %>% 
mutate(Text=paste(
         "Region: ",region, "<br>",
         "Type of Place: ",room_type, "<br>",
         "Count:",Freq, sep = ""
      )) %>%
  
ggplot(aes(x=region, y=Freq, group=room_type, text=Text)) +
      geom_col(aes(fill=room_type), position = "dodge") +
      labs(x = "",
           y = "Count",
           fill = "Type of Place") +
      theme_bw()+
      theme(
        panel.border = element_blank()
      )
  
ggplotly(plot_agg_roomtypeperRegion, tooltip="text") %>%
  layout(legend = list(orientation = "h", x = 0.1, y = -0.05),
         title = list(text = paste0('Type of Place Distribution',
                                    '<br>',
                                    '<sup>','per Region','</sup>')))

Terlihat bahwa jenis Entire Home/Apartment dan Private Room merupakan jenis paling banyak di setiap Region di Singapura.


3.3 Harga dan Estimasi Revenue

Berikut ringkasan data Price

summary(df_list_summary$price)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    15.0    56.0    94.0   137.2   150.0 10286.0

Terlihat bahwa rentang (range) harga homestay Airbnb di Singapura sangat besar, yaitu dari USD 15 sampai USD 10,286.

3.3.1 Price Distribution

plot_pricedist <- ggplot(df_list_summary, aes(price)) +
  # using bins = 50 and continous scale
  geom_histogram(aes(y = ..density..), col = "black", fill = "white") + 
  geom_density(alpha = 0.3, fill="#FF6666") +
  geom_vline(xintercept = round(mean(df_list_summary$price), 2),linetype = 3) +
  scale_x_log10() +
  labs(title =  "Transformed distribution of Price",
      subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis"),
      x="Price", y = "Density", 
      fill = "Type of Place") +
 guides(col = F)

library(viridis)
plot_price_per_roomtype <- df_list_summary %>% select(price, room_type) %>% 
  ggplot(aes(price, fill = room_type))+
  geom_density(aes(col = room_type, show.legends = F), alpha = 0.5, round=2) +
  scale_fill_viridis(discrete = TRUE) +
  scale_color_viridis(discrete = TRUE) +
  theme(text = element_text(size = 10)) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 100))+
  labs(title = "Price Distribution",
       subtitle = "per Type of Place",
       x="Price", y = "Density", 
       fill = "") +
  guides(col = F) + theme(legend.position = "bottom") +
  labs(title = "S'pore Airbnb Price Distribution",
       subtitle = "by Type of Place",
       x="Price", y= "Density")

subplot1 <- ggarrange(plot_pricedist, plot_price_per_roomtype,
                      ncol = 2, nrow = 1, 
                      common.legend = FALSE, 
                      legend = "bottom")
subplot1

Dari plot density dan histogram diatas terlihat bahwa data harga (price) right-skewed. Nilai rataan harga (price) lebih besar dibandingkan nilai median nya.

3.3.2 Persebaran harga berdasarkan jenis homestay

Sedangkan untuk persebaran harga per Type of Place digambarkan pada diagram boxplot berikut

#df_list_summary %>% names()
plot_boxplot_prt <- df_list_summary %>%
  ggplot(aes(x = price, y = room_type)) +
      geom_boxplot(aes(fill = room_type)) +
      scale_x_log10() +
      labs(x = expression( ~'log'[10] ~ "Price"), y = "",
           title = "Price Distribution",
           subtitle = "per Type of Place",
           fill = "Type of Place") +
      theme_bw() +
      theme(
        axis.text.x = element_text(angle = 45, hjust = 1)
      )
plot_boxplot_prt

Terlihat bahwa untuk setiap Type of Place, range harga nya besar dan terdapat banyak outlier.

3.3.3 Rataan harga per jenis homestay (Type of Place)

library(RColorBrewer)
# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]

df_list_summary %>% 
  group_by(room_type) %>% 
  summarise(avg_price = round(mean(price),2)) %>% 
  ggplot(aes(x = avg_price, y = room_type, color = avg_price)) +
  geom_segment(aes(xend = 10, yend = room_type), size = 2) +
  geom_point(aes(color = avg_price),size = 10) +
  geom_text(aes(label = avg_price), color = "white", size = 2.5) +
  scale_x_continuous("", limits = c(10, 250), position = "top") +
  scale_color_gradientn(colors = palette) +
  labs (
    title = "Average Price per Room Type", 
    fill = "Average Price"
  ) +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none")

Secara keseluruhan, urutan harga homestay tertinggi (berdasarkan jenis homestay) adalah: Entire Home / Apartment, Hotel Room, Private Room, dan Shared Room. Jenis homestay Entire Home atau Apartment lebih mahal karena relatif memiliki ukuran yang lebih luas dan fasilitas yang lebih banyak dibandingkan dengan jenis homestay lainnya. Sedangkan paling murah adalah jenis Shared Room.

3.3.4 Estimasi revenue per homestay

Berdasarkan data yang tersedia saya coba menaksir pemasukan homestay menggunakan rumus berikut:

\[Estimated Revenue = Minimum Nights * Price* Number of Reviews\]

Berikut 10 data yang telah diurutkan berdasarkan Estimated Revenue terbesar

df_est_revenue <- df_list_summary %>% 
  select(-c(latitude, longitude, last_review)) %>% 
  mutate(est_revenue = minimum_nights * price * number_of_reviews) %>%
  arrange(desc(est_revenue))

kable(head(df_est_revenue,10))
id name host_id host_name neighbourhood_group neighbourhood room_type price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 est_revenue
38772165 Private Pool PentHouse Sea View CENTRAL CBD Prime! 160018769 SuperHost Central Region Downtown Core Entire home/apt 2000 200 12 0.91 7 364 4800000
32148763 Room in a large house; lush, safe, tranquil estate 25062093 Natasha K West Region Clementi Private room 99 1000 45 2.18 2 365 4455000
7311328 (Central-Novena) Semi-Detached House Near Subway 38303962 Chi Siang Central Region Novena Entire home/apt 400 60 147 2.32 2 269 3528000
15946383 DesignWithDavid: Entire Home 5 Rooms Chinatown 50034878 David Central Region Outram Entire home/apt 325 90 83 1.88 1 181 2427750
30776327 Room in a big house in lush, safe, tranquil estate 25062093 Natasha K West Region Clementi Private room 89 1000 26 1.17 2 365 2314000
5827713 The Private Sanctuary 30080617 Eddie East Region Tampines Private room 70 90 285 4.26 5 365 1795500
5919270 The Antiquity Room 30080617 Eddie East Region Tampines Private room 80 90 246 3.68 5 365 1771200
13283938 10min to CITY CENTRE CleanCosyApt 5min WalktoMetro 772728 Sunrise Central Region Geylang Entire home/apt 288 90 65 1.22 2 365 1684800
9532788 Chinatown: In the Heart of the City 49380207 Ong Central Region Outram Private room 88 90 206 3.50 7 364 1631520
10827113 4BR Designer Penthouse | The Verv 33955595 Abigail Central Region River Valley Entire home/apt 550 30 95 2.00 8 63 1567500

Berikut summary estimasi revenue berdasarkan jenis homestay

agg_est_rev <- df_est_revenue %>% 
  group_by(room_type) %>% 
  summarise(avg.price = round(mean(price),2),
            avg.min.nights = round(mean(minimum_nights),2),
            avg.nbr.reviews = round(mean(number_of_reviews),2),
            avg.avail.365 = round(mean(availability_365),2),
            avg.estimated.revenue = round(mean(est_revenue),2)
            )
kable(agg_est_rev)
room_type avg.price avg.min.nights avg.nbr.reviews avg.avail.365 avg.estimated.revenue
Entire home/apt 215.47 22.02 25.51 251.34 62978.94
Hotel room 105.89 2.19 19.88 307.21 3843.70
Private room 90.30 24.36 21.61 273.29 34977.16
Shared room 51.35 6.94 19.31 309.86 1514.01
plot_est1 <- agg_est_rev %>% 
  ggplot(aes(x = avg.estimated.revenue, y = reorder(room_type, avg.estimated.revenue), color = avg.estimated.revenue)) +
  geom_segment(aes(xend = 10, yend = room_type), size = 15) +
  geom_text(aes(label = avg.estimated.revenue), color = "black", size = 3, 
            nudge_x = 4000) +
  scale_x_continuous(name = "", 
                     labels = scales::comma, position = "top") +
  scale_color_gradientn(colors = palette) +
  labs (
    title = "Average Estimated Revenue (in USD)", 
    subtitle = "by Type of Place",
    fill = ""
  ) +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none")

plot_est2 <- agg_est_rev %>% 
  ggplot(aes(x = avg.nbr.reviews, y = reorder(room_type, avg.estimated.revenue), )) +
  geom_col(aes(fill = avg.nbr.reviews)) + 
  geom_text(aes(label = avg.nbr.reviews), color = "white", size = 3, 
            nudge_x = -2) +
  labs (
    title = "Avg. Number of Reviews", 
    subtitle = "by Type of Place",
    fill = "") +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none")

plot_est3 <- agg_est_rev %>% 
  ggplot(aes(x = avg.min.nights, y = reorder(room_type, avg.estimated.revenue))) +
  geom_col(aes(fill = avg.min.nights)) + 
  geom_text(aes(label = avg.min.nights), color = "white", size = 2.5, 
            nudge_x = -0.9) +
  labs (
    title = "Avg. Minimum Number of \nNights Spend", 
    subtitle = "by Type of Place",
    fill = "", y = "qwe", x = "") +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none")

grid.arrange(plot_est1, plot_est2, plot_est3,
             nrow=2,
             layout_matrix = rbind(c(1,1),c(2,3)))

Dari aggregasi data diatas, rataan estimasi revenue terbesar pada jenis homestay Entire home / apartment, sejalan dengan nilai rataan price yang tinggi.

Meskipun Hotel Room memiliki ratan harga lebih tinggi dibandingkan Private Room, namun dari segi rataan estimasi revenue nya masih kalah dibandingkan Private Room. Hal ini dikarenakan Private Room lebih digemari (berdasarkan keterisian maupun jumlah review) oleh wisatawan. Selain itu rata-rata waktu yang dihabiskan wisatawan di homestay jenis Private Room relatif lebih lama dibandingkan dengan Hotel.


3.4 Reviews

3.4.1 Relasi antara jumlah review dengan harga

plot_PricevsReviews <- df_list_summary %>% 
  select(c(id, host_id, number_of_reviews, neighbourhood, neighbourhood_group, 
           reviews_per_month, room_type, price, minimum_nights, availability_365)) %>% 
  group_by(neighbourhood_group) %>% 
  arrange(desc(number_of_reviews)) %>% 
  
  ggplot(aes(x = price, y = number_of_reviews)) +
  geom_point(aes(col = room_type), show.legend = FALSE, alpha = 0.2) +
  theme_bw() +
  labs(x = "price", y = "number of reviews", 
       title = "Price vs Number of Reviews")
  
ggplotly(plot_PricevsReviews) %>%
 layout(title = list(text = paste0('Price vs Number of Reviews')),
        xaxis = list(text = 'number of reviews'), 
        legend = list(title=list(text='<b> Type of Place </b>')))

Terlihat ada kecenderungan homestay yang memiliki review banyak memiliki harga yang rendah.

3.5 Homestay Map

Dengan data koordinat homestay, data lokasi homestay beserta informasi tambahan lainnya juga bisa digambarkan ke dalam peta. Berikut contoh mapping lokasi homestay yang terdapat pada kota Changi

map_dist1 <- df_list_summary %>% 
         filter(neighbourhood == "Southern Islands")

map <- leaflet()
    map <- addTiles(map)

    map <- addMarkers(map,
                      lng = map_dist1$longitude,
                      lat = map_dist1$latitude,
                      popup = paste(
                        "<b>","Host Desc: ", "</b>",map_dist1$name, "<br>", 
                        "<b>", "Region: ", "</b>", map_dist1$neighbourhood, "<br>",
                        "<b>", "Planning Area: ", "</b>", map_dist1$neighbourhood_group, "<br>", 
                        "<b>", "Type of Place: ", "</b>", map_dist1$room_type, "<br>",
                        "<b>", "Price: $", "</b>", map_dist1$price, "<br>",
                        map_dist1$number_of_reviews, " reviews",sep = ""),
                      clusterOptions = markerClusterOptions())

map

Dari gambar diatas, terdapat sembilan homestay di wilayah Southern Island. Informasi tambahan terkait homestay bisa dilihat dari popup icon lokasi homestay pada peta.