Last Update : March 24th, 2021
1 Introduction
Liburan tahun baru segera tiba, saatnya bersiap untuk liburan awal tahun baru (semoga di tahun 2021 wabah pandemi covid 19 segera berakhir). Diantara sekian banyak tujuan wisata, Singapura menjadi salah satu destinasi favorit warga Indonesia, karena jarak yang relatif dekat dengan negara kita dan pilihan tempat wisata yang atraktif.
Di era digital saat ini terdapat banyak opsi alternatif untuk mencari penginapan, salah satunya yang cukup populer ialah Airbnb. Airbnb merupakan market digital platform dimana user bisa memasarkan atau mencari penginapan (homestay) sesuai dengan preferensi masing-masing individu. Airbnb populer dikalangan pelancong karena menawarkan beragam jenis homestay dengan harga yang kompetitif, tersebar di hampir seluruh dunia. Selain itu para pelancong bisa berbagi pengalaman liburannya dan berinterkasi pengguna lainnya, baik itu pelancong maupun pemilik homestay.
Pada artikel ini saya akan coba menggali insight dari data homestay Airbnb di Singapura. Data yang dipakai pada tulisan ini diperoleh dari open data (CC Public License) Inside Airbnb yang saya pertengahan Desember 2020. Harapannya tulisan ini bisa memberikan gambaran sederhana kepada pembaca terkait data homestay Airbnb di Singapura. Untuk visualisasi plot interaktif di tulisan ini bisa mengunjungi halaman berikut.
2 Data Preparation
2.1 About the data
Berikut beberapa penjelasan sederhana terkait feature yang terdapat di dataset.
| Variable | Description |
|---|---|
| ID | ID Listing |
| Name | Nama deskripsi homestay |
| host_id | ID Host |
| host_name | Nama Host |
| neighbourhood_group | Lokasi Region homestay |
| neighbourhood | Lokasi Planning Area homestay |
| latitude | Koordinat lintang homestay |
| longitude | Koordinat bujur homestay |
| room_type | Jenis homestay |
| price | Harga sewa homestay (dalam USD) |
| minimum_nights | jumlah hari/malam minimal yang dibutuhkan untuk memesan homestay |
| availability_365 | ketersediaan (availability) homestay dalam satu tahun dari periode tahun lalu |
| number_of_reviews | Akumulasi jumlah komentar terhadap homestay |
2.2 Import Library and Dataset
Load library yang dibutuhkan
# for data wrangling
library(tidyr)
library(dplyr)
library(readr)
#library(lubridate)
#for data visualisation
library(ggplot2)
library(ggtext)
library(plotly)
library(ggthemes)
library(ggpubr)
library(leaflet)
library(ggmap)
library(knitr)
library(gridExtra)Load Dataset
df_list_summary <- read_csv("data_input/listings_summary.csv")
glimpse(df_list_summary)#> Rows: 4,492
#> Columns: 16
#> $ id <dbl> 49091, 50646, 56334, 71609, 71896, 7...
#> $ name <chr> "COZICOMFORT LONG TERM STAY ROOM 2",...
#> $ host_id <dbl> 266763, 227796, 266763, 367042, 3670...
#> $ host_name <chr> "Francesca", "Sujatha", "Francesca",...
#> $ neighbourhood_group <chr> "North Region", "Central Region", "N...
#> $ neighbourhood <chr> "Woodlands", "Bukit Timah", "Woodlan...
#> $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, ...
#> $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.95...
#> $ room_type <chr> "Private room", "Private room", "Pri...
#> $ price <dbl> 82, 80, 68, 179, 95, 82, 208, 52, 54...
#> $ minimum_nights <dbl> 180, 90, 6, 90, 90, 90, 1, 90, 90, 1...
#> $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 178, 199,...
#> $ last_review <date> 2013-10-21, 2014-12-26, 2015-10-01,...
#> $ reviews_per_month <dbl> 0.01, 0.23, 0.18, 0.19, 0.21, 0.42, ...
#> $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3, 4, 4, 7, ...
#> $ availability_365 <dbl> 365, 365, 365, 365, 365, 365, 181, 3...
Cuplikan 10 data awal
kable(head(df_list_summary, 10))| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 49091 | COZICOMFORT LONG TERM STAY ROOM 2 | 266763 | Francesca | North Region | Woodlands | 1.44255 | 103.7958 | Private room | 82 | 180 | 1 | 2013-10-21 | 0.01 | 2 | 365 |
| 50646 | Pleasant Room along Bukit Timah | 227796 | Sujatha | Central Region | Bukit Timah | 1.33235 | 103.7852 | Private room | 80 | 90 | 18 | 2014-12-26 | 0.23 | 1 | 365 |
| 56334 | COZICOMFORT | 266763 | Francesca | North Region | Woodlands | 1.44246 | 103.7967 | Private room | 68 | 6 | 20 | 2015-10-01 | 0.18 | 2 | 365 |
| 71609 | Ensuite Room (Room 1 & 2) near EXPO | 367042 | Belinda | East Region | Tampines | 1.34541 | 103.9571 | Private room | 179 | 90 | 20 | 2020-01-17 | 0.19 | 8 | 365 |
| 71896 | B&B Room 1 near Airport & EXPO | 367042 | Belinda | East Region | Tampines | 1.34567 | 103.9596 | Private room | 95 | 90 | 24 | 2019-10-13 | 0.21 | 8 | 365 |
| 71903 | Room 2-near Airport & EXPO | 367042 | Belinda | East Region | Tampines | 1.34702 | 103.9610 | Private room | 82 | 90 | 48 | 2020-01-09 | 0.42 | 8 | 365 |
| 71907 | 3rd level Jumbo room 5 near EXPO | 367042 | Belinda | East Region | Tampines | 1.34348 | 103.9634 | Private room | 208 | 1 | 29 | 2020-01-11 | 0.25 | 8 | 181 |
| 241503 | Long stay at The Breezy East “Leopard” | 1017645 | Bianca | East Region | Bedok | 1.32391 | 103.9128 | Private room | 52 | 90 | 178 | 2020-10-16 | 1.67 | 3 | 365 |
| 241508 | Long stay at The Breezy East “Plumeria” | 1017645 | Bianca | East Region | Bedok | 1.32391 | 103.9128 | Private room | 54 | 90 | 199 | 2019-09-21 | 1.82 | 3 | 354 |
| 275343 | Conveniently located City Room!(1,2,3,4,5,6,7,8) | 1439258 | K2 Guesthouse | Central Region | Bukit Merah | 1.28875 | 103.8081 | Private room | 52 | 14 | 20 | 2020-04-17 | 0.22 | 4 | 364 |
2.3 Pre-processsing Data
Konversi data neighbourhood_group,neighbourhood,room_type menjadi jenis factor
df_list_summary[, c("neighbourhood_group","neighbourhood","room_type")] <- lapply(df_list_summary[, c("neighbourhood_group","neighbourhood","room_type")], as.factor)
glimpse(df_list_summary)#> Rows: 4,492
#> Columns: 16
#> $ id <dbl> 49091, 50646, 56334, 71609, 71896, 7...
#> $ name <chr> "COZICOMFORT LONG TERM STAY ROOM 2",...
#> $ host_id <dbl> 266763, 227796, 266763, 367042, 3670...
#> $ host_name <chr> "Francesca", "Sujatha", "Francesca",...
#> $ neighbourhood_group <fct> North Region, Central Region, North ...
#> $ neighbourhood <fct> Woodlands, Bukit Timah, Woodlands, T...
#> $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, ...
#> $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.95...
#> $ room_type <fct> Private room, Private room, Private ...
#> $ price <dbl> 82, 80, 68, 179, 95, 82, 208, 52, 54...
#> $ minimum_nights <dbl> 180, 90, 6, 90, 90, 90, 1, 90, 90, 1...
#> $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 178, 199,...
#> $ last_review <date> 2013-10-21, 2014-12-26, 2015-10-01,...
#> $ reviews_per_month <dbl> 0.01, 0.23, 0.18, 0.19, 0.21, 0.42, ...
#> $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 3, 3, 4, 4, 7, ...
#> $ availability_365 <dbl> 365, 365, 365, 365, 365, 365, 181, 3...
Cek missing value
dim(df_list_summary)#> [1] 4492 16
colSums(is.na(df_list_summary))#> id name
#> 0 0
#> host_id host_name
#> 0 2
#> neighbourhood_group neighbourhood
#> 0 0
#> latitude longitude
#> 0 0
#> room_type price
#> 0 0
#> minimum_nights number_of_reviews
#> 0 0
#> last_review reviews_per_month
#> 1799 1799
#> calculated_host_listings_count availability_365
#> 0 0
Dataset memiliki 2693 baris dan 16 kolom, d
Dari hasil diatas terlihat di terdapat missing value pada kolom host_name, last_review, and reviews_per_month, dimana untuk kolom last_review, and reviews_per_month, jumlah missing value nya sama. Berikut baris data yang memiliki missing value pada kolom last_review
#df_list_summary[is.na(df_list_summary$host_name),]
df_list_summary[is.na(df_list_summary$last_review),] %>%
select("id","name","host_name","last_review","reviews_per_month")#> # A tibble: 1,799 x 5
#> id name host_name last_review reviews_per_mon~
#> <dbl> <chr> <chr> <date> <dbl>
#> 1 355955 Double room in an Authentic P~ Aresha NA NA
#> 2 469454 Nice view bedroom with aircon~ Hung NA NA
#> 3 481789 Master Bedroom in Newly Built~ Susan NA NA
#> 4 642660 BEST CITY LIVING WITH GA RESI~ Roger NA NA
#> 5 733863 Homestay at Serangoon Shirlnet NA NA
#> 6 768313 Common Room for rent immediate Immellym~ NA NA
#> 7 823571 Apartment away from town Tania NA NA
#> 8 1562453 Deluxe Quad-sharing room Domus NA NA
#> 9 1581224 Life Impact Coaching Gerard NA NA
#> 10 1611318 Central Condo, reasonable pri~ Betty NA NA
#> # ... with 1,789 more rows
Terlihat pola missing value kolom last_review selalu di baris yang sama dengan data reviews_per_month. Kemungkinnan data pada baris tersebut belum memiliki review. Untuk baris data pada kolom last_review, and reviews_per_month yang memiliki missing value akan dihilangkan, sedangkan missing value pada kolom host_name akan diset ke “NA”.
df_list_summary <- df_list_summary %>%
drop_na(last_review) %>%
mutate(host_name = replace_na(host_name, "NA"))
colSums(is.na(df_list_summary))#> id name
#> 0 0
#> host_id host_name
#> 0 0
#> neighbourhood_group neighbourhood
#> 0 0
#> latitude longitude
#> 0 0
#> room_type price
#> 0 0
#> minimum_nights number_of_reviews
#> 0 0
#> last_review reviews_per_month
#> 0 0
#> calculated_host_listings_count availability_365
#> 0 0
dim(df_list_summary)#> [1] 2693 16
Ukuran akhir data setelah dibersihkan menjadi 2693 x 16.
Berikut ringkasan data awal yang bertipe numerik
summary(df_list_summary)#> id name host_id host_name
#> Min. : 49091 Length:2693 Min. : 227796 Length:2693
#> 1st Qu.:15532514 Class :character 1st Qu.: 17526618 Class :character
#> Median :24211014 Mode :character Median : 52978586 Mode :character
#> Mean :24392302 Mean : 90806544
#> 3rd Qu.:34870326 3rd Qu.:138649185
#> Max. :45818138 Max. :362776064
#>
#> neighbourhood_group neighbourhood latitude longitude
#> Central Region :2174 Kallang : 434 Min. :1.245 Min. :103.6
#> East Region : 176 Geylang : 259 1st Qu.:1.295 1st Qu.:103.8
#> North-East Region: 107 Outram : 259 Median :1.310 Median :103.9
#> North Region : 75 Rochor : 214 Mean :1.313 Mean :103.8
#> West Region : 161 Novena : 172 3rd Qu.:1.319 3rd Qu.:103.9
#> River Valley: 143 Max. :1.453 Max. :104.0
#> (Other) :1212
#> room_type price minimum_nights number_of_reviews
#> Entire home/apt:1018 Min. : 15.0 Min. : 1.00 Min. : 1.00
#> Hotel room : 211 1st Qu.: 56.0 1st Qu.: 1.00 1st Qu.: 2.00
#> Private room :1351 Median : 94.0 Median : 5.00 Median : 6.00
#> Shared room : 113 Mean : 137.2 Mean : 21.01 Mean : 22.85
#> 3rd Qu.: 150.0 3rd Qu.: 18.00 3rd Qu.: 24.00
#> Max. :10286.0 Max. :1000.00 Max. :370.00
#>
#> last_review reviews_per_month calculated_host_listings_count
#> Min. :2013-10-21 Min. : 0.0100 Min. : 1.0
#> 1st Qu.:2019-08-02 1st Qu.: 0.1000 1st Qu.: 3.0
#> Median :2020-01-26 Median : 0.2700 Median : 10.0
#> Mean :2019-10-07 Mean : 0.7253 Mean : 29.7
#> 3rd Qu.:2020-05-06 3rd Qu.: 0.8700 3rd Qu.: 39.0
#> Max. :2020-10-26 Max. :22.5600 Max. :137.0
#>
#> availability_365
#> Min. : 0.0
#> 1st Qu.:178.0
#> Median :344.0
#> Mean :269.2
#> 3rd Qu.:364.0
#> Max. :365.0
#>
Dari data diatas terdapat data yang unik. Pada kolom minimum_nights, terdapat angka maximum == 1000. Berikut ini adalah datanya.
df_list_summary %>% filter(minimum_nights == 1000)#> # A tibble: 2 x 16
#> id name host_id host_name neighbourhood_g~ neighbourhood latitude
#> <dbl> <chr> <dbl> <chr> <fct> <fct> <dbl>
#> 1 3.08e7 Room~ 2.51e7 Natasha K West Region Clementi 1.32
#> 2 3.21e7 Room~ 2.51e7 Natasha K West Region Clementi 1.32
#> # ... with 9 more variables: longitude <dbl>, room_type <fct>, price <dbl>,
#> # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
#> # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#> # availability_365 <dbl>
Terlihat bahwa untuk data dengan nilai minimum_nights == 1000, nilai availability_365 nya == 365 dan last review nya di tahun 2019. Kemungkinan homestay tersebut di tahun 2020 belum pernah terisi. Berdasarkan kesimpulan diatas maka data tersebut tetap dipakai untuk eksplorasi data.
3 Data Exploration
Terdapat empat subject yang akan dieksplor dari dataset yakni: Lokasi Homestay (Homestay Location), Jenis Homestay (Room Type), Harga (Price), dan Homestay Map,
3.1 Homestay Location
Sebelum eksplor data, saya mencari tahu terkait data wilayah di Singapura. Berdasarkan referensi dari Wiki: Region of Singapore, Singapura terdiri dari 5 Region dan 55 Planning Area.
Dari informasi tersebut mari lihat persebaran data homestay Airbnb.
#aggregate df
df_list_summary %>%
group_by(neighbourhood_group) %>%
summarise(Total_Planning_Area = n_distinct(neighbourhood),
Total_Homestay = n()) %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.factor), ~"TOTAL")))#> # A tibble: 6 x 3
#> neighbourhood_group Total_Planning_Area Total_Homestay
#> <chr> <int> <int>
#> 1 Central Region 20 2174
#> 2 East Region 4 176
#> 3 North-East Region 5 107
#> 4 North Region 5 75
#> 5 West Region 8 161
#> 6 TOTAL 42 2693
Dari aggregasi data diatas, terdapat 2,693 homestay yang tersebar di 5 Region dan 42 Planning Area di Singapura.
3.1.1 Wilayah dengan jumlah homestay terbanyak
plot_col_top10listnb <- df_list_summary %>%
group_by(neighbourhood) %>%
summarize(num_listings = n(), neighbourhood_grp = unique(neighbourhood_group)) %>%
arrange(desc(num_listings)) %>%
mutate(Text=paste(
"Region: ",neighbourhood_grp, "<br>",
"Planning Area: ",neighbourhood, "<br>",
"No. of Listings: ",num_listings, sep = ""
)) %>%
head(10) %>%
ggplot(aes(x = reorder(neighbourhood, num_listings),
y = num_listings,
text = Text, fill = neighbourhood_grp)) +
geom_col() +
coord_flip() +
theme(legend.position = "bottom") +
labs(x = "", y = "Homestay Count",
fill = "Region") +
theme(
panel.border = element_blank(),
panel.grid.major = element_blank()
)
ggplotly(plot_col_top10listnb, tooltip="text") %>%
layout(title = list(text = paste0('Top 10 Planning Area with Highest Number of Airbnb Homestays',
'<br>',
'<sup>',
'among all Region in Singapore','</sup>')),
legend = list(orientation = "h", x = 0.1, y = -0.3)) %>%
config(displayModeBar = F)Berdasarkan Region secara keseluruhan, jumlah homestay banyak tersebar di Central Region. Terlihat pada diagram diatas, sembilan Planning Area dengan jumlah homestay terbanyak berasal dari Central Region. Hal ini wajar karena Central Region adalah Region dengan luas wilayah terbesar dan populasi penduduk terbanyak. Selain itu beberapa perkantoran dan tempat wisata terdapat disana.
Jika di breakdown per Region, berikut daftar Planning Area dengan jumlah homestay terbanyak di Singapore:
agg_homestayperRegion <- df_list_summary %>%
group_by(neighbourhood) %>%
summarize(neighbourhood_group = unique(neighbourhood_group), num_listings = n()) %>%
arrange(desc(neighbourhood_group), desc(num_listings))
agg_homestayperRegion %>% group_by(neighbourhood_group) %>%
slice_max(num_listings, n = 3, with_ties = F) %>%
arrange(desc(neighbourhood_group)) %>%
ggplot(aes(x = num_listings, y = reorder(neighbourhood, num_listings))) +
geom_col(aes(fill = neighbourhood), show.legend = F) +
geom_label(mapping = aes(label = num_listings),
position = position_dodge(width = 0.5),
#nudge_y = 3,
size = 2.5,
show.legend = F, fill = "white") +
facet_wrap(facets = ~neighbourhood_group, nrow = 5, scales = "free_y") +
theme_linedraw() + scale_fill_discrete() +
labs(title = "Top 3 Planning Area with Highest Number of Homestay",
subtitle = "per Region in Singapore",
x = "Homestay Count",
y = "Planning Area",
fill = "Frequency") +
theme(panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_line(colour = "grey", linetype = "dashed"))3.1.2 Persebaran homestay di Singapura
Berdasarkan persebaran homestay di Singapura, mayoritas tersebar di Central Region (80.73 %), sementara sekitar 20 % tersebar di Region lainnya.
homestaydist_perRegion <- df_list_summary %>%
group_by(neighbourhood_group) %>% summarize(num_listings = n()) %>%
mutate(percentage = round((num_listings/sum(num_listings))*100,2))
ggplot(homestaydist_perRegion,
aes(x=num_listings,
y=reorder(neighbourhood_group, num_listings))) +
geom_col(aes(fill=neighbourhood_group, show.legends = FALSE)) +
coord_flip() +
geom_hline(yintercept = mean(homestaydist_perRegion$num_listings), linetype = 3) +
geom_label(aes(label = paste0(num_listings, ' ','(',percentage,' %',')'
)),
size = 3,
show.legend = F, fill = "white") + theme_bw() +
labs(title = "Top 10 Region in Singapore",
subtitle = "based on number of homestay",
x = "Number of Homestay", y = "Region") +
theme(
panel.border = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none"
)
Untuk menggambarkan persebarannya, saya coba mapping homestay berdasarkan koordinatnya.
Note: terkait lebih lanjut mengenai data lokasi homestay airbnb yang tersedia di dataset ini dapat dibaca di bagian “Disclaimer” InsideAirbnb.
# Plot host dist in map
temp_height <- max(df_list_summary$latitude) - min(df_list_summary$latitude)
temp_width <- max(df_list_summary$longitude) - min(df_list_summary$longitude)
temp_borders <- c(bottom = min(df_list_summary$latitude) - 0.05 * temp_height,
top = max(df_list_summary$latitude) + 0.05 * temp_height,
left = min(df_list_summary$longitude) - 0.1 * temp_width,
right = max(df_list_summary$longitude) + 0.1 * temp_width)
#load staten map
temp_map <- get_stamenmap(temp_borders, zoom = 12, maptype = "toner-lite")
#load ggmap + aes geom point
plot_hostpointmap <- ggmap(temp_map) +
geom_point(data = df_list_summary,
mapping = aes(x = longitude, y = latitude,
col = neighbourhood_group, alpha=0.7)) +
labs(x = "longitude",
y = "latitude",
title = "Homestay Distribution in Singapore",
col = "Region") +
guides(alpha = FALSE) +
theme(plot.title = element_text(vjust=5))
plot_hostpointmap
3.2 Jenis Homestay
3.2.1 Jenis homestay (Type of Place) di Airbnb Singapura
Terdapat empat jenis homestay yang tersedia di airbnb Singapura, yakni:
kable(unique(df_list_summary$room_type))| x |
|---|
| Private room |
| Entire home/apt |
| Shared room |
| Hotel room |
Penjelasan terkait jenis homestay Airbnb bisa dilihat di artikel berikut “what do the different home types mean?”
3.2.2 Type of Place Distribution
df_list_summary %>%
count(room_type, sort = TRUE) %>%
mutate(room_type = reorder(room_type, n), percentage = round((n/sum(n))*100,2)) %>%
ggplot(aes(room_type, n)) +
geom_col(aes(fill = room_type)) +
coord_flip() +
geom_label(aes(label = paste0(n, ' ', '(',percentage,' %',')')),
nudge_y = 1, size = 2.5, show.legend = F, fill = "white") +
labs(title = "Aibnb Type of Place Distribution",
subtitle = "over Singapore",
x=NULL, y= "Count", fill = "Type of Place") +
theme_bw() +
theme(
panel.border = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
Homestay terbanyak secara berurutan adalah Private Room (50.17 %), Entire Home / Apartment (37.8 %), Hotel Room (7.84 %), dan Shared Room (4.2 %)
Sedangkan persebaran Type of Place per Region disajikan pada diagram berikut.
plot_agg_roomtypeperRegion <- data.frame(table(region = df_list_summary$neighbourhood_group,
room_type = df_list_summary$room_type)) %>%
mutate(Text=paste(
"Region: ",region, "<br>",
"Type of Place: ",room_type, "<br>",
"Count:",Freq, sep = ""
)) %>%
ggplot(aes(x=region, y=Freq, group=room_type, text=Text)) +
geom_col(aes(fill=room_type), position = "dodge") +
labs(x = "",
y = "Count",
fill = "Type of Place") +
theme_bw()+
theme(
panel.border = element_blank()
)
ggplotly(plot_agg_roomtypeperRegion, tooltip="text") %>%
layout(legend = list(orientation = "h", x = 0.1, y = -0.05),
title = list(text = paste0('Type of Place Distribution',
'<br>',
'<sup>','per Region','</sup>')))Terlihat bahwa jenis Entire Home/Apartment dan Private Room merupakan jenis paling banyak di setiap Region di Singapura.
3.3 Harga dan Estimasi Revenue
Berikut ringkasan data Price
summary(df_list_summary$price)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 15.0 56.0 94.0 137.2 150.0 10286.0
Terlihat bahwa rentang (range) harga homestay Airbnb di Singapura sangat besar, yaitu dari USD 15 sampai USD 10,286.
3.3.1 Price Distribution
plot_pricedist <- ggplot(df_list_summary, aes(price)) +
# using bins = 50 and continous scale
geom_histogram(aes(y = ..density..), col = "black", fill = "white") +
geom_density(alpha = 0.3, fill="#FF6666") +
geom_vline(xintercept = round(mean(df_list_summary$price), 2),linetype = 3) +
scale_x_log10() +
labs(title = "Transformed distribution of Price",
subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis"),
x="Price", y = "Density",
fill = "Type of Place") +
guides(col = F)
library(viridis)
plot_price_per_roomtype <- df_list_summary %>% select(price, room_type) %>%
ggplot(aes(price, fill = room_type))+
geom_density(aes(col = room_type, show.legends = F), alpha = 0.5, round=2) +
scale_fill_viridis(discrete = TRUE) +
scale_color_viridis(discrete = TRUE) +
theme(text = element_text(size = 10)) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 100))+
labs(title = "Price Distribution",
subtitle = "per Type of Place",
x="Price", y = "Density",
fill = "") +
guides(col = F) + theme(legend.position = "bottom") +
labs(title = "S'pore Airbnb Price Distribution",
subtitle = "by Type of Place",
x="Price", y= "Density")
subplot1 <- ggarrange(plot_pricedist, plot_price_per_roomtype,
ncol = 2, nrow = 1,
common.legend = FALSE,
legend = "bottom")
subplot1Dari plot density dan histogram diatas terlihat bahwa data harga (price) right-skewed. Nilai rataan harga (price) lebih besar dibandingkan nilai median nya.
3.3.2 Persebaran harga berdasarkan jenis homestay
Sedangkan untuk persebaran harga per Type of Place digambarkan pada diagram boxplot berikut
#df_list_summary %>% names()
plot_boxplot_prt <- df_list_summary %>%
ggplot(aes(x = price, y = room_type)) +
geom_boxplot(aes(fill = room_type)) +
scale_x_log10() +
labs(x = expression( ~'log'[10] ~ "Price"), y = "",
title = "Price Distribution",
subtitle = "per Type of Place",
fill = "Type of Place") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
plot_boxplot_prtTerlihat bahwa untuk setiap Type of Place, range harga nya besar dan terdapat banyak outlier.
3.3.3 Rataan harga per jenis homestay (Type of Place)
library(RColorBrewer)
# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]
df_list_summary %>%
group_by(room_type) %>%
summarise(avg_price = round(mean(price),2)) %>%
ggplot(aes(x = avg_price, y = room_type, color = avg_price)) +
geom_segment(aes(xend = 10, yend = room_type), size = 2) +
geom_point(aes(color = avg_price),size = 10) +
geom_text(aes(label = avg_price), color = "white", size = 2.5) +
scale_x_continuous("", limits = c(10, 250), position = "top") +
scale_color_gradientn(colors = palette) +
labs (
title = "Average Price per Room Type",
fill = "Average Price"
) +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")Secara keseluruhan, urutan harga homestay tertinggi (berdasarkan jenis homestay) adalah: Entire Home / Apartment, Hotel Room, Private Room, dan Shared Room. Jenis homestay Entire Home atau Apartment lebih mahal karena relatif memiliki ukuran yang lebih luas dan fasilitas yang lebih banyak dibandingkan dengan jenis homestay lainnya. Sedangkan paling murah adalah jenis Shared Room.
3.3.4 Estimasi revenue per homestay
Berdasarkan data yang tersedia saya coba menaksir pemasukan homestay menggunakan rumus berikut:
\[Estimated Revenue = Minimum Nights * Price* Number of Reviews\]
Berikut 10 data yang telah diurutkan berdasarkan Estimated Revenue terbesar
df_est_revenue <- df_list_summary %>%
select(-c(latitude, longitude, last_review)) %>%
mutate(est_revenue = minimum_nights * price * number_of_reviews) %>%
arrange(desc(est_revenue))
kable(head(df_est_revenue,10))| id | name | host_id | host_name | neighbourhood_group | neighbourhood | room_type | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | est_revenue |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38772165 | Private Pool PentHouse Sea View CENTRAL CBD Prime! | 160018769 | SuperHost | Central Region | Downtown Core | Entire home/apt | 2000 | 200 | 12 | 0.91 | 7 | 364 | 4800000 |
| 32148763 | Room in a large house; lush, safe, tranquil estate | 25062093 | Natasha K | West Region | Clementi | Private room | 99 | 1000 | 45 | 2.18 | 2 | 365 | 4455000 |
| 7311328 | (Central-Novena) Semi-Detached House Near Subway | 38303962 | Chi Siang | Central Region | Novena | Entire home/apt | 400 | 60 | 147 | 2.32 | 2 | 269 | 3528000 |
| 15946383 | DesignWithDavid: Entire Home 5 Rooms Chinatown | 50034878 | David | Central Region | Outram | Entire home/apt | 325 | 90 | 83 | 1.88 | 1 | 181 | 2427750 |
| 30776327 | Room in a big house in lush, safe, tranquil estate | 25062093 | Natasha K | West Region | Clementi | Private room | 89 | 1000 | 26 | 1.17 | 2 | 365 | 2314000 |
| 5827713 | The Private Sanctuary | 30080617 | Eddie | East Region | Tampines | Private room | 70 | 90 | 285 | 4.26 | 5 | 365 | 1795500 |
| 5919270 | The Antiquity Room | 30080617 | Eddie | East Region | Tampines | Private room | 80 | 90 | 246 | 3.68 | 5 | 365 | 1771200 |
| 13283938 | 10min to CITY CENTRE CleanCosyApt 5min WalktoMetro | 772728 | Sunrise | Central Region | Geylang | Entire home/apt | 288 | 90 | 65 | 1.22 | 2 | 365 | 1684800 |
| 9532788 | Chinatown: In the Heart of the City | 49380207 | Ong | Central Region | Outram | Private room | 88 | 90 | 206 | 3.50 | 7 | 364 | 1631520 |
| 10827113 | 4BR Designer Penthouse | The Verv | 33955595 | Abigail | Central Region | River Valley | Entire home/apt | 550 | 30 | 95 | 2.00 | 8 | 63 | 1567500 |
Berikut summary estimasi revenue berdasarkan jenis homestay
agg_est_rev <- df_est_revenue %>%
group_by(room_type) %>%
summarise(avg.price = round(mean(price),2),
avg.min.nights = round(mean(minimum_nights),2),
avg.nbr.reviews = round(mean(number_of_reviews),2),
avg.avail.365 = round(mean(availability_365),2),
avg.estimated.revenue = round(mean(est_revenue),2)
)
kable(agg_est_rev)| room_type | avg.price | avg.min.nights | avg.nbr.reviews | avg.avail.365 | avg.estimated.revenue |
|---|---|---|---|---|---|
| Entire home/apt | 215.47 | 22.02 | 25.51 | 251.34 | 62978.94 |
| Hotel room | 105.89 | 2.19 | 19.88 | 307.21 | 3843.70 |
| Private room | 90.30 | 24.36 | 21.61 | 273.29 | 34977.16 |
| Shared room | 51.35 | 6.94 | 19.31 | 309.86 | 1514.01 |
plot_est1 <- agg_est_rev %>%
ggplot(aes(x = avg.estimated.revenue, y = reorder(room_type, avg.estimated.revenue), color = avg.estimated.revenue)) +
geom_segment(aes(xend = 10, yend = room_type), size = 15) +
geom_text(aes(label = avg.estimated.revenue), color = "black", size = 3,
nudge_x = 4000) +
scale_x_continuous(name = "",
labels = scales::comma, position = "top") +
scale_color_gradientn(colors = palette) +
labs (
title = "Average Estimated Revenue (in USD)",
subtitle = "by Type of Place",
fill = ""
) +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")
plot_est2 <- agg_est_rev %>%
ggplot(aes(x = avg.nbr.reviews, y = reorder(room_type, avg.estimated.revenue), )) +
geom_col(aes(fill = avg.nbr.reviews)) +
geom_text(aes(label = avg.nbr.reviews), color = "white", size = 3,
nudge_x = -2) +
labs (
title = "Avg. Number of Reviews",
subtitle = "by Type of Place",
fill = "") +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")
plot_est3 <- agg_est_rev %>%
ggplot(aes(x = avg.min.nights, y = reorder(room_type, avg.estimated.revenue))) +
geom_col(aes(fill = avg.min.nights)) +
geom_text(aes(label = avg.min.nights), color = "white", size = 2.5,
nudge_x = -0.9) +
labs (
title = "Avg. Minimum Number of \nNights Spend",
subtitle = "by Type of Place",
fill = "", y = "qwe", x = "") +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")
grid.arrange(plot_est1, plot_est2, plot_est3,
nrow=2,
layout_matrix = rbind(c(1,1),c(2,3)))Dari aggregasi data diatas, rataan estimasi revenue terbesar pada jenis homestay Entire home / apartment, sejalan dengan nilai rataan price yang tinggi.
Meskipun Hotel Room memiliki ratan harga lebih tinggi dibandingkan Private Room, namun dari segi rataan estimasi revenue nya masih kalah dibandingkan Private Room. Hal ini dikarenakan Private Room lebih digemari (berdasarkan keterisian maupun jumlah review) oleh wisatawan. Selain itu rata-rata waktu yang dihabiskan wisatawan di homestay jenis Private Room relatif lebih lama dibandingkan dengan Hotel.
3.4 Reviews
3.4.1 Relasi antara jumlah review dengan harga
plot_PricevsReviews <- df_list_summary %>%
select(c(id, host_id, number_of_reviews, neighbourhood, neighbourhood_group,
reviews_per_month, room_type, price, minimum_nights, availability_365)) %>%
group_by(neighbourhood_group) %>%
arrange(desc(number_of_reviews)) %>%
ggplot(aes(x = price, y = number_of_reviews)) +
geom_point(aes(col = room_type), show.legend = FALSE, alpha = 0.2) +
theme_bw() +
labs(x = "price", y = "number of reviews",
title = "Price vs Number of Reviews")
ggplotly(plot_PricevsReviews) %>%
layout(title = list(text = paste0('Price vs Number of Reviews')),
xaxis = list(text = 'number of reviews'),
legend = list(title=list(text='<b> Type of Place </b>')))3.5 Homestay Map
Dengan data koordinat homestay, data lokasi homestay beserta informasi tambahan lainnya juga bisa digambarkan ke dalam peta. Berikut contoh mapping lokasi homestay yang terdapat pada kota Changi
map_dist1 <- df_list_summary %>%
filter(neighbourhood == "Southern Islands")
map <- leaflet()
map <- addTiles(map)
map <- addMarkers(map,
lng = map_dist1$longitude,
lat = map_dist1$latitude,
popup = paste(
"<b>","Host Desc: ", "</b>",map_dist1$name, "<br>",
"<b>", "Region: ", "</b>", map_dist1$neighbourhood, "<br>",
"<b>", "Planning Area: ", "</b>", map_dist1$neighbourhood_group, "<br>",
"<b>", "Type of Place: ", "</b>", map_dist1$room_type, "<br>",
"<b>", "Price: $", "</b>", map_dist1$price, "<br>",
map_dist1$number_of_reviews, " reviews",sep = ""),
clusterOptions = markerClusterOptions())
map