Airbnb, Inc. operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. It is based in San Francisco, California. The platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.It currently covers more than 100,000 cities and 220 countries worldwide, including Singapore.
Now we are going to look at Data AirBnb in Singapore.The data sourced from Kaggle.com and collected on 28 August 2019 according to the website.
Data Exploratory and Explanatory
Before we do the exploratory and explanatory data analysis, we will install all the library needed to support the data analysis.
library(lubridate)
library(scales)
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
library(glue)
library(viridis)
library(leaflet)
library(treemapify)
library(skimr)After we install the libraries, we call the data and check all the detail of the data.
air <- read_csv("airbnb/listings.csv")glimpse(air)## Rows: 7,907
## Columns: 16
## $ id <dbl> 49091, 50646, 56334, 71609, 71896, 7190~
## $ name <chr> "COZICOMFORT LONG TERM STAY ROOM 2", "P~
## $ host_id <dbl> 266763, 227796, 266763, 367042, 367042,~
## $ host_name <chr> "Francesca", "Sujatha", "Francesca", "B~
## $ neighbourhood_group <chr> "North Region", "Central Region", "Nort~
## $ neighbourhood <chr> "Woodlands", "Bukit Timah", "Woodlands"~
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.3~
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.9571,~
## $ room_type <chr> "Private room", "Private room", "Privat~
## $ price <dbl> 83, 81, 69, 206, 94, 104, 208, 50, 54, ~
## $ minimum_nights <dbl> 180, 90, 6, 1, 1, 1, 1, 90, 90, 90, 15,~
## $ number_of_reviews <dbl> 1, 18, 20, 14, 22, 39, 25, 174, 198, 23~
## $ last_review <date> 2013-10-21, 2014-12-26, 2015-10-01, 20~
## $ reviews_per_month <dbl> 0.01, 0.28, 0.20, 0.15, 0.22, 0.38, 0.2~
## $ calculated_host_listings_count <dbl> 2, 1, 2, 9, 9, 9, 9, 4, 4, 4, 32, 32, 9~
## $ availability_365 <dbl> 365, 365, 365, 353, 355, 346, 172, 59, ~
After we check, the dimension of the data is 7,907 rows and 16 columns. We also check the data types for every column. There are 5 columns that we have to change the data types:
- name
- host_name
- neighbourhood_group
- neighbourhood
- room_type
airsg <- air %>%
mutate(name = as.factor(name),
host_name = as.factor(host_name),
neighbourhood_group = as.factor(neighbourhood_group),
neighbourhood = as.factor(neighbourhood),
room_type = as.factor(room_type))
head(airsg)Now let us check once more the data and skim the details to check if there is some cleanings necessary
skim(airsg)| Name | airsg |
| Number of rows | 7907 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| factor | 5 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_review | 2758 | 0.65 | 2013-10-21 | 2019-08-27 | 2019-06-27 | 1001 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| name | 2 | 1 | FALSE | 7457 | Lux: 13, Inv: 9, Stu: 9, Cit: 8 |
| host_name | 0 | 1 | FALSE | 1833 | Jay: 290, Alv: 249, Ric: 157, Aar: 145 |
| neighbourhood_group | 0 | 1 | FALSE | 5 | Cen: 6309, Wes: 540, Eas: 508, Nor: 346 |
| neighbourhood | 0 | 1 | FALSE | 43 | Kal: 1043, Gey: 994, Nov: 537, Roc: 536 |
| room_type | 0 | 1 | FALSE | 3 | Ent: 4132, Pri: 3381, Sha: 394 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 23388624.63 | 10164162.07 | 49091.00 | 15821800.50 | 24706270.00 | 32348500.00 | 38112762.00 | ▂▃▅▆▇ |
| host_id | 0 | 1.00 | 91144807.41 | 81909095.31 | 23666.00 | 23058075.00 | 63448912.00 | 155381142.00 | 288567551.00 | ▇▃▂▂▁ |
| latitude | 0 | 1.00 | 1.31 | 0.03 | 1.24 | 1.30 | 1.31 | 1.32 | 1.45 | ▂▇▂▁▁ |
| longitude | 0 | 1.00 | 103.85 | 0.04 | 103.65 | 103.84 | 103.85 | 103.87 | 103.97 | ▁▁▃▇▁ |
| price | 0 | 1.00 | 169.33 | 340.19 | 0.00 | 65.00 | 124.00 | 199.00 | 10000.00 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 17.51 | 42.09 | 1.00 | 1.00 | 3.00 | 10.00 | 1000.00 | ▇▁▁▁▁ |
| number_of_reviews | 0 | 1.00 | 12.81 | 29.71 | 0.00 | 0.00 | 2.00 | 10.00 | 323.00 | ▇▁▁▁▁ |
| reviews_per_month | 2758 | 0.65 | 1.04 | 1.29 | 0.01 | 0.18 | 0.55 | 1.37 | 13.00 | ▇▁▁▁▁ |
| calculated_host_listings_count | 0 | 1.00 | 40.61 | 65.14 | 1.00 | 2.00 | 9.00 | 48.00 | 274.00 | ▇▁▁▁▁ |
| availability_365 | 0 | 1.00 | 208.73 | 146.12 | 0.00 | 54.00 | 260.00 | 355.00 | 365.00 | ▅▂▂▂▇ |
We found large missing value data on “reviews_per_month” column and “last_review” column. We probably will not use this column for analysis. So at the moment we keep the column and will do the transformation of the column in the future if necessary
Now let us look at the summary of the data to check the descriptive statistic for each columns
summary(airsg)## id name
## Min. : 49091 Luxury hostel with in-cabin locker - Single mixed: 13
## 1st Qu.:15821800 Inviting & Cozy 1BR APT 3 mins from Tg Pagar MRT : 9
## Median :24706270 Studio Apartment - Oakwood Premier : 9
## Mean :23388625 City-located 1BR loft apartment *BRAND NEW* : 8
## 3rd Qu.:32348500 Stylish 1BR Located 7 mins from Tg Pagar MRT : 8
## Max. :38112762 (Other) :7858
## NA's : 2
## host_id host_name neighbourhood_group
## Min. : 23666 Jay : 290 Central Region :6309
## 1st Qu.: 23058075 Alvin : 249 East Region : 508
## Median : 63448912 Richards: 157 North-East Region: 346
## Mean : 91144807 Aaron : 145 North Region : 204
## 3rd Qu.:155381142 Rain : 115 West Region : 540
## Max. :288567551 Darcy : 114
## (Other) :6837
## neighbourhood latitude longitude room_type
## Kallang :1043 Min. :1.244 Min. :103.6 Entire home/apt:4132
## Geylang : 994 1st Qu.:1.296 1st Qu.:103.8 Private room :3381
## Novena : 537 Median :1.311 Median :103.8 Shared room : 394
## Rochor : 536 Mean :1.314 Mean :103.8
## Outram : 477 3rd Qu.:1.322 3rd Qu.:103.9
## Bukit Merah: 470 Max. :1.455 Max. :104.0
## (Other) :3850
## price minimum_nights number_of_reviews last_review
## Min. : 0.0 Min. : 1.00 Min. : 0.00 Min. :2013-10-21
## 1st Qu.: 65.0 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.:2018-11-21
## Median : 124.0 Median : 3.00 Median : 2.00 Median :2019-06-27
## Mean : 169.3 Mean : 17.51 Mean : 12.81 Mean :2019-01-11
## 3rd Qu.: 199.0 3rd Qu.: 10.00 3rd Qu.: 10.00 3rd Qu.:2019-08-07
## Max. :10000.0 Max. :1000.00 Max. :323.00 Max. :2019-08-27
## NA's :2758
## reviews_per_month calculated_host_listings_count availability_365
## Min. : 0.010 Min. : 1.00 Min. : 0.0
## 1st Qu.: 0.180 1st Qu.: 2.00 1st Qu.: 54.0
## Median : 0.550 Median : 9.00 Median :260.0
## Mean : 1.044 Mean : 40.61 Mean :208.7
## 3rd Qu.: 1.370 3rd Qu.: 48.00 3rd Qu.:355.0
## Max. :13.000 Max. :274.00 Max. :365.0
## NA's :2758
If we look at above the data summary descriptive statistic, we can conclude:
- The most booked host in Singapore listing is Jay
- Central Region is the most favourite area for airbnb booking in Singapore and North Region is the least favourite.
- In terms of Neighbourhood, Kallang is the most desirable area in Singapore for Airbnb booking
- There are 3 type of listing : ‘Entire home/apt’, ‘Private room’ and ‘Shared room’ type
- Price median is around SGD 124.0
- Average staying nights is 17 days
Visualization and Analyis
Price
Now let’s take a look at the price distribution as per room types
plot1 <- airsg %>%
select(room_type,price,name) %>%
mutate(label=glue(
"{name}
Price in SGD: {price}
Room Type: {room_type}"
)) %>%
ggplot(aes(x=room_type,y=price,text=label)) +
geom_jitter(aes(col=room_type),cex=1.8,shape=8,alpha=0.3,show.legend = F) +
scale_y_continuous(limits=c(0,500),breaks=seq(0,500,100))+
geom_hline(aes(yintercept=median(price, na.rm=T)),
color="red", linetype="dashed", size=1)+
labs(title = "Price Distribution to Room Type",
subtitle = "AirBnb Singapore",
caption= "Data Source: Kaggle.com",
x="Room Type",
y="Price in SGD"
)+
theme_bw()
ggplotly(plot1,tooltip = "text") %>% layout(showlegend = FALSE)If we look at the graphic above, the price for entire home/apt per night is averagely higher than shared room and private room
Now let’s take a look at the price distribution as per Neighboorhood Group
plot2 <- airsg %>%
select(neighbourhood_group,neighbourhood,price,name) %>%
mutate(label=glue(
"{name}
Price in SGD: {price}
Region: {neighbourhood_group}
Area: {neighbourhood}"
)) %>%
ggplot(aes(y=price,x=neighbourhood_group,text=label)) +
geom_jitter(aes(col=neighbourhood_group)
,alpha=0.3,show.legend = F,cex=1.8,shape=8) +
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)+
scale_y_continuous(limits=c(0,500),breaks=seq(0,500,100))+
geom_hline(aes(yintercept=median(price, na.rm=T)),
color="red", linetype="dashed", size=1) +
labs(title = "Price Distribution to Neighbourhood Group Region",
subtitle = "AirBnb Singapore",
caption= "Data Source: Kaggle.com",
x="",
y="Price in SGD",
fill="",
col="")+
theme_bw()
ggplotly(plot2,tooltip = "text") %>% layout(showlegend = FALSE)If we look at the graphic above, the price in the Central Region is averagely higher than other region. It is predictable since this region is the home for famous tourism spots in Singapore.
Since the price in Central Region is averagely high, Now let’s take a look at the price distribution in Central Region area.
plot3 <- airsg %>%
filter(neighbourhood_group=="Central Region") %>%
select(neighbourhood,price,name) %>%
mutate(label=glue(
"{name}
Price avg in SGD: {price}
{neighbourhood}"
)) %>%
ggplot(aes(y=reorder(neighbourhood,price),x=price,text=label)) +
geom_point(aes(col=neighbourhood),alpha=0.4,show.legend = F) +
scale_x_continuous(limits=c(0,3000),breaks=seq(0,3000,500))+
geom_vline(aes(xintercept=median(price, na.rm=T)),
color="dodgerblue4", linetype="dashed", size=1) +
labs(title = "Price Distribution in Central Region",
subtitle = "AirBnb Singapore",
caption= "Data Source: Kaggle.com",
x="Price in SGD",
y="",
col="")+
theme_bw()
ggplotly(plot3,tooltip = "text") %>% layout(showlegend = FALSE)If we look at the plot above, all of the price listing in Southern Island and Marina South are higher than the average price in Central Region.
Now let us check the Top 10 host with the average highest price bookings
ploth <- airsg %>%
group_by(host_name) %>%
summarise(avg_price=mean(price)) %>%
arrange(desc(avg_price)) %>%
head(10) %>%
mutate(label=glue(
"Host Name: {host_name}
avg price: {avg_price}"
)) %>%
ggplot(aes(x=avg_price,y=reorder(host_name,avg_price),text=label))+
geom_col(aes(fill=host_name),show.legend = F)+
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)+
labs(title="Top 10 Host with The Average Highest Price per Listing",
x="Price in SG",
y="")+
theme_bw()
ggplotly(ploth,tooltip = "text") %>% layout(showlegend=F)Based on the plot above, Yolivia’s listings are the most expensive listings in Singapore.
Room Listings
Now we will take a look more detail on the room listings.
Let us check the population of the listings in every region in Singapore
plotpop <- airsg %>%
select(neighbourhood_group) %>%
count(neighbourhood_group) %>%
mutate(label=glue(
"number of listing: {n} rooms
{neighbourhood_group}"
)) %>%
ggplot(aes(y=reorder(neighbourhood_group,n),x=n,text=label))+
geom_col(aes(fill=neighbourhood_group)) +
labs(title= "Population of Listings per Region",
x="Numbers of Listings",
y = NULL) +
theme_bw()
ggplotly(plotpop,tooltip = "text") %>% layout(showlegend = FALSE)If we look at the plot above, the most populated listings in Singapore is in Central Region Area
Now let us check the population of the listings for every room type in Singapore
plotr <- airsg %>%
select(room_type) %>%
count(room_type) %>%
mutate(label = glue(
"number of listing: {n} rooms
{room_type}")) %>%
ggplot(aes(
x = reorder(room_type, n),
y = n,
text = label
)) +
geom_col(aes(fill = room_type)) +
labs(title = "Population of Listings per Room Type",
x = "Numbers of Listings",
y = NULL) +
theme_bw()
ggplotly(plotr, tooltip = "text") %>% layout(showlegend = FALSE)If we look at the plot above, the most listings in Singapore is with Entire home/apt type..
Let us check the availibility in a year for different room-type in every region
plot4 <- airsg %>%
group_by(neighbourhood_group, room_type) %>%
summarise(average_avail = mean(availability_365)) %>%
mutate(
label = glue(
"Avg availability {round(average_avail,0)} days
Type: {room_type}
{neighbourhood_group}"
)
) %>%
ggplot(aes(
y = reorder(neighbourhood_group, average_avail),
x = average_avail,
text = label
)) +
geom_col(aes(fill = room_type), position = "dodge", alpha = 0.8) +
scale_fill_viridis(discrete = TRUE) +
scale_color_viridis(discrete = TRUE) +
facet_wrap( ~ room_type, scales = "free_x") +
labs(
title = "Average Days Availibility in a Year",
subtitle = "AirBnb Singapore",
caption = "Data Source: Kaggle.com",
x = "",
y = "",
fill = ""
) +
theme_bw()
ggplotly(plot4, tooltip = "text") %>% layout(legend = list(
orientation = "h",
x = 0.2,
y = -0.1
))Based on plot above, for Shared Room, availability in Central Region is higher than other region. For Private Room, availability in North Region is higher than other region. For Entire/home apt, availibility in Central Region is higher than other region.
plothn <- airsg %>%
group_by(host_name) %>%
count(host_name) %>%
arrange(desc(n)) %>%
head(10) %>%
mutate(label = glue(
"Host Name: {host_name}
Total Listing: {n} rooms")) %>%
ggplot(aes(
x = n,
y = reorder(host_name, n),
text = label
)) +
geom_col(aes(fill = host_name), show.legend = F) +
scale_fill_viridis(discrete = TRUE) +
scale_color_viridis(discrete = TRUE) +
labs(title = "Top 10 Host with The Most Number of Listing",
x = "Number of Listing",
y = "") +
theme_bw()
ggplotly(plothn, tooltip = "text") %>% layout(showlegend = F)Based on plot above, Jay has the most numnber of listing with whooping 290 rooms.
Mapping
Now let us take a look at map below to show the number of listings in Singapore
Pic <- makeIcon(
iconUrl = "images (1).png",
iconWidth = 100 * 0.35,
iconHeight = 100 * 0.35
)
map <- leaflet()
map <- addTiles(map)
map <- addMarkers(
map,
lng = airsg$longitude,
lat = airsg$latitude,
popup = airsg$name,
clusterOptions = markerClusterOptions(),
icon = Pic
)
map