1 Intro

1.1 Greetings

Hi !! Welcome to my Rmd.
in this time im looking dataset from external source (kaggle). I hope u’ll enjoy.

1.2 Brief

This data is talking about airbnb matrics for listing in Singapore and the first thing i need to do is load all package tht might be needed for this dataset.

library(lubridate)
library(scales)
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
library(hrbrthemes)
library(viridis)
library(glue)
library(packcircles)
library(viridis)
library(ggiraph)
library(leaflet)
library(ggridges)
library(tidyverse)
library(treemapify)

2 Data Explanatory

2.1 Data Input & Structure

We could input our data to R and put it into ‘master’ object

master <- read_csv("listings.csv")
glimpse(master)

## Observations: 7,907
## Variables: 16
## $ id                             <dbl> 49091, 50646, 56334, 71609, 718...
## $ name                           <chr> "COZICOMFORT LONG TERM STAY ROO...
## $ host_id                        <dbl> 266763, 227796, 266763, 367042,...
## $ host_name                      <chr> "Francesca", "Sujatha", "France...
## $ neighbourhood_group            <chr> "North Region", "Central Region...
## $ neighbourhood                  <chr> "Woodlands", "Bukit Timah", "Wo...
## $ latitude                       <dbl> 1.44255, 1.33235, 1.44246, 1.34...
## $ longitude                      <dbl> 103.7958, 103.7852, 103.7967, 1...
## $ room_type                      <chr> "Private room", "Private room",...
## $ price                          <dbl> 83, 81, 69, 206, 94, 104, 208, ...
## $ minimum_nights                 <dbl> 180, 90, 6, 1, 1, 1, 1, 90, 90,...
## $ number_of_reviews              <dbl> 1, 18, 20, 14, 22, 39, 25, 174,...
## $ last_review                    <date> 2013-10-21, 2014-12-26, 2015-1...
## $ reviews_per_month              <dbl> 0.01, 0.28, 0.20, 0.15, 0.22, 0...
## $ calculated_host_listings_count <dbl> 2, 1, 2, 9, 9, 9, 9, 4, 4, 4, 3...
## $ availability_365               <dbl> 365, 365, 365, 353, 355, 346, 1...

dim(master)

## [1] 7907   16

We see here, dataset has 7907 row and 16 column
There are some wrong data type that we need to change it into corect type, and save it into new object ‘airbnb’

airbnb <- master %>% 
  mutate(name = as.factor(name),
         host_name = as.factor(host_name),
         neighbourhood_group = as.factor(neighbourhood_group),
         neighbourhood = as.factor(neighbourhood),
         room_type = as.factor(room_type))
airbnb

2.2 Missing Data

Seems good! now i want to check missing data from this dataset

airbnb %>% 
  is.na() %>% 
  colSums()

##                             id                           name 
##                              0                              2 
##                        host_id                      host_name 
##                              0                              0 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                           2758                           2758 
## calculated_host_listings_count               availability_365 
##                              0                              0

as we can see above, we have around 35% of missing data which coming from ‘last_review’ and ’reviews_per_month’column
In this case, our missing data is ‘date’ data, so lets ignore this first c=since we wont use it for further analysis
We will continue to check the data summary

summary(airbnb)

##        id          
##  Min.   :   49091  
##  1st Qu.:15821800  
##  Median :24706270  
##  Mean   :23388625  
##  3rd Qu.:32348500  
##  Max.   :38112762  
##                    
##                                                 name     
##  Luxury hostel with in-cabin locker - Single mixed:  13  
##  Inviting & Cozy 1BR APT 3 mins from Tg Pagar MRT :   9  
##  Studio Apartment - Oakwood Premier               :   9  
##  City-located 1BR loft apartment *BRAND NEW*      :   8  
##  Stylish 1BR Located 7 mins from Tg Pagar MRT     :   8  
##  (Other)                                          :7858  
##  NA's                                             :   2  
##     host_id             host_name           neighbourhood_group
##  Min.   :    23666   Jay     : 290   Central Region   :6309    
##  1st Qu.: 23058075   Alvin   : 249   East Region      : 508    
##  Median : 63448912   Richards: 157   North-East Region: 346    
##  Mean   : 91144807   Aaron   : 145   North Region     : 204    
##  3rd Qu.:155381142   Rain    : 115   West Region      : 540    
##  Max.   :288567551   Darcy   : 114                             
##                      (Other) :6837                             
##      neighbourhood     latitude       longitude               room_type   
##  Kallang    :1043   Min.   :1.244   Min.   :103.6   Entire home/apt:4132  
##  Geylang    : 994   1st Qu.:1.296   1st Qu.:103.8   Private room   :3381  
##  Novena     : 537   Median :1.311   Median :103.8   Shared room    : 394  
##  Rochor     : 536   Mean   :1.314   Mean   :103.8                         
##  Outram     : 477   3rd Qu.:1.322   3rd Qu.:103.9                         
##  Bukit Merah: 470   Max.   :1.455   Max.   :104.0                         
##  (Other)    :3850                                                         
##      price         minimum_nights    number_of_reviews
##  Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
##  1st Qu.:   65.0   1st Qu.:   1.00   1st Qu.:  0.00   
##  Median :  124.0   Median :   3.00   Median :  2.00   
##  Mean   :  169.3   Mean   :  17.51   Mean   : 12.81   
##  3rd Qu.:  199.0   3rd Qu.:  10.00   3rd Qu.: 10.00   
##  Max.   :10000.0   Max.   :1000.00   Max.   :323.00   
##                                                       
##   last_review         reviews_per_month calculated_host_listings_count
##  Min.   :2013-10-21   Min.   : 0.010    Min.   :  1.00                
##  1st Qu.:2018-11-21   1st Qu.: 0.180    1st Qu.:  2.00                
##  Median :2019-06-27   Median : 0.550    Median :  9.00                
##  Mean   :2019-01-11   Mean   : 1.044    Mean   : 40.61                
##  3rd Qu.:2019-08-07   3rd Qu.: 1.370    3rd Qu.: 48.00                
##  Max.   :2019-08-27   Max.   :13.000    Max.   :274.00                
##  NA's   :2758         NA's   :2758                                    
##  availability_365
##  Min.   :  0.0   
##  1st Qu.: 54.0   
##  Median :260.0   
##  Mean   :208.7   
##  3rd Qu.:355.0   
##  Max.   :365.0   
##

We may conclude: 1. ‘Jay’ with most popular host in Singapore listing, he has 290 point
2. ‘Central Region’ as the most populated area for airbnb in Singapore and North Region is the last
3. In Neighbourhood, ‘Kallang’ is the most desirable area in Singapore
4. There are 3 type of listing : ‘Entire home/apt’, ‘Private room’ and ‘Shared room’ type
5. Price average is around SGD 169.3
6. People usualy staying with average 17 days

3 Study Case

3.1 Price Distributions

Question : Which area with most populated for high rental cost?

I will use density plot to describe the price distributions

options(scipen = 88)
Plot_density <- airbnb %>% 
  select(price, neighbourhood_group) %>% 
  ggplot(aes(price, fill = neighbourhood_group, col = neighbourhood_group))+
  geom_density (alpha = 0.5, round=2) +
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)+
  geom_vline(aes(xintercept=mean(price, na.rm=T)),   
               color="red", linetype="dashed", size=1)+
    theme(text = element_text(size = 12))+
  scale_x_continuous(limits = c(0,800), breaks = seq(0,800,100))+
  labs(x="Price", y = NULL)
  
 

Plot_density

ggplotly(Plot_density, tooltip = "y")

Answer : graph above, we may see that Central Region has most distributed of higher rental price compare to other regions. It understandable because most of the Central Region area is tourism area, such as : Singapore river, Marina PArade, Newton, Rochor, Novena, Kallang, etc. its explain everything! in other hands, West Region price is concentrated around SGD 50-100, it happened with North Region and North-East Region as well

Another Question, at which area(neighbourhood) in central Region with high average of rental price?

Filtering the Region (neighbourhood_group) and grouping by area (neighbourhood)

airbnb %>% 
  filter(neighbourhood_group == "Central Region") %>% 
  group_by(neighbourhood) %>% 
  summarise(Price_median = median(price),
            Freq = n()) %>% 
  ggplot(aes(area =  Price_median, fill = Freq, label = neighbourhood)) +
  geom_treemap() +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T)+
  labs(fill = "Frequency Airbnb" )

Answer : Southern Islands area is the the area with most populated of expensive rental cost in Central Region. However, Southern Island have total frequency of airbnb less than 500-ish.

3.2 Listing Type

Question : What is the most demanding type for staying in Singapore?

We will analyse it from availability_365, which showing us the number of unoccupied days from a single listing at Airbnb in Singapore. Create new column into airbnb dataset ‘priceSeg’, which devided price into 3 range into <=300, 300<x<500, >=500.
This time i will use Box plot graph to illustrate. The aggregate showing as below :

airbnb <- airbnb %>% 
  mutate(priceSeg = case_when(
    price <= 300 ~ "Below SGD300",
    price >300 & price <500 ~ "SGD300-500",
    TRUE ~ "Above SGD500"
  ))

filter dataset only for Central Region with minimum number of reviews is 50.

Plotbx <- airbnb %>% 
  filter(neighbourhood_group == "Central Region" , number_of_reviews>50) %>% 
  ggplot(aes(x=room_type, y= availability_365)) +
  geom_jitter(aes(col = priceSeg, text = paste("Priceseg:", priceSeg, "<br>",
                                               "Room Type:", room_type, "<br>",
                                               "Unoccupied Days:", availability_365)), alpha = 0.5)+
  geom_boxplot(alpha=0.3, fill = "yellow") + 
  theme(legend.position="top",plot.title = element_text(size=12)) +
  geom_hline(yintercept = median(airbnb$availability_365), color = "red", linetype = 5)+
  scale_y_continuous( breaks = seq(0,360,20))+
  labs(x =NULL , y = "Days of Unoccupied", col = "Price Segment")+
  guides(fill = FALSE)+
  theme_ipsum()

Plotbx

ggplotly(Plotbx, tooltip = "text")

Answer : Based on boxplot above, we get conclusion “Entire room/apt” is the most demanding type of listing on Singapore Airbnb with the lowest average of unoccupied days at 120 Days in a year compare to others.Shared room type at Central Region is less in demand with highest average number of unoccupied days at around 350 days in a year.

3.3 Popular Area

Question: Where is the most demanding area in Central Region ?

Like previous graph above, we will use ‘availability_365’ to know the level of popularity an area at Airbnb in Singapore.
Create new object named ‘ava’ then sort it for only for listing with number of days unoccupancy less than 50 days which represent more popular.

ava <- airbnb %>% 
  filter(neighbourhood_group == "Central Region") %>% 
  distinct(availability_365, .keep_all = TRUE) %>% 
  arrange(availability_365) %>% 
  top_n(-50, availability_365)

continue to graph:

plotAva <- ggplot(ava, aes(neighbourhood, availability_365))+
  geom_point(alpha = 0.5, aes(size = price,col = room_type, text = paste("Area:", neighbourhood, "<br>",
                                                         "Unoccupied:", availability_365,"<br>",
                                                         "Price (SGD):", price,"<br>",
                                                         "Room Type:", room_type)))+
  coord_flip()+
  labs(title = " Most Popular Area at Certain Region in Singapore",subtitle = "less Unoccupied, more popular", x =NULL, y= "Unoccupied(days)", col = "Room Type")+
  guides(size = FALSE)
               
plotAva

ggplotly(plotAva, tooltip = "text")

Answer : after we sorted by number of unoccupied days less than 50, we got several area, among others are ’Bishan, Bukit Merah, Bukit Timah, Downtown Core, Geylang, Kallang, Marine Parade, Newton, Novena, Outram, Queenstown, River Valley, Rochor, Tanglin".

3.4 Top 10 Host

Question : Whose are the top 10 host at airBnB Singapore?

To answer this questions, we may use number of review datas inside airbnb column.
Create new object named ’ Host’ then sort descending and take only the top 10 based on number of reviews.

Host <- airbnb %>%
  distinct(host_id, .keep_all = TRUE) %>% 
  arrange(desc(number_of_reviews)) %>% 
  top_n(10, number_of_reviews)

continue create the plot

plotHost <- 
  ggplot(Host,aes(reorder(host_name,number_of_reviews), number_of_reviews))+
  geom_col(aes(text = paste("Reviews:", number_of_reviews,"<br>",
                            "Host Name:", host_name, "<br>",
                            "Price (SGD):", price, "<br>",
                            "Type:", room_type), fill = room_type))+
  facet_grid(rows = vars(neighbourhood_group), scales = "free_y")+
  geom_point(aes(col = price))+
  labs( x= NULL, y= "Reviews")+
  coord_flip()
  
plotHost

ggplotly(plotHost, tooltip = "text")

Answer : Top 10 Host from the ammount of reviews are : Val, Yuan, Felix, Callie&Kel, Your home away from home, Cheryl, Anita&David, Shu, Shirley, Su.They are mostly coming from Central Region and rest from East Region. Only these 2 regions (Central Region and East Region) that goes inside the top 10 while others Regions not having much of reviews

3.5 Leaflet

Question : Plot the spread out of the dataset for Central Region using leaflet function!

Create new object named ‘peta’ which only consist of Central Region.

 peta <- airbnb %>% 
               filter(neighbourhood_group == "Central Region" )

After that continur to leaflet.

Pic <- makeIcon(iconUrl = "images (1).png", 
                 iconWidth = 100*0.35,
                 iconHeight = 100*0.35) 

 map <- leaflet()
      map <- addTiles(map) 
      
      map <- addMarkers(map,
                        lng = peta$longitude,
                        lat = peta$latitude,
                        popup = peta$name,
                        clusterOptions = markerClusterOptions(), 
                        icon = Pic)
 
map

Answer : LEaflet above is showing the spread out Airbnb data at Central Region in Singapore

Capstone DV

Widya Kania Rahayu

10/26/2019