1 Intro

1.1 Greetings

Hi !! Welcome to my Rmd.
Im using the same data like previous LBB DV which i took it from external source (kaggle).
Since we are focusing on Shinydashboard, so i dont put much plot and analysis in this rmd (since i did it already on previous LBB DV, please kindly check it if needed :) )

I hope u’ll enjoy.

1.2 Brief

This data is talking about airbnb matrics for listing in New York city, USA. and the first thing i need to do is load all package tht might be needed for this dataset.

library(lubridate)
library(scales)
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
library(hrbrthemes)
library(viridis)
library(glue)

2 Data Explanatory

2.1 Data Input & Structure

We could input our data to R and put it into ‘airbnb’ object

airbnb <- read_csv("AB_NYC_2019.csv")

Then we do inspect data

glimpse(airbnb)

## Observations: 48,895
## Variables: 16
## $ id                             <dbl> 2539, 2595, 3647, 3831, 5022, 5...
## $ name                           <chr> "Clean & quiet apt home by the ...
## $ host_id                        <dbl> 2787, 2845, 4632, 4869, 7192, 7...
## $ host_name                      <chr> "John", "Jennifer", "Elisabeth"...
## $ neighbourhood_group            <chr> "Brooklyn", "Manhattan", "Manha...
## $ neighbourhood                  <chr> "Kensington", "Midtown", "Harle...
## $ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 4...
## $ longitude                      <dbl> -73.97237, -73.98377, -73.94190...
## $ room_type                      <chr> "Private room", "Entire home/ap...
## $ price                          <dbl> 149, 225, 150, 89, 80, 200, 60,...
## $ minimum_nights                 <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1,...
## $ number_of_reviews              <dbl> 9, 45, 0, 270, 9, 74, 49, 430, ...
## $ last_review                    <date> 2018-10-19, 2019-05-21, NA, 20...
## $ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.5...
## $ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1...
## $ availability_365               <dbl> 365, 355, 365, 194, 0, 129, 0, ...

From inspection above, we got short description of the data. airbnb is consist of 48895 x 16 of rows and cloumns. then we need to check the data structure
We found that some of column need to change become factor, let’s change it, then check the type

airbnb <- airbnb %>% 
  mutate(name = as.factor(name),
         host_name = as.factor (host_name),
         neighbourhood_group = as.factor(neighbourhood_group),
         neighbourhood = as.factor(neighbourhood),
         room_type = as.factor(room_type),
         last_review = as.factor(last_review))

glimpse(airbnb)

## Observations: 48,895
## Variables: 16
## $ id                             <dbl> 2539, 2595, 3647, 3831, 5022, 5...
## $ name                           <fct> "Clean & quiet apt home by the ...
## $ host_id                        <dbl> 2787, 2845, 4632, 4869, 7192, 7...
## $ host_name                      <fct> John, Jennifer, Elisabeth, Lisa...
## $ neighbourhood_group            <fct> Brooklyn, Manhattan, Manhattan,...
## $ neighbourhood                  <fct> Kensington, Midtown, Harlem, Cl...
## $ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 4...
## $ longitude                      <dbl> -73.97237, -73.98377, -73.94190...
## $ room_type                      <fct> Private room, Entire home/apt, ...
## $ price                          <dbl> 149, 225, 150, 89, 80, 200, 60,...
## $ minimum_nights                 <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1,...
## $ number_of_reviews              <dbl> 9, 45, 0, 270, 9, 74, 49, 430, ...
## $ last_review                    <fct> 2018-10-19, 2019-05-21, NA, 201...
## $ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.5...
## $ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1...
## $ availability_365               <dbl> 365, 355, 365, 194, 0, 129, 0, ...

as we see here, all data type has been corect already

2.2 Missing Data

Find out missing data for dataset inputed

airbnb %>% 
  is.na() %>% 
  colSums()

##                             id                           name 
##                              0                             16 
##                        host_id                      host_name 
##                              0                             21 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                          10052                          10052 
## calculated_host_listings_count               availability_365 
##                              0                              0

In this case, our missing data is ‘date’ data, so lets ignore this first c=since we wont use it for further analysis
We will continue to check the data summary

summary(airbnb)

##        id                                         name      
##  Min.   :    2539   Hillside Hotel                  :   18  
##  1st Qu.: 9471945   Home away from home             :   17  
##  Median :19677284   New york Multi-unit building    :   16  
##  Mean   :19017143   Brooklyn Apartment              :   12  
##  3rd Qu.:29152178   Loft Suite @ The Box House Hotel:   11  
##  Max.   :36487245   (Other)                         :48805  
##                     NA's                            :   16  
##     host_id                 host_name        neighbourhood_group
##  Min.   :     2438   Michael     :  417   Bronx        : 1091   
##  1st Qu.:  7822033   David       :  403   Brooklyn     :20104   
##  Median : 30793816   Sonder (NYC):  327   Manhattan    :21661   
##  Mean   : 67620011   John        :  294   Queens       : 5666   
##  3rd Qu.:107434423   Alex        :  279   Staten Island:  373   
##  Max.   :274321313   (Other)     :47154                         
##                      NA's        :   21                         
##             neighbourhood      latitude       longitude     
##  Williamsburg      : 3920   Min.   :40.50   Min.   :-74.24  
##  Bedford-Stuyvesant: 3714   1st Qu.:40.69   1st Qu.:-73.98  
##  Harlem            : 2658   Median :40.72   Median :-73.96  
##  Bushwick          : 2465   Mean   :40.73   Mean   :-73.95  
##  Upper West Side   : 1971   3rd Qu.:40.76   3rd Qu.:-73.94  
##  Hell's Kitchen    : 1958   Max.   :40.91   Max.   :-73.71  
##  (Other)           :32209                                   
##            room_type         price         minimum_nights   
##  Entire home/apt:25409   Min.   :    0.0   Min.   :   1.00  
##  Private room   :22326   1st Qu.:   69.0   1st Qu.:   1.00  
##  Shared room    : 1160   Median :  106.0   Median :   3.00  
##                          Mean   :  152.7   Mean   :   7.03  
##                          3rd Qu.:  175.0   3rd Qu.:   5.00  
##                          Max.   :10000.0   Max.   :1250.00  
##                                                             
##  number_of_reviews     last_review    reviews_per_month
##  Min.   :  0.00    2019-06-23: 1413   Min.   : 0.010   
##  1st Qu.:  1.00    2019-07-01: 1359   1st Qu.: 0.190   
##  Median :  5.00    2019-06-30: 1341   Median : 0.720   
##  Mean   : 23.27    2019-06-24:  875   Mean   : 1.373   
##  3rd Qu.: 24.00    2019-07-07:  718   3rd Qu.: 2.020   
##  Max.   :629.00    (Other)   :33137   Max.   :58.500   
##                    NA's      :10052   NA's   :10052    
##  calculated_host_listings_count availability_365
##  Min.   :  1.000                Min.   :  0.0   
##  1st Qu.:  1.000                1st Qu.:  0.0   
##  Median :  1.000                Median : 45.0   
##  Mean   :  7.144                Mean   :112.8   
##  3rd Qu.:  2.000                3rd Qu.:227.0   
##  Max.   :327.000                Max.   :365.0   
##

From summary above, we may conclude some of the things :
1. There are 3 types of listing : Entire home, privat home and shared room. which Entire home/apt are the most in airbnb populations
2. Price range is aroud 0 - 10,000 USD, with average value 152.7 USD
3. Manhattan neighbourhood group is the most highest ammount of property listing compare to others groups
4. People stays start from 1 night till 1250 nights (around 3 years-ish) but average people is staying around 7 nights
5. Michael is showing as the most popular host at listing property in New York City
6. “Hillside Hotel” is the most popular choice for staying in New York City in 2019

3 Aggregation & ggplotly

3.1 Price and availability

Find out, price distribution based on the number availability in a year
For graph below, we only pick “Manhattan” as Neighbourhood_group and “Private room” as type or room
Make a new object named ‘A’ which contains Queens neighbourhood group and privat room type

A <- airbnb %>% 
  filter(neighbourhood_group == "Manhattan" & room_type == "Private room") %>% 
  select(neighbourhood_group,room_type, availability_365,price)

dim(A)

## [1] 7982    4

then we create the graph using geom_poin and save it into ‘plotA’ object

plotA <- ggplot(A,aes( price, availability_365))+
  geom_point(color="orange",
        fill="#fd90c9",
        shape=23,
        alpha=0.7,
        size=3,
        stroke = 1, aes(text = paste("Price:", price, "<br>",
                     "Availability:", availability_365)))+
  geom_smooth()+
  scale_y_continuous(limits = c(0,400))+
  scale_x_continuous()+
  labs(title = "Price and Availability in Year", x= "Price", y = "Availability")+
  theme(plot.title = element_text(hjust = 0.5))
  

plotA

Continue using plotly

ggplotly(plotA, tooltip = "text")

Interpretations:

This graph showing us positive correlations between Availability and price at below 600 USD, but lately showing negative corelations afterwards.
it means, for price below around 600 USD at Manhattan neighbourhood group for Private room type we may say that the higher the price the less demand.
but afterwards showing contradiction behavior (but this behaviour might influenced by the number of night staying or others things).

3.2 Price Segmentation Vs Neighbourhood Group

Create grouping price with range into <=300, 300<=x<=500, 500<x<1000, x>=1000 after that create new column named by ‘price_seg’ column in ‘airbnb’ dataset

airbnb <- airbnb %>% 
  mutate(price_seg = case_when(
    price < 300 ~ "Below 300",
    price >= 300 & price <=500 ~ "300 to 500",
    price > 500 & price < 1000 ~ "Between 500 - 1000",
    TRUE ~ "Above 1000"
  ))

after that, create the graph using boxplot and geom jitter to know the distribution , text = paste(“Price Range:”, price_seg, “
”, “Area:”, neighbourhood_group, “
”, “Price:”, price)

plotB <- ggplot(airbnb, aes(neighbourhood_group,price)) +
   geom_jitter(aes(col= price_seg, text = paste("Price Range:", price_seg, "<br>",
                                                "Area:", neighbourhood_group, "<br>",
                                                "Price:", price)), alpha = 0.7) +
   geom_boxplot(alpha=1) +
   scale_y_continuous(limits = c(0,1500), breaks = seq(0,1500, 100))+
   labs(title = "Entire Home/apt Price by Neighbourhood", x= "Neighbourhood Group", y= "Price", col = "Price   Segment") +
   theme(plot.title = element_text(hjust = 0.5))

plotB

Continue to plotly

ggplotly(plotB, tooltip = "text")

Interpretations:
a.All area mostly populated by price below 300 USD
b. Manhattan and Brooklyn are the most distributed in price, although we find that mostly price range at below 300 USD but also found some price range is above 1000 USD
b. Third place is Queens
c. at Bronx and State Island, we dont find much price range above 1000 USD, and only few are having the price between 500-1000 USD, it means in this area both area (Bronx and State Island most average price is below 300 USD)

3.3 Host and Ammount of Review

We want to know which host who has the highest ammount of review?
we do subset only for host_id and order it from highest to lowest based on the number of reviews
we only take top 30 of number of reviews
we name it as ‘C’ object

C <- airbnb %>% 
  distinct(host_id, .keep_all = TRUE) %>% 
  arrange(desc(number_of_reviews)) %>% 
  top_n(30, number_of_reviews)

glimpse(C)

## Observations: 30
## Variables: 17
## $ id                             <dbl> 9145202, 891117, 834190, 347432...
## $ name                           <fct> "Room near JFK Queen Bed", "Pri...
## $ host_id                        <dbl> 47621202, 4734398, 2369681, 129...
## $ host_name                      <fct> Dona, Jj, Carol, Asa, Wanda, Li...
## $ neighbourhood_group            <fct> Queens, Manhattan, Manhattan, B...
## $ neighbourhood                  <fct> Jamaica, Harlem, Lower East Sid...
## $ latitude                       <dbl> 40.66730, 40.82264, 40.71921, 4...
## $ longitude                      <dbl> -73.76831, -73.94041, -73.99116...
## $ room_type                      <fct> Private room, Private room, Pri...
## $ price                          <dbl> 47, 49, 99, 160, 60, 55, 120, 6...
## $ minimum_nights                 <dbl> 1, 1, 2, 1, 3, 1, 30, 1, 1, 5, ...
## $ number_of_reviews              <dbl> 629, 594, 540, 488, 480, 474, 4...
## $ last_review                    <fct> 2019-07-05, 2019-06-15, 2019-07...
## $ reviews_per_month              <dbl> 14.58, 7.57, 6.95, 8.14, 6.70, ...
## $ calculated_host_listings_count <dbl> 2, 3, 1, 1, 1, 3, 2, 2, 1, 2, 5...
## $ availability_365               <dbl> 333, 339, 179, 269, 0, 332, 192...
## $ price_seg                      <chr> "Below 300", "Below 300", "Belo...

continue create the plot

plotC <- 
  ggplot(C,aes(reorder(host_name,number_of_reviews), number_of_reviews))+
  geom_col(fill ="#f0b81e", aes(text = paste("Reviews:", number_of_reviews,"<br>","Host Name:", host_name)))+
  facet_grid(rows = vars(neighbourhood_group), scales = "free_y")+
  geom_point(aes(col = price, size = price))+
  labs( x= NULL, y= "Reviews")+
  coord_flip()
  
plotC

ggplotly(plotC, tooltip = c("text", "size"))