Description :

  • Airbnb is a platform that allows house and apartment owners to rent their properties to guests for short-term stays.

    • Since 2011, hosts have been using Airbnb. This dataset describes the listing activity and metrics in NYC for 2023.
NYC
NYC



PHASE 1 : Ask

  • About the company:

    • Airbnb, Inc is an American San Francisco-based company operating an online marketplace for short- and long-term homestays and experiences.

      • The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk, and Joe Gebbia.

      • Since it was founded in 2008, Airbnb has become one of the most successful and valuable start-ups in the world and has significantly impacted the HORECA (hotel, restaurant, and catering) industry.


  • Content:

    • This Dataset includes information about hosts, geographical availability, and necessary metrics to draw data-driven decision-making.


    Key task :

    • Identified the business task.
    • Considered key stakeholders.


    Deliverable :

    • To gain insights from Data to solve business problem.



PHASE 2 : Prepare

  • Before we begin, there are few key points that are wrapped below as these are the vital steps I’ll be following to ensure its completion:




  • Using the ROCCC System to determine the credibility and integrity of the data.

    • Reliability: This data is reliable. This public dataset is a subset of Airbnb data and is made available for public use.

    • Originality: This is Original subset dataset.

    • Comprehensiveness: This data is comprehensive. It provides comprehensive information about Airbnb listings, hosts, and various metrics for analysis and research purposes.

    • Current: A recent dataset, which is current.

    • Cited: Inside Airbnb created the dataset, made it Public Dataset so this is Credible

      • Therefore, the data is not Biased And have full credibility for the same reason. It meets ROCCC System since it’s reliable, original, comprehensive, current and cited.


    Key task :

    • Downloaded data and stored it appropriately.
    • Identified how it’s organized.
    • Determined the credibility of the data.
    • Considered key stakeholders.


    Deliverable :

    • This dataset describes the listing activities for NYC from 2011 - 2023.



PHASE 3 : Process

  • I am using R since it is an Advanced Language that performs various complex Statistical computations, Analysis, Mining.. Therefore, it is widely used by Data Scientists. Hence, I chose R.


  • Dependencies :
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("skimr")
# install.packages("mice")
# install.packages("randomForest")
# install.packages("corrplot")
# install.packages("ggcorrplot")
  • Libraries :
library(tidyverse)
library(dplyr)
library(skimr)
library(mice)
library(randomForest)
library(corrplot)
library(ggcorrplot)
  • Working Directory :
setwd("D:/Case_Study/Data/compile_nyc")
  • Data Collection :
nyc_list <- read.csv("NYC-Airbnb-2023.csv")
  • Data Wrangling :

    • Ensured Data’s integrity.
    • Ensured column(s) name consistent.
nyc_list
  • A quick summary before proceeding :
summary(nyc_list) 
##        id                name              host_id           host_name        
##  Min.   :2.595e+03   Length:42931       Min.   :     1678   Length:42931      
##  1st Qu.:1.940e+07   Class :character   1st Qu.: 16085328   Class :character  
##  Median :4.337e+07   Mode  :character   Median : 74338125   Mode  :character  
##  Mean   :2.223e+17                      Mean   :151601209                     
##  3rd Qu.:6.305e+17                      3rd Qu.:268069240                     
##  Max.   :8.405e+17                      Max.   :503872891                     
##                                                                               
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:42931        Length:42931       Min.   :40.50   Min.   :-74.25  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.98  
##  Mode  :character    Mode  :character   Median :40.72   Median :-73.95  
##                                         Mean   :40.73   Mean   :-73.94  
##                                         3rd Qu.:40.76   3rd Qu.:-73.92  
##                                         Max.   :40.91   Max.   :-73.71  
##                                                                         
##   room_type             price         minimum_nights    number_of_reviews
##  Length:42931       Min.   :    0.0   Min.   :   1.00   Min.   :   0.00  
##  Class :character   1st Qu.:   75.0   1st Qu.:   2.00   1st Qu.:   1.00  
##  Mode  :character   Median :  125.0   Median :   7.00   Median :   5.00  
##                     Mean   :  200.3   Mean   :  18.11   Mean   :  25.86  
##                     3rd Qu.:  200.0   3rd Qu.:  30.00   3rd Qu.:  24.00  
##                     Max.   :99000.0   Max.   :1250.00   Max.   :1842.00  
##                                                                          
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:42931       Min.   : 0.010    Min.   :  1.00                
##  Class :character   1st Qu.: 0.140    1st Qu.:  1.00                
##  Mode  :character   Median : 0.520    Median :  1.00                
##                     Mean   : 1.169    Mean   : 24.05                
##                     3rd Qu.: 1.670    3rd Qu.:  4.00                
##                     Max.   :86.610    Max.   :526.00                
##                     NA's   :10304                                   
##  availability_365 number_of_reviews_ltm   license         
##  Min.   :  0.0    Min.   :   0.000      Length:42931      
##  1st Qu.:  0.0    1st Qu.:   0.000      Class :character  
##  Median : 89.0    Median :   0.000      Mode  :character  
##  Mean   :140.3    Mean   :   7.737                        
##  3rd Qu.:289.0    3rd Qu.:   7.000                        
##  Max.   :365.0    Max.   :1093.000                        
## 
  • Renaming few columns to understand the data more easily :
nyc_df <- nyc_list %>% 
  rename(list_id = id,
                listing_name = name,
                area = neighbourhood_group,
                geo_location = neighbourhood,
                host_list_count = calculated_host_listings_count,
                reviews_per_year = number_of_reviews_ltm,
                reviews_per_month_pct = reviews_per_month) %>% 
  select(-license)
nyc_df 
  • Removed ‘license’ column since it had only one value for one host id
    • missing & empty values in dataset
skim_without_charts(nyc_df)
Data summary
Name nyc_df
Number of rows 42931
Number of columns 17
_______________________
Column type frequency:
character 6
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_name 0 1 0 249 10 41410 0
host_name 0 1 0 35 5 9832 0
area 0 1 5 13 0 5 0
geo_location 0 1 4 25 0 223 0
room_type 0 1 10 15 0 4 0
last_review 0 1 0 10 10304 2796 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
list_id 0 1.00 2.222772e+17 3.344213e+17 2595.00 19404736.00 43374815.00 6.305016e+17 8.404660e+17
host_id 0 1.00 1.516012e+08 1.621301e+08 1678.00 16085328.00 74338125.00 2.680692e+08 5.038729e+08
latitude 0 1.00 4.073000e+01 6.000000e-02 40.50 40.69 40.72 4.076000e+01 4.091000e+01
longitude 0 1.00 -7.394000e+01 6.000000e-02 -74.25 -73.98 -73.95 -7.392000e+01 -7.371000e+01
price 0 1.00 2.003100e+02 8.950800e+02 0.00 75.00 125.00 2.000000e+02 9.900000e+04
minimum_nights 0 1.00 1.811000e+01 2.746000e+01 1.00 2.00 7.00 3.000000e+01 1.250000e+03
number_of_reviews 0 1.00 2.586000e+01 5.662000e+01 0.00 1.00 5.00 2.400000e+01 1.842000e+03
reviews_per_month_pct 10304 0.76 1.170000e+00 1.790000e+00 0.01 0.14 0.52 1.670000e+00 8.661000e+01
host_list_count 0 1.00 2.405000e+01 8.087000e+01 1.00 1.00 1.00 4.000000e+00 5.260000e+02
availability_365 0 1.00 1.402600e+02 1.420000e+02 0.00 0.00 89.00 2.890000e+02 3.650000e+02
reviews_per_year 0 1.00 7.740000e+00 1.829000e+01 0.00 0.00 0.00 7.000000e+00 1.093000e+03
  • Changed datatype for last_review :
nyc_df$last_review <- as.Date(nyc_df$last_review)
  • Replacing missing value with NA :
nyc_df[nyc_df==""] <- NA
  • Data Imputation needed since everything looks good but these two columns :
    • changed datatypes before using mice ;
nyc_df$reviews_per_year <- as.integer(nyc_df$reviews_per_year)
nyc_df$last_review <- as.integer(nyc_df$last_review)
nyc_df <- as.data.frame(nyc_df)
skim_without_charts(nyc_df)
Data summary
Name nyc_df
Number of rows 42931
Number of columns 17
_______________________
Column type frequency:
character 5
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_name 10 1 1 249 0 41409 0
host_name 5 1 1 35 0 9831 0
area 0 1 5 13 0 5 0
geo_location 0 1 4 25 0 223 0
room_type 0 1 10 15 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
list_id 0 1.00 2.222772e+17 3.344213e+17 2595.00 19404736.00 43374815.00 6.305016e+17 8.404660e+17
host_id 0 1.00 1.516012e+08 1.621301e+08 1678.00 16085328.00 74338125.00 2.680692e+08 5.038729e+08
latitude 0 1.00 4.073000e+01 6.000000e-02 40.50 40.69 40.72 4.076000e+01 4.091000e+01
longitude 0 1.00 -7.394000e+01 6.000000e-02 -74.25 -73.98 -73.95 -7.392000e+01 -7.371000e+01
price 0 1.00 2.003100e+02 8.950800e+02 0.00 75.00 125.00 2.000000e+02 9.900000e+04
minimum_nights 0 1.00 1.811000e+01 2.746000e+01 1.00 2.00 7.00 3.000000e+01 1.250000e+03
number_of_reviews 0 1.00 2.586000e+01 5.662000e+01 0.00 1.00 5.00 2.400000e+01 1.842000e+03
last_review 10304 0.76 1.885591e+04 8.011800e+02 15106.00 18331.00 19319.00 1.938900e+04 1.942200e+04
reviews_per_month_pct 10304 0.76 1.170000e+00 1.790000e+00 0.01 0.14 0.52 1.670000e+00 8.661000e+01
host_list_count 0 1.00 2.405000e+01 8.087000e+01 1.00 1.00 1.00 4.000000e+00 5.260000e+02
availability_365 0 1.00 1.402600e+02 1.420000e+02 0.00 0.00 89.00 2.890000e+02 3.650000e+02
reviews_per_year 0 1.00 7.740000e+00 1.829000e+01 0.00 0.00 0.00 7.000000e+00 1.093000e+03
  • Classification and Regression Using Mice, method cart.
    • For last_review & reviews_per_month_pct.
mice_date <- mice(nyc_df, m = 10, method = "cart")
## 
##  iter imp variable
##   1   1  last_review  reviews_per_month_pct
##   1   2  last_review  reviews_per_month_pct
##   1   3  last_review  reviews_per_month_pct
##   1   4  last_review  reviews_per_month_pct
##   1   5  last_review  reviews_per_month_pct
##   1   6  last_review  reviews_per_month_pct
##   1   7  last_review  reviews_per_month_pct
##   1   8  last_review  reviews_per_month_pct
##   1   9  last_review  reviews_per_month_pct
##   1   10  last_review  reviews_per_month_pct
##   2   1  last_review  reviews_per_month_pct
##   2   2  last_review  reviews_per_month_pct
##   2   3  last_review  reviews_per_month_pct
##   2   4  last_review  reviews_per_month_pct
##   2   5  last_review  reviews_per_month_pct
##   2   6  last_review  reviews_per_month_pct
##   2   7  last_review  reviews_per_month_pct
##   2   8  last_review  reviews_per_month_pct
##   2   9  last_review  reviews_per_month_pct
##   2   10  last_review  reviews_per_month_pct
##   3   1  last_review  reviews_per_month_pct
##   3   2  last_review  reviews_per_month_pct
##   3   3  last_review  reviews_per_month_pct
##   3   4  last_review  reviews_per_month_pct
##   3   5  last_review  reviews_per_month_pct
##   3   6  last_review  reviews_per_month_pct
##   3   7  last_review  reviews_per_month_pct
##   3   8  last_review  reviews_per_month_pct
##   3   9  last_review  reviews_per_month_pct
##   3   10  last_review  reviews_per_month_pct
##   4   1  last_review  reviews_per_month_pct
##   4   2  last_review  reviews_per_month_pct
##   4   3  last_review  reviews_per_month_pct
##   4   4  last_review  reviews_per_month_pct
##   4   5  last_review  reviews_per_month_pct
##   4   6  last_review  reviews_per_month_pct
##   4   7  last_review  reviews_per_month_pct
##   4   8  last_review  reviews_per_month_pct
##   4   9  last_review  reviews_per_month_pct
##   4   10  last_review  reviews_per_month_pct
##   5   1  last_review  reviews_per_month_pct
##   5   2  last_review  reviews_per_month_pct
##   5   3  last_review  reviews_per_month_pct
##   5   4  last_review  reviews_per_month_pct
##   5   5  last_review  reviews_per_month_pct
##   5   6  last_review  reviews_per_month_pct
##   5   7  last_review  reviews_per_month_pct
##   5   8  last_review  reviews_per_month_pct
##   5   9  last_review  reviews_per_month_pct
##   5   10  last_review  reviews_per_month_pct
## Warning: Number of logged events: 5
nyc_df <- complete(mice_date)
  • Formatting Data Types & Mutating New Column :
nyc_df$host_id <- as.numeric(nyc_df$host_id)
nyc_df$price <- as.numeric(nyc_df$price)
nyc_df$minimum_nights <- as.numeric(nyc_df$minimum_nights)
nyc_df$number_of_reviews <- as.numeric(nyc_df$number_of_reviews)


nyc_df$last_review <- as.Date(nyc_df$last_review)
nyc_df$year <- format(as.Date(nyc_df$last_review), "%Y") # mutated year
  • Replicating to new dataframe :
test_com <- nyc_df 
skim_without_charts(test_com)
Data summary
Name test_com
Number of rows 42931
Number of columns 18
_______________________
Column type frequency:
character 6
Date 1
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_name 10 1 1 249 0 41409 0
host_name 5 1 1 35 0 9831 0
area 0 1 5 13 0 5 0
geo_location 0 1 4 25 0 223 0
room_type 0 1 10 15 0 4 0
year 0 1 4 4 0 13 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_review 0 1 2011-05-12 2023-03-06 2022-03-06 2795

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
list_id 0 1 2.222772e+17 3.344213e+17 2595.00 19404736.00 43374815.00 6.305016e+17 8.404660e+17
host_id 0 1 1.516012e+08 1.621301e+08 1678.00 16085328.00 74338125.00 2.680692e+08 5.038729e+08
latitude 0 1 4.073000e+01 6.000000e-02 40.50 40.69 40.72 4.076000e+01 4.091000e+01
longitude 0 1 -7.394000e+01 6.000000e-02 -74.25 -73.98 -73.95 -7.392000e+01 -7.371000e+01
price 0 1 2.003100e+02 8.950800e+02 0.00 75.00 125.00 2.000000e+02 9.900000e+04
minimum_nights 0 1 1.811000e+01 2.746000e+01 1.00 2.00 7.00 3.000000e+01 1.250000e+03
number_of_reviews 0 1 2.586000e+01 5.662000e+01 0.00 1.00 5.00 2.400000e+01 1.842000e+03
reviews_per_month_pct 0 1 9.100000e-01 1.630000e+00 0.01 0.08 0.25 1.170000e+00 8.661000e+01
host_list_count 0 1 2.405000e+01 8.087000e+01 1.00 1.00 1.00 4.000000e+00 5.260000e+02
availability_365 0 1 1.402600e+02 1.420000e+02 0.00 0.00 89.00 2.890000e+02 3.650000e+02
reviews_per_year 0 1 7.740000e+00 1.829000e+01 0.00 0.00 0.00 7.000000e+00 1.093000e+03
  • Done Process part.


Key task :

  • Imputed by classification and regression trees.
    • Manipulated empty data.
    • Imputed missing data.
  • Renamed columns for better data understanding.
  • Converted Data Types.
    • Created fresh data frame.
    • listing_name has 10 missing values & host_name has 5 missing values.

Deliverable :

  • The dataset is ready for Analysis.



PHASE 4 : Analysis

  • Descriptive analyses are being used to summarize and explore the behavior of the data.

    • Using statistical techniques to understand the pattern.
  • Total Hosts

    • There are 27455 unique host_id
    • 42931 Total columns
n_unique(test_com$host_id) 
## [1] 27455
dim(test_com)
## [1] 42931    18
  • Price
    • As minimum price listed $0
    • As average price $200
    • As maximum price 99000
    • As median price 125
summary(test_com$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    75.0   125.0   200.3   200.0 99000.0
  • Removing all the rows containing 0 price since it’s not making sense.
    • Now minimum price listed $10
      • 27430 unique host_id
      • 42904 Total columns
test_com <- test_com %>% 
  filter(price > 5)
summary(test_com$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    75.0   125.0   200.4   200.0 99000.0
  • Minimum night
    • As minimum night 1
    • As average night 18
    • As maximum night 1250
    • As median night 7
summary(test_com$minimum_nights)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    7.00   18.12   30.00 1250.00
  • number of reviews made by host_id
    • 0 to 1842
summary(test_com$number_of_reviews)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    5.00   25.83   24.00 1842.00
  • Area
    • Bronx has 1691 listings
    • Brooklyn has 16237 listings
    • Manhattan has 17658 listings
    • Queens has 6916 listings
    • Staten Island has 429 listings
table(test_com$area)
## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##          1690         16234         17635          6916           429
  • Room Types
    • Entire home/apt has 24279 listings
    • Hotel room has 197 listings
    • Private room has 17879 listings
    • Shared room has 576 listings
table(test_com$room_type)
## 
## Entire home/apt      Hotel room    Private room     Shared room 
##           24279             170           17879             576
  • number of total lists made in cities
area_freq <- test_com %>% 
  group_by(area) %>% 
  summarise(total_list = sum(host_list_count))%>% 
  mutate(percent = total_list *100 / sum(total_list))
area_freq
  • price range frequency
price_freq <- test_com %>%
  mutate(price_range = case_when(price > 5 & price < 50 ~ "10 - 49",
                                 price >= 50 & price < 100 ~ "50 - 99",
                                 price >= 100 & price < 200 ~ "100 - 199",
                                 price >= 200 & price < 300 ~ "200 - 299",
                                 price >= 300 & price < 1000 ~ "301 - 999",
                                 price >= 1000 ~ "above 1000")) %>% 
  group_by(price_range) %>% 
  summarise(total_list = sum(host_list_count)) %>% 
  mutate(percent = total_list *100 / sum(total_list)) %>% 
  arrange(price_range)
price_freq
  • total listing with price grouped by geolocation area
geo_area_freq <- test_com %>%
  group_by(geo_location, area) %>% 
  summarise(total_list = sum(host_list_count),
            min_price = min(price),
            avg_price = mean(price),
            max_price = max(price),
            most_price = median(price))
geo_area_freq
  • total listing with room_type grouped by years
year_room_freq <- test_com %>% 
  group_by(year, room_type) %>% 
  summarise(total_list = sum(host_list_count),
            reviews_per_year = sum(reviews_per_year))%>% 
  mutate(percent = 100* reviews_per_year/sum(reviews_per_year))
year_room_freq
  • total listings made by host_name
host_list_count_total <- test_com %>% 
  group_by(host_name) %>% 
  summarise(total_list_count = sum(host_list_count)) %>% 
  arrange(desc(total_list_count))
host_list_count_total
  • total pricing listed by host_name
host_price_total <- test_com %>% 
  group_by(host_name) %>% 
  summarise(total_price = sum(price)) %>% 
  arrange(desc(total_price))
host_price_total
  • total pricing listed by host_name by Room types
host_room_total <- test_com  %>% 
  select(host_name, room_type, price) %>% 
  group_by(host_name, room_type) %>% 
  summarise(total_price = sum(price)) %>% 
  arrange(desc(total_price))
## `summarise()` has grouped output by 'host_name'. You can override using the
## `.groups` argument.
host_room_total
  • room type with price by areas :
room_price_freq <- test_com %>% 
  select(room_type, area, price, host_list_count) %>% 
  group_by(room_type) %>% 
  summarise(min_price = min(price), 
            avg_price = mean(price), 
            most_price = median(price),
            max_price = max(price),
            total_list = sum(host_list_count)) %>% 
  mutate(list_percent = total_list * 100 / sum(total_list))



PHASE 5 : Visualization

  • Creating a new data frame for Correlation matrix
    • formatted as numeric
corr_df <- test_com %>% 
  select(list_id, host_id, price, minimum_nights,
         number_of_reviews, last_review, reviews_per_month_pct,
         host_list_count, reviews_per_year, availability_365, year)

corr_df$year <- as.numeric(corr_df$year)
corr_df$last_review <- as.numeric(corr_df$last_review)
str(corr_df)
## 'data.frame':    42904 obs. of  11 variables:
##  $ list_id              : num  2595 5121 5203 5178 5136 ...
##  $ host_id              : num  2845 7356 7490 8967 7378 ...
##  $ price                : num  150 60 75 68 275 93 295 124 200 81 ...
##  $ minimum_nights       : num  30 30 2 2 60 3 4 3 1 30 ...
##  $ number_of_reviews    : num  49 50 118 575 3 350 45 223 68 189 ...
##  $ last_review          : num  19164 18232 17368 19407 19214 ...
##  $ reviews_per_month_pct: num  0.3 0.3 0.72 3.41 0.03 2.25 0.27 1.32 0.44 1.13 ...
##  $ host_list_count      : int  3 2 1 1 1 1 1 3 4 1 ...
##  $ reviews_per_year     : int  1 0 0 52 1 48 4 17 0 5 ...
##  $ availability_365     : int  314 365 0 106 181 145 1 164 310 207 ...
##  $ year                 : num  2022 2019 2017 2023 2022 ...
  • Correlation matrix
corrplot(cor(corr_df), method = "circle")

ggcorrplot(cor(corr_df), hc.order = T, type = "lower",
           lab = T)

  • Total listing made by Host at Cities based
    • Area ~ Total list Frequency
      • Manhattan alone has 62.5 % of total listings
      • Brooklyn has 22.1 %
      • Queens has 14.6 %
      • Bronx has 0.49 %
      • Staten Island has 0.11 %
area_freq %>% 
  ggplot(aes(area, total_list, fill= area)) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = total_list), vjust = 0) +
  guides(fill = guide_legend(title = "Area")) +
  theme(legend.position = "none") +
  labs(x = "Cities",
       y = "Total Listings",
       title = "Total Listings made at Cities :",
       caption = "Data Analyst : JP")  + theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "none")


  • Host Listing with Room Types
Room Types at NYC by Cities
Room Types at NYC by Cities


  • A simple visualization on total listings by price range
price_freq %>% 
  ggplot(aes(price_range, total_list, fill = total_list)) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = total_list), vjust = 0) +
  scale_y_continuous(labels = scales::comma) +
  guides(fill = guide_legend(title = "Total Lists")) +
  theme(legend.position = "none") +
  labs(x = "Price Range",
       y = "Total Listings",
       title = "Price Range for Total listings :",
       subtitle = "cheap airbnb price range",
       caption = "Data Analyst : JP") + theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "none")
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.

  • Boxplot for Room types on listed Price
    • Assumed outliers as True Outliers. Since its validates ROCCC System
    • we can see those outliers.
    • I did limit the price range to $10,000.
ggplot(test_com, aes(x = room_type, y = price)) +
  geom_boxplot(aes(fill = room_type)) + scale_y_log10(limits = c(1, 10000), labels = scales::comma) +
  geom_hline(yintercept = mean(test_com$price), color = "purple", linetype = 6) +
  annotate("text", x = 1,
           y = median(test_com$price[test_com$room_type == "Entire home/apt"]), 
           label = round(median(test_com$price[test_com$room_type == "Entire home/apt"]), 2), 
           size = 5, color = "white") +
  annotate("text", x = 2, 
           y = median(test_com$price[test_com$room_type == "Hotel room"]), 
           label = round(median(test_com$price[test_com$room_type == "Hotel room"]), 2), 
           size = 5, color = "red") +
  annotate("text", x = 3, 
           y = median(test_com$price[test_com$room_type == "Private room"]), 
           label = round(median(test_com$price[test_com$room_type == "Private room"]), 2), 
           size = 5, color = "lightgreen") +
  annotate("text", x = 4, 
           y = median(test_com$price[test_com$room_type == "Shared room"]), 
           label = round(median(test_com$price[test_com$room_type == "Shared room"]), 2), 
           size = 5, color = "green") +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = "Room Type",
       y = "Price",
       title = "Price Distribution by Room Type :",
       caption = "Data Analyst : JP") 
## Warning: Removed 7 rows containing non-finite values (`stat_boxplot()`).

  • Room Types by Average Price Filled with Median Price Listings
    • A clear view on average price vs median price listing
room_price_freq %>% 
  ggplot(aes(room_type, avg_price, fill = room_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_y_continuous(labels = scales::comma) +
  guides(fill = guide_legend(title = "Most Price")) +
  geom_text(aes(label = most_price), vjust = 0) +
  theme(legend.position = "none") +
  labs(x = "Room Types",
       y = "Average Price",
       title = "Room Types with Average Price :",
       subtitle = "Median Price floating for Room Types",
       caption = "Data Analyst : JP") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "none")
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.

  • Host Listing with Average Price By Cities
Average Price Listed at NYC by Cities
Average Price Listed at NYC by Cities
  • Total Listings made over Years :
    • Since 2017 hosts liked the idea of Airbnb
    • A clear view on listings made over period of years.
      • Shared Room has almost no impact over years.
year_room_freq %>% 
  ggplot(aes(year, total_list, color = room_type)) +
  scale_y_continuous(labels = scales::comma) +
  geom_point(size = 2, alpha = 10) +
  labs(x = "Years",
       y = "Total Listings",
       title = "Total Listings made over Years :",
       subtitle = "",
       caption = "Data Analyst : JP")  +
  theme_minimal()

  • Top 5 Listings by Host Names :
    • Top 1 host is Blueground created listings over 276676
host_list_count_total %>% 
  slice(1:5) %>% 
  ggplot(aes(host_name, total_list_count, fill = host_name)) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = total_list_count), vjust = 0) +
  theme(legend.position = "none") +
  labs(x = "Host Name",
       y = "Total Listings",
       title = "Top 5 Listings by Hosts",
       subtitle = "",
       caption = "Data Analyst : JP")  +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "none")

  • Top Prices by Host Names :
    • Top 1 host is RoomPicks listed total Price of 318578 $
      • where Blueground is on second position for 175662 $
host_room_total %>% 
  filter( total_price >= 52089) %>% 
  ggplot(aes(total_price, host_name, fill = total_price)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_fill_steps2() +
  geom_text(aes(label = total_price), vjust = 0) +
  theme(legend.position = "none") +
  labs(x = "Total Price in $ ",
       y = "Host Name",
       color = "Room Type",
       title = "Top Prices by Host Names",
       subtitle = "",
       caption = "Data Analyst : JP") +
  theme_minimal() +
  theme(legend.position = "none")

  • Sum of Total Price by Host Listing based on Cities :
Total Price Listed at NYC by Cities
Total Price Listed at NYC by Cities
  • Reviews per month by years :
test_com %>% 
  ggplot(aes(reviews_per_month_pct, reviews_per_year, color = room_type)) +
  geom_point(size = 2, alpha = 0.8)   +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Reviews per Month PCT",
       y = "Reviews per Year",
       color = "Room type",
       title = "Airbnb's Reviews per Month by Year:",
       subtitle = "Linear Regression Model has 'Strog' fit",
       caption = "Data Analyst : JP") +
  annotate("text", x= 15, y= 900, label = "R^2 =  0.73", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 ) +
  theme_minimal()



PHASE 6 : Act

  • The act phase would be done by the Executive team of the company. So,Passing the Documented Report to The Director and the Team.


  • Data-Driven Decision-Making :

  • Blueground $ RoomPicks are the Top 2 Hosts in NYC made 175662 $ $ 318578 $ Respc.

  • Blueground made maximum listings among other hosts in NYC for over 276676 listings .

  • Since 2017 hosts liked the idea of Airbnb, A progressive graph for Airbnb.

  • Hotel Rooms are the most Expensive on average price

  • Manhattan alone has 62.5 % of total listings**


  • Suggestion(s) :

  • Except Hotel Rooms, Rests are the cheapest, and travellers will like it

  • Travellers will like Bronx to stay since it can roam the city and even has lowest average price rated among others.

  • Staten Island has low crowd but higher average price for travellers to stay, those want less crowd can go for Staten Island