EDA on NYC Airbnb [ R ]

Description :

Airbnb is a platform that allows house and apartment owners to rent their properties to guests for short-term stays.
- Since 2011, hosts have been using Airbnb. This dataset describes the listing activity and metrics in NYC for 2023.

NYC

PHASE 1 : Ask

About the company:
- Airbnb, Inc is an American San Francisco-based company operating an online marketplace for short- and long-term homestays and experiences.
  - The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk, and Joe Gebbia.
  - Since it was founded in 2008, Airbnb has become one of the most successful and valuable start-ups in the world and has significantly impacted the HORECA (hotel, restaurant, and catering) industry.

Content:
- This Dataset includes information about hosts, geographical availability, and necessary metrics to draw data-driven decision-making.
Key task :
- Identified the business task.
- Considered key stakeholders.
Deliverable :
- To gain insights from Data to solve business problem.

PHASE 2 : Prepare

Before we begin, there are few key points that are wrapped below as these are the vital steps I’ll be following to ensure its completion:

This is a Public Dataset is part of Airbnb, and the original source can be found on InsideAirbnb.
- Downloaded the data and stored it on my Google Drive
- The data has been made available by Inside Airbnb with No Copyright CC0: Public Domain

Using the ROCCC System to determine the credibility and integrity of the data.
- Reliability: This data is reliable. This public dataset is a subset of Airbnb data and is made available for public use.
- Originality: This is Original subset dataset.
- Comprehensiveness: This data is comprehensive. It provides comprehensive information about Airbnb listings, hosts, and various metrics for analysis and research purposes.
- Current: A recent dataset, which is current.
- Cited: Inside Airbnb created the dataset, made it Public Dataset so this is Credible
  - Therefore, the data is not Biased And have full credibility for the same reason. It meets ROCCC System since it’s reliable, original, comprehensive, current and cited.
Key task :
- Downloaded data and stored it appropriately.
- Identified how it’s organized.
- Determined the credibility of the data.
- Considered key stakeholders.
Deliverable :
- This dataset describes the listing activities for NYC from 2011 - 2023.

PHASE 3 : Process

I am using R since it is an Advanced Language that performs various complex Statistical computations, Analysis, Mining.. Therefore, it is widely used by Data Scientists. Hence, I chose R.

Dependencies :

# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("skimr")
# install.packages("mice")
# install.packages("randomForest")
# install.packages("corrplot")
# install.packages("ggcorrplot")

Libraries :

library(tidyverse)
library(dplyr)
library(skimr)
library(mice)
library(randomForest)
library(corrplot)
library(ggcorrplot)

Working Directory :

setwd("D:/Case_Study/Data/compile_nyc")

Data Collection :

nyc_list <- read.csv("NYC-Airbnb-2023.csv")

Data Wrangling :
- Ensured Data’s integrity.
- Ensured column(s) name consistent.

nyc_list

A quick summary before proceeding :

summary(nyc_list)

##        id                name              host_id           host_name        
##  Min.   :2.595e+03   Length:42931       Min.   :     1678   Length:42931      
##  1st Qu.:1.940e+07   Class :character   1st Qu.: 16085328   Class :character  
##  Median :4.337e+07   Mode  :character   Median : 74338125   Mode  :character  
##  Mean   :2.223e+17                      Mean   :151601209                     
##  3rd Qu.:6.305e+17                      3rd Qu.:268069240                     
##  Max.   :8.405e+17                      Max.   :503872891                     
##                                                                               
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:42931        Length:42931       Min.   :40.50   Min.   :-74.25  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.98  
##  Mode  :character    Mode  :character   Median :40.72   Median :-73.95  
##                                         Mean   :40.73   Mean   :-73.94  
##                                         3rd Qu.:40.76   3rd Qu.:-73.92  
##                                         Max.   :40.91   Max.   :-73.71  
##                                                                         
##   room_type             price         minimum_nights    number_of_reviews
##  Length:42931       Min.   :    0.0   Min.   :   1.00   Min.   :   0.00  
##  Class :character   1st Qu.:   75.0   1st Qu.:   2.00   1st Qu.:   1.00  
##  Mode  :character   Median :  125.0   Median :   7.00   Median :   5.00  
##                     Mean   :  200.3   Mean   :  18.11   Mean   :  25.86  
##                     3rd Qu.:  200.0   3rd Qu.:  30.00   3rd Qu.:  24.00  
##                     Max.   :99000.0   Max.   :1250.00   Max.   :1842.00  
##                                                                          
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:42931       Min.   : 0.010    Min.   :  1.00                
##  Class :character   1st Qu.: 0.140    1st Qu.:  1.00                
##  Mode  :character   Median : 0.520    Median :  1.00                
##                     Mean   : 1.169    Mean   : 24.05                
##                     3rd Qu.: 1.670    3rd Qu.:  4.00                
##                     Max.   :86.610    Max.   :526.00                
##                     NA's   :10304                                   
##  availability_365 number_of_reviews_ltm   license         
##  Min.   :  0.0    Min.   :   0.000      Length:42931      
##  1st Qu.:  0.0    1st Qu.:   0.000      Class :character  
##  Median : 89.0    Median :   0.000      Mode  :character  
##  Mean   :140.3    Mean   :   7.737                        
##  3rd Qu.:289.0    3rd Qu.:   7.000                        
##  Max.   :365.0    Max.   :1093.000                        
##

Renaming few columns to understand the data more easily :

nyc_df <- nyc_list %>% 
  rename(list_id = id,
                listing_name = name,
                area = neighbourhood_group,
                geo_location = neighbourhood,
                host_list_count = calculated_host_listings_count,
                reviews_per_year = number_of_reviews_ltm,
                reviews_per_month_pct = reviews_per_month) %>% 
  select(-license)
nyc_df

Removed ‘license’ column since it had only one value for one host id
- missing & empty values in dataset

skim_without_charts(nyc_df)

Data summary
Name	nyc_df
Number of rows	42931
Number of columns	17
_______________________
Column type frequency:
character	6
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
listing_name	1	0	249	10	41410
host_name	1	0	35	5	9832
area	1	5	13	0	5
geo_location	1	4	25	0	223
room_type	1	10	15	0	4
last_review	1	0	10	10304	2796

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
list_id	0	1.00	2.222772e+17	3.344213e+17	2595.00	19404736.00	43374815.00	6.305016e+17	8.404660e+17
host_id	0	1.00	1.516012e+08	1.621301e+08	1678.00	16085328.00	74338125.00	2.680692e+08	5.038729e+08
latitude	0	1.00	4.073000e+01	6.000000e-02	40.50	40.69	40.72	4.076000e+01	4.091000e+01
longitude	0	1.00	-7.394000e+01	6.000000e-02	-74.25	-73.98	-73.95	-7.392000e+01	-7.371000e+01
price	0	1.00	2.003100e+02	8.950800e+02	0.00	75.00	125.00	2.000000e+02	9.900000e+04
minimum_nights	0	1.00	1.811000e+01	2.746000e+01	1.00	2.00	7.00	3.000000e+01	1.250000e+03
number_of_reviews	0	1.00	2.586000e+01	5.662000e+01	0.00	1.00	5.00	2.400000e+01	1.842000e+03
reviews_per_month_pct	10304	0.76	1.170000e+00	1.790000e+00	0.01	0.14	0.52	1.670000e+00	8.661000e+01
host_list_count	0	1.00	2.405000e+01	8.087000e+01	1.00	1.00	1.00	4.000000e+00	5.260000e+02
availability_365	0	1.00	1.402600e+02	1.420000e+02	0.00	0.00	89.00	2.890000e+02	3.650000e+02
reviews_per_year	0	1.00	7.740000e+00	1.829000e+01	0.00	0.00	0.00	7.000000e+00	1.093000e+03

Changed datatype for last_review :

nyc_df$last_review <- as.Date(nyc_df$last_review)

Replacing missing value with NA :

nyc_df[nyc_df==""] <- NA

Data Imputation needed since everything looks good but these two columns :
- changed datatypes before using mice ;

nyc_df$reviews_per_year <- as.integer(nyc_df$reviews_per_year)
nyc_df$last_review <- as.integer(nyc_df$last_review)
nyc_df <- as.data.frame(nyc_df)
skim_without_charts(nyc_df)

Data summary
Name	nyc_df
Number of rows	42931
Number of columns	17
_______________________
Column type frequency:
character	5
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
listing_name	10	1	1	249	41409
host_name	5	1	1	35	9831
area	0	1	5	13	5
geo_location	0	1	4	25	223
room_type	0	1	10	15	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
list_id	0	1.00	2.222772e+17	3.344213e+17	2595.00	19404736.00	43374815.00	6.305016e+17	8.404660e+17
host_id	0	1.00	1.516012e+08	1.621301e+08	1678.00	16085328.00	74338125.00	2.680692e+08	5.038729e+08
latitude	0	1.00	4.073000e+01	6.000000e-02	40.50	40.69	40.72	4.076000e+01	4.091000e+01
longitude	0	1.00	-7.394000e+01	6.000000e-02	-74.25	-73.98	-73.95	-7.392000e+01	-7.371000e+01
price	0	1.00	2.003100e+02	8.950800e+02	0.00	75.00	125.00	2.000000e+02	9.900000e+04
minimum_nights	0	1.00	1.811000e+01	2.746000e+01	1.00	2.00	7.00	3.000000e+01	1.250000e+03
number_of_reviews	0	1.00	2.586000e+01	5.662000e+01	0.00	1.00	5.00	2.400000e+01	1.842000e+03
last_review	10304	0.76	1.885591e+04	8.011800e+02	15106.00	18331.00	19319.00	1.938900e+04	1.942200e+04
reviews_per_month_pct	10304	0.76	1.170000e+00	1.790000e+00	0.01	0.14	0.52	1.670000e+00	8.661000e+01
host_list_count	0	1.00	2.405000e+01	8.087000e+01	1.00	1.00	1.00	4.000000e+00	5.260000e+02
availability_365	0	1.00	1.402600e+02	1.420000e+02	0.00	0.00	89.00	2.890000e+02	3.650000e+02
reviews_per_year	0	1.00	7.740000e+00	1.829000e+01	0.00	0.00	0.00	7.000000e+00	1.093000e+03

Classification and Regression Using Mice, method cart.
- For last_review & reviews_per_month_pct.

mice_date <- mice(nyc_df, m = 10, method = "cart")

## 
##  iter imp variable
##   1   1  last_review  reviews_per_month_pct
##   1   2  last_review  reviews_per_month_pct
##   1   3  last_review  reviews_per_month_pct
##   1   4  last_review  reviews_per_month_pct
##   1   5  last_review  reviews_per_month_pct
##   1   6  last_review  reviews_per_month_pct
##   1   7  last_review  reviews_per_month_pct
##   1   8  last_review  reviews_per_month_pct
##   1   9  last_review  reviews_per_month_pct
##   1   10  last_review  reviews_per_month_pct
##   2   1  last_review  reviews_per_month_pct
##   2   2  last_review  reviews_per_month_pct
##   2   3  last_review  reviews_per_month_pct
##   2   4  last_review  reviews_per_month_pct
##   2   5  last_review  reviews_per_month_pct
##   2   6  last_review  reviews_per_month_pct
##   2   7  last_review  reviews_per_month_pct
##   2   8  last_review  reviews_per_month_pct
##   2   9  last_review  reviews_per_month_pct
##   2   10  last_review  reviews_per_month_pct
##   3   1  last_review  reviews_per_month_pct
##   3   2  last_review  reviews_per_month_pct
##   3   3  last_review  reviews_per_month_pct
##   3   4  last_review  reviews_per_month_pct
##   3   5  last_review  reviews_per_month_pct
##   3   6  last_review  reviews_per_month_pct
##   3   7  last_review  reviews_per_month_pct
##   3   8  last_review  reviews_per_month_pct
##   3   9  last_review  reviews_per_month_pct
##   3   10  last_review  reviews_per_month_pct
##   4   1  last_review  reviews_per_month_pct
##   4   2  last_review  reviews_per_month_pct
##   4   3  last_review  reviews_per_month_pct
##   4   4  last_review  reviews_per_month_pct
##   4   5  last_review  reviews_per_month_pct
##   4   6  last_review  reviews_per_month_pct
##   4   7  last_review  reviews_per_month_pct
##   4   8  last_review  reviews_per_month_pct
##   4   9  last_review  reviews_per_month_pct
##   4   10  last_review  reviews_per_month_pct
##   5   1  last_review  reviews_per_month_pct
##   5   2  last_review  reviews_per_month_pct
##   5   3  last_review  reviews_per_month_pct
##   5   4  last_review  reviews_per_month_pct
##   5   5  last_review  reviews_per_month_pct
##   5   6  last_review  reviews_per_month_pct
##   5   7  last_review  reviews_per_month_pct
##   5   8  last_review  reviews_per_month_pct
##   5   9  last_review  reviews_per_month_pct
##   5   10  last_review  reviews_per_month_pct

## Warning: Number of logged events: 5

nyc_df <- complete(mice_date)

Formatting Data Types & Mutating New Column :

nyc_df$host_id <- as.numeric(nyc_df$host_id)
nyc_df$price <- as.numeric(nyc_df$price)
nyc_df$minimum_nights <- as.numeric(nyc_df$minimum_nights)
nyc_df$number_of_reviews <- as.numeric(nyc_df$number_of_reviews)


nyc_df$last_review <- as.Date(nyc_df$last_review)
nyc_df$year <- format(as.Date(nyc_df$last_review), "%Y") # mutated year

Replicating to new dataframe :

test_com <- nyc_df 
skim_without_charts(test_com)

Data summary
Name	test_com
Number of rows	42931
Number of columns	18
_______________________
Column type frequency:
character	6
Date	1
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
listing_name	10	1	1	249	41409
host_name	5	1	1	35	9831
area	0	1	5	13	5
geo_location	0	1	4	25	223
room_type	0	1	10	15	4
year	0	1	4	4	13

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
last_review	0	1	2011-05-12	2023-03-06	2022-03-06	2795

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
list_id	1	2.222772e+17	3.344213e+17	2595.00	19404736.00	43374815.00	6.305016e+17	8.404660e+17
host_id	1	1.516012e+08	1.621301e+08	1678.00	16085328.00	74338125.00	2.680692e+08	5.038729e+08
latitude	1	4.073000e+01	6.000000e-02	40.50	40.69	40.72	4.076000e+01	4.091000e+01
longitude	1	-7.394000e+01	6.000000e-02	-74.25	-73.98	-73.95	-7.392000e+01	-7.371000e+01
price	1	2.003100e+02	8.950800e+02	0.00	75.00	125.00	2.000000e+02	9.900000e+04
minimum_nights	1	1.811000e+01	2.746000e+01	1.00	2.00	7.00	3.000000e+01	1.250000e+03
number_of_reviews	1	2.586000e+01	5.662000e+01	0.00	1.00	5.00	2.400000e+01	1.842000e+03
reviews_per_month_pct	1	9.100000e-01	1.630000e+00	0.01	0.08	0.25	1.170000e+00	8.661000e+01
host_list_count	1	2.405000e+01	8.087000e+01	1.00	1.00	1.00	4.000000e+00	5.260000e+02
availability_365	1	1.402600e+02	1.420000e+02	0.00	0.00	89.00	2.890000e+02	3.650000e+02
reviews_per_year	1	7.740000e+00	1.829000e+01	0.00	0.00	0.00	7.000000e+00	1.093000e+03

Done Process part.

Key task :

Imputed by classification and regression trees.
- Manipulated empty data.
- Imputed missing data.
Renamed columns for better data understanding.
Converted Data Types.
- Created fresh data frame.
- listing_name has 10 missing values & host_name has 5 missing values.

Deliverable :

The dataset is ready for Analysis.

PHASE 4 : Analysis

Descriptive analyses are being used to summarize and explore the behavior of the data.
- Using statistical techniques to understand the pattern.
Total Hosts
- There are 27455 unique host_id
- 42931 Total columns

n_unique(test_com$host_id)

## [1] 27455

dim(test_com)

## [1] 42931    18

Price
- As minimum price listed $0
- As average price $200
- As maximum price 99000
- As median price 125

summary(test_com$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    75.0   125.0   200.3   200.0 99000.0

Removing all the rows containing 0 price since it’s not making sense.
- Now minimum price listed $10
  - 27430 unique host_id
  - 42904 Total columns

test_com <- test_com %>% 
  filter(price > 5)
summary(test_com$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    75.0   125.0   200.4   200.0 99000.0

Minimum night
- As minimum night 1
- As average night 18
- As maximum night 1250
- As median night 7

summary(test_com$minimum_nights)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    7.00   18.12   30.00 1250.00

number of reviews made by host_id
- 0 to 1842

summary(test_com$number_of_reviews)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    5.00   25.83   24.00 1842.00

Area
- Bronx has 1691 listings
- Brooklyn has 16237 listings
- Manhattan has 17658 listings
- Queens has 6916 listings
- Staten Island has 429 listings

table(test_com$area)

## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##          1690         16234         17635          6916           429

Room Types
- Entire home/apt has 24279 listings
- Hotel room has 197 listings
- Private room has 17879 listings
- Shared room has 576 listings

table(test_com$room_type)

## 
## Entire home/apt      Hotel room    Private room     Shared room 
##           24279             170           17879             576

number of total lists made in cities

area_freq <- test_com %>% 
  group_by(area) %>% 
  summarise(total_list = sum(host_list_count))%>% 
  mutate(percent = total_list *100 / sum(total_list))
area_freq

price range frequency

price_freq <- test_com %>%
  mutate(price_range = case_when(price > 5 & price < 50 ~ "10 - 49",
                                 price >= 50 & price < 100 ~ "50 - 99",
                                 price >= 100 & price < 200 ~ "100 - 199",
                                 price >= 200 & price < 300 ~ "200 - 299",
                                 price >= 300 & price < 1000 ~ "301 - 999",
                                 price >= 1000 ~ "above 1000")) %>% 
  group_by(price_range) %>% 
  summarise(total_list = sum(host_list_count)) %>% 
  mutate(percent = total_list *100 / sum(total_list)) %>% 
  arrange(price_range)
price_freq

total listing with price grouped by geolocation area

geo_area_freq <- test_com %>%
  group_by(geo_location, area) %>% 
  summarise(total_list = sum(host_list_count),
            min_price = min(price),
            avg_price = mean(price),
            max_price = max(price),
            most_price = median(price))
geo_area_freq

total listing with room_type grouped by years

year_room_freq <- test_com %>% 
  group_by(year, room_type) %>% 
  summarise(total_list = sum(host_list_count),
            reviews_per_year = sum(reviews_per_year))%>% 
  mutate(percent = 100* reviews_per_year/sum(reviews_per_year))
year_room_freq

total listings made by host_name

host_list_count_total <- test_com %>% 
  group_by(host_name) %>% 
  summarise(total_list_count = sum(host_list_count)) %>% 
  arrange(desc(total_list_count))
host_list_count_total

total pricing listed by host_name

host_price_total <- test_com %>% 
  group_by(host_name) %>% 
  summarise(total_price = sum(price)) %>% 
  arrange(desc(total_price))
host_price_total

total pricing listed by host_name by Room types

host_room_total <- test_com  %>% 
  select(host_name, room_type, price) %>% 
  group_by(host_name, room_type) %>% 
  summarise(total_price = sum(price)) %>% 
  arrange(desc(total_price))

## `summarise()` has grouped output by 'host_name'. You can override using the
## `.groups` argument.

host_room_total

room type with price by areas :

room_price_freq <- test_com %>% 
  select(room_type, area, price, host_list_count) %>% 
  group_by(room_type) %>% 
  summarise(min_price = min(price), 
            avg_price = mean(price), 
            most_price = median(price),
            max_price = max(price),
            total_list = sum(host_list_count)) %>% 
  mutate(list_percent = total_list * 100 / sum(total_list))

PHASE 5 : Visualization

Creating a new data frame for Correlation matrix
- formatted as numeric

corr_df <- test_com %>% 
  select(list_id, host_id, price, minimum_nights,
         number_of_reviews, last_review, reviews_per_month_pct,
         host_list_count, reviews_per_year, availability_365, year)

corr_df$year <- as.numeric(corr_df$year)
corr_df$last_review <- as.numeric(corr_df$last_review)
str(corr_df)

## 'data.frame':    42904 obs. of  11 variables:
##  $ list_id              : num  2595 5121 5203 5178 5136 ...
##  $ host_id              : num  2845 7356 7490 8967 7378 ...
##  $ price                : num  150 60 75 68 275 93 295 124 200 81 ...
##  $ minimum_nights       : num  30 30 2 2 60 3 4 3 1 30 ...
##  $ number_of_reviews    : num  49 50 118 575 3 350 45 223 68 189 ...
##  $ last_review          : num  19164 18232 17368 19407 19214 ...
##  $ reviews_per_month_pct: num  0.3 0.3 0.72 3.41 0.03 2.25 0.27 1.32 0.44 1.13 ...
##  $ host_list_count      : int  3 2 1 1 1 1 1 3 4 1 ...
##  $ reviews_per_year     : int  1 0 0 52 1 48 4 17 0 5 ...
##  $ availability_365     : int  314 365 0 106 181 145 1 164 310 207 ...
##  $ year                 : num  2022 2019 2017 2023 2022 ...

Correlation matrix

corrplot(cor(corr_df), method = "circle")

ggcorrplot(cor(corr_df), hc.order = T, type = "lower",
           lab = T)

Total listing made by Host at Cities based
- Area ~ Total list Frequency
  - Manhattan alone has 62.5 % of total listings
  - Brooklyn has 22.1 %
  - Queens has 14.6 %
  - Bronx has 0.49 %
  - Staten Island has 0.11 %

area_freq %>% 
  ggplot(aes(area, total_list, fill= area)) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = total_list), vjust = 0) +
  guides(fill = guide_legend(title = "Area")) +
  theme(legend.position = "none") +
  labs(x = "Cities",
       y = "Total Listings",
       title = "Total Listings made at Cities :",
       caption = "Data Analyst : JP")  + theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "none")

Host Listing with Room Types

Room Types at NYC by Cities

A simple visualization on total listings by price range

price_freq %>% 
  ggplot(aes(price_range, total_list, fill = total_list)) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = total_list), vjust = 0) +
  scale_y_continuous(labels = scales::comma) +
  guides(fill = guide_legend(title = "Total Lists")) +
  theme(legend.position = "none") +
  labs(x = "Price Range",
       y = "Total Listings",
       title = "Price Range for Total listings :",
       subtitle = "cheap airbnb price range",
       caption = "Data Analyst : JP") + theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "none")

## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.

Boxplot for Room types on listed Price
- Assumed outliers as True Outliers. Since its validates ROCCC System
- we can see those outliers.
- I did limit the price range to $10,000.

ggplot(test_com, aes(x = room_type, y = price)) +
  geom_boxplot(aes(fill = room_type)) + scale_y_log10(limits = c(1, 10000), labels = scales::comma) +
  geom_hline(yintercept = mean(test_com$price), color = "purple", linetype = 6) +
  annotate("text", x = 1,
           y = median(test_com$price[test_com$room_type == "Entire home/apt"]), 
           label = round(median(test_com$price[test_com$room_type == "Entire home/apt"]), 2), 
           size = 5, color = "white") +
  annotate("text", x = 2, 
           y = median(test_com$price[test_com$room_type == "Hotel room"]), 
           label = round(median(test_com$price[test_com$room_type == "Hotel room"]), 2), 
           size = 5, color = "red") +
  annotate("text", x = 3, 
           y = median(test_com$price[test_com$room_type == "Private room"]), 
           label = round(median(test_com$price[test_com$room_type == "Private room"]), 2), 
           size = 5, color = "lightgreen") +
  annotate("text", x = 4, 
           y = median(test_com$price[test_com$room_type == "Shared room"]), 
           label = round(median(test_com$price[test_com$room_type == "Shared room"]), 2), 
           size = 5, color = "green") +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = "Room Type",
       y = "Price",
       title = "Price Distribution by Room Type :",
       caption = "Data Analyst : JP")

## Warning: Removed 7 rows containing non-finite values (`stat_boxplot()`).

Room Types by Average Price Filled with Median Price Listings
- A clear view on average price vs median price listing

room_price_freq %>% 
  ggplot(aes(room_type, avg_price, fill = room_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_y_continuous(labels = scales::comma) +
  guides(fill = guide_legend(title = "Most Price")) +
  geom_text(aes(label = most_price), vjust = 0) +
  theme(legend.position = "none") +
  labs(x = "Room Types",
       y = "Average Price",
       title = "Room Types with Average Price :",
       subtitle = "Median Price floating for Room Types",
       caption = "Data Analyst : JP") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "none")

## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.

Host Listing with Average Price By Cities

Average Price Listed at NYC by Cities

Total Listings made over Years :
- Since 2017 hosts liked the idea of Airbnb
- A clear view on listings made over period of years.
  - Shared Room has almost no impact over years.

year_room_freq %>% 
  ggplot(aes(year, total_list, color = room_type)) +
  scale_y_continuous(labels = scales::comma) +
  geom_point(size = 2, alpha = 10) +
  labs(x = "Years",
       y = "Total Listings",
       title = "Total Listings made over Years :",
       subtitle = "",
       caption = "Data Analyst : JP")  +
  theme_minimal()

Top 5 Listings by Host Names :
- Top 1 host is Blueground created listings over 276676

host_list_count_total %>% 
  slice(1:5) %>% 
  ggplot(aes(host_name, total_list_count, fill = host_name)) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = total_list_count), vjust = 0) +
  theme(legend.position = "none") +
  labs(x = "Host Name",
       y = "Total Listings",
       title = "Top 5 Listings by Hosts",
       subtitle = "",
       caption = "Data Analyst : JP")  +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  theme(legend.position = "none")

Top Prices by Host Names :
- Top 1 host is RoomPicks listed total Price of 318578 $
  - where Blueground is on second position for 175662 $

host_room_total %>% 
  filter( total_price >= 52089) %>% 
  ggplot(aes(total_price, host_name, fill = total_price)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_fill_steps2() +
  geom_text(aes(label = total_price), vjust = 0) +
  theme(legend.position = "none") +
  labs(x = "Total Price in $ ",
       y = "Host Name",
       color = "Room Type",
       title = "Top Prices by Host Names",
       subtitle = "",
       caption = "Data Analyst : JP") +
  theme_minimal() +
  theme(legend.position = "none")

Sum of Total Price by Host Listing based on Cities :

Total Price Listed at NYC by Cities

Reviews per month by years :

test_com %>% 
  ggplot(aes(reviews_per_month_pct, reviews_per_year, color = room_type)) +
  geom_point(size = 2, alpha = 0.8)   +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Reviews per Month PCT",
       y = "Reviews per Year",
       color = "Room type",
       title = "Airbnb's Reviews per Month by Year:",
       subtitle = "Linear Regression Model has 'Strog' fit",
       caption = "Data Analyst : JP") +
  annotate("text", x= 15, y= 900, label = "R^2 =  0.73", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 ) +
  theme_minimal()

PHASE 6 : Act

The act phase would be done by the Executive team of the company. So,Passing the Documented Report to The Director and the Team.

Data-Driven Decision-Making :
Blueground $ RoomPicks are the Top 2 Hosts in NYC made 175662 $ $ 318578 $ Respc.
Blueground made maximum listings among other hosts in NYC for over 276676 listings .
Since 2017 hosts liked the idea of Airbnb, A progressive graph for Airbnb.
Hotel Rooms are the most Expensive on average price
Manhattan alone has 62.5 % of total listings**

Suggestion(s) :
Except Hotel Rooms, Rests are the cheapest, and travellers will like it
Travellers will like Bronx to stay since it can roam the city and even has lowest average price rated among others.
Staten Island has low crowd but higher average price for travellers to stay, those want less crowd can go for Staten Island
- THANK YOU
  - Jayprakash Kumar

EDA on NYC Airbnb [ R ]

NYC Airbnb Exploration

Jayprakash Kumar

2023-07-18

PHASE 1 : Ask

PHASE 2 : Prepare

PHASE 3 : Process

PHASE 4 : Analysis

PHASE 5 : Visualization

PHASE 6 : Act