1 Loading required packages

1.1 The following packages are used throughout the data challenge

  • tidyverse : Used in data processing and data transformation as well as for data visualization
  • tidyr : Used to make data tidy
  • readr : Used for importing data CSV files
  • GGally : Used for pairwise plots
  • DT : Used for displaying a table in HTML
  • leaflet : Used to add Interactive Map
  • glue : Used for Concatenating Name and Count in Graphs
  • matrixStats : Used to use functions on rows and columns of matrices
  • ggplot2 : Used to create elegant data visualization
  • expss : Used to use some functions from spreadsheets
  • sqldf : Used to manipulate R data frames using SQL
library(tidyverse)
library(tidyr)
library(readr)    
library(GGally)   
library(DT)       
library(leaflet)  
library(glue)
library(matrixStats)
library(ggplot2)
library(expss)
library(sqldf)

2 Loading the AirBnb listings and Zillow data in R

Airbnb_listings <- read.csv("~/Desktop/Projects/Capital One Data Challenge/From Katie/listings.csv")
Zillow_pricing <- read.csv("~/Desktop/Projects/Capital One Data Challenge/From Katie/Zip_Zhvi_2bedroom.csv")

3 Data Exploration - Zillow Pricing Data

class(Zillow_pricing) 
## [1] "data.frame"
dim(Zillow_pricing) 
## [1] 8946  262

3.1 Data filtration

  • City Criterion : We are only interested in the pricing of the properties located in the New York City as per the problem statement. So, we will filter data based on city.

  • We see that there are 262 variables in the Zillow pricing data table giving the median prices of the 2 BR properties located in different zip’s, but we will only be using the most recent median prices of 2 BR properties. Hence, we will filter for the rest of the columns and only use the median prices for 06/2017.

nyc_zillow_data <- filter(Zillow_pricing, City == 'New York')
nyc_zillow_final <- select(nyc_zillow_data, RegionID:SizeRank, X2017.06)

3.2 Data Cleaning for the filtered data

  • We did a check for missing values in the table which we will be used for further analysis. We found that there are no missing values in the filtered data table and we can go forward with this for further analysis.
colSums(is.na(nyc_zillow_final))
##   RegionID RegionName       City      State      Metro CountyName 
##          0          0          0          0          0          0 
##   SizeRank   X2017.06 
##          0          0

3.3 Checks for duplicates

  • we did a check if the data is unique at the RegionID level and RegionName level. We found that there are no duplicates at the RegionID and RegionName level. We are good to go with this table for our further analysis.
length(unique(nyc_zillow_final$RegionID)) == length(nyc_zillow_final$RegionID)
## [1] TRUE
length(unique(nyc_zillow_final$RegionName)) == length(nyc_zillow_final$RegionName)
## [1] TRUE

3.4 Comparison of median prices of properties

  • We will be interested in taking a look at the median prices of the 2BR properties in the regions of our interest. We will be segmenting Zips based on their county location.

  • After plotting, We found that the median prices of the properties located in the New York County are the highest. The properties in the Zip’s 10013, 10014, and 10011 are most expensive.

  • The median price of Zip’s will be one of the metrics to evaluate the best Zip’s to invest. We now move forward with our analysis using the Airbnb listings data.

Zip_median_prices <- arrange(nyc_zillow_final, desc(X2017.06))

ggplot(data=Zip_median_prices, aes(x = reorder(RegionName, X2017.06) , y = X2017.06, fill = as.factor(Zip_median_prices$CountyName))) + geom_bar(stat="identity" ) + scale_fill_brewer(name = "County Name" ,  palette = "Set1") + coord_flip() + labs(x = "Zipcode", y = "Median Price of 2BR Properties")

4 Data filtration of Airbnb listings data.

nyc_airbnb_data <- filter(Airbnb_listings, bedrooms == 2, zipcode %in% nyc_zillow_final[,"RegionName"])

5 Data Exploration of Airbnb listings data

dim(nyc_airbnb_data)
## [1] 1565  106
names(nyc_airbnb_data)
##   [1] "id"                                          
##   [2] "listing_url"                                 
##   [3] "scrape_id"                                   
##   [4] "last_scraped"                                
##   [5] "name"                                        
##   [6] "summary"                                     
##   [7] "space"                                       
##   [8] "description"                                 
##   [9] "experiences_offered"                         
##  [10] "neighborhood_overview"                       
##  [11] "notes"                                       
##  [12] "transit"                                     
##  [13] "access"                                      
##  [14] "interaction"                                 
##  [15] "house_rules"                                 
##  [16] "thumbnail_url"                               
##  [17] "medium_url"                                  
##  [18] "picture_url"                                 
##  [19] "xl_picture_url"                              
##  [20] "host_id"                                     
##  [21] "host_url"                                    
##  [22] "host_name"                                   
##  [23] "host_since"                                  
##  [24] "host_location"                               
##  [25] "host_about"                                  
##  [26] "host_response_time"                          
##  [27] "host_response_rate"                          
##  [28] "host_acceptance_rate"                        
##  [29] "host_is_superhost"                           
##  [30] "host_thumbnail_url"                          
##  [31] "host_picture_url"                            
##  [32] "host_neighbourhood"                          
##  [33] "host_listings_count"                         
##  [34] "host_total_listings_count"                   
##  [35] "host_verifications"                          
##  [36] "host_has_profile_pic"                        
##  [37] "host_identity_verified"                      
##  [38] "street"                                      
##  [39] "neighbourhood"                               
##  [40] "neighbourhood_cleansed"                      
##  [41] "neighbourhood_group_cleansed"                
##  [42] "city"                                        
##  [43] "state"                                       
##  [44] "zipcode"                                     
##  [45] "market"                                      
##  [46] "smart_location"                              
##  [47] "country_code"                                
##  [48] "country"                                     
##  [49] "latitude"                                    
##  [50] "longitude"                                   
##  [51] "is_location_exact"                           
##  [52] "property_type"                               
##  [53] "room_type"                                   
##  [54] "accommodates"                                
##  [55] "bathrooms"                                   
##  [56] "bedrooms"                                    
##  [57] "beds"                                        
##  [58] "bed_type"                                    
##  [59] "amenities"                                   
##  [60] "square_feet"                                 
##  [61] "price"                                       
##  [62] "weekly_price"                                
##  [63] "monthly_price"                               
##  [64] "security_deposit"                            
##  [65] "cleaning_fee"                                
##  [66] "guests_included"                             
##  [67] "extra_people"                                
##  [68] "minimum_nights"                              
##  [69] "maximum_nights"                              
##  [70] "minimum_minimum_nights"                      
##  [71] "maximum_minimum_nights"                      
##  [72] "minimum_maximum_nights"                      
##  [73] "maximum_maximum_nights"                      
##  [74] "minimum_nights_avg_ntm"                      
##  [75] "maximum_nights_avg_ntm"                      
##  [76] "calendar_updated"                            
##  [77] "has_availability"                            
##  [78] "availability_30"                             
##  [79] "availability_60"                             
##  [80] "availability_90"                             
##  [81] "availability_365"                            
##  [82] "calendar_last_scraped"                       
##  [83] "number_of_reviews"                           
##  [84] "number_of_reviews_ltm"                       
##  [85] "first_review"                                
##  [86] "last_review"                                 
##  [87] "review_scores_rating"                        
##  [88] "review_scores_accuracy"                      
##  [89] "review_scores_cleanliness"                   
##  [90] "review_scores_checkin"                       
##  [91] "review_scores_communication"                 
##  [92] "review_scores_location"                      
##  [93] "review_scores_value"                         
##  [94] "requires_license"                            
##  [95] "license"                                     
##  [96] "jurisdiction_names"                          
##  [97] "instant_bookable"                            
##  [98] "is_business_travel_ready"                    
##  [99] "cancellation_policy"                         
## [100] "require_guest_profile_picture"               
## [101] "require_guest_phone_verification"            
## [102] "calculated_host_listings_count"              
## [103] "calculated_host_listings_count_entire_homes" 
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" 
## [106] "reviews_per_month"

5.1 Data Filtering based on variables for Airbnb listings data

  • The data table nyc_airbnb_data contains 1565 observations of 2BR properties. It has 106 different variables as of now. We will only be interested in columns that will help us evaluate the best zips/ which align with our approach. Let us first filter for the variables which will help us create attributes at zip level.

  • We will keep on updating the table based on the use of variables by adding or dropping them. “nyc_airbnb_final” will be the table with which we move towards our further analysis.

nyc_airbnb_final <- select(nyc_airbnb_data, id, host_id, street, neighbourhood, neighbourhood_cleansed, neighbourhood_group_cleansed,   city,   zipcode,    market, smart_location, latitude,   longitude, accommodates, bathrooms, bedrooms,   beds,   square_feet, price, weekly_price, monthly_price, security_deposit,  cleaning_fee,   guests_included,    extra_people, availability_30, availability_60, availability_90, availability_365, cancellation_policy )

5.2 Data Quality check for Airbnb Listings data

glimpse(nyc_airbnb_final)
## Observations: 1,565
## Variables: 29
## $ id                           <int> 16458, 20853, 23135, 27006, 27659, …
## $ host_id                      <int> 64056, 79070, 11481, 115560, 119588…
## $ street                       <fct> "Brooklyn, NY, United States", "Bro…
## $ neighbourhood                <fct> Brooklyn, Brooklyn, Carroll Gardens…
## $ neighbourhood_cleansed       <fct> Park Slope, Prospect Heights, Carro…
## $ neighbourhood_group_cleansed <fct> Brooklyn, Brooklyn, Brooklyn, Manha…
## $ city                         <fct> Brooklyn, Brooklyn, Brooklyn, New Y…
## $ zipcode                      <fct> 11215, 11217, 11231, 10023, 11215, …
## $ market                       <fct> New York, New York, New York, New Y…
## $ smart_location               <fct> "Brooklyn, NY", "Brooklyn, NY", "Br…
## $ latitude                     <dbl> 40.67343, 40.68035, 40.67967, 40.77…
## $ longitude                    <dbl> -73.98338, -73.97162, -74.00154, -7…
## $ accommodates                 <int> 4, 4, 3, 4, 9, 3, 6, 2, 4, 4, 4, 4,…
## $ bathrooms                    <dbl> 2.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.5, …
## $ bedrooms                     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ beds                         <int> 1, 3, 2, 3, 6, 1, 3, 2, 3, 3, 2, 2,…
## $ square_feet                  <int> NA, NA, NA, NA, NA, 600, 1200, NA, …
## $ price                        <fct> $225.00, $250.00, $175.00, $235.00,…
## $ weekly_price                 <fct> "", "$1,400.00", "", "$2,000.00", "…
## $ monthly_price                <fct> "", "$4,000.00", "", "$6,000.00", "…
## $ security_deposit             <fct> "", "$500.00", "$0.00", "$500.00", …
## $ cleaning_fee                 <fct> , $250.00, $65.00, $120.00, $120.00…
## $ guests_included              <int> 1, 4, 1, 3, 1, 2, 2, 1, 2, 2, 11, 2…
## $ extra_people                 <fct> $0.00, $50.00, $15.00, $40.00, $0.0…
## $ availability_30              <int> 0, 0, 21, 0, 0, 0, 0, 12, 0, 0, 13,…
## $ availability_60              <int> 0, 0, 39, 16, 17, 0, 0, 37, 0, 0, 4…
## $ availability_90              <int> 0, 0, 69, 16, 17, 0, 0, 67, 6, 0, 7…
## $ availability_365             <int> 0, 158, 342, 199, 216, 0, 0, 342, 2…
## $ cancellation_policy          <fct> moderate, strict_14_with_grace_peri…
  • We see that the columns where the entries are in the form of Dollars, they are treated as character strings. So we cannot directly check for N.A’s (missing values) in these columns. They are price, weekly_price, monthly_price, security_deposit, cleaning_fee, and extra_people. So let us check how many missing values are present in these columns
sum((nyc_airbnb_final$price==''))
## [1] 0
sum((nyc_airbnb_final$weekly_price==''))
## [1] 1351
sum((nyc_airbnb_final$monthly_price==''))
## [1] 1398
sum((nyc_airbnb_final$security_deposit==''))
## [1] 420
sum((nyc_airbnb_final$cleaning_fee==''))
## [1] 212
sum((nyc_airbnb_final$extra_people==''))
## [1] 0
  • We see that there are missing values in ‘weekly_price’, ‘monthly_price’, ‘security_deposit’, ‘cleaning_fee’.

  • Let us first remove ‘$’ sign and then the ‘,’ sign from these columns

nyc_airbnb_final$price = gsub("\\$", "", nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = gsub("\\$", "", nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = gsub("\\$", "", nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = gsub("\\$", "", nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = gsub("\\$", "", nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = gsub("\\$", "", nyc_airbnb_final$extra_people)


nyc_airbnb_final$price = gsub("\\,", "", nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = gsub("\\,", "", nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = gsub("\\,", "", nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = gsub("\\,", "", nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = gsub("\\,", "", nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = gsub("\\,", "", nyc_airbnb_final$extra_people)
  • We will convert these columns into integer form for the ease of use
nyc_airbnb_final$price = as.numeric(nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = as.numeric(nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = as.numeric(nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = as.numeric(nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = as.numeric(nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = as.numeric(nyc_airbnb_final$extra_people)
  • Checking for N.A’s in our table (after all of the major columns are converted to integer form)
colSums(is.na(nyc_airbnb_final))
##                           id                      host_id 
##                            0                            0 
##                       street                neighbourhood 
##                            0                            0 
##       neighbourhood_cleansed neighbourhood_group_cleansed 
##                            0                            0 
##                         city                      zipcode 
##                            0                            0 
##                       market               smart_location 
##                            0                            0 
##                     latitude                    longitude 
##                            0                            0 
##                 accommodates                    bathrooms 
##                            0                            3 
##                     bedrooms                         beds 
##                            0                            0 
##                  square_feet                        price 
##                         1538                            0 
##                 weekly_price                monthly_price 
##                         1351                         1398 
##             security_deposit                 cleaning_fee 
##                          420                          212 
##              guests_included                 extra_people 
##                            0                            0 
##              availability_30              availability_60 
##                            0                            0 
##              availability_90             availability_365 
##                            0                            0 
##          cancellation_policy 
##                            0
  • We found that the number of N.A. values for weekly_price, monthly_price, security_deposit and cleaning fee are the same as the number of blanks that were present in these columns before converting into an integer, which tells that we have successfully converted these columns into integer form.

  • There are no missing values for ‘id’ and ‘host_id’ which we will be needing for further analysis. Also, the columns ‘bathrooms’ has a few missing values, but that is a column that we won’t be using much.

  • Now, we will check for the duplicates at the id and host_id level

length(unique(nyc_airbnb_final$id)) == length(nyc_airbnb_final$id)
## [1] TRUE
length(unique(nyc_airbnb_final$host_id)) == length(nyc_airbnb_final$host_id)
## [1] FALSE
  • We see that the data is unique at the listing level and is not unique at the host_id level, which is expected. Every unique listing has a unique id whereas as there can a common host of multiple listings, hence it is not unique at the host level. So, we can go forward and do our analyses at the listing level

6 Zip Potential Analysis

6.1 Methodology and Assumptions for occupancy rate calculation

  • We will be calculating the occupancy rate for all the 4 columns (30, 60, 90 and 365 days).

  • For example 30 days occupancy rate in percent = (No of days occupied in 30 days)*100/30

  • After calculating this for all 4 columns, we will be taking the max of (30 days occupancy, 60 days occupancy, 90 days occupancy, 365 days occupancy, 75%). The reason behind taking the maximum is, these metrics are something which is updated by hosts. So, if a host currently has 100% occupancy for next 30 days but less than 100% occupancy for next 60 days, we can always assume that in some time the occupancy for next 30 days will be updated and it will be close to 100% with current availability_30 as evidence. Similar logic applies to future days (>60 days). Hence, taking the max of these 4 occupancy rates.

  • The mean of the final occupancy rate comes out to be greater than 92%. Our assumption of 75% would have undermined the market potential. We go forward with the calculated final occupancy rate for each listing.

nyc_airbnb_final$percent_occupancy_30 = ((30- nyc_airbnb_final$availability_30)*100)/30
nyc_airbnb_final$percent_occupancy_60 = ((60- nyc_airbnb_final$availability_60)*100)/60
nyc_airbnb_final$percent_occupancy_90 = ((90- nyc_airbnb_final$availability_90)*100)/90
nyc_airbnb_final$percent_occupancy_365 = ((365- nyc_airbnb_final$availability_365)*100)/365

#creating a column with min 75% occupancy for all the listings 

nyc_airbnb_final$min_occu_rate <- replicate(1565, 75)

##Final Occupancy rate calculation 
nyc_airbnb_final$final_occu_rate = pmax(nyc_airbnb_final$percent_occupancy_30, nyc_airbnb_final$percent_occupancy_60, nyc_airbnb_final$percent_occupancy_90, nyc_airbnb_final$percent_occupancy_365, nyc_airbnb_final$min_occu_rate)

mean(nyc_airbnb_final$final_occu_rate)
## [1] 92.07428

6.2 Estimating weekly and monthly prices

  • For the listing which does not have weekly and monthly prices mentioned, we will calculate based on daily prices.

  • Weekly price = 7 times the daily price

  • Monthly price = 30 times the daily price

nyc_airbnb_final$weekly_final_rate = ifelse(is.na(nyc_airbnb_final$weekly_price), 7*nyc_airbnb_final$price, nyc_airbnb_final$weekly_price)

nyc_airbnb_final$monthly_final_rate = ifelse(is.na(nyc_airbnb_final$monthly_price), 30*nyc_airbnb_final$price, nyc_airbnb_final$monthly_price)

6.3 Revenue calculation at the listing level

  • We will first be calculating the maximum possible revenue per year at the listing level

  • We are interested in calculating the revenue generated per year by every listing based on three rates :

    1. Daily (365 times daily rate)
    2. Weekly (52 times the weekly rate)
    3. Monthly (12 times the monthly rate).
  • These calculations will give us the revenues expected if there was a 100% occupancy rate. After this, we will factorize the revenues by using the actual occupancy rates.

nyc_airbnb_final$max_rev_day_basis = 365 * (nyc_airbnb_final$price)
nyc_airbnb_final$max_rev_week_basis = 52 * (nyc_airbnb_final$weekly_final_rate)
nyc_airbnb_final$max_rev_month_basis = 12 * (nyc_airbnb_final$monthly_final_rate)
  • Now these revenues are to be factored by the occupancy rates calculated by us. We will call this as actual revenues generated based on three rates.
nyc_airbnb_final$actual_rev_day_basis = (nyc_airbnb_final$max_rev_day_basis * nyc_airbnb_final$final_occu_rate)/100

nyc_airbnb_final$actual_rev_week_basis = (nyc_airbnb_final$max_rev_week_basis * nyc_airbnb_final$final_occu_rate)/100

nyc_airbnb_final$actual_rev_month_basis = (nyc_airbnb_final$max_rev_month_basis * nyc_airbnb_final$final_occu_rate)/100
  • Practically, the actual revenue is generated using all the 3 types of prices. i.e. out of total occupied days for a listing, the bookings might be daily bookings or weekly bookings or monthly bookings. We cannot proceed to take any one of these. They need to be assigned with some weights. Since we do not have actual weights, we will go forward assuming weights. It is obvious that the maximum number of the booking will be on a day basis, then on a weekly basis and lowest on a monthly basis. So, we are assuming the weights as follows:

  • Assuming that 80% of revenue is generated by daily bookings, 15% by the weekly bookings and 5% by the monthly bookings. Note that we can always revise these weights after taking a look at the results. Now, we can calculate the final revenue per year at a listing level using these weights

nyc_airbnb_final$final_rev_per_year = (0.80 * nyc_airbnb_final$actual_rev_day_basis) + (0.15 * nyc_airbnb_final$actual_rev_week_basis) + (0.05 * nyc_airbnb_final$actual_rev_month_basis)

6.4 Calculations of Individual Bookings per year

  • Now, let us get to the third important calculation i.e. calculation of the number of individual-bookings per year at the listing level.

  • Individual-bookings : The metric ‘individual-bookings’ is counting the number of individuals visiting the listing long with the number of bookings. For example, if the same individual books a listing for 10 days, it will be counted as 10 individual-bookings. So we will be calculating the total number of individual-bookings per year for every listing in our database.

  • We have a variable ‘Accomodates’ in our data base that gives us the capacity of individuals that can be accommodated at the same time for each listing. We will be using ‘Accommodates’ and ‘Final occupancy rate’ to calculate the total number of individual-bookings per year.

  • In the very first step we will calculate the maximum yearly capacity of a listing to accommodate total individuals per year

nyc_airbnb_final$yearly_capacity = nyc_airbnb_final$accommodates * 365
  • In the next step we will factorize this yearly capacity by multiplying it with the actual occupancy rate. This metric will give us the total individual-bookings per year for every listing, which we will be using later to roll up at a Zip level.
nyc_airbnb_final$yearly_indi_bookings = (nyc_airbnb_final$yearly_capacity * nyc_airbnb_final$final_occu_rate)/100

7 Zip Level Calculations

#Grouping data at a ZIP level  

groupby_zip <- group_by(nyc_airbnb_final, zipcode)

##Calculating the total number of listings, Total revenue generated per year and total number of yearly individual-bookings for every ZIP

ZIP_potential_analysis <- summarise(groupby_zip, total_listings = n(), Total_revenue = sum(final_rev_per_year), Total_indi_bookings = sum(yearly_indi_bookings))

#Quick checks

sum(ZIP_potential_analysis$total_listings) #1565 
## [1] 1565
length(nyc_airbnb_final$id) #1565
## [1] 1565
sum(ZIP_potential_analysis$Total_revenue) #145861918 
## [1] 145861918
sum(nyc_airbnb_final$final_rev_per_year) #145861918
## [1] 145861918
sum(ZIP_potential_analysis$Total_indi_bookings) #2331099
## [1] 2331099
sum(nyc_airbnb_final$yearly_indi_bookings) #2331099
## [1] 2331099

7.1 Comparison of Listing Prices in Zip’s

  • As we have seen earlier that the properties in New York County were the most expensive. Similarly, we found that the most expensive listings are also in New York i.e. in the Manhattan area.
plot_price <- ggplot(nyc_airbnb_final, aes(x = zipcode,y = price, fill = neighbourhood_group_cleansed)) + geom_boxplot() + scale_y_continuous(limits = quantile(nyc_airbnb_final$price, c(0, 0.99))) + labs(x = "Zipcode", y = "Price/Night") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + guides(fill = guide_legend(title = "Neighbourhood"))
suppressWarnings(print(plot_price))

8 Zip Rankings based on Market Potential Metrics

ZIP_potential_analysis$rank_listings <- rank(-ZIP_potential_analysis$total_listings)
ZIP_potential_analysis$rank_revenue <- rank(-ZIP_potential_analysis$Total_revenue)
ZIP_potential_analysis$rank_indi_booking <- rank(-ZIP_potential_analysis$Total_indi_bookings)

8.1 Segment Assignment

  • We are assigning segments to all the ZIPs based on the calculated ranks.

  • Segmentation is done in the following way:

    1. The ZIPs having ranks in between 1-5 will be given a ‘VH’ (Very High potential) segment.
    2. The ZIPs having ranks in between 6-10 will be given an ‘H’ (High potential) segment.
    3. The ZIPs having ranks in between 11-15 will be given an ‘M’ (Medium potential) segment.
    4. The ZIPs having ranks in between 16-20 will be given an ‘L’ (Low potential) segment.
    5. The ZIPs having ranks in between 21-25 will be given a ‘VL’ (Very Low potential) segment.
  • We will be creating a data frame with ranks and respective segments as described above. In one of the cases for rank_listings, the same rank of ‘4.5’ is assigned to zipcode’s 10025 and 11217 as they have the same number of listings. We are also putting a provision for this rank into our rank-segment assignment table

rank_to_segment_table <- data.frame(Rank = c(1:25, 4.5), Segment = c('VH','VH','VH','VH','VH','H','H','H','H','H','M','M','M','M','M','L','L','L','L','L','VL','VL','VL','VL','VL','VH'))

rank_to_segment_table
##    Rank Segment
## 1   1.0      VH
## 2   2.0      VH
## 3   3.0      VH
## 4   4.0      VH
## 5   5.0      VH
## 6   6.0       H
## 7   7.0       H
## 8   8.0       H
## 9   9.0       H
## 10 10.0       H
## 11 11.0       M
## 12 12.0       M
## 13 13.0       M
## 14 14.0       M
## 15 15.0       M
## 16 16.0       L
## 17 17.0       L
## 18 18.0       L
## 19 19.0       L
## 20 20.0       L
## 21 21.0      VL
## 22 22.0      VL
## 23 23.0      VL
## 24 24.0      VL
## 25 25.0      VL
## 26  4.5      VH
  • Merging these segments into our ZIP_potential_analysis table.
ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_listings", by.y = "Rank", all.x = TRUE)

ZIP_potential_analysis <- rename(ZIP_potential_analysis, Listing_segments = Segment)

ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_revenue", by.y = "Rank", all.x = TRUE)

ZIP_potential_analysis <- rename(ZIP_potential_analysis, Revenue_segments = Segment)

ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_indi_booking", by.y = "Rank", all.x = TRUE)

ZIP_potential_analysis <- rename(ZIP_potential_analysis, Indi_bookings_segments = Segment)

#Arranging the data frame 

ZIP_potential_analysis <- select(ZIP_potential_analysis, zipcode, total_listings, rank_listings, Listing_segments, Total_revenue, rank_revenue, Revenue_segments, Total_indi_bookings, rank_indi_booking, Indi_bookings_segments)

##Assigning the order of levels of segments

ZIP_potential_analysis$Listing_segments <- factor(ZIP_potential_analysis$Listing_segments, levels = c("VH", "H", "M", "L", "VL"))

ZIP_potential_analysis$Revenue_segments <- factor(ZIP_potential_analysis$Revenue_segments, levels = c("VH", "H", "M", "L", "VL"))

ZIP_potential_analysis$Indi_bookings_segments <- factor(ZIP_potential_analysis$Indi_bookings_segments, levels = c("VH", "H", "M", "L", "VL"))

8.2 Comparison of the number of listings for all the ZIPs

  • We can see that the 5 zips 11215, 10036, 10003, 11217 and 10025 falls in the very high potential segment based on the number of listings.
Zip_listing_plot <- arrange(ZIP_potential_analysis, desc(total_listings))

ggplot(data=Zip_listing_plot, aes(x =zipcode , y = total_listings, fill = as.factor(Zip_listing_plot$Listing_segments))) + ggtitle(label = "Total Listings by Zip's") +    geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Total Listings in Zip") +theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))

8.3 Comparison of Revenue Generated for all the Zips

  • We can see that the 5 zips 10003, 10036, 10013, 10011 and 10025 falls in the very high potential segment on the basis total revenue generated per year.
Zip_revenue_plot <- arrange(ZIP_potential_analysis, desc(Total_revenue))

ggplot(data=Zip_revenue_plot, aes(x = zipcode , y = Total_revenue, fill = as.factor(Zip_revenue_plot$Revenue_segments))) + ggtitle(label = "Total Revenue by Zip's") + geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Revenue per Year") + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))

8.4 Comparison of individual-bookings per year for all Zip’s

  • We can see that the 5 zips 11215, 10036, 10003, 10025 and 11217 falls in the very high potential segment based on the total individual-bookings per year.
Zip_indi_booking_plot <- arrange(ZIP_potential_analysis, desc(Total_indi_bookings))

ggplot(data=Zip_indi_booking_plot, aes(x = zipcode, y = Total_indi_bookings, fill = as.factor(Zip_indi_booking_plot$Indi_bookings_segments))) + ggtitle(label = "Total Individual Bookings by Zip's") + geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Individual-Bookings per year") + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))

  • Now, we want to decide which zip has the highest potential based on all 3 potential metrics together, let us check how many times the ZIPs fall under high potential zones (i.e “VH” and “H”)
df1 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Listing_segments"])

df2 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Revenue_segments"])

df3 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Indi_bookings_segments"])

final_df <- rbind(df1, df2, df3)

zip_segment_level <- summarise(group_by(final_df, Zip, Segment), count = n())

zip_segment_summary <- arrange(zip_segment_level, Segment,desc(count))

9 Conclusion based on the Market Potential Metrics

10 Verification of the results by using the metric ‘Cancellation_Policy’

cancellation <- data.frame(zipcode = nyc_airbnb_data$zipcode,  policy = nyc_airbnb_data$cancellation_policy)

cancellation$policy <- factor(cancellation$policy, levels = c("strict_14_with_grace_period", "super_strict_30", "super_strict_60", "strict", "moderate", "flexible"))

cancellation_summary <- summarise(group_by(cancellation, zipcode, policy), count = n())
can_final_summary <- arrange(filter(cancellation_summary, zipcode %in% c("10003", "10025", "10036", "11215", "11217", "10011", "10013")), desc(count))

y <- data.frame(zip = ZIP_potential_analysis$zipcode, total_listings = ZIP_potential_analysis$total_listings)


can_merged_data <- sqldf("select a.zipcode, a.policy, a.count, b.total_listings 
      from can_final_summary a 
      left join 
      y b
      on 
      a.zipcode = b.zip")

can_merged_data$percent <- (can_merged_data$count/can_merged_data$total_listings)*100

can_merged_data
##    zipcode                      policy count total_listings    percent
## 1    10036 strict_14_with_grace_period   110            147 74.8299320
## 2    11215 strict_14_with_grace_period    85            189 44.9735450
## 3    10025 strict_14_with_grace_period    66            124 53.2258065
## 4    11217 strict_14_with_grace_period    64            124 51.6129032
## 5    10013 strict_14_with_grace_period    63            105 60.0000000
## 6    10003 strict_14_with_grace_period    61            136 44.8529412
## 7    10011 strict_14_with_grace_period    59            106 55.6603774
## 8    11215                    moderate    58            189 30.6878307
## 9    11215                    flexible    46            189 24.3386243
## 10   10003                    moderate    38            136 27.9411765
## 11   10025                    flexible    36            124 29.0322581
## 12   11217                    moderate    36            124 29.0322581
## 13   10003                    flexible    34            136 25.0000000
## 14   10011                    flexible    28            106 26.4150943
## 15   10013                    flexible    27            105 25.7142857
## 16   11217                    flexible    23            124 18.5483871
## 17   10025                    moderate    22            124 17.7419355
## 18   10036                    flexible    22            147 14.9659864
## 19   10011                    moderate    19            106 17.9245283
## 20   10013                    moderate    15            105 14.2857143
## 21   10036                    moderate    13            147  8.8435374
## 22   10003             super_strict_60     3            136  2.2058824
## 23   10036             super_strict_30     1            147  0.6802721
## 24   10036             super_strict_60     1            147  0.6802721
## 25   11217             super_strict_30     1            124  0.8064516
ggplot(can_merged_data, aes(fill = policy, y = percent, x = zipcode)) + ggtitle(label = "Cancellation Policy Distribution") +
    geom_bar(position="stack", stat="identity") +labs(x = "Zipcode", y = "Percentage") +theme(legend.position="bottom")

11 Payback Period calculation at the Zip level

zip_avg_revenues <- summarise(groupby_zip, Avg_revenue = mean(final_rev_per_year))

#Merging the median cost of properties from the Zillow data 

zip_avg_revenues <- merge(x = zip_avg_revenues, y = nyc_zillow_final, by.x = "zipcode", by.y = "RegionName", all.x = TRUE)

zip_avg_revenues$payback_years <- zip_avg_revenues$X2017.06/zip_avg_revenues$Avg_revenue

zip_avg_revenues$years_group <- ifelse(zip_avg_revenues$payback_years <= 5, "<5 Years", ifelse(zip_avg_revenues$payback_years <= 10, "5-10 Years", ifelse(zip_avg_revenues$payback_years <= 15, "11-15 Years", ifelse(zip_avg_revenues$payback_years <= 20, "16-20 Years", "20+ Years"))))

zip_avg_revenues$yearly_return_percent <- 100 * (zip_avg_revenues$Avg_revenue/zip_avg_revenues$X2017.06)


#Plotting payback years with percent annual returns

ggplot(zip_avg_revenues, aes(fill = years_group, x= zipcode, y= payback_years, )) +
  geom_point(size = 4, shape = 22) + labs( x = "Zipcode", y = "Years") +   ggtitle("Payback Period of Zip's") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 16) )  +  theme_classic() + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))  

zip_avg_revenues <- rename(zip_avg_revenues, Median_Prices = X2017.06)

plot <- ggplot(zip_avg_revenues, aes(x=zipcode, y=payback_years, fill = zipcode)) + ggtitle(label = "Payback Period with Median Prices of Properties") + scale_y_continuous(labels = scales::comma) + geom_point(aes(col = zipcode, size=Median_Prices/10000)) + geom_smooth(method="loess", se=F) + labs( x="Zipcode", y="Payback Years") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot

12 Final Conclusion :

13 Future Scope :

13.1 Zillow Cost Data :

  • We saw that the Zillow had around 262 columns that recorded the prices of 2BR properties in the past. We could have plotted the time series of the median prices of every Zip and studied their behaviors. We can check for the Zip’s where are property prices are rising at a higher rate and rate them accordingly. A metric like ‘Average growth’ in the property values can be calculated to do so.

13.2 AirBnb Listings Data :

  • There were several columns in the raw data that we did not use in our analysis but could be used to support our conclusions. Below are the names and how the other variables can be used:

Neighborhood overview : At the Zip level, we can analyze the additional details such as sites to see, restaurants, landmarks around, etc. The Zip’s can be rated based on the landmarks around or tourist spots around can be given more importance.

Transit : At the Zip level, we can analyze the details such as nearby bus services, train services, distance from airport, etc. The Zip’s can be rated based on these transit services and rated accordingly.

host_response_time and host_response_rate : Analyzing the average response and average response rate in every Zip can give us an idea of how demanding the listings were. Creating a metric based on that can help us rate the Zips.

host_is_superhost : Superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests. We can check for the number of hosts at the Zip level and rate them accordingly.

Area of Properties (Square Foot) : This can be a good metric that will help to used to check how prices rise as there is a rise in the area of properties. As we saw most of these fields were empty for most of the properties, we did not use it in our analyses.

Cleaning Fee : This can be an addition to the revenue generation calculation in our analysis. But, since we did not consider the maintenance cost and taxes we were not able to use this metric.

Listing Review : There were multiple variables which have the ratings given by the customers to the listings. Using this we can calculate the popularity of Zips which can be an important metric to make some conclusion.

Segment Assignment : We saw that we were only interested in analyzing the properties in the New York Zips. Since the number of Zips to be evaluated was a small number, we used a method of direct ranking and segmenting accordingly. The other approach would have been to calculate the deciles/quantiles and then segment the Zips. This approach was good if we were analyzing all the Zip’s in US/New York State together.

14 Tableau Dashboard for Results