1 Loading required packages
- 1.1 The following packages are used throughout the data challenge
2 Loading the AirBnb listings and Zillow data in R
3 Data Exploration - Zillow Pricing Data
4 Data filtration of Airbnb listings data.
5 Data Exploration of Airbnb listings data
- 5.1 Data Filtering based on variables for Airbnb listings data
- 5.2 Data Quality check for Airbnb Listings data
6 Zip Potential Analysis
7 Zip Level Calculations
- 7.1 Comparison of Listing Prices in Zip’s
8 Zip Rankings based on Market Potential Metrics
9 Conclusion based on the Market Potential Metrics
10 Verification of the results by using the metric ‘Cancellation_Policy’
11 Payback Period calculation at the Zip level
12 Final Conclusion :
13 Future Scope :
- 13.1 Zillow Cost Data :
- 13.2 AirBnb Listings Data :
14 Tableau Dashboard for Results

1 Loading required packages

1.1 The following packages are used throughout the data challenge

tidyverse : Used in data processing and data transformation as well as for data visualization
tidyr : Used to make data tidy
readr : Used for importing data CSV files
GGally : Used for pairwise plots
DT : Used for displaying a table in HTML
leaflet : Used to add Interactive Map
glue : Used for Concatenating Name and Count in Graphs
matrixStats : Used to use functions on rows and columns of matrices
ggplot2 : Used to create elegant data visualization
expss : Used to use some functions from spreadsheets
sqldf : Used to manipulate R data frames using SQL

library(tidyverse)
library(tidyr)
library(readr)    
library(GGally)   
library(DT)       
library(leaflet)  
library(glue)
library(matrixStats)
library(ggplot2)
library(expss)
library(sqldf)

2 Loading the AirBnb listings and Zillow data in R

The raw data files ‘listings.csv’ and ‘Zip_Zhvi_2bedroom.csv’ are loaded in the R-environment.

Airbnb_listings <- read.csv("~/Desktop/Projects/Capital One Data Challenge/From Katie/listings.csv")
Zillow_pricing <- read.csv("~/Desktop/Projects/Capital One Data Challenge/From Katie/Zip_Zhvi_2bedroom.csv")

3 Data Exploration - Zillow Pricing Data

We will be doing data exploration of the raw Zillow pricing data provided. The exploration includes checking for the class of data and data dimensions at this point.

class(Zillow_pricing)

## [1] "data.frame"

dim(Zillow_pricing)

## [1] 8946  262

3.1 Data filtration

City Criterion : We are only interested in the pricing of the properties located in the New York City as per the problem statement. So, we will filter data based on city.
We see that there are 262 variables in the Zillow pricing data table giving the median prices of the 2 BR properties located in different zip’s, but we will only be using the most recent median prices of 2 BR properties. Hence, we will filter for the rest of the columns and only use the median prices for 06/2017.

nyc_zillow_data <- filter(Zillow_pricing, City == 'New York')
nyc_zillow_final <- select(nyc_zillow_data, RegionID:SizeRank, X2017.06)

3.2 Data Cleaning for the filtered data

We did a check for missing values in the table which we will be used for further analysis. We found that there are no missing values in the filtered data table and we can go forward with this for further analysis.

colSums(is.na(nyc_zillow_final))

##   RegionID RegionName       City      State      Metro CountyName 
##          0          0          0          0          0          0 
##   SizeRank   X2017.06 
##          0          0

3.3 Checks for duplicates

we did a check if the data is unique at the RegionID level and RegionName level. We found that there are no duplicates at the RegionID and RegionName level. We are good to go with this table for our further analysis.

length(unique(nyc_zillow_final$RegionID)) == length(nyc_zillow_final$RegionID)

## [1] TRUE

length(unique(nyc_zillow_final$RegionName)) == length(nyc_zillow_final$RegionName)

## [1] TRUE

3.4 Comparison of median prices of properties

We will be interested in taking a look at the median prices of the 2BR properties in the regions of our interest. We will be segmenting Zips based on their county location.
After plotting, We found that the median prices of the properties located in the New York County are the highest. The properties in the Zip’s 10013, 10014, and 10011 are most expensive.
The median price of Zip’s will be one of the metrics to evaluate the best Zip’s to invest. We now move forward with our analysis using the Airbnb listings data.

Zip_median_prices <- arrange(nyc_zillow_final, desc(X2017.06))

ggplot(data=Zip_median_prices, aes(x = reorder(RegionName, X2017.06) , y = X2017.06, fill = as.factor(Zip_median_prices$CountyName))) + geom_bar(stat="identity" ) + scale_fill_brewer(name = "County Name" ,  palette = "Set1") + coord_flip() + labs(x = "Zipcode", y = "Median Price of 2BR Properties")

4 Data filtration of Airbnb listings data.

There are in all 25 ZIPs among which we will be finding the best zips to invest for our client. We will be using the AirBnb Listings data to evaluate zips.
Since our client has already concluded that 2 BR properties are best to invest, we will only be interested in 2 BR property listings located in these 25 ZIPs of New York City. So, we will first be filtering the 2BR properties from the listings data

nyc_airbnb_data <- filter(Airbnb_listings, bedrooms == 2, zipcode %in% nyc_zillow_final[,"RegionName"])

5 Data Exploration of Airbnb listings data

dim(nyc_airbnb_data)

## [1] 1565  106

names(nyc_airbnb_data)

##   [1] "id"                                          
##   [2] "listing_url"                                 
##   [3] "scrape_id"                                   
##   [4] "last_scraped"                                
##   [5] "name"                                        
##   [6] "summary"                                     
##   [7] "space"                                       
##   [8] "description"                                 
##   [9] "experiences_offered"                         
##  [10] "neighborhood_overview"                       
##  [11] "notes"                                       
##  [12] "transit"                                     
##  [13] "access"                                      
##  [14] "interaction"                                 
##  [15] "house_rules"                                 
##  [16] "thumbnail_url"                               
##  [17] "medium_url"                                  
##  [18] "picture_url"                                 
##  [19] "xl_picture_url"                              
##  [20] "host_id"                                     
##  [21] "host_url"                                    
##  [22] "host_name"                                   
##  [23] "host_since"                                  
##  [24] "host_location"                               
##  [25] "host_about"                                  
##  [26] "host_response_time"                          
##  [27] "host_response_rate"                          
##  [28] "host_acceptance_rate"                        
##  [29] "host_is_superhost"                           
##  [30] "host_thumbnail_url"                          
##  [31] "host_picture_url"                            
##  [32] "host_neighbourhood"                          
##  [33] "host_listings_count"                         
##  [34] "host_total_listings_count"                   
##  [35] "host_verifications"                          
##  [36] "host_has_profile_pic"                        
##  [37] "host_identity_verified"                      
##  [38] "street"                                      
##  [39] "neighbourhood"                               
##  [40] "neighbourhood_cleansed"                      
##  [41] "neighbourhood_group_cleansed"                
##  [42] "city"                                        
##  [43] "state"                                       
##  [44] "zipcode"                                     
##  [45] "market"                                      
##  [46] "smart_location"                              
##  [47] "country_code"                                
##  [48] "country"                                     
##  [49] "latitude"                                    
##  [50] "longitude"                                   
##  [51] "is_location_exact"                           
##  [52] "property_type"                               
##  [53] "room_type"                                   
##  [54] "accommodates"                                
##  [55] "bathrooms"                                   
##  [56] "bedrooms"                                    
##  [57] "beds"                                        
##  [58] "bed_type"                                    
##  [59] "amenities"                                   
##  [60] "square_feet"                                 
##  [61] "price"                                       
##  [62] "weekly_price"                                
##  [63] "monthly_price"                               
##  [64] "security_deposit"                            
##  [65] "cleaning_fee"                                
##  [66] "guests_included"                             
##  [67] "extra_people"                                
##  [68] "minimum_nights"                              
##  [69] "maximum_nights"                              
##  [70] "minimum_minimum_nights"                      
##  [71] "maximum_minimum_nights"                      
##  [72] "minimum_maximum_nights"                      
##  [73] "maximum_maximum_nights"                      
##  [74] "minimum_nights_avg_ntm"                      
##  [75] "maximum_nights_avg_ntm"                      
##  [76] "calendar_updated"                            
##  [77] "has_availability"                            
##  [78] "availability_30"                             
##  [79] "availability_60"                             
##  [80] "availability_90"                             
##  [81] "availability_365"                            
##  [82] "calendar_last_scraped"                       
##  [83] "number_of_reviews"                           
##  [84] "number_of_reviews_ltm"                       
##  [85] "first_review"                                
##  [86] "last_review"                                 
##  [87] "review_scores_rating"                        
##  [88] "review_scores_accuracy"                      
##  [89] "review_scores_cleanliness"                   
##  [90] "review_scores_checkin"                       
##  [91] "review_scores_communication"                 
##  [92] "review_scores_location"                      
##  [93] "review_scores_value"                         
##  [94] "requires_license"                            
##  [95] "license"                                     
##  [96] "jurisdiction_names"                          
##  [97] "instant_bookable"                            
##  [98] "is_business_travel_ready"                    
##  [99] "cancellation_policy"                         
## [100] "require_guest_profile_picture"               
## [101] "require_guest_phone_verification"            
## [102] "calculated_host_listings_count"              
## [103] "calculated_host_listings_count_entire_homes" 
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" 
## [106] "reviews_per_month"

5.1 Data Filtering based on variables for Airbnb listings data

The data table nyc_airbnb_data contains 1565 observations of 2BR properties. It has 106 different variables as of now. We will only be interested in columns that will help us evaluate the best zips/ which align with our approach. Let us first filter for the variables which will help us create attributes at zip level.
We will keep on updating the table based on the use of variables by adding or dropping them. “nyc_airbnb_final” will be the table with which we move towards our further analysis.

nyc_airbnb_final <- select(nyc_airbnb_data, id, host_id, street, neighbourhood, neighbourhood_cleansed, neighbourhood_group_cleansed,   city,   zipcode,    market, smart_location, latitude,   longitude, accommodates, bathrooms, bedrooms,   beds,   square_feet, price, weekly_price, monthly_price, security_deposit,  cleaning_fee,   guests_included,    extra_people, availability_30, availability_60, availability_90, availability_365, cancellation_policy )

5.2 Data Quality check for Airbnb Listings data

glimpse(nyc_airbnb_final)

## Observations: 1,565
## Variables: 29
## $ id                           <int> 16458, 20853, 23135, 27006, 27659, …
## $ host_id                      <int> 64056, 79070, 11481, 115560, 119588…
## $ street                       <fct> "Brooklyn, NY, United States", "Bro…
## $ neighbourhood                <fct> Brooklyn, Brooklyn, Carroll Gardens…
## $ neighbourhood_cleansed       <fct> Park Slope, Prospect Heights, Carro…
## $ neighbourhood_group_cleansed <fct> Brooklyn, Brooklyn, Brooklyn, Manha…
## $ city                         <fct> Brooklyn, Brooklyn, Brooklyn, New Y…
## $ zipcode                      <fct> 11215, 11217, 11231, 10023, 11215, …
## $ market                       <fct> New York, New York, New York, New Y…
## $ smart_location               <fct> "Brooklyn, NY", "Brooklyn, NY", "Br…
## $ latitude                     <dbl> 40.67343, 40.68035, 40.67967, 40.77…
## $ longitude                    <dbl> -73.98338, -73.97162, -74.00154, -7…
## $ accommodates                 <int> 4, 4, 3, 4, 9, 3, 6, 2, 4, 4, 4, 4,…
## $ bathrooms                    <dbl> 2.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.5, …
## $ bedrooms                     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ beds                         <int> 1, 3, 2, 3, 6, 1, 3, 2, 3, 3, 2, 2,…
## $ square_feet                  <int> NA, NA, NA, NA, NA, 600, 1200, NA, …
## $ price                        <fct> $225.00, $250.00, $175.00, $235.00,…
## $ weekly_price                 <fct> "", "$1,400.00", "", "$2,000.00", "…
## $ monthly_price                <fct> "", "$4,000.00", "", "$6,000.00", "…
## $ security_deposit             <fct> "", "$500.00", "$0.00", "$500.00", …
## $ cleaning_fee                 <fct> , $250.00, $65.00, $120.00, $120.00…
## $ guests_included              <int> 1, 4, 1, 3, 1, 2, 2, 1, 2, 2, 11, 2…
## $ extra_people                 <fct> $0.00, $50.00, $15.00, $40.00, $0.0…
## $ availability_30              <int> 0, 0, 21, 0, 0, 0, 0, 12, 0, 0, 13,…
## $ availability_60              <int> 0, 0, 39, 16, 17, 0, 0, 37, 0, 0, 4…
## $ availability_90              <int> 0, 0, 69, 16, 17, 0, 0, 67, 6, 0, 7…
## $ availability_365             <int> 0, 158, 342, 199, 216, 0, 0, 342, 2…
## $ cancellation_policy          <fct> moderate, strict_14_with_grace_peri…

We see that the columns where the entries are in the form of Dollars, they are treated as character strings. So we cannot directly check for N.A’s (missing values) in these columns. They are price, weekly_price, monthly_price, security_deposit, cleaning_fee, and extra_people. So let us check how many missing values are present in these columns

sum((nyc_airbnb_final$price==''))

## [1] 0

sum((nyc_airbnb_final$weekly_price==''))

## [1] 1351

sum((nyc_airbnb_final$monthly_price==''))

## [1] 1398

sum((nyc_airbnb_final$security_deposit==''))

## [1] 420

sum((nyc_airbnb_final$cleaning_fee==''))

## [1] 212

sum((nyc_airbnb_final$extra_people==''))

## [1] 0

We see that there are missing values in ‘weekly_price’, ‘monthly_price’, ‘security_deposit’, ‘cleaning_fee’.
Let us first remove ‘$’ sign and then the ‘,’ sign from these columns

nyc_airbnb_final$price = gsub("\\$", "", nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = gsub("\\$", "", nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = gsub("\\$", "", nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = gsub("\\$", "", nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = gsub("\\$", "", nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = gsub("\\$", "", nyc_airbnb_final$extra_people)


nyc_airbnb_final$price = gsub("\\,", "", nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = gsub("\\,", "", nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = gsub("\\,", "", nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = gsub("\\,", "", nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = gsub("\\,", "", nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = gsub("\\,", "", nyc_airbnb_final$extra_people)

We will convert these columns into integer form for the ease of use

nyc_airbnb_final$price = as.numeric(nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = as.numeric(nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = as.numeric(nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = as.numeric(nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = as.numeric(nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = as.numeric(nyc_airbnb_final$extra_people)

Checking for N.A’s in our table (after all of the major columns are converted to integer form)

colSums(is.na(nyc_airbnb_final))

##                           id                      host_id 
##                            0                            0 
##                       street                neighbourhood 
##                            0                            0 
##       neighbourhood_cleansed neighbourhood_group_cleansed 
##                            0                            0 
##                         city                      zipcode 
##                            0                            0 
##                       market               smart_location 
##                            0                            0 
##                     latitude                    longitude 
##                            0                            0 
##                 accommodates                    bathrooms 
##                            0                            3 
##                     bedrooms                         beds 
##                            0                            0 
##                  square_feet                        price 
##                         1538                            0 
##                 weekly_price                monthly_price 
##                         1351                         1398 
##             security_deposit                 cleaning_fee 
##                          420                          212 
##              guests_included                 extra_people 
##                            0                            0 
##              availability_30              availability_60 
##                            0                            0 
##              availability_90             availability_365 
##                            0                            0 
##          cancellation_policy 
##                            0

We found that the number of N.A. values for weekly_price, monthly_price, security_deposit and cleaning fee are the same as the number of blanks that were present in these columns before converting into an integer, which tells that we have successfully converted these columns into integer form.
There are no missing values for ‘id’ and ‘host_id’ which we will be needing for further analysis. Also, the columns ‘bathrooms’ has a few missing values, but that is a column that we won’t be using much.
Now, we will check for the duplicates at the id and host_id level

length(unique(nyc_airbnb_final$id)) == length(nyc_airbnb_final$id)

## [1] TRUE

length(unique(nyc_airbnb_final$host_id)) == length(nyc_airbnb_final$host_id)

## [1] FALSE

We see that the data is unique at the listing level and is not unique at the host_id level, which is expected. Every unique listing has a unique id whereas as there can a common host of multiple listings, hence it is not unique at the host level. So, we can go forward and do our analyses at the listing level

6 Zip Potential Analysis

We will carry out a Zip potential analysis to check which Zip’s have most of the markets potential. This can be done by checking the revenues generated, the total number of listings and the number of total individual bookings taking place in a year for every Zip.
We will first calculate these metrics at the listing level and then roll it up at the ZIP level.
At the listing level, we have two major steps :
1. To calculate the occupancy rate
2. To calculate the revenue_generated/Year by every listing.
3. Calculate the number of Individual-bookings/Year
Let us start with calculating the occupancy rate using the variables ‘availability_30’, ‘availability_60’, ‘availability_90’ and ‘availability_365’. These columns show the number of days the listing is available to book within 30, 60, 90 and 365 days respectively. So, if a listing has a value of zero in available_30, it means that the listing is completely booked for the next 30 days (from the date of calendar scrap). Since these values are updated by the hosts, some of these values might not reflect the actual availability. To cover this, we are assuming a minimum occupancy of 75%.

6.1 Methodology and Assumptions for occupancy rate calculation

We will be calculating the occupancy rate for all the 4 columns (30, 60, 90 and 365 days).
For example 30 days occupancy rate in percent = (No of days occupied in 30 days)*100/30
After calculating this for all 4 columns, we will be taking the max of (30 days occupancy, 60 days occupancy, 90 days occupancy, 365 days occupancy, 75%). The reason behind taking the maximum is, these metrics are something which is updated by hosts. So, if a host currently has 100% occupancy for next 30 days but less than 100% occupancy for next 60 days, we can always assume that in some time the occupancy for next 30 days will be updated and it will be close to 100% with current availability_30 as evidence. Similar logic applies to future days (>60 days). Hence, taking the max of these 4 occupancy rates.
The mean of the final occupancy rate comes out to be greater than 92%. Our assumption of 75% would have undermined the market potential. We go forward with the calculated final occupancy rate for each listing.

nyc_airbnb_final$percent_occupancy_30 = ((30- nyc_airbnb_final$availability_30)*100)/30
nyc_airbnb_final$percent_occupancy_60 = ((60- nyc_airbnb_final$availability_60)*100)/60
nyc_airbnb_final$percent_occupancy_90 = ((90- nyc_airbnb_final$availability_90)*100)/90
nyc_airbnb_final$percent_occupancy_365 = ((365- nyc_airbnb_final$availability_365)*100)/365

#creating a column with min 75% occupancy for all the listings 

nyc_airbnb_final$min_occu_rate <- replicate(1565, 75)

##Final Occupancy rate calculation 
nyc_airbnb_final$final_occu_rate = pmax(nyc_airbnb_final$percent_occupancy_30, nyc_airbnb_final$percent_occupancy_60, nyc_airbnb_final$percent_occupancy_90, nyc_airbnb_final$percent_occupancy_365, nyc_airbnb_final$min_occu_rate)

mean(nyc_airbnb_final$final_occu_rate)

## [1] 92.07428

6.2 Estimating weekly and monthly prices

For the listing which does not have weekly and monthly prices mentioned, we will calculate based on daily prices.
Weekly price = 7 times the daily price
Monthly price = 30 times the daily price

nyc_airbnb_final$weekly_final_rate = ifelse(is.na(nyc_airbnb_final$weekly_price), 7*nyc_airbnb_final$price, nyc_airbnb_final$weekly_price)

nyc_airbnb_final$monthly_final_rate = ifelse(is.na(nyc_airbnb_final$monthly_price), 30*nyc_airbnb_final$price, nyc_airbnb_final$monthly_price)

6.3 Revenue calculation at the listing level

We will first be calculating the maximum possible revenue per year at the listing level
We are interested in calculating the revenue generated per year by every listing based on three rates :
1. Daily (365 times daily rate)
2. Weekly (52 times the weekly rate)
3. Monthly (12 times the monthly rate).
These calculations will give us the revenues expected if there was a 100% occupancy rate. After this, we will factorize the revenues by using the actual occupancy rates.

nyc_airbnb_final$max_rev_day_basis = 365 * (nyc_airbnb_final$price)
nyc_airbnb_final$max_rev_week_basis = 52 * (nyc_airbnb_final$weekly_final_rate)
nyc_airbnb_final$max_rev_month_basis = 12 * (nyc_airbnb_final$monthly_final_rate)

Now these revenues are to be factored by the occupancy rates calculated by us. We will call this as actual revenues generated based on three rates.

nyc_airbnb_final$actual_rev_day_basis = (nyc_airbnb_final$max_rev_day_basis * nyc_airbnb_final$final_occu_rate)/100

nyc_airbnb_final$actual_rev_week_basis = (nyc_airbnb_final$max_rev_week_basis * nyc_airbnb_final$final_occu_rate)/100

nyc_airbnb_final$actual_rev_month_basis = (nyc_airbnb_final$max_rev_month_basis * nyc_airbnb_final$final_occu_rate)/100

Practically, the actual revenue is generated using all the 3 types of prices. i.e. out of total occupied days for a listing, the bookings might be daily bookings or weekly bookings or monthly bookings. We cannot proceed to take any one of these. They need to be assigned with some weights. Since we do not have actual weights, we will go forward assuming weights. It is obvious that the maximum number of the booking will be on a day basis, then on a weekly basis and lowest on a monthly basis. So, we are assuming the weights as follows:
Assuming that 80% of revenue is generated by daily bookings, 15% by the weekly bookings and 5% by the monthly bookings. Note that we can always revise these weights after taking a look at the results. Now, we can calculate the final revenue per year at a listing level using these weights

nyc_airbnb_final$final_rev_per_year = (0.80 * nyc_airbnb_final$actual_rev_day_basis) + (0.15 * nyc_airbnb_final$actual_rev_week_basis) + (0.05 * nyc_airbnb_final$actual_rev_month_basis)

6.4 Calculations of Individual Bookings per year

Now, let us get to the third important calculation i.e. calculation of the number of individual-bookings per year at the listing level.
Individual-bookings : The metric ‘individual-bookings’ is counting the number of individuals visiting the listing long with the number of bookings. For example, if the same individual books a listing for 10 days, it will be counted as 10 individual-bookings. So we will be calculating the total number of individual-bookings per year for every listing in our database.
We have a variable ‘Accomodates’ in our data base that gives us the capacity of individuals that can be accommodated at the same time for each listing. We will be using ‘Accommodates’ and ‘Final occupancy rate’ to calculate the total number of individual-bookings per year.
In the very first step we will calculate the maximum yearly capacity of a listing to accommodate total individuals per year

nyc_airbnb_final$yearly_capacity = nyc_airbnb_final$accommodates * 365

In the next step we will factorize this yearly capacity by multiplying it with the actual occupancy rate. This metric will give us the total individual-bookings per year for every listing, which we will be using later to roll up at a Zip level.

nyc_airbnb_final$yearly_indi_bookings = (nyc_airbnb_final$yearly_capacity * nyc_airbnb_final$final_occu_rate)/100

7 Zip Level Calculations

We are now done calculating metrics at the listings level. Now, we will move and calculate the same metrics at the zip level. All the ZIPs will be analyzed using 3 major metrics. They are:
1. ZIP market potential
2. ZIP payback period
3. Prices of properties in the ZIPs.
4. We will also be using a metric based on the variable ‘Cancellation policy’ to verify our results.
The first metric to be calculated is ZIP market potential. Zip market potential will be calculated using three metrics :
1. Total number of listings in the Zip
2. Yearly Revenue generated by the Zip
3. Yearly individual-bookings in the Zip
Basically, we have the data at the listings level and we also have the ZIP of every listing. Now, we will be rolling up this listing level data at the ZIP level. Let us get the unique ZIPs from the listings table
We did some checks to verify that our calculation is correct. We observe that all the calculations at the listings level are matching with the calculations at the ZIP level after rolling up the data.

#Grouping data at a ZIP level  

groupby_zip <- group_by(nyc_airbnb_final, zipcode)

##Calculating the total number of listings, Total revenue generated per year and total number of yearly individual-bookings for every ZIP

ZIP_potential_analysis <- summarise(groupby_zip, total_listings = n(), Total_revenue = sum(final_rev_per_year), Total_indi_bookings = sum(yearly_indi_bookings))

#Quick checks

sum(ZIP_potential_analysis$total_listings) #1565

## [1] 1565

length(nyc_airbnb_final$id) #1565

## [1] 1565

sum(ZIP_potential_analysis$Total_revenue) #145861918

## [1] 145861918

sum(nyc_airbnb_final$final_rev_per_year) #145861918

## [1] 145861918

sum(ZIP_potential_analysis$Total_indi_bookings) #2331099

## [1] 2331099

sum(nyc_airbnb_final$yearly_indi_bookings) #2331099

## [1] 2331099

7.1 Comparison of Listing Prices in Zip’s

As we have seen earlier that the properties in New York County were the most expensive. Similarly, we found that the most expensive listings are also in New York i.e. in the Manhattan area.

plot_price <- ggplot(nyc_airbnb_final, aes(x = zipcode,y = price, fill = neighbourhood_group_cleansed)) + geom_boxplot() + scale_y_continuous(limits = quantile(nyc_airbnb_final$price, c(0, 0.99))) + labs(x = "Zipcode", y = "Price/Night") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + guides(fill = guide_legend(title = "Neighbourhood"))
suppressWarnings(print(plot_price))

8 Zip Rankings based on Market Potential Metrics

Now we want to rank all the ZIPs based on 3 potential metrics. Lower the rank, higher is the Zip market potential.

ZIP_potential_analysis$rank_listings <- rank(-ZIP_potential_analysis$total_listings)
ZIP_potential_analysis$rank_revenue <- rank(-ZIP_potential_analysis$Total_revenue)
ZIP_potential_analysis$rank_indi_booking <- rank(-ZIP_potential_analysis$Total_indi_bookings)

8.1 Segment Assignment

We are assigning segments to all the ZIPs based on the calculated ranks.
Segmentation is done in the following way:
1. The ZIPs having ranks in between 1-5 will be given a ‘VH’ (Very High potential) segment.
2. The ZIPs having ranks in between 6-10 will be given an ‘H’ (High potential) segment.
3. The ZIPs having ranks in between 11-15 will be given an ‘M’ (Medium potential) segment.
4. The ZIPs having ranks in between 16-20 will be given an ‘L’ (Low potential) segment.
5. The ZIPs having ranks in between 21-25 will be given a ‘VL’ (Very Low potential) segment.
We will be creating a data frame with ranks and respective segments as described above. In one of the cases for rank_listings, the same rank of ‘4.5’ is assigned to zipcode’s 10025 and 11217 as they have the same number of listings. We are also putting a provision for this rank into our rank-segment assignment table

rank_to_segment_table <- data.frame(Rank = c(1:25, 4.5), Segment = c('VH','VH','VH','VH','VH','H','H','H','H','H','M','M','M','M','M','L','L','L','L','L','VL','VL','VL','VL','VL','VH'))

rank_to_segment_table

##    Rank Segment
## 1   1.0      VH
## 2   2.0      VH
## 3   3.0      VH
## 4   4.0      VH
## 5   5.0      VH
## 6   6.0       H
## 7   7.0       H
## 8   8.0       H
## 9   9.0       H
## 10 10.0       H
## 11 11.0       M
## 12 12.0       M
## 13 13.0       M
## 14 14.0       M
## 15 15.0       M
## 16 16.0       L
## 17 17.0       L
## 18 18.0       L
## 19 19.0       L
## 20 20.0       L
## 21 21.0      VL
## 22 22.0      VL
## 23 23.0      VL
## 24 24.0      VL
## 25 25.0      VL
## 26  4.5      VH

Merging these segments into our ZIP_potential_analysis table.

ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_listings", by.y = "Rank", all.x = TRUE)

ZIP_potential_analysis <- rename(ZIP_potential_analysis, Listing_segments = Segment)

ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_revenue", by.y = "Rank", all.x = TRUE)

ZIP_potential_analysis <- rename(ZIP_potential_analysis, Revenue_segments = Segment)

ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_indi_booking", by.y = "Rank", all.x = TRUE)

ZIP_potential_analysis <- rename(ZIP_potential_analysis, Indi_bookings_segments = Segment)

#Arranging the data frame 

ZIP_potential_analysis <- select(ZIP_potential_analysis, zipcode, total_listings, rank_listings, Listing_segments, Total_revenue, rank_revenue, Revenue_segments, Total_indi_bookings, rank_indi_booking, Indi_bookings_segments)

##Assigning the order of levels of segments

ZIP_potential_analysis$Listing_segments <- factor(ZIP_potential_analysis$Listing_segments, levels = c("VH", "H", "M", "L", "VL"))

ZIP_potential_analysis$Revenue_segments <- factor(ZIP_potential_analysis$Revenue_segments, levels = c("VH", "H", "M", "L", "VL"))

ZIP_potential_analysis$Indi_bookings_segments <- factor(ZIP_potential_analysis$Indi_bookings_segments, levels = c("VH", "H", "M", "L", "VL"))

8.2 Comparison of the number of listings for all the ZIPs

We can see that the 5 zips 11215, 10036, 10003, 11217 and 10025 falls in the very high potential segment based on the number of listings.

Zip_listing_plot <- arrange(ZIP_potential_analysis, desc(total_listings))

ggplot(data=Zip_listing_plot, aes(x =zipcode , y = total_listings, fill = as.factor(Zip_listing_plot$Listing_segments))) + ggtitle(label = "Total Listings by Zip's") +    geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Total Listings in Zip") +theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))

8.3 Comparison of Revenue Generated for all the Zips

We can see that the 5 zips 10003, 10036, 10013, 10011 and 10025 falls in the very high potential segment on the basis total revenue generated per year.

Zip_revenue_plot <- arrange(ZIP_potential_analysis, desc(Total_revenue))

ggplot(data=Zip_revenue_plot, aes(x = zipcode , y = Total_revenue, fill = as.factor(Zip_revenue_plot$Revenue_segments))) + ggtitle(label = "Total Revenue by Zip's") + geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Revenue per Year") + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))

8.4 Comparison of individual-bookings per year for all Zip’s

We can see that the 5 zips 11215, 10036, 10003, 10025 and 11217 falls in the very high potential segment based on the total individual-bookings per year.

Zip_indi_booking_plot <- arrange(ZIP_potential_analysis, desc(Total_indi_bookings))

ggplot(data=Zip_indi_booking_plot, aes(x = zipcode, y = Total_indi_bookings, fill = as.factor(Zip_indi_booking_plot$Indi_bookings_segments))) + ggtitle(label = "Total Individual Bookings by Zip's") + geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Individual-Bookings per year") + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))

Now, we want to decide which zip has the highest potential based on all 3 potential metrics together, let us check how many times the ZIPs fall under high potential zones (i.e “VH” and “H”)

df1 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Listing_segments"])

df2 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Revenue_segments"])

df3 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Indi_bookings_segments"])

final_df <- rbind(df1, df2, df3)

zip_segment_level <- summarise(group_by(final_df, Zip, Segment), count = n())

zip_segment_summary <- arrange(zip_segment_level, Segment,desc(count))

9 Conclusion based on the Market Potential Metrics

It is observed that 3 ZIPs 10003, 10025, 10036 lie in the very high potential zone based on all the three metrics. Hence, these the best zip’s to invest based on the market potential.
Later, we see that ZIP’s 11215 and 11217 lie twice in Very High potential zone and lie once under the High potential zone. Hence, these are the second best zip’s to invest based on the market potential.
After these 5 ZIP’s, we see that ZIP’s 10011 and 10013 lie once in the Very High potential zone and lie twice under the “High” potential zone. Hence, these are the third best zip’s to invest based on the market potential.

10 Verification of the results by using the metric ‘Cancellation_Policy’

Assumption: We are assuming that if a zip has high market potential i.e. If there is a high demand for the listings in these ZIPs, they should have strict cancellation policies. The reason being the host would not like to keep his/her listing unoccupied because someone is canceling the pre-booked reservation. So, we would check for ZIP’s that have strict cancellation policies.

cancellation <- data.frame(zipcode = nyc_airbnb_data$zipcode,  policy = nyc_airbnb_data$cancellation_policy)

cancellation$policy <- factor(cancellation$policy, levels = c("strict_14_with_grace_period", "super_strict_30", "super_strict_60", "strict", "moderate", "flexible"))

cancellation_summary <- summarise(group_by(cancellation, zipcode, policy), count = n())

We will only be interested in looking for the best-evaluated ZIP’s in our previous analysis

can_final_summary <- arrange(filter(cancellation_summary, zipcode %in% c("10003", "10025", "10036", "11215", "11217", "10011", "10013")), desc(count))

y <- data.frame(zip = ZIP_potential_analysis$zipcode, total_listings = ZIP_potential_analysis$total_listings)


can_merged_data <- sqldf("select a.zipcode, a.policy, a.count, b.total_listings 
      from can_final_summary a 
      left join 
      y b
      on 
      a.zipcode = b.zip")

can_merged_data$percent <- (can_merged_data$count/can_merged_data$total_listings)*100

can_merged_data

##    zipcode                      policy count total_listings    percent
## 1    10036 strict_14_with_grace_period   110            147 74.8299320
## 2    11215 strict_14_with_grace_period    85            189 44.9735450
## 3    10025 strict_14_with_grace_period    66            124 53.2258065
## 4    11217 strict_14_with_grace_period    64            124 51.6129032
## 5    10013 strict_14_with_grace_period    63            105 60.0000000
## 6    10003 strict_14_with_grace_period    61            136 44.8529412
## 7    10011 strict_14_with_grace_period    59            106 55.6603774
## 8    11215                    moderate    58            189 30.6878307
## 9    11215                    flexible    46            189 24.3386243
## 10   10003                    moderate    38            136 27.9411765
## 11   10025                    flexible    36            124 29.0322581
## 12   11217                    moderate    36            124 29.0322581
## 13   10003                    flexible    34            136 25.0000000
## 14   10011                    flexible    28            106 26.4150943
## 15   10013                    flexible    27            105 25.7142857
## 16   11217                    flexible    23            124 18.5483871
## 17   10025                    moderate    22            124 17.7419355
## 18   10036                    flexible    22            147 14.9659864
## 19   10011                    moderate    19            106 17.9245283
## 20   10013                    moderate    15            105 14.2857143
## 21   10036                    moderate    13            147  8.8435374
## 22   10003             super_strict_60     3            136  2.2058824
## 23   10036             super_strict_30     1            147  0.6802721
## 24   10036             super_strict_60     1            147  0.6802721
## 25   11217             super_strict_30     1            124  0.8064516

We can see that the majority of the high potential evaluated Zip have a strict cancellation policy, which supports our results as per the assumption.

ggplot(can_merged_data, aes(fill = policy, y = percent, x = zipcode)) + ggtitle(label = "Cancellation Policy Distribution") +
    geom_bar(position="stack", stat="identity") +labs(x = "Zipcode", y = "Percentage") +theme(legend.position="bottom")

So, these are the final order of ZIP’s that we will suggest our client to invest into based on the market potential:
1. 10003
2. 10025
3. 10036
4. 11215
5. 11217
6. 10011
7. 10013

11 Payback Period calculation at the Zip level

Now, we move to the last past of the analysis where we will calculate the payback period of the Zips by comparing the median prices of properties given in the Zillow data and the average revenue generated per year in the Zip
Assumption 1: We are assuming that the expected value of revenue from a Zip be the average of the revenue generated by all the listings in that Zip. We will then compare this average value of revenue with the property prices given in the Zillow data.
Assumption 2: To calculate the payback period we are directly comparing the average revenue generated with the price of the property i.e. we are not comparing the profits with the property prices. Since with the given data cannot directly calculate the profit at a listing level. To calculate that we will have to consider expenses such as maintenance costs, taxes, etc. The results of the payback period are going to be conservative side as these costs are not considered. They are going to be higher than the actual payback period.
Assumption 3: We will also be calculating yearly returns based on average revenue’s calculated. The actual yearly return will be less than this as we are not comparing the profit with the investment.

zip_avg_revenues <- summarise(groupby_zip, Avg_revenue = mean(final_rev_per_year))

#Merging the median cost of properties from the Zillow data 

zip_avg_revenues <- merge(x = zip_avg_revenues, y = nyc_zillow_final, by.x = "zipcode", by.y = "RegionName", all.x = TRUE)

zip_avg_revenues$payback_years <- zip_avg_revenues$X2017.06/zip_avg_revenues$Avg_revenue

zip_avg_revenues$years_group <- ifelse(zip_avg_revenues$payback_years <= 5, "<5 Years", ifelse(zip_avg_revenues$payback_years <= 10, "5-10 Years", ifelse(zip_avg_revenues$payback_years <= 15, "11-15 Years", ifelse(zip_avg_revenues$payback_years <= 20, "16-20 Years", "20+ Years"))))

zip_avg_revenues$yearly_return_percent <- 100 * (zip_avg_revenues$Avg_revenue/zip_avg_revenues$X2017.06)


#Plotting payback years with percent annual returns

ggplot(zip_avg_revenues, aes(fill = years_group, x= zipcode, y= payback_years, )) +
  geom_point(size = 4, shape = 22) + labs( x = "Zipcode", y = "Years") +   ggtitle("Payback Period of Zip's") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 16) )  +  theme_classic() + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))

Based on the payback period calculated by our approach, Zip’s 10303, 10304, 10305, 10306 and 11434 have the lowest payback period. Only by using these results, we cannot suggest these zips to invest. There might a scenario where the investment itself is low and hence the payback period is low. Let us check for that by including the price comparison along with the payback period.

zip_avg_revenues <- rename(zip_avg_revenues, Median_Prices = X2017.06)

plot <- ggplot(zip_avg_revenues, aes(x=zipcode, y=payback_years, fill = zipcode)) + ggtitle(label = "Payback Period with Median Prices of Properties") + scale_y_continuous(labels = scales::comma) + geom_point(aes(col = zipcode, size=Median_Prices/10000)) + geom_smooth(method="loess", se=F) + labs( x="Zipcode", y="Payback Years") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot

Here, we can see that the Zip’s that have less payback period also has less investment. Hence, the profit from this Zip’s will also be comparably low. Whereas, we see that the Zip’s where the investment is more, the payback period is also high. But, the profit will be higher in these Zips.
We cannot suggest the best Zip’s using the metric payback period. We need more data points and other information to come at some conclusion like profit data, tax data, etc.

12 Final Conclusion :

We will finally advise our client to invest in the Zip’s having higher potentials calculated in the previous analysis. Below is the list of those Zips.
1. 10003
2. 10025
3. 10036
4. 11215
5. 11217
6. 10011
7. 10013

13 Future Scope :

Considering the overall raw data, there is a lot that could have been done if time had permitted to do so. Below are some of the Zip evaluation approaches that can help us to conclude:

13.1 Zillow Cost Data :

We saw that the Zillow had around 262 columns that recorded the prices of 2BR properties in the past. We could have plotted the time series of the median prices of every Zip and studied their behaviors. We can check for the Zip’s where are property prices are rising at a higher rate and rate them accordingly. A metric like ‘Average growth’ in the property values can be calculated to do so.

13.2 AirBnb Listings Data :

There were several columns in the raw data that we did not use in our analysis but could be used to support our conclusions. Below are the names and how the other variables can be used:

Neighborhood overview : At the Zip level, we can analyze the additional details such as sites to see, restaurants, landmarks around, etc. The Zip’s can be rated based on the landmarks around or tourist spots around can be given more importance.

Transit : At the Zip level, we can analyze the details such as nearby bus services, train services, distance from airport, etc. The Zip’s can be rated based on these transit services and rated accordingly.

host_response_time and host_response_rate : Analyzing the average response and average response rate in every Zip can give us an idea of how demanding the listings were. Creating a metric based on that can help us rate the Zips.

host_is_superhost : Superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests. We can check for the number of hosts at the Zip level and rate them accordingly.

Area of Properties (Square Foot) : This can be a good metric that will help to used to check how prices rise as there is a rise in the area of properties. As we saw most of these fields were empty for most of the properties, we did not use it in our analyses.

Cleaning Fee : This can be an addition to the revenue generation calculation in our analysis. But, since we did not consider the maintenance cost and taxes we were not able to use this metric.

Listing Review : There were multiple variables which have the ratings given by the customers to the listings. Using this we can calculate the popularity of Zips which can be an important metric to make some conclusion.

Segment Assignment : We saw that we were only interested in analyzing the properties in the New York Zips. Since the number of Zips to be evaluated was a small number, we used a method of direct ranking and segmenting accordingly. The other approach would have been to calculate the deciles/quantiles and then segment the Zips. This approach was good if we were analyzing all the Zip’s in US/New York State together.

14 Tableau Dashboard for Results

A Tableau Dashboard has been made to show these results to our client. The dashboard shows the geographical location of Zips and related metrics. Below are the instruction on how to see the best Zips:
1. Select the attribute from the ‘Zip Attributes’ dropdown
2. Select the ‘Very High’ and ‘High’ options from the ‘Attribute Segment’ dropdown.
3. There is also a provision to see payback periods for Zips. For that select ‘Payback Period’ in the ‘Zip Attributes’ dropdown and ‘Year groups’ from ‘Attribute Segment’ dropdown.
4. Note that the dashboard was only built with a aim to show the results from the analyses.
5. Use this link to see the dashboard.

Capital One Data Challenge

Digvijay Kawale

January 05, 2020