library(tidyverse)
library(tidyr)
library(readr)
library(GGally)
library(DT)
library(leaflet)
library(glue)
library(matrixStats)
library(ggplot2)
library(expss)
library(sqldf)
Airbnb_listings <- read.csv("~/Desktop/Projects/Capital One Data Challenge/From Katie/listings.csv")
Zillow_pricing <- read.csv("~/Desktop/Projects/Capital One Data Challenge/From Katie/Zip_Zhvi_2bedroom.csv")
class(Zillow_pricing)
## [1] "data.frame"
dim(Zillow_pricing)
## [1] 8946 262
City Criterion : We are only interested in the pricing of the properties located in the New York City as per the problem statement. So, we will filter data based on city.
We see that there are 262 variables in the Zillow pricing data table giving the median prices of the 2 BR properties located in different zip’s, but we will only be using the most recent median prices of 2 BR properties. Hence, we will filter for the rest of the columns and only use the median prices for 06/2017.
nyc_zillow_data <- filter(Zillow_pricing, City == 'New York')
nyc_zillow_final <- select(nyc_zillow_data, RegionID:SizeRank, X2017.06)
colSums(is.na(nyc_zillow_final))
## RegionID RegionName City State Metro CountyName
## 0 0 0 0 0 0
## SizeRank X2017.06
## 0 0
length(unique(nyc_zillow_final$RegionID)) == length(nyc_zillow_final$RegionID)
## [1] TRUE
length(unique(nyc_zillow_final$RegionName)) == length(nyc_zillow_final$RegionName)
## [1] TRUE
We will be interested in taking a look at the median prices of the 2BR properties in the regions of our interest. We will be segmenting Zips based on their county location.
After plotting, We found that the median prices of the properties located in the New York County are the highest. The properties in the Zip’s 10013, 10014, and 10011 are most expensive.
The median price of Zip’s will be one of the metrics to evaluate the best Zip’s to invest. We now move forward with our analysis using the Airbnb listings data.
Zip_median_prices <- arrange(nyc_zillow_final, desc(X2017.06))
ggplot(data=Zip_median_prices, aes(x = reorder(RegionName, X2017.06) , y = X2017.06, fill = as.factor(Zip_median_prices$CountyName))) + geom_bar(stat="identity" ) + scale_fill_brewer(name = "County Name" , palette = "Set1") + coord_flip() + labs(x = "Zipcode", y = "Median Price of 2BR Properties")
There are in all 25 ZIPs among which we will be finding the best zips to invest for our client. We will be using the AirBnb Listings data to evaluate zips.
Since our client has already concluded that 2 BR properties are best to invest, we will only be interested in 2 BR property listings located in these 25 ZIPs of New York City. So, we will first be filtering the 2BR properties from the listings data
nyc_airbnb_data <- filter(Airbnb_listings, bedrooms == 2, zipcode %in% nyc_zillow_final[,"RegionName"])
dim(nyc_airbnb_data)
## [1] 1565 106
names(nyc_airbnb_data)
## [1] "id"
## [2] "listing_url"
## [3] "scrape_id"
## [4] "last_scraped"
## [5] "name"
## [6] "summary"
## [7] "space"
## [8] "description"
## [9] "experiences_offered"
## [10] "neighborhood_overview"
## [11] "notes"
## [12] "transit"
## [13] "access"
## [14] "interaction"
## [15] "house_rules"
## [16] "thumbnail_url"
## [17] "medium_url"
## [18] "picture_url"
## [19] "xl_picture_url"
## [20] "host_id"
## [21] "host_url"
## [22] "host_name"
## [23] "host_since"
## [24] "host_location"
## [25] "host_about"
## [26] "host_response_time"
## [27] "host_response_rate"
## [28] "host_acceptance_rate"
## [29] "host_is_superhost"
## [30] "host_thumbnail_url"
## [31] "host_picture_url"
## [32] "host_neighbourhood"
## [33] "host_listings_count"
## [34] "host_total_listings_count"
## [35] "host_verifications"
## [36] "host_has_profile_pic"
## [37] "host_identity_verified"
## [38] "street"
## [39] "neighbourhood"
## [40] "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed"
## [42] "city"
## [43] "state"
## [44] "zipcode"
## [45] "market"
## [46] "smart_location"
## [47] "country_code"
## [48] "country"
## [49] "latitude"
## [50] "longitude"
## [51] "is_location_exact"
## [52] "property_type"
## [53] "room_type"
## [54] "accommodates"
## [55] "bathrooms"
## [56] "bedrooms"
## [57] "beds"
## [58] "bed_type"
## [59] "amenities"
## [60] "square_feet"
## [61] "price"
## [62] "weekly_price"
## [63] "monthly_price"
## [64] "security_deposit"
## [65] "cleaning_fee"
## [66] "guests_included"
## [67] "extra_people"
## [68] "minimum_nights"
## [69] "maximum_nights"
## [70] "minimum_minimum_nights"
## [71] "maximum_minimum_nights"
## [72] "minimum_maximum_nights"
## [73] "maximum_maximum_nights"
## [74] "minimum_nights_avg_ntm"
## [75] "maximum_nights_avg_ntm"
## [76] "calendar_updated"
## [77] "has_availability"
## [78] "availability_30"
## [79] "availability_60"
## [80] "availability_90"
## [81] "availability_365"
## [82] "calendar_last_scraped"
## [83] "number_of_reviews"
## [84] "number_of_reviews_ltm"
## [85] "first_review"
## [86] "last_review"
## [87] "review_scores_rating"
## [88] "review_scores_accuracy"
## [89] "review_scores_cleanliness"
## [90] "review_scores_checkin"
## [91] "review_scores_communication"
## [92] "review_scores_location"
## [93] "review_scores_value"
## [94] "requires_license"
## [95] "license"
## [96] "jurisdiction_names"
## [97] "instant_bookable"
## [98] "is_business_travel_ready"
## [99] "cancellation_policy"
## [100] "require_guest_profile_picture"
## [101] "require_guest_phone_verification"
## [102] "calculated_host_listings_count"
## [103] "calculated_host_listings_count_entire_homes"
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms"
## [106] "reviews_per_month"
The data table nyc_airbnb_data contains 1565 observations of 2BR properties. It has 106 different variables as of now. We will only be interested in columns that will help us evaluate the best zips/ which align with our approach. Let us first filter for the variables which will help us create attributes at zip level.
We will keep on updating the table based on the use of variables by adding or dropping them. “nyc_airbnb_final” will be the table with which we move towards our further analysis.
nyc_airbnb_final <- select(nyc_airbnb_data, id, host_id, street, neighbourhood, neighbourhood_cleansed, neighbourhood_group_cleansed, city, zipcode, market, smart_location, latitude, longitude, accommodates, bathrooms, bedrooms, beds, square_feet, price, weekly_price, monthly_price, security_deposit, cleaning_fee, guests_included, extra_people, availability_30, availability_60, availability_90, availability_365, cancellation_policy )
glimpse(nyc_airbnb_final)
## Observations: 1,565
## Variables: 29
## $ id <int> 16458, 20853, 23135, 27006, 27659, …
## $ host_id <int> 64056, 79070, 11481, 115560, 119588…
## $ street <fct> "Brooklyn, NY, United States", "Bro…
## $ neighbourhood <fct> Brooklyn, Brooklyn, Carroll Gardens…
## $ neighbourhood_cleansed <fct> Park Slope, Prospect Heights, Carro…
## $ neighbourhood_group_cleansed <fct> Brooklyn, Brooklyn, Brooklyn, Manha…
## $ city <fct> Brooklyn, Brooklyn, Brooklyn, New Y…
## $ zipcode <fct> 11215, 11217, 11231, 10023, 11215, …
## $ market <fct> New York, New York, New York, New Y…
## $ smart_location <fct> "Brooklyn, NY", "Brooklyn, NY", "Br…
## $ latitude <dbl> 40.67343, 40.68035, 40.67967, 40.77…
## $ longitude <dbl> -73.98338, -73.97162, -74.00154, -7…
## $ accommodates <int> 4, 4, 3, 4, 9, 3, 6, 2, 4, 4, 4, 4,…
## $ bathrooms <dbl> 2.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.5, …
## $ bedrooms <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ beds <int> 1, 3, 2, 3, 6, 1, 3, 2, 3, 3, 2, 2,…
## $ square_feet <int> NA, NA, NA, NA, NA, 600, 1200, NA, …
## $ price <fct> $225.00, $250.00, $175.00, $235.00,…
## $ weekly_price <fct> "", "$1,400.00", "", "$2,000.00", "…
## $ monthly_price <fct> "", "$4,000.00", "", "$6,000.00", "…
## $ security_deposit <fct> "", "$500.00", "$0.00", "$500.00", …
## $ cleaning_fee <fct> , $250.00, $65.00, $120.00, $120.00…
## $ guests_included <int> 1, 4, 1, 3, 1, 2, 2, 1, 2, 2, 11, 2…
## $ extra_people <fct> $0.00, $50.00, $15.00, $40.00, $0.0…
## $ availability_30 <int> 0, 0, 21, 0, 0, 0, 0, 12, 0, 0, 13,…
## $ availability_60 <int> 0, 0, 39, 16, 17, 0, 0, 37, 0, 0, 4…
## $ availability_90 <int> 0, 0, 69, 16, 17, 0, 0, 67, 6, 0, 7…
## $ availability_365 <int> 0, 158, 342, 199, 216, 0, 0, 342, 2…
## $ cancellation_policy <fct> moderate, strict_14_with_grace_peri…
sum((nyc_airbnb_final$price==''))
## [1] 0
sum((nyc_airbnb_final$weekly_price==''))
## [1] 1351
sum((nyc_airbnb_final$monthly_price==''))
## [1] 1398
sum((nyc_airbnb_final$security_deposit==''))
## [1] 420
sum((nyc_airbnb_final$cleaning_fee==''))
## [1] 212
sum((nyc_airbnb_final$extra_people==''))
## [1] 0
We see that there are missing values in ‘weekly_price’, ‘monthly_price’, ‘security_deposit’, ‘cleaning_fee’.
Let us first remove ‘$’ sign and then the ‘,’ sign from these columns
nyc_airbnb_final$price = gsub("\\$", "", nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = gsub("\\$", "", nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = gsub("\\$", "", nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = gsub("\\$", "", nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = gsub("\\$", "", nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = gsub("\\$", "", nyc_airbnb_final$extra_people)
nyc_airbnb_final$price = gsub("\\,", "", nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = gsub("\\,", "", nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = gsub("\\,", "", nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = gsub("\\,", "", nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = gsub("\\,", "", nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = gsub("\\,", "", nyc_airbnb_final$extra_people)
nyc_airbnb_final$price = as.numeric(nyc_airbnb_final$price)
nyc_airbnb_final$weekly_price = as.numeric(nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_price = as.numeric(nyc_airbnb_final$monthly_price)
nyc_airbnb_final$security_deposit = as.numeric(nyc_airbnb_final$security_deposit)
nyc_airbnb_final$cleaning_fee = as.numeric(nyc_airbnb_final$cleaning_fee)
nyc_airbnb_final$extra_people = as.numeric(nyc_airbnb_final$extra_people)
colSums(is.na(nyc_airbnb_final))
## id host_id
## 0 0
## street neighbourhood
## 0 0
## neighbourhood_cleansed neighbourhood_group_cleansed
## 0 0
## city zipcode
## 0 0
## market smart_location
## 0 0
## latitude longitude
## 0 0
## accommodates bathrooms
## 0 3
## bedrooms beds
## 0 0
## square_feet price
## 1538 0
## weekly_price monthly_price
## 1351 1398
## security_deposit cleaning_fee
## 420 212
## guests_included extra_people
## 0 0
## availability_30 availability_60
## 0 0
## availability_90 availability_365
## 0 0
## cancellation_policy
## 0
We found that the number of N.A. values for weekly_price, monthly_price, security_deposit and cleaning fee are the same as the number of blanks that were present in these columns before converting into an integer, which tells that we have successfully converted these columns into integer form.
There are no missing values for ‘id’ and ‘host_id’ which we will be needing for further analysis. Also, the columns ‘bathrooms’ has a few missing values, but that is a column that we won’t be using much.
Now, we will check for the duplicates at the id and host_id level
length(unique(nyc_airbnb_final$id)) == length(nyc_airbnb_final$id)
## [1] TRUE
length(unique(nyc_airbnb_final$host_id)) == length(nyc_airbnb_final$host_id)
## [1] FALSE
We will carry out a Zip potential analysis to check which Zip’s have most of the markets potential. This can be done by checking the revenues generated, the total number of listings and the number of total individual bookings taking place in a year for every Zip.
We will first calculate these metrics at the listing level and then roll it up at the ZIP level.
At the listing level, we have two major steps :
Let us start with calculating the occupancy rate using the variables ‘availability_30’, ‘availability_60’, ‘availability_90’ and ‘availability_365’. These columns show the number of days the listing is available to book within 30, 60, 90 and 365 days respectively. So, if a listing has a value of zero in available_30, it means that the listing is completely booked for the next 30 days (from the date of calendar scrap). Since these values are updated by the hosts, some of these values might not reflect the actual availability. To cover this, we are assuming a minimum occupancy of 75%.
We will be calculating the occupancy rate for all the 4 columns (30, 60, 90 and 365 days).
For example 30 days occupancy rate in percent = (No of days occupied in 30 days)*100/30
After calculating this for all 4 columns, we will be taking the max of (30 days occupancy, 60 days occupancy, 90 days occupancy, 365 days occupancy, 75%). The reason behind taking the maximum is, these metrics are something which is updated by hosts. So, if a host currently has 100% occupancy for next 30 days but less than 100% occupancy for next 60 days, we can always assume that in some time the occupancy for next 30 days will be updated and it will be close to 100% with current availability_30 as evidence. Similar logic applies to future days (>60 days). Hence, taking the max of these 4 occupancy rates.
The mean of the final occupancy rate comes out to be greater than 92%. Our assumption of 75% would have undermined the market potential. We go forward with the calculated final occupancy rate for each listing.
nyc_airbnb_final$percent_occupancy_30 = ((30- nyc_airbnb_final$availability_30)*100)/30
nyc_airbnb_final$percent_occupancy_60 = ((60- nyc_airbnb_final$availability_60)*100)/60
nyc_airbnb_final$percent_occupancy_90 = ((90- nyc_airbnb_final$availability_90)*100)/90
nyc_airbnb_final$percent_occupancy_365 = ((365- nyc_airbnb_final$availability_365)*100)/365
#creating a column with min 75% occupancy for all the listings
nyc_airbnb_final$min_occu_rate <- replicate(1565, 75)
##Final Occupancy rate calculation
nyc_airbnb_final$final_occu_rate = pmax(nyc_airbnb_final$percent_occupancy_30, nyc_airbnb_final$percent_occupancy_60, nyc_airbnb_final$percent_occupancy_90, nyc_airbnb_final$percent_occupancy_365, nyc_airbnb_final$min_occu_rate)
mean(nyc_airbnb_final$final_occu_rate)
## [1] 92.07428
For the listing which does not have weekly and monthly prices mentioned, we will calculate based on daily prices.
Weekly price = 7 times the daily price
Monthly price = 30 times the daily price
nyc_airbnb_final$weekly_final_rate = ifelse(is.na(nyc_airbnb_final$weekly_price), 7*nyc_airbnb_final$price, nyc_airbnb_final$weekly_price)
nyc_airbnb_final$monthly_final_rate = ifelse(is.na(nyc_airbnb_final$monthly_price), 30*nyc_airbnb_final$price, nyc_airbnb_final$monthly_price)
We will first be calculating the maximum possible revenue per year at the listing level
We are interested in calculating the revenue generated per year by every listing based on three rates :
These calculations will give us the revenues expected if there was a 100% occupancy rate. After this, we will factorize the revenues by using the actual occupancy rates.
nyc_airbnb_final$max_rev_day_basis = 365 * (nyc_airbnb_final$price)
nyc_airbnb_final$max_rev_week_basis = 52 * (nyc_airbnb_final$weekly_final_rate)
nyc_airbnb_final$max_rev_month_basis = 12 * (nyc_airbnb_final$monthly_final_rate)
nyc_airbnb_final$actual_rev_day_basis = (nyc_airbnb_final$max_rev_day_basis * nyc_airbnb_final$final_occu_rate)/100
nyc_airbnb_final$actual_rev_week_basis = (nyc_airbnb_final$max_rev_week_basis * nyc_airbnb_final$final_occu_rate)/100
nyc_airbnb_final$actual_rev_month_basis = (nyc_airbnb_final$max_rev_month_basis * nyc_airbnb_final$final_occu_rate)/100
Practically, the actual revenue is generated using all the 3 types of prices. i.e. out of total occupied days for a listing, the bookings might be daily bookings or weekly bookings or monthly bookings. We cannot proceed to take any one of these. They need to be assigned with some weights. Since we do not have actual weights, we will go forward assuming weights. It is obvious that the maximum number of the booking will be on a day basis, then on a weekly basis and lowest on a monthly basis. So, we are assuming the weights as follows:
Assuming that 80% of revenue is generated by daily bookings, 15% by the weekly bookings and 5% by the monthly bookings. Note that we can always revise these weights after taking a look at the results. Now, we can calculate the final revenue per year at a listing level using these weights
nyc_airbnb_final$final_rev_per_year = (0.80 * nyc_airbnb_final$actual_rev_day_basis) + (0.15 * nyc_airbnb_final$actual_rev_week_basis) + (0.05 * nyc_airbnb_final$actual_rev_month_basis)
Now, let us get to the third important calculation i.e. calculation of the number of individual-bookings per year at the listing level.
Individual-bookings : The metric ‘individual-bookings’ is counting the number of individuals visiting the listing long with the number of bookings. For example, if the same individual books a listing for 10 days, it will be counted as 10 individual-bookings. So we will be calculating the total number of individual-bookings per year for every listing in our database.
We have a variable ‘Accomodates’ in our data base that gives us the capacity of individuals that can be accommodated at the same time for each listing. We will be using ‘Accommodates’ and ‘Final occupancy rate’ to calculate the total number of individual-bookings per year.
In the very first step we will calculate the maximum yearly capacity of a listing to accommodate total individuals per year
nyc_airbnb_final$yearly_capacity = nyc_airbnb_final$accommodates * 365
nyc_airbnb_final$yearly_indi_bookings = (nyc_airbnb_final$yearly_capacity * nyc_airbnb_final$final_occu_rate)/100
We are now done calculating metrics at the listings level. Now, we will move and calculate the same metrics at the zip level. All the ZIPs will be analyzed using 3 major metrics. They are:
The first metric to be calculated is ZIP market potential. Zip market potential will be calculated using three metrics :
Basically, we have the data at the listings level and we also have the ZIP of every listing. Now, we will be rolling up this listing level data at the ZIP level. Let us get the unique ZIPs from the listings table
We did some checks to verify that our calculation is correct. We observe that all the calculations at the listings level are matching with the calculations at the ZIP level after rolling up the data.
#Grouping data at a ZIP level
groupby_zip <- group_by(nyc_airbnb_final, zipcode)
##Calculating the total number of listings, Total revenue generated per year and total number of yearly individual-bookings for every ZIP
ZIP_potential_analysis <- summarise(groupby_zip, total_listings = n(), Total_revenue = sum(final_rev_per_year), Total_indi_bookings = sum(yearly_indi_bookings))
#Quick checks
sum(ZIP_potential_analysis$total_listings) #1565
## [1] 1565
length(nyc_airbnb_final$id) #1565
## [1] 1565
sum(ZIP_potential_analysis$Total_revenue) #145861918
## [1] 145861918
sum(nyc_airbnb_final$final_rev_per_year) #145861918
## [1] 145861918
sum(ZIP_potential_analysis$Total_indi_bookings) #2331099
## [1] 2331099
sum(nyc_airbnb_final$yearly_indi_bookings) #2331099
## [1] 2331099
plot_price <- ggplot(nyc_airbnb_final, aes(x = zipcode,y = price, fill = neighbourhood_group_cleansed)) + geom_boxplot() + scale_y_continuous(limits = quantile(nyc_airbnb_final$price, c(0, 0.99))) + labs(x = "Zipcode", y = "Price/Night") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + guides(fill = guide_legend(title = "Neighbourhood"))
suppressWarnings(print(plot_price))
ZIP_potential_analysis$rank_listings <- rank(-ZIP_potential_analysis$total_listings)
ZIP_potential_analysis$rank_revenue <- rank(-ZIP_potential_analysis$Total_revenue)
ZIP_potential_analysis$rank_indi_booking <- rank(-ZIP_potential_analysis$Total_indi_bookings)
We are assigning segments to all the ZIPs based on the calculated ranks.
Segmentation is done in the following way:
We will be creating a data frame with ranks and respective segments as described above. In one of the cases for rank_listings, the same rank of ‘4.5’ is assigned to zipcode’s 10025 and 11217 as they have the same number of listings. We are also putting a provision for this rank into our rank-segment assignment table
rank_to_segment_table <- data.frame(Rank = c(1:25, 4.5), Segment = c('VH','VH','VH','VH','VH','H','H','H','H','H','M','M','M','M','M','L','L','L','L','L','VL','VL','VL','VL','VL','VH'))
rank_to_segment_table
## Rank Segment
## 1 1.0 VH
## 2 2.0 VH
## 3 3.0 VH
## 4 4.0 VH
## 5 5.0 VH
## 6 6.0 H
## 7 7.0 H
## 8 8.0 H
## 9 9.0 H
## 10 10.0 H
## 11 11.0 M
## 12 12.0 M
## 13 13.0 M
## 14 14.0 M
## 15 15.0 M
## 16 16.0 L
## 17 17.0 L
## 18 18.0 L
## 19 19.0 L
## 20 20.0 L
## 21 21.0 VL
## 22 22.0 VL
## 23 23.0 VL
## 24 24.0 VL
## 25 25.0 VL
## 26 4.5 VH
ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_listings", by.y = "Rank", all.x = TRUE)
ZIP_potential_analysis <- rename(ZIP_potential_analysis, Listing_segments = Segment)
ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_revenue", by.y = "Rank", all.x = TRUE)
ZIP_potential_analysis <- rename(ZIP_potential_analysis, Revenue_segments = Segment)
ZIP_potential_analysis <- merge(x = ZIP_potential_analysis, y = rank_to_segment_table, by.x = "rank_indi_booking", by.y = "Rank", all.x = TRUE)
ZIP_potential_analysis <- rename(ZIP_potential_analysis, Indi_bookings_segments = Segment)
#Arranging the data frame
ZIP_potential_analysis <- select(ZIP_potential_analysis, zipcode, total_listings, rank_listings, Listing_segments, Total_revenue, rank_revenue, Revenue_segments, Total_indi_bookings, rank_indi_booking, Indi_bookings_segments)
##Assigning the order of levels of segments
ZIP_potential_analysis$Listing_segments <- factor(ZIP_potential_analysis$Listing_segments, levels = c("VH", "H", "M", "L", "VL"))
ZIP_potential_analysis$Revenue_segments <- factor(ZIP_potential_analysis$Revenue_segments, levels = c("VH", "H", "M", "L", "VL"))
ZIP_potential_analysis$Indi_bookings_segments <- factor(ZIP_potential_analysis$Indi_bookings_segments, levels = c("VH", "H", "M", "L", "VL"))
Zip_listing_plot <- arrange(ZIP_potential_analysis, desc(total_listings))
ggplot(data=Zip_listing_plot, aes(x =zipcode , y = total_listings, fill = as.factor(Zip_listing_plot$Listing_segments))) + ggtitle(label = "Total Listings by Zip's") + geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Total Listings in Zip") +theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))
Zip_revenue_plot <- arrange(ZIP_potential_analysis, desc(Total_revenue))
ggplot(data=Zip_revenue_plot, aes(x = zipcode , y = Total_revenue, fill = as.factor(Zip_revenue_plot$Revenue_segments))) + ggtitle(label = "Total Revenue by Zip's") + geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Revenue per Year") + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))
Zip_indi_booking_plot <- arrange(ZIP_potential_analysis, desc(Total_indi_bookings))
ggplot(data=Zip_indi_booking_plot, aes(x = zipcode, y = Total_indi_bookings, fill = as.factor(Zip_indi_booking_plot$Indi_bookings_segments))) + ggtitle(label = "Total Individual Bookings by Zip's") + geom_bar(stat="identity" ) + scale_fill_brewer(name = "Segments", palette = "Set2") + labs(x = "Zipcode", y = "Individual-Bookings per year") + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))
df1 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Listing_segments"])
df2 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Revenue_segments"])
df3 <- data.frame(Zip = ZIP_potential_analysis[,"zipcode"], Segment = ZIP_potential_analysis[,"Indi_bookings_segments"])
final_df <- rbind(df1, df2, df3)
zip_segment_level <- summarise(group_by(final_df, Zip, Segment), count = n())
zip_segment_summary <- arrange(zip_segment_level, Segment,desc(count))
It is observed that 3 ZIPs 10003, 10025, 10036 lie in the very high potential zone based on all the three metrics. Hence, these the best zip’s to invest based on the market potential.
Later, we see that ZIP’s 11215 and 11217 lie twice in Very High potential zone and lie once under the High potential zone. Hence, these are the second best zip’s to invest based on the market potential.
After these 5 ZIP’s, we see that ZIP’s 10011 and 10013 lie once in the Very High potential zone and lie twice under the “High” potential zone. Hence, these are the third best zip’s to invest based on the market potential.
cancellation <- data.frame(zipcode = nyc_airbnb_data$zipcode, policy = nyc_airbnb_data$cancellation_policy)
cancellation$policy <- factor(cancellation$policy, levels = c("strict_14_with_grace_period", "super_strict_30", "super_strict_60", "strict", "moderate", "flexible"))
cancellation_summary <- summarise(group_by(cancellation, zipcode, policy), count = n())
can_final_summary <- arrange(filter(cancellation_summary, zipcode %in% c("10003", "10025", "10036", "11215", "11217", "10011", "10013")), desc(count))
y <- data.frame(zip = ZIP_potential_analysis$zipcode, total_listings = ZIP_potential_analysis$total_listings)
can_merged_data <- sqldf("select a.zipcode, a.policy, a.count, b.total_listings
from can_final_summary a
left join
y b
on
a.zipcode = b.zip")
can_merged_data$percent <- (can_merged_data$count/can_merged_data$total_listings)*100
can_merged_data
## zipcode policy count total_listings percent
## 1 10036 strict_14_with_grace_period 110 147 74.8299320
## 2 11215 strict_14_with_grace_period 85 189 44.9735450
## 3 10025 strict_14_with_grace_period 66 124 53.2258065
## 4 11217 strict_14_with_grace_period 64 124 51.6129032
## 5 10013 strict_14_with_grace_period 63 105 60.0000000
## 6 10003 strict_14_with_grace_period 61 136 44.8529412
## 7 10011 strict_14_with_grace_period 59 106 55.6603774
## 8 11215 moderate 58 189 30.6878307
## 9 11215 flexible 46 189 24.3386243
## 10 10003 moderate 38 136 27.9411765
## 11 10025 flexible 36 124 29.0322581
## 12 11217 moderate 36 124 29.0322581
## 13 10003 flexible 34 136 25.0000000
## 14 10011 flexible 28 106 26.4150943
## 15 10013 flexible 27 105 25.7142857
## 16 11217 flexible 23 124 18.5483871
## 17 10025 moderate 22 124 17.7419355
## 18 10036 flexible 22 147 14.9659864
## 19 10011 moderate 19 106 17.9245283
## 20 10013 moderate 15 105 14.2857143
## 21 10036 moderate 13 147 8.8435374
## 22 10003 super_strict_60 3 136 2.2058824
## 23 10036 super_strict_30 1 147 0.6802721
## 24 10036 super_strict_60 1 147 0.6802721
## 25 11217 super_strict_30 1 124 0.8064516
ggplot(can_merged_data, aes(fill = policy, y = percent, x = zipcode)) + ggtitle(label = "Cancellation Policy Distribution") +
geom_bar(position="stack", stat="identity") +labs(x = "Zipcode", y = "Percentage") +theme(legend.position="bottom")
Now, we move to the last past of the analysis where we will calculate the payback period of the Zips by comparing the median prices of properties given in the Zillow data and the average revenue generated per year in the Zip
Assumption 1: We are assuming that the expected value of revenue from a Zip be the average of the revenue generated by all the listings in that Zip. We will then compare this average value of revenue with the property prices given in the Zillow data.
Assumption 2: To calculate the payback period we are directly comparing the average revenue generated with the price of the property i.e. we are not comparing the profits with the property prices. Since with the given data cannot directly calculate the profit at a listing level. To calculate that we will have to consider expenses such as maintenance costs, taxes, etc. The results of the payback period are going to be conservative side as these costs are not considered. They are going to be higher than the actual payback period.
Assumption 3: We will also be calculating yearly returns based on average revenue’s calculated. The actual yearly return will be less than this as we are not comparing the profit with the investment.
zip_avg_revenues <- summarise(groupby_zip, Avg_revenue = mean(final_rev_per_year))
#Merging the median cost of properties from the Zillow data
zip_avg_revenues <- merge(x = zip_avg_revenues, y = nyc_zillow_final, by.x = "zipcode", by.y = "RegionName", all.x = TRUE)
zip_avg_revenues$payback_years <- zip_avg_revenues$X2017.06/zip_avg_revenues$Avg_revenue
zip_avg_revenues$years_group <- ifelse(zip_avg_revenues$payback_years <= 5, "<5 Years", ifelse(zip_avg_revenues$payback_years <= 10, "5-10 Years", ifelse(zip_avg_revenues$payback_years <= 15, "11-15 Years", ifelse(zip_avg_revenues$payback_years <= 20, "16-20 Years", "20+ Years"))))
zip_avg_revenues$yearly_return_percent <- 100 * (zip_avg_revenues$Avg_revenue/zip_avg_revenues$X2017.06)
#Plotting payback years with percent annual returns
ggplot(zip_avg_revenues, aes(fill = years_group, x= zipcode, y= payback_years, )) +
geom_point(size = 4, shape = 22) + labs( x = "Zipcode", y = "Years") + ggtitle("Payback Period of Zip's") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 16) ) + theme_classic() + theme(legend.position="bottom", axis.text.x = element_text(angle = 90, hjust = 1))
zip_avg_revenues <- rename(zip_avg_revenues, Median_Prices = X2017.06)
plot <- ggplot(zip_avg_revenues, aes(x=zipcode, y=payback_years, fill = zipcode)) + ggtitle(label = "Payback Period with Median Prices of Properties") + scale_y_continuous(labels = scales::comma) + geom_point(aes(col = zipcode, size=Median_Prices/10000)) + geom_smooth(method="loess", se=F) + labs( x="Zipcode", y="Payback Years") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot
Here, we can see that the Zip’s that have less payback period also has less investment. Hence, the profit from this Zip’s will also be comparably low. Whereas, we see that the Zip’s where the investment is more, the payback period is also high. But, the profit will be higher in these Zips.
We cannot suggest the best Zip’s using the metric payback period. We need more data points and other information to come at some conclusion like profit data, tax data, etc.
We will finally advise our client to invest in the Zip’s having higher potentials calculated in the previous analysis. Below is the list of those Zips.
Neighborhood overview : At the Zip level, we can analyze the additional details such as sites to see, restaurants, landmarks around, etc. The Zip’s can be rated based on the landmarks around or tourist spots around can be given more importance.
Transit : At the Zip level, we can analyze the details such as nearby bus services, train services, distance from airport, etc. The Zip’s can be rated based on these transit services and rated accordingly.
host_response_time and host_response_rate : Analyzing the average response and average response rate in every Zip can give us an idea of how demanding the listings were. Creating a metric based on that can help us rate the Zips.
host_is_superhost : Superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests. We can check for the number of hosts at the Zip level and rate them accordingly.
Area of Properties (Square Foot) : This can be a good metric that will help to used to check how prices rise as there is a rise in the area of properties. As we saw most of these fields were empty for most of the properties, we did not use it in our analyses.
Cleaning Fee : This can be an addition to the revenue generation calculation in our analysis. But, since we did not consider the maintenance cost and taxes we were not able to use this metric.
Listing Review : There were multiple variables which have the ratings given by the customers to the listings. Using this we can calculate the popularity of Zips which can be an important metric to make some conclusion.
Segment Assignment : We saw that we were only interested in analyzing the properties in the New York Zips. Since the number of Zips to be evaluated was a small number, we used a method of direct ranking and segmenting accordingly. The other approach would have been to calculate the deciles/quantiles and then segment the Zips. This approach was good if we were analyzing all the Zip’s in US/New York State together.
A Tableau Dashboard has been made to show these results to our client. The dashboard shows the geographical location of Zips and related metrics. Below are the instruction on how to see the best Zips: