Installing and accessing necessary packages

if(!require("dplyr")){
 
  install.packages("dplyr",dependencies = T)
  library(dplyr) #For Data Manipulation; Essential 'Pipe' Operator
}
if(!require("readr")){
 
  install.packages("readr",dependencies = T)
 library(readr) #Fast and friendly way to read Rectangular Data
}
if(!require("tidyr")){
 
  install.packages("tidyr",dependencies = T)
 library(tidyr) #Required for tidying and cleaning data to facilitate analysis

}
if(!require("ggplot2")){
 
  install.packages("ggplot2",dependencies = T)
library(ggplot2) #Used for Visualizations and Graphics
}
if(!require("gbm")){
 
  install.packages("gbm",dependencies = T)
library(gbm) #Used for fitting a Gradient Boost Model
}
if(!require("mice")){
 
  install.packages("mice",dependencies = T)
library(mice) # Multivariate Imputation by Chained Equations. Deals with missing data.
}
if(!require("plotly")){
 
  install.packages("plotly",dependencies = T)
library(plotly) #Create interactive web-based graphs
}

Loading Data(Airbnb Listings and Zillow Property Data)

list.data <- read_csv("C:/Users/Swagatam/Desktop/CapitalOne/listings.csv")
zillow.data<-read_csv("C:/Users/Swagatam/Desktop/CapitalOne/Zip_Zhvi_2bedroom.csv")

About the Dataset

  • The Airbnb/Listings data contains information on the listing including location, number of bedrooms, room types (entire home/private home/shared home)
  • The Zillow Data contains cost data to determine the average property price for 2 bedrooms

Objective of the Data Challenge

I am consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in. My objectives for this analysis would be finding out the Zip Codes and Neighborhoods that would have the best Annual Return Rates based on investments made in the properties and the Revenue from Airbnb. Secondly, I will be analyzing various factors like Room Type, Cancellation Policies, etc. and the relationship with Annual Return Rate and Total Annual Returns.

Assumptions

  • The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
  • The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
  • All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
  • The Occupancy Rate has been derived on the basis of certain conditions relating to Neighborhood and Overall Ratings (will be further discussed in the Analysis)
  • Imputation has been done on the basis of the MICE package in R. The package creates multiple imputations (replacement values) for multivariate missing data.

The datasets have been imported using the read_r package:

  1. The Airbnb data has 40753 observations and 95 variables
  2. The Zillow data has 8946 observations and 262 variables

Data Cleaning

The steps used for cleaning the data are as follows:

  1. Using the dplyr package, data has been filtered for hotels/apartments with 2 or less bedrooms. This corresponds to almost 95% of the data (i.e. 38000 observations). Further analysis/visualizations will be done on apartments with exactly 2 bedrooms for measuring returns, etc.
  2. Further filtering has been done on the Zillow dataset to obtain observations only for New York city. Note that filter has been done through Metro to get more observations. While joining ‘New York’ city will be used automatically for the Merge
  3. Calculation for the metric ‘Average Growth Rate’ has been done based on the Time Serie data from 1996 to 2017. This column has been appended using the Averaged Monthly Growth formula and then Growth based on yearly lag (previous year). Every Zip Code has a Growth Rate now
  4. Based on this Growth Rate, a new predicted price for 2019 has been created. The Growth Rate has been squared and multiplied with the 2017 data
  5. The two files have now been merged on the basis of the Zip Code key.
  6. The number of missing values found are 8429. Imputation of missing values is a must to obtain the prices for all Zip Codes/Locations.
#Removing Text data based columns from Airbnb Data
#Filtering bedrooms(less than 2). This contains more than 95% of the observations and will be required later for only 2 bedrooms
list.data<-list.data%>%
  filter(bedrooms<=2)%>%
  select(-c(listing_url,scrape_id,experiences_offered,thumbnail_url,medium_url,
            picture_url,xl_picture_url,host_url,host_name,host_about,host_thumbnail_url,
            host_picture_url))

#Filtering for New York City. Note that filter has been done through Metro to get more observations. While joining 'New York' city will be used automatically for the Merge
zillow.data<-zillow.data%>%
  filter(Metro=='New York')

#Creating the Time Series Data by using 'dplyr' functions
#Average Growth Rate has been considered for every year based on Monthly Growth every year
time.series.data<-zillow.data%>%
  select(-c(1,3:7))%>%
  group_by(RegionName)%>%
  gather(key=YearMonth,value=Price,-RegionName)%>%
  arrange(RegionName)%>%
  separate(YearMonth, c("Year","Month"),sep="-")%>%
  select(-Month)%>%
  group_by(RegionName,Year) %>%
  dplyr::summarize(Year.Price = mean(Price, na.rm=TRUE))%>%
  group_by(RegionName) %>% 
  mutate(Growth = (Year.Price - lag(Year.Price))/lag(Year.Price))%>%
  select(-Year,-Year.Price)%>%
  group_by(RegionName) %>% 
  dplyr::summarize(Avg.Growth.Rate = mean(Growth, na.rm=TRUE))

#Average Growth Rate column is being merged with Zillow Data to create a New Column of Growth Rates
merge.zillow.data<-merge(zillow.data,time.series.data,by="RegionName",all = TRUE)

#Further, New Column for predicted price in June 2019 has been created by squaring the Growth Rate for 2017 and taking its product with the 2017 price
merge.zillow.data1<-merge.zillow.data%>%
  mutate(Avg.Growth.Rate=Avg.Growth.Rate+1)%>%
  mutate(June.2019.predicted.price=(`2017-06`)*(Avg.Growth.Rate)^2)%>%
  select(-c(8:262))

#New Zillow Data is being merged with Airbnb data
final.data<-merge(list.data, merge.zillow.data1, by.x="zipcode", by.y="RegionName",all.x=TRUE)

#Removing redundant columns
final.data<-final.data%>%
  select(-c(State,Metro))

#Calculating Total Missing Values in New Price column
sum(!is.na(final.data$June.2019.predicted.price))
## [1] 8429

Preparing Data for Gradient Boost Algorithm

This will emphasize on the important predictors for Property Price

  1. The objective behind using this Algorithm is to find the variables that would be important/significant in predicting the Property Price based on the give data.
  2. The data has been prepared by selecting relevant columns like Location, Neighborhood, Ratings, etc. The results after running the gradient boost algorithm (using the gbm function) are as follows:
#Selecting Relevant columns that might influence the Price of a Property
#Important columns like Neighbourhood, Overall Scores/Location scores by previous customers, Zipcode of the location, SizeRank or Population Metric of that area
#Cleaning Done by removing missing values

final.data.model<-final.data%>%
  mutate_if(is.character, as.factor)%>%
  select(c(neighbourhood_cleansed,neighbourhood_group_cleansed,city,
           zipcode,SizeRank,review_scores_location,review_scores_value,June.2019.predicted.price))%>%
  filter(!is.na(June.2019.predicted.price))
final.data.model<-unique(final.data.model)

Applying Gradient Boost Algorithm

#This is help us find the importantance of predictors and rank them based on Node Purity
model.boost<- gbm(June.2019.predicted.price~., data = final.data.model, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, interaction.depth = 8)
summary(model.boost)

##                                                       var      rel.inf
## zipcode                                           zipcode 7.958414e+01
## neighbourhood_cleansed             neighbourhood_cleansed 1.397552e+01
## SizeRank                                         SizeRank 6.183416e+00
## city                                                 city 1.578422e-01
## neighbourhood_group_cleansed neighbourhood_group_cleansed 9.854045e-02
## review_scores_value                   review_scores_value 5.179075e-04
## review_scores_location             review_scores_location 2.187905e-05

We can conclude that Zipcode(i.e Location), Neighbourhood and Population are important for predicting/determining the Price of a Property

MICE Imputation (Algorithm for imputing Missing Values)

The package creates multiple imputations (replacement values) for multivariate missing data.

#Converting Character Fields/Columns to Factors using 'mutate_if' function
na.model<-final.data%>%
  mutate_if(is.character, as.factor)%>%
  select(c(neighbourhood_cleansed,neighbourhood_group_cleansed,city,
           zipcode,SizeRank,review_scores_location,review_scores_value,June.2019.predicted.price))

set.seed(123)
miceMod <- mice(na.model, method='cart',m=1,maxit=1)  # perform mice imputation, based on CART/Decision Tree Algorithm. Number of iterations here is one
## 
##  iter imp variable
##   1   1  city  zipcode  SizeRank  review_scores_location  review_scores_value  June.2019.predicted.price
miceOutput <- complete(miceMod) #New Data is generated with imputed values for missing observations

#Checking for missing values
anyNA(miceOutput)
## [1] FALSE
#Final Append with original Merged Airbnb Data
final.imputed<-cbind(final.data,miceOutput)

#Removing Redundant columns
final.imputed.clean<-final.imputed[-c(89:96)]

Exploratory Data Analysis on Cleaned Data

#Review Analysis
#Mean Reivew Scores for different Neighbourhoods
final.imputed.clean%>%
  select(neighbourhood_group_cleansed,review_scores_value)%>%
  group_by(neighbourhood_group_cleansed)%>%
  summarise(Mean_Score=mean(review_scores_value,na.rm=T))%>%
  arrange(desc(Mean_Score))
## # A tibble: 5 x 2
##   neighbourhood_group_cleansed Mean_Score
##   <chr>                             <dbl>
## 1 Staten Island                      9.51
## 2 Brooklyn                           9.38
## 3 Bronx                              9.38
## 4 Queens                             9.37
## 5 Manhattan                          9.30
#Price Analysis
#Mean Price for different Neighbourhoods
final.imputed.clean%>%
  select(neighbourhood_group_cleansed,June.2019.predicted.price)%>%
  group_by(neighbourhood_group_cleansed)%>%
  summarise(Mean_Price=mean(June.2019.predicted.price,na.rm=T))%>%
  arrange(desc(Mean_Price))
## # A tibble: 5 x 2
##   neighbourhood_group_cleansed Mean_Price
##   <chr>                             <dbl>
## 1 Manhattan                      2281060.
## 2 Brooklyn                       1454485.
## 3 Queens                          419987.
## 4 Bronx                           414516.
## 5 Staten Island                   406374.
#Neighbourhood Analyis
#Count of Airbnb hotels/residences in different Neighbourhoods
final.imputed.clean%>%
  select(neighbourhood_group_cleansed,zipcode)%>%
  group_by(neighbourhood_group_cleansed)%>%
  count(zipcode)%>%
  group_by(neighbourhood_group_cleansed)%>%
  summarise(Count_Airbnb=sum(n))%>%
  arrange(desc(Count_Airbnb))
## # A tibble: 5 x 2
##   neighbourhood_group_cleansed Count_Airbnb
##   <chr>                               <int>
## 1 Manhattan                           18403
## 2 Brooklyn                            15831
## 3 Queens                               3632
## 4 Bronx                                 625
## 5 Staten Island                         224

Exploring the Price columns

#Checking Price Column

class(final.imputed.clean$monthly_price) # It is not of Numeric Type
## [1] "character"
#Converting Price columns to Numeric using apply function 
#Removing dollar symbol too
strip_dollars = function(x) {as.numeric(gsub("\\$", "",x)) }
final.imputed.clean[, 49:53] <- sapply(final.imputed.clean[, 49:53], strip_dollars )

Calculating the Rating and Occupancy Rate based on self defined criteria

#Establishing Criteria for Rating based on Average Review Score Range
#Above 9 will be 'Excellent'
#Between 8 and 9 will be 'Good'
#Below 8 will be 'Needs Improvement'
final.imputed.clean<-final.imputed.clean%>%
  mutate(Avg.Review.Score=rowMeans(.[,69:74],na.rm=T))%>%
  mutate(Rating=ifelse(Avg.Review.Score>=9 & Avg.Review.Score<=10,"Excellent",
                       ifelse(Avg.Review.Score>=8 & Avg.Review.Score<9,"Good","Needs Improvement")))
  final.imputed.clean$Rating[which(is.na(final.imputed.clean$Rating))]<-"Needs Improvement"
  
#Occupancy Rates based on Rating and Location conditions
#Generally good Neighbourhoods and well Rated accomodations have better Occupancy Rates
#Here there are 4 different Rates: 0.75, 0.70, 0.65, 0.55
  final.imputed.clean<-final.imputed.clean%>%
  mutate(OccupancyRate=ifelse(Rating=="Excellent"&
                                (neighbourhood_group_cleansed=="Manhattan"|
                                   neighbourhood_group_cleansed=="Brooklyn"),0.75,
                              ifelse(Rating=="Excellent"&
                                neighbourhood_group_cleansed=="Queens",0.70,ifelse(Rating=="Good"&
                                   (neighbourhood_group_cleansed=="Manhattan"|
                                      neighbourhood_group_cleansed=="Brooklyn"),0.65,0.55))))

#75% Occupancy Rate
sum(final.imputed.clean$OccupancyRate==0.75)
## [1] 22775
#70% Occupancy Rate
sum(final.imputed.clean$OccupancyRate==0.70)
## [1] 2383
#65% Occupancy Rate
sum(final.imputed.clean$OccupancyRate==0.65)
## [1] 2444
#55% Occupancy Rate
sum(final.imputed.clean$OccupancyRate==0.55)
## [1] 11113
  1. The objective behind using this Algorithm is to impute the missing values of the Price Property
  2. Used the CART/Decision Tree Algorithm as a parameter. Number of iterations here is one
  3. New Data is generated with imputed values for missing observations and number of missing values now is 0.
  4. I have appended it to the new Airbnb dataset using the column bind function. Redundant columns have been removed to obtain a cleaned dataset.
  5. Final cleaning of the Price columns has been done to remove the ‘$’ sign and converted it from numeric type for further Analysis

Calculating Yearly Earnings and Yearly Rate of Returns

#Monthly Rate of Return based on certain factors

#If there is a Monthly Rate given, we consider that and calculate total for the year by multiplying with 12

#If there is no Monthly Rate, we take the daily Rate and multiply by 365 to get total for year

#Product of Total and Occupancy Rate is taken to Calculate Yearly Earnings

#Yearly Return Rate is the Yearly Returns divided by the total invested amount for the property

final.imputed.clean<-final.imputed.clean%>%
  mutate(Yearly_Earnings=ifelse(is.na(monthly_price),365*price*OccupancyRate,
                                12*monthly_price*OccupancyRate))%>%
  mutate(Yearly_Return_Rate=Yearly_Earnings/June.2019.predicted.price)
  1. Staten Island has got the Best Ratings and Manhattan has faired poorly compared to the other 4 counterparts. One probable reason can be that the population of Staten Island is pretty low i.e. 0.4 million when compared to the densely populated Manhattan with 1.67 million. People who book Airbnb rooms/hotels are often on a vacation mood and do not prefer highly populated metro cities.
  2. The Mean Property Price in Manhattan and Brooklyn exceeds the others by a significant number. This is quite predictable as these are busy cities and there is heavy competition among real estate agents. Also the presence of corporate offices and buildings in that area boosts the property prices
  3. From the two conclusions above and the third table, we can also infer that the number of properties in Brooklyn and Manhattan is pretty high compared to the others. Staten Island and Bronx have less than 1000 properties

Important Data for Visualization

#Considering Important Numerical and Categorical Data for visualizing.
#Eg: Return Rate, Rating Score, Total Annual Returns, Zipcode, etc

viz.data<-final.imputed.clean%>%
  select(c(1,18,20,30,31,37:41,42:44,60:63,78:81,87:94))
head(viz.data)
##   zipcode host_response_rate host_is_superhost
## 1   07310               100%             FALSE
## 2   10000                78%             FALSE
## 3   10001               100%             FALSE
## 4   10001                60%             FALSE
## 5   10001               100%              TRUE
## 6   10001                N/A             FALSE
##   neighbourhood_group_cleansed        city latitude longitude
## 1                    Manhattan Jersey City 40.72670 -74.01076
## 2                    Manhattan    New York 40.75980 -73.99374
## 3                    Manhattan    New York 40.74842 -74.00235
## 4                    Manhattan    New York 40.75133 -73.99965
## 5                    Manhattan    New York 40.74402 -73.99272
## 6                    Manhattan    New York 40.74799 -73.99993
##   is_location_exact property_type       room_type accommodates bathrooms
## 1              TRUE     Apartment    Private room            2         1
## 2             FALSE     Apartment    Private room            2         1
## 3              TRUE     Apartment Entire home/apt            1         1
## 4              TRUE     Apartment Entire home/apt            2         1
## 5              TRUE     Apartment Entire home/apt            4         1
## 6              TRUE     Apartment Entire home/apt            4         2
##   bedrooms availability_30 availability_60 availability_90
## 1        2               0               0               2
## 2        1              20              47              77
## 3        1               0               0               0
## 4        1              29              59              89
## 5        0               4              10              29
## 6        2               0               0               0
##   availability_365 instant_bookable cancellation_policy
## 1                2            FALSE            flexible
## 2              346             TRUE            moderate
## 3                0            FALSE            flexible
## 4              179            FALSE            moderate
## 5              101            FALSE              strict
## 6                0            FALSE            moderate
##   require_guest_profile_picture require_guest_phone_verification SizeRank
## 1                         FALSE                            FALSE       NA
## 2                         FALSE                            FALSE       NA
## 3                         FALSE                            FALSE       NA
## 4                         FALSE                            FALSE       NA
## 5                         FALSE                            FALSE       NA
## 6                         FALSE                            FALSE       NA
##   Avg.Growth.Rate June.2019.predicted.price Avg.Review.Score
## 1              NA                   2278585              NaN
## 2              NA                   2278585              NaN
## 3              NA                   2278585              NaN
## 4              NA                   2278585        10.000000
## 5              NA                   2278585         9.833333
## 6              NA                   2278585         9.666667
##              Rating OccupancyRate Yearly_Earnings Yearly_Return_Rate
## 1 Needs Improvement          0.55        13450.25        0.005902897
## 2 Needs Improvement          0.55        40150.00        0.017620588
## 3 Needs Improvement          0.55        60225.00        0.026430882
## 4         Excellent          0.75        75281.25        0.033038602
## 5         Excellent          0.75        95538.75        0.041928989
## 6         Excellent          0.75       123187.50        0.054063167

Plot 1.1 - Rating Based on Neighbourhood

viz.data.rating<-viz.data%>%
  filter(bedrooms==2)%>%
  select(Rating,Yearly_Return_Rate,neighbourhood_group_cleansed)%>%
  group_by(neighbourhood_group_cleansed,Rating)%>%
  summarise(Mean_Return=mean(Yearly_Return_Rate,na.rm=T))

ggplot(viz.data.rating,aes(x=Rating,y=Mean_Return*100,fill=as.factor(Rating)))+geom_bar(stat="identity")+
  facet_wrap(~neighbourhood_group_cleansed)+coord_flip()+
  labs(title="Effect of Ratings on Return", 
       x="Rating",
       y="Annual Return Rate(%)",
       fill="Rating")

Analysis/Conclusions:

  1. Its clear from the graph that all the neighborhoods (except Bronx and Staten Island) have higher return rates when the Customer Ratings are Excellent
  2. Overall Manhattan has the highest Return Rates for accommodations with Excellent and Good Customer Ratings. It would be worth investing in locations in and around Manhattan as services and maintenance would be of high quality here
  3. Bronx area should be avoided due to poor Return Rates

Plot 1.2 - Yearly Earnings Based on Neighbourhood

viz.data%>%
  filter(bedrooms==2)%>%
  select(neighbourhood_group_cleansed,Yearly_Return_Rate)%>%
  group_by(neighbourhood_group_cleansed)%>%
  summarise(Mean=mean(Yearly_Return_Rate,na.rm=T))%>%
  arrange(desc(Mean))
## # A tibble: 5 x 2
##   neighbourhood_group_cleansed   Mean
##   <chr>                         <dbl>
## 1 Queens                       0.0860
## 2 Bronx                        0.0591
## 3 Staten Island                0.0568
## 4 Brooklyn                     0.0319
## 5 Manhattan                    0.0304
#BoxPlot
ggplot(viz.data, aes(x=neighbourhood_group_cleansed, y=Yearly_Earnings/1000, fill=neighbourhood_group_cleansed)) +
  geom_boxplot()+scale_y_continuous(limits = c(0, 150))+
  labs(title="Box Plot for Yearly Earnings", 
       x="Neighbourhood",
       y="Annual Earnings",
       fill="Neighbourhood")

Analysis/Conclusions:

  1. Manhattan has the highest Annual Earnings from followed by Brooklyn including a very high Median Price
  2. This can be attributed to a number of factors:
    • High Population Density
    • Presence of Large Corporates
    • Customers giving higher ratings(read: Excellent and Good) and thereby creating a cycle for more customers to come and stay
    • Bronx and Staten Islands have lower Annual Earnings but the we can cannot eliminate the possibility of better Return Rates as the investment in these properties will also be less. Further Analysis will confirm a better course of action

Table - Mean Response Score Based on Neighbourhood and Zipcode

strip_perc = function(x) {as.numeric(sub("%","",x))/100}
viz.data[,2] <- sapply(viz.data[, 2], strip_perc )
viz.data%>%
  filter(bedrooms==2)%>%
  select(zipcode,host_response_rate,neighbourhood_group_cleansed)%>%
  group_by(neighbourhood_group_cleansed,zipcode)%>%
  summarise(Mean_Response_Score=mean(host_response_rate,na.rm = T))%>%
  arrange(desc(Mean_Response_Score))%>%
  top_n(5)
## # A tibble: 41 x 3
## # Groups:   neighbourhood_group_cleansed [5]
##    neighbourhood_group_cleansed zipcode Mean_Response_Score
##    <chr>                        <chr>                 <dbl>
##  1 Bronx                        10451                     1
##  2 Bronx                        10462                     1
##  3 Bronx                        10465                     1
##  4 Bronx                        10466                     1
##  5 Bronx                        10469                     1
##  6 Bronx                        10475                     1
##  7 Brooklyn                     11204                     1
##  8 Brooklyn                     11214                     1
##  9 Brooklyn                     11223                     1
## 10 Manhattan                    07310                     1
## # ... with 31 more rows

Plot 2 - Annual Return Rate Based on Neighbourhood(Density and Boxplots)

#Plot 2.1-Density Plot

theme_set(theme_classic())
g <- ggplot(data=viz.data, aes(Yearly_Return_Rate*100))
g.g<-g + geom_density(aes(fill=factor(neighbourhood_group_cleansed)), alpha=0.8) + 
  labs(title="Density Plot", 
       subtitle="Annual Return Rate Grouped by Neighbourhood",
       caption="Source: Airbnb Data",
       x="Annual Return Rate(%)",
       y="Density",
       fill="Neighbourhood Area")+scale_x_continuous(limits = c(0, 10))
p <- ggplotly(g.g)
p
#Plot 2.2-Boxplot
ggplot(viz.data, aes(x=neighbourhood_group_cleansed, y=Yearly_Return_Rate*100, fill=neighbourhood_group_cleansed)) +
  geom_boxplot()+scale_y_continuous(limits = c(0, 20))+
  labs(title="Box Plot", 
       subtitle="Annual Return Rate Grouped by Neighbourhood",
       caption="Source: Airbnb Data",
       x="Neighbourhood",
       y="Annual Return Rate(%)",
       fill="Neighbourhood Area")

#### Analysis/Conclusions:

  1. The above plots show that Queens has better Annual Return Rates compared to others. Bronx and Brooklyn have poor Annual Return Rates and it wouldn’t be advisable to invest here
  2. Also, note that there is high variability in the data for Staten Island and Queens, but its not the same for the others where the data is closer to the central values. It is therefore advisable to avoid Staten Island as the number of observations are less for it.

Plot 3 - Annual Earnings Based on Room Type(Histogram plots)

viz.data%>%
  filter(bedrooms==2)%>%
  group_by(room_type)%>%
  tally()
## # A tibble: 2 x 2
##   room_type           n
##   <chr>           <int>
## 1 Entire home/apt  4593
## 2 Private room      301
viz.data%>%
  filter(bedrooms==2)%>%
  plot_ly(x = ~Yearly_Earnings,color = ~room_type) %>%
  add_histogram()%>%
  layout(title="Room Type Analysis", 
        scene = list(
          xaxis = list(title = "Yearly Earnings"),
          yaxis = list(title = "Count")))

Plot 4 - Annual Earnings Based on Cancellation Policy(Area plots)

viz.data%>%
  filter(bedrooms==2)%>%
  ggplot(aes(x=Yearly_Earnings/1000))+geom_area(stat ="bin", color="darkblue",
                                           fill="lightblue")+facet_wrap(~cancellation_policy)+
  ggtitle("Cancellation Policy Analysis")+xlab("Yearly Earnings in $1000s") +ylab("Count") 

  1. The total number of Private Rooms(301) is very less compared to Entire Home/Apartments(4593). These Homes/Apartments contribute heavily toward the Annual Earnings.
  2. The accommodations with Strict Cancellation policies contribute heavily towards the Annual Earnings. They do not offer refund easily, so if there are cancellations, the earnings get accumulated

Plot 5 - Top Zip codes Based on Return Rates(Area plots)

viz.data.1<-viz.data%>%
  filter(bedrooms==2)%>%
  filter(!is.na(zipcode))%>%
  select(zipcode,neighbourhood_group_cleansed,Yearly_Return_Rate)%>%
  group_by(zipcode,neighbourhood_group_cleansed)%>%
  summarise(Mean_Return_Rate=mean(Yearly_Return_Rate,na.rm=T))%>%
  arrange(Mean_Return_Rate)

viz.data.1<-viz.data.1%>%
  select(zipcode,neighbourhood_group_cleansed,Mean_Return_Rate)%>%
  tail(25)

ggplot(data=viz.data.1,aes(x=reorder(zipcode,-Mean_Return_Rate),y=Mean_Return_Rate*100,fill=neighbourhood_group_cleansed))+
  geom_bar(stat="identity")+ 
  #scale_fill_manual(values=c("red", "yellow","dark green","blue","grey"))+
  coord_flip()+ggtitle("Top 25 Zipcodes for Best Return")+xlab("Zipcode") +ylab("Average Return Rate(%)") + labs(fill = "Neighbourhood Area")

Final Conclusions

  1. It can be finally concluded that Queens has very high return on investments i.e. between 10% and 20% for many locations(zip codes). It has the highest for 11109 with 19% return rate
  2. Queens also has 8 locations in the top 25. All these observations have more than 10% Return Rates
  3. Staten Island has one location in the top 25 with a very high return rate i.e. close to 11% but further analysis needs to be done on this zip code like the locality, background check of the hosts, property prices of surrounding areas, etc.
  4. The top 5 Zip Codes are 10460, 11109, 11378, 11435, 11104. It will definitely be worth investing here

SUMMARY AND FUTURE SCOPE OF THE PROJECT

  1. The above analysis is subject to the investor’s budget constraint. If he/she does not have an appetite for high risk-high return investments, the analysis would be very different
  2. There are very few properties in Staten Island and Bronx. So, choices are limited for the investor
  3. To better the analysis, we would need data from other sources, like expenses data for hotels/accommodations, property data, etc.
  4. Modeling has not been performed on this dataset except to find importance of Variables for Visualizations. But exploring various regression and classification models may yield better results
  5. Text Mining/Analysis can be done on the variables that have description, thereby helping us understand whether people in general have a preference for the locality