airbnb logo
Airbnb is one of the world’s largest marketplaces for unique, authentic places to stay and local experiences. It is an innovative business platform that provides and guides an opportunity to link two groups - the hosts, people who want to rent out their homes and the guests, people who are looking for accommodations in that locale. Airbnb offers over 7 million accommodations and 50,000 handcrafted activities across 81,000 cities and 191 countries worldwide all powered by local hosts.
Airbnb in UK and Manchester
As per the annual report 2018, over the past one year, hosts from 2,600 villages, towns and cities have welcomed 8.4 million guests with 223,200 active listings on Airbnb across the whole of the UK. The platform provides a flexible source of income, with typical hosts earning an average of £3,100 a year.
Airbnb Manchester has seen a positive trend in the bookings since 2017. In 2018, Manchester was identified as one of the top places where Airbnb books are made for work travellers. Manchester being the home ground for two of the most famous football clubs also attracts a lot of visitors looking for fair priced accommodations such as hostels and single room rentals.
Airbnb’s Popularity
Airbnb is a platform that allows many different types of guest’s needs to be met.Ranging from offbeat and cosy rural hideaway to stylish city flats, accommodating anyone from groups and families, where space and kitchens might be more essential, to people looking to work away from home, looking for accommodations with quick Wi-Fi, workspace and accessibility. Approximately 89 percent of guests who choose to travel on Airbnb do so because of the amenities on offer, and every listing allows different expectations to be met. The search experience on Airbnb looks like this, with a variety of different features available to select from.
Airbnb Revenue Model and Implications
Airbnb receives commissions from both the hosts and the guests for every booking. For each booking Airbnb charges the guest 6-12% of the booking fee and the host 3% for every successful transaction. Such being the revenue model, it is imperative to ensure the increase in the number of bookings for the profitability of Airbnb’s business. One of means to achieve that would by having reasonable prices for the accommodations and experiences offered by the hosts and providing efficient recommendations of popular listings available for the prospective guests. By aiding the hosts to quote fair prices and an understanding of the market trend and guests’ expectation, Airbnb can attract more customers, thereby leading to an increase in the revenue for both the hosts and Airbnb.
Quoting a fair price for any listings is a challenging task for the hosts as there are a multitude of factors that can affect the price. The historical data gathered from the Airbnb listings through the years provides a reasonable understanding of guests’ willingness to book based on various features of the hosts and the properties. Given the market being dynamic, and certain factors such as location, host hospitality, history of positive experience can provide a competitive advantage to the hosts in order to set the prices at mutually beneficial levels. In addition, through text analytics on the reviews from customers, insights on the attributives relating to positive and negative sentiments of the guests can be analysed in order to provide the hosts with recommendations on how to improve services in the future. This analysis of the guest feedback can allow hosts to improve their rating scores and attract potential bookings.
Given the understanding of the business scenario, the main objective of this analysis is to predict the popularity score for the Airbnb listings and fair price based on the features associated with the hosts, properties and reviews. The aim is to develop reliable prediction models using machine learning techniques such as KNN, Naïve Bayes, Decision and Regression Trees and Linear regression, which can serve as a baseline to understand the factors that contribute to the popularity and pricing of a listing.
The data we used for analysis is sourced from Inside Airbnb. Inside Airbnb utilizes public information compiled from the Airbnb website and provides key metrics and insights on the Airbnb listings. The files that are available for the analysis are:
Quick inspection of the files reveals that all the features from the ‘listings_summary’ file is available in ‘listings_details’ file and therefore will not be used further. Also, the location coordinates available in the details file would be used for exploratory analysis, hence ‘neighbourhoods’ csv and geojson files will not be used further. Files ‘listings_details’,‘reviews_summary’,‘reviews_details’,and ‘calendar’ are used further in the analysis.
Most of the attributes relating to the property, host and guest ratings required for the analysis are available in the ‘listings_details’ file. The data contains details of 4848 listings from the Greater Manchester region and 105 different variables associated with the listings.
Observations & Variables:
## [1] 4848 105
Out of the 105 features available, after initial manual inspection 49 columns detailed below are retained for further exploration.
• id - listing identifier • last_scraped – date of the data scraped
• host_since - host experience
• host_location – host locality
• host_is_superhost - categorical t or f - describing highly rated and reliable hosts - https://www.airbnb.co.uk/superhost
• host_identity_verified - categorical t or f - host credibility metric
• host_response_time – categorical measure of how quickly the host responds
• host_response_rate – numerical measure of how quickly the host responds
• host_total_listings_count – total number of listings hosted by the host
• neighbourhood_cleansed – location of the listing
• neighbourhood_group_cleansed – higher level location of the listing
• latitude – coordinates to visualise the data on the map
• longitude - coordinates to visualise the data on the map
• property_type - categorical variable describing the property type
• room_type - categorical variable describing property feature
• accommodates - discrete value describing property feature
• bathrooms - discrete value describing property feature
• bedrooms - discrete value describing property feature
• beds - discrete value describing property feature
• bed_type - categorical value describing property feature
• amenities – list of amenities available
• is_location_exact - categorical t or f - location credibility metric
• price - price per night
• security_deposit - associated with the cost
• cleaning_fee - additional cost at the top of rent
• guests_included - minimum value of accommodates
• extra_people - cost of additional person per night
• minimum_nights – minimum booking duration
• first_review – date of first review
• last_review - date of last review
• weekly_price – price feature
• monthly_price - price feature
• maximum_nights - maximum booking duration
• calendar_updated – frequency of calendar refresh
• has_availability – listing availability
• availability_30 – availability feature
• availability_60 - availability feature
• availability_90 - availability feature
• availability_365 - availability feature
• instant_bookable - categorical value booking feature
• cancellation_policy - ordinal value with 5 categories of flexibility
• review_scores_accuracy – review feature
• review_scores_cleanliness - review feature
• review_scores_value - review feature
• review_scores_rating - weighted sum of review scores
• reviews_per_month - average number of reviews received per month
• number_of_reviews - total number of reviews received • accm_since - active duration of the accommodation
Post analysing the structure and summary of the filtered variables, the following data cleansing steps have been implemented:
• The dates and numbers in character format are converted to respective datatypes • Missing values are identified and substituted
• Spaces are replaced with ‘_’ in character columns
• ‘amenities’ column is cleansed using text analytics methods and 10 amenities of interest are retained
• Outlier values from columns ‘beds’, ‘bathrooms’ and ‘price’ are removed
• Correlated columns pertaining to availability and reviews are removed
Features with high number of missing values
## Variable Count
## 1 monthly_price 4647
## 2 weekly_price 4593
## 3 security_deposit 1747
## 4 cleaning_fee 1263
## 5 review_scores_value 939
## 6 review_scores_accuracy 938
## 7 review_scores_cleanliness 937
## 8 review_scores_rating 936
## 9 first_review 873
## 10 last_review 873
## 11 reviews_per_month 873
Correlation Matrix
Listings per Neighbourhood and Neighbourhood Group
Based on the graph, it is evident that the majority of the listings are in the Manchester City, concentrated in the Safford, Trafford, Ancoats & Clayton and the City center. This can be expected given the proximity of these areas to the places of interest in Manchester.
Histogram of Calendar Availability
Since this is a look at the future availability from 16/11/2019, the near future tends to be booked more than not, and also upticks can be noticed around holiday times (Christmas and Easter) can be seen
Reviews Vs Ratings
Listings with higher ratings have higher number of reviews. The initial number of positive reviews seems to make a listing be booked more often than the ones with no, bad or less reviews. Also, seems like guests would rather not rate if the review to be given is bad.
Time Series Spread
The time series plots are an effort to capture the seasonality and trends in the listings data
Host Since Time Series
The first series shows the variation of new hosts joining Airbnb over the period of 10 years
The time series spread show that the number of hosts joining Airbnb is in a increasing trend. It is evident that the number of new hosts who have been joining Airbnb is increasing. There is an unexplainable drop in the new hosts signing up in late 2010, but since then the trend is maintained. There is also the evidence of seasonal component indicating a spike which can attributed to the holiday seasons.
First Reviews Time Series
This time series is plot of first reviews received over the period of 10 years
The trend for getting the first review for listing is quite erratic, with major portion of the time series falling under random component
Number of Reviews Time Series
This time series is plot of number of reviews received over the period of 10 years
The time series spread show that the total number of reviews on Airbnb is in a increasing trend, spiking seen during the end of every two years, though the seasonality is not accounting for this variation. .
Histogram of All Non Catgorical Features
Word Cloud of Frequently Used Terms in Guest Reviews
The most frequently used words in the reviews are ‘house’,‘stay’,‘host’ which represents the listings, followed by adjectives such as ‘nice’,‘love’,‘comfortable’,‘perfect’. Since the words have not been separated based on the sentiment, it is evident that majority of the reviews project a positive feedback.
Word Association
Based on the most freqently used words, 5 key words that represent the listings is considered for word association analysis.
Words most associated with the focused terms (“stay”, “host”, “location”, “manchester”,“house”) are represented as a network graph. The thickness of the arrows in the network of words shows the how frequently they are associated with focus terms.
Words associated with the token ‘house’ such as ‘cosy’,‘spacious’,‘clean’, are descriptive of the property.
Words associated with the token ‘host’ such as ‘helpful’,‘wonderful’, are descriptive of the host.
Words associated with the token ‘manchester’ such as ‘airport’,‘explore’, are descriptive of the location.
Words associated with the token ‘stay’ such as ‘pleasant’,‘enjoy’,‘short’ are descriptive of the experience.
Analysis of word associations provide quality feedback for each of the listing attribute.
Sentiment Analysis
The review texts are analysed to identify if the review has been a positive or a negative one and the sentiment is matched with the rating of the listing. The review_score_rating has been converted into 5 bins to represent the star ratings
It can be seen from the plots that for the listings with a rating of 80 and above, positive words are associated with host and the listing features. More negative words are associated with the listings of rating below 70. The associated negative words such as ‘dirty’and ’cancelled’ can be utilized to provide feedback to the host on how they can improve the guest experience.
Sentiments Radar
The sentiments of ‘trust’and ’anticipation’ are seen to be most associated with higher rated listings and ‘anger’ are associated with lower rated listings.
Based on the above barcharts we can conclude that:
Popular listings have greater relative representation as a superhost and their host identity verified. Slightly higher number of popular listings don’t seem to offer instant booking.
Based on the above barcharts we can conclude that:
Popular listings have the highest representation in Manchester city, Safford and Trafford. These locations also have the most number of listings in Greater Manchester region. Popular listings are dominant in Apartment and House properties Popular listings are mostly represented in Entire home/Apt or Private Room with Entire Home being more popular Almost all records in both groups have the same bed type making this feature redundant. Popular listings are less likely to have a flexible policy and more likely moderate to strict 14 days
Popular listings have their hosts responding to the queries within an hour
Based on the above boxplots we can conclude that:
Identical distribution of listings in terms of number of bathrooms, bedrooms, beds and minimum number of nights with few outliers Overall price is lower for popular listings. Price might be an influencer for the status. Similar distribution in both groups in terms of price per extra person, but higher median in the popular listings Popular listings all have high review scores, even the outliers scoring above 70. Popular hosts do not own own more than 1-2 listings. Popular listings have higher number of popular amenities listed
Density Map by Popularity
The listigs in the areas of City Center, Trafford and Safford are more popular compared to their neighbours.
(Zoom in for better view)
Histogram of Listings’ Prices
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 30.00 53.00 79.16 89.00 5539.00
Majority of the listing prices are with a $50 range and rightly so since Airbnb offer fair priced accommodations. There is a sparse number of listings are found above $200 price range which might be the offbeat expensive locations available.
Boxplot of Listing Prices across Neighbourhood Groups
The prices are quite evenly distributed among the different neighbourhood groups with Salford having a slightly higher median of price among all the neighbourhoods.
Boxplot of Listing Prices across Property type
The prices are not quite evenly distributed among the different property types, which is expected. The outliers of offbeat listings such as cave, villa, townhouse can be seen clearly.
Density Map by Price
The locations near the popular areas such as City center, Safford, Trafford are priced higher than their neighbours
(Zoom in for better view)
The data exploration gave a reasonable understanding of how the listing features are associated and the importance they might hold regarding the accommodation’s popularity and price.
There are a lot of features available in the dataset to choose from in order to build the prediction model. Using feature selection methods, variables having significant impact on the response variables ‘is_popular’ and ‘price’ can be determined. The feature selection methods used are Boruta. The list of significant features as suggested the Boruta method is extracted for further processing.
Feature Selection Using Boruta Model
## meanImp decision
## reviews_per_month 70.57177 Confirmed
## review_scores_rating 42.68227 Confirmed
## host_is_superhost 30.32340 Confirmed
## host_identity_verified 29.65874 Confirmed
## price 21.99897 Confirmed
## calculated_host_listings_count 19.10859 Confirmed
## host_response_rate 15.27068 Confirmed
## host_response_time 14.99621 Confirmed
## availability_30 13.92047 Confirmed
## amenities 13.85121 Confirmed
Based on the Boruta model suggestion the below are pulled features are considered for prediction model,
Predictor Features:
host_response_rate(continuous) numerical measure of host response rate
host_response_time(categorical – within a few hours/within a day/within few days) how quickly the host responds
host_is_superhost (categorical -YES/NO) whether the host is an “Airbnb Superhost”
host identity verified (categorical - YES/NO) whether Airbnb has verified the identity of
the host
neighbourhood (categorical - 25 levels) the neighbourhood that the property is in
property type (categorical - 17 levels) the type of property
room type (categorical - Entire home/Private room/Shared room) the type of room
accommodates (continuous) the number of people the property can hold
bathrooms (continuous) the number of bathrooms the property has
bedrooms (continuous) the number of bedrooms the property has
beds (continuous) the number of beds the property has
bed type (categorical - Airbed/Couch/Futon/Pull-out Sofa/Real Bed) the type of bed
guests included (continuous) - the number of guests allowed
minimum nights (continuous) - the minimum number of nights for a reservation
total_amenities (continuous) - number of popular amenities available
instant bookable (categorical - YES/NO) whether the property can be reserved through the Airbnb instant booking interface
cancellation policy (categorical - flexible/moderate/strict/super strict 30 ) the strictness of the cancellation policy is
Response Features:
price (continuous) the daily price of the property in dollars
is_popular(categorical- YES/NO) whether the listing is popular
Data Transformations
The below steps were followed to prepare the data for the prediction models,
All the categorical values are factorised and converted to integers. The list of amenities is split into columns
From this point, two sets of data are maintained to evaluate the models, one set is retained as factors and dummy variables are created in another set.
10 fold Cross Validation is used to test all the models in this analysis.
The ‘is_popular’ feature of the datasets are predicted by employing the methods: KNN, Naïve Bayes, Decision Tree and Regression Tree. The evaluation metrics of the different models are captured as below.
The entire datasets were normalized using vector normalization with dataset specific minimum and maximum values. The evaluation metrics of KNN prediction model are as below. The optimum k value for nearest neighbour was found to be 5.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 310 64
## 1 43 67
##
## Accuracy : 0.7789
## 95% CI : (0.7393, 0.8151)
## No Information Rate : 0.7293
## P-Value [Acc > NIR] : 0.007224
##
## Kappa : 0.4103
##
## Mcnemar's Test P-Value : 0.053178
##
## Sensitivity : 0.8782
## Specificity : 0.5115
## Pos Pred Value : 0.8289
## Neg Pred Value : 0.6091
## Prevalence : 0.7293
## Detection Rate : 0.6405
## Detection Prevalence : 0.7727
## Balanced Accuracy : 0.6948
##
## 'Positive' Class : 0
##
## values
## Sensitivity 0.8839479
## Specificity 0.4672989
## Precision 0.8183921
## F-Score 0.8498375
## Accuracy 0.7718277
## AUC 0.6948187
The model presents a decent prediction with the AUC ~ 0.7, but it can be seen from the Specificity score that it fails to identify True Negatives. The model classifies not popular listings also as popular.
The evaluation metrics of Naïve Bayes prediction model are as below. A Laplace correction factor of 1 is used since many of the variables possess value 0.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 221 17
## 2 132 114
##
## Accuracy : 0.6921
## 95% CI : (0.6489, 0.733)
## No Information Rate : 0.7293
## P-Value [Acc > NIR] : 0.9695
##
## Kappa : 0.3889
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6261
## Specificity : 0.8702
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.4634
## Prevalence : 0.7293
## Detection Rate : 0.4566
## Detection Prevalence : 0.4917
## Balanced Accuracy : 0.7481
##
## 'Positive' Class : 1
##
## values
## Sensitivity 0.6320353
## Specificity 0.8485614
## Precision 0.9190856
## F-Score 0.7486817
## Accuracy 0.6903111
## AUC 0.7481457
The model provides a better prediction than the KNN model, with the AUC > 0.7, but it can be seen from the Sensitivity score that it fails to identify True Positives. The model classifies popular listings as not popular.
The evaluation metrics of Decision Tree prediction model are as below. A boost of 5 is set for optimal results.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 313 35
## 2 40 96
##
## Accuracy : 0.845
## 95% CI : (0.8097, 0.8761)
## No Information Rate : 0.7293
## P-Value [Acc > NIR] : 9.979e-10
##
## Kappa : 0.6122
##
## Mcnemar's Test P-Value : 0.6442
##
## Sensitivity : 0.8867
## Specificity : 0.7328
## Pos Pred Value : 0.8994
## Neg Pred Value : 0.7059
## Prevalence : 0.7293
## Detection Rate : 0.6467
## Detection Prevalence : 0.7190
## Balanced Accuracy : 0.8098
##
## 'Positive' Class : 1
##
## values
## Sensitivity 0.8927322
## Specificity 0.7055901
## Precision 0.8922084
## F-Score 0.8922377
## Accuracy 0.8423691
## AUC 0.8097550
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 484
##
##
## | predicted default class
## actual default class | 1 | 2 | Row Total |
## ---------------------|-----------|-----------|-----------|
## 1 | 313 | 40 | 353 |
## | 0.647 | 0.083 | |
## ---------------------|-----------|-----------|-----------|
## 2 | 35 | 96 | 131 |
## | 0.072 | 0.198 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 348 | 136 | 484 |
## ---------------------|-----------|-----------|-----------|
##
##
It is seen from the evaluation metrics that the decision tree results in a high AUC of 0.8 in predicting the popularity score.
The random forest regression tree model is used here as a means to validate the decision tree model’s prediction.
The RF model is along similar accuracy level as of the decision tree model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 338 43
## 2 22 80
##
## Accuracy : 0.8654
## 95% CI : (0.8317, 0.8946)
## No Information Rate : 0.7453
## P-Value [Acc > NIR] : 7.419e-11
##
## Kappa : 0.6244
##
## Mcnemar's Test P-Value : 0.01311
##
## Sensitivity : 0.9389
## Specificity : 0.6504
## Pos Pred Value : 0.8871
## Neg Pred Value : 0.7843
## Prevalence : 0.7453
## Detection Rate : 0.6998
## Detection Prevalence : 0.7888
## Balanced Accuracy : 0.7946
##
## 'Positive' Class : 1
##
## values
## Sensitivity 0.9320019
## Specificity 0.7267435
## Precision 0.9028018
## F-Score 0.9170704
## Accuracy 0.8769130
## AUC 0.7946477
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 483
##
##
## | predicted default class
## actual default class | 1 | 2 | Row Total |
## ---------------------|-----------|-----------|-----------|
## 1 | 338 | 22 | 360 |
## | 0.700 | 0.046 | |
## ---------------------|-----------|-----------|-----------|
## 2 | 43 | 80 | 123 |
## | 0.089 | 0.166 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 381 | 102 | 483 |
## ---------------------|-----------|-----------|-----------|
##
##
Linear Regression is employed to train the model and predict the listings price. The model is initially built all the variables as predictors, following which stepwise regression is used to continually remove the predictor variables of least significance until the minimal possible AIC value is achieved.
Each of the 2 best fit models as suggested by the stepwise regression function is then applied on the testing set and the performance evaluation metrics are captured as below.
Not Dummied
Step: AIC=47263.74 Bestmodel_nd <- lm(price ~ host_response_time + host_identity_verified + neighbourhood_cleansed + neighbourhood_group_cleansed + property_type + room_type + accommodates + bedrooms + beds + bathrooms + minimum_nights + availability_30 + review_scores_rating + instant_bookable + calculated_host_listings_count + amenities_kitchen + amenities_laptopfriendlyworkspace + amenities_microwave + amenities_tv + amenities_wifi + is_popular, data=data_train)
Dummied
Step: AIC=47195.83 Bestmodel_nd <-lm( price ~ host_response_rate + accommodates + beds + bathrooms + extra_people + minimum_nights + availability_30 + review_scores_rating + amenities_essentials + amenities_laptopfriendlyworkspace + amenities_microwave + amenities_tv + amenities_wifi + host_identity_verified + instant_bookable + is_popular + host_response_time.within.a.few.hours + property_type.Cave + property_type.House + property_type.Serviced.apartment + room_type.Private_room + room_type.Shared_room + neighbourhood_group_cleansed.Manchester + neighbourhood_group_cleansed.Salford + neighbourhood_group_cleansed.Stockport + neighbourhood_group_cleansed.Trafford + neighbourhood_cleansed.Cheetham + neighbourhood_cleansed.City_Centre + neighbourhood_cleansed.Higher_Blackley + neighbourhood_cleansed.Withington, data = data_train)
## ME RMSE MAE MPE MAPE
## Test set -0.016674450 0.4939639 0.3653723 -1.7522627 9.293342
## Test set 0.021739684 0.5613720 0.3832813 -1.1208590 9.501987
## Test set 0.001567441 0.4521596 0.3270017 -1.1257161 8.118781
## Test set 0.033588674 0.5387948 0.3664397 -0.5759281 9.064609
## Test set -0.014053673 0.5023257 0.3644493 -1.8912911 9.214073
## Test set -0.015334325 0.4666438 0.3468781 -1.7763845 8.931906
## Test set -0.019347610 0.4724692 0.3385554 -1.8432977 8.512478
## Test set -0.011502803 0.4790496 0.3483766 -1.7042239 8.799665
## Test set 0.042286962 0.4952563 0.3617424 -0.4393855 8.920990
## Test set -0.020582957 0.4289735 0.3263710 -1.6594497 8.196135
## values
## Corr 0.7610168
## MAE 0.3528468
## R_Sq 0.5829148
## Adj_R_Sq 0.5808913
Since the price distribution in the dataset is heavily right skewed, logarithmic value of price is used in the models to make the distribution more symmetrical and the performance of the model is evaluated.
The dataset with dummied variables yields a better prediction of the combinations considered. The correlation rate is a high of 0.77 and a reasonable R^2 value. Overall, using the logarithmic value provided a better price prediction model.
The analysis has revealed that property-related attributes significantly influence Airbnb popularity and prices although the magnitude of these effects is very diverse and complex. While ‘is_popular’ being a categorical field was easier to predict as it was mainly driven by the reviews related features. The price variable is more continuous, with a lot of dependency on other listing features such as number of people accommodated, beds, amenities like kitchen and Wi-Fi, locations and review scores and this is seen through the linear regression models. It should also be noted that there is the possibility of presence of various biases in the data analysed. The model relies heavily on the number of reviews and ratings that a listing has received. Also, the data doesn’t account for the seasonal variations in the prices, such as vacations, weekends etc. Hence, the pricing model can only serve to predict a baseline daily average price without accounting for trends in market demand throughout the year. The data is also very limiting to perform more accurate predictions. Using more sophisticated text mining and NLP methods, the free text columns such as the space, summary, reviews can be analysed to make more efficient data driven evaluations.