1 Introduction

Airbnb is a peer-to-peer online marketplace and homestay network that enables people to list or rent short-term lodging in residential properties, with the cost of such accommodation set by the property owner. The company receives percentage service fees from both guests and hosts in conjunction with every booking. It has over 2,000,000 listings in 34,000 cities and 191 countries.

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.

Column Explanations are :

  • idlisting: ID
  • name: name of the listing
  • host_id: host ID
  • host_name: name of the host
  • neighbourhood_grouplocation: neighbourhoodarea
  • latitude: latitude coordinates
  • longitude: longitude coordinates
  • room_typelisting: space type
  • price: price in dollars
  • minimum_night: samount of nights minimum
  • number_of_reviews: number of reviews
  • last_review: latest review
  • reviews_per_month: number of reviews per month
  • calculated_host_listings_count: amount of listing per host
  • availability_365: number of days when listing is available for booking

1.2 Data Pre Processing

## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ neighbourhood                 : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...
##                             id                           name 
##                              0                              0 
##                        host_id                      host_name 
##                              0                              0 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
## calculated_host_listings_count               availability_365 
##                              0                              0

2 Unsupervised Learning

2.1 Dimensionality Reduction with PCA

From this summary of PCA, we can reduce the dimensionality of our data to 4 PC’s and still obtain 86.83 % of the information. We dont want to choose lower than 4 PC’s because the data retained is only 69.4 %.

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.1695 1.0646 0.9845 0.9334 0.8115
## Proportion of Variance 0.2735 0.2267 0.1938 0.1742 0.1317
## Cumulative Proportion  0.2735 0.5002 0.6940 0.8683 1.0000

We can also locate major outliers using PCA

3 Conclusion

  1. From this summary of PCA, we can reduce the dimensionality of our data to 4 PC’s and still obtain 86.83 % of the information.
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.1695 1.0646 0.9845 0.9334 0.8115
## Proportion of Variance 0.2735 0.2267 0.1938 0.1742 0.1317
## Cumulative Proportion  0.2735 0.5002 0.6940 0.8683 1.0000
  1. K value where the gradient starts to slant is at a k value of 8

  1. Proportions of Cluster Components
## 
##            1            2            3            4            5 
## 0.0008794355 0.0025769506 0.0980468351 0.2484303098 0.5987728807 
##            6            7            8 
## 0.0243174149 0.0155435116 0.0114326618

4 Insights

4.1 Insights PCA

From our pca analysis we find that from all our numerical variables, all of them can be 50 % summarised by 2 PC (principal components) in which the biggest conttributio towards the PC 1 is availability 365 and calculated host listing counts. And the biggest contributer tu the second PC is the number of reviews. All of the other numercical variables have more of a multicolinearity towards each other as we can see from the variables factor map.

4.2 Insights Clustering

When we attempt a clustering, what we want to achieve is to cluster the observations with the closest characteristics within a cluster, and create clusters separated as distantly possible.

This happens because clustering by kmeans is done by comparing the positions of the observations based on their numeric variables, and then findind other near observations to form a cluster. So from every one of these clusters each of their elements should result in fairly the same “value” from their numerical variables.

Insight that is given from this clustering is that when we fail to obtain a targeted airbnb listing, we can search to look to others in its cluster and we will easily find another with the same characteristics.