1 Introduction

Airbnb is a peer-to-peer online marketplace and homestay network that enables people to list or rent short-term lodging in residential properties, with the cost of such accommodation set by the property owner. The company receives percentage service fees from both guests and hosts in conjunction with every booking. It has over 2,000,000 listings in 34,000 cities and 191 countries.

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.

Column Explanations are :

idlisting: ID
name: name of the listing
host_id: host ID
host_name: name of the host
neighbourhood_grouplocation: neighbourhoodarea
latitude: latitude coordinates
longitude: longitude coordinates
room_typelisting: space type
price: price in dollars
minimum_night: samount of nights minimum
number_of_reviews: number of reviews
last_review: latest review
reviews_per_month: number of reviews per month
calculated_host_listings_count: amount of listing per host
availability_365: number of days when listing is available for booking

1.1 Reading Data

df <- read.csv("AB_NYC_2019.csv")

1.2 Data Pre Processing

str(df)

## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ neighbourhood                 : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

df <- df %>% 
  select(-reviews_per_month,-last_review,-latitude,-longitude) %>% 
  mutate(name = as.character(name),
         id = as.character(id),
         host_id = as.character(host_id),
         host_name = as.character(host_name),
         price = as.numeric(price))

#Find NA
sapply(df, function(x) sum(is.na(x)))

##                             id                           name 
##                              0                              0 
##                        host_id                      host_name 
##                              0                              0 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
## calculated_host_listings_count               availability_365 
##                              0                              0

#Number Dataframe
df.num <- df %>% 
  select_if(is.numeric)

#Non Numeric Dataframe
df.nonnumeric <- df %>% 
  select_if(negate(is.numeric))

2 Unsupervised Learning

2.1 Dimensionality Reduction with PCA

From this summary of PCA, we can reduce the dimensionality of our data to 4 PC’s and still obtain 86.83 % of the information. We dont want to choose lower than 4 PC’s because the data retained is only 69.4 %.

pca <- prcomp(df.num, scale. = T)

summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.1695 1.0646 0.9845 0.9334 0.8115
## Proportion of Variance 0.2735 0.2267 0.1938 0.1742 0.1317
## Cumulative Proportion  0.2735 0.5002 0.6940 0.8683 1.0000

We can also locate major outliers using PCA

#Allows us to look on how the initial variables contirubte to PC1 and PC2
pca2 <- PCA(df.num,graph = T)

#Highligts our main 5 outliers
plot.PCA(pca2, cex=0.6,choix=("ind"),select = "contrib5")

df_pca <- as.data.frame(pca$x)

2.2 Clustering by K-Means

df_scale <- as.data.frame(scale(df.num))

df_km <- kmeans(df_scale,7)

#Plot Cluster
fviz_cluster(df_km, 
             data = df_scale) + 
  theme_minimal()

#Metrics
df_km$withinss

## [1]  1785.013  3763.203 39664.822 13932.443 10870.864  5919.191  9739.015

df_km$tot.withinss

## [1] 85674.55

df_km$betweenss

## [1] 158795.4

2.2.1 Determining K

From what we can see in the graph, we want to take the K value where the gradient starts to slant, in this case it seems to be at a k value of 8.

# Determine K
wss <- function(data, maxCluster = 20) {
  # Initialize within sum of squares
  SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
  SSw <- vector()
  for (i in 2:maxCluster) {
    SSw[i] <- sum(kmeans(data, centers = i)$withinss)
  }
  plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(df_scale)

#Cluster to 8
df_km <- kmeans(df_scale,8)

#Plot new Cluster
fviz_cluster(df_km, 
             data = df_scale) + 
  theme_minimal()

New Df with Cluster

cluster <- df_km$cluster

df_cluster <- cbind(df,cluster)

#Proportions of Cluster Components
prop.table(table(df_cluster$cluster))

## 
##            1            2            3            4            5 
## 0.0008794355 0.0025769506 0.0980468351 0.2484303098 0.5987728807 
##            6            7            8 
## 0.0243174149 0.0155435116 0.0114326618

3 Conclusion

From this summary of PCA, we can reduce the dimensionality of our data to 4 PC’s and still obtain 86.83 % of the information.

summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.1695 1.0646 0.9845 0.9334 0.8115
## Proportion of Variance 0.2735 0.2267 0.1938 0.1742 0.1317
## Cumulative Proportion  0.2735 0.5002 0.6940 0.8683 1.0000

K value where the gradient starts to slant is at a k value of 8

wss(df_scale)

Proportions of Cluster Components

prop.table(table(df_cluster$cluster))

## 
##            1            2            3            4            5 
## 0.0008794355 0.0025769506 0.0980468351 0.2484303098 0.5987728807 
##            6            7            8 
## 0.0243174149 0.0155435116 0.0114326618

4 Insights

4.1 Insights PCA

From our pca analysis we find that from all our numerical variables, all of them can be 50 % summarised by 2 PC (principal components) in which the biggest conttributio towards the PC 1 is availability 365 and calculated host listing counts. And the biggest contributer tu the second PC is the number of reviews. All of the other numercical variables have more of a multicolinearity towards each other as we can see from the variables factor map.

plot.PCA(pca2, cex=0.6,choix=("var"),select = "contrib5")

4.2 Insights Clustering

When we attempt a clustering, what we want to achieve is to cluster the observations with the closest characteristics within a cluster, and create clusters separated as distantly possible.

This happens because clustering by kmeans is done by comparing the positions of the observations based on their numeric variables, and then findind other near observations to form a cluster. So from every one of these clusters each of their elements should result in fairly the same “value” from their numerical variables.

Insight that is given from this clustering is that when we fail to obtain a targeted airbnb listing, we can search to look to others in its cluster and we will easily find another with the same characteristics.

fviz_cluster(df_km, 
             data = df_scale) + 
  theme_minimal()

Dimensionality Reduction with PCA and Clustering by K means

Thona Elisa

9/10/2019