Clustering Airbnb NY 2

Introduction

I decided to explore Airbnb database in New York. Using clustering methods I chose the number of clusters. Clusters were selected in both scenarios: using only locations atributes: longitude and latitude and apart from them: price and type of the rented airbnb. They made it possible to divide the accomodation in New York. In the resulting clusters, I conducted, among other statistics, an analysis of the price in the districts of New York. The main purpose is to examine the price of airbnb in clusters. We can expect that price of booking depend on its location. For example in the Manhattan district the price is supposed to be higher than in Queens or Brooklyn.

Reading data

airbnb <- read.csv("airbnb.csv")
airbnb <- airbnb %>% 
  filter(price < 1000)

clustering on location, price and type of the airbnb

Data Pre Processing

Database will be reduced by NA values. In order to use other variables in cluster analyze, some variables will be reshaped into factor variables

airbnb$private_room <- airbnb$room_type
airbnb$entire_apartment <- airbnb$room_type
airbnb$shared_room <- airbnb$room_type

set.seed(200)
airbnb_sample_ <- airbnb[sample(nrow(airbnb), 1000), ]

airbnb$private_room[airbnb$private_room == "Private room"] <- 1
airbnb$private_room[airbnb$private_room != 1] <- 0

airbnb$entire_apartment[airbnb$entire_apartment == "Entire home/apt"] <- 1
airbnb$entire_apartment[airbnb$entire_apartment != 1] <- 0

airbnb$shared_room[airbnb$shared_room == "Shared room"] <- 1
airbnb$shared_room[airbnb$shared_room != 1] <- 0

airbnb$private_room <- as.numeric(airbnb$private_room)
airbnb$entire_apartment <- as.numeric(airbnb$entire_apartment)
airbnb$shared_room <- as.numeric(airbnb$shared_room)



airbnb2 = subset(airbnb, select=-c(id,name,host_name,host_id,last_review,calculated_host_listings_count, availability_365, neighbourhood_group,neighbourhood, room_type, reviews_per_month, minimum_nights, number_of_reviews))

airbnb2 <- scale(airbnb2)

set.seed(200)
airbnb1_sample <- airbnb2[sample(nrow(airbnb2), 1000), ]
airbnb_clus <- airbnb1_sample

add visualisation

Hopkins Statistics

Before clustering it is necessary to check whether the data is clusterable. Accoring to Hopkins Statisitcs equal to 0.97 I can assume that dataset is significanly a clusterable data.

get_clust_tendency(airbnb_clus, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))

## $hopkins_stat
## [1] 0.973009
## 
## $plot

Clustering

Afterwards, as it is known that we can cluster is to choose the most optimal number of clusters. I will aply clustering algorithms such as kmeans.

At the beginning, I will find the most optimal number of clusters. I will use silhouette and wss statistics and then apply it to clustering algorithms: kmeans, PAM and CLARA. I have chosen a sample consisted of 1000 observation, so the hierarchical clustering is not visable and transparent.

Silhouette

In the first step, the silhouette method was used to choose optimal number of clusters.The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.[wikipedia]

Kmeans charts suggests dividing data into 2 clusters. PAM method chose the 2 clusters and CLARA grouped data into 2 clusters as well. Lastly, hcut recommends 3 clusters. However, the difference in silhouette width between the two and three clusters in hcut is not so visible.

library(gridExtra)
a <- fviz_nbclust(airbnb_clus, FUNcluster = kmeans, method = "silhouette") +
  labs(subtitle = "silhouette - kmeans")

b <- fviz_nbclust(airbnb_clus, FUNcluster = cluster::pam, method = "silhouette") +
  labs(subtitle = "silhouette - pam")

c <- fviz_nbclust(airbnb_clus, FUNcluster = cluster::clara, method = "silhouette") + 
  labs(subtitle = "silhouette - clara")

d <- fviz_nbclust(airbnb_clus, FUNcluster = hcut, method = "silhouette") +
  labs(subtitle = "silhouette - hcut")
 
grid.arrange(a, b, c, d,  ncol=2)

Total within sum of squares (WSS)

The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center. The WSS score is the sum of these Squared Errors for all the points. [https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb]

Elbow method

To compare results obtained by silhouette method, I will analyze charts of elbow method with WSS. In CLARA and PAM method the end of elbow in the graph visually suggests 4 clusters. Kmeans and hcut in silhouette method suggest 3 clusters.

e1 <- fviz_nbclust(airbnb_clus, kmeans, method = "wss",k.max = 10) +
  labs(subtitle = "Elbow method - kmeans")

e2 <- fviz_nbclust(airbnb_clus, clara, method = "wss",k.max = 10) +
  labs(subtitle = "Elbow method - clara")

e3 <- fviz_nbclust(airbnb_clus, pam, method = "wss",k.max = 10) +
  labs(subtitle = "Elbow method - pam")

e4 <- fviz_nbclust(airbnb_clus, hcut, method = "wss",k.max = 10) +
  labs(subtitle = "Elbow method - hcut")

grid.arrange(e1,e2,e3,e4, ncol=2, top = "Optimal number of clusters")

library(factoextra)
cl_kmeans <- eclust(airbnb_clus, k=2, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
a <- fviz_silhouette(cl_kmeans)

##   cluster size ave.sil.width
## 1       1  473          0.43
## 2       2  527          0.47

b <- fviz_cluster(cl_kmeans, data = airbnb_clus, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)

Kmeans

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps: Data assigment step and Centroid update step. [https://blogs.oracle.com/ai-and-datascience/post/introduction-to-k-means-clustering]

km2 <- eclust(airbnb_clus,k=2,hc_metric = 'euclidean', graph = FALSE)
km3 <- eclust(airbnb_clus,k=3,hc_metric = 'euclidean', graph = FALSE)
km4 <- eclust(airbnb_clus,k=4,hc_metric = 'euclidean', graph = FALSE)
km5 <- eclust(airbnb_clus,k=5,hc_metric = 'euclidean', graph = FALSE)

k1 <- fviz_cluster(km2, geom = c("point")) + ggtitle('K-means with 2 clusters')
k2 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 3 clusters')
k3 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 4 clusters')
k4 <- fviz_cluster(km5, geom = c("point")) + ggtitle('K-means with 5 clusters')

grid.arrange(arrangeGrob(k1, k2, k3, k4, ncol=2 , top = "Clustering"))

PAM

-small datasets PAM is an implementation of the k-medoid method, i.e. a clustering technique that divides a dataset containing n objects into k groups (clusters) known a priori. A useful tool for assessing the quality of grouping is the silhouette. It is calculated for each object subject to grouping. On its basis, it is possible to determine whether the objects have been grouped correctly or whether they have been placed in the wrong clusters.

pam2 <- eclust(airbnb_clus,'pam',k=2,hc_metric = 'euclidean', graph = FALSE)
pam3 <- eclust(airbnb_clus,'pam',k=3,hc_metric = 'euclidean', graph = FALSE)
pam4 <- eclust(airbnb_clus,'pam',k=4,hc_metric = 'euclidean', graph = FALSE)
pam5<- eclust(airbnb_clus,'pam',k=5,hc_metric = 'euclidean', graph = FALSE)

p1 <- fviz_cluster(pam2, geom = c("point")) + ggtitle('PAM with 2 clusters')
p2 <- fviz_cluster(pam3, geom = c("point")) + ggtitle('PAM with 3 clusters')
p3 <- fviz_cluster(pam4, geom = c("point")) + ggtitle('PAM with 4 clusters')
p4 <- fviz_cluster(pam5, geom = c("point")) + ggtitle('PAM with 5 clusters')

grid.arrange(arrangeGrob(p1, p2, p3, p4,  ncol=2 , top = "Clustering"))

CLARA

-larger datasets CLARA draws multiple samples of the dataset, then applies PAM on each sample, and gives the best clustering as the oputput. [from presentation]

clara2 <- eclust(airbnb_clus,'clara',k=2,hc_metric = 'euclidean', graph = FALSE)
clara3 <- eclust(airbnb_clus,'clara',k=3,hc_metric = 'euclidean', graph = FALSE)
clara4 <- eclust(airbnb_clus,'clara',k=4,hc_metric = 'euclidean', graph = FALSE)
clara5 <- eclust(airbnb_clus,'clara',k=5,hc_metric = 'euclidean', graph = FALSE)

c1 <- fviz_cluster(clara2, geom = c("point")) + ggtitle('Clara with 2 clusters')
c2 <- fviz_cluster(clara3, geom = c("point")) + ggtitle('Clara with 3 clusters')
c3 <- fviz_cluster(clara4, geom = c("point")) + ggtitle('Clara with 4 clusters')
c4 <- fviz_cluster(clara5, geom = c("point")) + ggtitle('Clara with 5 clusters')

grid.arrange(arrangeGrob(c1, c2, c3, c4, ncol=2 , top = "Clustering"))

Post diagnostics

According to silhouete method and k-means charts, the most interesting number of clusters is 2. Moreover the size of them is similar, which cannot be said about the division in the case of 3 or 4 clusters.

km2$silinfo$clus.avg.widths

## [1] 0.4340356 0.4679409

km3$silinfo$clus.avg.widths

## [1] 0.4391868 0.4976030 0.2191956

km4$silinfo$clus.avg.widths

## [1] 0.5338543 0.3078273 0.1656116 0.4157589

km2_s <- fviz_silhouette(km2)  + ggtitle('K-means with 2 clusters')

##   cluster size ave.sil.width
## 1       1  473          0.43
## 2       2  527          0.47

km3_s <- fviz_silhouette(km3)  + ggtitle('K-means with 3 clusters')

##   cluster size ave.sil.width
## 1       1  438          0.44
## 2       2  499          0.50
## 3       3   63          0.22

km4_s <- fviz_silhouette(km4)  + ggtitle('K-means with 4 clusters')

##   cluster size ave.sil.width
## 1       1  418          0.53
## 2       2   48          0.31
## 3       3   64          0.17
## 4       4  470          0.42

grid.arrange(arrangeGrob(km2_s,km3_s, km4_s, ncol=2 , top = "Clustering"))

Clusters’ analysys

To get to know better data, I will analyze and compare data divided into 2 clusters.

km_ = kmeans(airbnb_clus, centers = 2, nstart=25)

airbnb_clus_map_ <- cbind(airbnb_sample_, cluster = km_$cluster) 
new_airbnb_ <- cbind(airbnb_sample_, cluster = km_$cluster) 

cluster1 <- new_airbnb_[new_airbnb_$cluster == 1,]
cluster2 <- new_airbnb_[new_airbnb_$cluster == 2,]


cluster1_map_ <- new_airbnb_[airbnb_clus_map_$cluster == 1,]
cluster2_map_ <- new_airbnb_[airbnb_clus_map_$cluster == 2,]

Results

Cluster 1

Cluster 1 consists of 473 airbnbs. Most of the airbnb is located in Brooklyn (46,5%). Also a relative huge group is in Manhattan district and 15,56 % of airbnb is in Queens. 454 out of 473 is private rooms. Based on summary table and density plot, mean price is around 80 dollars per night, 1st quartile is 50 and 3rd quartile is 95.

my.sf.point <- st_as_sf(x = cluster1_map_, 
                        coords = c("longitude", "latitude"),
                        crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")

# simple plot
# plot(my.sf.point)

# interactive map:
mapview(my.sf.point)

# convert to sp object if needed
my.sp.point <- as(my.sf.point, "Spatial")

summary(cluster1)

##        id               name              host_id           host_name        
##  Min.   :    8025   Length:473         Min.   :     7549   Length:473        
##  1st Qu.:11434339   Class :character   1st Qu.: 12672019   Class :character  
##  Median :21261549   Mode  :character   Median : 40632179   Mode  :character  
##  Mean   :20266917                      Mean   : 78282972                     
##  3rd Qu.:29831517                      3rd Qu.:137358866                     
##  Max.   :36477588                      Max.   :273139430                     
##                                                                              
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:473          Length:473         Min.   :40.58   Min.   :-74.08  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.97  
##  Mode  :character    Mode  :character   Median :40.72   Median :-73.95  
##                                         Mean   :40.73   Mean   :-73.94  
##                                         3rd Qu.:40.76   3rd Qu.:-73.92  
##                                         Max.   :40.89   Max.   :-73.74  
##                                                                         
##   room_type             price        minimum_nights   number_of_reviews
##  Length:473         Min.   : 21.00   Min.   : 1.000   Min.   :  0.00   
##  Class :character   1st Qu.: 50.00   1st Qu.: 1.000   1st Qu.:  1.00   
##  Mode  :character   Median : 70.00   Median : 2.000   Median :  5.00   
##                     Mean   : 80.69   Mean   : 5.307   Mean   : 23.61   
##                     3rd Qu.: 95.00   3rd Qu.: 4.000   3rd Qu.: 24.00   
##                     Max.   :525.00   Max.   :60.000   Max.   :356.00   
##                                                                        
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:473         Min.   :0.020     Min.   :  1.000               
##  Class :character   1st Qu.:0.230     1st Qu.:  1.000               
##  Mode  :character   Median :0.860     Median :  1.000               
##                     Mean   :1.482     Mean   :  3.882               
##                     3rd Qu.:2.110     3rd Qu.:  3.000               
##                     Max.   :9.000     Max.   :103.000               
##                     NA's   :92                                      
##  availability_365 private_room       entire_apartment   shared_room       
##  Min.   :  0.0    Length:473         Length:473         Length:473        
##  1st Qu.:  0.0    Class :character   Class :character   Class :character  
##  Median : 62.0    Mode  :character   Mode  :character   Mode  :character  
##  Mean   :111.9                                                            
##  3rd Qu.:203.0                                                            
##  Max.   :365.0                                                            
##                                                                           
##     cluster 
##  Min.   :1  
##  1st Qu.:1  
##  Median :1  
##  Mean   :1  
##  3rd Qu.:1  
##  Max.   :1  
##

number_of_private <- filter(cluster1, room_type =='Private room')
number_of_entire <- filter(cluster1, room_type =='Entire home/apt')
number_of_shared <- filter(cluster1, room_type =='Shared room')
nrow(number_of_private)

## [1] 454

nrow(number_of_entire)

## [1] 0

nrow(number_of_shared)

## [1] 19

number_of_bronx <- filter(cluster1, neighbourhood_group =='Bronx')
number_of_manhattan <- filter(cluster1, neighbourhood_group =='Manhattan')
number_of_brooklyn <- filter(cluster1, neighbourhood_group =='Brooklyn')
number_of_queens <- filter(cluster1, neighbourhood_group =='Queens')
number_of_island <- filter(cluster1, neighbourhood_group =='Staten Island')

nrow(number_of_bronx)

## [1] 18

nrow(number_of_manhattan)

## [1] 159

nrow(number_of_brooklyn)

## [1] 220

nrow(number_of_queens)

## [1] 74

nrow(number_of_island)

## [1] 2

nrow(cluster1)

## [1] 473

mean(cluster1$price)

## [1] 80.69133

hist(cluster1$price,xlab = "Airbnb Price",col = "gray",border = "black", xlim = c(0,400) , ylim = c(0,400), breaks = 8)

d1 <- density(cluster1$price)
plot(d1)

summary(cluster1$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   50.00   70.00   80.69   95.00  525.00

Cluster 2

Cluster 2 consists of 526 airbnbs. Half of the airbnb is located in Manhattan. Also a relative huge group is in Brooklyn district (37,26%) and 11,21 % of airbnb is in Queens. 526 out of 527 is ‘Entire home or apartment’. Based on summary table and density plot, mean price is around 190 dollars per night, 1st quartile is 120 and 3rd quartile is 223.5.

my.sf.point <- st_as_sf(x = cluster2_map_, 
                        coords = c("longitude", "latitude"),
                        crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")

# simple plot
# plot(my.sf.point)

# interactive map:
mapview(my.sf.point)

# convert to sp object if needed
my.sp.point <- as(my.sf.point, "Spatial")

summary(cluster2)

##        id               name              host_id           host_name        
##  Min.   :   27644   Length:527         Min.   :    52394   Length:527        
##  1st Qu.: 8503370   Class :character   1st Qu.:  5598420   Class :character  
##  Median :19408844   Mode  :character   Median : 24467359   Mode  :character  
##  Mean   :18578594                      Mean   : 59844159                     
##  3rd Qu.:29442508                      3rd Qu.: 85272560                     
##  Max.   :36404936                      Max.   :273568164                     
##                                                                              
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:527          Length:527         Min.   :40.51   Min.   :-74.24  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.99  
##  Mode  :character    Mode  :character   Median :40.73   Median :-73.96  
##                                         Mean   :40.73   Mean   :-73.96  
##                                         3rd Qu.:40.76   3rd Qu.:-73.94  
##                                         Max.   :40.87   Max.   :-73.73  
##                                                                         
##   room_type             price       minimum_nights    number_of_reviews
##  Length:527         Min.   : 16.0   Min.   :  1.000   Min.   :  0.00   
##  Class :character   1st Qu.:120.0   1st Qu.:  2.000   1st Qu.:  1.00   
##  Mode  :character   Median :160.0   Median :  3.000   Median :  6.00   
##                     Mean   :190.1   Mean   :  8.283   Mean   : 25.06   
##                     3rd Qu.:223.5   3rd Qu.:  5.500   3rd Qu.: 25.00   
##                     Max.   :900.0   Max.   :222.000   Max.   :334.00   
##                                                                        
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:527         Min.   : 0.010    Min.   :  1.00                
##  Class :character   1st Qu.: 0.210    1st Qu.:  1.00                
##  Mode  :character   Median : 0.720    Median :  1.00                
##                     Mean   : 1.409    Mean   : 12.51                
##                     3rd Qu.: 2.115    3rd Qu.:  2.00                
##                     Max.   :15.780    Max.   :327.00                
##                     NA's   :95                                      
##  availability_365 private_room       entire_apartment   shared_room       
##  Min.   :  0.0    Length:527         Length:527         Length:527        
##  1st Qu.:  0.0    Class :character   Class :character   Class :character  
##  Median : 44.0    Mode  :character   Mode  :character   Mode  :character  
##  Mean   :111.4                                                            
##  3rd Qu.:228.0                                                            
##  Max.   :365.0                                                            
##                                                                           
##     cluster 
##  Min.   :2  
##  1st Qu.:2  
##  Median :2  
##  Mean   :2  
##  3rd Qu.:2  
##  Max.   :2  
##

number_of_private <- filter(cluster2, room_type =='Private room')
number_of_entire <- filter(cluster2, room_type =='Entire home/apt')
number_of_shared <- filter(cluster2, room_type =='Shared room')
nrow(number_of_private)

## [1] 1

nrow(number_of_entire)

## [1] 526

nrow(number_of_shared)

## [1] 0

number_of_bronx <- filter(cluster2, neighbourhood_group =='Bronx')
number_of_manhattan <- filter(cluster2, neighbourhood_group =='Manhattan')
number_of_brooklyn <- filter(cluster2, neighbourhood_group =='Brooklyn')
number_of_queens <- filter(cluster2, neighbourhood_group =='Queens')
number_of_island <- filter(cluster2, neighbourhood_group =='Staten Island')

nrow(number_of_bronx)

## [1] 6

nrow(number_of_manhattan)

## [1] 263

nrow(number_of_brooklyn)

## [1] 196

nrow(number_of_queens)

## [1] 59

nrow(number_of_island)

## [1] 3

nrow(cluster2)

## [1] 527

mean(cluster2$price)

## [1] 190.1347

hist(cluster2$price,xlab = "Airbnb Price",col = "gray",border = "black", xlim = c(0,400) , ylim = c(0,400), breaks = 8)

d2 <- density(cluster2$price)
plot(d2)

summary(cluster2$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    16.0   120.0   160.0   190.1   223.5   900.0