I decided to explore Airbnb database in New York. Using clustering methods I chose the number of clusters. Clusters were selected in both scenarios: using only locations atributes: longitude and latitude and apart from them: price and type of the rented airbnb. They made it possible to divide the accomodation in New York. In the resulting clusters, I conducted, among other statistics, an analysis of the price in the districts of New York. The main purpose is to examine the price of airbnb in clusters. We can expect that price of booking depend on its location. For example in the Manhattan district the price is supposed to be higher than in Queens or Brooklyn.
airbnb <- read.csv("airbnb.csv")
airbnb <- airbnb %>%
filter(price < 1000)
Database will be reduced by NA values. In order to use other variables in cluster analyze, some variables will be reshaped into factor variables
airbnb$private_room <- airbnb$room_type
airbnb$entire_apartment <- airbnb$room_type
airbnb$shared_room <- airbnb$room_type
set.seed(200)
airbnb_sample_ <- airbnb[sample(nrow(airbnb), 1000), ]
airbnb$private_room[airbnb$private_room == "Private room"] <- 1
airbnb$private_room[airbnb$private_room != 1] <- 0
airbnb$entire_apartment[airbnb$entire_apartment == "Entire home/apt"] <- 1
airbnb$entire_apartment[airbnb$entire_apartment != 1] <- 0
airbnb$shared_room[airbnb$shared_room == "Shared room"] <- 1
airbnb$shared_room[airbnb$shared_room != 1] <- 0
airbnb$private_room <- as.numeric(airbnb$private_room)
airbnb$entire_apartment <- as.numeric(airbnb$entire_apartment)
airbnb$shared_room <- as.numeric(airbnb$shared_room)
airbnb2 = subset(airbnb, select=-c(id,name,host_name,host_id,last_review,calculated_host_listings_count, availability_365, neighbourhood_group,neighbourhood, room_type, reviews_per_month, minimum_nights, number_of_reviews))
airbnb2 <- scale(airbnb2)
set.seed(200)
airbnb1_sample <- airbnb2[sample(nrow(airbnb2), 1000), ]
airbnb_clus <- airbnb1_sample
Before clustering it is necessary to check whether the data is clusterable. Accoring to Hopkins Statisitcs equal to 0.97 I can assume that dataset is significanly a clusterable data.
get_clust_tendency(airbnb_clus, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))
## $hopkins_stat
## [1] 0.973009
##
## $plot
Afterwards, as it is known that we can cluster is to choose the most optimal number of clusters. I will aply clustering algorithms such as kmeans.
At the beginning, I will find the most optimal number of clusters. I will use silhouette and wss statistics and then apply it to clustering algorithms: kmeans, PAM and CLARA. I have chosen a sample consisted of 1000 observation, so the hierarchical clustering is not visable and transparent.
In the first step, the silhouette method was used to choose optimal number of clusters.The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.[wikipedia]
Kmeans charts suggests dividing data into 2 clusters. PAM method chose the 2 clusters and CLARA grouped data into 2 clusters as well. Lastly, hcut recommends 3 clusters. However, the difference in silhouette width between the two and three clusters in hcut is not so visible.
library(gridExtra)
a <- fviz_nbclust(airbnb_clus, FUNcluster = kmeans, method = "silhouette") +
labs(subtitle = "silhouette - kmeans")
b <- fviz_nbclust(airbnb_clus, FUNcluster = cluster::pam, method = "silhouette") +
labs(subtitle = "silhouette - pam")
c <- fviz_nbclust(airbnb_clus, FUNcluster = cluster::clara, method = "silhouette") +
labs(subtitle = "silhouette - clara")
d <- fviz_nbclust(airbnb_clus, FUNcluster = hcut, method = "silhouette") +
labs(subtitle = "silhouette - hcut")
grid.arrange(a, b, c, d, ncol=2)
The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center. The WSS score is the sum of these Squared Errors for all the points. [https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb]
To compare results obtained by silhouette method, I will analyze charts of elbow method with WSS. In CLARA and PAM method the end of elbow in the graph visually suggests 4 clusters. Kmeans and hcut in silhouette method suggest 3 clusters.
e1 <- fviz_nbclust(airbnb_clus, kmeans, method = "wss",k.max = 10) +
labs(subtitle = "Elbow method - kmeans")
e2 <- fviz_nbclust(airbnb_clus, clara, method = "wss",k.max = 10) +
labs(subtitle = "Elbow method - clara")
e3 <- fviz_nbclust(airbnb_clus, pam, method = "wss",k.max = 10) +
labs(subtitle = "Elbow method - pam")
e4 <- fviz_nbclust(airbnb_clus, hcut, method = "wss",k.max = 10) +
labs(subtitle = "Elbow method - hcut")
grid.arrange(e1,e2,e3,e4, ncol=2, top = "Optimal number of clusters")
library(factoextra)
cl_kmeans <- eclust(airbnb_clus, k=2, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
a <- fviz_silhouette(cl_kmeans)
## cluster size ave.sil.width
## 1 1 473 0.43
## 2 2 527 0.47
b <- fviz_cluster(cl_kmeans, data = airbnb_clus, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps: Data assigment step and Centroid update step. [https://blogs.oracle.com/ai-and-datascience/post/introduction-to-k-means-clustering]
km2 <- eclust(airbnb_clus,k=2,hc_metric = 'euclidean', graph = FALSE)
km3 <- eclust(airbnb_clus,k=3,hc_metric = 'euclidean', graph = FALSE)
km4 <- eclust(airbnb_clus,k=4,hc_metric = 'euclidean', graph = FALSE)
km5 <- eclust(airbnb_clus,k=5,hc_metric = 'euclidean', graph = FALSE)
k1 <- fviz_cluster(km2, geom = c("point")) + ggtitle('K-means with 2 clusters')
k2 <- fviz_cluster(km3, geom = c("point")) + ggtitle('K-means with 3 clusters')
k3 <- fviz_cluster(km4, geom = c("point")) + ggtitle('K-means with 4 clusters')
k4 <- fviz_cluster(km5, geom = c("point")) + ggtitle('K-means with 5 clusters')
grid.arrange(arrangeGrob(k1, k2, k3, k4, ncol=2 , top = "Clustering"))
-small datasets PAM is an implementation of the k-medoid method, i.e. a clustering technique that divides a dataset containing n objects into k groups (clusters) known a priori. A useful tool for assessing the quality of grouping is the silhouette. It is calculated for each object subject to grouping. On its basis, it is possible to determine whether the objects have been grouped correctly or whether they have been placed in the wrong clusters.
pam2 <- eclust(airbnb_clus,'pam',k=2,hc_metric = 'euclidean', graph = FALSE)
pam3 <- eclust(airbnb_clus,'pam',k=3,hc_metric = 'euclidean', graph = FALSE)
pam4 <- eclust(airbnb_clus,'pam',k=4,hc_metric = 'euclidean', graph = FALSE)
pam5<- eclust(airbnb_clus,'pam',k=5,hc_metric = 'euclidean', graph = FALSE)
p1 <- fviz_cluster(pam2, geom = c("point")) + ggtitle('PAM with 2 clusters')
p2 <- fviz_cluster(pam3, geom = c("point")) + ggtitle('PAM with 3 clusters')
p3 <- fviz_cluster(pam4, geom = c("point")) + ggtitle('PAM with 4 clusters')
p4 <- fviz_cluster(pam5, geom = c("point")) + ggtitle('PAM with 5 clusters')
grid.arrange(arrangeGrob(p1, p2, p3, p4, ncol=2 , top = "Clustering"))
-larger datasets CLARA draws multiple samples of the dataset, then applies PAM on each sample, and gives the best clustering as the oputput. [from presentation]
clara2 <- eclust(airbnb_clus,'clara',k=2,hc_metric = 'euclidean', graph = FALSE)
clara3 <- eclust(airbnb_clus,'clara',k=3,hc_metric = 'euclidean', graph = FALSE)
clara4 <- eclust(airbnb_clus,'clara',k=4,hc_metric = 'euclidean', graph = FALSE)
clara5 <- eclust(airbnb_clus,'clara',k=5,hc_metric = 'euclidean', graph = FALSE)
c1 <- fviz_cluster(clara2, geom = c("point")) + ggtitle('Clara with 2 clusters')
c2 <- fviz_cluster(clara3, geom = c("point")) + ggtitle('Clara with 3 clusters')
c3 <- fviz_cluster(clara4, geom = c("point")) + ggtitle('Clara with 4 clusters')
c4 <- fviz_cluster(clara5, geom = c("point")) + ggtitle('Clara with 5 clusters')
grid.arrange(arrangeGrob(c1, c2, c3, c4, ncol=2 , top = "Clustering"))
According to silhouete method and k-means charts, the most interesting number of clusters is 2. Moreover the size of them is similar, which cannot be said about the division in the case of 3 or 4 clusters.
km2$silinfo$clus.avg.widths
## [1] 0.4340356 0.4679409
km3$silinfo$clus.avg.widths
## [1] 0.4391868 0.4976030 0.2191956
km4$silinfo$clus.avg.widths
## [1] 0.5338543 0.3078273 0.1656116 0.4157589
km2_s <- fviz_silhouette(km2) + ggtitle('K-means with 2 clusters')
## cluster size ave.sil.width
## 1 1 473 0.43
## 2 2 527 0.47
km3_s <- fviz_silhouette(km3) + ggtitle('K-means with 3 clusters')
## cluster size ave.sil.width
## 1 1 438 0.44
## 2 2 499 0.50
## 3 3 63 0.22
km4_s <- fviz_silhouette(km4) + ggtitle('K-means with 4 clusters')
## cluster size ave.sil.width
## 1 1 418 0.53
## 2 2 48 0.31
## 3 3 64 0.17
## 4 4 470 0.42
grid.arrange(arrangeGrob(km2_s,km3_s, km4_s, ncol=2 , top = "Clustering"))
To get to know better data, I will analyze and compare data divided into 2 clusters.
km_ = kmeans(airbnb_clus, centers = 2, nstart=25)
airbnb_clus_map_ <- cbind(airbnb_sample_, cluster = km_$cluster)
new_airbnb_ <- cbind(airbnb_sample_, cluster = km_$cluster)
cluster1 <- new_airbnb_[new_airbnb_$cluster == 1,]
cluster2 <- new_airbnb_[new_airbnb_$cluster == 2,]
cluster1_map_ <- new_airbnb_[airbnb_clus_map_$cluster == 1,]
cluster2_map_ <- new_airbnb_[airbnb_clus_map_$cluster == 2,]
Cluster 1 consists of 473 airbnbs. Most of the airbnb is located in Brooklyn (46,5%). Also a relative huge group is in Manhattan district and 15,56 % of airbnb is in Queens. 454 out of 473 is private rooms. Based on summary table and density plot, mean price is around 80 dollars per night, 1st quartile is 50 and 3rd quartile is 95.
my.sf.point <- st_as_sf(x = cluster1_map_,
coords = c("longitude", "latitude"),
crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
# simple plot
# plot(my.sf.point)
# interactive map:
mapview(my.sf.point)
# convert to sp object if needed
my.sp.point <- as(my.sf.point, "Spatial")
summary(cluster1)
## id name host_id host_name
## Min. : 8025 Length:473 Min. : 7549 Length:473
## 1st Qu.:11434339 Class :character 1st Qu.: 12672019 Class :character
## Median :21261549 Mode :character Median : 40632179 Mode :character
## Mean :20266917 Mean : 78282972
## 3rd Qu.:29831517 3rd Qu.:137358866
## Max. :36477588 Max. :273139430
##
## neighbourhood_group neighbourhood latitude longitude
## Length:473 Length:473 Min. :40.58 Min. :-74.08
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.97
## Mode :character Mode :character Median :40.72 Median :-73.95
## Mean :40.73 Mean :-73.94
## 3rd Qu.:40.76 3rd Qu.:-73.92
## Max. :40.89 Max. :-73.74
##
## room_type price minimum_nights number_of_reviews
## Length:473 Min. : 21.00 Min. : 1.000 Min. : 0.00
## Class :character 1st Qu.: 50.00 1st Qu.: 1.000 1st Qu.: 1.00
## Mode :character Median : 70.00 Median : 2.000 Median : 5.00
## Mean : 80.69 Mean : 5.307 Mean : 23.61
## 3rd Qu.: 95.00 3rd Qu.: 4.000 3rd Qu.: 24.00
## Max. :525.00 Max. :60.000 Max. :356.00
##
## last_review reviews_per_month calculated_host_listings_count
## Length:473 Min. :0.020 Min. : 1.000
## Class :character 1st Qu.:0.230 1st Qu.: 1.000
## Mode :character Median :0.860 Median : 1.000
## Mean :1.482 Mean : 3.882
## 3rd Qu.:2.110 3rd Qu.: 3.000
## Max. :9.000 Max. :103.000
## NA's :92
## availability_365 private_room entire_apartment shared_room
## Min. : 0.0 Length:473 Length:473 Length:473
## 1st Qu.: 0.0 Class :character Class :character Class :character
## Median : 62.0 Mode :character Mode :character Mode :character
## Mean :111.9
## 3rd Qu.:203.0
## Max. :365.0
##
## cluster
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
number_of_private <- filter(cluster1, room_type =='Private room')
number_of_entire <- filter(cluster1, room_type =='Entire home/apt')
number_of_shared <- filter(cluster1, room_type =='Shared room')
nrow(number_of_private)
## [1] 454
nrow(number_of_entire)
## [1] 0
nrow(number_of_shared)
## [1] 19
number_of_bronx <- filter(cluster1, neighbourhood_group =='Bronx')
number_of_manhattan <- filter(cluster1, neighbourhood_group =='Manhattan')
number_of_brooklyn <- filter(cluster1, neighbourhood_group =='Brooklyn')
number_of_queens <- filter(cluster1, neighbourhood_group =='Queens')
number_of_island <- filter(cluster1, neighbourhood_group =='Staten Island')
nrow(number_of_bronx)
## [1] 18
nrow(number_of_manhattan)
## [1] 159
nrow(number_of_brooklyn)
## [1] 220
nrow(number_of_queens)
## [1] 74
nrow(number_of_island)
## [1] 2
nrow(cluster1)
## [1] 473
mean(cluster1$price)
## [1] 80.69133
hist(cluster1$price,xlab = "Airbnb Price",col = "gray",border = "black", xlim = c(0,400) , ylim = c(0,400), breaks = 8)
d1 <- density(cluster1$price)
plot(d1)
summary(cluster1$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 50.00 70.00 80.69 95.00 525.00
Cluster 2 consists of 526 airbnbs. Half of the airbnb is located in Manhattan. Also a relative huge group is in Brooklyn district (37,26%) and 11,21 % of airbnb is in Queens. 526 out of 527 is ‘Entire home or apartment’. Based on summary table and density plot, mean price is around 190 dollars per night, 1st quartile is 120 and 3rd quartile is 223.5.
my.sf.point <- st_as_sf(x = cluster2_map_,
coords = c("longitude", "latitude"),
crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
# simple plot
# plot(my.sf.point)
# interactive map:
mapview(my.sf.point)
# convert to sp object if needed
my.sp.point <- as(my.sf.point, "Spatial")
summary(cluster2)
## id name host_id host_name
## Min. : 27644 Length:527 Min. : 52394 Length:527
## 1st Qu.: 8503370 Class :character 1st Qu.: 5598420 Class :character
## Median :19408844 Mode :character Median : 24467359 Mode :character
## Mean :18578594 Mean : 59844159
## 3rd Qu.:29442508 3rd Qu.: 85272560
## Max. :36404936 Max. :273568164
##
## neighbourhood_group neighbourhood latitude longitude
## Length:527 Length:527 Min. :40.51 Min. :-74.24
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.99
## Mode :character Mode :character Median :40.73 Median :-73.96
## Mean :40.73 Mean :-73.96
## 3rd Qu.:40.76 3rd Qu.:-73.94
## Max. :40.87 Max. :-73.73
##
## room_type price minimum_nights number_of_reviews
## Length:527 Min. : 16.0 Min. : 1.000 Min. : 0.00
## Class :character 1st Qu.:120.0 1st Qu.: 2.000 1st Qu.: 1.00
## Mode :character Median :160.0 Median : 3.000 Median : 6.00
## Mean :190.1 Mean : 8.283 Mean : 25.06
## 3rd Qu.:223.5 3rd Qu.: 5.500 3rd Qu.: 25.00
## Max. :900.0 Max. :222.000 Max. :334.00
##
## last_review reviews_per_month calculated_host_listings_count
## Length:527 Min. : 0.010 Min. : 1.00
## Class :character 1st Qu.: 0.210 1st Qu.: 1.00
## Mode :character Median : 0.720 Median : 1.00
## Mean : 1.409 Mean : 12.51
## 3rd Qu.: 2.115 3rd Qu.: 2.00
## Max. :15.780 Max. :327.00
## NA's :95
## availability_365 private_room entire_apartment shared_room
## Min. : 0.0 Length:527 Length:527 Length:527
## 1st Qu.: 0.0 Class :character Class :character Class :character
## Median : 44.0 Mode :character Mode :character Mode :character
## Mean :111.4
## 3rd Qu.:228.0
## Max. :365.0
##
## cluster
## Min. :2
## 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
##
number_of_private <- filter(cluster2, room_type =='Private room')
number_of_entire <- filter(cluster2, room_type =='Entire home/apt')
number_of_shared <- filter(cluster2, room_type =='Shared room')
nrow(number_of_private)
## [1] 1
nrow(number_of_entire)
## [1] 526
nrow(number_of_shared)
## [1] 0
number_of_bronx <- filter(cluster2, neighbourhood_group =='Bronx')
number_of_manhattan <- filter(cluster2, neighbourhood_group =='Manhattan')
number_of_brooklyn <- filter(cluster2, neighbourhood_group =='Brooklyn')
number_of_queens <- filter(cluster2, neighbourhood_group =='Queens')
number_of_island <- filter(cluster2, neighbourhood_group =='Staten Island')
nrow(number_of_bronx)
## [1] 6
nrow(number_of_manhattan)
## [1] 263
nrow(number_of_brooklyn)
## [1] 196
nrow(number_of_queens)
## [1] 59
nrow(number_of_island)
## [1] 3
nrow(cluster2)
## [1] 527
mean(cluster2$price)
## [1] 190.1347
hist(cluster2$price,xlab = "Airbnb Price",col = "gray",border = "black", xlim = c(0,400) , ylim = c(0,400), breaks = 8)
d2 <- density(cluster2$price)
plot(d2)
summary(cluster2$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.0 120.0 160.0 190.1 223.5 900.0