Data clustering is one of the basic but very popular and important methods of the unsupervised learning. However, the most well-known clustering algorithms, i.a. k-means or k-medoids, are designed for the datasets which consist of continuous variables. While using count data in many cases is acceptable and widely applied, these algorithms aren’t dedicated to for example nominal variables. Similar problem concerns popular dimension reduction method: Principal Component Analysis. Nevertheless, the data very often consists of various types of variables and rejecting the nominal ones may significantly lower the effectiveness of the analysis. That’s when algorithms for the mixed type data may prove valuable.
The main idea of this article is to present and compare different approaches of cluster analysis and dimension reduction performed on the data about Airbnb accommodations on Belin. Analysis was divided into two cases. First one is the case of the data with only continuous and count variables selected. Simple K-means and PAM algorithms were performed on this data. Additionally, the principal component analysis as well as the hierarchical clustering were added before the k-means as the extra steps allowing to potentially increase the quality of the analysis. In the second case some categorical variables were added to the previous dataset and three types of clustering were applied on the newly created data: k-prototypes, partitioning around medoids using Gower distance and k-means performed on the data with categorical variables converted to binary values and treated as numeric. In the end also the Factor Analysis of Mixed Data was used to potentially increase the quality of clustering in the second case.
library(stringr)
library(clustertend)
library(NbClust)
library(factoextra)
library(ClusterR)
library(fpc)
library(clusterSim)
library(psych)
library(FactoMineR)
library(clustMixType)
Data used in the analysis consists of the information about the listings in Berlin from the popular web portal Airbnb. The dataset was creatured by Murray Cox and contains data of over 22 thousands observations scraped on November 2018. More information about the dataset can be found here.
Deep analysis of the Airbnb listings may have plenty of practical applications. Data clustering according to its purpose allows to have a better insight and helps in understanding the structures of the travel accommodation market. Airbnb arouses different emotions across the world. On the one hand it is very useful and convenient site for finding holiday accomodation. On the other hand there is always risk when the accommodation can be offered by almost anyone. Moreover, in many cities attractive for tourists, popularity of Airbnb contributes to high depopulation of local inhabitants in exchange for houses and rooms offered to tourists by hosts. Clustering may turn out to be useful in finding the features characteristic for good vs bad hosts, safe vs not safe offers, accommodations offered by local inhabitants vs more business oriented types of accommodations or even to localize potential fraudsters.
Due to the time complexity of some algorithms presented in the article which makes them almost impossible to perform in a reasonable amount of time on such large database, only observations from the Mitte district, which is probably the most attractive district for the majority of tourists and is located in the middle of Berlin, were chosen. There are some clustering algorithms dedicated to the large samples, like CLARA or CLARANS, however not all of the methods presented here have proper equivalents for large samples and because of that it was decided to limit the dataset as stated above. Observations which had missing values in any of the variables used in analysis were removed. Eventually the dataset consists of over 3700 observations. 20 continuous/count variables, 7 nominal (including 5 binary) variables and 1 ordinal variable were used.
listings<-read.csv("listings.csv", sep=",", dec=".", header=TRUE)
summar<-read.csv("listings_summary.csv", sep=",", dec=".", header=TRUE)
listings<-listings[,c("id","neighbourhood_group","latitude","longitude","room_type","price","minimum_nights","number_of_reviews","reviews_per_month","calculated_host_listings_count","availability_365")]
summar<-summar[,c("host_is_superhost","host_has_profile_pic","host_identity_verified","is_location_exact","accommodates","bathrooms","bedrooms","beds","bed_type","amenities","security_deposit","cleaning_fee","review_scores_accuracy","review_scores_cleanliness","review_scores_checkin","review_scores_communication","review_scores_location","review_scores_value","instant_bookable","cancellation_policy")]
database<-cbind(listings, summar)
rm("listings","summar")
database <- subset(database, database$neighbourhood_group=="Mitte")
database <- na.omit(database)
database <- database[!(database$host_is_superhost=="" & database$host_has_profile_pic=="" & database$host_identity_verified==""),]
database$amenities <- as.character(database$amenities)
database$amenities <- ifelse(nchar(database$amenities)>2, str_count(database$amenities, ',')+1, 0)
database$security_deposit <- database$security_deposit %>% str_extract_all("\\(?[0-9,.]+\\)?") %>% gsub(",", "", .) %>% as.numeric()
database$security_deposit[is.na(database$security_deposit)] <- 0
database$cleaning_fee <- database$cleaning_fee %>% str_extract_all("\\(?[0-9,.]+\\)?") %>% gsub(",", "", .) %>% as.numeric()
database$cleaning_fee[is.na(database$cleaning_fee)] <- 0
database$dist<-sqrt((database$latitude-52.5195)^2+(database$longitude-13.4072)^2)
dane <- subset(database, select = -c(id, longitude, latitude, room_type, host_is_superhost, host_has_profile_pic, host_identity_verified, is_location_exact, neighbourhood_group, bed_type, instant_bookable, cancellation_policy))
Before the proper clustering it is usually very worthwhile to assess the general clustering tendency of the data.
hopkins(dane, n=100)
## $H
## [1] 0.008273232
get_clust_tendency(dane, 100, graph=FALSE, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.009559817
Clustering tendency can be assessed with the Hopkins’ statistic. Close to zero values of the statistic in both cases mean that we should reject the null hypothesis that it is unlikely that there are statistically significant clusters, so it seems that the data is clusterable.
K-means clustering method is probably the most common one. Its goal is to minimize the total squared error. It is simple but the number of clusters k must be chosen beforehand. There are many methods to do that so it is appropriate to check few of them.
fviz_nbclust(dane, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)
The most popular method to do this is the silhouette statistics. The higher value it takes, the better, so it suggests choosing two clusters.
fviz_nbclust(dane, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)
opt<-Optimal_Clusters_KMeans(dane, max_clusters=8, plot_clusters=TRUE, criterion="AIC")
opt2<-Optimal_Clusters_KMeans(dane, max_clusters=8, plot_clusters=TRUE, criterion="BIC")
Total within sum of squares, AIC and BIC should take as little value as possible, however here one can see that generally the higher number number, the smaller value of statistics. Because of that, it is more effective to use the elbow method which suggests choosing the optimal number of clusters in a way that adding more clusters will affect the statistics signifficantly less. Based on that the Total WIthin Sum of Squares suggests choosing 2,3 or 5 clusters, AIC and BIC - also 2 or 3, maybe 5 clusters. Considering this and the Silhouette statistics, 2, 3 and 5 clusters will be chosen as the k numbers in the K-means algorithm.
kmeans2 <- eclust(dane, "kmeans", hc_metric="euclidean", k=2)
kmeans3 <- eclust(dane, "kmeans", hc_metric="euclidean", k=3)
kmeans5 <- eclust(dane, "kmeans", hc_metric="euclidean", k=5)
fviz_cluster(kmeans2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans2)
## cluster size ave.sil.width
## 1 1 287 0.17
## 2 2 3419 0.73
fviz_cluster(kmeans3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans3)
## cluster size ave.sil.width
## 1 1 111 0.27
## 2 2 2396 0.68
## 3 3 1199 0.06
fviz_cluster(kmeans5, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans5)
## cluster size ave.sil.width
## 1 1 39 0.42
## 2 2 2108 0.64
## 3 3 746 0.29
## 4 4 269 0.16
## 5 5 544 0.37
Because of the number of variables in the model (which results in many dimensions), the differences between clusters aren’t very clearly visible on the cluster plots. In spite of the high overall value of the silhouette statistics in the model with two clusters, it can be seen that while the second cluster is much bigger and has good average silhouette width, the first cluster is much worse and the silhouette width takes even negative values. However, the two other clusterings have worst results, with still one great cluster but the rest taking small or even negative average silhouette width values.
round(calinhara(dane,kmeans2$cluster),digits=2)
## [1] 2408.93
round(calinhara(dane,kmeans3$cluster),digits=2)
## [1] 2216.59
round(calinhara(dane,kmeans5$cluster),digits=2)
## [1] 2379.84
Additionally, there are some other measures of clustering quality. Calinski-Harabasz index is one of them, designed for comparing clusterings with different number of clusters. The highest value, which means the best clustering, is in the case of 2 clusters.
dudahart2(dane,kmeans2$cluster)$p.value
## [1] 0
dudahart2(dane,kmeans3$cluster)$p.value
## [1] 0
dudahart2(dane,kmeans5$cluster)$p.value
## [1] 0
Duda-Hart test is the test with H0 that a data shouldn’t be split into two clusters. P-value equal to 0 in every case means that the H0 is rejected and the clusters should be split, but as it was said before, splitting the dataset to too many clusters isn’t really preferred.
Taking into account all that results, the division into two clusters will be considered as the best one. The differences in coordinates between centers of the two clusters in that model are presented below.
kmeans2$centers
## price minimum_nights number_of_reviews reviews_per_month
## 1 118.48780 24.515679 22.06272 1.093868
## 2 63.98187 6.948815 27.73443 1.374452
## calculated_host_listings_count availability_365 accommodates bathrooms
## 1 2.843206 160.39721 3.372822 1.209059
## 2 2.582626 85.11612 2.811348 1.097397
## bedrooms beds amenities security_deposit cleaning_fee
## 1 1.484321 2.066202 22.11498 854.11150 62.19861
## 2 1.155894 1.707809 17.40041 61.63703 18.39807
## review_scores_accuracy review_scores_cleanliness review_scores_checkin
## 1 9.693380 9.508711 9.651568
## 2 9.639368 9.269669 9.676806
## review_scores_communication review_scores_location review_scores_value
## 1 9.745645 9.634146 9.337979
## 2 9.698450 9.462708 9.366481
## dist
## 1 0.03074756
## 2 0.03870296
Partitioning Around Medoids is the method that is a little bit similar to K-means but it chooses the actual points in the dataset (medoids) as centers of clusters. It is oriented to minimize the sum of dissimilarities between points in a cluster and a medoid of that cluster. Also the number of clusters must be determined beforehand.
fviz_nbclust(dane, FUNcluster = cluster::pam, method = c("silhouette"), k.max = 8, nboot = 100,)
This time the silhouette statistics indicates choosing four clusters as the best option, however spliting the data into two clusters also seems to be quite good.
fviz_nbclust(dane, FUNcluster = cluster::pam, method = c("wss"), k.max = 8, nboot = 100,)
The total Within Sum of Squares again suggests that the higher number of clusters, the better, but there is clearly wisible an elbow on the fourth cluster. Other potential options are six and two clusters.
fviz_cluster(pam2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 1020 -0.04
## 2 2 2686 0.67
fviz_cluster(pam4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(pam4)
## cluster size ave.sil.width
## 1 1 717 0.20
## 2 2 735 0.32
## 3 3 2130 0.69
## 4 4 124 0.15
Overall, the average silhouette width in the both PAM models is lower than the one chosen in the K-means. The diference between two and four clusters is quite small. There is the same problem with the one best cluster and the rest of them not so good, with some values of the silhouette below the zero (especially the second one in the case of spliting into the two clusters). Anyway, spliting the data into four clusters seems to give more reasonable and balanced results and it should be considered as the optimal choice.
Standarization is an important step to do before the PCA and MDS as it allows to change the variance to be equal to 1.
dane <- data.Normalization(dane, type="n1",normalization="column")
Dimension reduction techniques are used to reduce the number of factors (variables) when there are many of them and at least some of them are correlating with each other. By using dimension reduction it is possible to exchange the initial factors for principal variables, for example to make the analysis easier or to avoid overfitting of the model.
Multidimentional scaling method is a dimension reduction technique which enables to present the data in two dimensions, no matter of the amount of variables used. One of the great advantages of this method is the ease of detecting the outliers so it will be performed before the next steps of the analysis.
rownames(dane) <- NULL
dist.dane <- dist(dane)
mds1 <- cmdscale(dist.dane, k=2)
plot(mds1)
abline(v=9)
abline(h=9)
x.out<-which(mds1[,1]>9)
y.out<-which(mds1[,2]>9)
out.all<-c(x.out, y.out)
out.uni<-unique(out.all)
d<-dane[out.uni,]
d$x.out<-mds1[out.uni,1]
d$y.out<-mds1[out.uni,2]
points(d[,21:22], pch=21, bg="red")
full<-1:3706
limited<-rownames(d)
dane$mark<-full %in% limited
describeBy(dane[,1:20], dane$mark)
##
## Descriptive statistics by group
## group: FALSE
## vars n mean sd median trimmed mad
## price 1 3683 -0.03 0.68 -0.22 -0.15 0.35
## minimum_nights 2 3683 0.00 1.00 -0.07 -0.06 0.02
## number_of_reviews 3 3683 0.00 1.00 -0.41 -0.24 0.19
## reviews_per_month 4 3683 0.00 1.00 -0.35 -0.19 0.47
## calculated_host_listings_count 5 3683 0.00 1.00 -0.29 -0.23 0.00
## availability_365 6 3683 0.00 1.00 -0.63 -0.17 0.16
## accommodates 7 3683 -0.01 0.95 -0.48 -0.16 0.00
## bathrooms 8 3683 -0.01 0.97 -0.33 -0.28 0.00
## bedrooms 9 3683 -0.01 0.98 -0.27 -0.11 0.00
## beds 10 3683 -0.01 0.96 -0.55 -0.23 0.00
## amenities 11 3683 0.00 1.00 -0.29 -0.13 0.77
## security_deposit 12 3683 0.00 1.00 -0.43 -0.22 0.00
## cleaning_fee 13 3683 -0.01 0.99 -0.22 -0.16 0.73
## review_scores_accuracy 14 3683 0.03 0.91 0.48 0.21 0.00
## review_scores_cleanliness 15 3683 0.02 0.95 0.68 0.20 0.00
## review_scores_checkin 16 3683 0.02 0.94 0.48 0.21 0.00
## review_scores_communication 17 3683 0.02 0.90 0.43 0.22 0.00
## review_scores_location 18 3683 0.02 0.95 0.67 0.20 0.00
## review_scores_value 19 3683 0.02 0.92 0.77 0.18 0.00
## dist 20 3683 0.00 1.00 0.00 -0.04 1.33
## min max range skew kurtosis se
## price -0.81 8.68 9.49 3.73 24.38 0.01
## minimum_nights -0.09 58.54 58.63 54.44 3152.05 0.02
## number_of_reviews -0.55 8.75 9.30 3.44 15.31 0.02
## reviews_per_month -0.74 19.34 20.08 5.09 67.52 0.02
## calculated_host_listings_count -0.29 7.73 8.03 5.84 38.22 0.02
## availability_365 -0.73 2.21 2.94 1.10 -0.35 0.02
## accommodates -1.05 7.46 8.51 2.11 6.56 0.02
## bathrooms -3.42 8.96 12.38 3.35 13.82 0.02
## bedrooms -1.78 5.75 7.53 1.58 4.40 0.02
## beds -1.29 10.63 11.92 2.94 12.90 0.02
## amenities -1.86 5.15 7.00 1.39 2.35 0.02
## security_deposit -0.43 14.28 14.71 5.85 52.76 0.02
## cleaning_fee -0.71 12.36 13.07 5.12 51.44 0.02
## review_scores_accuracy -7.62 0.48 8.11 -2.78 11.49 0.01
## review_scores_cleanliness -6.95 0.68 7.63 -2.16 7.01 0.02
## review_scores_checkin -11.24 0.48 11.72 -3.04 15.37 0.02
## review_scores_communication -9.68 0.43 10.11 -3.23 16.22 0.01
## review_scores_location -6.98 0.67 7.65 -1.88 5.48 0.02
## review_scores_value -6.47 0.77 7.24 -1.58 4.25 0.02
## dist -1.61 2.38 3.99 0.25 -1.22 0.02
## --------------------------------------------------------
## group: TRUE
## vars n mean sd median trimmed mad
## price 1 23 4.13 8.60 -0.28 2.41 0.46
## minimum_nights 2 23 0.00 0.12 -0.06 -0.02 0.03
## number_of_reviews 3 23 -0.39 0.36 -0.53 -0.47 0.03
## reviews_per_month 4 23 -0.41 0.41 -0.55 -0.48 0.19
## calculated_host_listings_count 5 23 0.60 1.29 -0.11 0.44 0.27
## availability_365 6 23 0.40 1.13 0.55 0.33 1.90
## accommodates 7 23 1.91 3.53 0.08 1.63 1.68
## bathrooms 8 23 1.56 2.68 -0.33 1.06 0.00
## bedrooms 9 23 1.17 2.38 -0.27 1.00 2.23
## beds 10 23 1.72 3.30 0.20 1.18 1.10
## amenities 11 23 0.03 1.51 -0.71 -0.20 0.77
## security_deposit 12 23 0.21 1.32 -0.43 -0.12 0.00
## cleaning_fee 13 23 0.93 2.19 0.46 0.52 1.74
## review_scores_accuracy 14 23 -4.04 3.66 -4.92 -3.86 4.01
## review_scores_cleanliness 15 23 -2.80 3.12 -3.14 -2.73 5.66
## review_scores_checkin 16 23 -2.84 3.43 -2.45 -2.45 4.34
## review_scores_communication 17 23 -3.84 4.05 -2.46 -3.52 4.28
## review_scores_location 18 23 -2.71 2.92 -1.88 -2.49 3.78
## review_scores_value 19 23 -3.85 3.37 -4.06 -3.80 3.58
## dist 20 23 0.46 1.02 0.64 0.50 1.16
## min max range skew kurtosis se
## price -0.67 28.85 29.52 1.71 1.52 1.79
## minimum_nights -0.09 0.25 0.34 1.11 -0.56 0.03
## number_of_reviews -0.55 0.81 1.37 2.28 3.98 0.07
## reviews_per_month -0.72 0.65 1.37 1.62 1.29 0.08
## calculated_host_listings_count -0.29 2.99 3.28 1.08 -0.62 0.27
## availability_365 -0.73 2.21 2.94 0.39 -1.41 0.23
## accommodates -1.05 7.46 8.51 0.72 -1.31 0.74
## bathrooms -0.33 8.96 9.28 1.29 0.77 0.56
## bedrooms -1.78 5.75 7.53 0.73 -0.70 0.50
## beds -0.55 9.14 9.69 1.16 -0.25 0.69
## amenities -1.54 4.00 5.54 1.13 0.30 0.32
## security_deposit -0.43 4.79 5.21 2.25 4.43 0.27
## cleaning_fee -0.71 7.46 8.17 1.60 1.67 0.46
## review_scores_accuracy -10.33 0.48 10.81 -0.41 -0.92 0.76
## review_scores_cleanliness -6.95 0.68 7.63 -0.16 -1.65 0.65
## review_scores_checkin -11.24 0.48 11.72 -0.62 -0.64 0.71
## review_scores_communication -11.13 0.43 11.56 -0.33 -1.39 0.84
## review_scores_location -9.53 0.67 10.20 -0.61 -0.66 0.61
## review_scores_value -8.88 0.77 9.65 -0.12 -1.41 0.70
## dist -1.42 1.88 3.30 -0.37 -1.04 0.21
From the plot above it can be seen that some observations indeed have coordinates obtained from the MDS different than the majority of the points. Those points were marked in red and their details were printed. Outliers can be also very interesting to analyse whether are those simply offers different than the others, offers with some mistakes or maybe even suspicious offers. Anyway, further analysis will be performed without those points to increase its quality.
dane <- subset(dane, dane$mark == FALSE, select = -c(mark))
rm(d)
Principal Component Analysis concentrates on finding the directions with the greatest variability of the data. Transformating the data and reducing its dimensionality with PCA is one of the common ways of improving clustering i.a. thanks to the limiting of the redundant features.
pca <- PCA(dane, ncp=6, scale.unit = TRUE, graph=FALSE)
get_eigenvalue(pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.8122024 19.0610120 19.06101
## Dim.2 2.9281338 14.6406688 33.70168
## Dim.3 1.7585210 8.7926050 42.49429
## Dim.4 1.4685993 7.3429963 49.83728
## Dim.5 1.1735753 5.8678764 55.70516
## Dim.6 1.0110116 5.0550580 60.76022
## Dim.7 0.9669701 4.8348504 65.59507
## Dim.8 0.8696153 4.3480766 69.94314
## Dim.9 0.8106810 4.0534048 73.99655
## Dim.10 0.6759204 3.3796021 77.37615
## Dim.11 0.6714952 3.3574762 80.73363
## Dim.12 0.6009311 3.0046557 83.73828
## Dim.13 0.5061537 2.5307687 86.26905
## Dim.14 0.4907879 2.4539395 88.72299
## Dim.15 0.4789834 2.3949168 91.11791
## Dim.16 0.4363484 2.1817418 93.29965
## Dim.17 0.4154900 2.0774501 95.37710
## Dim.18 0.3913505 1.9567525 97.33385
## Dim.19 0.3733984 1.8669922 99.20084
## Dim.20 0.1598313 0.7991563 100.00000
fviz_eig(pca, addlabels=TRUE)
There are different approaches to choose the right number of the dimensions to keep. One is to keep those dimensions where eigenvalues are bigger than 1 - that means six dimensions in this case. Also those are all dimensions that explain each one at least 5% of the variance. On the plot the elbow method suggests the same number. Those six dimensions explain about 60% of the whole variance, which isn’t perfect but acceptable. That is the percent of the variance that will be kept in the data after the dimension reduction.
pcavar <- get_pca_var(pca)
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)
On the variable correlation plot above can be seen the relationship between variables in the case of first two dimensions, where the possitively corelated variables are on the same sides, negatively corelated on the opposite sides and the length of the arrows as well as the color of variables denote the quality of representation of the variables. Additionally the contributions of variables to those two main principal components is presented with their transparency on the plot. Contributions of the top ten variables to every one of the six chosen dimensions is presented on the bar plots below, with the expected average contribution equal to 5% and marked with the red line.
fviz_contrib(pca, choice = "var", axes = 1, top = 10)
fviz_contrib(pca, choice = "var", axes = 2, top = 10)
fviz_contrib(pca, choice = "var", axes = 3, top = 10)
fviz_contrib(pca, choice = "var", axes = 4, top = 10)
fviz_contrib(pca, choice = "var", axes = 5, top = 10)
fviz_contrib(pca, choice = "var", axes = 6, top = 10)
fviz_pca_ind(pca, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
)
The last graph is representing the observations and their quality of representation among two main principal components.
The next steps will be performed on the data prepared with the PCA with the six principal components kept.
Non-hierarchical clustering algorithms, like K-means and PAM are not very affected by the outliers but they need the number of clusters stated beforehand and the results depend remarkably on the initial centers. On the other hand, hierarchical algorithms don’t need the initial number of clusters but they are sensitive to outliers and can’be be corrected ex-post. However those two methods can be used together. This time the optimal number of clusters will be chosen with the analysis of dendrogram results.
dendr1 <- HCPC(pca, -1, graph = F)
fviz_dend(dendr1,
palette = "jco",
rect = TRUE, rect_fill = TRUE,
rect_border = "jco",
)
The suggested partition is based on the highest relative loss of inertia. This partition indeed seems to be the best so it will be chosen. Dendrogram with the distribution of the observations in the first two dimensions is presented on the graph below.
plot(dendr1, choice = "3D.map")
datapca <- data.frame(pca$ind$coord)
hopkins(datapca, n=100)
## $H
## [1] 0.07239851
get_clust_tendency(datapca, 100, graph=F, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.071698
The data based on principal components is also clusterable as the Hopkins’ statistic takes values close to zero in both tests.
fviz_nbclust(datapca, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)
According to Silhouette statistic optimal number of clusters is two but even then the value is low. Spliting the data into the three clusters is the next best option. In compare to the dendrogram results, four clusters aren’t here the very good solution.
kmeanspca4 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=4, graph = F)
fviz_cluster(kmeanspca4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeanspca4)
## cluster size ave.sil.width
## 1 1 1904 0.39
## 2 2 573 0.13
## 3 3 752 0.10
## 4 4 454 0.05
As it was already noticed, the average silhouette width isn’t very satisfying. Especially the two smaller clusters have small, often even negative values of silhouette.
kmeanspca2 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=2, graph = F)
kmeanspca3 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=3, graph = F)
round(calinhara(datapca,kmeanspca2$cluster),digits=2)
## [1] 1017.2
round(calinhara(datapca,kmeanspca3$cluster),digits=2)
## [1] 975.84
round(calinhara(datapca,kmeanspca4$cluster),digits=2)
## [1] 908.34
According to Calinski-Harabasz index we should prefer two clusters over three and four, same as with Silhouette statistics, so it seems that the number of clusters chosen from the hierarchical clustering might not be optimal for the K-means clustering.
This time, apart from all of the variables used so far, let’s add eight categorical/ordinal variables and prepare the new dataset “dane2”.
dane2 <- subset(database, select = -c(id, longitude, latitude, neighbourhood_group))
K-means is a popular algorithm for clustering continuous data. In case of categorical variables K-modes can be used. It is in the simpiest words the alternative and also an extension to K-means. As can be expected from its name, it uses modes instead of means. K-prototypes is an approach that combines K-means and K-modes. It uses the Euclidian distance for numeric variables and simple matching coefficient for categorical variables in the dataset.
The trade off between wages for this distances is denoted by the lambda parameter. By default it is calculated with the variance for numeric variables. In this analysis let’s compare that approach with the one with the standard deviation instead of variance.
Es <- numeric(10)
for(i in 1:10){
kpres <- kproto(dane2, k = i)
Es[i] <- kpres$tot.withinss
}
Essil <- numeric(10)
for(i in 1:10){
kpres <- kproto(dane2, k = i)
Essil[i] <- silhouette_kproto(kpres)
}
Esmcclain <- numeric(10)
for(i in 2:10){
kpres <- kproto(dane2, k = i)
Esmcclain[i] <- mcclain_kproto(kpres)
}
plot(1:10, Es, type = "b", ylab = "Total Within Sum Of Squares", xlab = "Number of clusters")
According to Total Within Sum Of Squares the more clusters there will be, the better. According to elbow method two, four or six clusters should be considered.
plot(1:10, Essil, type = "b", ylab = "Silhouette", xlab = "Number of clusters")
Silhouette statistic strongly suggests choosing two clusters.
plot(1:10, Esmcclain, type = "b", ylab = "McClain-Rao index", xlab = "Number of clusters")
McClain_rao index should take as little values as possible, same as Total Sum Of Squares, so this also suggests two clusters.
Es2 <- numeric(10)
for(i in 1:10){
kpres <- kproto(dane2, k = i, lambdaest(dane2, num.method = 2))
Es2[i] <- kpres$tot.withinss
}
Essil2 <- numeric(10)
for(i in 1:10){
kpres <- kproto(dane2, k = i, lambdaest(dane2, num.method = 2))
Essil2[i] <- silhouette_kproto(kpres)
}
Esmcclain2 <- numeric(10)
for(i in 2:10){
kpres <- kproto(dane2, k = i, lambdaest(dane2, num.method = 2))
Esmcclain2[i] <- mcclain_kproto(kpres)
}
plot(1:10, Es2, type = "b", ylab = "Total Within Sum Of Squares", xlab = "Number of clusters")
Total Within Sum Of Squares again suggests choosing high number of clusters, with some indication to seven clusters or five.
plot(1:10, Essil2, type = "b", ylab = "Silhouette", xlab = "Number of clusters")
According to Silhouette statistic there should be two clusters chosen.
plot(1:10, Esmcclain2, type = "b", ylab = "McClain-Rao index", xlab = "Number of clusters")
This time the best option based on McClain-Rao index is nine clusters, however among smaller numbers of clusters two and four should be considered.
kproto2var <- kproto(dane2, 2)
kproto2sd <- kproto(dane2, 2, lambdaest(dane2, num.method = 2))
kproto2var$centers
kproto2sd$centers
The results show the centers of clusters in the case of splitting the data into two groups respectively with lambda obtained through variance (kproto2var) and through standard deviation (kproto2sd). One can see that in both cases the centers have the same values of binary variables but type of room and cancellation policy differ between centers.
mcclain_kproto(kproto2var)
## [1] 0.1191884
mcclain_kproto(kproto2sd)
## [1] 0.08075219
silhouette_kproto(kproto2var)
## [1] 0.8119357
silhouette_kproto(kproto2sd)
## [1] 0.8483459
McClain-Rao index is slightly lower and the Silhouette statistic slightly higher for the clustering obtained with the lambda defined with standard deviation, so it is proper to assume that it is a little better clustering. The Silhouette statistic is really high so it seems that this approach worked pretty well.
Another popular approach in case of mixed data is to use the Gower distance, which in contrast to for example the Euclidian distance can deal with different data types. That is because the distance metric is chosen to every variable according to its type: Manhattan distance to interval variables and the ordinal ones (which are ranked first) and the Dice coefficient for nominal variables. Having this kind of distance matrix, the proper clustering algorithm can be chosen. In this analysis the clustering will be performed with Partitioning Around Medoids algorithm.
gower_dist <- daisy(dane2,
metric = "gower",
type = list(logratio = 3))
Esgower <- numeric(10)
for(i in 2:10){
pames <- pam(gower_dist, diss = TRUE, k = i)
Esgower[i] <- pames$silinfo$avg.width}
plot(1:10, Esgower, type = "b", ylab = "Silhouette", xlab = "Number of Clusters")
Based on Silhouette statistic two clusters will be chosen as the optimal partition.
pamgower <- pam(gower_dist, diss = TRUE, k=2)
fviz_silhouette(pamgower)
## cluster size ave.sil.width
## 1 1 1781 0.10
## 2 2 1925 0.23
Unfortunately, the results aren’t satisfying. The Silhouette statistic is very low for both clusters.
The last approach is to convert those nominal variables to binary values and all of the binary values treat as if they were numeric zero-one values. This approach sometimes isn’t recommended for various of reasons, however it is easy to come up with this idea and doesn’t require any additional knowledge than for example K-means technique. Let’s check how this method will work with this data.
danebin <- database
danebin$host_is_superhost <- as.numeric(ifelse(danebin$host_is_superhost == 't', 1, 0))
danebin$host_has_profile_pic <- as.numeric(ifelse(danebin$host_has_profile_pic == 't', 1, 0))
danebin$host_identity_verified <- as.numeric(ifelse(danebin$host_identity_verified == 't', 1, 0))
danebin$is_location_exact <- as.numeric(ifelse(danebin$is_location_exact == 't', 1, 0))
danebin$instant_bookable <- as.numeric(ifelse(danebin$instant_bookable == 't', 1, 0))
danebin$room_type_apt <- as.numeric(ifelse(danebin$room_type == 'Entire home/apt', 1, 0))
danebin$room_type_priv <- as.numeric(ifelse(danebin$room_type == 'Private room', 1, 0))
danebin$room_type_shared <- as.numeric(ifelse(danebin$room_type == 'Shared room', 1, 0))
danebin$bed_type_airbed <- as.numeric(ifelse(danebin$bed_type == 'Airbed', 1, 0))
danebin$bed_type_couch <- as.numeric(ifelse(danebin$bed_type == 'Couch', 1, 0))
danebin$bed_type_futon <- as.numeric(ifelse(danebin$bed_type == 'Futon', 1, 0))
danebin$bed_type_sofa <- as.numeric(ifelse(danebin$bed_type == 'Pull-out Sofa', 1, 0))
danebin$bed_type_real <- as.numeric(ifelse(danebin$bed_type == 'Real Bed', 1, 0))
danebin$cancellation_policy_flexible <- as.numeric(ifelse(danebin$cancellation_policy == 'flexible', 1, 0))
danebin$cancellation_policy_moderate <- as.numeric(ifelse(danebin$cancellation_policy == 'moderate', 1, 0))
danebin$cancellation_policy_strict14 <- as.numeric(ifelse(danebin$cancellation_policy == 'strict_14_with_grace_period', 1, 0))
danebin$cancellation_policy_strict30 <- as.numeric(ifelse(danebin$cancellation_policy == 'super_strict_30', 1, 0))
danebin$cancellation_policy_strict60 <- as.numeric(ifelse(danebin$cancellation_policy == 'super_strict_60', 1, 0))
danebin <- subset(danebin, select = -c(id, longitude, latitude, neighbourhood_group, room_type, bed_type, cancellation_policy))
hopkins(danebin, n=100)
## $H
## [1] 0.01206602
get_clust_tendency(danebin, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.002272877
Data is clusterable.
fviz_nbclust(danebin, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)
According to Silhouette statistic two clusters should be chosen.
fviz_nbclust(danebin, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)
opt<-Optimal_Clusters_KMeans(danebin, max_clusters=8, plot_clusters=TRUE, criterion="AIC")
opt4<-Optimal_Clusters_KMeans(danebin, max_clusters=8, plot_clusters=TRUE, criterion="BIC")
According to Total Within Sum Of Squares, AIC and BIC the number of clusters should be high with the indication for five clusters or two-three with the elbow method assessment.
kmeanscat2 <- eclust(danebin, "kmeans", hc_metric="euclidean", k=2)
kmeanscat3 <- eclust(danebin, "kmeans", hc_metric="euclidean", k=3)
kmeanscat5 <- eclust(danebin, "kmeans", hc_metric="euclidean", k=5)
fviz_silhouette(kmeanscat2)
## cluster size ave.sil.width
## 1 1 287 0.17
## 2 2 3419 0.73
fviz_silhouette(kmeanscat3)
## cluster size ave.sil.width
## 1 1 111 0.27
## 2 2 2396 0.68
## 3 3 1199 0.06
fviz_silhouette(kmeanscat5)
## cluster size ave.sil.width
## 1 1 39 0.42
## 2 2 2108 0.64
## 3 3 746 0.29
## 4 4 269 0.16
## 5 5 544 0.37
Average silhouette width is highest for two clusters. All results are quite similar to those obtained with the K-means on just continuous variables in the first scenario. There is one big cluster with the highest Silhouette value in every case and the rest of clusters is rather worse.
round(calinhara(danebin,kmeanscat2$cluster),digits=2)
## [1] 2408.87
round(calinhara(danebin,kmeanscat3$cluster),digits=2)
## [1] 2216.51
round(calinhara(danebin,kmeanscat5$cluster),digits=2)
## [1] 2379.7
According to Calinski-Harabasz index we should also prefer spliting the data into two groups. So let’s consider it as a best option.
At the end, just like in the first scenario, clustering on the data obtained with the dimension reduction, with some help from hierarchical clustering methods, will be performed.
Firstly The steps for continuous variables were repeated, in which the data was standarized and the outliers detected and eliminated, and that data was joined with the rest of variables.
Factor analysis of mixed data is a technique to perform a principal components method on a data with continuous and categorical variables. It is a combination of Principal Component Analysis for quantitative and Multiple Correspondence Analysis for gualitative variables.
famd <- FAMD(dane2, ncp=50, graph=FALSE)
get_eigenvalue(famd)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.3146660 12.3276172 12.32762
## Dim.2 3.1470083 8.9914522 21.31907
## Dim.3 1.9999964 5.7142755 27.03334
## Dim.4 1.7499815 4.9999471 32.03329
## Dim.5 1.5144771 4.3270776 36.36037
## Dim.6 1.2550052 3.5857291 39.94610
## Dim.7 1.1770949 3.3631283 43.30923
## Dim.8 1.1179266 3.1940759 46.50330
## Dim.9 1.0598303 3.0280866 49.53139
## Dim.10 1.0327728 2.9507795 52.48217
## Dim.11 1.0171137 2.9060391 55.38821
## Dim.12 1.0052944 2.8722696 58.26048
## Dim.13 0.9976618 2.8504624 61.11094
## Dim.14 0.9771573 2.7918780 63.90282
## Dim.15 0.9544407 2.7269735 66.62979
## Dim.16 0.9360554 2.6744440 69.30424
## Dim.17 0.9200119 2.6286054 71.93284
## Dim.18 0.9007241 2.5734975 74.50634
## Dim.19 0.8626151 2.4646145 76.97095
## Dim.20 0.7970548 2.2772995 79.24825
## Dim.21 0.7465696 2.1330560 81.38131
## Dim.22 0.7047723 2.0136353 83.39494
## Dim.23 0.6574763 1.8785036 85.27345
## Dim.24 0.6485629 1.8530368 87.12648
## Dim.25 0.6002737 1.7150676 88.84155
## Dim.26 0.5272054 1.5063011 90.34785
## Dim.27 0.4892949 1.3979855 91.74584
## Dim.28 0.4819052 1.3768719 93.12271
## Dim.29 0.4415568 1.2615908 94.38430
## Dim.30 0.4102558 1.1721595 95.55646
## Dim.31 0.4082420 1.1664057 96.72287
## Dim.32 0.3757689 1.0736253 97.79649
## Dim.33 0.3269025 0.9340073 98.73050
## Dim.34 0.2912427 0.8321219 99.56262
## Dim.35 0.1530826 0.4373790 100.00000
fviz_eig(famd, ncp=20, addlabels=TRUE)
Twelve dimensions have eigenvalues equal to 1 or higher and they explain jointly 58,26% of variance. On the graph can be seen that six dimensions might be good as well, however they explain only 39,95% of variance, so twelve dimesions will be chosen.
On the graph below is presented the correlation between all variables in the two first principal dimensions. The color of variables is used to show the quality of the representation.
famd <- FAMD(dane2, ncp=12, graph=FALSE)
famdvar <- get_famd_var(famd)
fviz_famd_var(famd, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)
Additionally, the contributions of the variables to the six main dimensions are presented below. Some of the categorical variables used in this scenario have real impact on the dimensions, especially type of room and its contribution to fourth dimension being over 20% or fifth dimension with contributions of cancellation_policy, instant_bookable, room_type and host_identity_verified all over the average expected contribution. It shows that having those variables indeed have an impact at the way the principal components are calculated for this data.
fviz_contrib(famd, choice = "var", axes = 1, top = 20)
fviz_contrib(famd, choice = "var", axes = 2, top = 20)
fviz_contrib(famd, choice = "var", axes = 3, top = 20)
fviz_contrib(famd, choice = "var", axes = 4, top = 20)
fviz_contrib(famd, choice = "var", axes = 5, top = 20)
fviz_contrib(famd, choice = "var", axes = 6, top = 20)
The best partition based on dendrogram is to split the data into four groups. A distribution of observations among the groups, obtained with dendrogram, is also presented below.
dendr2 <- HCPC(famd, -1, graph = F)
fviz_dend(dendr2,
palette = "uchicago",
rect = TRUE, rect_fill = TRUE,
rect_border = "uchicago",
)
plot(dendr2, choice = "3D.map")
A great thing about dimension reduction is that thanks to the FAMD the initial variables, both continuous and categorical were replaced with appropriate dimensions. Because of that there is no longer need to worry about the mixed data and the clustering can be done now with any method applicable to the continuous data. The simple K-means clustering with the Euclidian distance will be performed here.
datafamd <- data.frame(famd$ind$coord)
hopkins(datafamd, n=100)
## $H
## [1] 0.05153417
get_clust_tendency(datafamd, 2, graph=F, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.02346509
The data obtained through the dimension reduction is clusterable.
fviz_cluster(kmeansfamd, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeansfamd)
## cluster size ave.sil.width
## 1 1 1764 0.29
## 2 2 544 0.06
## 3 3 894 0.06
## 4 4 481 0.02
Unfortunately the Silhouette statistic shows that the result of this clustering is very bad. All of the clusters have low silhouette width and three of them have average silhouette width almost equal to zero.
Although at first sight it seems that all popular clustering methods are dedicated to the datasets consisting of only continuous variables, there are many different approaches which allow to explore and search for the hidden structures also in the mixed data. There isn’t probably one universal solution and the essence of the clustering as always is to go through enough solutions to find that one which is cut out for the current problem.