1. Introduction

Data clustering is one of the basic but very popular and important methods of the unsupervised learning. However, the most well-known clustering algorithms, i.a. k-means or k-medoids, are designed for the datasets which consist of continuous variables. While using count data in many cases is acceptable and widely applied, these algorithms aren’t dedicated to for example nominal variables. Similar problem concerns popular dimension reduction method: Principal Component Analysis. Nevertheless, the data very often consists of various types of variables and rejecting the nominal ones may significantly lower the effectiveness of the analysis. That’s when algorithms for the mixed type data may prove valuable.

The main idea of this article is to present and compare different approaches of cluster analysis and dimension reduction performed on the data about Airbnb accommodations on Belin. Analysis was divided into two cases. First one is the case of the data with only continuous and count variables selected. Simple K-means and PAM algorithms were performed on this data. Additionally, the principal component analysis as well as the hierarchical clustering were added before the k-means as the extra steps allowing to potentially increase the quality of the analysis. In the second case some categorical variables were added to the previous dataset and three types of clustering were applied on the newly created data: k-prototypes, partitioning around medoids using Gower distance and k-means performed on the data with categorical variables converted to binary values and treated as numeric. In the end also the Factor Analysis of Mixed Data was used to potentially increase the quality of clustering in the second case.

2. Libraries

library(stringr)
library(clustertend)
library(NbClust)
library(factoextra)
library(ClusterR)
library(fpc)
library(clusterSim)
library(psych)
library(FactoMineR)
library(clustMixType)

3. Dataset description

Data used in the analysis consists of the information about the listings in Berlin from the popular web portal Airbnb. The dataset was creatured by Murray Cox and contains data of over 22 thousands observations scraped on November 2018. More information about the dataset can be found here.

Deep analysis of the Airbnb listings may have plenty of practical applications. Data clustering according to its purpose allows to have a better insight and helps in understanding the structures of the travel accommodation market. Airbnb arouses different emotions across the world. On the one hand it is very useful and convenient site for finding holiday accomodation. On the other hand there is always risk when the accommodation can be offered by almost anyone. Moreover, in many cities attractive for tourists, popularity of Airbnb contributes to high depopulation of local inhabitants in exchange for houses and rooms offered to tourists by hosts. Clustering may turn out to be useful in finding the features characteristic for good vs bad hosts, safe vs not safe offers, accommodations offered by local inhabitants vs more business oriented types of accommodations or even to localize potential fraudsters.

Due to the time complexity of some algorithms presented in the article which makes them almost impossible to perform in a reasonable amount of time on such large database, only observations from the Mitte district, which is probably the most attractive district for the majority of tourists and is located in the middle of Berlin, were chosen. There are some clustering algorithms dedicated to the large samples, like CLARA or CLARANS, however not all of the methods presented here have proper equivalents for large samples and because of that it was decided to limit the dataset as stated above. Observations which had missing values in any of the variables used in analysis were removed. Eventually the dataset consists of over 3700 observations. 20 continuous/count variables, 7 nominal (including 5 binary) variables and 1 ordinal variable were used.

3.1. Continuous and count data for both scenarios

  • price - price of the offer for one day in US dollars
  • minimum_nights - minimum number of nights for which the accommodation could be rented
  • number_of_reviews - number of reviews for the listing
  • reviews_per_month - mean number of reviews for month
  • calculated_host_listings_count - number of host listings
  • availability_365 - number of days for which a host is available in a year
  • accommodates - number of people for which is the offer
  • bathrooms - number of bathrooms
  • bedrooms - number of bedrooms
  • beds - number of beds
  • amenities - sum of amenities offered in this listing
  • security_deposit - value that has to be paid as a deposit
  • cleaning_fee - disposable price for the cleaning after the guest’s visit
  • review_scores_accuracy - score for the accuracy of the offer, can be 1 to 10
  • review_scores_cleanliness - score for the cleanliness of the offer, can be 1 to 10
  • review_scores_checkin - score for the quality of the ckeck in, can be 1 to 10
  • review_scores_communication - score for the communication with the host, can be 1 to 10
  • review_scores_location - score for the location of the accommodation
  • review_scores_value - score for the value of this offer for paid price
  • dist - additional value created from the latitude and longitude coordinates. Some clustering algorithms, for example k-means, can’t handle the cases the cases with not circular clusters. Considering the trends in the travel accommodation markets one may expect that using langitute and longitude variables in theirs natural form wouldn’t be effective, as for example the most attractive, costly or desirable accomodations are usually in the centre of the city and that many features change not with the changes in specifically increase or decrease of latitude/longitude but rather with the moving away from that centre. Nevertheless, the location of the accommodation may be very important feature in the analysis. The problem is to assess where is the centre of the city, which should not necessary be the centre of the city understood in the typical way or the geographical centre but rather as a touristic centre or main tourist attraction. For the purpose of this analysis it was assumed that this centre is located on the Alexanderplatz, which is very important tourist attraction and also a central transportation hub in this part of Berlin. Although this assumption is only a guess, it should give much better results than leaving coordinates unchanged. Because of that, to every listing there was calculated the Euclidian distance between the accommodation and the Alexanderplatz.

3.2. Categorical/ordinal data for the second scenario

  • room_type - type of an accommodation, 3 levels: entire home, private room or shared room
  • host_is_superhost - binary variable informing about superhost status which is an award given to the hosts that fulfil certain requirements and is high-valued assessment of the hosts.
  • host_has_profile_pic - binary variable informing whether the host has profile picture on the Airbnb account
  • host_identity_verified - binary variable informing whether the host identity is verified
  • is_location_exact - binary variable informing whether the exact location of the accommodation is stated
  • bed_type - type of the bed in the accommodation, 5 levels: airbed, couch, futon, pull-out sofa, real bed
  • instant_bookable - binary variable informing whether the accommodation can be booked immediately without previous acceptation from the host
  • cancellation_policy - ordinal variable informing about the rules concerning the booking cancellation, 5 levels from the most flexible to the most strict one.

4. First scenario for continuous data

4.1 Dataset preparation

listings<-read.csv("listings.csv", sep=",", dec=".", header=TRUE) 
summar<-read.csv("listings_summary.csv", sep=",", dec=".", header=TRUE)

listings<-listings[,c("id","neighbourhood_group","latitude","longitude","room_type","price","minimum_nights","number_of_reviews","reviews_per_month","calculated_host_listings_count","availability_365")]

summar<-summar[,c("host_is_superhost","host_has_profile_pic","host_identity_verified","is_location_exact","accommodates","bathrooms","bedrooms","beds","bed_type","amenities","security_deposit","cleaning_fee","review_scores_accuracy","review_scores_cleanliness","review_scores_checkin","review_scores_communication","review_scores_location","review_scores_value","instant_bookable","cancellation_policy")]

database<-cbind(listings, summar)
rm("listings","summar")

database <- subset(database, database$neighbourhood_group=="Mitte")

database <- na.omit(database)

database <- database[!(database$host_is_superhost=="" & database$host_has_profile_pic=="" & database$host_identity_verified==""),]


database$amenities <- as.character(database$amenities)
database$amenities <- ifelse(nchar(database$amenities)>2, str_count(database$amenities, ',')+1, 0)

database$security_deposit <- database$security_deposit  %>% str_extract_all("\\(?[0-9,.]+\\)?") %>% gsub(",", "", .) %>% as.numeric()
database$security_deposit[is.na(database$security_deposit)] <- 0

database$cleaning_fee <- database$cleaning_fee  %>% str_extract_all("\\(?[0-9,.]+\\)?") %>% gsub(",", "", .) %>% as.numeric()
database$cleaning_fee[is.na(database$cleaning_fee)] <- 0

database$dist<-sqrt((database$latitude-52.5195)^2+(database$longitude-13.4072)^2)

dane <- subset(database, select = -c(id, longitude, latitude, room_type, host_is_superhost, host_has_profile_pic, host_identity_verified, is_location_exact, neighbourhood_group, bed_type, instant_bookable, cancellation_policy))

4.2 Prediagnostics - clustering tendency

Before the proper clustering it is usually very worthwhile to assess the general clustering tendency of the data.

hopkins(dane, n=100)
## $H
## [1] 0.008273232
get_clust_tendency(dane, 100, graph=FALSE, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat 
## [1] 0.009559817

Clustering tendency can be assessed with the Hopkins’ statistic. Close to zero values of the statistic in both cases mean that we should reject the null hypothesis that it is unlikely that there are statistically significant clusters, so it seems that the data is clusterable.

4.3 K-means

K-means clustering method is probably the most common one. Its goal is to minimize the total squared error. It is simple but the number of clusters k must be chosen beforehand. There are many methods to do that so it is appropriate to check few of them.

Optimal number of clusters

fviz_nbclust(dane, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)

The most popular method to do this is the silhouette statistics. The higher value it takes, the better, so it suggests choosing two clusters.

fviz_nbclust(dane, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)

opt<-Optimal_Clusters_KMeans(dane, max_clusters=8, plot_clusters=TRUE, criterion="AIC") 

opt2<-Optimal_Clusters_KMeans(dane, max_clusters=8, plot_clusters=TRUE, criterion="BIC") 

Total within sum of squares, AIC and BIC should take as little value as possible, however here one can see that generally the higher number number, the smaller value of statistics. Because of that, it is more effective to use the elbow method which suggests choosing the optimal number of clusters in a way that adding more clusters will affect the statistics signifficantly less. Based on that the Total WIthin Sum of Squares suggests choosing 2,3 or 5 clusters, AIC and BIC - also 2 or 3, maybe 5 clusters. Considering this and the Silhouette statistics, 2, 3 and 5 clusters will be chosen as the k numbers in the K-means algorithm.

Results and post-diagnostics

kmeans2 <- eclust(dane, "kmeans", hc_metric="euclidean", k=2)

kmeans3 <- eclust(dane, "kmeans", hc_metric="euclidean", k=3)

kmeans5 <- eclust(dane, "kmeans", hc_metric="euclidean", k=5)

fviz_cluster(kmeans2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans2)
##   cluster size ave.sil.width
## 1       1  287          0.17
## 2       2 3419          0.73

fviz_cluster(kmeans3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans3)
##   cluster size ave.sil.width
## 1       1  111          0.27
## 2       2 2396          0.68
## 3       3 1199          0.06

fviz_cluster(kmeans5, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans5)
##   cluster size ave.sil.width
## 1       1   39          0.42
## 2       2 2108          0.64
## 3       3  746          0.29
## 4       4  269          0.16
## 5       5  544          0.37

Because of the number of variables in the model (which results in many dimensions), the differences between clusters aren’t very clearly visible on the cluster plots. In spite of the high overall value of the silhouette statistics in the model with two clusters, it can be seen that while the second cluster is much bigger and has good average silhouette width, the first cluster is much worse and the silhouette width takes even negative values. However, the two other clusterings have worst results, with still one great cluster but the rest taking small or even negative average silhouette width values.

round(calinhara(dane,kmeans2$cluster),digits=2)
## [1] 2408.93
round(calinhara(dane,kmeans3$cluster),digits=2) 
## [1] 2216.59
round(calinhara(dane,kmeans5$cluster),digits=2)
## [1] 2379.84

Additionally, there are some other measures of clustering quality. Calinski-Harabasz index is one of them, designed for comparing clusterings with different number of clusters. The highest value, which means the best clustering, is in the case of 2 clusters.

dudahart2(dane,kmeans2$cluster)$p.value
## [1] 0
dudahart2(dane,kmeans3$cluster)$p.value
## [1] 0
dudahart2(dane,kmeans5$cluster)$p.value
## [1] 0

Duda-Hart test is the test with H0 that a data shouldn’t be split into two clusters. P-value equal to 0 in every case means that the H0 is rejected and the clusters should be split, but as it was said before, splitting the dataset to too many clusters isn’t really preferred.

Taking into account all that results, the division into two clusters will be considered as the best one. The differences in coordinates between centers of the two clusters in that model are presented below.

kmeans2$centers
##       price minimum_nights number_of_reviews reviews_per_month
## 1 118.48780      24.515679          22.06272          1.093868
## 2  63.98187       6.948815          27.73443          1.374452
##   calculated_host_listings_count availability_365 accommodates bathrooms
## 1                       2.843206        160.39721     3.372822  1.209059
## 2                       2.582626         85.11612     2.811348  1.097397
##   bedrooms     beds amenities security_deposit cleaning_fee
## 1 1.484321 2.066202  22.11498        854.11150     62.19861
## 2 1.155894 1.707809  17.40041         61.63703     18.39807
##   review_scores_accuracy review_scores_cleanliness review_scores_checkin
## 1               9.693380                  9.508711              9.651568
## 2               9.639368                  9.269669              9.676806
##   review_scores_communication review_scores_location review_scores_value
## 1                    9.745645               9.634146            9.337979
## 2                    9.698450               9.462708            9.366481
##         dist
## 1 0.03074756
## 2 0.03870296

4.4. PAM

Partitioning Around Medoids is the method that is a little bit similar to K-means but it chooses the actual points in the dataset (medoids) as centers of clusters. It is oriented to minimize the sum of dissimilarities between points in a cluster and a medoid of that cluster. Also the number of clusters must be determined beforehand.

Optimal number of clusters

fviz_nbclust(dane, FUNcluster = cluster::pam, method = c("silhouette"), k.max = 8, nboot = 100,)

This time the silhouette statistics indicates choosing four clusters as the best option, however spliting the data into two clusters also seems to be quite good.

fviz_nbclust(dane, FUNcluster = cluster::pam, method = c("wss"), k.max = 8, nboot = 100,) 

The total Within Sum of Squares again suggests that the higher number of clusters, the better, but there is clearly wisible an elbow on the fourth cluster. Other potential options are six and two clusters.

Results and post-diagnostics

fviz_cluster(pam2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(pam2)
##   cluster size ave.sil.width
## 1       1 1020         -0.04
## 2       2 2686          0.67

fviz_cluster(pam4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(pam4)
##   cluster size ave.sil.width
## 1       1  717          0.20
## 2       2  735          0.32
## 3       3 2130          0.69
## 4       4  124          0.15

Overall, the average silhouette width in the both PAM models is lower than the one chosen in the K-means. The diference between two and four clusters is quite small. There is the same problem with the one best cluster and the rest of them not so good, with some values of the silhouette below the zero (especially the second one in the case of spliting into the two clusters). Anyway, spliting the data into four clusters seems to give more reasonable and balanced results and it should be considered as the optimal choice.

4.5. Combination of dimension reduction and clustering techniques

Standarization

Standarization is an important step to do before the PCA and MDS as it allows to change the variance to be equal to 1.

dane <- data.Normalization(dane, type="n1",normalization="column")

Dimension reduction

Dimension reduction techniques are used to reduce the number of factors (variables) when there are many of them and at least some of them are correlating with each other. By using dimension reduction it is possible to exchange the initial factors for principal variables, for example to make the analysis easier or to avoid overfitting of the model.

MDS

Multidimentional scaling method is a dimension reduction technique which enables to present the data in two dimensions, no matter of the amount of variables used. One of the great advantages of this method is the ease of detecting the outliers so it will be performed before the next steps of the analysis.

rownames(dane) <- NULL

dist.dane <- dist(dane)

mds1 <- cmdscale(dist.dane, k=2)

plot(mds1)
abline(v=9)
abline(h=9)
x.out<-which(mds1[,1]>9)
y.out<-which(mds1[,2]>9)
out.all<-c(x.out, y.out)
out.uni<-unique(out.all)
d<-dane[out.uni,]
d$x.out<-mds1[out.uni,1]
d$y.out<-mds1[out.uni,2]
points(d[,21:22], pch=21, bg="red")

full<-1:3706
limited<-rownames(d)
dane$mark<-full %in% limited
describeBy(dane[,1:20], dane$mark)
## 
##  Descriptive statistics by group 
## group: FALSE
##                                vars    n  mean   sd median trimmed  mad
## price                             1 3683 -0.03 0.68  -0.22   -0.15 0.35
## minimum_nights                    2 3683  0.00 1.00  -0.07   -0.06 0.02
## number_of_reviews                 3 3683  0.00 1.00  -0.41   -0.24 0.19
## reviews_per_month                 4 3683  0.00 1.00  -0.35   -0.19 0.47
## calculated_host_listings_count    5 3683  0.00 1.00  -0.29   -0.23 0.00
## availability_365                  6 3683  0.00 1.00  -0.63   -0.17 0.16
## accommodates                      7 3683 -0.01 0.95  -0.48   -0.16 0.00
## bathrooms                         8 3683 -0.01 0.97  -0.33   -0.28 0.00
## bedrooms                          9 3683 -0.01 0.98  -0.27   -0.11 0.00
## beds                             10 3683 -0.01 0.96  -0.55   -0.23 0.00
## amenities                        11 3683  0.00 1.00  -0.29   -0.13 0.77
## security_deposit                 12 3683  0.00 1.00  -0.43   -0.22 0.00
## cleaning_fee                     13 3683 -0.01 0.99  -0.22   -0.16 0.73
## review_scores_accuracy           14 3683  0.03 0.91   0.48    0.21 0.00
## review_scores_cleanliness        15 3683  0.02 0.95   0.68    0.20 0.00
## review_scores_checkin            16 3683  0.02 0.94   0.48    0.21 0.00
## review_scores_communication      17 3683  0.02 0.90   0.43    0.22 0.00
## review_scores_location           18 3683  0.02 0.95   0.67    0.20 0.00
## review_scores_value              19 3683  0.02 0.92   0.77    0.18 0.00
## dist                             20 3683  0.00 1.00   0.00   -0.04 1.33
##                                   min   max range  skew kurtosis   se
## price                           -0.81  8.68  9.49  3.73    24.38 0.01
## minimum_nights                  -0.09 58.54 58.63 54.44  3152.05 0.02
## number_of_reviews               -0.55  8.75  9.30  3.44    15.31 0.02
## reviews_per_month               -0.74 19.34 20.08  5.09    67.52 0.02
## calculated_host_listings_count  -0.29  7.73  8.03  5.84    38.22 0.02
## availability_365                -0.73  2.21  2.94  1.10    -0.35 0.02
## accommodates                    -1.05  7.46  8.51  2.11     6.56 0.02
## bathrooms                       -3.42  8.96 12.38  3.35    13.82 0.02
## bedrooms                        -1.78  5.75  7.53  1.58     4.40 0.02
## beds                            -1.29 10.63 11.92  2.94    12.90 0.02
## amenities                       -1.86  5.15  7.00  1.39     2.35 0.02
## security_deposit                -0.43 14.28 14.71  5.85    52.76 0.02
## cleaning_fee                    -0.71 12.36 13.07  5.12    51.44 0.02
## review_scores_accuracy          -7.62  0.48  8.11 -2.78    11.49 0.01
## review_scores_cleanliness       -6.95  0.68  7.63 -2.16     7.01 0.02
## review_scores_checkin          -11.24  0.48 11.72 -3.04    15.37 0.02
## review_scores_communication     -9.68  0.43 10.11 -3.23    16.22 0.01
## review_scores_location          -6.98  0.67  7.65 -1.88     5.48 0.02
## review_scores_value             -6.47  0.77  7.24 -1.58     4.25 0.02
## dist                            -1.61  2.38  3.99  0.25    -1.22 0.02
## -------------------------------------------------------- 
## group: TRUE
##                                vars  n  mean   sd median trimmed  mad
## price                             1 23  4.13 8.60  -0.28    2.41 0.46
## minimum_nights                    2 23  0.00 0.12  -0.06   -0.02 0.03
## number_of_reviews                 3 23 -0.39 0.36  -0.53   -0.47 0.03
## reviews_per_month                 4 23 -0.41 0.41  -0.55   -0.48 0.19
## calculated_host_listings_count    5 23  0.60 1.29  -0.11    0.44 0.27
## availability_365                  6 23  0.40 1.13   0.55    0.33 1.90
## accommodates                      7 23  1.91 3.53   0.08    1.63 1.68
## bathrooms                         8 23  1.56 2.68  -0.33    1.06 0.00
## bedrooms                          9 23  1.17 2.38  -0.27    1.00 2.23
## beds                             10 23  1.72 3.30   0.20    1.18 1.10
## amenities                        11 23  0.03 1.51  -0.71   -0.20 0.77
## security_deposit                 12 23  0.21 1.32  -0.43   -0.12 0.00
## cleaning_fee                     13 23  0.93 2.19   0.46    0.52 1.74
## review_scores_accuracy           14 23 -4.04 3.66  -4.92   -3.86 4.01
## review_scores_cleanliness        15 23 -2.80 3.12  -3.14   -2.73 5.66
## review_scores_checkin            16 23 -2.84 3.43  -2.45   -2.45 4.34
## review_scores_communication      17 23 -3.84 4.05  -2.46   -3.52 4.28
## review_scores_location           18 23 -2.71 2.92  -1.88   -2.49 3.78
## review_scores_value              19 23 -3.85 3.37  -4.06   -3.80 3.58
## dist                             20 23  0.46 1.02   0.64    0.50 1.16
##                                   min   max range  skew kurtosis   se
## price                           -0.67 28.85 29.52  1.71     1.52 1.79
## minimum_nights                  -0.09  0.25  0.34  1.11    -0.56 0.03
## number_of_reviews               -0.55  0.81  1.37  2.28     3.98 0.07
## reviews_per_month               -0.72  0.65  1.37  1.62     1.29 0.08
## calculated_host_listings_count  -0.29  2.99  3.28  1.08    -0.62 0.27
## availability_365                -0.73  2.21  2.94  0.39    -1.41 0.23
## accommodates                    -1.05  7.46  8.51  0.72    -1.31 0.74
## bathrooms                       -0.33  8.96  9.28  1.29     0.77 0.56
## bedrooms                        -1.78  5.75  7.53  0.73    -0.70 0.50
## beds                            -0.55  9.14  9.69  1.16    -0.25 0.69
## amenities                       -1.54  4.00  5.54  1.13     0.30 0.32
## security_deposit                -0.43  4.79  5.21  2.25     4.43 0.27
## cleaning_fee                    -0.71  7.46  8.17  1.60     1.67 0.46
## review_scores_accuracy         -10.33  0.48 10.81 -0.41    -0.92 0.76
## review_scores_cleanliness       -6.95  0.68  7.63 -0.16    -1.65 0.65
## review_scores_checkin          -11.24  0.48 11.72 -0.62    -0.64 0.71
## review_scores_communication    -11.13  0.43 11.56 -0.33    -1.39 0.84
## review_scores_location          -9.53  0.67 10.20 -0.61    -0.66 0.61
## review_scores_value             -8.88  0.77  9.65 -0.12    -1.41 0.70
## dist                            -1.42  1.88  3.30 -0.37    -1.04 0.21

From the plot above it can be seen that some observations indeed have coordinates obtained from the MDS different than the majority of the points. Those points were marked in red and their details were printed. Outliers can be also very interesting to analyse whether are those simply offers different than the others, offers with some mistakes or maybe even suspicious offers. Anyway, further analysis will be performed without those points to increase its quality.

dane <- subset(dane, dane$mark == FALSE, select = -c(mark))
rm(d)

PCA

Principal Component Analysis concentrates on finding the directions with the greatest variability of the data. Transformating the data and reducing its dimensionality with PCA is one of the common ways of improving clustering i.a. thanks to the limiting of the redundant features.

pca <- PCA(dane, ncp=6, scale.unit = TRUE, graph=FALSE)
get_eigenvalue(pca)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   3.8122024       19.0610120                    19.06101
## Dim.2   2.9281338       14.6406688                    33.70168
## Dim.3   1.7585210        8.7926050                    42.49429
## Dim.4   1.4685993        7.3429963                    49.83728
## Dim.5   1.1735753        5.8678764                    55.70516
## Dim.6   1.0110116        5.0550580                    60.76022
## Dim.7   0.9669701        4.8348504                    65.59507
## Dim.8   0.8696153        4.3480766                    69.94314
## Dim.9   0.8106810        4.0534048                    73.99655
## Dim.10  0.6759204        3.3796021                    77.37615
## Dim.11  0.6714952        3.3574762                    80.73363
## Dim.12  0.6009311        3.0046557                    83.73828
## Dim.13  0.5061537        2.5307687                    86.26905
## Dim.14  0.4907879        2.4539395                    88.72299
## Dim.15  0.4789834        2.3949168                    91.11791
## Dim.16  0.4363484        2.1817418                    93.29965
## Dim.17  0.4154900        2.0774501                    95.37710
## Dim.18  0.3913505        1.9567525                    97.33385
## Dim.19  0.3733984        1.8669922                    99.20084
## Dim.20  0.1598313        0.7991563                   100.00000
fviz_eig(pca, addlabels=TRUE)

There are different approaches to choose the right number of the dimensions to keep. One is to keep those dimensions where eigenvalues are bigger than 1 - that means six dimensions in this case. Also those are all dimensions that explain each one at least 5% of the variance. On the plot the elbow method suggests the same number. Those six dimensions explain about 60% of the whole variance, which isn’t perfect but acceptable. That is the percent of the variance that will be kept in the data after the dimension reduction.

pcavar <- get_pca_var(pca)
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)

On the variable correlation plot above can be seen the relationship between variables in the case of first two dimensions, where the possitively corelated variables are on the same sides, negatively corelated on the opposite sides and the length of the arrows as well as the color of variables denote the quality of representation of the variables. Additionally the contributions of variables to those two main principal components is presented with their transparency on the plot. Contributions of the top ten variables to every one of the six chosen dimensions is presented on the bar plots below, with the expected average contribution equal to 5% and marked with the red line.

fviz_contrib(pca, choice = "var", axes = 1, top = 10)

fviz_contrib(pca, choice = "var", axes = 2, top = 10)

fviz_contrib(pca, choice = "var", axes = 3, top = 10)

fviz_contrib(pca, choice = "var", axes = 4, top = 10)

fviz_contrib(pca, choice = "var", axes = 5, top = 10)

fviz_contrib(pca, choice = "var", axes = 6, top = 10)

fviz_pca_ind(pca, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             )

The last graph is representing the observations and their quality of representation among two main principal components.

The next steps will be performed on the data prepared with the PCA with the six principal components kept.

Hierarchical Clustering

Non-hierarchical clustering algorithms, like K-means and PAM are not very affected by the outliers but they need the number of clusters stated beforehand and the results depend remarkably on the initial centers. On the other hand, hierarchical algorithms don’t need the initial number of clusters but they are sensitive to outliers and can’be be corrected ex-post. However those two methods can be used together. This time the optimal number of clusters will be chosen with the analysis of dendrogram results.

dendr1 <- HCPC(pca, -1, graph = F)

fviz_dend(dendr1, 
          palette = "jco", 
          rect = TRUE, rect_fill = TRUE, 
          rect_border = "jco",          
)

The suggested partition is based on the highest relative loss of inertia. This partition indeed seems to be the best so it will be chosen. Dendrogram with the distribution of the observations in the first two dimensions is presented on the graph below.

plot(dendr1, choice = "3D.map")

K-means on PCA

Prediagnostics

datapca <- data.frame(pca$ind$coord)

hopkins(datapca, n=100)
## $H
## [1] 0.07239851
get_clust_tendency(datapca, 100, graph=F, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.071698

The data based on principal components is also clusterable as the Hopkins’ statistic takes values close to zero in both tests.

Results and post-diagnostics

fviz_nbclust(datapca, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)

According to Silhouette statistic optimal number of clusters is two but even then the value is low. Spliting the data into the three clusters is the next best option. In compare to the dendrogram results, four clusters aren’t here the very good solution.

kmeanspca4 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=4, graph = F)
fviz_cluster(kmeanspca4, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeanspca4)
##   cluster size ave.sil.width
## 1       1 1904          0.39
## 2       2  573          0.13
## 3       3  752          0.10
## 4       4  454          0.05

As it was already noticed, the average silhouette width isn’t very satisfying. Especially the two smaller clusters have small, often even negative values of silhouette.

kmeanspca2 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=2, graph = F)
kmeanspca3 <- eclust(datapca, "kmeans", hc_metric="euclidean", k=3, graph = F)

round(calinhara(datapca,kmeanspca2$cluster),digits=2) 
## [1] 1017.2
round(calinhara(datapca,kmeanspca3$cluster),digits=2) 
## [1] 975.84
round(calinhara(datapca,kmeanspca4$cluster),digits=2)
## [1] 908.34

According to Calinski-Harabasz index we should prefer two clusters over three and four, same as with Silhouette statistics, so it seems that the number of clusters chosen from the hierarchical clustering might not be optimal for the K-means clustering.

5. Second scenario for mixed data

5.1. Data preparation

This time, apart from all of the variables used so far, let’s add eight categorical/ordinal variables and prepare the new dataset “dane2”.

dane2 <- subset(database, select = -c(id, longitude, latitude, neighbourhood_group))

5.2. K-prototypes

K-means is a popular algorithm for clustering continuous data. In case of categorical variables K-modes can be used. It is in the simpiest words the alternative and also an extension to K-means. As can be expected from its name, it uses modes instead of means. K-prototypes is an approach that combines K-means and K-modes. It uses the Euclidian distance for numeric variables and simple matching coefficient for categorical variables in the dataset.

Lambda parameter

The trade off between wages for this distances is denoted by the lambda parameter. By default it is calculated with the variance for numeric variables. In this analysis let’s compare that approach with the one with the standard deviation instead of variance.

Choosing the optimal number of clusters

For lambda calculated with variance

Es <- numeric(10)
for(i in 1:10){
  kpres <- kproto(dane2, k = i)
  Es[i] <- kpres$tot.withinss
}

Essil <- numeric(10)
for(i in 1:10){
  kpres <- kproto(dane2, k = i)
  Essil[i] <- silhouette_kproto(kpres)
}

Esmcclain <- numeric(10)
for(i in 2:10){
  kpres <- kproto(dane2, k = i)
  Esmcclain[i] <- mcclain_kproto(kpres)
}
plot(1:10, Es, type = "b", ylab = "Total Within Sum Of Squares", xlab = "Number of clusters")

According to Total Within Sum Of Squares the more clusters there will be, the better. According to elbow method two, four or six clusters should be considered.

plot(1:10, Essil, type = "b", ylab = "Silhouette", xlab = "Number of clusters")

Silhouette statistic strongly suggests choosing two clusters.

plot(1:10, Esmcclain, type = "b", ylab = "McClain-Rao index", xlab = "Number of clusters")

McClain_rao index should take as little values as possible, same as Total Sum Of Squares, so this also suggests two clusters.

For lambda calculated with standard deviation

Es2 <- numeric(10)
for(i in 1:10){
  kpres <- kproto(dane2, k = i, lambdaest(dane2, num.method = 2))
  Es2[i] <- kpres$tot.withinss
}

Essil2 <- numeric(10)
for(i in 1:10){
  kpres <- kproto(dane2, k = i, lambdaest(dane2, num.method = 2))
  Essil2[i] <- silhouette_kproto(kpres)
}

Esmcclain2 <- numeric(10)
for(i in 2:10){
  kpres <- kproto(dane2, k = i, lambdaest(dane2, num.method = 2))
  Esmcclain2[i] <- mcclain_kproto(kpres)
}
plot(1:10, Es2, type = "b", ylab = "Total Within Sum Of Squares", xlab = "Number of clusters")

Total Within Sum Of Squares again suggests choosing high number of clusters, with some indication to seven clusters or five.

plot(1:10, Essil2, type = "b", ylab = "Silhouette", xlab = "Number of clusters")

According to Silhouette statistic there should be two clusters chosen.

plot(1:10, Esmcclain2, type = "b", ylab = "McClain-Rao index", xlab = "Number of clusters")

This time the best option based on McClain-Rao index is nine clusters, however among smaller numbers of clusters two and four should be considered.

Clustering with K-Proto

kproto2var <- kproto(dane2, 2)
kproto2sd <- kproto(dane2, 2, lambdaest(dane2, num.method = 2))
kproto2var$centers
kproto2sd$centers

The results show the centers of clusters in the case of splitting the data into two groups respectively with lambda obtained through variance (kproto2var) and through standard deviation (kproto2sd). One can see that in both cases the centers have the same values of binary variables but type of room and cancellation policy differ between centers.

mcclain_kproto(kproto2var)
## [1] 0.1191884
mcclain_kproto(kproto2sd)
## [1] 0.08075219
silhouette_kproto(kproto2var) 
## [1] 0.8119357
silhouette_kproto(kproto2sd)
## [1] 0.8483459

McClain-Rao index is slightly lower and the Silhouette statistic slightly higher for the clustering obtained with the lambda defined with standard deviation, so it is proper to assume that it is a little better clustering. The Silhouette statistic is really high so it seems that this approach worked pretty well.

5.3. Gower distance

Another popular approach in case of mixed data is to use the Gower distance, which in contrast to for example the Euclidian distance can deal with different data types. That is because the distance metric is chosen to every variable according to its type: Manhattan distance to interval variables and the ordinal ones (which are ranked first) and the Dice coefficient for nominal variables. Having this kind of distance matrix, the proper clustering algorithm can be chosen. In this analysis the clustering will be performed with Partitioning Around Medoids algorithm.

Choosing the optimal number of clusters

gower_dist <- daisy(dane2,
                    metric = "gower",
                    type = list(logratio = 3))

Esgower <- numeric(10)
for(i in 2:10){
  pames <- pam(gower_dist, diss = TRUE, k = i)
  Esgower[i] <- pames$silinfo$avg.width}
plot(1:10, Esgower, type = "b", ylab = "Silhouette", xlab = "Number of Clusters") 

Based on Silhouette statistic two clusters will be chosen as the optimal partition.

Results

pamgower <- pam(gower_dist, diss = TRUE, k=2)
fviz_silhouette(pamgower)
##   cluster size ave.sil.width
## 1       1 1781          0.10
## 2       2 1925          0.23

Unfortunately, the results aren’t satisfying. The Silhouette statistic is very low for both clusters.

5.4. K-means on converted categorical attributes to binary values

The last approach is to convert those nominal variables to binary values and all of the binary values treat as if they were numeric zero-one values. This approach sometimes isn’t recommended for various of reasons, however it is easy to come up with this idea and doesn’t require any additional knowledge than for example K-means technique. Let’s check how this method will work with this data.

Converting the data

danebin <- database

danebin$host_is_superhost <- as.numeric(ifelse(danebin$host_is_superhost == 't', 1, 0))
danebin$host_has_profile_pic <- as.numeric(ifelse(danebin$host_has_profile_pic == 't', 1, 0))
danebin$host_identity_verified <- as.numeric(ifelse(danebin$host_identity_verified == 't', 1, 0))
danebin$is_location_exact <- as.numeric(ifelse(danebin$is_location_exact == 't', 1, 0))
danebin$instant_bookable <- as.numeric(ifelse(danebin$instant_bookable == 't', 1, 0))

danebin$room_type_apt <- as.numeric(ifelse(danebin$room_type == 'Entire home/apt', 1, 0))
danebin$room_type_priv <- as.numeric(ifelse(danebin$room_type == 'Private room', 1, 0))
danebin$room_type_shared <- as.numeric(ifelse(danebin$room_type == 'Shared room', 1, 0))

danebin$bed_type_airbed <- as.numeric(ifelse(danebin$bed_type == 'Airbed', 1, 0))
danebin$bed_type_couch <- as.numeric(ifelse(danebin$bed_type == 'Couch', 1, 0))
danebin$bed_type_futon <- as.numeric(ifelse(danebin$bed_type == 'Futon', 1, 0))
danebin$bed_type_sofa <- as.numeric(ifelse(danebin$bed_type == 'Pull-out Sofa', 1, 0))
danebin$bed_type_real <- as.numeric(ifelse(danebin$bed_type == 'Real Bed', 1, 0))

danebin$cancellation_policy_flexible <- as.numeric(ifelse(danebin$cancellation_policy == 'flexible', 1, 0))
danebin$cancellation_policy_moderate <- as.numeric(ifelse(danebin$cancellation_policy == 'moderate', 1, 0))
danebin$cancellation_policy_strict14 <- as.numeric(ifelse(danebin$cancellation_policy == 'strict_14_with_grace_period', 1, 0))
danebin$cancellation_policy_strict30 <- as.numeric(ifelse(danebin$cancellation_policy == 'super_strict_30', 1, 0))
danebin$cancellation_policy_strict60 <- as.numeric(ifelse(danebin$cancellation_policy == 'super_strict_60', 1, 0))

danebin <- subset(danebin, select = -c(id, longitude, latitude, neighbourhood_group, room_type, bed_type, cancellation_policy))

Prediagnostics

hopkins(danebin, n=100)
## $H
## [1] 0.01206602
get_clust_tendency(danebin, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.002272877

Data is clusterable.

fviz_nbclust(danebin, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)

According to Silhouette statistic two clusters should be chosen.

Clustering

fviz_nbclust(danebin, FUNcluster = kmeans, method = c("wss"), k.max = 8, nboot = 100,)

opt<-Optimal_Clusters_KMeans(danebin, max_clusters=8, plot_clusters=TRUE, criterion="AIC")

opt4<-Optimal_Clusters_KMeans(danebin, max_clusters=8, plot_clusters=TRUE, criterion="BIC")

According to Total Within Sum Of Squares, AIC and BIC the number of clusters should be high with the indication for five clusters or two-three with the elbow method assessment.

kmeanscat2 <- eclust(danebin, "kmeans", hc_metric="euclidean", k=2)

kmeanscat3 <- eclust(danebin, "kmeans", hc_metric="euclidean", k=3)

kmeanscat5 <- eclust(danebin, "kmeans", hc_metric="euclidean", k=5)

fviz_silhouette(kmeanscat2)
##   cluster size ave.sil.width
## 1       1  287          0.17
## 2       2 3419          0.73

fviz_silhouette(kmeanscat3)
##   cluster size ave.sil.width
## 1       1  111          0.27
## 2       2 2396          0.68
## 3       3 1199          0.06

fviz_silhouette(kmeanscat5)
##   cluster size ave.sil.width
## 1       1   39          0.42
## 2       2 2108          0.64
## 3       3  746          0.29
## 4       4  269          0.16
## 5       5  544          0.37

Average silhouette width is highest for two clusters. All results are quite similar to those obtained with the K-means on just continuous variables in the first scenario. There is one big cluster with the highest Silhouette value in every case and the rest of clusters is rather worse.

Post-diagnostics

round(calinhara(danebin,kmeanscat2$cluster),digits=2)
## [1] 2408.87
round(calinhara(danebin,kmeanscat3$cluster),digits=2) 
## [1] 2216.51
round(calinhara(danebin,kmeanscat5$cluster),digits=2)
## [1] 2379.7

According to Calinski-Harabasz index we should also prefer spliting the data into two groups. So let’s consider it as a best option.

5.5. FAMD with K-means

At the end, just like in the first scenario, clustering on the data obtained with the dimension reduction, with some help from hierarchical clustering methods, will be performed.

Firstly The steps for continuous variables were repeated, in which the data was standarized and the outliers detected and eliminated, and that data was joined with the rest of variables.

FAMD

Factor analysis of mixed data is a technique to perform a principal components method on a data with continuous and categorical variables. It is a combination of Principal Component Analysis for quantitative and Multiple Correspondence Analysis for gualitative variables.

famd <- FAMD(dane2, ncp=50, graph=FALSE)
get_eigenvalue(famd)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.3146660       12.3276172                    12.32762
## Dim.2   3.1470083        8.9914522                    21.31907
## Dim.3   1.9999964        5.7142755                    27.03334
## Dim.4   1.7499815        4.9999471                    32.03329
## Dim.5   1.5144771        4.3270776                    36.36037
## Dim.6   1.2550052        3.5857291                    39.94610
## Dim.7   1.1770949        3.3631283                    43.30923
## Dim.8   1.1179266        3.1940759                    46.50330
## Dim.9   1.0598303        3.0280866                    49.53139
## Dim.10  1.0327728        2.9507795                    52.48217
## Dim.11  1.0171137        2.9060391                    55.38821
## Dim.12  1.0052944        2.8722696                    58.26048
## Dim.13  0.9976618        2.8504624                    61.11094
## Dim.14  0.9771573        2.7918780                    63.90282
## Dim.15  0.9544407        2.7269735                    66.62979
## Dim.16  0.9360554        2.6744440                    69.30424
## Dim.17  0.9200119        2.6286054                    71.93284
## Dim.18  0.9007241        2.5734975                    74.50634
## Dim.19  0.8626151        2.4646145                    76.97095
## Dim.20  0.7970548        2.2772995                    79.24825
## Dim.21  0.7465696        2.1330560                    81.38131
## Dim.22  0.7047723        2.0136353                    83.39494
## Dim.23  0.6574763        1.8785036                    85.27345
## Dim.24  0.6485629        1.8530368                    87.12648
## Dim.25  0.6002737        1.7150676                    88.84155
## Dim.26  0.5272054        1.5063011                    90.34785
## Dim.27  0.4892949        1.3979855                    91.74584
## Dim.28  0.4819052        1.3768719                    93.12271
## Dim.29  0.4415568        1.2615908                    94.38430
## Dim.30  0.4102558        1.1721595                    95.55646
## Dim.31  0.4082420        1.1664057                    96.72287
## Dim.32  0.3757689        1.0736253                    97.79649
## Dim.33  0.3269025        0.9340073                    98.73050
## Dim.34  0.2912427        0.8321219                    99.56262
## Dim.35  0.1530826        0.4373790                   100.00000
fviz_eig(famd, ncp=20, addlabels=TRUE)

Twelve dimensions have eigenvalues equal to 1 or higher and they explain jointly 58,26% of variance. On the graph can be seen that six dimensions might be good as well, however they explain only 39,95% of variance, so twelve dimesions will be chosen.

On the graph below is presented the correlation between all variables in the two first principal dimensions. The color of variables is used to show the quality of the representation.

famd <- FAMD(dane2, ncp=12, graph=FALSE)
famdvar <- get_famd_var(famd)
fviz_famd_var(famd, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)

Additionally, the contributions of the variables to the six main dimensions are presented below. Some of the categorical variables used in this scenario have real impact on the dimensions, especially type of room and its contribution to fourth dimension being over 20% or fifth dimension with contributions of cancellation_policy, instant_bookable, room_type and host_identity_verified all over the average expected contribution. It shows that having those variables indeed have an impact at the way the principal components are calculated for this data.

fviz_contrib(famd, choice = "var", axes = 1, top = 20)

fviz_contrib(famd, choice = "var", axes = 2, top = 20)

fviz_contrib(famd, choice = "var", axes = 3, top = 20)

fviz_contrib(famd, choice = "var", axes = 4, top = 20)

fviz_contrib(famd, choice = "var", axes = 5, top = 20)

fviz_contrib(famd, choice = "var", axes = 6, top = 20)

Hierarchical clustering

The best partition based on dendrogram is to split the data into four groups. A distribution of observations among the groups, obtained with dendrogram, is also presented below.

dendr2 <- HCPC(famd, -1, graph = F)
fviz_dend(dendr2, 
          palette = "uchicago", 
          rect = TRUE, rect_fill = TRUE, 
          rect_border = "uchicago",       
)

plot(dendr2, choice = "3D.map")

K-means on FAMD

A great thing about dimension reduction is that thanks to the FAMD the initial variables, both continuous and categorical were replaced with appropriate dimensions. Because of that there is no longer need to worry about the mixed data and the clustering can be done now with any method applicable to the continuous data. The simple K-means clustering with the Euclidian distance will be performed here.

Prediagnostics

datafamd <- data.frame(famd$ind$coord)
hopkins(datafamd, n=100)
## $H
## [1] 0.05153417
get_clust_tendency(datafamd, 2, graph=F, gradient=list(low="red", mid="white", high="blue"))$hopkins_stat
## [1] 0.02346509

The data obtained through the dimension reduction is clusterable.

Results and the post-diagnostics

fviz_cluster(kmeansfamd, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeansfamd)
##   cluster size ave.sil.width
## 1       1 1764          0.29
## 2       2  544          0.06
## 3       3  894          0.06
## 4       4  481          0.02

Unfortunately the Silhouette statistic shows that the result of this clustering is very bad. All of the clusters have low silhouette width and three of them have average silhouette width almost equal to zero.

6. Summary

Although at first sight it seems that all popular clustering methods are dedicated to the datasets consisting of only continuous variables, there are many different approaches which allow to explore and search for the hidden structures also in the mixed data. There isn’t probably one universal solution and the essence of the clustering as always is to go through enough solutions to find that one which is cut out for the current problem.