PCA and Clustering

Workation is a data set that provides a ranking together with variables that were taken into account when choosing the best cities for workation (work and vacation). This data set contains 10 variables, which are taken into account when choosing the best city for a workation. The data has been downloaded from Kaggle website.

Exploratory Data Analysis & Clearing the Data

# Preview the data
head(raw_data, 3)

##   Ranking      City  Country
## 1       1   Bangkok Thailand
## 2       2 New Delhi    India
## 3       3    Lisbon Portugal
##   Remote.connection..Average.WiFi.speed..Mbps.per.second.
## 1                                                      28
## 2                                                      12
## 3                                                      33
##   Co.working.spaces..Number.of.co.working.spaces
## 1                                            117
## 2                                            165
## 3                                             95
##   Caffeine..Average.price.of.buying.a.coffee
## 1                                       1.56
## 2                                       1.42
## 3                                       1.56
##   Travel..Average.price.of.taxi..per.km.
## 1                                   0.82
## 2                                   0.19
## 3                                   0.40
##   After.work.drinks..Average.price.for.2.beers.in.a.bar
## 1                                                  3.08
## 2                                                  2.90
## 3                                                  3.42
##   Accommodation..Average.price.of.1.bedroom.apartment.per.month
## 1                                                        415.18
## 2                                                        179.25
## 3                                                        736.19
##   Food..Average.cost.of.a.meal.at.a.local..mid.level.restaurant
## 1                                                          1.54
## 2                                                          2.90
## 3                                                          7.69
##   Climate..Average.number.of.sunshine.hours
## 1                                      2624
## 2                                      2685
## 3                                      2806
##   Tourist.attractions..Number.of..Things.to.do..on.Tripadvisor
## 1                                                         2262
## 2                                                         2019
## 3                                                         1969
##   Instagramability..Number.of.photos.with..
## 1                                  28386616
## 2                                  28528249
## 3                                  10205538

I decided to change and shorten the column names.

colnames(raw_data)

##  [1] "Ranking"                                                      
##  [2] "City"                                                         
##  [3] "Country"                                                      
##  [4] "Remote.connection..Average.WiFi.speed..Mbps.per.second."      
##  [5] "Co.working.spaces..Number.of.co.working.spaces"               
##  [6] "Caffeine..Average.price.of.buying.a.coffee"                   
##  [7] "Travel..Average.price.of.taxi..per.km."                       
##  [8] "After.work.drinks..Average.price.for.2.beers.in.a.bar"        
##  [9] "Accommodation..Average.price.of.1.bedroom.apartment.per.month"
## [10] "Food..Average.cost.of.a.meal.at.a.local..mid.level.restaurant"
## [11] "Climate..Average.number.of.sunshine.hours"                    
## [12] "Tourist.attractions..Number.of..Things.to.do..on.Tripadvisor" 
## [13] "Instagramability..Number.of.photos.with.."

# Shorten column names for easier use
colnames(raw_data)[4:13] <- c('WiFi', 'CoWorking', 'Coffe', 'Taxi', 'Beer', 'Accommodation', 'Food', 'Sunshine', 'Attractions', 'Instagram')

colnames(raw_data)

##  [1] "Ranking"       "City"          "Country"       "WiFi"         
##  [5] "CoWorking"     "Coffe"         "Taxi"          "Beer"         
##  [9] "Accommodation" "Food"          "Sunshine"      "Attractions"  
## [13] "Instagram"

Here I would like to introduce the variables:

Wifi is an average Wifi Speed,

CoWorking is number of coworking spaces,

Coffee is an average price of buying a coffee,

Taxi is an average proce of taxi,

Beer is an average price for two beers in a bar,

Accommodation is an average price of 1 bedroom apartment per month,

Food is an average cost of local meal at mid-level restaurant,

Sunshine is an average number of sunshine hours,

Attractions is number of ‘Thinks to do’ on TripAdvisor,

Instagram is number of photos with # hashtag.

Now, I would like to check whether all data is of the type numeric.

for (i in 1:ncol(raw_data)){
  print(i)
  print(class(raw_data[,i]))
}

## [1] 1
## [1] "integer"
## [1] 2
## [1] "character"
## [1] 3
## [1] "character"
## [1] 4
## [1] "integer"
## [1] 5
## [1] "integer"
## [1] 6
## [1] "numeric"
## [1] 7
## [1] "numeric"
## [1] 8
## [1] "numeric"
## [1] 9
## [1] "numeric"
## [1] 10
## [1] "numeric"
## [1] 11
## [1] "integer"
## [1] 12
## [1] "integer"
## [1] 13
## [1] "integer"

I removed the columns Country and City (number 2 and 3), as their class is character and they are not informative. Moreover, I also removed the Ranking (number 1) column for clustering. The new shortened data set will be called data.

data <- raw_data[,4:13]
head(data)

##   WiFi CoWorking Coffe Taxi Beer Accommodation  Food Sunshine Attractions
## 1   28       117  1.56 0.82 3.08        415.18  1.54     2624        2262
## 2   12       165  1.42 0.19 2.90        179.25  2.90     2685        2019
## 3   33        95  1.56 0.40 3.42        736.19  7.69     2806        1969
## 4   37       136  1.59 1.01 5.12        768.46 10.25     2591        2739
## 5   17        67  1.22 0.47 2.16        229.55  5.15     2525        1660
## 6   37        40  1.20 0.72 2.40        366.66  4.81     1988        1468
##   Instagram
## 1  28386616
## 2  28528249
## 3  10205538
## 4  62894055
## 5  21293975
## 6  14267880

Then, I checked for NAs:

apply(data, 2, function(x) any(is.na(x)))

##          WiFi     CoWorking         Coffe          Taxi          Beer 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
## Accommodation          Food      Sunshine   Attractions     Instagram 
##         FALSE         FALSE         FALSE         FALSE         FALSE

There were no NAs in my dataset, thus I could proceed to check for outliers.

par(mfrow = c(2, 5))
for (i in 1:ncol(data)){
  boxplot(data[,i], col = 'grey', ylab = colnames(data[i]))
}

par(mfrow = c(1, 1))

There are some outliers, however, there are not many of them and not all variables have outliers. I will take it into account when choosing a clustering algorithm.

The data set has high dimension (10 variables), thus I decided to first first reduce the dimension using PCA and then to perform clustering.

PCA

Let’s see what is the correlation between variables.

corr_data <- cor(data, method="pearson") # nominal values
print(corr_data, digits=2)

##                 WiFi CoWorking Coffe     Taxi  Beer Accommodation   Food
## WiFi           1.000     0.180 0.331  2.4e-01 0.307          0.57  0.409
## CoWorking      0.180     1.000 0.117  1.4e-02 0.134          0.41  0.062
## Coffe          0.331     0.117 1.000  4.1e-01 0.687          0.68  0.603
## Taxi           0.244     0.014 0.413  1.0e+00 0.342          0.48  0.588
## Beer           0.307     0.134 0.687  3.4e-01 1.000          0.67  0.671
## Accommodation  0.574     0.408 0.684  4.8e-01 0.671          1.00  0.690
## Food           0.409     0.062 0.603  5.9e-01 0.671          0.69  1.000
## Sunshine      -0.227    -0.166 0.009 -2.5e-01 0.059         -0.10 -0.170
## Attractions    0.057     0.646 0.013  1.1e-06 0.078          0.31  0.083
## Instagram      0.125     0.736 0.169  2.5e-02 0.274          0.39  0.139
##               Sunshine Attractions Instagram
## WiFi           -0.2275     5.7e-02    0.1251
## CoWorking      -0.1660     6.5e-01    0.7359
## Coffe           0.0090     1.3e-02    0.1689
## Taxi           -0.2545     1.1e-06    0.0246
## Beer            0.0590     7.8e-02    0.2738
## Accommodation  -0.1008     3.1e-01    0.3902
## Food           -0.1701     8.3e-02    0.1391
## Sunshine        1.0000    -1.0e-01    0.0034
## Attractions    -0.1049     1.0e+00    0.6504
## Instagram       0.0034     6.5e-01    1.0000

And the correlation plot:

corrplot(corr_data, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colorRampPalette(c("#99FF33", "#CC0066", "black"))(200))

Beer, Coffee, Food, Taxi and WiFi have strong correlation with Accommodation.

CoWorking and Instagram have a strong correlation with Attractions.

Coffee and Food have a strong correlation with Beer.

Food has a strong correlation with Coffee.

Instagram has a strong correlation with CoWorking.

Taxi and WiFi have both strong correlation with Food.

Then, I standardized the data:

preproc1 <- preProcess(data, method=c("center", "scale"))
data_s <- predict(preproc1, data)

And calculated eigenvalues on the basis of covariance:

data_cov<-cov(data_s)
raw_data_eigen<-eigen(data_cov)
raw_data_eigen$values

##  [1] 3.9415586 2.1782532 1.2112680 0.7846449 0.5236360 0.4125724 0.3414264
##  [8] 0.2410057 0.2187643 0.1468705

head(raw_data_eigen$vectors)

##            [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
## [1,] -0.2893990 -0.08763996  0.27280033  0.79256858 -0.30089931 -0.03819244
## [2,] -0.2349595  0.52385452  0.06584985  0.03430246  0.05107179  0.40938390
## [3,] -0.3779944 -0.21597057 -0.24059344 -0.01963833  0.32990449  0.42374320
## [4,] -0.2894328 -0.25041093  0.27808069 -0.52437050 -0.56434898  0.29174857
## [5,] -0.3866965 -0.16388824 -0.32523657 -0.03620515  0.38098644 -0.22765122
## [6,] -0.4636067 -0.01993322 -0.04106675  0.13889066 -0.10560279  0.03427873
##             [,7]        [,8]        [,9]      [,10]
## [1,]  0.06377226  0.21999066  0.01970819 -0.2472493
## [2,]  0.08608499 -0.59579507  0.07864324 -0.3550872
## [3,] -0.52465156  0.25914526 -0.28270575 -0.2036485
## [4,]  0.09157012  0.16081417  0.22104619 -0.1204785
## [5,]  0.33939859  0.08396046  0.60336568 -0.1823462
## [6,] -0.20970665 -0.28822467  0.15546598  0.7746918

There are 3 eigenvalues with value above 1, these should be chosen for PCA.

pca1<-prcomp(data_s, center=FALSE, scale=FALSE)
pca1

## Standard deviations (1, .., p=10):
##  [1] 1.9853359 1.4758907 1.1005762 0.8858018 0.7236270 0.6423180 0.5843170
##  [8] 0.4909233 0.4677224 0.3832369
## 
## Rotation (n x k) = (10 x 10):
##                       PC1         PC2         PC3         PC4         PC5
## WiFi          -0.28939903 -0.08763996  0.27280033 -0.79256858  0.30089931
## CoWorking     -0.23495954  0.52385452  0.06584985 -0.03430246 -0.05107179
## Coffe         -0.37799444 -0.21597057 -0.24059344  0.01963833 -0.32990449
## Taxi          -0.28943275 -0.25041093  0.27808069  0.52437050  0.56434898
## Beer          -0.38669652 -0.16388824 -0.32523657  0.03620515 -0.38098644
## Accommodation -0.46360666 -0.01993322 -0.04106675 -0.13889066  0.10560279
## Food          -0.40054557 -0.24469054  0.03321884  0.16942140  0.02302925
## Sunshine       0.09728205 -0.02076909 -0.80868713 -0.10143221  0.54599917
## Attractions   -0.18925603  0.52343960  0.03520013  0.17568172  0.15670684
## Instagram     -0.25339173  0.49698717 -0.14915422  0.06964701 -0.00440924
##                       PC6         PC7         PC8         PC9       PC10
## WiFi          -0.03819244  0.06377226 -0.21999066  0.01970819 -0.2472493
## CoWorking      0.40938390  0.08608499  0.59579507  0.07864324 -0.3550872
## Coffe          0.42374320 -0.52465156 -0.25914526 -0.28270575 -0.2036485
## Taxi           0.29174857  0.09157012 -0.16081417  0.22104619 -0.1204785
## Beer          -0.22765122  0.33939859 -0.08396046  0.60336568 -0.1823462
## Accommodation  0.03427873 -0.20970665  0.28822467  0.15546598  0.7746918
## Food          -0.47674753  0.18482220  0.34879373 -0.58407810 -0.1584317
## Sunshine       0.01822636 -0.01455493  0.11108630 -0.03325428 -0.1172021
## Attractions   -0.51816539 -0.52500393 -0.20026221  0.15102063 -0.1627769
## Instagram      0.13097611  0.48549936 -0.48927485 -0.33371059  0.2409611

# PCA summary
pca1$rotation

##                       PC1         PC2         PC3         PC4         PC5
## WiFi          -0.28939903 -0.08763996  0.27280033 -0.79256858  0.30089931
## CoWorking     -0.23495954  0.52385452  0.06584985 -0.03430246 -0.05107179
## Coffe         -0.37799444 -0.21597057 -0.24059344  0.01963833 -0.32990449
## Taxi          -0.28943275 -0.25041093  0.27808069  0.52437050  0.56434898
## Beer          -0.38669652 -0.16388824 -0.32523657  0.03620515 -0.38098644
## Accommodation -0.46360666 -0.01993322 -0.04106675 -0.13889066  0.10560279
## Food          -0.40054557 -0.24469054  0.03321884  0.16942140  0.02302925
## Sunshine       0.09728205 -0.02076909 -0.80868713 -0.10143221  0.54599917
## Attractions   -0.18925603  0.52343960  0.03520013  0.17568172  0.15670684
## Instagram     -0.25339173  0.49698717 -0.14915422  0.06964701 -0.00440924
##                       PC6         PC7         PC8         PC9       PC10
## WiFi          -0.03819244  0.06377226 -0.21999066  0.01970819 -0.2472493
## CoWorking      0.40938390  0.08608499  0.59579507  0.07864324 -0.3550872
## Coffe          0.42374320 -0.52465156 -0.25914526 -0.28270575 -0.2036485
## Taxi           0.29174857  0.09157012 -0.16081417  0.22104619 -0.1204785
## Beer          -0.22765122  0.33939859 -0.08396046  0.60336568 -0.1823462
## Accommodation  0.03427873 -0.20970665  0.28822467  0.15546598  0.7746918
## Food          -0.47674753  0.18482220  0.34879373 -0.58407810 -0.1584317
## Sunshine       0.01822636 -0.01455493  0.11108630 -0.03325428 -0.1172021
## Attractions   -0.51816539 -0.52500393 -0.20026221  0.15102063 -0.1627769
## Instagram      0.13097611  0.48549936 -0.48927485 -0.33371059  0.2409611

summary(pca1)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9853 1.4759 1.1006 0.88580 0.72363 0.64232 0.58432
## Proportion of Variance 0.3942 0.2178 0.1211 0.07846 0.05236 0.04126 0.03414
## Cumulative Proportion  0.3942 0.6120 0.7331 0.81157 0.86394 0.90519 0.93934
##                           PC8     PC9    PC10
## Standard deviation     0.4909 0.46772 0.38324
## Proportion of Variance 0.0241 0.02188 0.01469
## Cumulative Proportion  0.9634 0.98531 1.00000

First three principal components represent 73% of the variability in the data.

fviz_eig(pca1)

fviz_pca_var(pca1, repel = TRUE, col.var="contrib")+scale_color_gradient2(low="#99FF33", mid="#CC0066", high="black", midpoint=5)

The darkest variables are the most important. Thus we can consider Accommodation, Instagram, CoWorking, Attractions, Food, Beer and Coffee as most important.

# PCs contribution
par(mfrow = c(1, 3))
fviz_contrib(pca1, "var", axes=1, xtickslab.rt=90)

fviz_contrib(pca1, "var", axes=2, xtickslab.rt=90)

fviz_contrib(pca1, "var", axes=3, xtickslab.rt=90)

par(mfrow = c(1, 1))

Main contribution to the PCs:

1st PC: Accommodation, Food, Beer and Coffee,

2nd PC: CoWorking, Attractions and Instagram,

3rd PC: Mainly Sunshine and a smaller percentage from Beer,

Now, I will do clustering on the 3 PCs.

pc_data <- as.data.frame(pca1$x[,1:3])

Clustering

Let’s check whether the data is clusterable using Hopkins statistics.

# Check whether the data is clustarable
hopkins(pc_data, m=nrow(pc_data)-1)

## [1] 0.9448153

The data is definitely clusterable as the Hopkins statistics is close to 1.

Let’s find out what is the best number of clusters

opt1 <- Optimal_Clusters_KMeans(pc_data, max_clusters=10, plot_clusters = TRUE)

opt2 <- Optimal_Clusters_KMeans(pc_data, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

Since Silhouette is the highest (0.6) for 2 clusters, this will be my preferable choice. I could also analyze 3 clusters, however the difference between 0.6 for 2 clusters and 0.44 for 3 clusters is significant. Moreover, smaller number of clusters reduces complexity.

By choosing 2 clusters I would like to verify the ranking for best workation city. I imagine that clustering should split the ranking into two parts- top in ranking and low in ranking. I would like to check whether the ranking will be confirmed with my clusters.

I chose PAM (k-medoids) clustering as it is more robust to outliers and noise.

k <- pam(pc_data, 2)
summary(k)

## Medoids:
##      ID       PC1        PC2       PC3
## [1,] 27  1.544350  0.2759193 0.2157809
## [2,] 85 -1.451421 -0.5580540 0.5081618
## Clustering vector:
##   [1] 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1
##  [38] 2 2 2 2 2 1 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 1 1 1 2 1 2 1 1 1 2 2 1 1 1 1
##  [75] 2 2 2 1 1 1 1 2 2 1 2 2 1 1 2 1 1 1 2 1 2 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2
## [112] 2 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2
## Objective function:
##    build     swap 
## 1.900186 1.786494 
## 
## Numerical information per cluster:
##      size max_diss  av_diss  diameter separation
## [1,]   73 3.767608 1.453106  5.722192   0.550201
## [2,]   74 9.662208 2.115377 11.500497   0.550201
## 
## Isolated clusters:
##  L-clusters: character(0)
##  L*-clusters: character(0)
## 
## Silhouette plot information:
##     cluster neighbor     sil_width
## 11        1        2  0.6438056232
## 104       1        2  0.6407512333
## 68        1        2  0.6385734336
## 72        1        2  0.6227212208
## 49        1        2  0.6210397804
## 54        1        2  0.6182350100
## 81        1        2  0.6153063087
## 16        1        2  0.6140608706
## 74        1        2  0.6138813228
## 27        1        2  0.6103399861
## 61        1        2  0.6089871973
## 32        1        2  0.6023133931
## 21        1        2  0.6017961550
## 115       1        2  0.5999018198
## 34        1        2  0.5983823226
## 5         1        2  0.5947676593
## 67        1        2  0.5907008554
## 23        1        2  0.5899454241
## 46        1        2  0.5880641583
## 91        1        2  0.5877558858
## 130       1        2  0.5876270108
## 29        1        2  0.5859568252
## 25        1        2  0.5837791559
## 10        1        2  0.5814081459
## 20        1        2  0.5762424276
## 51        1        2  0.5734249641
## 15        1        2  0.5721322595
## 26        1        2  0.5695028206
## 87        1        2  0.5684843316
## 121       1        2  0.5675158713
## 79        1        2  0.5671770820
## 43        1        2  0.5623589806
## 17        1        2  0.5579402321
## 135       1        2  0.5543479312
## 19        1        2  0.5436207794
## 84        1        2  0.5370079056
## 90        1        2  0.5303250698
## 88        1        2  0.5292797290
## 18        1        2  0.5268954662
## 6         1        2  0.5261961493
## 62        1        2  0.5165003669
## 47        1        2  0.5130899326
## 80        1        2  0.5043533150
## 37        1        2  0.5011463160
## 92        1        2  0.5005364519
## 9         1        2  0.4954377330
## 56        1        2  0.4844061780
## 96        1        2  0.4776302827
## 14        1        2  0.4762678833
## 101       1        2  0.4680797861
## 94        1        2  0.4619375608
## 64        1        2  0.4528530865
## 1         1        2  0.4468285865
## 66        1        2  0.4392309711
## 2         1        2  0.4371754490
## 7         1        2  0.4337562333
## 60        1        2  0.4317206496
## 28        1        2  0.4278006142
## 3         1        2  0.4255188856
## 13        1        2  0.4122291425
## 8         1        2  0.3320039592
## 35        1        2  0.3301362432
## 24        1        2  0.3158275706
## 113       1        2  0.2963454117
## 58        1        2  0.2877910615
## 73        1        2  0.2746489571
## 31        1        2  0.2470973788
## 78        1        2  0.2436858887
## 98        1        2  0.2306986605
## 141       1        2  0.2037551250
## 118       1        2  0.1953380631
## 71        1        2  0.1926434601
## 137       1        2  0.0830548467
## 77        2        1  0.4567781721
## 40        2        1  0.4506842397
## 103       2        1  0.4465726429
## 138       2        1  0.4419108836
## 124       2        1  0.4402000952
## 107       2        1  0.4399655382
## 95        2        1  0.4237890636
## 89        2        1  0.4164769446
## 38        2        1  0.4125751217
## 147       2        1  0.4093297159
## 112       2        1  0.4085963803
## 85        2        1  0.3932671254
## 76        2        1  0.3920695038
## 50        2        1  0.3920440610
## 105       2        1  0.3898765504
## 106       2        1  0.3868399712
## 145       2        1  0.3859219734
## 42        2        1  0.3799502770
## 127       2        1  0.3797448408
## 134       2        1  0.3743261256
## 117       2        1  0.3714397488
## 144       2        1  0.3707096275
## 22        2        1  0.3697689412
## 100       2        1  0.3610277185
## 30        2        1  0.3595969417
## 128       2        1  0.3494822822
## 33        2        1  0.3462571025
## 52        2        1  0.3392292758
## 93        2        1  0.3217698800
## 70        2        1  0.3112202323
## 83        2        1  0.3108366861
## 53        2        1  0.2902944281
## 36        2        1  0.2841143396
## 44        2        1  0.2659816060
## 146       2        1  0.2656008263
## 132       2        1  0.2549973676
## 139       2        1  0.2536694346
## 133       2        1  0.2522503965
## 143       2        1  0.2413520991
## 41        2        1  0.2399437163
## 122       2        1  0.2265833336
## 120       2        1  0.2234287083
## 69        2        1  0.2070172433
## 140       2        1  0.1944764434
## 142       2        1  0.1882570682
## 55        2        1  0.1880860077
## 63        2        1  0.1819966580
## 39        2        1  0.1671609650
## 114       2        1  0.1660037501
## 97        2        1  0.1618330151
## 119       2        1  0.1598925749
## 111       2        1  0.1586093556
## 102       2        1  0.1519262505
## 99        2        1  0.1420480265
## 45        2        1  0.1331120008
## 48        2        1  0.1260399671
## 86        2        1  0.1044217423
## 125       2        1  0.0946676744
## 12        2        1  0.0863536824
## 131       2        1  0.0837594313
## 108       2        1  0.0794187530
## 123       2        1  0.0546687509
## 110       2        1  0.0501727329
## 116       2        1  0.0480305915
## 129       2        1  0.0101287714
## 59        2        1  0.0007129311
## 126       2        1 -0.0008676781
## 109       2        1 -0.0042116933
## 4         2        1 -0.0054900525
## 136       2        1 -0.0167905418
## 65        2        1 -0.0223499003
## 57        2        1 -0.0598779642
## 75        2        1 -0.1083748515
## 82        2        1 -0.1502109992
## Average silhouette width per cluster:
## [1] 0.4923572 0.2351499
## Average silhouette width of total data set:
## [1] 0.3628787
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(pc_data, col=k$clust, pch=16)

We can see above six 2D projections of data, which are in a 3D space. Clearly there might a couple of outliers, especially in a cluster colored in blue.

Let’s visualize the clusters:

# Plot clusters
fviz_cluster(k, geom = "point", ellipse.type = "norm")

fviz_cluster(k, geom = "point", ellipse.type = "convex")

Then, I added cluster assignment to the original data.

# Cluster sizes
sort(table(k$cluster))

## 
##  1  2 
## 73 74

clust <- names(sort(table(k$cluster)))

pc <- cbind(pc_data, k$cluster)

colnames(pc)[4] <- 'cluster_number'

# Combine PC data with cities names
final_data <- cbind(raw_data, pc)
final_data <- final_data[,c(1:13,17)]

# First cluster
cluster_1 <- final_data[final_data$cluster_number == '1',]
# Second Cluster
cluster_2 <- final_data[final_data$cluster_number == '2',]

Now it’s time for some analysis. As I mentioned before, I am curious whether my clustering splits the ranking into two parts- top in ranking and low in ranking.

So let’s see..

cluster_1

##     Ranking              City        Country WiFi CoWorking Coffe Taxi  Beer
## 1         1           Bangkok       Thailand   28       117  1.56 0.82  3.08
## 2         2         New Delhi          India   12       165  1.42 0.19  2.90
## 3         3            Lisbon       Portugal   33        95  1.56 0.40  3.42
## 5         5      Buenos Aires      Argentina   17        67  1.22 0.47  2.16
## 6         5          Budapest        Hungary   37        40  1.20 0.72  2.40
## 7         7            Mumbai          India   23       152  1.57 0.22  3.28
## 8         8          Istanbul         Turkey   13        69  1.23 0.29  3.34
## 9         9         Bucharest        Romania   54        46  1.78 0.35  2.78
## 10       10            Phuket       Thailand   23        11  1.73 0.76  3.92
## 11       10        Chiang Mai       Thailand   26        25  1.32 0.33  3.52
## 13       13           Jakarta      Indonesia   14       104  1.74 0.22  3.98
## 14       14         São Paulo         Brazil   17       112  1.14 0.85  2.82
## 15       15    Rio de Janeiro         Brazil   10        42  1.08 0.42  2.26
## 16       16             Sofia       Bulgaria   30        53  1.13 0.35  2.56
## 17       17       Mexico City         Mexico   20       133  1.67 0.24  2.90
## 18       18             Hanoi        Vietnam   14        59  1.20 0.38  1.26
## 19       19            Krakow         Poland   37        35  2.00 0.45  3.76
## 20       20      Kuala Lumpur       Malaysia   17       103  1.94 0.52  5.16
## 21       21  Ho Chi Minh City        Vietnam   13        92  1.36 0.47  1.32
## 23       23          Belgrade         Serbia   34        33  1.32 0.51  2.90
## 24       23            Prague Czech Republic   36        38  1.90 0.93  3.68
## 25       25         Cape Town   South Africa   12        46  1.45 0.60  3.50
## 26       26         Siem Reap       Cambodia    9         3  1.25 0.54  1.26
## 27       27             Porto       Portugal   29        32  1.34 0.43  3.40
## 28       28          Valencia          Spain   30        39  1.51 0.85  5.10
## 29       29              Kyiv        Ukraine   33        49  0.93 0.18  1.58
## 31       31            Moscow         Russia   17        47  1.70 0.15  4.06
## 32       32            Hoi An        Vietnam   12         1  1.03 0.47  1.50
## 34       34           Colombo      Sri Lanka    9        24  1.56 0.18  2.46
## 35       34             Seoul    South Korea   25        74  2.98 0.55  4.68
## 37       37          Santiago          Chile   11        57  2.21 0.94  5.66
## 43       43         Marrakech        Morocco    8         5  1.49 0.40  6.82
## 46       46          Medellin       Colombia    9        41  0.92 0.68  1.50
## 47       47            Malaga          Spain   26        17  1.27 0.73  4.26
## 49       49         Kathmandu          Nepal    7        10  1.13 0.48  3.62
## 51       51           Seville          Spain   28         7  1.21 0.80  3.40
## 54       54      Johannesburg   South Africa   12        49  1.42 0.60  3.00
## 56       56            Taipei         Taiwan   19        60  2.34 0.64  3.08
## 58       58              Faro       Portugal   70         1  1.57 0.85  4.24
## 60       60            Athens         Greece   12        29  2.42 0.64  6.84
## 61       61        Phnom Penh       Cambodia    8        27  1.69 0.72  1.08
## 62       62           Beijing          China    2        69  3.22 0.26  2.22
## 64       64              Lima           Peru   16        67  1.73 1.09  2.90
## 66       66            Warsaw         Poland   23        54  2.24 0.56  3.92
## 67       67             Minsk        Belarus   23         5  1.09 0.15  2.24
## 68       68         Cartagena       Colombia   11         9  0.84 1.35  3.32
## 71       71         Ljubljana       Slovenia   61         8  1.45 0.85  4.44
## 72       72             Quito        Ecuador   12        35  1.71 1.08  2.86
## 73       73          Florence          Italy   18        13  1.12 0.85  8.54
## 74       74            Oaxaca         Mexico   16         7  1.63 0.88  1.46
## 78       77           Tallinn        Estonia   37        14  2.43 0.51  6.82
## 79       79            Zagreb        Croatia   25        12  1.30 0.57  3.40
## 80       80             Xi'an          China   16         4  3.32 0.22  1.10
## 81       81            Yangon  Burma/Myanmar   10        22  1.61 0.72  1.44
## 84       84            Naples          Italy   19         7  1.18 0.87  5.12
## 87       87          Arequipa           Peru    9         4  1.39 0.91  2.84
## 88       88        Montevideo        Uruguay   25        10  1.94 0.82  3.92
## 90       90             Split        Croatia   20         6  1.36 1.13  4.54
## 91       91           Pokhara          Nepal    9         3  0.83 0.45  3.62
## 92       92            Cancun         Mexico    7         9  1.58 2.16  2.88
## 94       94          Shanghai          China    2       119  2.94 0.32  2.22
## 96       96              Riga         Latvia   23        20  2.20 0.60  5.10
## 98       98 Palma de Mallorca          Spain   29        15  1.81 0.83  4.26
## 101     101           Vilnius      Lithuania   25        22  2.03 0.51  5.46
## 104     104            Manila    Philippines    8        29  1.77 0.19  2.44
## 113     113              Hvar        Croatia   19         0  1.57 0.85  8.54
## 115     115          San Jose     Costa Rica   10        17  1.72 0.75  3.48
## 118     118          Adelaide      Australia   18        17  2.34 1.08  6.36
## 121     121         Vientiane           Laos    5         3  1.64 1.43  1.64
## 130     130            La Paz        Bolivia    6         7  1.91 1.14  3.34
## 135     135             Dakar        Senegal    6         6  2.00 1.30  1.64
## 137     137            Muscat           Oman    8         7  3.47 0.51 14.96
## 141     141       Kuwait City         Kuwait    7         4  3.56 1.38  4.78
##     Accommodation  Food Sunshine Attractions Instagram cluster_number
## 1          415.18  1.54     2624        2262  28386616              1
## 2          179.25  2.90     2685        2019  28528249              1
## 3          736.19  7.69     2806        1969  10205538              1
## 5          229.55  5.15     2525        1660  21293975              1
## 6          366.66  4.81     1988        1468  14267880              1
## 7          419.64  2.90     2584         892  47201552              1
## 8          230.10  2.92     2218        2088 116213193              1
## 9          352.35  6.07     2115         726   3376251              1
## 10         301.08  2.72     3450        1698  10190220              1
## 11         302.09  1.54     2512        1181   6428544              1
## 13         338.45  1.99     3252         449  75735133              1
## 14         323.93  4.23     1894        1267  32178314              1
## 15         264.92  4.59     2187        1607  43041426              1
## 16         313.59  5.23     2177         598   4197864              1
## 17         454.33  5.44     2555         969   6810938              1
## 18         247.44  1.25     1585        1948   7438755              1
## 19         538.25  5.63     2492         999   5787290              1
## 20         327.98  2.58     2774         665  15148859              1
## 21         404.46  1.57     2489        1357   1636058              1
## 23         308.00  5.08     2112         535   4424079              1
## 24         595.32  5.01     1668        2342  17540484              1
## 25         494.85  7.00     3094        1469  12323779              1
## 26         203.43  1.43     3519        1650   2293814              1
## 27         572.92  5.96     2468         802   9726729              1
## 28         594.88  8.52     2660         681  23385115              1
## 29         413.92  4.62     1955          87   6800366              1
## 31         643.35  7.56     1901        4653  58201864              1
## 32         160.86  1.33     2786        1176   2115317              1
## 34         313.09  1.45     3007         868   2575173              1
## 35         642.77  5.00     2066        1166  20380943              1
## 37         311.18  6.60     2808        1041  19690250              1
## 43         289.67  3.52     3131        2107   8671948              1
## 46         186.54  2.82     1892         546  30904265              1
## 47         581.14  8.54     3365         525   6992731              1
## 49         105.54  1.66     2529        1723   1684647              1
## 51         561.28  7.66     2898         932   1755465              1
## 54         328.29  6.00     3124         390   2999098              1
## 56         521.57  3.08     1405        1031  14321522              1
## 58         509.68  7.43     3036         167   1705160              1
## 60         348.38  8.54     2773        1149  10176728              1
## 61         338.05  2.15     3197         522   1196625              1
## 62         780.74  3.62     2671        1408   7286155              1
## 64         367.39  2.36     1230         855  10148526              1
## 66         614.23  5.59     1571         829  11966096              1
## 67         293.95  5.59     1780         352  11083815              1
## 68         373.24  2.07     3019         504  18939652              1
## 71         527.62  8.54     1974         422   1737701              1
## 72         343.81  2.69     2238         493   8411766              1
## 73         650.53 12.81     2924        1377  10392587              1
## 74         247.88  3.26     2896         294   4727625              1
## 78         466.50  8.52     1923         490   2861931              1
## 79         407.66  5.67     1913         520   3480120              1
## 80         234.49  2.22     3406         242    573692              1
## 81         594.02  2.54     3156         478    986024              1
## 84         550.03  8.54     2375        1045   8865857              1
## 87         164.43  1.83     3333         211   1571080              1
## 88         316.58  7.03     2481         326   3050401              1
## 90         360.39  7.94     2631         550   4249722              1
## 91          83.02  1.36     2573         344    741567              1
## 92         318.06  5.76     2525         912  10005189              1
## 94         959.42  3.33     1776         975  10969967              1
## 96         376.53  6.81     1754         540   3604009              1
## 98         736.96 12.77     2763         446   2370407              1
## 101        445.99  6.81     1691         457   3414796              1
## 104        493.79  2.87     2103         161   5381518              1
## 113        179.32 12.81     3535          85    880583              1
## 115        417.08  5.21     2081         305   5084798              1
## 118        783.36  8.63     2765         458   6427748              1
## 121        366.07  2.51     3250         104    409470              1
## 130        257.89  2.61     2289         211   1451813              1
## 135        502.94  3.91     3078          70   2108284              1
## 137        448.11  4.67     3493         219   3433459              1
## 141        643.09  4.79     3940         132   4133614              1

cluster_2

##     Ranking           City              Country WiFi CoWorking Coffe Taxi  Beer
## 4         4      Barcelona                Spain   37       136  1.59 1.01  5.12
## 12       12         Madrid                Spain   32       125  1.70 0.94 10.02
## 22       22      Singapore            Singapore   93       117  2.95 0.48  9.54
## 30       30    Los Angeles        United States   58       105  3.39 1.21 10.08
## 33       33      Hong Kong            Hong Kong   78       183  3.50 0.79 10.12
## 36       36      Las Vegas        United States   47        21  3.36 1.45  8.64
## 38       38  San Francisco        United States   75        77  3.39 1.34 10.08
## 39       38         Berlin              Germany   33       127  2.49 1.71  4.98
## 40       40      San Diego        United States   74        53  3.06 1.34  8.60
## 41       41          Tokyo                Japan   32       163  2.65 2.75  6.54
## 42       42        Chicago        United States   42       104  3.02 1.21  7.92
## 44       44         Vienna              Austria   42        91  2.82 1.21  6.82
## 45       45       New York        United States   37       272  3.48 1.34 10.58
## 48       48          Dubai United Arab Emirates   15       115  3.66 0.43 17.64
## 50       50        Houston        United States   60        62  2.80 1.03  7.20
## 52       52          Miami        United States   40        59  3.25 1.16  8.64
## 53       53         Sydney            Australia   28        93  2.14 1.25  8.48
## 55       55      Melbourne            Australia   21       127  2.34 0.87 10.78
## 57       57           Rome                Italy   17        50  1.07 1.02  8.52
## 59       59          Osaka                Japan   26        47  2.33 2.61  2.93
## 63       63        Phoenix        United States   44        35  3.32 1.00  7.18
## 65       65       Montreal               Canada   27        60  2.37 1.01  6.94
## 69       69          Paris               France   30       136  3.17 1.28 10.64
## 70       70        Toronto               Canada   26       113  2.63 1.15  8.06
## 75       75      Liverpool       United Kingdom   26        17  2.66 0.93  3.50
## 76       76    New Orleans        United States   45        16  3.25 1.54  5.02
## 77       77  Washington DC        United States   68        59  3.30 1.87  8.60
## 82       82          Kyoto                Japan   29         9  3.05 1.19  4.76
## 83       83        Hamburg              Germany   41        65  2.46 1.63  7.26
## 85       85      Vancouver               Canada   40        43  2.60 1.08  8.06
## 86       86          Milan                Italy   17        87  1.38 1.15  8.24
## 89       89       Portland        United States   44        31  3.04 1.16  8.60
## 93       93       Brussels              Belgium   37        66  2.60 1.66  6.84
## 95       95         Dublin              Ireland   61        46  2.88 1.28  9.40
## 97       97           Lyon               France   36        21  2.65 1.51  5.24
## 99       99       Brisbane            Australia   22        37  2.49 1.17  8.62
## 100     100       Auckland          New Zealand   34        30  2.57 1.32 10.06
## 102     101         London       United Kingdom   22       318  2.95 1.70 10.00
## 103     103      Stockholm               Sweden   37        50  3.35 1.13 11.72
## 105     105       Honolulu               Hawaii   43         8  4.05 1.47  8.62
## 106     105         Munich              Germany   31        87  2.70 1.71  6.84
## 107     107         Boston        United States   33        41  3.20 1.34 10.08
## 108     108        Calgary               Canada   24        34  2.44 1.16  8.10
## 109     109          Perth            Australia   19        29  2.45 0.91  9.54
## 110     110      Marseille               France   24        11  2.29 1.29 10.24
## 111     111        Cologne              Germany   33        38  2.34 1.71  6.84
## 112     112      Amsterdam          Netherlands   22        99  2.85 2.01  8.54
## 114     114     Dusseldorf              Germany   25        44  2.59 1.88  6.84
## 116     116      Abu Dhabi United Arab Emirates    9        32  3.48 0.39 15.68
## 117     117       Helsinki              Finland   36        22  3.38 0.85 10.24
## 119     119       Bordeaux               France   20        14  2.67 1.42 10.24
## 120     120      Frankfurt              Germany   22        66  2.49 1.71  6.84
## 122     122      Stuttgart              Germany   33        25  2.51 1.45  6.82
## 123     123       Hannover              Germany   38         7  2.26 1.71  5.98
## 124     124     Copenhagen              Denmark   46        26  4.46 1.38 11.48
## 125     125       Edmonton               Canada   30        10  2.69 1.04  6.94
## 126     126        Dresden              Germany   35         5  2.04 1.87  5.94
## 127     126     Manchester       United Kingdom   33        38  2.81 1.22  8.00
## 128     128      Rotterdam          Netherlands   37        31  2.52 1.71  6.80
## 129     129 St. Petersburg               Russia   21         4  2.41 1.00  5.74
## 131     131           Doha                Qatar   12        13  3.39 0.40 17.80
## 132     132         Ottawa               Canada   26        24  2.59 1.15  8.06
## 133     133       Tel Aviv               Israel   14        28  3.02 0.88 13.28
## 134     134      Edinburgh       United Kingdom   26        23  2.76 1.42  8.50
## 136     136      Dubrovnik              Croatia   19         0  2.52 1.19  9.12
## 138     138           Oslo               Norway   49        33  3.62 1.16 14.94
## 139     139        Glasgow       United Kingdom   26        16  2.77 1.06  7.00
## 140     140        Belfast       United Kingdom   26        13  2.74 1.07  9.00
## 142     142       Salzburg              Austria   30         7  2.68 1.65  6.64
## 143     143         Beirut              Lebanon    4        18  6.10 1.02  9.50
## 144     144         Zurich          Switzerland   33        42  3.97 3.00 11.06
## 145     145         Geneva          Switzerland   36        23  3.52 2.44 12.58
## 146     146       Valletta                Malta   10         1  2.33 2.55  6.10
## 147     147      Reykjavik              Iceland   26         9  3.36 2.03 13.88
##     Accommodation  Food Sunshine Attractions Instagram cluster_number
## 4          768.46 10.25     2591        2739  62894055              2
## 12         751.80 10.25     2769        1976  44165754              2
## 22        1455.77  7.28     2022        1561  41453533              2
## 30        1612.62 14.40     3254        1375  75436810              2
## 33        1581.43  5.56     1836        1330  39815830              2
## 36         826.39 10.80     3825        1420  40524360              2
## 38        2149.04 14.34     3062        1490  30781286              2
## 39         810.65  8.54     1626        1894  48461817              2
## 40        1553.79 10.75     3055        1118  29887093              2
## 41        1011.57  6.55     1881        3416  56623159              2
## 42        1330.55 12.24     2508        1411  51603607              2
## 44         755.00  9.37     1930        1149  14453386              2
## 45        2171.68 14.40     2535       10269 113711319              2
## 48        1069.88  6.86     3885        1615 110962610              2
## 50        1002.84 10.80     2578         589  27197101              2
## 52        1400.80 11.52     3154         762  80497499              2
## 53        1398.15  9.54     2636        1343  33859310              2
## 55         925.10 10.78     2636        1242  33945108              2
## 57         822.90 12.77     2473        3444  27665299              2
## 59         621.08  5.21     1996         790  16057725              2
## 63         936.98 10.78     3872         389   9833747              2
## 65         737.69  8.67     2051         793  18833685              2
## 69        1054.23 12.77     1662        4265 130025724              2
## 70        1135.92 11.52     2066        1048  50157783              2
## 75         649.17 12.00     2199        1050  14834860              2
## 76        1002.56 12.51     2649        1016  11614267              2
## 77        1666.66 10.75     2528         778  10770923              2
## 82         612.23  5.21     1775        1533  17684174              2
## 83         809.13 10.25     1557         688  20155361              2
## 85        1166.84 11.52     1938         729  24259344              2
## 86         927.31 12.81     1915        1543  21733273              2
## 89        1151.21 11.47     2341         741  12325674              2
## 93         724.37 12.81     1546         747   7074355              2
## 95        1437.05 12.81     1453        1243  12936240              2
## 97         596.38 10.89     2002         502   9144123              2
## 99         929.46 10.78     2968         668  11948455              2
## 100        964.17 10.06     2003        1099   5385894              2
## 102       1662.25 15.00     1633        4983 150702588              2
## 103       1154.07 10.05     1803         557  13630429              2
## 105       1368.17 10.78     3036         946   4898654              2
## 106       1115.58 11.95     1709         735  12132597              2
## 107       1877.31 14.40     2634         653  23846477              2
## 108        722.32 10.40     2396         540   7317582              2
## 109        803.84 10.60     3230         416  10684382              2
## 110        514.68 11.10     2858         488   8058438              2
## 111        711.60  8.11     1504         465   7660853              2
## 112       1316.79 12.81     1662        1782  33774833              2
## 114        685.86  8.54     2359         312   7242509              2
## 116        891.43  5.88     3509         317  31532015              2
## 117        860.76 11.10     1858         624   7238202              2
## 119        601.71 11.10     2713         560   7924124              2
## 120        894.56  8.54     1586         497  11611541              2
## 122        800.83 10.22     1692         250   9950961              2
## 123        526.02  8.54     1501         177   4633789              2
## 124       1178.93 14.92     1803         772  10101675              2
## 125        681.29 11.56     2345         294   4765517              2
## 126        447.61  8.49     1581         303   3640354              2
## 127        849.23 15.00     1416         458  16470240              2
## 128       1016.10 12.74     1745         323   6749161              2
## 129        995.39  8.96     1636        3218   3030245              2
## 131       1119.45  5.93     3897         203  22179935              2
## 132        928.07 11.52     2084         435   6914271              2
## 133       1148.13 14.38     3311         757   6474622              2
## 134        808.33 15.00     1427        1163   9582024              2
## 136        578.51 11.95     3322         468   2556696              2
## 138       1103.16 14.91     1668         444   8115873              2
## 139        651.23 15.00     1203         682   7250566              2
## 140        604.91 11.98     1868         483   3389741              2
## 142        655.73 10.64     1697         290   3108680              2
## 143        735.51 19.10     3534         188  10239115              2
## 144       1541.47 19.76     1566         485   6904776              2
## 145       1565.45 19.67     1828         300   4329235              2
## 146        890.99 17.03     3054         144    764698              2
## 147       1170.60 14.46     1326         629   2057729              2

It is easy to see that cluster 1 contains cities that are the top ones in the ranking. However, when looking at the first 10 rows, I do see that some of the levels in the ranking are missing, i.e. some observations were assigned to the second cluster, even though ranking would assume that they should be in the first cluster. For instance level 4 is missing in the top 10. Then level 12 is missing in the top 20…

Now I would like to see the cities, which were ‘misplaced’ when looking at the ranking. I will only inspect the cities that are not on the ‘boundaries’ between these two clusters, i.e. the ones that are not in the middle of the ranking.

The ranking has 147 levels, thus let’s explore the top 30 and the bottom 30, leaving 87 cities on the ‘boundary’.

From the top 30 in the ranking, there are #4, #12, #22, #30 levels missing.

Let’s check what are these cities.

subset(raw_data, subset = Ranking %in% c(4,12,22,30))

##    Ranking        City       Country WiFi CoWorking Coffe Taxi  Beer
## 4        4   Barcelona         Spain   37       136  1.59 1.01  5.12
## 12      12      Madrid         Spain   32       125  1.70 0.94 10.02
## 22      22   Singapore     Singapore   93       117  2.95 0.48  9.54
## 30      30 Los Angeles United States   58       105  3.39 1.21 10.08
##    Accommodation  Food Sunshine Attractions Instagram
## 4         768.46 10.25     2591        2739  62894055
## 12        751.80 10.25     2769        1976  44165754
## 22       1455.77  7.28     2022        1561  41453533
## 30       1612.62 14.40     3254        1375  75436810

So it is Barcelona, Madrid, Singapore and Los Angeles that have been assigned to the low ranking cluster, whereas when looking at the ranking, they should be in the top 30.

From the bottom 30 in the ranking, there are #141, #137, #130, #121, #115, #113 levels missing.

Let’s check what are these countries.

subset(raw_data, subset = Ranking %in% c(141,137,130,121,115,113))

##     Ranking        City    Country WiFi CoWorking Coffe Taxi  Beer
## 113     113        Hvar    Croatia   19         0  1.57 0.85  8.54
## 115     115    San Jose Costa Rica   10        17  1.72 0.75  3.48
## 121     121   Vientiane       Laos    5         3  1.64 1.43  1.64
## 130     130      La Paz    Bolivia    6         7  1.91 1.14  3.34
## 137     137      Muscat       Oman    8         7  3.47 0.51 14.96
## 141     141 Kuwait City     Kuwait    7         4  3.56 1.38  4.78
##     Accommodation  Food Sunshine Attractions Instagram
## 113        179.32 12.81     3535          85    880583
## 115        417.08  5.21     2081         305   5084798
## 121        366.07  2.51     3250         104    409470
## 130        257.89  2.61     2289         211   1451813
## 137        448.11  4.67     3493         219   3433459
## 141        643.09  4.79     3940         132   4133614

So it is Hvar, San Jose, Vientiane, La Paz, Muscat and Kuwait Citi that have been assigned to the top ranking cluster, whereas when looking at the ranking, they should be in the low 30.

Now, I would like to calculate mean for the variables in the two clusters.

The best scenario would be:

i) the highest value for: WiFi, CoWorking, Sunshine, Attractions and Instagram

ii) the lowest value for: Coffee, Taxi, Beer and Accommodation

round(colMeans(cluster_1[4:13]),2)

##          WiFi     CoWorking         Coffe          Taxi          Beer 
##         19.44         38.81          1.69          0.65          3.65 
## Accommodation          Food      Sunshine   Attractions     Instagram 
##        411.99          4.90       2540.32        920.53   12447708.62

round(colMeans(cluster_2[4:13]),2)

##          WiFi     CoWorking         Coffe          Taxi          Beer 
##         34.11         57.93          2.86          1.35          8.73 
## Accommodation          Food      Sunshine   Attractions     Instagram 
##       1022.96         11.38       2280.69       1163.59   25137563.11

Conclusion

Cluster 2 has higher mean for WiFi, CoWorking, Instagram and Attractions when compared to the cluster 1. Thus cluster 2 will contain preferable workation locations based on WiFi, CoWorking, Instagram and Attractions variables. However, WiFi was considered as ‘not so important’ variable in the PCA.

It is worth to notice that CoWorking, Instagram and Attractions (all but WiFi) are the main contributors to the 2nd PC.

Cluster 1 has higher mean for Sunshine and lower mean for Coffee, Taxi, Beer, Accommodation and Food when compared to the cluster 2. Thus cluster 1 will contain preferable workation locations based on Coffee, Taxi, Beer, Accommodation and Food variables. However, Taxi was considered as ‘not so important’ variable in the PCA.

It is worth to notice that Accommodation, Food, Beer and Coffee (all but Taxi) are the main contributors to the 1st PC.

By looking at this comparison, one can think that preferable workation locations are the ones that are cheap. Someone who is looking to work and live abroad will probably search for a city, where cost of living is pretty low. Accommodation seems to be the most important variable as it is usually the most expensive part of regular expenses.

It doesn’t surprise me that Bangkok is the highest in the ranking, as Thailand is very cheap!

Although, Barcelona is 4th in the ranking, it belongs to the cluster 2. The average prices for Taxi, Beer, Accommodation and Food in Barcelona are significantly higher to the average values for the cluster 1. The average prices in Barcelona are definitely closer to the average values from the cluster 2, which means that it shouldn’t be considered as one of the best cities for workation.

Workation

Natalia Miela

2022-12-26

PCA and Clustering

In this project, I will analyze the data set and perform cluster analysis to investigate whether the ranking provided in the raw data file is somehow related to the clusters that I obtained.

Exploratory Data Analysis & Clearing the Data

I decided to change and shorten the column names.

Here I would like to introduce the variables:

Wifi is an average Wifi Speed,

CoWorking is number of coworking spaces,

Coffee is an average price of buying a coffee,

Taxi is an average proce of taxi,

Beer is an average price for two beers in a bar,

Accommodation is an average price of 1 bedroom apartment per month,

Food is an average cost of local meal at mid-level restaurant,

Sunshine is an average number of sunshine hours,

Attractions is number of ‘Thinks to do’ on TripAdvisor,

Instagram is number of photos with # hashtag.

Now, I would like to check whether all data is of the type numeric.

I removed the columns Country and City (number 2 and 3), as their class is character and they are not informative. Moreover, I also removed the Ranking (number 1) column for clustering. The new shortened data set will be called data.

Then, I checked for NAs:

There were no NAs in my dataset, thus I could proceed to check for outliers.

There are some outliers, however, there are not many of them and not all variables have outliers. I will take it into account when choosing a clustering algorithm.

The data set has high dimension (10 variables), thus I decided to first first reduce the dimension using PCA and then to perform clustering.

PCA

Let’s see what is the correlation between variables.

And the correlation plot:

Beer, Coffee, Food, Taxi and WiFi have strong correlation with Accommodation.

CoWorking and Instagram have a strong correlation with Attractions.

Coffee and Food have a strong correlation with Beer.

Food has a strong correlation with Coffee.

Instagram has a strong correlation with CoWorking.

Taxi and WiFi have both strong correlation with Food.

Then, I standardized the data:

And calculated eigenvalues on the basis of covariance:

There are 3 eigenvalues with value above 1, these should be chosen for PCA.

First three principal components represent 73% of the variability in the data.

The darkest variables are the most important. Thus we can consider Accommodation, Instagram, CoWorking, Attractions, Food, Beer and Coffee as most important.

Main contribution to the PCs:

1st PC: Accommodation, Food, Beer and Coffee,

2nd PC: CoWorking, Attractions and Instagram,

3rd PC: Mainly Sunshine and a smaller percentage from Beer,

Now, I will do clustering on the 3 PCs.

Clustering

Let’s check whether the data is clusterable using Hopkins statistics.

The data is definitely clusterable as the Hopkins statistics is close to 1.

Let’s find out what is the best number of clusters

Since Silhouette is the highest (0.6) for 2 clusters, this will be my preferable choice. I could also analyze 3 clusters, however the difference between 0.6 for 2 clusters and 0.44 for 3 clusters is significant. Moreover, smaller number of clusters reduces complexity.

By choosing 2 clusters I would like to verify the ranking for best workation city. I imagine that clustering should split the ranking into two parts- top in ranking and low in ranking. I would like to check whether the ranking will be confirmed with my clusters.

I chose PAM (k-medoids) clustering as it is more robust to outliers and noise.

We can see above six 2D projections of data, which are in a 3D space. Clearly there might a couple of outliers, especially in a cluster colored in blue.

Let’s visualize the clusters:

Then, I added cluster assignment to the original data.

Now it’s time for some analysis. As I mentioned before, I am curious whether my clustering splits the ranking into two parts- top in ranking and low in ranking.

So let’s see..

Now I would like to see the cities, which were ‘misplaced’ when looking at the ranking. I will only inspect the cities that are not on the ‘boundaries’ between these two clusters, i.e. the ones that are not in the middle of the ranking.

The ranking has 147 levels, thus let’s explore the top 30 and the bottom 30, leaving 87 cities on the ‘boundary’.

From the top 30 in the ranking, there are #4, #12, #22, #30 levels missing.

Let’s check what are these cities.

So it is Barcelona, Madrid, Singapore and Los Angeles that have been assigned to the low ranking cluster, whereas when looking at the ranking, they should be in the top 30.

From the bottom 30 in the ranking, there are #141, #137, #130, #121, #115, #113 levels missing.

Let’s check what are these countries.

So it is Hvar, San Jose, Vientiane, La Paz, Muscat and Kuwait Citi that have been assigned to the top ranking cluster, whereas when looking at the ranking, they should be in the low 30.

Now, I would like to calculate mean for the variables in the two clusters.

The best scenario would be:

i) the highest value for: WiFi, CoWorking, Sunshine, Attractions and Instagram

ii) the lowest value for: Coffee, Taxi, Beer and Accommodation

Conclusion

It is worth to notice that CoWorking, Instagram and Attractions (all but WiFi) are the main contributors to the 2nd PC.

It is worth to notice that Accommodation, Food, Beer and Coffee (all but Taxi) are the main contributors to the 1st PC.

It doesn’t surprise me that Bangkok is the highest in the ranking, as Thailand is very cheap!