Exploratory Data Analysis & Clearing the Data
# Preview the data
head(raw_data, 3)
## Ranking City Country
## 1 1 Bangkok Thailand
## 2 2 New Delhi India
## 3 3 Lisbon Portugal
## Remote.connection..Average.WiFi.speed..Mbps.per.second.
## 1 28
## 2 12
## 3 33
## Co.working.spaces..Number.of.co.working.spaces
## 1 117
## 2 165
## 3 95
## Caffeine..Average.price.of.buying.a.coffee
## 1 1.56
## 2 1.42
## 3 1.56
## Travel..Average.price.of.taxi..per.km.
## 1 0.82
## 2 0.19
## 3 0.40
## After.work.drinks..Average.price.for.2.beers.in.a.bar
## 1 3.08
## 2 2.90
## 3 3.42
## Accommodation..Average.price.of.1.bedroom.apartment.per.month
## 1 415.18
## 2 179.25
## 3 736.19
## Food..Average.cost.of.a.meal.at.a.local..mid.level.restaurant
## 1 1.54
## 2 2.90
## 3 7.69
## Climate..Average.number.of.sunshine.hours
## 1 2624
## 2 2685
## 3 2806
## Tourist.attractions..Number.of..Things.to.do..on.Tripadvisor
## 1 2262
## 2 2019
## 3 1969
## Instagramability..Number.of.photos.with..
## 1 28386616
## 2 28528249
## 3 10205538
I decided to change and shorten the column names.
colnames(raw_data)
## [1] "Ranking"
## [2] "City"
## [3] "Country"
## [4] "Remote.connection..Average.WiFi.speed..Mbps.per.second."
## [5] "Co.working.spaces..Number.of.co.working.spaces"
## [6] "Caffeine..Average.price.of.buying.a.coffee"
## [7] "Travel..Average.price.of.taxi..per.km."
## [8] "After.work.drinks..Average.price.for.2.beers.in.a.bar"
## [9] "Accommodation..Average.price.of.1.bedroom.apartment.per.month"
## [10] "Food..Average.cost.of.a.meal.at.a.local..mid.level.restaurant"
## [11] "Climate..Average.number.of.sunshine.hours"
## [12] "Tourist.attractions..Number.of..Things.to.do..on.Tripadvisor"
## [13] "Instagramability..Number.of.photos.with.."
# Shorten column names for easier use
colnames(raw_data)[4:13] <- c('WiFi', 'CoWorking', 'Coffe', 'Taxi', 'Beer', 'Accommodation', 'Food', 'Sunshine', 'Attractions', 'Instagram')
colnames(raw_data)
## [1] "Ranking" "City" "Country" "WiFi"
## [5] "CoWorking" "Coffe" "Taxi" "Beer"
## [9] "Accommodation" "Food" "Sunshine" "Attractions"
## [13] "Instagram"
Here I would like to introduce the variables:
Wifi is an average Wifi Speed,
CoWorking is number of coworking spaces,
Coffee is an average price of buying a coffee,
Taxi is an average proce of taxi,
Beer is an average price for two beers in a bar,
Accommodation is an average price of 1 bedroom apartment per
month,
Food is an average cost of local meal at mid-level restaurant,
Sunshine is an average number of sunshine hours,
Attractions is number of ‘Thinks to do’ on TripAdvisor,
Instagram is number of photos with # hashtag.
Now, I would like to check whether all data is of the type
numeric.
for (i in 1:ncol(raw_data)){
print(i)
print(class(raw_data[,i]))
}
## [1] 1
## [1] "integer"
## [1] 2
## [1] "character"
## [1] 3
## [1] "character"
## [1] 4
## [1] "integer"
## [1] 5
## [1] "integer"
## [1] 6
## [1] "numeric"
## [1] 7
## [1] "numeric"
## [1] 8
## [1] "numeric"
## [1] 9
## [1] "numeric"
## [1] 10
## [1] "numeric"
## [1] 11
## [1] "integer"
## [1] 12
## [1] "integer"
## [1] 13
## [1] "integer"
Then, I checked for NAs:
apply(data, 2, function(x) any(is.na(x)))
## WiFi CoWorking Coffe Taxi Beer
## FALSE FALSE FALSE FALSE FALSE
## Accommodation Food Sunshine Attractions Instagram
## FALSE FALSE FALSE FALSE FALSE
There were no NAs in my dataset, thus I could proceed to check for
outliers.
par(mfrow = c(2, 5))
for (i in 1:ncol(data)){
boxplot(data[,i], col = 'grey', ylab = colnames(data[i]))
}

par(mfrow = c(1, 1))
There are some outliers, however, there are not many of them and not
all variables have outliers. I will take it into account when choosing a
clustering algorithm.
The data set has high dimension (10 variables), thus I decided to
first first reduce the dimension using PCA and then to perform
clustering.
PCA
Let’s see what is the correlation between variables.
corr_data <- cor(data, method="pearson") # nominal values
print(corr_data, digits=2)
## WiFi CoWorking Coffe Taxi Beer Accommodation Food
## WiFi 1.000 0.180 0.331 2.4e-01 0.307 0.57 0.409
## CoWorking 0.180 1.000 0.117 1.4e-02 0.134 0.41 0.062
## Coffe 0.331 0.117 1.000 4.1e-01 0.687 0.68 0.603
## Taxi 0.244 0.014 0.413 1.0e+00 0.342 0.48 0.588
## Beer 0.307 0.134 0.687 3.4e-01 1.000 0.67 0.671
## Accommodation 0.574 0.408 0.684 4.8e-01 0.671 1.00 0.690
## Food 0.409 0.062 0.603 5.9e-01 0.671 0.69 1.000
## Sunshine -0.227 -0.166 0.009 -2.5e-01 0.059 -0.10 -0.170
## Attractions 0.057 0.646 0.013 1.1e-06 0.078 0.31 0.083
## Instagram 0.125 0.736 0.169 2.5e-02 0.274 0.39 0.139
## Sunshine Attractions Instagram
## WiFi -0.2275 5.7e-02 0.1251
## CoWorking -0.1660 6.5e-01 0.7359
## Coffe 0.0090 1.3e-02 0.1689
## Taxi -0.2545 1.1e-06 0.0246
## Beer 0.0590 7.8e-02 0.2738
## Accommodation -0.1008 3.1e-01 0.3902
## Food -0.1701 8.3e-02 0.1391
## Sunshine 1.0000 -1.0e-01 0.0034
## Attractions -0.1049 1.0e+00 0.6504
## Instagram 0.0034 6.5e-01 1.0000
And the correlation plot:
corrplot(corr_data, type = "lower", order = "alphabet", tl.col = "black", tl.cex = 1, col=colorRampPalette(c("#99FF33", "#CC0066", "black"))(200))

Beer, Coffee, Food, Taxi and WiFi have strong correlation with
Accommodation.
CoWorking and Instagram have a strong correlation with
Attractions.
Coffee and Food have a strong correlation with Beer.
Food has a strong correlation with Coffee.
Instagram has a strong correlation with CoWorking.
Taxi and WiFi have both strong correlation with Food.
Then, I standardized the data:
preproc1 <- preProcess(data, method=c("center", "scale"))
data_s <- predict(preproc1, data)
And calculated eigenvalues on the basis of covariance:
data_cov<-cov(data_s)
raw_data_eigen<-eigen(data_cov)
raw_data_eigen$values
## [1] 3.9415586 2.1782532 1.2112680 0.7846449 0.5236360 0.4125724 0.3414264
## [8] 0.2410057 0.2187643 0.1468705
head(raw_data_eigen$vectors)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.2893990 -0.08763996 0.27280033 0.79256858 -0.30089931 -0.03819244
## [2,] -0.2349595 0.52385452 0.06584985 0.03430246 0.05107179 0.40938390
## [3,] -0.3779944 -0.21597057 -0.24059344 -0.01963833 0.32990449 0.42374320
## [4,] -0.2894328 -0.25041093 0.27808069 -0.52437050 -0.56434898 0.29174857
## [5,] -0.3866965 -0.16388824 -0.32523657 -0.03620515 0.38098644 -0.22765122
## [6,] -0.4636067 -0.01993322 -0.04106675 0.13889066 -0.10560279 0.03427873
## [,7] [,8] [,9] [,10]
## [1,] 0.06377226 0.21999066 0.01970819 -0.2472493
## [2,] 0.08608499 -0.59579507 0.07864324 -0.3550872
## [3,] -0.52465156 0.25914526 -0.28270575 -0.2036485
## [4,] 0.09157012 0.16081417 0.22104619 -0.1204785
## [5,] 0.33939859 0.08396046 0.60336568 -0.1823462
## [6,] -0.20970665 -0.28822467 0.15546598 0.7746918
There are 3 eigenvalues with value above 1, these should be chosen
for PCA.
pca1<-prcomp(data_s, center=FALSE, scale=FALSE)
pca1
## Standard deviations (1, .., p=10):
## [1] 1.9853359 1.4758907 1.1005762 0.8858018 0.7236270 0.6423180 0.5843170
## [8] 0.4909233 0.4677224 0.3832369
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4 PC5
## WiFi -0.28939903 -0.08763996 0.27280033 -0.79256858 0.30089931
## CoWorking -0.23495954 0.52385452 0.06584985 -0.03430246 -0.05107179
## Coffe -0.37799444 -0.21597057 -0.24059344 0.01963833 -0.32990449
## Taxi -0.28943275 -0.25041093 0.27808069 0.52437050 0.56434898
## Beer -0.38669652 -0.16388824 -0.32523657 0.03620515 -0.38098644
## Accommodation -0.46360666 -0.01993322 -0.04106675 -0.13889066 0.10560279
## Food -0.40054557 -0.24469054 0.03321884 0.16942140 0.02302925
## Sunshine 0.09728205 -0.02076909 -0.80868713 -0.10143221 0.54599917
## Attractions -0.18925603 0.52343960 0.03520013 0.17568172 0.15670684
## Instagram -0.25339173 0.49698717 -0.14915422 0.06964701 -0.00440924
## PC6 PC7 PC8 PC9 PC10
## WiFi -0.03819244 0.06377226 -0.21999066 0.01970819 -0.2472493
## CoWorking 0.40938390 0.08608499 0.59579507 0.07864324 -0.3550872
## Coffe 0.42374320 -0.52465156 -0.25914526 -0.28270575 -0.2036485
## Taxi 0.29174857 0.09157012 -0.16081417 0.22104619 -0.1204785
## Beer -0.22765122 0.33939859 -0.08396046 0.60336568 -0.1823462
## Accommodation 0.03427873 -0.20970665 0.28822467 0.15546598 0.7746918
## Food -0.47674753 0.18482220 0.34879373 -0.58407810 -0.1584317
## Sunshine 0.01822636 -0.01455493 0.11108630 -0.03325428 -0.1172021
## Attractions -0.51816539 -0.52500393 -0.20026221 0.15102063 -0.1627769
## Instagram 0.13097611 0.48549936 -0.48927485 -0.33371059 0.2409611
# PCA summary
pca1$rotation
## PC1 PC2 PC3 PC4 PC5
## WiFi -0.28939903 -0.08763996 0.27280033 -0.79256858 0.30089931
## CoWorking -0.23495954 0.52385452 0.06584985 -0.03430246 -0.05107179
## Coffe -0.37799444 -0.21597057 -0.24059344 0.01963833 -0.32990449
## Taxi -0.28943275 -0.25041093 0.27808069 0.52437050 0.56434898
## Beer -0.38669652 -0.16388824 -0.32523657 0.03620515 -0.38098644
## Accommodation -0.46360666 -0.01993322 -0.04106675 -0.13889066 0.10560279
## Food -0.40054557 -0.24469054 0.03321884 0.16942140 0.02302925
## Sunshine 0.09728205 -0.02076909 -0.80868713 -0.10143221 0.54599917
## Attractions -0.18925603 0.52343960 0.03520013 0.17568172 0.15670684
## Instagram -0.25339173 0.49698717 -0.14915422 0.06964701 -0.00440924
## PC6 PC7 PC8 PC9 PC10
## WiFi -0.03819244 0.06377226 -0.21999066 0.01970819 -0.2472493
## CoWorking 0.40938390 0.08608499 0.59579507 0.07864324 -0.3550872
## Coffe 0.42374320 -0.52465156 -0.25914526 -0.28270575 -0.2036485
## Taxi 0.29174857 0.09157012 -0.16081417 0.22104619 -0.1204785
## Beer -0.22765122 0.33939859 -0.08396046 0.60336568 -0.1823462
## Accommodation 0.03427873 -0.20970665 0.28822467 0.15546598 0.7746918
## Food -0.47674753 0.18482220 0.34879373 -0.58407810 -0.1584317
## Sunshine 0.01822636 -0.01455493 0.11108630 -0.03325428 -0.1172021
## Attractions -0.51816539 -0.52500393 -0.20026221 0.15102063 -0.1627769
## Instagram 0.13097611 0.48549936 -0.48927485 -0.33371059 0.2409611
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9853 1.4759 1.1006 0.88580 0.72363 0.64232 0.58432
## Proportion of Variance 0.3942 0.2178 0.1211 0.07846 0.05236 0.04126 0.03414
## Cumulative Proportion 0.3942 0.6120 0.7331 0.81157 0.86394 0.90519 0.93934
## PC8 PC9 PC10
## Standard deviation 0.4909 0.46772 0.38324
## Proportion of Variance 0.0241 0.02188 0.01469
## Cumulative Proportion 0.9634 0.98531 1.00000
First three principal components represent 73% of the variability in
the data.
fviz_eig(pca1)

fviz_pca_var(pca1, repel = TRUE, col.var="contrib")+scale_color_gradient2(low="#99FF33", mid="#CC0066", high="black", midpoint=5)

The darkest variables are the most important. Thus we can consider
Accommodation, Instagram, CoWorking, Attractions, Food, Beer and Coffee
as most important.
# PCs contribution
par(mfrow = c(1, 3))
fviz_contrib(pca1, "var", axes=1, xtickslab.rt=90)

fviz_contrib(pca1, "var", axes=2, xtickslab.rt=90)

fviz_contrib(pca1, "var", axes=3, xtickslab.rt=90)

par(mfrow = c(1, 1))
Main contribution to the PCs:
1st PC: Accommodation, Food, Beer and Coffee,
2nd PC: CoWorking, Attractions and Instagram,
3rd PC: Mainly Sunshine and a smaller percentage from Beer,
Now, I will do clustering on the 3 PCs.
pc_data <- as.data.frame(pca1$x[,1:3])
Clustering
Let’s check whether the data is clusterable using Hopkins
statistics.
# Check whether the data is clustarable
hopkins(pc_data, m=nrow(pc_data)-1)
## [1] 0.9448153
The data is definitely clusterable as the Hopkins statistics is
close to 1.
Let’s find out what is the best number of clusters
opt1 <- Optimal_Clusters_KMeans(pc_data, max_clusters=10, plot_clusters = TRUE)

opt2 <- Optimal_Clusters_KMeans(pc_data, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

Since Silhouette is the highest (0.6) for 2 clusters, this will be
my preferable choice. I could also analyze 3 clusters, however the
difference between 0.6 for 2 clusters and 0.44 for 3 clusters is
significant. Moreover, smaller number of clusters reduces
complexity.
By choosing 2 clusters I would like to verify the ranking for best
workation city. I imagine that clustering should split the ranking into
two parts- top in ranking and low in ranking. I would like to check
whether the ranking will be confirmed with my clusters.
I chose PAM (k-medoids) clustering as it is more robust to outliers
and noise.
k <- pam(pc_data, 2)
summary(k)
## Medoids:
## ID PC1 PC2 PC3
## [1,] 27 1.544350 0.2759193 0.2157809
## [2,] 85 -1.451421 -0.5580540 0.5081618
## Clustering vector:
## [1] 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1
## [38] 2 2 2 2 2 1 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 1 1 1 2 1 2 1 1 1 2 2 1 1 1 1
## [75] 2 2 2 1 1 1 1 2 2 1 2 2 1 1 2 1 1 1 2 1 2 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2
## [112] 2 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2
## Objective function:
## build swap
## 1.900186 1.786494
##
## Numerical information per cluster:
## size max_diss av_diss diameter separation
## [1,] 73 3.767608 1.453106 5.722192 0.550201
## [2,] 74 9.662208 2.115377 11.500497 0.550201
##
## Isolated clusters:
## L-clusters: character(0)
## L*-clusters: character(0)
##
## Silhouette plot information:
## cluster neighbor sil_width
## 11 1 2 0.6438056232
## 104 1 2 0.6407512333
## 68 1 2 0.6385734336
## 72 1 2 0.6227212208
## 49 1 2 0.6210397804
## 54 1 2 0.6182350100
## 81 1 2 0.6153063087
## 16 1 2 0.6140608706
## 74 1 2 0.6138813228
## 27 1 2 0.6103399861
## 61 1 2 0.6089871973
## 32 1 2 0.6023133931
## 21 1 2 0.6017961550
## 115 1 2 0.5999018198
## 34 1 2 0.5983823226
## 5 1 2 0.5947676593
## 67 1 2 0.5907008554
## 23 1 2 0.5899454241
## 46 1 2 0.5880641583
## 91 1 2 0.5877558858
## 130 1 2 0.5876270108
## 29 1 2 0.5859568252
## 25 1 2 0.5837791559
## 10 1 2 0.5814081459
## 20 1 2 0.5762424276
## 51 1 2 0.5734249641
## 15 1 2 0.5721322595
## 26 1 2 0.5695028206
## 87 1 2 0.5684843316
## 121 1 2 0.5675158713
## 79 1 2 0.5671770820
## 43 1 2 0.5623589806
## 17 1 2 0.5579402321
## 135 1 2 0.5543479312
## 19 1 2 0.5436207794
## 84 1 2 0.5370079056
## 90 1 2 0.5303250698
## 88 1 2 0.5292797290
## 18 1 2 0.5268954662
## 6 1 2 0.5261961493
## 62 1 2 0.5165003669
## 47 1 2 0.5130899326
## 80 1 2 0.5043533150
## 37 1 2 0.5011463160
## 92 1 2 0.5005364519
## 9 1 2 0.4954377330
## 56 1 2 0.4844061780
## 96 1 2 0.4776302827
## 14 1 2 0.4762678833
## 101 1 2 0.4680797861
## 94 1 2 0.4619375608
## 64 1 2 0.4528530865
## 1 1 2 0.4468285865
## 66 1 2 0.4392309711
## 2 1 2 0.4371754490
## 7 1 2 0.4337562333
## 60 1 2 0.4317206496
## 28 1 2 0.4278006142
## 3 1 2 0.4255188856
## 13 1 2 0.4122291425
## 8 1 2 0.3320039592
## 35 1 2 0.3301362432
## 24 1 2 0.3158275706
## 113 1 2 0.2963454117
## 58 1 2 0.2877910615
## 73 1 2 0.2746489571
## 31 1 2 0.2470973788
## 78 1 2 0.2436858887
## 98 1 2 0.2306986605
## 141 1 2 0.2037551250
## 118 1 2 0.1953380631
## 71 1 2 0.1926434601
## 137 1 2 0.0830548467
## 77 2 1 0.4567781721
## 40 2 1 0.4506842397
## 103 2 1 0.4465726429
## 138 2 1 0.4419108836
## 124 2 1 0.4402000952
## 107 2 1 0.4399655382
## 95 2 1 0.4237890636
## 89 2 1 0.4164769446
## 38 2 1 0.4125751217
## 147 2 1 0.4093297159
## 112 2 1 0.4085963803
## 85 2 1 0.3932671254
## 76 2 1 0.3920695038
## 50 2 1 0.3920440610
## 105 2 1 0.3898765504
## 106 2 1 0.3868399712
## 145 2 1 0.3859219734
## 42 2 1 0.3799502770
## 127 2 1 0.3797448408
## 134 2 1 0.3743261256
## 117 2 1 0.3714397488
## 144 2 1 0.3707096275
## 22 2 1 0.3697689412
## 100 2 1 0.3610277185
## 30 2 1 0.3595969417
## 128 2 1 0.3494822822
## 33 2 1 0.3462571025
## 52 2 1 0.3392292758
## 93 2 1 0.3217698800
## 70 2 1 0.3112202323
## 83 2 1 0.3108366861
## 53 2 1 0.2902944281
## 36 2 1 0.2841143396
## 44 2 1 0.2659816060
## 146 2 1 0.2656008263
## 132 2 1 0.2549973676
## 139 2 1 0.2536694346
## 133 2 1 0.2522503965
## 143 2 1 0.2413520991
## 41 2 1 0.2399437163
## 122 2 1 0.2265833336
## 120 2 1 0.2234287083
## 69 2 1 0.2070172433
## 140 2 1 0.1944764434
## 142 2 1 0.1882570682
## 55 2 1 0.1880860077
## 63 2 1 0.1819966580
## 39 2 1 0.1671609650
## 114 2 1 0.1660037501
## 97 2 1 0.1618330151
## 119 2 1 0.1598925749
## 111 2 1 0.1586093556
## 102 2 1 0.1519262505
## 99 2 1 0.1420480265
## 45 2 1 0.1331120008
## 48 2 1 0.1260399671
## 86 2 1 0.1044217423
## 125 2 1 0.0946676744
## 12 2 1 0.0863536824
## 131 2 1 0.0837594313
## 108 2 1 0.0794187530
## 123 2 1 0.0546687509
## 110 2 1 0.0501727329
## 116 2 1 0.0480305915
## 129 2 1 0.0101287714
## 59 2 1 0.0007129311
## 126 2 1 -0.0008676781
## 109 2 1 -0.0042116933
## 4 2 1 -0.0054900525
## 136 2 1 -0.0167905418
## 65 2 1 -0.0223499003
## 57 2 1 -0.0598779642
## 75 2 1 -0.1083748515
## 82 2 1 -0.1502109992
## Average silhouette width per cluster:
## [1] 0.4923572 0.2351499
## Average silhouette width of total data set:
## [1] 0.3628787
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(pc_data, col=k$clust, pch=16)

We can see above six 2D projections of data, which are in a 3D
space. Clearly there might a couple of outliers, especially in a cluster
colored in blue.
Let’s visualize the clusters:
# Plot clusters
fviz_cluster(k, geom = "point", ellipse.type = "norm")

fviz_cluster(k, geom = "point", ellipse.type = "convex")

Then, I added cluster assignment to the original data.
# Cluster sizes
sort(table(k$cluster))
##
## 1 2
## 73 74
clust <- names(sort(table(k$cluster)))
pc <- cbind(pc_data, k$cluster)
colnames(pc)[4] <- 'cluster_number'
# Combine PC data with cities names
final_data <- cbind(raw_data, pc)
final_data <- final_data[,c(1:13,17)]
# First cluster
cluster_1 <- final_data[final_data$cluster_number == '1',]
# Second Cluster
cluster_2 <- final_data[final_data$cluster_number == '2',]
Now it’s time for some analysis. As I mentioned before, I am curious
whether my clustering splits the ranking into two parts- top in ranking
and low in ranking.
So let’s see..
cluster_1
## Ranking City Country WiFi CoWorking Coffe Taxi Beer
## 1 1 Bangkok Thailand 28 117 1.56 0.82 3.08
## 2 2 New Delhi India 12 165 1.42 0.19 2.90
## 3 3 Lisbon Portugal 33 95 1.56 0.40 3.42
## 5 5 Buenos Aires Argentina 17 67 1.22 0.47 2.16
## 6 5 Budapest Hungary 37 40 1.20 0.72 2.40
## 7 7 Mumbai India 23 152 1.57 0.22 3.28
## 8 8 Istanbul Turkey 13 69 1.23 0.29 3.34
## 9 9 Bucharest Romania 54 46 1.78 0.35 2.78
## 10 10 Phuket Thailand 23 11 1.73 0.76 3.92
## 11 10 Chiang Mai Thailand 26 25 1.32 0.33 3.52
## 13 13 Jakarta Indonesia 14 104 1.74 0.22 3.98
## 14 14 São Paulo Brazil 17 112 1.14 0.85 2.82
## 15 15 Rio de Janeiro Brazil 10 42 1.08 0.42 2.26
## 16 16 Sofia Bulgaria 30 53 1.13 0.35 2.56
## 17 17 Mexico City Mexico 20 133 1.67 0.24 2.90
## 18 18 Hanoi Vietnam 14 59 1.20 0.38 1.26
## 19 19 Krakow Poland 37 35 2.00 0.45 3.76
## 20 20 Kuala Lumpur Malaysia 17 103 1.94 0.52 5.16
## 21 21 Ho Chi Minh City Vietnam 13 92 1.36 0.47 1.32
## 23 23 Belgrade Serbia 34 33 1.32 0.51 2.90
## 24 23 Prague Czech Republic 36 38 1.90 0.93 3.68
## 25 25 Cape Town South Africa 12 46 1.45 0.60 3.50
## 26 26 Siem Reap Cambodia 9 3 1.25 0.54 1.26
## 27 27 Porto Portugal 29 32 1.34 0.43 3.40
## 28 28 Valencia Spain 30 39 1.51 0.85 5.10
## 29 29 Kyiv Ukraine 33 49 0.93 0.18 1.58
## 31 31 Moscow Russia 17 47 1.70 0.15 4.06
## 32 32 Hoi An Vietnam 12 1 1.03 0.47 1.50
## 34 34 Colombo Sri Lanka 9 24 1.56 0.18 2.46
## 35 34 Seoul South Korea 25 74 2.98 0.55 4.68
## 37 37 Santiago Chile 11 57 2.21 0.94 5.66
## 43 43 Marrakech Morocco 8 5 1.49 0.40 6.82
## 46 46 Medellin Colombia 9 41 0.92 0.68 1.50
## 47 47 Malaga Spain 26 17 1.27 0.73 4.26
## 49 49 Kathmandu Nepal 7 10 1.13 0.48 3.62
## 51 51 Seville Spain 28 7 1.21 0.80 3.40
## 54 54 Johannesburg South Africa 12 49 1.42 0.60 3.00
## 56 56 Taipei Taiwan 19 60 2.34 0.64 3.08
## 58 58 Faro Portugal 70 1 1.57 0.85 4.24
## 60 60 Athens Greece 12 29 2.42 0.64 6.84
## 61 61 Phnom Penh Cambodia 8 27 1.69 0.72 1.08
## 62 62 Beijing China 2 69 3.22 0.26 2.22
## 64 64 Lima Peru 16 67 1.73 1.09 2.90
## 66 66 Warsaw Poland 23 54 2.24 0.56 3.92
## 67 67 Minsk Belarus 23 5 1.09 0.15 2.24
## 68 68 Cartagena Colombia 11 9 0.84 1.35 3.32
## 71 71 Ljubljana Slovenia 61 8 1.45 0.85 4.44
## 72 72 Quito Ecuador 12 35 1.71 1.08 2.86
## 73 73 Florence Italy 18 13 1.12 0.85 8.54
## 74 74 Oaxaca Mexico 16 7 1.63 0.88 1.46
## 78 77 Tallinn Estonia 37 14 2.43 0.51 6.82
## 79 79 Zagreb Croatia 25 12 1.30 0.57 3.40
## 80 80 Xi'an China 16 4 3.32 0.22 1.10
## 81 81 Yangon Burma/Myanmar 10 22 1.61 0.72 1.44
## 84 84 Naples Italy 19 7 1.18 0.87 5.12
## 87 87 Arequipa Peru 9 4 1.39 0.91 2.84
## 88 88 Montevideo Uruguay 25 10 1.94 0.82 3.92
## 90 90 Split Croatia 20 6 1.36 1.13 4.54
## 91 91 Pokhara Nepal 9 3 0.83 0.45 3.62
## 92 92 Cancun Mexico 7 9 1.58 2.16 2.88
## 94 94 Shanghai China 2 119 2.94 0.32 2.22
## 96 96 Riga Latvia 23 20 2.20 0.60 5.10
## 98 98 Palma de Mallorca Spain 29 15 1.81 0.83 4.26
## 101 101 Vilnius Lithuania 25 22 2.03 0.51 5.46
## 104 104 Manila Philippines 8 29 1.77 0.19 2.44
## 113 113 Hvar Croatia 19 0 1.57 0.85 8.54
## 115 115 San Jose Costa Rica 10 17 1.72 0.75 3.48
## 118 118 Adelaide Australia 18 17 2.34 1.08 6.36
## 121 121 Vientiane Laos 5 3 1.64 1.43 1.64
## 130 130 La Paz Bolivia 6 7 1.91 1.14 3.34
## 135 135 Dakar Senegal 6 6 2.00 1.30 1.64
## 137 137 Muscat Oman 8 7 3.47 0.51 14.96
## 141 141 Kuwait City Kuwait 7 4 3.56 1.38 4.78
## Accommodation Food Sunshine Attractions Instagram cluster_number
## 1 415.18 1.54 2624 2262 28386616 1
## 2 179.25 2.90 2685 2019 28528249 1
## 3 736.19 7.69 2806 1969 10205538 1
## 5 229.55 5.15 2525 1660 21293975 1
## 6 366.66 4.81 1988 1468 14267880 1
## 7 419.64 2.90 2584 892 47201552 1
## 8 230.10 2.92 2218 2088 116213193 1
## 9 352.35 6.07 2115 726 3376251 1
## 10 301.08 2.72 3450 1698 10190220 1
## 11 302.09 1.54 2512 1181 6428544 1
## 13 338.45 1.99 3252 449 75735133 1
## 14 323.93 4.23 1894 1267 32178314 1
## 15 264.92 4.59 2187 1607 43041426 1
## 16 313.59 5.23 2177 598 4197864 1
## 17 454.33 5.44 2555 969 6810938 1
## 18 247.44 1.25 1585 1948 7438755 1
## 19 538.25 5.63 2492 999 5787290 1
## 20 327.98 2.58 2774 665 15148859 1
## 21 404.46 1.57 2489 1357 1636058 1
## 23 308.00 5.08 2112 535 4424079 1
## 24 595.32 5.01 1668 2342 17540484 1
## 25 494.85 7.00 3094 1469 12323779 1
## 26 203.43 1.43 3519 1650 2293814 1
## 27 572.92 5.96 2468 802 9726729 1
## 28 594.88 8.52 2660 681 23385115 1
## 29 413.92 4.62 1955 87 6800366 1
## 31 643.35 7.56 1901 4653 58201864 1
## 32 160.86 1.33 2786 1176 2115317 1
## 34 313.09 1.45 3007 868 2575173 1
## 35 642.77 5.00 2066 1166 20380943 1
## 37 311.18 6.60 2808 1041 19690250 1
## 43 289.67 3.52 3131 2107 8671948 1
## 46 186.54 2.82 1892 546 30904265 1
## 47 581.14 8.54 3365 525 6992731 1
## 49 105.54 1.66 2529 1723 1684647 1
## 51 561.28 7.66 2898 932 1755465 1
## 54 328.29 6.00 3124 390 2999098 1
## 56 521.57 3.08 1405 1031 14321522 1
## 58 509.68 7.43 3036 167 1705160 1
## 60 348.38 8.54 2773 1149 10176728 1
## 61 338.05 2.15 3197 522 1196625 1
## 62 780.74 3.62 2671 1408 7286155 1
## 64 367.39 2.36 1230 855 10148526 1
## 66 614.23 5.59 1571 829 11966096 1
## 67 293.95 5.59 1780 352 11083815 1
## 68 373.24 2.07 3019 504 18939652 1
## 71 527.62 8.54 1974 422 1737701 1
## 72 343.81 2.69 2238 493 8411766 1
## 73 650.53 12.81 2924 1377 10392587 1
## 74 247.88 3.26 2896 294 4727625 1
## 78 466.50 8.52 1923 490 2861931 1
## 79 407.66 5.67 1913 520 3480120 1
## 80 234.49 2.22 3406 242 573692 1
## 81 594.02 2.54 3156 478 986024 1
## 84 550.03 8.54 2375 1045 8865857 1
## 87 164.43 1.83 3333 211 1571080 1
## 88 316.58 7.03 2481 326 3050401 1
## 90 360.39 7.94 2631 550 4249722 1
## 91 83.02 1.36 2573 344 741567 1
## 92 318.06 5.76 2525 912 10005189 1
## 94 959.42 3.33 1776 975 10969967 1
## 96 376.53 6.81 1754 540 3604009 1
## 98 736.96 12.77 2763 446 2370407 1
## 101 445.99 6.81 1691 457 3414796 1
## 104 493.79 2.87 2103 161 5381518 1
## 113 179.32 12.81 3535 85 880583 1
## 115 417.08 5.21 2081 305 5084798 1
## 118 783.36 8.63 2765 458 6427748 1
## 121 366.07 2.51 3250 104 409470 1
## 130 257.89 2.61 2289 211 1451813 1
## 135 502.94 3.91 3078 70 2108284 1
## 137 448.11 4.67 3493 219 3433459 1
## 141 643.09 4.79 3940 132 4133614 1
cluster_2
## Ranking City Country WiFi CoWorking Coffe Taxi Beer
## 4 4 Barcelona Spain 37 136 1.59 1.01 5.12
## 12 12 Madrid Spain 32 125 1.70 0.94 10.02
## 22 22 Singapore Singapore 93 117 2.95 0.48 9.54
## 30 30 Los Angeles United States 58 105 3.39 1.21 10.08
## 33 33 Hong Kong Hong Kong 78 183 3.50 0.79 10.12
## 36 36 Las Vegas United States 47 21 3.36 1.45 8.64
## 38 38 San Francisco United States 75 77 3.39 1.34 10.08
## 39 38 Berlin Germany 33 127 2.49 1.71 4.98
## 40 40 San Diego United States 74 53 3.06 1.34 8.60
## 41 41 Tokyo Japan 32 163 2.65 2.75 6.54
## 42 42 Chicago United States 42 104 3.02 1.21 7.92
## 44 44 Vienna Austria 42 91 2.82 1.21 6.82
## 45 45 New York United States 37 272 3.48 1.34 10.58
## 48 48 Dubai United Arab Emirates 15 115 3.66 0.43 17.64
## 50 50 Houston United States 60 62 2.80 1.03 7.20
## 52 52 Miami United States 40 59 3.25 1.16 8.64
## 53 53 Sydney Australia 28 93 2.14 1.25 8.48
## 55 55 Melbourne Australia 21 127 2.34 0.87 10.78
## 57 57 Rome Italy 17 50 1.07 1.02 8.52
## 59 59 Osaka Japan 26 47 2.33 2.61 2.93
## 63 63 Phoenix United States 44 35 3.32 1.00 7.18
## 65 65 Montreal Canada 27 60 2.37 1.01 6.94
## 69 69 Paris France 30 136 3.17 1.28 10.64
## 70 70 Toronto Canada 26 113 2.63 1.15 8.06
## 75 75 Liverpool United Kingdom 26 17 2.66 0.93 3.50
## 76 76 New Orleans United States 45 16 3.25 1.54 5.02
## 77 77 Washington DC United States 68 59 3.30 1.87 8.60
## 82 82 Kyoto Japan 29 9 3.05 1.19 4.76
## 83 83 Hamburg Germany 41 65 2.46 1.63 7.26
## 85 85 Vancouver Canada 40 43 2.60 1.08 8.06
## 86 86 Milan Italy 17 87 1.38 1.15 8.24
## 89 89 Portland United States 44 31 3.04 1.16 8.60
## 93 93 Brussels Belgium 37 66 2.60 1.66 6.84
## 95 95 Dublin Ireland 61 46 2.88 1.28 9.40
## 97 97 Lyon France 36 21 2.65 1.51 5.24
## 99 99 Brisbane Australia 22 37 2.49 1.17 8.62
## 100 100 Auckland New Zealand 34 30 2.57 1.32 10.06
## 102 101 London United Kingdom 22 318 2.95 1.70 10.00
## 103 103 Stockholm Sweden 37 50 3.35 1.13 11.72
## 105 105 Honolulu Hawaii 43 8 4.05 1.47 8.62
## 106 105 Munich Germany 31 87 2.70 1.71 6.84
## 107 107 Boston United States 33 41 3.20 1.34 10.08
## 108 108 Calgary Canada 24 34 2.44 1.16 8.10
## 109 109 Perth Australia 19 29 2.45 0.91 9.54
## 110 110 Marseille France 24 11 2.29 1.29 10.24
## 111 111 Cologne Germany 33 38 2.34 1.71 6.84
## 112 112 Amsterdam Netherlands 22 99 2.85 2.01 8.54
## 114 114 Dusseldorf Germany 25 44 2.59 1.88 6.84
## 116 116 Abu Dhabi United Arab Emirates 9 32 3.48 0.39 15.68
## 117 117 Helsinki Finland 36 22 3.38 0.85 10.24
## 119 119 Bordeaux France 20 14 2.67 1.42 10.24
## 120 120 Frankfurt Germany 22 66 2.49 1.71 6.84
## 122 122 Stuttgart Germany 33 25 2.51 1.45 6.82
## 123 123 Hannover Germany 38 7 2.26 1.71 5.98
## 124 124 Copenhagen Denmark 46 26 4.46 1.38 11.48
## 125 125 Edmonton Canada 30 10 2.69 1.04 6.94
## 126 126 Dresden Germany 35 5 2.04 1.87 5.94
## 127 126 Manchester United Kingdom 33 38 2.81 1.22 8.00
## 128 128 Rotterdam Netherlands 37 31 2.52 1.71 6.80
## 129 129 St. Petersburg Russia 21 4 2.41 1.00 5.74
## 131 131 Doha Qatar 12 13 3.39 0.40 17.80
## 132 132 Ottawa Canada 26 24 2.59 1.15 8.06
## 133 133 Tel Aviv Israel 14 28 3.02 0.88 13.28
## 134 134 Edinburgh United Kingdom 26 23 2.76 1.42 8.50
## 136 136 Dubrovnik Croatia 19 0 2.52 1.19 9.12
## 138 138 Oslo Norway 49 33 3.62 1.16 14.94
## 139 139 Glasgow United Kingdom 26 16 2.77 1.06 7.00
## 140 140 Belfast United Kingdom 26 13 2.74 1.07 9.00
## 142 142 Salzburg Austria 30 7 2.68 1.65 6.64
## 143 143 Beirut Lebanon 4 18 6.10 1.02 9.50
## 144 144 Zurich Switzerland 33 42 3.97 3.00 11.06
## 145 145 Geneva Switzerland 36 23 3.52 2.44 12.58
## 146 146 Valletta Malta 10 1 2.33 2.55 6.10
## 147 147 Reykjavik Iceland 26 9 3.36 2.03 13.88
## Accommodation Food Sunshine Attractions Instagram cluster_number
## 4 768.46 10.25 2591 2739 62894055 2
## 12 751.80 10.25 2769 1976 44165754 2
## 22 1455.77 7.28 2022 1561 41453533 2
## 30 1612.62 14.40 3254 1375 75436810 2
## 33 1581.43 5.56 1836 1330 39815830 2
## 36 826.39 10.80 3825 1420 40524360 2
## 38 2149.04 14.34 3062 1490 30781286 2
## 39 810.65 8.54 1626 1894 48461817 2
## 40 1553.79 10.75 3055 1118 29887093 2
## 41 1011.57 6.55 1881 3416 56623159 2
## 42 1330.55 12.24 2508 1411 51603607 2
## 44 755.00 9.37 1930 1149 14453386 2
## 45 2171.68 14.40 2535 10269 113711319 2
## 48 1069.88 6.86 3885 1615 110962610 2
## 50 1002.84 10.80 2578 589 27197101 2
## 52 1400.80 11.52 3154 762 80497499 2
## 53 1398.15 9.54 2636 1343 33859310 2
## 55 925.10 10.78 2636 1242 33945108 2
## 57 822.90 12.77 2473 3444 27665299 2
## 59 621.08 5.21 1996 790 16057725 2
## 63 936.98 10.78 3872 389 9833747 2
## 65 737.69 8.67 2051 793 18833685 2
## 69 1054.23 12.77 1662 4265 130025724 2
## 70 1135.92 11.52 2066 1048 50157783 2
## 75 649.17 12.00 2199 1050 14834860 2
## 76 1002.56 12.51 2649 1016 11614267 2
## 77 1666.66 10.75 2528 778 10770923 2
## 82 612.23 5.21 1775 1533 17684174 2
## 83 809.13 10.25 1557 688 20155361 2
## 85 1166.84 11.52 1938 729 24259344 2
## 86 927.31 12.81 1915 1543 21733273 2
## 89 1151.21 11.47 2341 741 12325674 2
## 93 724.37 12.81 1546 747 7074355 2
## 95 1437.05 12.81 1453 1243 12936240 2
## 97 596.38 10.89 2002 502 9144123 2
## 99 929.46 10.78 2968 668 11948455 2
## 100 964.17 10.06 2003 1099 5385894 2
## 102 1662.25 15.00 1633 4983 150702588 2
## 103 1154.07 10.05 1803 557 13630429 2
## 105 1368.17 10.78 3036 946 4898654 2
## 106 1115.58 11.95 1709 735 12132597 2
## 107 1877.31 14.40 2634 653 23846477 2
## 108 722.32 10.40 2396 540 7317582 2
## 109 803.84 10.60 3230 416 10684382 2
## 110 514.68 11.10 2858 488 8058438 2
## 111 711.60 8.11 1504 465 7660853 2
## 112 1316.79 12.81 1662 1782 33774833 2
## 114 685.86 8.54 2359 312 7242509 2
## 116 891.43 5.88 3509 317 31532015 2
## 117 860.76 11.10 1858 624 7238202 2
## 119 601.71 11.10 2713 560 7924124 2
## 120 894.56 8.54 1586 497 11611541 2
## 122 800.83 10.22 1692 250 9950961 2
## 123 526.02 8.54 1501 177 4633789 2
## 124 1178.93 14.92 1803 772 10101675 2
## 125 681.29 11.56 2345 294 4765517 2
## 126 447.61 8.49 1581 303 3640354 2
## 127 849.23 15.00 1416 458 16470240 2
## 128 1016.10 12.74 1745 323 6749161 2
## 129 995.39 8.96 1636 3218 3030245 2
## 131 1119.45 5.93 3897 203 22179935 2
## 132 928.07 11.52 2084 435 6914271 2
## 133 1148.13 14.38 3311 757 6474622 2
## 134 808.33 15.00 1427 1163 9582024 2
## 136 578.51 11.95 3322 468 2556696 2
## 138 1103.16 14.91 1668 444 8115873 2
## 139 651.23 15.00 1203 682 7250566 2
## 140 604.91 11.98 1868 483 3389741 2
## 142 655.73 10.64 1697 290 3108680 2
## 143 735.51 19.10 3534 188 10239115 2
## 144 1541.47 19.76 1566 485 6904776 2
## 145 1565.45 19.67 1828 300 4329235 2
## 146 890.99 17.03 3054 144 764698 2
## 147 1170.60 14.46 1326 629 2057729 2
It is easy to see that cluster 1 contains cities that are the top
ones in the ranking. However, when looking at the first 10 rows, I do
see that some of the levels in the ranking are missing, i.e. some
observations were assigned to the second cluster, even though ranking
would assume that they should be in the first cluster. For instance
level 4 is missing in the top 10. Then level 12 is missing in the top
20…
Now I would like to see the cities, which were ‘misplaced’ when
looking at the ranking. I will only inspect the cities that are not on
the ‘boundaries’ between these two clusters, i.e. the ones that are not
in the middle of the ranking.
The ranking has 147 levels, thus let’s explore the top 30 and the
bottom 30, leaving 87 cities on the ‘boundary’.
From the top 30 in the ranking, there are #4, #12, #22, #30 levels
missing.
Let’s check what are these cities.
subset(raw_data, subset = Ranking %in% c(4,12,22,30))
## Ranking City Country WiFi CoWorking Coffe Taxi Beer
## 4 4 Barcelona Spain 37 136 1.59 1.01 5.12
## 12 12 Madrid Spain 32 125 1.70 0.94 10.02
## 22 22 Singapore Singapore 93 117 2.95 0.48 9.54
## 30 30 Los Angeles United States 58 105 3.39 1.21 10.08
## Accommodation Food Sunshine Attractions Instagram
## 4 768.46 10.25 2591 2739 62894055
## 12 751.80 10.25 2769 1976 44165754
## 22 1455.77 7.28 2022 1561 41453533
## 30 1612.62 14.40 3254 1375 75436810
So it is Barcelona, Madrid, Singapore and Los Angeles that have been
assigned to the low ranking cluster, whereas when looking at the
ranking, they should be in the top 30.
From the bottom 30 in the ranking, there are #141, #137, #130, #121,
#115, #113 levels missing.
Let’s check what are these countries.
subset(raw_data, subset = Ranking %in% c(141,137,130,121,115,113))
## Ranking City Country WiFi CoWorking Coffe Taxi Beer
## 113 113 Hvar Croatia 19 0 1.57 0.85 8.54
## 115 115 San Jose Costa Rica 10 17 1.72 0.75 3.48
## 121 121 Vientiane Laos 5 3 1.64 1.43 1.64
## 130 130 La Paz Bolivia 6 7 1.91 1.14 3.34
## 137 137 Muscat Oman 8 7 3.47 0.51 14.96
## 141 141 Kuwait City Kuwait 7 4 3.56 1.38 4.78
## Accommodation Food Sunshine Attractions Instagram
## 113 179.32 12.81 3535 85 880583
## 115 417.08 5.21 2081 305 5084798
## 121 366.07 2.51 3250 104 409470
## 130 257.89 2.61 2289 211 1451813
## 137 448.11 4.67 3493 219 3433459
## 141 643.09 4.79 3940 132 4133614
So it is Hvar, San Jose, Vientiane, La Paz, Muscat and Kuwait Citi
that have been assigned to the top ranking cluster, whereas when looking
at the ranking, they should be in the low 30.
Now, I would like to calculate mean for the variables in the two
clusters.
The best scenario would be:
i) the highest value for: WiFi, CoWorking, Sunshine, Attractions and
Instagram
ii) the lowest value for: Coffee, Taxi, Beer and Accommodation
round(colMeans(cluster_1[4:13]),2)
## WiFi CoWorking Coffe Taxi Beer
## 19.44 38.81 1.69 0.65 3.65
## Accommodation Food Sunshine Attractions Instagram
## 411.99 4.90 2540.32 920.53 12447708.62
round(colMeans(cluster_2[4:13]),2)
## WiFi CoWorking Coffe Taxi Beer
## 34.11 57.93 2.86 1.35 8.73
## Accommodation Food Sunshine Attractions Instagram
## 1022.96 11.38 2280.69 1163.59 25137563.11
Conclusion
Cluster 2 has higher mean for WiFi, CoWorking, Instagram and
Attractions when compared to the cluster 1. Thus cluster 2 will contain
preferable workation locations based on WiFi, CoWorking, Instagram and
Attractions variables. However, WiFi was considered as ‘not so
important’ variable in the PCA.
It is worth to notice that CoWorking, Instagram and Attractions (all
but WiFi) are the main contributors to the 2nd PC.
Cluster 1 has higher mean for Sunshine and lower mean for Coffee,
Taxi, Beer, Accommodation and Food when compared to the cluster 2. Thus
cluster 1 will contain preferable workation locations based on Coffee,
Taxi, Beer, Accommodation and Food variables. However, Taxi was
considered as ‘not so important’ variable in the PCA.
It is worth to notice that Accommodation, Food, Beer and Coffee (all
but Taxi) are the main contributors to the 1st PC.
By looking at this comparison, one can think that preferable
workation locations are the ones that are cheap. Someone who is looking
to work and live abroad will probably search for a city, where cost of
living is pretty low. Accommodation seems to be the most important
variable as it is usually the most expensive part of regular
expenses.
It doesn’t surprise me that Bangkok is the highest in the ranking,
as Thailand is very cheap!
Although, Barcelona is 4th in the ranking, it belongs to the cluster
2. The average prices for Taxi, Beer, Accommodation and Food in
Barcelona are significantly higher to the average values for the cluster
1. The average prices in Barcelona are definitely closer to the average
values from the cluster 2, which means that it shouldn’t be considered
as one of the best cities for workation.