Negara yang berlokasi di Asia Timur merupakan salah satu objek pariwisata yang terbaik, para pengunjung dapat berpergian ke Negara Taiwan, Jepang ataupun Korea Selatan. Sudah banyak juga aplikasi-aplikasi yang menawarkan rekomendasi tempat pariwisata salah satunya tripadvisor.com. Situs tersebut juga menerima feedback dan review dari para wisatawan yang berpergian ke Asia Timur, tempat-tempat wisata mencakup Bar, Park, Musium, Pantai, Taman, dll. Para Pengunjung juga dapat memberikan rating berpa Excellent (4), Very Good (3), Average (2), Poor (1), dan Terrible (0).
Berikut ini adalah data review dari pengunjung yang menggunakan situs tripadvisor.com yang sudah pernah berkunjung ke negara-negara Timur Asia. data tersebut dapat diolah menggunakan Machine Learning K-means Clustering, untuk mengetahui ada berapa kelas yang dapat diklasifikasikan oleh Machine Learning tersebut. Kita juga dapat mengambil informasi yanng ada dengan mengombinasikan PCA dan clustering dari data tersebut.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
##
## subplot
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
##
## Attaching package: 'funModeling'
## The following object is masked from 'package:GGally':
##
## range01
## Observations: 980
## Variables: 11
## $ User.ID <fct> User 1, User 2, User 3, User 4, User 5, User 6, User 7,...
## $ Category.1 <dbl> 0.93, 1.02, 1.22, 0.45, 0.51, 0.99, 0.90, 0.74, 1.12, 0...
## $ Category.2 <dbl> 1.80, 2.20, 0.80, 1.80, 1.20, 1.28, 1.36, 1.40, 1.76, 1...
## $ Category.3 <dbl> 2.29, 2.66, 0.54, 0.29, 1.18, 0.72, 0.26, 0.22, 1.04, 0...
## $ Category.4 <dbl> 0.62, 0.64, 0.53, 0.57, 0.57, 0.27, 0.32, 0.41, 0.64, 0...
## $ Category.5 <dbl> 0.80, 1.42, 0.24, 0.46, 1.54, 0.74, 0.86, 0.82, 0.82, 1...
## $ Category.6 <dbl> 2.42, 3.18, 1.54, 1.52, 2.02, 1.26, 1.58, 1.50, 2.14, 1...
## $ Category.7 <dbl> 3.19, 3.21, 3.18, 3.18, 3.18, 3.17, 3.17, 3.17, 3.18, 3...
## $ Category.8 <dbl> 2.79, 2.63, 2.80, 2.96, 2.78, 2.89, 2.66, 2.81, 2.79, 2...
## $ Category.9 <dbl> 1.82, 1.86, 1.31, 1.57, 1.18, 1.66, 1.22, 1.54, 1.41, 2...
## $ Category.10 <dbl> 2.42, 2.32, 2.50, 2.86, 2.54, 3.66, 3.22, 2.88, 2.54, 3...
Berikut adalah penjelasan pada setiap data: 1. User.ID : Unique user id 2. Category.1 : Average user feedback on art galleries 3. Category.2 : Average user feedback on dance clubs 4. Category.3 : Average user feedback on juice bars 5. Category.4 : Average user feedback on restaurants 6. Category.5 : Average user feedback on museums 7. Category.6 : Average user feedback on resorts 8. Category.7 : Average user feedback on parks/picnic spots 9. Category.8 : Average user feedback on beaches 10. Category.9 : Average user feedback on theaters 11. Category.10 : Average user feedback on religious institutions
Melihat struktur data
## User.ID Category.1 Category.2 Category.3 Category.4 Category.5 Category.6
## 1 User 1 0.93 1.8 2.29 0.62 0.8 2.42
## Category.7 Category.8 Category.9 Category.10
## 1 3.19 2.79 1.82 2.42
Mengecek missing value
## User.ID Category.1 Category.2 Category.3 Category.4 Category.5
## 0 0 0 0 0 0
## Category.6 Category.7 Category.8 Category.9 Category.10
## 0 0 0 0 0
Kesimpulan tidak ditemukan nilai missing value
Mengubah variable category menjadi nama yang sesuai dan menghapus variable user
trip <- trip %>%
select(- User.ID) %>%
rename(Art = Category.1,
Club = Category.2,
Bar = Category.3,
Restaurant = Category.4,
Museum = Category.5,
Resort = Category.6,
Park = Category.7,
Beach = Category.8,
Theatre = Category.9,
Institutions = Category.10)
head(trip,2)## Art Club Bar Restaurant Museum Resort Park Beach Theatre Institutions
## 1 0.93 1.8 2.29 0.62 0.80 2.42 3.19 2.79 1.82 2.42
## 2 1.02 2.2 2.66 0.64 1.42 3.18 3.21 2.63 1.86 2.32
## Standard deviations (1, .., p=10):
## [1] 1.7255447 1.1237411 1.1104500 1.0317317 1.0098612 0.8988780 0.8393180
## [8] 0.6782247 0.5734843 0.3755038
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4 PC5
## Art -0.01533238 0.31907527 -0.520473005 0.41411084 -0.31367748
## Club 0.12965685 -0.29722301 0.603205768 0.17896740 -0.03662148
## Bar 0.42881276 0.28305662 -0.046071730 -0.34153182 0.12952900
## Restaurant 0.22864226 0.01897704 0.078773787 0.74890429 0.02242080
## Museum 0.32181751 -0.34565736 -0.007916925 -0.21441024 -0.56644367
## Resort 0.41744888 -0.24905167 -0.161710035 0.02295544 -0.40737387
## Park 0.49573266 0.14264171 -0.089886330 -0.10745970 0.34490103
## Beach -0.10481328 -0.41966084 -0.537430519 -0.11753217 0.16641040
## Theatre 0.03977775 -0.58584628 -0.179104270 0.18603647 0.42399390
## Institutions -0.45896330 -0.09046983 0.041478568 -0.11185528 -0.26410789
## PC6 PC7 PC8 PC9 PC10
## Art 0.50330680 -0.198420995 -0.03987369 -0.25060928 -0.009942147
## Club 0.34345200 -0.598133204 -0.05571957 -0.12293371 -0.060662580
## Bar 0.19608956 -0.114115540 0.52180303 0.12477140 -0.511013160
## Restaurant -0.43167726 0.036949059 0.42824895 0.09221006 -0.013660185
## Museum -0.11798417 0.216103985 0.23045888 -0.53691710 0.089717049
## Resort 0.05087168 -0.001439142 -0.29880387 0.69220114 -0.007935260
## Park 0.07970356 -0.133089959 0.04034761 -0.04217483 0.753555820
## Beach -0.34631462 -0.588493780 0.08075941 -0.05336670 -0.083260622
## Theatre 0.46256416 0.415093811 0.12106882 -0.01512579 -0.087110340
## Institutions 0.21717814 -0.073718713 0.61190451 0.35458961 0.380027791
Melihat hasil dari biplot
Melihat summary dari pca
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.7255 1.1237 1.1105 1.0317 1.0099 0.8989 0.83932 0.6782
## Proportion of Variance 0.2978 0.1263 0.1233 0.1065 0.1020 0.0808 0.07045 0.0460
## Cumulative Proportion 0.2978 0.4240 0.5473 0.6538 0.7558 0.8366 0.90701 0.9530
## PC9 PC10
## Standard deviation 0.57348 0.3755
## Proportion of Variance 0.03289 0.0141
## Cumulative Proportion 0.98590 1.0000
Melihat korelasi antara variable
Hal yang bisa diambil dari korelasi adalah:
Park dan institutions memiliki nilai korelasi negatif yang paling kuat
Bar dan Park memiliki nilai korelasi positif yang paling tiggi
Observasi 642, 100, 331 merupakan nilai outlier
dari 7 data PC yang ada bisa mendapatkan nilai yang mencakup 90% dari info keseluruhan
Dalam menentukan k pada cluster, ada beberapa pendekatan yang dapat digunakan yaitu : 1. Silhouette 2. Elbow 3. Gap Statistic
Pertama, kita akan coba menggunakan metode Elbow terlebih dahulu. Metode ini menggunakan pendekatan nilai within sum of square (wss) sebagai penentu k optimal.
1. Elbow Method
# determining optimal number of clustering, this process to compute called "Elbow method".
set.seed(123)
fviz_nbclust(trip, kmeans, method = "wss")2. Sillhouette Method
# or with Average Silhouette Method measures the quality of a clustering.
set.seed(123)
fviz_nbclust(trip, kmeans, method = "silhouette")3. Gap Statistic
# or with Gap statistic method measures the quality of a clustering.
set.seed(123)
gap_stat <- clusGap(trip, FUN = kmeans, nstart = 25, K.max = 10, B = 123)## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
Dapat disimpulkan berdasarkan dari metode elbow, gap statistic, sillhouette bahwa nilai K yang diambil bernilai 2
## K-means clustering with 2 clusters of sizes 598, 382
##
## Cluster means:
## Art Club Bar Restaurant Museum Resort Park Beach
## 1 0.8850167 1.309900 0.491990 0.5003679 0.7862542 1.625284 3.176973 2.854331
## 2 0.9059948 1.419476 1.829398 0.5828010 1.1800000 2.183560 3.187147 2.804895
## Theatre Institutions
## 1 1.597124 2.925485
## 2 1.526099 2.601571
##
## Clustering vector:
## [1] 2 2 1 1 2 1 1 1 2 1 2 1 2 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 1 1 1 1
## [38] 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2 2 2 1 1 1 1 1 2 1 2 1 1 2 1 2
## [75] 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 1 2 1 1 2 2 1 1 2 1
## [112] 2 1 2 2 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2
## [149] 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## [186] 1 1 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 2 1 1
## [223] 2 1 2 1 1 2 1 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1
## [260] 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 2 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 2 1 1 1 1 1
## [297] 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 2 2 2 2 2 1 2 1
## [334] 1 1 2 2 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1
## [371] 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2 1 2 2 2 2 1 2 1 2 1 1 2 2 2
## [408] 2 2 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [445] 2 1 2 2 2 1 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 2 2 2 1 2 2 1
## [482] 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 2
## [519] 1 1 1 2 2 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1
## [556] 2 1 1 1 1 2 2 2 2 1 2 2 2 2 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1
## [593] 1 2 1 2 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1
## [630] 1 1 1 1 2 1 2 2 2 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 1 2
## [667] 2 2 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 2
## [704] 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 2 1 1 2 1 1 2 1 2 1 2 2 1 1 2 1 1 1 2 2 1
## [741] 2 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1
## [778] 2 1 1 2 2 1 1 1 2 2 1 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1
## [815] 2 2 2 1 2 2 2 1 1 2 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 1
## [852] 2 1 2 2 1 2 2 1 2 2 1 2 1 1 1 2 2 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 2 1 1
## [889] 2 1 1 1 2 1 1 1 2 1 1 1 1 2 2 1 2 2 1 2 1 1 1 2 1 2 1 2 2 1 1 1 2 1 1 1 2
## [926] 2 2 1 1 1 1 2 1 1 1 2 1 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 2 1 1 1 2 1
## [963] 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 662.4259 517.3276
## (between_SS / total_SS = 32.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Berdasarkan hasil clustering, sangat terbentuk dengan selisih yang cukup jauh yaitu cluster 1 : 598, cluster 2: 382.
## [1] 2 2 1 1 2 1 1 1 2 1 2 1 2 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 1 1 1 1
## [38] 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2 2 2 1 1 1 1 1 2 1 2 1 1 2 1 2
## [75] 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 1 2 1 1 2 2 1 1 2 1
## [112] 2 1 2 2 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2
## [149] 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## [186] 1 1 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 2 1 1
## [223] 2 1 2 1 1 2 1 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1
## [260] 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 2 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 2 1 1 1 1 1
## [297] 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 2 2 2 2 2 1 2 1
## [334] 1 1 2 2 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1
## [371] 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2 1 2 2 2 2 1 2 1 2 1 1 2 2 2
## [408] 2 2 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [445] 2 1 2 2 2 1 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 2 2 2 1 2 2 1
## [482] 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 2
## [519] 1 1 1 2 2 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1
## [556] 2 1 1 1 1 2 2 2 2 1 2 2 2 2 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1
## [593] 1 2 1 2 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1
## [630] 1 1 1 1 2 1 2 2 2 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 1 2
## [667] 2 2 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 2
## [704] 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 2 1 1 2 1 1 2 1 2 1 2 2 1 1 2 1 1 1 2 2 1
## [741] 2 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1
## [778] 2 1 1 2 2 1 1 1 2 2 1 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1
## [815] 2 2 2 1 2 2 2 1 1 2 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 1
## [852] 2 1 2 2 1 2 2 1 2 2 1 2 1 1 1 2 2 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 2 1 1
## [889] 2 1 1 1 2 1 1 1 2 1 1 1 1 2 2 1 2 2 1 2 1 1 1 2 1 2 1 2 2 1 1 1 2 1 1 1 2
## [926] 2 2 1 1 1 1 2 1 1 1 2 1 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 2 1 1 1 2 1
## [963] 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2
## Art Club Bar Restaurant Museum Resort Park Beach
## 1 0.8850167 1.309900 0.491990 0.5003679 0.7862542 1.625284 3.176973 2.854331
## 2 0.9059948 1.419476 1.829398 0.5828010 1.1800000 2.183560 3.187147 2.804895
## Theatre Institutions
## 1 1.597124 2.925485
## 2 1.526099 2.601571
## [1] 598 382
## [1] 1179.753
trip_1 <- trip %>%
mutate(Cluster = trip_r$cluster) %>%
group_by(Cluster) %>%
summarise_at(1:10, "sd")
trip_1## # A tibble: 2 x 11
## Cluster Art Club Bar Restaurant Museum Resort Park Beach Theatre
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.339 0.453 0.352 0.259 0.363 0.451 0.00524 0.141 0.390
## 2 2 0.306 0.509 0.556 0.303 0.437 0.489 0.00713 0.126 0.316
## # ... with 1 more variable: Institutions <dbl>
Pada cluster 2 merupakan cluster yang memiliki Art, Beach, dan Institution tinggi. Sedangkan pada Cluster 1 yang merupakan cluster yang memiliki Club, Bar, Restaurant, Museum, Resort, Park, Theatre nilai lebih tinggi.
Berdasarkan analisis unsupervised learning yang dilakukan, kita dapat mengambil kesimpulan:
K-means clustering dapat dilakukan pada dataset tersebut, K-means clustering dapat membagi data menjadi 2 cluster dengan 32% nilai total sum of squares berasal dari jarak observasi antara cluster.
2 cluster tersebut lebih didominasi oleh PC 1.