1 Introduction

Negara yang berlokasi di Asia Timur merupakan salah satu objek pariwisata yang terbaik, para pengunjung dapat berpergian ke Negara Taiwan, Jepang ataupun Korea Selatan. Sudah banyak juga aplikasi-aplikasi yang menawarkan rekomendasi tempat pariwisata salah satunya tripadvisor.com. Situs tersebut juga menerima feedback dan review dari para wisatawan yang berpergian ke Asia Timur, tempat-tempat wisata mencakup Bar, Park, Musium, Pantai, Taman, dll. Para Pengunjung juga dapat memberikan rating berpa Excellent (4), Very Good (3), Average (2), Poor (1), dan Terrible (0).

Berikut ini adalah data review dari pengunjung yang menggunakan situs tripadvisor.com yang sudah pernah berkunjung ke negara-negara Timur Asia. data tersebut dapat diolah menggunakan Machine Learning K-means Clustering, untuk mengetahui ada berapa kelas yang dapat diklasifikasikan oleh Machine Learning tersebut. Kita juga dapat mengambil informasi yanng ada dengan mengombinasikan PCA dan clustering dari data tersebut.

2 Memuat library

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
## 
##     subplot
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai
## 
## Attaching package: 'funModeling'
## The following object is masked from 'package:GGally':
## 
##     range01

3 Memuat dataset

## Observations: 980
## Variables: 11
## $ User.ID     <fct> User 1, User 2, User 3, User 4, User 5, User 6, User 7,...
## $ Category.1  <dbl> 0.93, 1.02, 1.22, 0.45, 0.51, 0.99, 0.90, 0.74, 1.12, 0...
## $ Category.2  <dbl> 1.80, 2.20, 0.80, 1.80, 1.20, 1.28, 1.36, 1.40, 1.76, 1...
## $ Category.3  <dbl> 2.29, 2.66, 0.54, 0.29, 1.18, 0.72, 0.26, 0.22, 1.04, 0...
## $ Category.4  <dbl> 0.62, 0.64, 0.53, 0.57, 0.57, 0.27, 0.32, 0.41, 0.64, 0...
## $ Category.5  <dbl> 0.80, 1.42, 0.24, 0.46, 1.54, 0.74, 0.86, 0.82, 0.82, 1...
## $ Category.6  <dbl> 2.42, 3.18, 1.54, 1.52, 2.02, 1.26, 1.58, 1.50, 2.14, 1...
## $ Category.7  <dbl> 3.19, 3.21, 3.18, 3.18, 3.18, 3.17, 3.17, 3.17, 3.18, 3...
## $ Category.8  <dbl> 2.79, 2.63, 2.80, 2.96, 2.78, 2.89, 2.66, 2.81, 2.79, 2...
## $ Category.9  <dbl> 1.82, 1.86, 1.31, 1.57, 1.18, 1.66, 1.22, 1.54, 1.41, 2...
## $ Category.10 <dbl> 2.42, 2.32, 2.50, 2.86, 2.54, 3.66, 3.22, 2.88, 2.54, 3...

Berikut adalah penjelasan pada setiap data: 1. User.ID : Unique user id 2. Category.1 : Average user feedback on art galleries 3. Category.2 : Average user feedback on dance clubs 4. Category.3 : Average user feedback on juice bars 5. Category.4 : Average user feedback on restaurants 6. Category.5 : Average user feedback on museums 7. Category.6 : Average user feedback on resorts 8. Category.7 : Average user feedback on parks/picnic spots 9. Category.8 : Average user feedback on beaches 10. Category.9 : Average user feedback on theaters 11. Category.10 : Average user feedback on religious institutions

4 Exploratory Data Analysis

Melihat struktur data

##   User.ID Category.1 Category.2 Category.3 Category.4 Category.5 Category.6
## 1  User 1       0.93        1.8       2.29       0.62        0.8       2.42
##   Category.7 Category.8 Category.9 Category.10
## 1       3.19       2.79       1.82        2.42

Mengecek missing value

##     User.ID  Category.1  Category.2  Category.3  Category.4  Category.5 
##           0           0           0           0           0           0 
##  Category.6  Category.7  Category.8  Category.9 Category.10 
##           0           0           0           0           0

Kesimpulan tidak ditemukan nilai missing value

Mengubah variable category menjadi nama yang sesuai dan menghapus variable user

##    Art Club  Bar Restaurant Museum Resort Park Beach Theatre Institutions
## 1 0.93  1.8 2.29       0.62   0.80   2.42 3.19  2.79    1.82         2.42
## 2 1.02  2.2 2.66       0.64   1.42   3.18 3.21  2.63    1.86         2.32

5 Machine Learning

5.1 Visualisasi trip review dengan PCA dan FactoMineR

## Standard deviations (1, .., p=10):
##  [1] 1.7255447 1.1237411 1.1104500 1.0317317 1.0098612 0.8988780 0.8393180
##  [8] 0.6782247 0.5734843 0.3755038
## 
## Rotation (n x k) = (10 x 10):
##                      PC1         PC2          PC3         PC4         PC5
## Art          -0.01533238  0.31907527 -0.520473005  0.41411084 -0.31367748
## Club          0.12965685 -0.29722301  0.603205768  0.17896740 -0.03662148
## Bar           0.42881276  0.28305662 -0.046071730 -0.34153182  0.12952900
## Restaurant    0.22864226  0.01897704  0.078773787  0.74890429  0.02242080
## Museum        0.32181751 -0.34565736 -0.007916925 -0.21441024 -0.56644367
## Resort        0.41744888 -0.24905167 -0.161710035  0.02295544 -0.40737387
## Park          0.49573266  0.14264171 -0.089886330 -0.10745970  0.34490103
## Beach        -0.10481328 -0.41966084 -0.537430519 -0.11753217  0.16641040
## Theatre       0.03977775 -0.58584628 -0.179104270  0.18603647  0.42399390
## Institutions -0.45896330 -0.09046983  0.041478568 -0.11185528 -0.26410789
##                      PC6          PC7         PC8         PC9         PC10
## Art           0.50330680 -0.198420995 -0.03987369 -0.25060928 -0.009942147
## Club          0.34345200 -0.598133204 -0.05571957 -0.12293371 -0.060662580
## Bar           0.19608956 -0.114115540  0.52180303  0.12477140 -0.511013160
## Restaurant   -0.43167726  0.036949059  0.42824895  0.09221006 -0.013660185
## Museum       -0.11798417  0.216103985  0.23045888 -0.53691710  0.089717049
## Resort        0.05087168 -0.001439142 -0.29880387  0.69220114 -0.007935260
## Park          0.07970356 -0.133089959  0.04034761 -0.04217483  0.753555820
## Beach        -0.34631462 -0.588493780  0.08075941 -0.05336670 -0.083260622
## Theatre       0.46256416  0.415093811  0.12106882 -0.01512579 -0.087110340
## Institutions  0.21717814 -0.073718713  0.61190451  0.35458961  0.380027791

Melihat hasil dari biplot

Melihat summary dari pca

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6     PC7    PC8
## Standard deviation     1.7255 1.1237 1.1105 1.0317 1.0099 0.8989 0.83932 0.6782
## Proportion of Variance 0.2978 0.1263 0.1233 0.1065 0.1020 0.0808 0.07045 0.0460
## Cumulative Proportion  0.2978 0.4240 0.5473 0.6538 0.7558 0.8366 0.90701 0.9530
##                            PC9   PC10
## Standard deviation     0.57348 0.3755
## Proportion of Variance 0.03289 0.0141
## Cumulative Proportion  0.98590 1.0000

Melihat korelasi antara variable

Hal yang bisa diambil dari korelasi adalah:

  1. Park dan institutions memiliki nilai korelasi negatif yang paling kuat

  2. Bar dan Park memiliki nilai korelasi positif yang paling tiggi

  3. Observasi 642, 100, 331 merupakan nilai outlier

  4. dari 7 data PC yang ada bisa mendapatkan nilai yang mencakup 90% dari info keseluruhan

5.2 Mencari nilai K Optimal

Dalam menentukan k pada cluster, ada beberapa pendekatan yang dapat digunakan yaitu : 1. Silhouette 2. Elbow 3. Gap Statistic

Pertama, kita akan coba menggunakan metode Elbow terlebih dahulu. Metode ini menggunakan pendekatan nilai within sum of square (wss) sebagai penentu k optimal.

1. Elbow Method

2. Sillhouette Method

3. Gap Statistic

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

Dapat disimpulkan berdasarkan dari metode elbow, gap statistic, sillhouette bahwa nilai K yang diambil bernilai 2

5.3 Cluster K-Means dengan K Optimal

## K-means clustering with 2 clusters of sizes 598, 382
## 
## Cluster means:
##         Art     Club      Bar Restaurant    Museum   Resort     Park    Beach
## 1 0.8850167 1.309900 0.491990  0.5003679 0.7862542 1.625284 3.176973 2.854331
## 2 0.9059948 1.419476 1.829398  0.5828010 1.1800000 2.183560 3.187147 2.804895
##    Theatre Institutions
## 1 1.597124     2.925485
## 2 1.526099     2.601571
## 
## Clustering vector:
##   [1] 2 2 1 1 2 1 1 1 2 1 2 1 2 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 1 1 1 1
##  [38] 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2 2 2 1 1 1 1 1 2 1 2 1 1 2 1 2
##  [75] 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 1 2 1 1 2 2 1 1 2 1
## [112] 2 1 2 2 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2
## [149] 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## [186] 1 1 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 2 1 1
## [223] 2 1 2 1 1 2 1 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1
## [260] 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 2 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 2 1 1 1 1 1
## [297] 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 2 2 2 2 2 1 2 1
## [334] 1 1 2 2 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1
## [371] 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2 1 2 2 2 2 1 2 1 2 1 1 2 2 2
## [408] 2 2 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [445] 2 1 2 2 2 1 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 2 2 2 1 2 2 1
## [482] 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 2
## [519] 1 1 1 2 2 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1
## [556] 2 1 1 1 1 2 2 2 2 1 2 2 2 2 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1
## [593] 1 2 1 2 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1
## [630] 1 1 1 1 2 1 2 2 2 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 1 2
## [667] 2 2 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 2
## [704] 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 2 1 1 2 1 1 2 1 2 1 2 2 1 1 2 1 1 1 2 2 1
## [741] 2 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1
## [778] 2 1 1 2 2 1 1 1 2 2 1 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1
## [815] 2 2 2 1 2 2 2 1 1 2 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 1
## [852] 2 1 2 2 1 2 2 1 2 2 1 2 1 1 1 2 2 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 2 1 1
## [889] 2 1 1 1 2 1 1 1 2 1 1 1 1 2 2 1 2 2 1 2 1 1 1 2 1 2 1 2 2 1 1 1 2 1 1 1 2
## [926] 2 2 1 1 1 1 2 1 1 1 2 1 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 2 1 1 1 2 1
## [963] 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 662.4259 517.3276
##  (between_SS / total_SS =  32.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Berdasarkan hasil clustering, sangat terbentuk dengan selisih yang cukup jauh yaitu cluster 1 : 598, cluster 2: 382.

##   [1] 2 2 1 1 2 1 1 1 2 1 2 1 2 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 1 1 1 1
##  [38] 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2 2 2 1 1 1 1 1 2 1 2 1 1 2 1 2
##  [75] 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 1 2 1 1 2 2 1 1 2 1
## [112] 2 1 2 2 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2
## [149] 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## [186] 1 1 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 2 1 1
## [223] 2 1 2 1 1 2 1 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1
## [260] 2 1 2 1 2 1 2 2 2 1 2 1 1 2 2 2 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 2 1 1 1 1 1
## [297] 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 2 2 2 2 2 1 2 1
## [334] 1 1 2 2 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1
## [371] 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 2 1 2 2 2 2 1 2 1 2 1 1 2 2 2
## [408] 2 2 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [445] 2 1 2 2 2 1 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 2 2 2 1 2 2 1
## [482] 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 2
## [519] 1 1 1 2 2 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1
## [556] 2 1 1 1 1 2 2 2 2 1 2 2 2 2 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1
## [593] 1 2 1 2 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1
## [630] 1 1 1 1 2 1 2 2 2 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 1 2
## [667] 2 2 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 2
## [704] 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 2 1 1 2 1 1 2 1 2 1 2 2 1 1 2 1 1 1 2 2 1
## [741] 2 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 2 1 2 2 1 1 1 1 2 1
## [778] 2 1 1 2 2 1 1 1 2 2 1 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1
## [815] 2 2 2 1 2 2 2 1 1 2 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 1
## [852] 2 1 2 2 1 2 2 1 2 2 1 2 1 1 1 2 2 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 2 1 1
## [889] 2 1 1 1 2 1 1 1 2 1 1 1 1 2 2 1 2 2 1 2 1 1 1 2 1 2 1 2 2 1 1 1 2 1 1 1 2
## [926] 2 2 1 1 1 1 2 1 1 1 2 1 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 2 1 1 1 2 1
## [963] 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2
##         Art     Club      Bar Restaurant    Museum   Resort     Park    Beach
## 1 0.8850167 1.309900 0.491990  0.5003679 0.7862542 1.625284 3.176973 2.854331
## 2 0.9059948 1.419476 1.829398  0.5828010 1.1800000 2.183560 3.187147 2.804895
##    Theatre Institutions
## 1 1.597124     2.925485
## 2 1.526099     2.601571
## [1] 598 382
## [1] 1179.753

## # A tibble: 2 x 11
##   Cluster   Art  Club   Bar Restaurant Museum Resort    Park Beach Theatre
##     <int> <dbl> <dbl> <dbl>      <dbl>  <dbl>  <dbl>   <dbl> <dbl>   <dbl>
## 1       1 0.339 0.453 0.352      0.259  0.363  0.451 0.00524 0.141   0.390
## 2       2 0.306 0.509 0.556      0.303  0.437  0.489 0.00713 0.126   0.316
## # ... with 1 more variable: Institutions <dbl>

Pada cluster 2 merupakan cluster yang memiliki Art, Beach, dan Institution tinggi. Sedangkan pada Cluster 1 yang merupakan cluster yang memiliki Club, Bar, Restaurant, Museum, Resort, Park, Theatre nilai lebih tinggi.

6 Kesimpulan

Berdasarkan analisis unsupervised learning yang dilakukan, kita dapat mengambil kesimpulan:

  1. K-means clustering dapat dilakukan pada dataset tersebut, K-means clustering dapat membagi data menjadi 2 cluster dengan 32% nilai total sum of squares berasal dari jarak observasi antara cluster.

  2. 2 cluster tersebut lebih didominasi oleh PC 1.