Dataset yang digunakan berasal dari Gaia DR3 yang dirilis oleh European Space Agency. Dataset ini dapat diakses melalui portal resmi ESA atau melalui tautan berikut:https://gea.esac.esa.int/archive/
set.seed(123)
#load dataset
data <- read.csv("D:/coding/coding sem 4/Analisis Multivariat/tugas harian/tugas 8/data Gaia.csv")
#cek tipe data
str(data)
## 'data.frame': 20000 obs. of 20 variables:
## $ ra : num 45 45 44.9 44.9 45 ...
## $ dec : num 0.0436 0.0463 0.1182 0.1996 0.2197 ...
## $ parallax : num 0.196 -0.345 0.712 0.143 0.157 ...
## $ parallax_error : num 0.1335 0.2701 0.0489 0.1148 0.487 ...
## $ parallax_over_error : num 1.47 -1.277 14.575 1.248 0.322 ...
## $ pmra : num 6.57 3.09 0.91 2.18 2.66 ...
## $ pmra_error : num 0.155 0.325 0.055 0.132 0.675 ...
## $ pmdec : num -1.93 -5.23 5 -2.36 1.19 ...
## $ pmdec_error : num 0.1361 0.2835 0.0438 0.1083 0.5034 ...
## $ phot_g_mean_mag : num 18.1 19.3 16.3 17.9 19.8 ...
## $ phot_bp_mean_mag : num 18.7 20 16.8 18.3 20.7 ...
## $ phot_rp_mean_mag : num 17.5 18.5 15.7 17.4 18.9 ...
## $ phot_g_mean_flux_over_error : num 601 304 1731 779 210 ...
## $ phot_bp_mean_flux_over_error: num 39.15 14.02 269.26 54.53 7.91 ...
## $ phot_rp_mean_flux_over_error: num 58.1 25.3 310.8 61.4 17.4 ...
## $ bp_rp : num 1.27 1.41 1.082 0.874 1.731 ...
## $ ruwe : num 0.979 1.061 0.986 1.001 1.134 ...
## $ astrometric_excess_noise : num 0 0.239 0 0 1.658 ...
## $ astrometric_chi2_al : num 193 235 219 221 244 ...
## $ astrometric_gof_al : num -0.3807 1.2525 -0.2527 0.0575 2.5151 ...
summary(data)
## ra dec parallax parallax_error
## Min. :42.56 Min. :0.005615 Min. :-7.5589 Min. :0.01193
## 1st Qu.:44.70 1st Qu.:1.287861 1st Qu.: 0.3456 1st Qu.:0.09552
## Median :45.52 Median :1.817568 Median : 0.8444 Median :0.26702
## Mean :45.47 Mean :1.809208 Mean : 1.1281 Mean :0.40510
## 3rd Qu.:46.32 3rd Qu.:2.330488 3rd Qu.: 1.6119 3rd Qu.:0.57936
## Max. :47.80 Max. :3.571471 Max. :67.9335 Max. :3.89691
## parallax_over_error pmra pmra_error pmdec
## Min. : -4.3095 Min. :-107.4238 Min. :0.01287 Min. :-925.0263
## 1st Qu.: 0.9394 1st Qu.: 0.1463 1st Qu.:0.10132 1st Qu.: -9.4004
## Median : 3.0777 Median : 3.5680 Median :0.28304 Median : -4.2265
## Mean : 15.8188 Mean : 6.3042 Mean :0.44578 Mean : -6.7115
## 3rd Qu.: 10.5901 3rd Qu.: 9.2044 3rd Qu.:0.62107 3rd Qu.: -0.9846
## Max. :2476.2068 Max. :1400.2918 Max. :3.14349 Max. : 108.7685
## pmdec_error phot_g_mean_mag phot_bp_mean_mag phot_rp_mean_mag
## Min. :0.01128 Min. : 6.463 Min. : 6.955 Min. : 5.802
## 1st Qu.:0.08992 1st Qu.:17.198 1st Qu.:17.872 1st Qu.:16.401
## Median :0.25430 Median :18.920 Median :19.807 Median :17.981
## Mean :0.41548 Mean :18.318 Mean :19.064 Mean :17.458
## 3rd Qu.:0.57253 3rd Qu.:19.977 3rd Qu.:20.779 3rd Qu.:19.007
## Max. :3.04139 Max. :21.058 Max. :23.396 Max. :21.167
## phot_g_mean_flux_over_error phot_bp_mean_flux_over_error
## Min. : 18.07 Min. : 1.015
## 1st Qu.: 168.67 1st Qu.: 7.300
## Median : 341.91 Median : 15.627
## Mean : 799.23 Mean : 120.678
## 3rd Qu.: 914.56 3rd Qu.: 73.535
## Max. :9634.57 Max. :3782.464
## phot_rp_mean_flux_over_error bp_rp ruwe
## Min. : 2.294 Min. :-1.060 Min. : 0.7495
## 1st Qu.: 18.339 1st Qu.: 1.024 1st Qu.: 0.9809
## Median : 43.173 Median : 1.555 Median : 1.0248
## Mean : 199.277 Mean : 1.606 Mean : 1.1119
## 3rd Qu.: 155.534 3rd Qu.: 2.139 3rd Qu.: 1.0767
## Max. :4698.398 Max. : 4.655 Max. :34.1544
## astrometric_excess_noise astrometric_chi2_al astrometric_gof_al
## Min. : 0.00000 Min. : 54.57 Min. : -4.7528
## 1st Qu.: 0.00000 1st Qu.: 177.22 1st Qu.: -0.3286
## Median : 0.08324 Median : 207.85 Median : 0.5037
## Mean : 0.52717 Mean : 450.75 Mean : 1.6839
## 3rd Qu.: 0.70297 3rd Qu.: 240.27 3rd Qu.: 1.4591
## Max. :10.50597 Max. :410619.80 Max. :290.0931
library(tidyverse) # Manipulasi dan Visualisasi Data
library(flexclust) # k-median
library(dbscan) # DBSCAN
library(meanShiftR) # Mean-shift
library(e1071) # Fuzzy C-Means
library(cluster) # Silhouette Score
library(fpc) # Metrics lain
library(mclust) # ARI metrics untuk membandingkan cluster dengan label yang benar
#standarisasi
scaled <- scale(data)
scaled <- as.data.frame(scaled)
======================== KMEAN =============================== [penentuan klaster menggunakan elbow method] hasil elbow menunjukan grafik mulai turun drastis hingga pada klaster 3-4 dan mulai melandai setelah itu,sehingga total klaster ditentukan adalah 3 untuk analisis clustering menggunakan K-Mean Clustering
klus <- numeric(10)
set.seed(123)
for (i in 1:10) {
klus[i] <- kmeans(scaled, centers=i, nstart=25)$tot.withinss
}
plot(1:10, klus, type="b",
xlab="Jumlah Cluster (k)",
ylab="klus_val",
main="Elbow Method")
[penentuan kelas berdasarkan evaluasi Silhouette Score] dengan melakukan
looping terhadap model K-mean clustering dan evaluasi Silhouette Score
maka dapat diambil hasil sebagai berikut
set.seed(123)
for (k in 2:10) {
kmean_temp <- kmeans(scaled, centers=k, nstart=25)
sill <- silhouette(kmean_temp$cluster, dist(scaled))
cat("k =", k, "avg =", mean(sill[,3]), "\n")
}
## k = 2 avg = 0.3862473
## k = 3 avg = 0.2551473
## k = 4 avg = 0.2598647
## k = 5 avg = 0.2014831
## k = 6 avg = 0.1755149
## k = 7 avg = 0.1781538
## k = 8 avg = 0.1712497
## k = 9 avg = 0.1546204
## k = 10 avg = 0.1518328
di tentukan k=3 data selanjutnya dilakukan pemodelan menggunakan model K-mean
set.seed(123)
kmean_model <- kmeans(scaled, centers=3, nstart=25)
cluster dimasukan ke dalam dataframe scaled
scaled$cluster <- kmean_model$cluster
head(scaled)
## ra dec parallax parallax_error parallax_over_error
## 1 -0.4735717 -2.430303 -0.5897656 -0.6530777 -0.29038221
## 2 -0.4711103 -2.426567 -0.9320825 -0.3246222 -0.34595777
## 3 -0.5557789 -2.327605 -0.2632497 -0.8564939 -0.02516868
## 4 -0.5187441 -2.215559 -0.6232278 -0.6978655 -0.29487574
## 5 -0.3973694 -2.187910 -0.6145676 0.1968094 -0.31360090
## 6 -0.5133725 -2.188980 0.1235925 -0.6149237 -0.14078494
## pmra pmra_error pmdec pmdec_error phot_g_mean_mag
## 1 0.01522305 -0.6169215 0.31810922 -0.6207487 -0.08555113
## 2 -0.18596362 -0.2554013 0.09884629 -0.2932130 0.44481821
## 3 -0.31211814 -0.8278677 0.77899951 -0.8256220 -0.93565567
## 4 -0.23860449 -0.6648645 0.28934165 -0.6823465 -0.20084034
## 5 -0.21060182 0.4848737 0.52553759 0.1954100 0.67434952
## 6 -0.45925054 -0.5680823 -0.23938221 -0.6052978 -0.05478509
## phot_bp_mean_mag phot_rp_mean_mag phot_g_mean_flux_over_error
## 1 -0.15043317 -0.002304904 -0.18590405
## 2 0.39479961 0.532204529 -0.46516138
## 3 -1.01080134 -0.861593909 0.87449692
## 4 -0.35665290 -0.036271449 -0.01871451
## 5 0.70354101 0.716782806 -0.55324293
## 6 0.09166898 -0.146470683 -0.23297277
## phot_bp_mean_flux_over_error phot_rp_mean_flux_over_error bp_rp
## 1 -0.2906132 -0.3373898 -0.5083216
## 2 -0.3802030 -0.4159153 -0.2965959
## 3 0.5296668 0.2665592 -0.7933216
## 4 -0.2358138 -0.3295427 -1.1096703
## 5 -0.4019894 -0.4347450 0.1892754
## 6 -0.3172959 -0.3065313 0.7680448
## ruwe astrometric_excess_noise astrometric_chi2_al astrometric_gof_al
## 1 -0.16269314 -0.5771776 -0.04875876 -0.22290968
## 2 -0.06175747 -0.3155205 -0.04082505 -0.04657551
## 3 -0.15368671 -0.5771776 -0.04378172 -0.20909565
## 4 -0.13545920 -0.5771776 -0.04340617 -0.17560503
## 5 0.02648764 1.2382632 -0.03915613 0.08973757
## 6 -0.12248518 -0.5771776 -0.04522973 -0.15285281
## cluster
## 1 3
## 2 3
## 3 3
## 4 3
## 5 2
## 6 3
hasil evaluasi Silhouette Score
#silhouette score
library(cluster)
sill <- silhouette(kmean_model$cluster, dist(scaled))
plot(sill)
#visualisasi
plot(scaled[,1], scaled[,2],
col=scaled$cluster,
pch=19)
visualisasi menggunakan pca setelah dilakukan modeling(jangan
terbalik)
cluster <- scaled$cluster
data2 <- data[, !(names(data) %in% c("bp_rp"))]
scaled2<- scale(data2)
scaled2 <- as.data.frame(scaled2)
scaled21 <- scaled2[, !(names(scaled2) %in% c("cluster"))]
pca <- prcomp(scaled21)
pca_vis11 <- as.data.frame(pca$x[,1:2])
pca_vis11$cluster <- as.factor(cluster)
plot(pca_vis11[,1], pca_vis11[,2],
col=pca_vis11$cluster,
pch=19)
#cek distribusi klaster
table(scaled$cluster)
##
## 1 2 3
## 2194 5574 12232
________PCA_________ untuk mengurangi dimensi tinggi digunakan PCA
set.seed(123)
data2 = data[, !(names(data) %in% c("bp_rp"))]
scaled2=scale(data2)
pca <- prcomp(scaled2)
# ambil komponen utama
pca_data <- as.data.frame(pca$x[, 1:4])
#elbow method
klus_pca <- numeric(10)
set.seed(123)
for (i in 1:10) {
klus_pca[i] <- kmeans(pca_data, centers=i, nstart=25)$tot.withinss
}
plot(1:10, klus_pca, type="b",
xlab="Jumlah Cluster (k)",
ylab="klus_val",
main="Elbow Method")
[penentuan kelas berdasarkan evaluasi Silhouette Score] dengan melakukan
looping terhadap model K-mean clustering dan evaluasi Silhouette Score
maka dapat diambil hasil sebagai berikut
set.seed(123)
for (k in 2:10) {
km_temp <- kmeans(pca_data, centers=k, nstart=25)
sil <- silhouette(km_temp$cluster, dist(pca_data))
cat("k =", k, "avg =", mean(sil[,3]), "\n")
}
## k = 2 avg = 0.5284545
## k = 3 avg = 0.4364458
## k = 4 avg = 0.4419127
## k = 5 avg = 0.392971
## k = 6 avg = 0.3589015
## k = 7 avg = 0.3595987
## k = 8 avg = 0.3693389
## k = 9 avg = 0.3438449
## k = 10 avg = 0.3133077
pemodelan kmean dengan data PCA
set.seed(123)
km <- kmeans(pca_data, centers=3, nstart=25)
masukin kluster ke dataframe pca
pca_data$cluster <- km$cluster
head(pca_data)
## PC1 PC2 PC3 PC4 cluster
## 1 0.2528784 0.8342192 -1.1074683 -0.6026877 3
## 2 -0.8178928 0.5109209 -1.0154965 -0.4918189 3
## 3 2.1667607 0.7082544 -0.9994586 0.3433361 3
## 4 0.4830801 0.8199811 -1.2294265 -0.4329031 3
## 5 -1.8068619 -0.3753090 -0.3110857 0.1247008 2
## 6 0.3096919 0.6929271 -0.7100537 -0.9284087 3
evaluasi model menggunakan sill
library(cluster)
sil <- silhouette(km$cluster, dist(pca_data))
plot(sil)
hasil visualisasi menggunakan data pca yang dimodelkan(pca-model-visualisasi)
plot(pca_data[,1], pca_data[,2],
col=km$cluster,
pch=19,
main="Cluster Visualization (PCA)")
#distribusi kluster
table(pca_data$cluster)
##
## 1 2 3
## 2169 5570 12261
============================kMEDIAN==============================
#persiapan data untuk kmedian
library(flexclust)
library(cluster)
scaled_med <- scale(data)
scaled_med <- as.data.frame(scaled_med)
[penentuan klaster menggunakan elbow method] hasil elbow menunjukan grafik mulai turun drastis hingga pada klaster 3,lalu terjadi patahan yang tidak terlalu tajam pada titik ke 4, dan mulai melandai setelah itu,sehingga total klaster ditentukan adalah 4 untuk analisis K-Median Clustering model menggunakan fungsi kcca() dengan pendekatan K-medians
library(flexclust)
set.seed(123)
klus_med <- numeric(10)
data_mat <- as.matrix(scaled_med)
for (i in 2:10) {
model <- kcca(data_mat, k=i, family=kccaFamily("kmedians"))
cluster <- clusters(model)
centers <- parameters(model)
total_dist <- 0
for (j in 1:nrow(data_mat)) {
center <- centers[cluster[j], ]
total_dist <- total_dist + sum(abs(data_mat[j, ] - center))
}
klus_med[i] <- total_dist
}
plot(2:10, klus_med[2:10], type="b",
xlab="Jumlah Cluster (k)",
ylab="Total Distance",
main="Elbow Method (K-Medians)")
[penetuan kelas berdasarkan evaluasi Silhouette Score] dengan melakukan
looping terhadap model K-Median Clustering menggunakan fungsi kcca()
dengan pendekatan K-medians dan evaluasi Silhouette Score
library(flexclust)
library(cluster)
set.seed(123)
sil_scores <- numeric(10)
data_mat <- as.matrix(scaled_med)
for (k in 2:10) {
model <- kcca(data_mat, k=k, family=kccaFamily("kmedians"))
cluster <- clusters(model)
sil <- silhouette(cluster, dist(data_mat))
sil_scores[k] <- mean(sil[,3])
cat("k =", k, "avg =", sil_scores[k], "\n")
}
## k = 2 avg = 0.2562424
## k = 3 avg = 0.1805973
## k = 4 avg = 0.1475768
## k = 5 avg = 0.1038777
## k = 6 avg = 0.08913588
## k = 7 avg = 0.07991459
## k = 8 avg = 0.08111546
## k = 9 avg = 0.08387388
## k = 10 avg = 0.08016689
#modeling
set.seed(123)
data_mat <- as.matrix(scaled_med)
kmed_model <- kcca(data_mat, k=4,family=kccaFamily("kmedians"))
#masukin kluster ke scaled_med
library(flexclust)
scaled_med$cluster <- clusters(kmed_model)
head(scaled_med)
## ra dec parallax parallax_error parallax_over_error
## 1 -0.4735717 -2.430303 -0.5897656 -0.6530777 -0.29038221
## 2 -0.4711103 -2.426567 -0.9320825 -0.3246222 -0.34595777
## 3 -0.5557789 -2.327605 -0.2632497 -0.8564939 -0.02516868
## 4 -0.5187441 -2.215559 -0.6232278 -0.6978655 -0.29487574
## 5 -0.3973694 -2.187910 -0.6145676 0.1968094 -0.31360090
## 6 -0.5133725 -2.188980 0.1235925 -0.6149237 -0.14078494
## pmra pmra_error pmdec pmdec_error phot_g_mean_mag
## 1 0.01522305 -0.6169215 0.31810922 -0.6207487 -0.08555113
## 2 -0.18596362 -0.2554013 0.09884629 -0.2932130 0.44481821
## 3 -0.31211814 -0.8278677 0.77899951 -0.8256220 -0.93565567
## 4 -0.23860449 -0.6648645 0.28934165 -0.6823465 -0.20084034
## 5 -0.21060182 0.4848737 0.52553759 0.1954100 0.67434952
## 6 -0.45925054 -0.5680823 -0.23938221 -0.6052978 -0.05478509
## phot_bp_mean_mag phot_rp_mean_mag phot_g_mean_flux_over_error
## 1 -0.15043317 -0.002304904 -0.18590405
## 2 0.39479961 0.532204529 -0.46516138
## 3 -1.01080134 -0.861593909 0.87449692
## 4 -0.35665290 -0.036271449 -0.01871451
## 5 0.70354101 0.716782806 -0.55324293
## 6 0.09166898 -0.146470683 -0.23297277
## phot_bp_mean_flux_over_error phot_rp_mean_flux_over_error bp_rp
## 1 -0.2906132 -0.3373898 -0.5083216
## 2 -0.3802030 -0.4159153 -0.2965959
## 3 0.5296668 0.2665592 -0.7933216
## 4 -0.2358138 -0.3295427 -1.1096703
## 5 -0.4019894 -0.4347450 0.1892754
## 6 -0.3172959 -0.3065313 0.7680448
## ruwe astrometric_excess_noise astrometric_chi2_al astrometric_gof_al
## 1 -0.16269314 -0.5771776 -0.04875876 -0.22290968
## 2 -0.06175747 -0.3155205 -0.04082505 -0.04657551
## 3 -0.15368671 -0.5771776 -0.04378172 -0.20909565
## 4 -0.13545920 -0.5771776 -0.04340617 -0.17560503
## 5 0.02648764 1.2382632 -0.03915613 0.08973757
## 6 -0.12248518 -0.5771776 -0.04522973 -0.15285281
## cluster
## 1 3
## 2 1
## 3 3
## 4 3
## 5 2
## 6 1
hasil evaluasi Silhouette Score juga menunjukan rata rata skor silhouette keseluruhan adalah 43%,akurai ini menunjukan bahwa model cukup baik dalam mengklasterkan data menjadi 4 kelas.
library(cluster)
cluster_labels <- clusters(kmed_model)
sil_med <- silhouette(cluster_labels, dist(scaled_med))
plot(sil_med)
scaled_med$cluster <- clusters(kmed_model)
plot(scaled_med[,1], scaled_med[,2],
col=scaled_med$cluster,
pch=19,
main="Cluster Visualization (K-median)")
visualisasi menggunakan pca setelah dilakukan modeling(jangan
terbalik)
cluster2 <- scaled_med$cluster
data2 <- data[, !(names(data) %in% c("bp_rp"))]
scaled3<- scale(data2)
scaled3 <- as.data.frame(scaled3)
scaled22 <- scaled3[, !(names(scaled3) %in% c("cluster"))]
pca <- prcomp(scaled22)
pca_vis21 <- as.data.frame(pca$x[,1:2])
pca_vis21$cluster2 <- as.factor(cluster2)
plot(pca_vis21[,1], pca_vis21[,2],
col=pca_vis21$cluster2,
pch=19)
table(scaled_med$cluster)
##
## 1 2 3 4
## 7504 5007 5341 2148
___________PCA____________
#pca data
set.seed(123)
data2 = data[, !(names(data) %in% c("bp_rp"))]
scaled2=scale(data2)
pca <- prcomp(scaled2)
# ambil komponen utama
pca_med <- as.data.frame(pca$x[, 1:4])
#elbow
library(flexclust)
set.seed(123)
klus_pm <- numeric(10)
data_mat <- as.matrix(pca_med)
for (i in 2:10) {
model <- kcca(data_mat, k=i, family=kccaFamily("kmedians"))
cluster <- clusters(model)
centers <- parameters(model)
total_dist <- 0
for (j in 1:nrow(data_mat)) {
center <- centers[cluster[j], ]
total_dist <- total_dist + sum(abs(data_mat[j, ] - center))
}
klus_pm[i] <- total_dist
}
plot(2:10, klus_pm[2:10], type="b",
xlab="Jumlah Cluster (k)",
ylab="Total Distance",
main="Elbow Method (K-Medians)")
library(flexclust)
library(cluster)
set.seed(123)
sil_scores <- numeric(10)
data_mat <- as.matrix(pca_med)
for (k in 2:10) {
model <- kcca(data_mat, k=k, family=kccaFamily("kmedians"))
cluster <- clusters(model)
sil <- silhouette(cluster, dist(data_mat))
sil_scores[k] <- mean(sil[,3])
cat("k =", k, "avg =", sil_scores[k], "\n")
}
## k = 2 avg = 0.3616178
## k = 3 avg = 0.4302791
## k = 4 avg = 0.3404127
## k = 5 avg = 0.3242769
## k = 6 avg = 0.2859091
## k = 7 avg = 0.3025192
## k = 8 avg = 0.2827804
## k = 9 avg = 0.2626032
## k = 10 avg = 0.2631468
k = 3
set.seed(123)
data_mat <- as.matrix(pca_med)
kmed_pm <- kcca(data_mat, k=3,family=kccaFamily("kmedians"))
library(flexclust)
pca_med$cluster <- clusters(kmed_pm)
head(pca_med)
## PC1 PC2 PC3 PC4 cluster
## 1 0.2528784 0.8342192 -1.1074683 -0.6026877 1
## 2 -0.8178928 0.5109209 -1.0154965 -0.4918189 1
## 3 2.1667607 0.7082544 -0.9994586 0.3433361 1
## 4 0.4830801 0.8199811 -1.2294265 -0.4329031 1
## 5 -1.8068619 -0.3753090 -0.3110857 0.1247008 2
## 6 0.3096919 0.6929271 -0.7100537 -0.9284087 1
library(cluster)
cluster_labmed <- clusters(kmed_pm)
sil_pm <- silhouette(cluster_labmed, dist(pca_med))
plot(sil_pm)
pca_med$cluster <- clusters(kmed_pm)
plot(pca_med[,1], pca_med[,2],
col=pca_med$cluster,
pch=19,
main="Cluster Visualization (K-median_pca)")
table(pca_med$cluster)
##
## 1 2 3
## 11968 5572 2460
=========================DBSCAN================================
library(dbscan)
scaled_db = scale(data)
scaled_db = as.data.frame(scaled_db)
minPts ≈ (2*dim) 2 × 20 = 40
kNNdistplot(scaled_db, k = 40) # minPts
abline(h = , col="red", lty=2)
set.seed(123)
kNNdistplot(scaled_db, k = 40) # minPts
abline(h = 1.5, col="red", lty=2)
penentuan nilai epsilon dilakukan menggunakan pendekatan k-Nearest
Neighbor Distance (k-NN distance). Nilai jarak ke tetangga ke-k (dalam
hal ini k = 8) dihitung untuk setiap titik data menggunakan fungsi
kNNdist(), kemudian hasilnya diurutkan secara menaik. Pada pendekatan
ini, nilai ε ditentukan berdasarkan distribusi jarak, dengan mengambil
nilai pada persentil tertentu, yaitu 90%, 95%, dan 98%.Dengan ini
diperoleh 3 rekomendasi nilai epsilon yang akan digunakan
set.seed(123)
# Hitung jarak dengan knn
jarak_10nn <- kNNdist(scaled_db, k = 40)
jarak_sorted <- sort(jarak_10nn)
# Coba persentile: 90%, 95%, 98%
eps_90 <- quantile(jarak_sorted, 0.90)
eps_95 <- quantile(jarak_sorted, 0.95)
eps_98 <- quantile(jarak_sorted, 0.98)
cat("eps (90% data padat):", round(eps_90, 4), "\n")
## eps (90% data padat): 2.401
cat("eps (95% data padat):", round(eps_95, 4), "\n")
## eps (95% data padat): 3.117
cat("eps (98% data padat):", round(eps_98, 4), "\n")
## eps (98% data padat): 4.4821
# Visualisasi perbandingan
kNNdistplot(scaled_db, k = 10)
grid(col = "gray", lty = "dotted")
abline(h = eps_90, col = "green", lty = 2, lwd = 1.5) # 90%
abline(h = eps_95, col = "orange", lty = 2, lwd = 1.5) # 95%
abline(h = eps_98, col = "purple", lty = 2, lwd = 1.5) # 98%
abline(h = 62.3, col = "red", lty = 1, lwd = 1)
legend("topleft",
legend = c("eps 90%", "eps 95%", "eps 98%", "eps lama (62.3)"),
col = c("green", "orange", "purple", "red"),
lty = 2, cex = 0.8)
library(dbscan)
set.seed(123)
db_model <-dbscan::dbscan(scaled_db, eps = 4.4975, minPts = 40)
library(cluster)
set.seed(123)
# silhouette hanya untuk cluster > 0
sil_db <- silhouette(db_model$cluster, dist(scaled_db))
mean(sil[,3]) # rata-rata silhouette
## [1] 0.2631468
plot(sil_db)
scaled_db$cluster_db <- db_model$cluster
table(scaled_db$cluster_db)
##
## 0 1
## 176 19824
colors <- ifelse(scaled_db$cluster_db == 0, "blue", scaled_db$cluster_db + 1)
plot(scaled_db[,1], scaled_db[,2],
col=colors,
pch=19,
main="Cluster Visualization (DBSCAN)")
cluster_db <- scaled_db$cluster
data2 <- data[, !(names(data) %in% c("bp_rp"))]
scaled4<- scale(data2)
scaled4 <- as.data.frame(scaled4)
scaled23 <- scaled4[, !(names(scaled4) %in% c("cluster"))]
pca <- prcomp(scaled23)
pca_vis31 <- as.data.frame(pca$x[,1:2])
pca_vis31$cluster_db <- as.factor(cluster_db)
plot(pca_vis31[,1], pca_vis31[,2],
col=pca_vis31$cluster_db,
pch=19)
table(scaled_db$cluster)
##
## 0 1
## 176 19824
________________PCA_________________
library(dbscan)
set.seed(123)
data2 = data[, !(names(data) %in% c("bp_rp"))]
scaled2=scale(data2)
pca <- prcomp(scaled2)
# ambil komponen utama
pca_db <- as.data.frame(pca$x[, 1:4])
minPts ≈ (2*dim) 2 × 4 = 8
set.seed(123)
kNNdistplot(pca_db, k = 8) # minPts
abline(h = , col="red", lty=2)
set.seed(123)
kNNdistplot(pca_db, k = 10) # minPts
abline(h = 1.5, col="red", lty=2)
penentuan nilai epsilon dilakukan menggunakan pendekatan k-Nearest
Neighbor Distance (k-NN distance). Nilai jarak ke tetangga ke-k (dalam
hal ini k = 8) dihitung untuk setiap titik data menggunakan fungsi
kNNdist(), kemudian hasilnya diurutkan secara menaik. Pada pendekatan
ini, nilai ε ditentukan berdasarkan distribusi jarak, dengan mengambil
nilai pada persentil tertentu, yaitu 90%, 95%, dan 98%.Dengan ini
diperoleh 3 rekomendasi nilai epsilon yang akan digunakan
set.seed(123)
# Hitung jarak dengan knn
jarak_10nn <- kNNdist(pca_db, k = 8)
jarak_sorted <- sort(jarak_10nn)
# Coba beberapa persentile: 90%, 95%, 98%
eps_90 <- quantile(jarak_sorted, 0.90)
eps_95 <- quantile(jarak_sorted, 0.95)
eps_98 <- quantile(jarak_sorted, 0.98)
cat("eps (90% data padat):", round(eps_90, 4), "\n")
## eps (90% data padat): 0.4235
cat("eps (95% data padat):", round(eps_95, 4), "\n")
## eps (95% data padat): 0.687
cat("eps (98% data padat):", round(eps_98, 4), "\n")
## eps (98% data padat): 1.2376
# Visualisasi perbandingan
kNNdistplot(pca_db, k = 10)
grid(col = "gray", lty = "dotted")
abline(h = eps_90, col = "green", lty = 2, lwd = 1.5) # 90%
abline(h = eps_95, col = "orange", lty = 2, lwd = 1.5) # 95%
abline(h = eps_98, col = "purple", lty = 2, lwd = 1.5) # 98%
abline(h = 62.3, col = "red", lty = 1, lwd = 1)
legend("topleft",
legend = c("eps 90%", "eps 95%", "eps 98%"),
col = c("green", "orange", "purple", "red"),
lty = 2, cex = 0.8)
set.seed(123)
db_pm <- dbscan::dbscan(scaled_db, eps = 1.2376, minPts = 8)
library(cluster)
# silhouette hanya untuk cluster > 0
sil_p_db <- silhouette(db_pm$cluster, dist(pca_db))
mean(sil[,3]) # rata-rata silhouette
## [1] 0.2631468
plot(sil_p_db)
pca_db$cluster_db <- db_pm$cluster
table(pca_db$cluster_db)
##
## 0 1 2 3
## 4166 15824 6 4
colors <- ifelse(pca_db$cluster_db == 0, "blue", pca_db$cluster_db + 1)
plot(pca_db[,1], pca_db[,3],
col=colors,
pch=19,
main="Cluster Visualization (DBSCAN PCA)")
===================MEAN SHIFT dan fuzzy c-mean=====================
#load dataset
data <- read.csv("D:/coding/coding sem 4/Analisis Multivariat/tugas harian/tugas 8/data Gaia.csv")
# head(data)
names(data)
## [1] "ra" "dec"
## [3] "parallax" "parallax_error"
## [5] "parallax_over_error" "pmra"
## [7] "pmra_error" "pmdec"
## [9] "pmdec_error" "phot_g_mean_mag"
## [11] "phot_bp_mean_mag" "phot_rp_mean_mag"
## [13] "phot_g_mean_flux_over_error" "phot_bp_mean_flux_over_error"
## [15] "phot_rp_mean_flux_over_error" "bp_rp"
## [17] "ruwe" "astrometric_excess_noise"
## [19] "astrometric_chi2_al" "astrometric_gof_al"
library(dplyr)
# Cleaning Data
clean_data <- data %>%
filter(
parallax > 0, # jarak valid
parallax_over_error > 5, # sinyal kuat
ruwe < 1.4, # kualitas astrometri bagus
astrometric_excess_noise < 1, # noise rendah
phot_g_mean_flux_over_error > 50, # fotometri bagus
phot_bp_mean_flux_over_error > 20, # fotometri bagus
phot_rp_mean_flux_over_error > 20, # fotometri bagus
!is.na(bp_rp), # warna valid
phot_bp_mean_mag < 18, # hindari objek terlalu redup
abs(pmra) < 50, # hindari gerakan ekstrem
abs(pmdec) < 50
)
remove_outlier <- function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
x >= (Q1 - 1.5*IQR) & x <= (Q3 + 1.5*IQR)
}
clean_data <- clean_data[
remove_outlier(clean_data$parallax) &
remove_outlier(clean_data$phot_bp_mean_mag),
]
# Standarisasi
set.seed(123)
scaled <- scale(clean_data)
head(scaled)
## ra dec parallax parallax_error parallax_over_error
## 1 -0.5894716 -2.401366 -0.6627475 0.09709794 -0.6003715
## 4 -0.2073499 -1.878702 0.5367149 -0.42450005 0.1238680
## 5 -0.1053962 -1.880102 0.7674640 -1.26889529 1.6965019
## 6 -0.1995066 -1.826076 -0.5907419 0.41280258 -0.6279018
## 7 -0.2721081 -1.701324 -0.8640934 0.10245696 -0.6759542
## 8 -0.3253414 -1.660678 -0.5532149 -0.66240299 -0.3142721
## pmra pmra_error pmdec pmdec_error phot_g_mean_mag
## 1 -0.49489293 0.2110222 1.2084250 -0.0109340 0.6368205
## 4 1.31392257 -0.6780964 -1.5809992 -0.5772195 -0.1977762
## 5 0.03037912 -1.2926366 -0.5019719 -1.2675357 -1.3897890
## 6 0.76626590 0.3606228 0.8862111 0.6217998 0.7202120
## 7 0.65144640 0.3222144 0.7914461 0.1928078 0.6365164
## 8 -0.56320606 -0.7095169 0.4105252 -0.7426648 -0.1265097
## phot_bp_mean_mag phot_rp_mean_mag phot_g_mean_flux_over_error
## 1 0.5722383 0.68010685 -0.506620719
## 4 -0.1609793 -0.23469672 0.652357960
## 5 -1.4000541 -1.36149459 2.367870831
## 6 0.6851487 0.73890380 -0.695540465
## 7 0.5355388 0.71863544 -0.629149950
## 8 -0.2264605 -0.06459793 -0.009000576
## phot_bp_mean_flux_over_error phot_rp_mean_flux_over_error bp_rp
## 1 -0.36263903 -0.55065191 -0.1921202
## 4 -0.07103512 -0.02093872 0.2259837
## 5 1.41910581 2.10495319 -0.7290301
## 6 -0.60315170 -0.69399215 0.0688648
## 7 -0.48388419 -0.24228984 -0.5054439
## 8 -0.27531863 -0.56493892 -0.7348570
## ruwe astrometric_excess_noise astrometric_chi2_al astrometric_gof_al
## 1 -0.5462404 -0.70491230 -0.13169816 -0.5554004
## 4 -2.1419193 -0.70491230 -0.88235502 -2.2887823
## 5 -1.3081528 -0.70491230 -0.77016330 -1.3359992
## 6 -0.4232824 -0.70491230 -0.20164042 -0.4207737
## 7 -0.8521914 -0.70491230 -0.19578369 -0.8957797
## 8 -0.3069737 0.05980238 -0.01446507 -0.2967915
data2 = clean_data[, !(names(clean_data) %in% c("bp_rp"))]
scaled2 = scale(data2)
pca <- prcomp(scaled2)
# ambil 4 komponen utama
pca_data <- as.matrix(pca$x[, 1:4])
head(pca_data)
## PC1 PC2 PC3 PC4
## 1 -1.5237312 -1.0159387 -1.717715 -1.5881859
## 4 0.6688832 -3.4136220 -1.702044 0.9448040
## 5 4.6116821 -2.6357220 -1.323726 0.2403015
## 6 -2.0651172 -0.7954661 -1.326218 -0.8389543
## 7 -1.6572042 -1.3403829 -1.216895 -1.1142627
## 8 0.2827574 -0.5606037 -1.082361 -1.3459090
nrow(pca_data)
## [1] 3714
data_sample <- pca_data[sample(nrow(pca_data), 2000), ] # Ambil sampel data untuk meringankan proses komputasi
data_clean <- as.matrix(data_sample)
head(data_clean)
## PC1 PC2 PC3 PC4
## 2678 -2.899226 -0.7625470 0.3542632 1.1175675
## 2730 1.005984 -1.7391699 2.0360217 -0.7083932
## 2428 -3.676731 1.0709040 2.0056274 0.2553468
## 574 -3.307732 -0.3013239 -0.7745895 -0.6534462
## 211 7.189455 1.6545760 -2.1771913 1.1907099
## 3231 -0.564175 1.6488640 -1.5314324 -1.7056050
ms_res <- meanShift(data_clean)
head(ms_res$assignment)
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
## [6,] 6
# Number of K
# Elbow Method
wss <- sapply(1:10, function(k){
kmeans(data_clean, centers = k, nstart = 20, iter.max = 100)$tot.withinss
})
par(mfrow = c(1, 1))
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Total within-clusters sum of squares",
main = "Elbow Method")
# Silhouette Analysis
avg_sil <- function(k) {
km_res <- kmeans(data_clean, centers = k, nstart = 25)
ss <- silhouette(km_res$cluster, dist(data_clean))
mean(ss[, 3])
}
k_values <- 2:10
avg_sil_values <- sapply(k_values, avg_sil)
par(mfrow = c(1, 1))
plot(k_values, avg_sil_values, type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Average Silhouette Width",
main = "Silhouette Analysis")
Berdasarkan output diatas, tampak bahwa titik elbow mulai turun secara drastis pada K=2 sehingga kami menggunakan K = 2 untuk mendapatkan cluster yang meaningful. Hal itu juga didukung oleh Silhouette Analysis dengan K = 2 memiliki Silhouette score tertinggi yaitu 0.38 . Nilai ini menunjukkan bahwa struktur cluster cukup jelas dan terpisah dengan baik dibandingkan jumlah cluster lainnya.
fcm_res <- cmeans(data_clean, centers = 2, m = 2)
head(fcm_res$cluster)
## 2678 2730 2428 574 211 3231
## 2 1 2 2 1 2
par(mfrow = c(1, 3), mar = c(4, 4, 2, 1))
# Mean Shift
plot(data_clean[,1], data_clean[,2],
col = ms_res$assignment,
pch = 19,
main = "Mean Shift Clustering",
xlab = "PC1", ylab = "PC2")
# Fuzzy C-Means
plot(data_clean[,1], data_clean[,2],
col = fcm_res$cluster,
pch = 19,
main = "Fuzzy C-Means Clustering",
xlab = "PC1", ylab = "PC2")
# Plot original (tanpa clustering)
plot(data_clean[,1], data_clean[,2],
col = "black",
pch = 19,
main = "Original Data",
xlab = "PC1", ylab = "PC2")
Berdasarkan plot diatas, dapat diamati bahwa clustering menggunakan Mean-Shift menghasilkan point (warna) yang cukup banyak dan pola cluster tidak terlihat jelas. Plot Fuzzy C-Means menunjukkan adanya dua cluster dengan struktur yang jelas. Adapun untuk plot paling kanan menampilkan data asli sebelum dilakukan clustering.
# Ambil sampel data agar komputasi tidak berat
set.seed(123)
idx <- sample(1:nrow(data_clean), 500)
d <- dist(data_clean[idx, ])
cluster_clean <- as.numeric(as.factor(ms_res$assignment[idx]))
# Mean-Shift
# 1. Silhouette
mean(silhouette(ms_res$assignment[idx], d)[,3])
## [1] -0.09024909
# 2. Dunn-Index
stats <- cluster.stats(d, cluster_clean)
paste("Dunn Index:", stats$dunn)
## [1] "Dunn Index: 0.048962050245082"
paste("Within-cluster SS:", stats$within.cluster.ss)
## [1] "Within-cluster SS: 92.8077445443671"
Hasil clustering menggunakan Mean Shift menunjukkan performa yang kurang baik, ditandai dengan nilai Silhouette Score negatif (-0.13) yang mengindikasikan bahwa banyak data tidak terkelompok dengan benar. Selain itu, nilai Dunn Index yang rendah menunjukkan bahwa cluster tidak terpisah dengan baik sehingga bisa disimpulkan Mean-Shift Clustering gagal dilakukan dan tidak cocok untuk diimplementasikan pada Data Astrometri Gaia DR3
# Fuzzy C-Means
# 1. Silhouette
mean(silhouette(fcm_res$cluster[idx], d)[,3])
## [1] 0.3782663
# 2. Dunn-Index
stats <- cluster.stats(d, fcm_res$cluster[idx])
paste("Dunn Index:", stats$dunn)
## [1] "Dunn Index: 0.0303780143090635"
paste("Within-cluster SS:", stats$within.cluster.ss)
## [1] "Within-cluster SS: 4345.42600972149"
Berbeda dengan Mean Shift, metode Fuzzy C-Means menghasilkan performa yang lebih baik dengan nilai Silhouette Score yang cukup besar sekitar 0.37 yang menunjukkan bahwa cluster cukup terpisah dengan baik. Nilai Dunn Index yang lebih tinggi juga menunjukkan adanya peningkatan kualitas cluster sehingga Fuzzy C-Means cocok diterapkan pada Dataset Astrometri Gaia DR3 dengan cluster kiri (hitam) merepresentasikan bintang lebih dekat / normal dan cluster kanan (merah) merepresentasikan bintang lebih terang / jauh / berbeda karakteristik.