CLUSTERING ALGERIAN FOREST FIREST FISKS

Deskripsi Dataset

Dataset yang digunakan dalam penelitian ini adalah Algerian Forest Fires Dataset yang diperoleh dari UCI Machine Learning Repository. Dataset ini berisi data kejadian kebakaran hutan yang dikumpulkan dari dua wilayah di Algeria, yaitu Bejaia dan Sidi Bel-Abbes, selama periode Juni 2012 hingga September 2012.

Pada penelitian ini hanya digunakan data dari wilayah Bejaia sebanyak 122 observasi. Data yang digunakan terdiri dari 10 variabel numerik yang merepresentasikan kondisi cuaca dan indeks kebakaran hutan. Variabel tersebut meliputi Temperature, RH (Relative Humidity), Ws (Wind Speed), Rain, FFMC (Fine Fuel Moisture Code), DMC (Duff Moisture Code), DC (Drought Code), ISI (Initial Spread Index), BUI (Buildup Index), dan FWI (Fire Weather Index).

Variabel day, month, dan year tidak digunakan dalam proses clustering karena hanya berfungsi sebagai informasi waktu pengamatan. Selain itu, variabel Classes juga tidak digunakan karena merupakan label kelas (fire dan not fire), sedangkan penelitian ini menggunakan pendekatan unsupervised learning melalui metode clustering.

Dataset ini dapat diakses melalui UCI Machine Learning Repository pada tautan berikut:

https://archive.ics.uci.edu/dataset/547/algerian+forest+fires+dataset

Import Data

# Library
library(cluster)

## Warning: package 'cluster' was built under R version 4.4.3

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.4.3

## Welcome to factoextra!

## Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/

library(fpc)

## Warning: package 'fpc' was built under R version 4.4.3

library(clusterSim)

## Warning: package 'clusterSim' was built under R version 4.4.3

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 4.4.3

library(readxl)

# Import data
data <- read_excel("C:/Users/X13/Downloads/ALGERIAN FOREST FIREST FISKS.xlsx")
head(data)

## # A tibble: 6 × 14
##     day month  year Temperature    RH    Ws  Rain  FFMC   DMC    DC   ISI   BUI
##   <dbl> <dbl> <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     6  2012          29    57    18   0    65.7   3.4   7.6   1.3   3.4
## 2     2     6  2012          29    61    13   1.3  64.4   4.1   7.6   1     3.9
## 3     3     6  2012          26    82    22  13.1  47.1   2.5   7.1   0.3   2.7
## 4     4     6  2012          25    89    13   2.5  28.6   1.3   6.9   0     1.7
## 5     5     6  2012          27    77    16   0    64.8   3    14.2   1.2   3.9
## 6     6     6  2012          31    67    14   0    82.6   5.8  22.2   3.1   7  
## # ℹ 2 more variables: FWI <dbl>, Classes <chr>

str(data)

## tibble [122 × 14] (S3: tbl_df/tbl/data.frame)
##  $ day        : num [1:122] 1 2 3 4 5 6 7 8 9 10 ...
##  $ month      : num [1:122] 6 6 6 6 6 6 6 6 6 6 ...
##  $ year       : num [1:122] 2012 2012 2012 2012 2012 ...
##  $ Temperature: num [1:122] 29 29 26 25 27 31 33 30 25 28 ...
##  $ RH         : num [1:122] 57 61 82 89 77 67 54 73 88 79 ...
##  $ Ws         : num [1:122] 18 13 22 13 16 14 13 15 13 12 ...
##  $ Rain       : num [1:122] 0 1.3 13.1 2.5 0 0 0 0 0.2 0 ...
##  $ FFMC       : num [1:122] 65.7 64.4 47.1 28.6 64.8 82.6 88.2 86.6 52.9 73.2 ...
##  $ DMC        : num [1:122] 3.4 4.1 2.5 1.3 3 5.8 9.9 12.1 7.9 9.5 ...
##  $ DC         : num [1:122] 7.6 7.6 7.1 6.9 14.2 22.2 30.5 38.3 38.8 46.3 ...
##  $ ISI        : num [1:122] 1.3 1 0.3 0 1.2 3.1 6.4 5.6 0.4 1.3 ...
##  $ BUI        : num [1:122] 3.4 3.9 2.7 1.7 3.9 7 10.9 13.5 10.5 12.6 ...
##  $ FWI        : num [1:122] 0.5 0.4 0.1 0 0.5 2.5 7.2 7.1 0.3 0.9 ...
##  $ Classes    : chr [1:122] "not fire" "not fire" "not fire" "not fire" ...

summary(data)

##       day            month          year       Temperature          RH       
##  Min.   : 1.00   Min.   :6.0   Min.   :2012   Min.   :22.00   Min.   :45.00  
##  1st Qu.: 8.00   1st Qu.:7.0   1st Qu.:2012   1st Qu.:29.00   1st Qu.:60.00  
##  Median :16.00   Median :7.5   Median :2012   Median :31.00   Median :68.00  
##  Mean   :15.75   Mean   :7.5   Mean   :2012   Mean   :31.18   Mean   :67.98  
##  3rd Qu.:23.00   3rd Qu.:8.0   3rd Qu.:2012   3rd Qu.:34.00   3rd Qu.:77.75  
##  Max.   :31.00   Max.   :9.0   Max.   :2012   Max.   :37.00   Max.   :89.00  
##        Ws          Rain              FFMC            DMC        
##  Min.   :11   Min.   : 0.0000   Min.   :28.60   Min.   : 0.700  
##  1st Qu.:14   1st Qu.: 0.0000   1st Qu.:65.92   1st Qu.: 3.725  
##  Median :16   Median : 0.0000   Median :80.90   Median : 9.450  
##  Mean   :16   Mean   : 0.8426   Mean   :74.67   Mean   :12.315  
##  3rd Qu.:18   3rd Qu.: 0.5000   3rd Qu.:86.78   3rd Qu.:16.300  
##  Max.   :26   Max.   :16.8000   Max.   :90.30   Max.   :54.200  
##        DC              ISI              BUI             FWI        
##  Min.   :  6.90   Min.   : 0.000   Min.   : 1.10   Min.   : 0.000  
##  1st Qu.: 10.05   1st Qu.: 1.125   1st Qu.: 5.10   1st Qu.: 0.500  
##  Median : 35.55   Median : 2.650   Median :11.20   Median : 3.000  
##  Mean   : 53.16   Mean   : 3.656   Mean   :15.43   Mean   : 5.578  
##  3rd Qu.: 79.03   3rd Qu.: 5.600   3rd Qu.:21.68   3rd Qu.: 8.700  
##  Max.   :220.40   Max.   :12.500   Max.   :67.40   Max.   :30.200  
##    Classes         
##  Length:122        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

# Pilih variabel numerik
data_num <- data[, c("Temperature",
                     "RH",
                     "Ws",
                     "Rain",
                     "FFMC",
                     "DMC",
                     "DC",
                     "ISI",
                     "BUI",
                     "FWI")]

# Standardisasi data
data_scaled <- scale(data_num)

Menentukan Jumlah cluster

# Menentukan jumlah cluster
# Elbow Method
fviz_nbclust(
  data_scaled,
  kmeans,
  method = "wss"
) +
  ggtitle("Elbow Method")

# Silhouette Method
fviz_nbclust(
  data_scaled,
  kmeans,
  method = "silhouette"
) +
  ggtitle("Silhouette Method")

# Jumlah cluster optimal
k <- 2

Pembahasan

Penentuan jumlah cluster dilakukan menggunakan Elbow Method dan Silhouette Method. Elbow Method digunakan untuk melihat jumlah cluster yang paling sesuai berdasarkan nilai Within Cluster Sum of Squares (WSS). Dari grafik Elbow Method terlihat bahwa penurunan nilai WSS mulai melambat setelah k = 2, sehingga jumlah cluster yang layak digunakan berada di sekitar nilai tersebut.

Selanjutnya dilakukan pengujian menggunakan Silhouette Method. Berdasarkan grafik Silhouette Method, nilai silhouette tertinggi diperoleh pada k = 2 dengan nilai sekitar 0,34. Hal ini menunjukkan bahwa pembentukan dua cluster menghasilkan kualitas pengelompokan yang paling baik dibandingkan jumlah cluster lainnya.

Berdasarkan hasil Elbow Method dan Silhouette Method, jumlah cluster yang dipilih adalah 2 cluster.

Metode K-Means Clustering

# METODE K-MEANS CLUSTERING

set.seed(123)

kmeans_result <- kmeans(
  data_scaled,
  centers = k,
  nstart = 25
)
kmeans_result

## K-means clustering with 2 clusters of sizes 76, 46
## 
## Cluster means:
##   Temperature         RH          Ws       Rain       FFMC        DMC
## 1  -0.4624715  0.3773217  0.03233117  0.2007684 -0.4754015 -0.5703689
## 2   0.7640834 -0.6234010 -0.05341671 -0.3317043  0.7854459  0.9423485
##           DC        ISI        BUI        FWI
## 1 -0.5786084 -0.6141231 -0.5779694 -0.6262925
## 2  0.9559617  1.0146382  0.9549059  1.0347442
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1
##  [38] 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 2 2 2 2 2 2 1 1 1 1 2 1 2 1 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
## [112] 1 2 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 454.5872 274.2449
##  (between_SS / total_SS =  39.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Tabel Cluster K-Means
table(kmeans_result$cluster)

## 
##  1  2 
## 76 46

# Plot K-Means
fviz_cluster(
  kmeans_result,
  data = data_scaled,
  geom = "point",
  ellipse.type = "convex",
  main = "Cluster K-Means"
)

Pembahasan

Hasil clustering menggunakan metode K-Means menghasilkan 2 cluster. Berdasarkan visualisasi cluster, terlihat bahwa kedua cluster memiliki pemisahan yang cukup jelas. Jumlah anggota pada Cluster 1 sebanyak 76 data, sedangkan Cluster 2 sebanyak 46 data. Hal ini menunjukkan bahwa metode K-Means mampu mengelompokkan data ke dalam dua kelompok berdasarkan kemiripan karakteristik masing-masing data.

Hierarchical Clustering

# HIERARCHICAL CLUSTERING

dist_matrix <- dist(data_scaled)

hc <- hclust(
  dist_matrix,
  method = "ward.D2"
)


# Dendrogram
plot(
  hc,
  main = "Dendrogram Hierarchical Clustering",
  xlab = "",
  sub = ""
)

rect.hclust(
  hc,
  k = k,
  border = "red"
)

# Membentuk cluster
hc_cluster <- cutree(
  hc,
  k = k
)

hc_cluster

##   [1] 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2
##  [38] 2 1 1 1 1 1 2 1 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
## [112] 2 2 2 2 1 1 2 2 2 1 1

# Tabel Cluster Hierarchical
table(hc_cluster)

## hc_cluster
##  1  2 
## 50 72

# Plot Hierarchical
fviz_cluster(
  list(
    data = data_scaled,
    cluster = hc_cluster
  ),
  geom = "point",
  ellipse.type = "convex",
  main = "Cluster Hierarchical"
)

Pembahasan

Metode Hierarchical Clustering dilakukan menggunakan metode Ward.D2 dengan jumlah cluster sebanyak 2 cluster. Berdasarkan dendrogram yang dihasilkan, terlihat bahwa data dapat dibagi menjadi dua kelompok utama yang ditunjukkan oleh kotak berwarna merah. Pemotongan dendrogram pada k = 2 menghasilkan dua cluster yang memiliki karakteristik yang berbeda.

Berdasarkan hasil clustering, diperoleh 50 data pada Cluster 1 dan 72 data pada Cluster 2. Hasil ini menunjukkan bahwa sebagian besar data tergabung ke dalam Cluster 2.

Visualisasi cluster menunjukkan bahwa kedua cluster memiliki pemisahan yang cukup jelas meskipun masih terdapat beberapa data yang berada pada area yang berdekatan. Secara umum, metode Hierarchical Clustering mampu mengelompokkan data ke dalam dua kelompok berdasarkan tingkat kemiripan karakteristik masing-masing data.

Evaluasi Silhouette

Evaluasi kualitas cluster dilakukan menggunakan Silhouette Coefficient. Nilai silhouette digunakan untuk mengukur seberapa baik suatu data berada pada cluster-nya dibandingkan dengan cluster lainnya. Semakin tinggi nilai silhouette, maka semakin baik kualitas cluster yang terbentuk.

# EVALUASI SILHOUETTE

sil_kmeans <- silhouette(
  kmeans_result$cluster,
  dist_matrix
)

sil_hc <- silhouette(
  hc_cluster,
  dist_matrix
)

# Plot Silhouette K-Means
fviz_silhouette(sil_kmeans)

##   cluster size ave.sil.width
## 1       1   76          0.35
## 2       2   46          0.33

Pembahasan

Berdasarkan hasil perhitungan, diperoleh nilai rata-rata silhouette sebesar 0,34. Nilai tersebut menunjukkan bahwa cluster yang terbentuk memiliki kualitas yang cukup baik dengan pemisahan antar cluster yang cukup jelas.

Selain itu, Cluster 1 memiliki nilai rata-rata silhouette sebesar 0,35, sedangkan Cluster 2 memiliki nilai rata-rata silhouette sebesar 0,33. Nilai silhouette yang positif pada kedua cluster menunjukkan bahwa sebagian besar data telah berada pada cluster yang sesuai.

# Plot Silhouette Hierarchical
fviz_silhouette(sil_hc)

##   cluster size ave.sil.width
## 1       1   50          0.36
## 2       2   72          0.26

Pembahasan

Berdasarkan hasil perhitungan, diperoleh nilai rata-rata silhouette sebesar 0,30. Cluster 1 memiliki nilai rata-rata silhouette sebesar 0,36, sedangkan Cluster 2 memiliki nilai rata-rata silhouette sebesar 0,26. Meskipun kedua cluster masih menunjukkan nilai positif, kualitas cluster yang dihasilkan relatif lebih rendah dibandingkan metode K-Means.

Kesimpulan

Secara keseluruhan, nilai silhouette metode K-Means lebih tinggi dibandingkan metode Hierarchical Clustering. Hal ini menunjukkan bahwa hasil pengelompokan menggunakan metode K-Means memiliki kualitas cluster yang lebih baik dan pemisahan antar cluster yang lebih jelas dibandingkan metode Hierarchical Clustering.

# Nilai rata-rata silhouette
silhouette_kmeans <- mean(sil_kmeans[,3])

silhouette_hc <- mean(sil_hc[,3])

Bootstrap Stability

# BOOTSTRAP STABILITY

boot_kmeans <- clusterboot(
  data_scaled,
  B = 100,
  clustermethod = kmeansCBI,
  krange = k,
  seed = 123
)

## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100

boot_hc <- clusterboot(
  data_scaled,
  B = 100,
  clustermethod = hclustCBI,
  method = "ward.D2",
  k = k,
  seed = 123
)

## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100

bootstrap_kmeans <- mean(boot_kmeans$bootmean)
bootstrap_kmeans

## [1] 0.8550133

bootstrap_hc <- mean(boot_hc$bootmean)
bootstrap_hc

## [1] 0.6233238

Pembahasan

Bootstrap Stability K-Means

Evaluasi kestabilan cluster pada metode K-Means dilakukan menggunakan Bootstrap Stability dengan 100 kali pengulangan (B = 100). Hasil yang diperoleh menunjukkan nilai bootstrap stability sebesar 0,855. Nilai tersebut menunjukkan bahwa cluster yang terbentuk memiliki tingkat kestabilan yang baik karena sebagian besar anggota cluster tetap konsisten pada setiap proses bootstrap.

Bootstrap Stability Hierarchical Clustering

Evaluasi kestabilan cluster pada metode Hierarchical Clustering dilakukan menggunakan Bootstrap Stability dengan 100 kali pengulangan (B = 100). Hasil yang diperoleh menunjukkan nilai bootstrap stability sebesar 0,623. Nilai tersebut menunjukkan bahwa cluster yang terbentuk cukup stabil, namun masih lebih rendah dibandingkan metode K-Means.

Perbandingan Bootstrap Stability

Berdasarkan hasil Bootstrap Stability, metode K-Means memiliki nilai kestabilan sebesar 0,855, sedangkan metode Hierarchical Clustering memiliki nilai sebesar 0,623. Hal ini menunjukkan bahwa hasil cluster yang dihasilkan oleh metode K-Means lebih stabil dan lebih konsisten dibandingkan metode Hierarchical Clustering ketika dilakukan pengujian menggunakan bootstrap.

# TABEL HASIL

hasil_validasi <- data.frame(
  Metode = c(
    "K-Means",
    "Hierarchical"
  ),
  Silhouette = c(
    silhouette_kmeans,
    silhouette_hc
  ),
  Bootstrap_Stability = c(
    bootstrap_kmeans,
    bootstrap_hc
  )
)

print(hasil_validasi)

##         Metode Silhouette Bootstrap_Stability
## 1      K-Means  0.3402009           0.8550133
## 2 Hierarchical  0.3029253           0.6233238

# Menentukan metode terbaik
if(silhouette_kmeans > silhouette_hc){
  cat("\nMetode terbaik adalah K-Means\n")
} else {
  cat("\nMetode terbaik adalah Hierarchical Clustering\n")
}

## 
## Metode terbaik adalah K-Means

Pembahasan

Perbandingan metode clustering dilakukan menggunakan Silhouette Coefficient dan Bootstrap Stability. Hasil validasi menunjukkan bahwa metode K-Means memperoleh nilai silhouette sebesar 0,3402 dan nilai bootstrap stability sebesar 0,8550. Sementara itu, metode Hierarchical Clustering memperoleh nilai silhouette sebesar 0,3029 dan nilai bootstrap stability sebesar 0,6233.

Berdasarkan nilai silhouette, metode K-Means memiliki kualitas cluster yang lebih baik dibandingkan Hierarchical Clustering karena menghasilkan nilai silhouette yang lebih tinggi. Selain itu, berdasarkan nilai bootstrap stability, metode K-Means juga menunjukkan tingkat kestabilan cluster yang lebih baik dibandingkan Hierarchical Clustering.

Dengan demikian, baik dari segi kualitas maupun kestabilan cluster, metode K-Means memberikan hasil yang lebih baik dibandingkan metode Hierarchical Clustering.

Kesimpulan

Berdasarkan hasil analisis clustering pada data Algerian Forest Fires Dataset wilayah Bejaia, diperoleh jumlah cluster optimal sebanyak 2 cluster berdasarkan Elbow Method dan Silhouette Method. Selanjutnya dilakukan clustering menggunakan metode K-Means dan Hierarchical Clustering.

Hasil evaluasi menunjukkan bahwa metode K-Means menghasilkan nilai silhouette sebesar 0,3402 dan bootstrap stability sebesar 0,8550, sedangkan metode Hierarchical Clustering menghasilkan nilai silhouette sebesar 0,3029 dan bootstrap stability sebesar 0,6233.

Berdasarkan hasil tersebut, metode K-Means dipilih sebagai metode clustering terbaik karena memiliki kualitas cluster dan tingkat kestabilan yang lebih baik dibandingkan metode Hierarchical Clustering.