The dataset was found on Kaggle (https://www.kaggle.com/datasets/argolof/predicting-terrorism). It holds almost 30K observations and following variables:
Variable that have been used in analysis:
Date
City
(# of people) Killed
(# of people) Injured
“Date” was filtered to contain only 2002 to shorten down the number of observations. There wasn’t much diversity when it comes to “Country” variable so only “City” was used to reduce dimensions. “Killed” and “Injured” variables were utilized fully.
I was trying to use full set of observations but neither my machine power was enough to proceed not outcomes were meaningfull so i decided to shorten down the focus.
In this analysis i’ll try to see if DBSCAN is indeed better for dataset with large outliers or other clustering methods can also give reasonable insights.
Analysis will follow structure:
Import dataset
Extract needed variables and filter observations
Convert character variables and check for NA’s
Check outliers
Standardization and Hopkins’ stat
Clustering and Silhouette score: DBSCAN, K-means, PAM
library(readr)
library(dbscan)
library(cluster)
library("fpc")
library("dbscan")
library(hopkins)
library(factoextra)
attacks_data_UTF8 <- read_csv("C:/Users/stran/Desktop/UW/UL/Clustering/rel_attackes/attacks_data_UTF8.csv", col_types = cols(Date = col_date(format = "%Y-%m-%d"),
Killed = col_number(), Injured = col_number()))
head(attacks_data_UTF8, 5)
## # A tibble: 5 × 7
## ...1 Date Country City Killed Injured Description
## <dbl> <date> <chr> <chr> <dbl> <dbl> <chr>
## 1 1 2002-01-01 Indonesia Palu 1 0 Four bombs explode a…
## 2 2 2002-01-01 India Baramulla 1 0 Terrorists enter the…
## 3 3 2002-01-01 India Poshkar 2 0 Two civilians are ab…
## 4 4 2002-01-02 India Rajouri 6 9 Three separate terro…
## 5 5 2002-01-02 India Jehangir Chowk 2 25 A Muslim militant ki…
Now i’d like to transform the dataset to remove features that won’t be used in clustering and focus on the observations from one period (2002) to make it easier for clustering since on large dataset my machine power isn’t enough and it takes too long.
att <- attacks_data_UTF8[,2:6]
att_02 <- att[att$Date < "2003-01-01",]
head(att_02, 5)
## # A tibble: 5 × 5
## Date Country City Killed Injured
## <date> <chr> <chr> <dbl> <dbl>
## 1 2002-01-01 Indonesia Palu 1 0
## 2 2002-01-01 India Baramulla 1 0
## 3 2002-01-01 India Poshkar 2 0
## 4 2002-01-02 India Rajouri 6 9
## 5 2002-01-02 India Jehangir Chowk 2 25
Now I transform ‘City’ column from character type to numeric to use it later on. Then i check for NA-s.
att_02$City <- as.numeric(as.factor(att_02$City))
summary(att_02) #NA-s' check; outcome shows that our data doesn't have NA-s
## Date Country City Killed
## Min. :2002-01-01 Length:564 Min. : 1.0 Min. : 0.000
## 1st Qu.:2002-03-31 Class :character 1st Qu.:116.8 1st Qu.: 1.000
## Median :2002-07-04 Mode :character Median :180.0 Median : 2.000
## Mean :2002-07-02 Mean :193.9 Mean : 5.051
## 3rd Qu.:2002-10-01 3rd Qu.:287.0 3rd Qu.: 4.000
## Max. :2002-12-31 Max. :374.0 Max. :216.000
## Injured
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 1.00
## Mean : 10.74
## 3rd Qu.: 8.00
## Max. :521.00
str(att_02) #Check that character type has been changed
## tibble [564 × 5] (S3: tbl_df/tbl/data.frame)
## $ Date : Date[1:564], format: "2002-01-01" "2002-01-01" ...
## $ Country: chr [1:564] "Indonesia" "India" "India" "India" ...
## $ City : num [1:564] 276 37 285 294 157 175 180 219 297 214 ...
## $ Killed : num [1:564] 1 1 2 6 2 1 2 2 3 4 ...
## $ Injured: num [1:564] 0 0 0 9 25 0 0 3 12 4 ...
boxplot(att_02$City)
boxplot(att_02$Killed)
boxplot(att_02$Injured)
summary(att_02[,3:5])
## City Killed Injured
## Min. : 1.0 Min. : 0.000 Min. : 0.00
## 1st Qu.:116.8 1st Qu.: 1.000 1st Qu.: 0.00
## Median :180.0 Median : 2.000 Median : 1.00
## Mean :193.9 Mean : 5.051 Mean : 10.74
## 3rd Qu.:287.0 3rd Qu.: 4.000 3rd Qu.: 8.00
## Max. :374.0 Max. :216.000 Max. :521.00
Given that the boxplot indicates significant outliers and the quartile analysis using the str() function confirmed their existence, we can infer that the data distribution is not uniform, and the outliers may skew the mean and standard deviation, making these measures less representative. It might be necessary to investigate and possibly remove or adjust these outliers for more accurate analysis, but we won’t do it since i want to use DBSCAN for data with noise.
scaled <- scale(att_02[,3:5])
hopkins(scaled)
## [1] 0.9970434
get_clust_tendency(scaled, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))
## $hopkins_stat
## [1] 0.9816819
##
## $plot
The Hopkins statistic for this dataset is approximately 1. As a value greater than 0.5 indicates that the dataset is not highly clusterable, it suggests that clustering may not be effective. However, having tested four different datasets, each yielding a Hopkins statistic consistently above 0.5, I have chosen to proceed despite this indicator. This conclusion is further supported by the ordered dissimilarity plot, where the absence of distinct, bright-colored fields implies a low likelihood of identifying meaningful clusters within the data.
To determine the most suitable epsilon (ε) parameter for DBSCAN clustering, I began by plotting the k-nearest neighbors distance plot using the kNNdistplot function from the dbscan package. By examining the plot, I identified that an epsilon value of 0.5 would be most appropriate. This decision was visually supported by the horizontal reference line at 0.5.
I proceeded to apply the DBSCAN algorithm with different epsilon values to observe the clustering results. Initially, I used eps = 1 and MinPts = 3, and visualized the clusters using fviz_cluster. Subsequently, I adjusted the epsilon value to 0.5, which was deemed more suitable based on the earlier analysis. I visualized the results to compare the clustering outcomes.
To further evaluate the clustering performance, I calculated the silhouette score for the clusters formed with eps = 0.5. The silhouette score provides a measure of how similar each point is to its own cluster compared to other clusters.
dbscan::kNNdistplot(scaled, k = 3)
abline(h = 0.5, lty = 2, col = "pink")
db_scaled <- fpc::dbscan(scaled, eps = 1, MinPts = 3)
fviz_cluster(db_scaled, scaled, stand = FALSE, frame = FALSE, geom = "point")
db_scaled2 <- fpc::dbscan(scaled, eps = 0.5, MinPts = 3)
fviz_cluster(db_scaled2, scaled, stand = FALSE, frame = FALSE, geom = "point")
filtered_data <- scaled[db_scaled2$cluster != 0, ]
filtered_labels <- db_scaled2$cluster[db_scaled2$cluster != 0]
filtered_dist_matrix <- dist(filtered_data)
# Calculate silhouette values for the filtered data
sil_values <- silhouette(filtered_labels, filtered_dist_matrix)
# Print silhouette summary
summary(sil_values)
## Silhouette of 540 units in 2 clusters from silhouette.default(x = filtered_labels, dist = filtered_dist_matrix) :
## Cluster sizes and average silhouette widths:
## 536 4
## 0.5301062 0.8585949
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4463 0.4914 0.5663 0.5325 0.6378 0.8887
# Visualize silhouette plot
plot(sil_values, border = NA, main = "Silhouette Plot for DBSCAN Clustering")
fviz_silhouette(sil_values, palette = "jco", label = TRUE, print.summary = TRUE) + theme_minimal() + labs(title = "Silhouette Plot for DBSCAN Clustering", x = "Silhouette Width", y = "Cluster") + theme(plot.title = element_text(size = 16, face = "bold"), axis.title = element_text(size = 14), axis.text = element_text(size = 12), legend.text = element_text(size = 12))
## cluster size ave.sil.width
## 1 1 536 0.53
## 2 2 4 0.86
Silhouette score is 0.53 and 0.86 for 1st and 2nd clusters respectively that implies that the clustering configuration is appropriate.it means that the objects within a cluster are similar to each other (high cohesion) and distinct from objects in other clusters (high separation).
Below i try to determine the optimal number of clusters for both k-means and PAM (Partitioning Around Medoids) clustering methods. The silhouette method is used to evaluate the quality of clustering for different numbers of clusters, and the results are visualized with a classic theme.
library(gridExtra)
fviz_nbclust(scaled, FUNcluster = kmeans, method = "silhouette") + theme_classic()
fviz_nbclust(scaled, FUNcluster = cluster::pam, method = "silhouette") + theme_classic()
cl_kmeans1 <- eclust(scaled, k=5, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
fviz_silhouette(cl_kmeans1)
## cluster size ave.sil.width
## 1 1 217 0.57
## 2 2 305 0.51
## 3 3 4 0.21
## 4 4 36 0.09
## 5 5 2 0.43
cl_pam <- eclust(scaled, k=5, FUNcluster="pam", hc_metric="pearson", graph=FALSE)
fviz_silhouette(cl_pam)
## cluster size ave.sil.width
## 1 1 177 0.51
## 2 2 140 0.50
## 3 3 201 0.54
## 4 4 43 -0.04
## 5 5 3 0.29
fviz_cluster(cl_pam, data = scaled, elipse.type = "convex", main = "Pam") + theme_minimal()
fviz_cluster(cl_kmeans1, data = scaled, elipse.type = "convex") + theme_minimal()
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) excels at identifying outliers, which is a significant strength. However, based on the initial visualizations alone, I find the clustering outcomes of both PAM (Partitioning Around Medoids) and K-means to be more intuitive and visuall.
Visually, K-means and PAM provide clusters that appear more distinct and clearly segregated, aligning well with an intuitive understanding of the dataset. This visual clarity is often beneficial for quick assessments and initial interpretations. However, DBSCAN does great in detecting clusters of arbitrary shapes and sizes. This capability allows it to uncover subtle and nuanced patterns within the data that might be overlooked by traditional methods like K-means or PAM.
In my preliminary analysis, DBSCAN identified a cluster that would likely remain hidden if using K-means or PAM, demonstrating its strength in revealing complex structures within the data. Despite this advantage, the silhouette score for DBSCAN clusters was higher, indicating better-defined clusters with greater cohesion and separation compared to those produced by K-means or PAM. This suggests that, from a mathematical standpoint, DBSCAN performed more effectively in clustering the data.
However, it’s worth remembering that the Hopkins statistic for that dataset has been less favorable. The Hopkins statistic measures the clustering tendency of the data, and a low score may indicate that the data is not highly clusterable in general.
In conclusion, while the visual clarity of K-means and PAM makes them more immediately appealing for initial assessments, DBSCAN’s strength in identifying outliers and subtle clusters cannot be understated. A more thorough dataset investigation is necessary to fully appreciate the intricate clusters detected by all of the algorithms.