Project Part 1 - Clustering

Clustering project for USL classes

The project aims to analyze deaths caused by drug overdoses using clustering methods. The data pertains to the state of Connecticut in the United States and includes recorded deaths related to drug use. Data is official and sourced from a platform that aggregates datasets provided by state agencies (https://data.ct.gov/Health-and-Human-Services/Accidental-Drug-Related-Deaths-2012-2023/rybz-nyjw/about_data).

Opis obrazka source: https://delamere.com/blog/top-10-most-dangerous-drugs

#Loading packages
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(tidyverse)
library(dendextend)
library(dbscan)
library(RColorBrewer)
library(tidyverse)
library(dendextend)
library(leaflet)
library(ggplot2)
library(flexclust)
library(tidyr)
library(leaflet.extras)
library(hopkins)
library(ClustGeo)
library(NbClust)

#Loading data
drug_deaths <- read.csv("C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Accidental_Drug_Related_Deaths.csv")

drug_deaths$Sex <- ifelse(tolower(drug_deaths$Sex) == "male", 1, 0)
drug_deaths$Is_local <- ifelse(drug_deaths$Residence.State == "CT", 1, 0)
drug_deaths$Injury_place_residence <- ifelse(drug_deaths$Injury.Place == "Residence", 1, 0)
drug_deaths[ , 23:44] <- apply(drug_deaths[ , 23:44], 2, function(x) ifelse(x == "Y", 1, ifelse(x == "", 0, x)))
drug_deaths <- drug_deaths %>%
  mutate_at(23:44, ~as.numeric(ifelse(. == "Y", 1, ifelse(. == "", 0, .))))

drug_deaths <- drug_deaths %>%
  separate(InjuryCityGeo, into = c("CityGeo", "Longitude"), sep = ",\\s*-", remove = FALSE)
drug_deaths$Longitude <- sub("\\)$", "", drug_deaths$Longitude)
drug_deaths$Latitude <- sub(".*\\(([-0-9.]+).*", "\\1", drug_deaths$CityGeo)
drug_deaths$Longitude <- paste0("-", drug_deaths$Longitude)
drug_deaths$Longitude <- as.numeric(drug_deaths$Longitude)
drug_deaths$Latitude <- as.numeric(drug_deaths$Latitude)
drug_deaths <- drug_deaths %>%
  select(-Latitude, -Longitude, everything(), Latitude, Longitude)

#Dropping unnecsessary variables
drug_deaths_1 <- drug_deaths[ , !(names(drug_deaths) %in% c("Date", "Date.Type","Ethnicity","Residence.City","Residence.County","Injury.City","Injury.County","Residence.State","Injury.County","Injury.State","Injury.Place","Description.of.Injury","Manner.of.Death","Other.Significant.Conditions","Death.City","Death.County","Death.State","Location","Location.if.Other","Cause.of.Death","Other","ResidenceCityGeo","DeathCityGeo","InjuryCityGeo","CityGeo","Race"))]
drug_deaths_1 <- na.omit(drug_deaths_1)

Visualizing the data on a map of Connecticut State in US

leaflet(data = drug_deaths_1[27:28]) %>%
  addTiles() %>%
  addCircleMarkers(~Longitude, ~Latitude, radius = 4, color = "blue", fillOpacity = 0.7)

leaflet(data = drug_deaths_1[27:28]) %>%
  addTiles() %>%
  addHeatmap(lng = ~Longitude, lat = ~Latitude, intensity = 1, blur = 20, max = 0.05)

As we can see, the observations are fairly evenly distributed across the map of the state. On the heatmap, larger number of observations can be noted in the southern part of the state, where the largest cities, Bridgeport and New Haven, are located. Here I choose a random sample of the data to speed up the code, because working on original data was took to long

set.seed(123)
drug_deaths_sample<- drug_deaths_1[sample(nrow(drug_deaths_1), 2000), ]
#Data_preview
str(drug_deaths_sample)

## 'data.frame':    2000 obs. of  28 variables:
##  $ Age                          : int  37 31 52 41 60 53 36 35 56 48 ...
##  $ Sex                          : num  1 0 1 1 1 1 0 1 1 1 ...
##  $ Heroin                       : num  0 0 1 0 1 0 0 1 0 1 ...
##  $ Heroin.death.certificate..DC.: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cocaine                      : num  0 0 1 0 0 0 1 0 0 0 ...
##  $ Fentanyl                     : num  1 1 1 0 0 0 1 1 1 1 ...
##  $ Fentanyl.Analogue            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Oxycodone                    : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ Oxymorphone                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Ethanol                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hydrocodone                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Benzodiazepine               : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Methadone                    : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Meth.Amphetamine             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Amphet                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Tramad                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hydromorphone                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Morphine..Not.Heroin.        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Xylazine                     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Gabapentin                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Opiate.NOS                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Heroin.Morph.Codeine         : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ Other.Opioid                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Any.Opioid                   : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ Is_local                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Injury_place_residence       : num  1 1 0 0 1 1 0 1 1 1 ...
##  $ Longitude                    : num  -72 -72 -72.9 -73.5 -72.8 ...
##  $ Latitude                     : num  41.4 41.4 41.4 41.4 41.5 ...
##  - attr(*, "na.action")= 'omit' Named int [1:542] 11 14 17 21 23 26 28 62 85 153 ...
##   ..- attr(*, "names")= chr [1:542] "11" "14" "17" "21" ...

summary(drug_deaths_sample)

##       Age             Sex            Heroin       Heroin.death.certificate..DC.
##  Min.   :14.00   Min.   :0.000   Min.   :0.0000   Min.   :0.000                
##  1st Qu.:34.00   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.000                
##  Median :44.00   Median :1.000   Median :0.0000   Median :0.000                
##  Mean   :44.08   Mean   :0.748   Mean   :0.2995   Mean   :0.055                
##  3rd Qu.:54.00   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.000                
##  Max.   :84.00   Max.   :1.000   Max.   :1.0000   Max.   :1.000                
##     Cocaine          Fentanyl      Fentanyl.Analogue   Oxycodone    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000     Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000     1st Qu.:0.000  
##  Median :0.0000   Median :1.0000   Median :0.000     Median :0.000  
##  Mean   :0.3745   Mean   :0.6775   Mean   :0.083     Mean   :0.088  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.000     3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000     Max.   :1.000  
##   Oxymorphone        Ethanol       Hydrocodone    Benzodiazepine  
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000   Median :0.000   Median :0.0000  
##  Mean   :0.0135   Mean   :0.262   Mean   :0.012   Mean   :0.2235  
##  3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.000   Max.   :1.0000  
##    Methadone     Meth.Amphetamine     Amphet           Tramad    
##  Min.   :0.000   Min.   :0.000    Min.   :0.0000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:0.000    1st Qu.:0.0000   1st Qu.:0.00  
##  Median :0.000   Median :0.000    Median :0.0000   Median :0.00  
##  Mean   :0.093   Mean   :0.011    Mean   :0.0315   Mean   :0.03  
##  3rd Qu.:0.000   3rd Qu.:0.000    3rd Qu.:0.0000   3rd Qu.:0.00  
##  Max.   :1.000   Max.   :1.000    Max.   :1.0000   Max.   :1.00  
##  Hydromorphone    Morphine..Not.Heroin.    Xylazine        Gabapentin    
##  Min.   :0.0000   Min.   :0.0000        Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000        1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000        Median :0.0000   Median :0.0000  
##  Mean   :0.0065   Mean   :0.0035        Mean   :0.0905   Mean   :0.0365  
##  3rd Qu.:0.0000   3rd Qu.:0.0000        3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000        Max.   :1.0000   Max.   :1.0000  
##    Opiate.NOS     Heroin.Morph.Codeine  Other.Opioid    Any.Opioid    
##  Min.   :0.0000   Min.   :0.000        Min.   :0.00   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000        1st Qu.:0.00   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000        Median :0.00   Median :1.0000  
##  Mean   :0.0115   Mean   :0.186        Mean   :0.01   Mean   :0.7455  
##  3rd Qu.:0.0000   3rd Qu.:0.000        3rd Qu.:0.00   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000        Max.   :1.00   Max.   :1.0000  
##     Is_local      Injury_place_residence   Longitude         Latitude    
##  Min.   :0.0000   Min.   :0.0000         Min.   :-73.64   Min.   :41.03  
##  1st Qu.:1.0000   1st Qu.:0.0000         1st Qu.:-73.05   1st Qu.:41.31  
##  Median :1.0000   Median :1.0000         Median :-72.86   Median :41.55  
##  Mean   :0.8085   Mean   :0.5425         Mean   :-72.81   Mean   :41.52  
##  3rd Qu.:1.0000   3rd Qu.:1.0000         3rd Qu.:-72.65   3rd Qu.:41.76  
##  Max.   :1.0000   Max.   :1.0000         Max.   :-71.83   Max.   :42.03

any(is.na(drug_deaths_sample))

## [1] FALSE

Clustering

While working with the dataset, I experimented with various approaches and methods for applying clustering, many of which produced unremarkable results. It turns out that the type of substance consumed plays a very minor role, making it challenging to identify clusters using such data. After numerous attempts, I decided to divide the project into two parts:

The first part focuses on using variables that most frequently exhibited differences in clusters, which are also relatively general variables.
The second approach, which will be the primary focus, involves analysis of the geographic distribution of overdose cases, combined with the age of the individuals involved.

This division allows for a more targeted exploration of the data and improves the chances of uncovering meaningful insights.

Part 1

#Choosing variables
drug_deaths_sample_1 <- drug_deaths_sample[c("Age", "Sex","Is_local","Cocaine","Ethanol","Fentanyl","Injury_place_residence")]

hopkins(drug_deaths_sample_1)

## [1] 0.9513197

The Hopkins statistic is very close to 1, indicating that our data can be easly clusterable.

#calculating optimal number of clusters
opt1<-Optimal_Clusters_KMeans(drug_deaths_sample_1, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

opt2<-NbClust(drug_deaths_sample_1, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
opt_3<-pamk(drug_deaths_sample_1, krange=2:15, criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(drug_deaths_sample, "dist"), critout=FALSE)
opt_3$nc

## [1] 2

According to calculations optimal number of clusters is 2

#k-means clustering
cluster_km <- kmeans(drug_deaths_sample_1, 2)
cluster_km$size

## [1]  981 1019

cluster_km$centers

##        Age       Sex  Is_local   Cocaine   Ethanol  Fentanyl
## 1 33.07543 0.7584098 0.7869521 0.3567788 0.2303772 0.7145770
## 2 54.68008 0.7379784 0.8292444 0.3915604 0.2924436 0.6418057
##   Injury_place_residence
## 1              0.5504587
## 2              0.5348381

drug_deaths_sample_1$cluster <- cluster_km$cluster

#PAM
cluster_pam<-eclust(drug_deaths_sample_1, "pam", k= 2)

cluster_pam$medoids

##      Age Sex Is_local Cocaine Ethanol Fentanyl Injury_place_residence cluster
## 5801  34   1        1       0       0        1                      1       1
## 1751  55   1        1       0       0        1                      1       2

drug_deaths_sample_1$cluster <- cluster_pam$cluster

#CLARA
clara_clust<-clara(drug_deaths_sample_1, 2, metric="euclidean", stand=FALSE, samples=6,
                   sampsize=50, trace=0, medoids.x=TRUE,
                   rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
class(clara_clust)

## [1] "clara"     "partition"

clara_clust

## Call:     clara(x = drug_deaths_sample_1, k = 2, metric = "euclidean", stand = FALSE, samples = 6, sampsize = 50, trace = 0, medoids.x = TRUE, rngR = FALSE, pamLike = FALSE, correct.d = TRUE) 
## Medoids:
##      Age Sex Is_local Cocaine Ethanol Fentanyl Injury_place_residence cluster
## 1129  33   1        1       0       0        1                      1       1
## 9367  52   1        1       0       1        1                      0       2
## Objective function:   5.880223
## Clustering vector:    Named int [1:2000] 1 1 2 1 2 2 1 1 2 2 2 2 1 2 1 2 1 1 ...
##  - attr(*, "names")= chr [1:2000] "2526" "2574" "10776" "8935" "3056" "1890" "9600" ...
## Cluster sizes:            932 1068 
## Best sample:
##  [1] 6912  9630  1642  6908  10622 1426  161   11791 11045 11221 2218  692  
## [13] 8383  9367  4728  10704 502   7888  8313  7294  3498  317   7787  6445 
## [25] 1836  5013  1724  5394  9524  6804  4931  10549 6140  2593  6585  8541 
## [37] 5520  773   5659  7604  7889  1129  858   10779 6588  11016 10060 4094 
## [49] 5562  6027 
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"

Based on the three models above, we can draw several conclusions:

-Age is an important variable; one cluster relates to individuals around 30 years old, while another focuses on those around 50.

-In the younger cluster, more people died due to ethanol use, while in the older cluster, the cause of death was primarily fentanyl.

Apart from these differences, the clusters appear relatively similar, which is why we will proceed to a geographical analysis.

Part 2

#Choosing  only geographical variables
drug_deaths_sample_2<-drug_deaths_sample[27:28]
hopkins(drug_deaths_sample_2)

## [1] 0.9897636

Here we have similiar situation to Part 1, hopkins index is close to 1 which means clusterable data

#calculating optimal number of clusters
opt1<-Optimal_Clusters_KMeans(drug_deaths_sample_2, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

opt2<-NbClust(drug_deaths_sample_2, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
opt3<-pamk(drug_deaths_sample_2, krange=2:10, criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(drug_deaths_sample, "dist"), critout=FALSE)
opt3$nc

## [1] 9

We can observe an interesting situation here, where the silhouette index suggests an increasing optimal number of clusters. However, this contradicts the intended goal, which is why, in the further analysis, 10 clusters and 5 clusters were used, based on a slight peak observed in the plots. Since the calculations indicated that the more clusters, the better, I considered 10 clusters as the maximum reasonable number that is visible on the map. If I increase the number further, I would eventually reach a point where each observation becomes a separate cluster, which is not desirable.

Below, I will calculate models using K-means, PAM, and hierarchical clustering for 5 and 10 clusters each. K-means

cluster_km_5 <- kmeans(drug_deaths_sample_2, 5)
plot(drug_deaths_sample_2$Latitude, drug_deaths_sample_2$Longitude, col = cluster_km_5$cluster, pch = 19,
     xlab = "Latitude", ylab = "Longitude", main = "K-means Clustering for 5 cluster")

locations_kmeans <- data.frame(drug_deaths_sample_2, cluster = as.factor(cluster_km_5$cluster))
pal <- colorFactor(palette = c("#E41A1C", "#377EB8", "#4DAF4A", 
                               "#984EA3", "#FF7F00", "#FFFF33", 
                               "#A65628", "#F781BF", "#999999", "#66C2A5"),
                   domain = locations_kmeans$cluster)
leaflet(data = locations_kmeans) %>%
  addTiles() %>%
  addCircleMarkers(
    ~Longitude, ~Latitude, 
    color = ~pal(cluster),
    radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topright", pal = pal, values = locations_kmeans$cluster, title = "Clusters")

cluster_km_10 <- kmeans(drug_deaths_sample_2, 10)
plot(drug_deaths_sample_2$Latitude, drug_deaths_sample_2$Longitude, col = cluster_km_10$cluster, pch = 19,
     xlab = "Latitude", ylab = "Longitude", main = "K-means Clustering for 10 clusters")

locations_kmeans <- data.frame(drug_deaths_sample_2, cluster = as.factor(cluster_km_10$cluster))
pal <- colorFactor(palette = c("#E41A1C", "#377EB8", "#4DAF4A", 
                               "#984EA3", "#FF7F00", "#FFFF33", 
                               "#A65628", "#F781BF", "#999999", "#66C2A5"),
                   domain = locations_kmeans$cluster)
leaflet(data = locations_kmeans) %>%
  addTiles() %>%
  addCircleMarkers(
    ~Longitude, ~Latitude, 
    color = ~pal(cluster),
    radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topright", pal = pal, values = locations_kmeans$cluster, title = "Clusters")

PAM

cluster_pam_5<-eclust(drug_deaths_sample_2, "pam", k= 5)

locations_pam_5 <- data.frame(drug_deaths_sample_2, cluster = as.factor(cluster_pam_5$cluster))
pal <- colorFactor(palette = c("#E41A1C", "#377EB8", "#4DAF4A", 
                               "#984EA3", "#FF7F00", "#FFFF33", 
                               "#A65628", "#F781BF", "#999999", 
                               "#66C2A5", "#FC8D62", "#8DA0CB", 
                               "#E78AC3", "#A6D854", "#FFD92F"),
                   domain = locations_kmeans$cluster)
leaflet(data = locations_pam_5) %>%
  addTiles() %>%
  addCircleMarkers(
    ~Longitude, ~Latitude, 
    color = ~pal(cluster),
    radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topright", pal = pal, values = locations_pam_5$cluster, title = "Clusters")

cluster_pam_10<-eclust(drug_deaths_sample_2, "pam", k= 10)

locations_pam_10 <- data.frame(drug_deaths_sample_2, cluster = as.factor(cluster_pam_10$cluster))
leaflet(data = locations_pam_10) %>%
  addTiles() %>%
  addCircleMarkers(
    ~Longitude, ~Latitude, 
    color = ~pal(cluster),
    radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topright", pal = pal, values = locations_pam_10$cluster, title = "Clusters")

Hierarchical clustering In hierarchical clustering, I experimented with different functions and methods. The main difference was in separating the southern part of Connecticut. Other than that, subsequent approaches did not differ significantly from one another. Therefore, I decided to use the agnes function with the “Ward” method.

d <- dist(drug_deaths_sample_2, method = "euclidean")
hc <- agnes(d, method = "ward")
cluster_hier_5 <- cutree(hc, k = 5)
table(cluster_hier_5)

## cluster_hier_5
##   1   2   3   4   5 
## 228 523 144 793 312

locations_hierarchical_5 <-drug_deaths_sample_2 %>%
  mutate(cluster = cluster_hier_5)

pltree(hc, cex = 0.6, hang = -1, main = "dendrogram - agnes", labels = FALSE)

rect.hclust(hc, k = 5, border = 2:5)

leaflet(data = locations_hierarchical_5) %>%
  addTiles() %>%
  addCircleMarkers(
    ~Longitude, ~Latitude, 
    color = ~pal(cluster),
    radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topright", pal = pal, values = locations_hierarchical_5$cluster, title = "Clusters")

cluster_hier_10 <- cutree(hc, k = 10)
table(cluster_hier_10)

## cluster_hier_10
##   1   2   3   4   5   6   7   8   9  10 
## 156 275  69 231 163 248 312 399  75  72

locations_hierarchical_10 <-drug_deaths_sample_2 %>%
  mutate(cluster = cluster_hier_10)

pltree(hc, cex = 0.6, hang = -1, main = "dendrogram - agnes", labels = FALSE)

rect.hclust(hc, k = 10, border = 2:5)

leaflet(data = locations_hierarchical_10) %>%
  addTiles() %>%
  addCircleMarkers(
    ~Longitude, ~Latitude, 
    color = ~pal(cluster),
    radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topright", pal = pal, values = locations_hierarchical_10$cluster, title = "Clusters")

When it comes to analysis using k-means, PAM, and hierarchical clustering methods, the results are quite similar to each other. For 5 clusters, a separate cluster was created for the eastern part of the state, which is characterized by lower population density. The models also identified the southwest as a distinct cluster, which can be considered a successful outcome.

Regarding clustering with 10 clusters, a situation can be observed where a single city and its surrounding area are identified as a cluster, which is also an interesting result.

DBscan

dbscan::kNNdistplot(drug_deaths_sample_2, k =  10)

db <- fpc::dbscan(drug_deaths_sample_2, eps = 0.08, MinPts = 15)
plot(db, drug_deaths_sample_2, main = "DBSCAN", frame = FALSE)

locations_dbscan <- data.frame(drug_deaths_sample_2, cluster = as.factor(db$cluster))

leaflet(data = locations_dbscan) %>%
  addTiles() %>%
  addCircleMarkers(
    ~Longitude, ~Latitude, 
    color = ~pal(cluster),
    radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend(position = "topright", pal = pal, values = locations_dbscan$cluster, title = "Clusters")

## Warning in pal(v): Some values were outside the color scale and will be treated
## as NA

## Warning in pal(cluster): Some values were outside the color scale and will be
## treated as NA

## Warning in pal(cluster): Some values were outside the color scale and will be
## treated as NA

When it comes to clustering using DBSCAN, it is not the best solution in this case. The observations are relatively evenly distributed across the state, making it challenging to select appropriate parameters. This often resulted in either one large cluster or many very small ones with a large number of outliers.

Now let’s analyze the statistics for these models to determine the best one.

Calinski-Harabasz index

cluster_clara_5<-eclust(drug_deaths_sample_2, "clara", k= 5)

cluster_clara_10<-eclust(drug_deaths_sample_2, "clara", k= 10)

Calinski_Harabasz <- data.frame(
  Method = c("K-means", "K-means", "PAM", "PAM","CLARA","CLARA", "Hierarchical", "Hierarchical", "DBSCAN"),
  Num_of_Clusters = c(5, 10, 5, 10,5, 10, 5, 10, ''),
  Calinski_Harabasz = c(
    round(calinhara(drug_deaths_sample_2, cluster_km_5$cluster), digits = 2),
    round(calinhara(drug_deaths_sample_2, cluster_km_10$cluster), digits = 2),
    round(calinhara(drug_deaths_sample_2, cluster_pam_5$cluster), digits = 2),
    round(calinhara(drug_deaths_sample_2, cluster_pam_10$cluster), digits = 2),
    round(calinhara(drug_deaths_sample_2, cluster_clara_5$cluster), digits = 2),
    round(calinhara(drug_deaths_sample_2, cluster_clara_10$cluster), digits = 2),   
    round(calinhara(drug_deaths_sample_2, locations_hierarchical_5$cluster), digits = 2),
    round(calinhara(drug_deaths_sample_2, locations_hierarchical_10$cluster), digits = 2),
    round(calinhara(drug_deaths_sample_2, db$cluster), digits = 2)
    )
  )

Calinski_Harabasz

##         Method Num_of_Clusters Calinski_Harabasz
## 1      K-means               5           2873.59
## 2      K-means              10           2266.96
## 3          PAM               5           2738.93
## 4          PAM              10           3164.83
## 5        CLARA               5           2715.83
## 6        CLARA              10           2785.42
## 7 Hierarchical               5           2466.49
## 8 Hierarchical              10           2719.50
## 9       DBSCAN                           1484.21

Here, the PAM clustering method achieved definitely the highest score.Therefore, I additionally performed clustering using the CLARA method, which is quite similar, to see if it could produce an even better result. After another attempt PAM with 10 clusters is still the best.

Duda-Hart index

Duda_Hart<- data.frame(
  Method = c("K-means", "K-means", "PAM", "PAM", "Hierarchical", "Hierarchical", "DBSCAN"),
  Num_of_Clusters = c(5, 10, 5, 10, 5, 10, ''),
  Duda_Hart = c(
    dudahart2(drug_deaths_sample_2, cluster_km_5$cluster)$p.value,
    dudahart2(drug_deaths_sample_2, cluster_km_10$cluster)$p.value, 
    dudahart2(drug_deaths_sample_2, cluster_pam_5$cluster)$p.value,
    dudahart2(drug_deaths_sample_2, cluster_pam_10$cluster)$p.value,
    dudahart2(drug_deaths_sample_2, locations_hierarchical_5$cluster)$p.value,
    dudahart2(drug_deaths_sample_2, locations_hierarchical_10$cluster)$p.value, 
    dudahart2(drug_deaths_sample_2, db$cluster)$p.value
  )
)

Duda_Hart

##         Method Num_of_Clusters Duda_Hart
## 1      K-means               5         0
## 2      K-means              10         0
## 3          PAM               5         0
## 4          PAM              10         0
## 5 Hierarchical               5         0
## 6 Hierarchical              10         0
## 7       DBSCAN                         0

In all of these cases, the Duda-Hart index is equal to or very close to 0, which indicates that the clusters are not homogeneous. Typically, this is a bad sign for clustering, but when using only geographical variables, this measure may not be the best way to evaluate the models

Inertia

diss.mat<-dist(drug_deaths_sample_2, method = "euclidean")

inertia<-matrix(0, nrow=4, ncol=7)
colnames(inertia)<-c("KM 5 clust.","KM 10 clust.","PAM 5 clust.","PAM 10 clust.","hierarchical 5 clust.","hierarchical 10 clust.","DB-can.")
rownames(inertia)<-c("intra-clust", "total", "percentage", "Q")
inertia[1,1]<-withindiss(diss.mat, part=cluster_km_5$cluster)# intra-cluster
inertia[2,1]<-inertdiss(diss.mat) # overall
inertia[3,1]<-inertia[1,1]/ inertia[2,1] # ratio
inertia[4,1]<-1-inertia[3,1] # Q, inter-cluster

inertia[1,2]<-withindiss(diss.mat, part=cluster_km_10$cluster)
inertia[2,2]<-inertdiss(diss.mat) 
inertia[3,2]<-inertia[1,2]/ inertia[2,2]
inertia[4,2]<-1-inertia[3,2] 

inertia[1,3]<-withindiss(diss.mat, part=cluster_pam_5$clustering)
inertia[2,3]<-inertdiss(diss.mat) 
inertia[3,3]<-inertia[1,3]/ inertia[2,3] 
inertia[4,3]<-1-inertia[3,3] 

inertia[1,4]<-withindiss(diss.mat, part=cluster_pam_10$clustering)
inertia[2,4]<-inertdiss(diss.mat)
inertia[3,4]<-inertia[1,4]/ inertia[2,4]
inertia[4,4]<-1-inertia[3,4]

inertia[1,5]<-withindiss(diss.mat, part=cluster_hier_5)
inertia[2,5]<-inertdiss(diss.mat)
inertia[3,5]<-inertia[1,5]/ inertia[2,5]
inertia[4,5]<-1-inertia[3,5]

inertia[1,6]<-withindiss(diss.mat, part=cluster_hier_10)
inertia[2,6]<-inertdiss(diss.mat)
inertia[3,6]<-inertia[1,6]/ inertia[2,6]
inertia[4,6]<-1-inertia[3,6]

inertia[1,7]<-withindiss(diss.mat, part=db$cluster)
inertia[2,7]<-inertdiss(diss.mat)
inertia[3,7]<-inertia[1,7]/ inertia[2,7]
inertia[4,7]<-1-inertia[3,7]

inertia

##             KM 5 clust. KM 10 clust. PAM 5 clust. PAM 10 clust.
## intra-clust  0.02843194   0.01708449   0.02961443    0.01255409
## total        0.19224472   0.19224472   0.19224472    0.19224472
## percentage   0.14789454   0.08886847   0.15404550    0.06530264
## Q            0.85210546   0.91113153   0.84595450    0.93469736
##             hierarchical 5 clust. hierarchical 10 clust.    DB-can.
## intra-clust            0.03233533             0.01445530 0.02086794
## total                  0.19224472             0.19224472 0.19224472
## percentage             0.16819880             0.07519217 0.10854885
## Q                      0.83180120             0.92480783 0.89145115

In the case of Inertia, the model should be characterized by high inter-cluster inertia (diversity) and low intra-cluster inertia (homogeneity). In our case, the best results in both of these measures were obtained by the PAM model with 10 clusters, which is why it is selected as the best model. Analyzing cluster by Age

locations_pam_10_dt<- merge(locations_pam_10, drug_deaths[, "Age"], by = "row.names", all.x = TRUE)

ggplot(locations_pam_10_dt, aes(x = factor(cluster), y = y, fill = factor(cluster))) +
  geom_boxplot() +
  labs(title = "Boxplot of Time by K-means cluster", 
       x = "Cluster", 
       y = "Age") +
  scale_fill_brewer(palette = "Set1")+
  theme(plot.title = element_text(hjust = 0.5))

As for the age analysis, there are no significant differences between the clusters. Only cluster number 3 is characterized by a lower average age compared to the others. This cluster corresponds to the area around the city of Danbury in the western part of the state

Conclusions

The project aimed to analyze drug overdose cases in the state of Connecticut, USA, using clustering techniques. The main conclusion from the analysis is that the type of substance used does not play a significant role, and the models were unable to separate observations into different clusters based on this information. The models often distinguished observations based on age, with younger individuals more frequently using ethanol, and older individuals more often using fentanyl. The geographical analysis was quite successful in separating large urban clusters into distinct groups, with the best results achieved by the PAM model with 10 clusters. During the analysis, a challenge arose in determining the number of clusters, as various metrics suggested a very high number of clusters, with the best results coming from models where the number of clusters was almost the number of observations.