The project aims to analyze deaths caused by drug overdoses using clustering methods. The data pertains to the state of Connecticut in the United States and includes recorded deaths related to drug use. Data is official and sourced from a platform that aggregates datasets provided by state agencies (https://data.ct.gov/Health-and-Human-Services/Accidental-Drug-Related-Deaths-2012-2023/rybz-nyjw/about_data).
source: https://delamere.com/blog/top-10-most-dangerous-drugs
#Loading packages
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(tidyverse)
library(dendextend)
library(dbscan)
library(RColorBrewer)
library(tidyverse)
library(dendextend)
library(leaflet)
library(ggplot2)
library(flexclust)
library(tidyr)
library(leaflet.extras)
library(hopkins)
library(ClustGeo)
library(NbClust)
#Loading data
drug_deaths <- read.csv("C:/Users/User/OneDrive/Pulpit/Studia magisterskie/USL/Accidental_Drug_Related_Deaths.csv")
drug_deaths$Sex <- ifelse(tolower(drug_deaths$Sex) == "male", 1, 0)
drug_deaths$Is_local <- ifelse(drug_deaths$Residence.State == "CT", 1, 0)
drug_deaths$Injury_place_residence <- ifelse(drug_deaths$Injury.Place == "Residence", 1, 0)
drug_deaths[ , 23:44] <- apply(drug_deaths[ , 23:44], 2, function(x) ifelse(x == "Y", 1, ifelse(x == "", 0, x)))
drug_deaths <- drug_deaths %>%
mutate_at(23:44, ~as.numeric(ifelse(. == "Y", 1, ifelse(. == "", 0, .))))
drug_deaths <- drug_deaths %>%
separate(InjuryCityGeo, into = c("CityGeo", "Longitude"), sep = ",\\s*-", remove = FALSE)
drug_deaths$Longitude <- sub("\\)$", "", drug_deaths$Longitude)
drug_deaths$Latitude <- sub(".*\\(([-0-9.]+).*", "\\1", drug_deaths$CityGeo)
drug_deaths$Longitude <- paste0("-", drug_deaths$Longitude)
drug_deaths$Longitude <- as.numeric(drug_deaths$Longitude)
drug_deaths$Latitude <- as.numeric(drug_deaths$Latitude)
drug_deaths <- drug_deaths %>%
select(-Latitude, -Longitude, everything(), Latitude, Longitude)
#Dropping unnecsessary variables
drug_deaths_1 <- drug_deaths[ , !(names(drug_deaths) %in% c("Date", "Date.Type","Ethnicity","Residence.City","Residence.County","Injury.City","Injury.County","Residence.State","Injury.County","Injury.State","Injury.Place","Description.of.Injury","Manner.of.Death","Other.Significant.Conditions","Death.City","Death.County","Death.State","Location","Location.if.Other","Cause.of.Death","Other","ResidenceCityGeo","DeathCityGeo","InjuryCityGeo","CityGeo","Race"))]
drug_deaths_1 <- na.omit(drug_deaths_1)
Visualizing the data on a map of Connecticut State in US
leaflet(data = drug_deaths_1[27:28]) %>%
addTiles() %>%
addCircleMarkers(~Longitude, ~Latitude, radius = 4, color = "blue", fillOpacity = 0.7)
leaflet(data = drug_deaths_1[27:28]) %>%
addTiles() %>%
addHeatmap(lng = ~Longitude, lat = ~Latitude, intensity = 1, blur = 20, max = 0.05)
As we can see, the observations are fairly evenly distributed across the map of the state. On the heatmap, larger number of observations can be noted in the southern part of the state, where the largest cities, Bridgeport and New Haven, are located. Here I choose a random sample of the data to speed up the code, because working on original data was took to long
set.seed(123)
drug_deaths_sample<- drug_deaths_1[sample(nrow(drug_deaths_1), 2000), ]
#Data_preview
str(drug_deaths_sample)
## 'data.frame': 2000 obs. of 28 variables:
## $ Age : int 37 31 52 41 60 53 36 35 56 48 ...
## $ Sex : num 1 0 1 1 1 1 0 1 1 1 ...
## $ Heroin : num 0 0 1 0 1 0 0 1 0 1 ...
## $ Heroin.death.certificate..DC.: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Cocaine : num 0 0 1 0 0 0 1 0 0 0 ...
## $ Fentanyl : num 1 1 1 0 0 0 1 1 1 1 ...
## $ Fentanyl.Analogue : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Oxycodone : num 0 0 0 0 0 1 0 0 0 0 ...
## $ Oxymorphone : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Ethanol : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hydrocodone : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Benzodiazepine : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Methadone : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Meth.Amphetamine : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Amphet : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Tramad : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hydromorphone : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Morphine..Not.Heroin. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Xylazine : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Gabapentin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Opiate.NOS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Heroin.Morph.Codeine : num 0 0 1 0 1 0 0 0 0 0 ...
## $ Other.Opioid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Any.Opioid : num 1 1 1 0 1 1 1 1 1 1 ...
## $ Is_local : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Injury_place_residence : num 1 1 0 0 1 1 0 1 1 1 ...
## $ Longitude : num -72 -72 -72.9 -73.5 -72.8 ...
## $ Latitude : num 41.4 41.4 41.4 41.4 41.5 ...
## - attr(*, "na.action")= 'omit' Named int [1:542] 11 14 17 21 23 26 28 62 85 153 ...
## ..- attr(*, "names")= chr [1:542] "11" "14" "17" "21" ...
summary(drug_deaths_sample)
## Age Sex Heroin Heroin.death.certificate..DC.
## Min. :14.00 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:34.00 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000
## Median :44.00 Median :1.000 Median :0.0000 Median :0.000
## Mean :44.08 Mean :0.748 Mean :0.2995 Mean :0.055
## 3rd Qu.:54.00 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :84.00 Max. :1.000 Max. :1.0000 Max. :1.000
## Cocaine Fentanyl Fentanyl.Analogue Oxycodone
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.0000 Median :1.0000 Median :0.000 Median :0.000
## Mean :0.3745 Mean :0.6775 Mean :0.083 Mean :0.088
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.000
## Oxymorphone Ethanol Hydrocodone Benzodiazepine
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.000 Median :0.0000
## Mean :0.0135 Mean :0.262 Mean :0.012 Mean :0.2235
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000 Max. :1.000 Max. :1.0000
## Methadone Meth.Amphetamine Amphet Tramad
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00
## Median :0.000 Median :0.000 Median :0.0000 Median :0.00
## Mean :0.093 Mean :0.011 Mean :0.0315 Mean :0.03
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.00
## Max. :1.000 Max. :1.000 Max. :1.0000 Max. :1.00
## Hydromorphone Morphine..Not.Heroin. Xylazine Gabapentin
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.0065 Mean :0.0035 Mean :0.0905 Mean :0.0365
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Opiate.NOS Heroin.Morph.Codeine Other.Opioid Any.Opioid
## Min. :0.0000 Min. :0.000 Min. :0.00 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.00 Median :1.0000
## Mean :0.0115 Mean :0.186 Mean :0.01 Mean :0.7455
## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.00 Max. :1.0000
## Is_local Injury_place_residence Longitude Latitude
## Min. :0.0000 Min. :0.0000 Min. :-73.64 Min. :41.03
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:-73.05 1st Qu.:41.31
## Median :1.0000 Median :1.0000 Median :-72.86 Median :41.55
## Mean :0.8085 Mean :0.5425 Mean :-72.81 Mean :41.52
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:-72.65 3rd Qu.:41.76
## Max. :1.0000 Max. :1.0000 Max. :-71.83 Max. :42.03
any(is.na(drug_deaths_sample))
## [1] FALSE
While working with the dataset, I experimented with various approaches and methods for applying clustering, many of which produced unremarkable results. It turns out that the type of substance consumed plays a very minor role, making it challenging to identify clusters using such data. After numerous attempts, I decided to divide the project into two parts:
This division allows for a more targeted exploration of the data and improves the chances of uncovering meaningful insights.
#Choosing variables
drug_deaths_sample_1 <- drug_deaths_sample[c("Age", "Sex","Is_local","Cocaine","Ethanol","Fentanyl","Injury_place_residence")]
hopkins(drug_deaths_sample_1)
## [1] 0.9513197
The Hopkins statistic is very close to 1, indicating that our data can be easly clusterable.
#calculating optimal number of clusters
opt1<-Optimal_Clusters_KMeans(drug_deaths_sample_1, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
opt2<-NbClust(drug_deaths_sample_1, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
opt_3<-pamk(drug_deaths_sample_1, krange=2:15, criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(drug_deaths_sample, "dist"), critout=FALSE)
opt_3$nc
## [1] 2
According to calculations optimal number of clusters is 2
#k-means clustering
cluster_km <- kmeans(drug_deaths_sample_1, 2)
cluster_km$size
## [1] 981 1019
cluster_km$centers
## Age Sex Is_local Cocaine Ethanol Fentanyl
## 1 33.07543 0.7584098 0.7869521 0.3567788 0.2303772 0.7145770
## 2 54.68008 0.7379784 0.8292444 0.3915604 0.2924436 0.6418057
## Injury_place_residence
## 1 0.5504587
## 2 0.5348381
drug_deaths_sample_1$cluster <- cluster_km$cluster
#PAM
cluster_pam<-eclust(drug_deaths_sample_1, "pam", k= 2)
cluster_pam$medoids
## Age Sex Is_local Cocaine Ethanol Fentanyl Injury_place_residence cluster
## 5801 34 1 1 0 0 1 1 1
## 1751 55 1 1 0 0 1 1 2
drug_deaths_sample_1$cluster <- cluster_pam$cluster
#CLARA
clara_clust<-clara(drug_deaths_sample_1, 2, metric="euclidean", stand=FALSE, samples=6,
sampsize=50, trace=0, medoids.x=TRUE,
rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
class(clara_clust)
## [1] "clara" "partition"
clara_clust
## Call: clara(x = drug_deaths_sample_1, k = 2, metric = "euclidean", stand = FALSE, samples = 6, sampsize = 50, trace = 0, medoids.x = TRUE, rngR = FALSE, pamLike = FALSE, correct.d = TRUE)
## Medoids:
## Age Sex Is_local Cocaine Ethanol Fentanyl Injury_place_residence cluster
## 1129 33 1 1 0 0 1 1 1
## 9367 52 1 1 0 1 1 0 2
## Objective function: 5.880223
## Clustering vector: Named int [1:2000] 1 1 2 1 2 2 1 1 2 2 2 2 1 2 1 2 1 1 ...
## - attr(*, "names")= chr [1:2000] "2526" "2574" "10776" "8935" "3056" "1890" "9600" ...
## Cluster sizes: 932 1068
## Best sample:
## [1] 6912 9630 1642 6908 10622 1426 161 11791 11045 11221 2218 692
## [13] 8383 9367 4728 10704 502 7888 8313 7294 3498 317 7787 6445
## [25] 1836 5013 1724 5394 9524 6804 4931 10549 6140 2593 6585 8541
## [37] 5520 773 5659 7604 7889 1129 858 10779 6588 11016 10060 4094
## [49] 5562 6027
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
Based on the three models above, we can draw several conclusions:
-Age is an important variable; one cluster relates to individuals around 30 years old, while another focuses on those around 50.
-In the younger cluster, more people died due to ethanol use, while in the older cluster, the cause of death was primarily fentanyl.
Apart from these differences, the clusters appear relatively similar, which is why we will proceed to a geographical analysis.
#Choosing only geographical variables
drug_deaths_sample_2<-drug_deaths_sample[27:28]
hopkins(drug_deaths_sample_2)
## [1] 0.9897636
Here we have similiar situation to Part 1, hopkins index is close to 1 which means clusterable data
#calculating optimal number of clusters
opt1<-Optimal_Clusters_KMeans(drug_deaths_sample_2, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
opt2<-NbClust(drug_deaths_sample_2, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
opt3<-pamk(drug_deaths_sample_2, krange=2:10, criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(drug_deaths_sample, "dist"), critout=FALSE)
opt3$nc
## [1] 9
We can observe an interesting situation here, where the silhouette index suggests an increasing optimal number of clusters. However, this contradicts the intended goal, which is why, in the further analysis, 10 clusters and 5 clusters were used, based on a slight peak observed in the plots. Since the calculations indicated that the more clusters, the better, I considered 10 clusters as the maximum reasonable number that is visible on the map. If I increase the number further, I would eventually reach a point where each observation becomes a separate cluster, which is not desirable.
Below, I will calculate models using K-means, PAM, and hierarchical clustering for 5 and 10 clusters each. K-means
cluster_km_5 <- kmeans(drug_deaths_sample_2, 5)
plot(drug_deaths_sample_2$Latitude, drug_deaths_sample_2$Longitude, col = cluster_km_5$cluster, pch = 19,
xlab = "Latitude", ylab = "Longitude", main = "K-means Clustering for 5 cluster")
locations_kmeans <- data.frame(drug_deaths_sample_2, cluster = as.factor(cluster_km_5$cluster))
pal <- colorFactor(palette = c("#E41A1C", "#377EB8", "#4DAF4A",
"#984EA3", "#FF7F00", "#FFFF33",
"#A65628", "#F781BF", "#999999", "#66C2A5"),
domain = locations_kmeans$cluster)
leaflet(data = locations_kmeans) %>%
addTiles() %>%
addCircleMarkers(
~Longitude, ~Latitude,
color = ~pal(cluster),
radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topright", pal = pal, values = locations_kmeans$cluster, title = "Clusters")
cluster_km_10 <- kmeans(drug_deaths_sample_2, 10)
plot(drug_deaths_sample_2$Latitude, drug_deaths_sample_2$Longitude, col = cluster_km_10$cluster, pch = 19,
xlab = "Latitude", ylab = "Longitude", main = "K-means Clustering for 10 clusters")
locations_kmeans <- data.frame(drug_deaths_sample_2, cluster = as.factor(cluster_km_10$cluster))
pal <- colorFactor(palette = c("#E41A1C", "#377EB8", "#4DAF4A",
"#984EA3", "#FF7F00", "#FFFF33",
"#A65628", "#F781BF", "#999999", "#66C2A5"),
domain = locations_kmeans$cluster)
leaflet(data = locations_kmeans) %>%
addTiles() %>%
addCircleMarkers(
~Longitude, ~Latitude,
color = ~pal(cluster),
radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topright", pal = pal, values = locations_kmeans$cluster, title = "Clusters")
PAM
cluster_pam_5<-eclust(drug_deaths_sample_2, "pam", k= 5)
locations_pam_5 <- data.frame(drug_deaths_sample_2, cluster = as.factor(cluster_pam_5$cluster))
pal <- colorFactor(palette = c("#E41A1C", "#377EB8", "#4DAF4A",
"#984EA3", "#FF7F00", "#FFFF33",
"#A65628", "#F781BF", "#999999",
"#66C2A5", "#FC8D62", "#8DA0CB",
"#E78AC3", "#A6D854", "#FFD92F"),
domain = locations_kmeans$cluster)
leaflet(data = locations_pam_5) %>%
addTiles() %>%
addCircleMarkers(
~Longitude, ~Latitude,
color = ~pal(cluster),
radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topright", pal = pal, values = locations_pam_5$cluster, title = "Clusters")
cluster_pam_10<-eclust(drug_deaths_sample_2, "pam", k= 10)
locations_pam_10 <- data.frame(drug_deaths_sample_2, cluster = as.factor(cluster_pam_10$cluster))
leaflet(data = locations_pam_10) %>%
addTiles() %>%
addCircleMarkers(
~Longitude, ~Latitude,
color = ~pal(cluster),
radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topright", pal = pal, values = locations_pam_10$cluster, title = "Clusters")
Hierarchical clustering In hierarchical clustering, I experimented with different functions and methods. The main difference was in separating the southern part of Connecticut. Other than that, subsequent approaches did not differ significantly from one another. Therefore, I decided to use the agnes function with the âWardâ method.
d <- dist(drug_deaths_sample_2, method = "euclidean")
hc <- agnes(d, method = "ward")
cluster_hier_5 <- cutree(hc, k = 5)
table(cluster_hier_5)
## cluster_hier_5
## 1 2 3 4 5
## 228 523 144 793 312
locations_hierarchical_5 <-drug_deaths_sample_2 %>%
mutate(cluster = cluster_hier_5)
pltree(hc, cex = 0.6, hang = -1, main = "dendrogram - agnes", labels = FALSE)
rect.hclust(hc, k = 5, border = 2:5)
leaflet(data = locations_hierarchical_5) %>%
addTiles() %>%
addCircleMarkers(
~Longitude, ~Latitude,
color = ~pal(cluster),
radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topright", pal = pal, values = locations_hierarchical_5$cluster, title = "Clusters")
cluster_hier_10 <- cutree(hc, k = 10)
table(cluster_hier_10)
## cluster_hier_10
## 1 2 3 4 5 6 7 8 9 10
## 156 275 69 231 163 248 312 399 75 72
locations_hierarchical_10 <-drug_deaths_sample_2 %>%
mutate(cluster = cluster_hier_10)
pltree(hc, cex = 0.6, hang = -1, main = "dendrogram - agnes", labels = FALSE)
rect.hclust(hc, k = 10, border = 2:5)
leaflet(data = locations_hierarchical_10) %>%
addTiles() %>%
addCircleMarkers(
~Longitude, ~Latitude,
color = ~pal(cluster),
radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topright", pal = pal, values = locations_hierarchical_10$cluster, title = "Clusters")
When it comes to analysis using k-means, PAM, and hierarchical clustering methods, the results are quite similar to each other. For 5 clusters, a separate cluster was created for the eastern part of the state, which is characterized by lower population density. The models also identified the southwest as a distinct cluster, which can be considered a successful outcome.
Regarding clustering with 10 clusters, a situation can be observed where a single city and its surrounding area are identified as a cluster, which is also an interesting result.
DBscan
dbscan::kNNdistplot(drug_deaths_sample_2, k = 10)
db <- fpc::dbscan(drug_deaths_sample_2, eps = 0.08, MinPts = 15)
plot(db, drug_deaths_sample_2, main = "DBSCAN", frame = FALSE)
locations_dbscan <- data.frame(drug_deaths_sample_2, cluster = as.factor(db$cluster))
leaflet(data = locations_dbscan) %>%
addTiles() %>%
addCircleMarkers(
~Longitude, ~Latitude,
color = ~pal(cluster),
radius = 10, fill = TRUE, fillOpacity = 0.7, stroke = FALSE,
popup = ~paste("Cluster:", cluster)
) %>%
addLegend(position = "topright", pal = pal, values = locations_dbscan$cluster, title = "Clusters")
## Warning in pal(v): Some values were outside the color scale and will be treated
## as NA
## Warning in pal(cluster): Some values were outside the color scale and will be
## treated as NA
## Warning in pal(cluster): Some values were outside the color scale and will be
## treated as NA
When it comes to clustering using DBSCAN, it is not the best solution in this case. The observations are relatively evenly distributed across the state, making it challenging to select appropriate parameters. This often resulted in either one large cluster or many very small ones with a large number of outliers.
Now letâs analyze the statistics for these models to determine the best one.
Calinski-Harabasz index
cluster_clara_5<-eclust(drug_deaths_sample_2, "clara", k= 5)
cluster_clara_10<-eclust(drug_deaths_sample_2, "clara", k= 10)
Calinski_Harabasz <- data.frame(
Method = c("K-means", "K-means", "PAM", "PAM","CLARA","CLARA", "Hierarchical", "Hierarchical", "DBSCAN"),
Num_of_Clusters = c(5, 10, 5, 10,5, 10, 5, 10, ''),
Calinski_Harabasz = c(
round(calinhara(drug_deaths_sample_2, cluster_km_5$cluster), digits = 2),
round(calinhara(drug_deaths_sample_2, cluster_km_10$cluster), digits = 2),
round(calinhara(drug_deaths_sample_2, cluster_pam_5$cluster), digits = 2),
round(calinhara(drug_deaths_sample_2, cluster_pam_10$cluster), digits = 2),
round(calinhara(drug_deaths_sample_2, cluster_clara_5$cluster), digits = 2),
round(calinhara(drug_deaths_sample_2, cluster_clara_10$cluster), digits = 2),
round(calinhara(drug_deaths_sample_2, locations_hierarchical_5$cluster), digits = 2),
round(calinhara(drug_deaths_sample_2, locations_hierarchical_10$cluster), digits = 2),
round(calinhara(drug_deaths_sample_2, db$cluster), digits = 2)
)
)
Calinski_Harabasz
## Method Num_of_Clusters Calinski_Harabasz
## 1 K-means 5 2873.59
## 2 K-means 10 2266.96
## 3 PAM 5 2738.93
## 4 PAM 10 3164.83
## 5 CLARA 5 2715.83
## 6 CLARA 10 2785.42
## 7 Hierarchical 5 2466.49
## 8 Hierarchical 10 2719.50
## 9 DBSCAN 1484.21
Here, the PAM clustering method achieved definitely the highest score.Therefore, I additionally performed clustering using the CLARA method, which is quite similar, to see if it could produce an even better result. After another attempt PAM with 10 clusters is still the best.
Duda-Hart index
Duda_Hart<- data.frame(
Method = c("K-means", "K-means", "PAM", "PAM", "Hierarchical", "Hierarchical", "DBSCAN"),
Num_of_Clusters = c(5, 10, 5, 10, 5, 10, ''),
Duda_Hart = c(
dudahart2(drug_deaths_sample_2, cluster_km_5$cluster)$p.value,
dudahart2(drug_deaths_sample_2, cluster_km_10$cluster)$p.value,
dudahart2(drug_deaths_sample_2, cluster_pam_5$cluster)$p.value,
dudahart2(drug_deaths_sample_2, cluster_pam_10$cluster)$p.value,
dudahart2(drug_deaths_sample_2, locations_hierarchical_5$cluster)$p.value,
dudahart2(drug_deaths_sample_2, locations_hierarchical_10$cluster)$p.value,
dudahart2(drug_deaths_sample_2, db$cluster)$p.value
)
)
Duda_Hart
## Method Num_of_Clusters Duda_Hart
## 1 K-means 5 0
## 2 K-means 10 0
## 3 PAM 5 0
## 4 PAM 10 0
## 5 Hierarchical 5 0
## 6 Hierarchical 10 0
## 7 DBSCAN 0
In all of these cases, the Duda-Hart index is equal to or very close to 0, which indicates that the clusters are not homogeneous. Typically, this is a bad sign for clustering, but when using only geographical variables, this measure may not be the best way to evaluate the models
Inertia
diss.mat<-dist(drug_deaths_sample_2, method = "euclidean")
inertia<-matrix(0, nrow=4, ncol=7)
colnames(inertia)<-c("KM 5 clust.","KM 10 clust.","PAM 5 clust.","PAM 10 clust.","hierarchical 5 clust.","hierarchical 10 clust.","DB-can.")
rownames(inertia)<-c("intra-clust", "total", "percentage", "Q")
inertia[1,1]<-withindiss(diss.mat, part=cluster_km_5$cluster)# intra-cluster
inertia[2,1]<-inertdiss(diss.mat) # overall
inertia[3,1]<-inertia[1,1]/ inertia[2,1] # ratio
inertia[4,1]<-1-inertia[3,1] # Q, inter-cluster
inertia[1,2]<-withindiss(diss.mat, part=cluster_km_10$cluster)
inertia[2,2]<-inertdiss(diss.mat)
inertia[3,2]<-inertia[1,2]/ inertia[2,2]
inertia[4,2]<-1-inertia[3,2]
inertia[1,3]<-withindiss(diss.mat, part=cluster_pam_5$clustering)
inertia[2,3]<-inertdiss(diss.mat)
inertia[3,3]<-inertia[1,3]/ inertia[2,3]
inertia[4,3]<-1-inertia[3,3]
inertia[1,4]<-withindiss(diss.mat, part=cluster_pam_10$clustering)
inertia[2,4]<-inertdiss(diss.mat)
inertia[3,4]<-inertia[1,4]/ inertia[2,4]
inertia[4,4]<-1-inertia[3,4]
inertia[1,5]<-withindiss(diss.mat, part=cluster_hier_5)
inertia[2,5]<-inertdiss(diss.mat)
inertia[3,5]<-inertia[1,5]/ inertia[2,5]
inertia[4,5]<-1-inertia[3,5]
inertia[1,6]<-withindiss(diss.mat, part=cluster_hier_10)
inertia[2,6]<-inertdiss(diss.mat)
inertia[3,6]<-inertia[1,6]/ inertia[2,6]
inertia[4,6]<-1-inertia[3,6]
inertia[1,7]<-withindiss(diss.mat, part=db$cluster)
inertia[2,7]<-inertdiss(diss.mat)
inertia[3,7]<-inertia[1,7]/ inertia[2,7]
inertia[4,7]<-1-inertia[3,7]
inertia
## KM 5 clust. KM 10 clust. PAM 5 clust. PAM 10 clust.
## intra-clust 0.02843194 0.01708449 0.02961443 0.01255409
## total 0.19224472 0.19224472 0.19224472 0.19224472
## percentage 0.14789454 0.08886847 0.15404550 0.06530264
## Q 0.85210546 0.91113153 0.84595450 0.93469736
## hierarchical 5 clust. hierarchical 10 clust. DB-can.
## intra-clust 0.03233533 0.01445530 0.02086794
## total 0.19224472 0.19224472 0.19224472
## percentage 0.16819880 0.07519217 0.10854885
## Q 0.83180120 0.92480783 0.89145115
In the case of Inertia, the model should be characterized by high inter-cluster inertia (diversity) and low intra-cluster inertia (homogeneity). In our case, the best results in both of these measures were obtained by the PAM model with 10 clusters, which is why it is selected as the best model. Analyzing cluster by Age
locations_pam_10_dt<- merge(locations_pam_10, drug_deaths[, "Age"], by = "row.names", all.x = TRUE)
ggplot(locations_pam_10_dt, aes(x = factor(cluster), y = y, fill = factor(cluster))) +
geom_boxplot() +
labs(title = "Boxplot of Time by K-means cluster",
x = "Cluster",
y = "Age") +
scale_fill_brewer(palette = "Set1")+
theme(plot.title = element_text(hjust = 0.5))
As for the age analysis, there are no significant differences between the clusters. Only cluster number 3 is characterized by a lower average age compared to the others. This cluster corresponds to the area around the city of Danbury in the western part of the state
The project aimed to analyze drug overdose cases in the state of Connecticut, USA, using clustering techniques. The main conclusion from the analysis is that the type of substance used does not play a significant role, and the models were unable to separate observations into different clusters based on this information. The models often distinguished observations based on age, with younger individuals more frequently using ethanol, and older individuals more often using fentanyl. The geographical analysis was quite successful in separating large urban clusters into distinct groups, with the best results achieved by the PAM model with 10 clusters. During the analysis, a challenge arose in determining the number of clusters, as various metrics suggested a very high number of clusters, with the best results coming from models where the number of clusters was almost the number of observations.