multishapes data:Methods like DBSCAN redefine what it means to be a cluster:
Clusters are dense regions in \(p-\)space, separated by regions of lower densities of points.
With this definition, you might not even belong to a cluster at all! (See the blue points).MinPts neighborsMinPts neighbors, but does have at least one core point as a neighborMinPts = 5MinPts = 5MinPts and \(\epsilon\)minPtsminPts as “minimum expected size of a cluster”: can incorporate domain expertise hereminPts of 4 or 5 is typical defaultminPts implies points must be more dense to be considered core points \(\Rightarrow\) fewer clusters, more outliersminPts = 1 \(\Rightarrow\) a cluster could have just a single pointminPtsminPts and \(\epsilon\) specifies the number of clusters and classification of outliers:minPts (default in dbscan is 5) and choose \(\epsilon\)minPts (say \(k=5\)), calculate the distance of each point to its \((k-1)^{st}\) nearest neighbordbscandbscan library includes both dbscan and kNNdistplot functionsdbscan for the multishapes data setabline(h = 0.15))Implementing dbscan with eps = 0.15:
drugs <- read.csv('Data/IllicitDrug.csv') %>%
column_to_rownames('State')
(drugs_scaled <- scale(drugs)
) %>% head DrugUse BingeDrink Poverty HSdrop Income
Alabama -0.9150748 -1.2127521 0.6098588 0.9105382 -1.0569412
Alaska 3.0829126 0.4407902 -0.8767792 0.2063128 0.2459105
Arizona 0.6994970 -0.5040911 1.2220039 1.6561886 -0.5053843
Arkansas -1.2994966 -0.7403114 0.6973081 0.4134379 -1.2513061
California 1.3145720 -0.7796815 0.8722067 1.5733385 0.5486712
Colorado 1.5452251 0.4407902 -0.9350788 -0.2493624 0.9829551
epsilon value is too high.Results with \(\epsilon = 1.3\), minPts = 5:
Oil_ID Region Area Palmitic Palmitoleic Strearic Oleic Linoleic
1 1 Southern North-Apulia 1075 75 226 7823 672
2 2 Southern North-Apulia 1088 73 224 7709 781
3 3 Southern North-Apulia 911 54 246 8113 549
4 4 Southern North-Apulia 966 57 240 7952 619
5 5 Southern North-Apulia 1051 67 259 7771 672
6 6 Southern North-Apulia 911 49 268 7924 678
Linolenic Eicosanoic Eicosenoic
1 36 60 29
2 31 61 29
3 31 63 29
4 50 78 35
5 50 80 46
6 51 70 44
Areafviz_pca(pca,
habillage = factor(oils$Area),
label='var',
repel = TRUE) +
ggtitle('PCA of olive oil data color coded by region') +
labs(color='Area',shape='Area') +
fviz_pca(pca,
habillage = factor(kmeans_best$cluster),
label='var',
repel = TRUE) +
ggtitle('PCA of olive oil data color coded by K-means cluster') +
labs(color='Cluster',shape='Cluster')minPts = 5, a couple candidate \(\epsilon\) values:Implementation:
fviz_pca(pca,
habillage = factor(oil_dbscan1$cluster),
label='var',
repel = TRUE) +
ggtitle('DBSCAN clustering results, epsilon = 1.2') +
labs(color='Area',shape='Area') +
fviz_pca(pca,
habillage = factor(oil_dbscan2$cluster),
label='var',
repel = TRUE) +
ggtitle('DBSCAN clustering results, epsilon = 1.7') +
labs(color='Area',shape='Area') minPts?minPts to 10minPts = 10, a couple candidate \(\epsilon\) values:minPts = 10oil_dbscan3 <- dbscan(oils_scaled, eps = 1.2, minPts = 10)
oil_dbscan4 <- dbscan(oils_scaled, eps = 1.4, minPts = 10)
oil_dbscan5 <- dbscan(oils_scaled, eps = 1.8, minPts = 10)
fviz_pca(pca,
habillage = factor(oil_dbscan3$cluster),
label='var',
repel = TRUE) +
ggtitle('DBSCAN, eps = 1.2, minPts= 10') +
labs(color='Cluster',shape='Cluster') +
fviz_pca(pca,
habillage = factor(oil_dbscan4$cluster),
label='var',
repel = TRUE) +
ggtitle('DBSCAN, eps = 1.4, minPts= 10') +
labs(color='Cluster',shape='Cluster') +
fviz_pca(pca,
habillage = factor(oil_dbscan5$cluster),
label='var',
repel = TRUE) +
ggtitle('DBSCAN, eps = 1.8, minPts= 10') +
labs(color='Cluster',shape='Cluster')