Activity 4.2 - Kmeans, PAM, and DBSCAN clustering

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Loading required packages:

#install.packages("dbscan")

library(cluster)
library(dbscan)
library(factoextra)
library(tidyverse)
library(patchwork)
library(ggrepel)

Question 1

Reconsider the three data sets below. We will now compare kmeans, PAM, and DBSCAN to cluster these data sets.

three_spheres <- read.csv('C:/Users/lr7273ow/OneDrive - Minnesota State/Documents/GitHub/DSCI_415/Activities/Data/cluster_data1.csv')
ring_moon_sphere <- read.csv('C:/Users/lr7273ow/OneDrive - Minnesota State/Documents/GitHub/DSCI_415/Activities/Data/cluster_data2.csv')
two_spirals_sphere <- read.csv('C:/Users/lr7273ow/OneDrive - Minnesota State/Documents/GitHub/DSCI_415/Activities/Data/cluster_data3.csv')

A)

With kmeans and PAM, we can specify that we want 3 clusters. But recall with DBSCAN we select minPts and eps, and the number of clusters is determined accordingly. Use k-nearest-neighbor distance plots to determine candidate epsilon values for each data set if minPts = 4. Add horizontal line(s) to each plot indicating your selected value(s) of \(\epsilon.\)

#install.packages("FNN")

library(FNN)
Warning: package 'FNN' was built under R version 4.5.2
knn_plot <- function(data, k = 4, title = "") {
  
  knn <- FNN::get.knn(data, k = k)
  dist_k <- sort(knn$nn.dist[, k])
  
  df <- data.frame(
    index = seq_along(dist_k),
    dist = dist_k
  )
  
  ggplot(df, aes(x = index, y = dist)) +
    geom_line() +
    labs(title = title,
         x = "Points (sorted)",
         y = paste0(k, "-NN distance")) +
    theme_minimal()
}




 
#dbscan for 3 spheres
knn_plot(three_spheres, k = 4,
         title = "4-NN Distance Plot — Three Spheres") + geom_hline(yintercept = .26, linetype = "dashed", color = "red")

#epsilon looked to be around 0.14

#dbscan for ring moon spheres 
knn_plot(ring_moon_sphere, k=4, 
         title = "4-NN Distance Plot-Ring Moon Spheres") + geom_hline(yintercept= 0.24 , linetype = "dashed", color = "red")

#dbscan for two spirals
knn_plot(two_spirals_sphere, k=4, 
         title = "4-NN Distance Plot-Two Spiral Spheres") + geom_hline(yintercept = 1.25,linetype = "dashed", color = "red")

#
kNNdistplot(three_spheres[,1:2],minPts = 5)

kNNdistplot(two_spirals_sphere[,1:2],minPts = 5)

B)

Write a function called plot_dbscan_results(df, eps, minPts). This function takes a data frame, epsilon value, and minPts as arguments and does the following:

  • Runs DBSCAN on the inputted data frame df, given the eps and minPts values;
  • Creates a scatterplot of the data frame with points color-coded by assigned cluster membership. Make sure the title of the plot includes the value of eps and minPts used to create the clusters!!

Using this function, and your candidate eps values from A) as a starting point, implement DBSCAN to correctly identify the 3 cluster shapes in each of the three data sets. You will likely need to revise the eps values until you settle on a “correct” solution.

plot_dbscan_results <- function(df,eps,minPts ){
  
  #run DBSCAN 
  db <- dbscan:: dbscan(df, eps, minPts = minPts) 
  
  #Add cluster labels to the data frame 
  df_plot <- df %>% 
   dplyr:: mutate(cluster = factor(db$cluster))

  
  #Plot the clusters
  p <- ggplot(df_plot, aes(x = df_plot[[1]], y = df_plot[[2]], color = cluster)) +  
  geom_point(size = 1.5) + 
  theme_minimal() + 
    labs(
      title = paste0("DBSCAN Clustering (eps = ", eps, ", minPts = ", minPts, ")"),
      x = "X",
      y = "Y", 
      color = "Cluster"
   )
    print(p)
}
  
  
#Trying the DBSCAN on each of the epsilon values. 

#@ og epsilon value of 0.24 three spheres 
plot_dbscan_results(three_spheres, eps = 0.24, minPts = 4)
Warning: Use of `df_plot[[1]]` is discouraged.
ℹ Use `.data[[1]]` instead.
Warning: Use of `df_plot[[2]]` is discouraged.
ℹ Use `.data[[2]]` instead.

#@ og epsilon value after tweaking a little bit three sphere
plot_dbscan_results(three_spheres, eps = 0.29, minPts = 4)
Warning: Use of `df_plot[[1]]` is discouraged.
ℹ Use `.data[[1]]` instead.
Use of `df_plot[[2]]` is discouraged.
ℹ Use `.data[[2]]` instead.

#@ og epsilon value of 0.11 ring moon 
plot_dbscan_results(ring_moon_sphere, eps = 0.24, minPts = 4)  
Warning: Use of `df_plot[[1]]` is discouraged.
ℹ Use `.data[[1]]` instead.
Use of `df_plot[[2]]` is discouraged.
ℹ Use `.data[[2]]` instead.

#@ eps value after tweaking a little bit (0.28)
plot_dbscan_results(ring_moon_sphere, eps = 0.285, minPts = 4)
Warning: Use of `df_plot[[1]]` is discouraged.
ℹ Use `.data[[1]]` instead.
Use of `df_plot[[2]]` is discouraged.
ℹ Use `.data[[2]]` instead.

#@ og epsilon value of  0.4

plot_dbscan_results(two_spirals_sphere, eps = 1.25, minPts = 4
)
Warning: Use of `df_plot[[1]]` is discouraged.
ℹ Use `.data[[1]]` instead.
Use of `df_plot[[2]]` is discouraged.
ℹ Use `.data[[2]]` instead.

#@ eps value after tweaking a little bit

plot_dbscan_results(two_spirals_sphere, eps = 1.27, minPts = 4
)
Warning: Use of `df_plot[[1]]` is discouraged.
ℹ Use `.data[[1]]` instead.
Use of `df_plot[[2]]` is discouraged.
ℹ Use `.data[[2]]` instead.

C)

Compare your DBSCAN solutions to the 3-cluster solutions from k-means and PAM. Use the patchwork package and your function from B) to produce a 3x3 grid of plots: one plot per method/data set combo. Comment on your findings.

 library(patchwork)

#dbscan
plot_dbscan_results <- function(df, eps, minPts) {
  
  # keep ONLY numeric columns for DBSCAN
  df_numeric <- dplyr::select(df, where(is.numeric))
  
  # run DBSCAN
  db <- dbscan::dbscan(df_numeric, eps = eps, minPts = minPts)
  
  # add cluster column
  df_plot <- df_numeric %>%
    dplyr::mutate(cluster = factor(db$cluster))
  
  # plot using column names, not numeric indexing
  ggplot(df_plot, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 1.5) +
    theme_minimal() +
    labs(
      title = paste0("DBSCAN (eps=", eps, ", minPts=", minPts, ")"),
      x = "X",
      y = "Y",
      color = "Cluster"
    )
}

#kmeans 
plot_kmeans_results <- function(df, centers) {
  
  df_numeric <- dplyr::select(df, where(is.numeric))
  
  km <- kmeans(df_numeric, centers = centers, nstart = 25)
  
  df_plot <- df_numeric %>% 
    dplyr::mutate(cluster = factor(km$cluster))
  
  ggplot(df_plot, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 1.5) +
    theme_minimal() +
    labs(
      title = paste0("K-means (k = ", centers, ")"),
      x = "X",
      y = "Y",
      color = "Cluster"
    )
}


#pam
plot_pam_results <- function(df, k) {
  
  df_numeric <- dplyr::select(df, where(is.numeric))
  
  pam_res <- cluster::pam(df_numeric, k = k)
  
  df_plot <- df_numeric %>%
    dplyr::mutate(cluster = factor(pam_res$clustering))
  
  ggplot(df_plot, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 1.5) +
    theme_minimal() +
    labs(
      title = paste0("PAM (k = ", k, ")"),
      x = "X",
      y = "Y",
      color = "Cluster"
    )
}


three_spheres
            x        y
1   0.8131339 4.084861
2   1.0518924 4.410097
3   0.8745572 3.866620
4   1.1435424 3.742791
5   1.2999460 4.345134
6   1.1079499 3.820448
7   0.8031028 3.827474
8   0.8912335 3.637910
9   0.8992804 3.842571
10  1.0681865 4.123401
11  1.1932278 3.932720
12  0.8892823 4.280426
13  1.2394118 3.571090
14  0.7691755 4.604014
15  1.0247093 3.918123
16  0.7862921 3.539198
17  0.9188633 4.176897
18  0.9785866 3.878812
19  1.0353742 4.230554
20  1.3795931 4.305117
21  0.7940187 3.863665
22  1.0095460 3.834677
23  1.0341107 4.175000
24  0.6796920 4.140951
25  1.1877736 3.996990
26  0.6036274 4.065677
27  0.9283108 4.093132
28  1.0172164 4.453145
29  1.3742069 3.777695
30  1.1120553 4.209853
31  0.9801792 3.909559
32  0.9797569 3.929415
33  0.7744408 4.082380
34  0.8697368 3.968143
35  0.9259300 3.393804
36  0.9647558 3.955853
37  1.1295174 3.894254
38  1.1330970 4.030003
39  1.1497125 4.072190
40  0.8147645 4.079085
41  1.0535198 4.009596
42  1.0909498 3.897154
43  1.1679289 3.745746
44  1.2189834 3.979774
45  0.9094913 3.933128
46  1.3305728 3.996288
47  1.0173299 3.835252
48  1.2016064 4.204404
49  0.9405298 4.356558
50  0.9655414 4.182181
51  0.8647251 3.728951
52  0.7450602 3.790901
53  1.0439616 4.312039
54  1.1038446 3.888212
55  0.7930091 4.180391
56  1.0412575 4.214235
57  0.9294129 4.216965
58  0.9960878 4.386737
59  1.0491910 3.640954
60  0.8846447 4.073937
61  1.0330346 4.148414
62  1.3483509 3.802894
63  0.9182967 3.574639
64  1.1407750 3.717240
65  1.3581725 4.319490
66  0.8543766 3.840055
67  1.0414402 3.807395
68  1.0722948 3.874687
69  0.8329458 3.766696
70  0.9837311 4.007095
71  0.7938279 3.867650
72  0.9025817 3.891115
73  0.9697842 3.679195
74  1.2268227 4.015942
75  1.2147374 3.866425
76  0.9785946 3.730636
77  1.0114503 3.866620
78  1.1268886 4.189743
79  0.7168993 3.827837
80  0.7405473 3.865563
81  0.8968723 4.178466
82  0.8431186 4.251361
83  0.7515127 3.465292
84  1.1772844 3.943965
85  0.9550546 3.731057
86  0.6203938 4.223703
87  1.0206894 4.108058
88  1.1467910 3.908037
89  1.0757584 3.745367
90  0.9354136 4.119445
91  1.4278847 4.194884
92  1.1180948 3.858179
93  0.7497254 4.014398
94  1.1088081 4.112933
95  1.2657804 4.428584
96  0.6924261 4.017691
97  0.9715270 3.948335
98  1.0618618 3.890726
99  1.1711467 3.826490
100 0.9024863 4.152340
101 1.7875219 3.178206
102 2.1210714 2.872378
103 1.7109095 3.014809
104 2.0861871 2.684300
105 1.8701237 3.020808
106 1.7666914 2.975524
107 1.8128653 3.176979
108 1.8897186 3.057175
109 2.0627773 2.760073
110 2.1473530 3.261750
111 1.9483216 3.270882
112 2.1546186 2.870267
113 1.9963029 2.941757
114 1.4959078 3.027445
115 1.9841154 3.040064
116 2.1822725 3.369424
117 2.0850128 3.077191
118 1.6776440 2.848804
119 2.0808511 2.923152
120 2.2601876 2.850728
121 1.7286870 3.108469
122 2.1418878 3.106608
123 2.0447257 2.881759
124 2.0498443 2.846219
125 2.0601877 3.100941
126 2.0169773 2.980953
127 1.8442148 2.700862
128 1.8975192 3.134942
129 1.8099788 2.685548
130 1.9475803 2.992949
131 2.0750316 2.944748
132 2.0618618 2.991223
133 2.3115961 3.099697
134 2.0576561 2.714938
135 2.4215270 2.486399
136 1.6683959 3.022203
137 1.9892953 3.072767
138 2.1513778 2.561965
139 1.7130948 3.320119
140 1.7763528 2.846887
141 1.8817713 3.163502
142 1.8421538 3.217877
143 1.9394005 3.459073
144 2.0746701 2.972872
145 1.9681596 3.085131
146 1.8205382 3.424168
147 1.7873701 2.844277
148 2.1659005 2.998574
149 1.8122476 3.155967
150 2.2094924 2.970507
151 1.6880013 2.859876
152 2.1233878 3.198759
153 2.3719564 3.046343
154 1.7201411 3.124989
155 2.0791674 3.004949
156 1.8850474 3.006909
157 2.2516849 3.267234
158 2.5369851 3.465378
159 1.9761238 2.943754
160 1.8088675 3.043871
161 1.5601990 3.222829
162 1.6495062 2.889838
163 1.9148438 2.889465
164 1.7616644 2.720298
165 2.0998620 3.060538
166 2.4020394 2.578524
167 2.3520262 3.376687
168 1.9082290 2.661465
169 2.0257712 3.328539
170 1.8293005 3.337713
171 1.9929737 2.796102
172 2.1197114 3.252944
173 1.9242784 2.989755
174 2.2995571 2.666137
175 1.8661205 3.225437
176 1.9087120 2.697145
177 1.9701494 3.223456
178 2.2201313 3.041756
179 1.9243165 2.898615
180 1.8569434 2.986407
181 1.7451910 3.346606
182 2.2158653 3.085335
183 1.9157895 3.159590
184 2.3468642 3.135337
185 2.0994880 3.298039
186 1.8423446 2.983745
187 2.3238755 3.253776
188 1.9130496 3.216226
189 2.0547086 2.915632
190 2.1617032 2.876137
191 2.2650297 2.949564
192 1.9349440 2.895680
193 2.2773399 3.201324
194 2.1525428 2.887062
195 2.2934223 3.052111
196 1.9796042 3.270961
197 1.8046324 2.942290
198 1.9466536 3.218901
199 2.3406807 2.533981
200 1.8117508 2.826063
201 2.8462091 3.785487
202 3.0068233 3.969149
203 2.5158774 4.033630
204 3.0598290 3.910584
205 2.9886552 4.128880
206 3.3490797 4.498037
207 3.0871214 4.085186
208 2.8782953 3.786661
209 2.7667600 3.823192
210 2.8768234 3.586523
211 2.9618658 3.813506
212 3.1049103 4.135456
213 3.1004721 3.914012
214 2.8723065 4.132528
215 2.8899511 4.138750
216 3.3793519 4.185737
217 3.2361296 4.000341
218 2.9265196 4.173778
219 2.9921908 3.841311
220 3.0307444 4.136205
221 2.6255651 4.154315
222 2.9726929 3.979482
223 3.0529673 3.894076
224 2.7402724 4.080542
225 2.9845377 4.000087
226 3.3817253 4.237672
227 2.9591532 4.295109
228 3.0835091 3.923510
229 3.3203532 4.044456
230 3.0863721 3.730639
231 2.6859966 3.783978
232 3.1327534 4.389182
233 3.0455145 3.864574
234 3.1841840 3.836356
235 2.8144682 3.985821
236 2.8261097 3.890042
237 2.9723580 4.083361
238 3.0779386 4.104962
239 3.2312926 4.237007
240 3.0035291 4.318167
241 2.8858952 4.039000
242 2.8152711 3.660066
243 2.9373028 3.975357
244 3.1382894 3.917117
245 3.1687054 4.136177
246 3.0999858 4.128190
247 2.7769261 3.661160
248 3.2512821 4.102005
249 3.1201269 3.846332
250 3.1836503 4.029442
251 3.0829865 4.188813
252 2.9812309 4.033411
253 3.2425394 3.826754
254 3.1628092 4.198101
255 3.0260355 3.749005
256 2.9827652 3.947035
257 3.0389127 4.000082
258 2.7606086 3.716426
259 3.0503476 3.873861
260 3.0389852 4.108492
261 2.8378262 4.185987
262 2.2899377 3.998431
263 2.7950654 4.041495
264 3.2613662 3.915076
265 2.7329741 4.269695
266 2.6215694 4.360090
267 2.7579019 4.580998
268 3.2470186 3.976161
269 3.2544428 3.751013
270 3.1165321 3.989267
271 2.7450306 3.658142
272 2.8378446 4.220559
273 3.2774462 4.180639
274 3.0987977 4.414383
275 3.1336114 3.829177
276 2.8409350 4.208425
277 3.1186726 4.558691
278 2.7046521 3.868512
279 2.9611149 4.005875
280 3.2848977 3.980547
281 3.4115816 4.089780
282 2.8853126 4.459397
283 3.0190778 4.009302
284 2.9029999 3.847450
285 3.1483029 4.101669
286 2.7180423 3.982661
287 3.2741374 3.964511
288 3.0920512 3.891550
289 3.2905735 3.878872
290 3.4192674 3.884962
291 2.9229976 3.831025
292 2.8966877 4.197647
293 2.8029638 4.066016
294 2.8127196 4.436529
295 3.0761045 3.751543
296 3.1093809 3.927499
297 2.4334188 3.745932
298 2.9509843 4.011124
299 2.9058766 3.670362
300 3.0318281 3.837162
# kmeans 
p1  <- plot_kmeans_results(three_spheres, 3)
p2  <- plot_kmeans_results(ring_moon_sphere, 3)
p3  <- plot_kmeans_results(two_spirals_sphere, 3)

#pam 

p4 <- plot_pam_results(three_spheres, 3)
p5 <- plot_pam_results(ring_moon_sphere, 3)
p6 <- plot_pam_results(two_spirals_sphere, 3)

three_spheres <- three_spheres[, c("x", "y")]


#dbscan 
three_spheres
            x        y
1   0.8131339 4.084861
2   1.0518924 4.410097
3   0.8745572 3.866620
4   1.1435424 3.742791
5   1.2999460 4.345134
6   1.1079499 3.820448
7   0.8031028 3.827474
8   0.8912335 3.637910
9   0.8992804 3.842571
10  1.0681865 4.123401
11  1.1932278 3.932720
12  0.8892823 4.280426
13  1.2394118 3.571090
14  0.7691755 4.604014
15  1.0247093 3.918123
16  0.7862921 3.539198
17  0.9188633 4.176897
18  0.9785866 3.878812
19  1.0353742 4.230554
20  1.3795931 4.305117
21  0.7940187 3.863665
22  1.0095460 3.834677
23  1.0341107 4.175000
24  0.6796920 4.140951
25  1.1877736 3.996990
26  0.6036274 4.065677
27  0.9283108 4.093132
28  1.0172164 4.453145
29  1.3742069 3.777695
30  1.1120553 4.209853
31  0.9801792 3.909559
32  0.9797569 3.929415
33  0.7744408 4.082380
34  0.8697368 3.968143
35  0.9259300 3.393804
36  0.9647558 3.955853
37  1.1295174 3.894254
38  1.1330970 4.030003
39  1.1497125 4.072190
40  0.8147645 4.079085
41  1.0535198 4.009596
42  1.0909498 3.897154
43  1.1679289 3.745746
44  1.2189834 3.979774
45  0.9094913 3.933128
46  1.3305728 3.996288
47  1.0173299 3.835252
48  1.2016064 4.204404
49  0.9405298 4.356558
50  0.9655414 4.182181
51  0.8647251 3.728951
52  0.7450602 3.790901
53  1.0439616 4.312039
54  1.1038446 3.888212
55  0.7930091 4.180391
56  1.0412575 4.214235
57  0.9294129 4.216965
58  0.9960878 4.386737
59  1.0491910 3.640954
60  0.8846447 4.073937
61  1.0330346 4.148414
62  1.3483509 3.802894
63  0.9182967 3.574639
64  1.1407750 3.717240
65  1.3581725 4.319490
66  0.8543766 3.840055
67  1.0414402 3.807395
68  1.0722948 3.874687
69  0.8329458 3.766696
70  0.9837311 4.007095
71  0.7938279 3.867650
72  0.9025817 3.891115
73  0.9697842 3.679195
74  1.2268227 4.015942
75  1.2147374 3.866425
76  0.9785946 3.730636
77  1.0114503 3.866620
78  1.1268886 4.189743
79  0.7168993 3.827837
80  0.7405473 3.865563
81  0.8968723 4.178466
82  0.8431186 4.251361
83  0.7515127 3.465292
84  1.1772844 3.943965
85  0.9550546 3.731057
86  0.6203938 4.223703
87  1.0206894 4.108058
88  1.1467910 3.908037
89  1.0757584 3.745367
90  0.9354136 4.119445
91  1.4278847 4.194884
92  1.1180948 3.858179
93  0.7497254 4.014398
94  1.1088081 4.112933
95  1.2657804 4.428584
96  0.6924261 4.017691
97  0.9715270 3.948335
98  1.0618618 3.890726
99  1.1711467 3.826490
100 0.9024863 4.152340
101 1.7875219 3.178206
102 2.1210714 2.872378
103 1.7109095 3.014809
104 2.0861871 2.684300
105 1.8701237 3.020808
106 1.7666914 2.975524
107 1.8128653 3.176979
108 1.8897186 3.057175
109 2.0627773 2.760073
110 2.1473530 3.261750
111 1.9483216 3.270882
112 2.1546186 2.870267
113 1.9963029 2.941757
114 1.4959078 3.027445
115 1.9841154 3.040064
116 2.1822725 3.369424
117 2.0850128 3.077191
118 1.6776440 2.848804
119 2.0808511 2.923152
120 2.2601876 2.850728
121 1.7286870 3.108469
122 2.1418878 3.106608
123 2.0447257 2.881759
124 2.0498443 2.846219
125 2.0601877 3.100941
126 2.0169773 2.980953
127 1.8442148 2.700862
128 1.8975192 3.134942
129 1.8099788 2.685548
130 1.9475803 2.992949
131 2.0750316 2.944748
132 2.0618618 2.991223
133 2.3115961 3.099697
134 2.0576561 2.714938
135 2.4215270 2.486399
136 1.6683959 3.022203
137 1.9892953 3.072767
138 2.1513778 2.561965
139 1.7130948 3.320119
140 1.7763528 2.846887
141 1.8817713 3.163502
142 1.8421538 3.217877
143 1.9394005 3.459073
144 2.0746701 2.972872
145 1.9681596 3.085131
146 1.8205382 3.424168
147 1.7873701 2.844277
148 2.1659005 2.998574
149 1.8122476 3.155967
150 2.2094924 2.970507
151 1.6880013 2.859876
152 2.1233878 3.198759
153 2.3719564 3.046343
154 1.7201411 3.124989
155 2.0791674 3.004949
156 1.8850474 3.006909
157 2.2516849 3.267234
158 2.5369851 3.465378
159 1.9761238 2.943754
160 1.8088675 3.043871
161 1.5601990 3.222829
162 1.6495062 2.889838
163 1.9148438 2.889465
164 1.7616644 2.720298
165 2.0998620 3.060538
166 2.4020394 2.578524
167 2.3520262 3.376687
168 1.9082290 2.661465
169 2.0257712 3.328539
170 1.8293005 3.337713
171 1.9929737 2.796102
172 2.1197114 3.252944
173 1.9242784 2.989755
174 2.2995571 2.666137
175 1.8661205 3.225437
176 1.9087120 2.697145
177 1.9701494 3.223456
178 2.2201313 3.041756
179 1.9243165 2.898615
180 1.8569434 2.986407
181 1.7451910 3.346606
182 2.2158653 3.085335
183 1.9157895 3.159590
184 2.3468642 3.135337
185 2.0994880 3.298039
186 1.8423446 2.983745
187 2.3238755 3.253776
188 1.9130496 3.216226
189 2.0547086 2.915632
190 2.1617032 2.876137
191 2.2650297 2.949564
192 1.9349440 2.895680
193 2.2773399 3.201324
194 2.1525428 2.887062
195 2.2934223 3.052111
196 1.9796042 3.270961
197 1.8046324 2.942290
198 1.9466536 3.218901
199 2.3406807 2.533981
200 1.8117508 2.826063
201 2.8462091 3.785487
202 3.0068233 3.969149
203 2.5158774 4.033630
204 3.0598290 3.910584
205 2.9886552 4.128880
206 3.3490797 4.498037
207 3.0871214 4.085186
208 2.8782953 3.786661
209 2.7667600 3.823192
210 2.8768234 3.586523
211 2.9618658 3.813506
212 3.1049103 4.135456
213 3.1004721 3.914012
214 2.8723065 4.132528
215 2.8899511 4.138750
216 3.3793519 4.185737
217 3.2361296 4.000341
218 2.9265196 4.173778
219 2.9921908 3.841311
220 3.0307444 4.136205
221 2.6255651 4.154315
222 2.9726929 3.979482
223 3.0529673 3.894076
224 2.7402724 4.080542
225 2.9845377 4.000087
226 3.3817253 4.237672
227 2.9591532 4.295109
228 3.0835091 3.923510
229 3.3203532 4.044456
230 3.0863721 3.730639
231 2.6859966 3.783978
232 3.1327534 4.389182
233 3.0455145 3.864574
234 3.1841840 3.836356
235 2.8144682 3.985821
236 2.8261097 3.890042
237 2.9723580 4.083361
238 3.0779386 4.104962
239 3.2312926 4.237007
240 3.0035291 4.318167
241 2.8858952 4.039000
242 2.8152711 3.660066
243 2.9373028 3.975357
244 3.1382894 3.917117
245 3.1687054 4.136177
246 3.0999858 4.128190
247 2.7769261 3.661160
248 3.2512821 4.102005
249 3.1201269 3.846332
250 3.1836503 4.029442
251 3.0829865 4.188813
252 2.9812309 4.033411
253 3.2425394 3.826754
254 3.1628092 4.198101
255 3.0260355 3.749005
256 2.9827652 3.947035
257 3.0389127 4.000082
258 2.7606086 3.716426
259 3.0503476 3.873861
260 3.0389852 4.108492
261 2.8378262 4.185987
262 2.2899377 3.998431
263 2.7950654 4.041495
264 3.2613662 3.915076
265 2.7329741 4.269695
266 2.6215694 4.360090
267 2.7579019 4.580998
268 3.2470186 3.976161
269 3.2544428 3.751013
270 3.1165321 3.989267
271 2.7450306 3.658142
272 2.8378446 4.220559
273 3.2774462 4.180639
274 3.0987977 4.414383
275 3.1336114 3.829177
276 2.8409350 4.208425
277 3.1186726 4.558691
278 2.7046521 3.868512
279 2.9611149 4.005875
280 3.2848977 3.980547
281 3.4115816 4.089780
282 2.8853126 4.459397
283 3.0190778 4.009302
284 2.9029999 3.847450
285 3.1483029 4.101669
286 2.7180423 3.982661
287 3.2741374 3.964511
288 3.0920512 3.891550
289 3.2905735 3.878872
290 3.4192674 3.884962
291 2.9229976 3.831025
292 2.8966877 4.197647
293 2.8029638 4.066016
294 2.8127196 4.436529
295 3.0761045 3.751543
296 3.1093809 3.927499
297 2.4334188 3.745932
298 2.9509843 4.011124
299 2.9058766 3.670362
300 3.0318281 3.837162
p7 <- plot_dbscan_results(three_spheres, eps = 0.29, minPts = 4)
p8 <- plot_dbscan_results(ring_moon_sphere, eps = 0.285, minPts = 4)
p9 <- plot_dbscan_results(two_spirals_sphere, eps = 1.27, minPts = 4)

(p1 | p2 | p3) /
(p4 | p5 | p6) /
(p7 | p8 | p9)

#findings

#Out of the graphs for three spheres I would say that the partitioning clustering of Kmeans and PAM worked really well
#for the three cluster data set.Looking very similar to the 
# scatterplots that came out of the dbscan results but in this
#case kmeans and pam have only 3 clusters without the extra '0'
#cluster that just acts as noise. 


#For the graphs of ring moon sphere, I honestly thought that
#the DBSCAN blew the other 2 options out of the water with respect  
#to accuracy. The scatterplot is the closest to what the dbscan #result scatterplot displayed. Visually we can see one ring (circle), one 'moon' (semi circle) and a sphere (filled circular cluster)
#along w/ 1 dot that is a 0 which acts as our noise. I would say that PAM is a better  


#For the graphs of two spirals, I can say that the DBSCAN once again
#did fantastic in creating the graph. It visually came closer to 
#creating what the DBSCAN wanted to create and came very close to 
#creating the dbscan scatterplot from the plot_dbscan_results(df, 
#eps =#, MinPts = #) code from before. Neither the Kmeans nor the
#pam graphs come close to replicating the dataset with respect to 
#the visuals of the graph which are very close visually to what   
#the name of the data set is (two spirals one sphere) but 
#also shows how accurate the epsilon we chose is for the DBSCAN.

Question 2

In this question we will apply cluster analysis to analyze economic development indicators (WDIs) from the World Bank. The data are all 2020 indicators and include:

  • life_expectancy: average life expectancy at birth
  • gdp: GDP per capita, in 2015 USD
  • co2: CO2 emissions, in metric tons per capita
  • fert_rate: annual births per 1000 women
  • health: percentage of GDP spent on health care
  • imports and exports: imports and exports as a percentage of GDP
  • internet and electricity: percentage of population with access to internet and electricity, respectively
  • infant_mort: infant mortality rate, infant deaths per 1000 live births
  • inflation: consumer price inflation, as annual percentage
  • income: annual per-capita income, in 2020 USD
wdi <- read.csv('C:/Users/lr7273ow/OneDrive - Minnesota State/Documents/GitHub/DSCI_415/Activities/Data/wdi_extract_clean.csv') 

head(wdi)
      country life_expectancy        gdp      co2 fert_rate    health internet
1 Afghanistan        61.45400   527.8346 0.180555     5.145 15.533614  17.0485
2     Albania        77.82400  4437.6535 1.607133     1.371  7.503894  72.2377
3     Algeria        73.25700  4363.6853 3.902928     2.940  5.638317  63.4727
4      Angola        63.11600  2433.3764 0.619139     5.371  3.274885  36.6347
5   Argentina        75.87800 11393.0506 3.764393     1.601 10.450306  85.5144
6     Armenia        73.37561  4032.0904 2.334560     1.700 12.240562  76.5077
  infant_mort electricity  imports inflation  exports    income
1        55.3        97.7 36.28908  5.601888 10.42082  475.7181
2         8.1       100.0 36.97995  1.620887 22.54076 4322.5497
3        20.4        99.7 24.85456  2.415131 15.53520 2689.8725
4        42.3        47.0 27.62749 22.271539 38.31454 1100.2175
5         8.7       100.0 13.59828 42.015095 16.60541 7241.0303
6        10.2       100.0 39.72382  1.211436 29.76499 3617.0320

Focus on using kmeans for this problem.

A)

My claim: 3-5 clusters appear optimal for this data set. Support or refute my claim using appropriate visualizations.

#We can try creating a kmeans vs wss plot that looks at what the app. 
#number of clusters would be 
wdi_num <- wdi |> dplyr::select(where(is.numeric))

fviz_nbclust(wdi_num,
            FUNcluster = kmeans,
            method = 'wss') + 
  labs(title = 'Plot of WSS vs k using means')

fviz_nbclust(wdi_num,
             FUNcluster = kmeans,
             method = 'silhouette') + 
  labs(title = 'Plot of silhouette vs. k using means')

#For the most part we can see that judging by a WSS and silhouette elbow
#plot we can see that their is a 'elbow' at the second x axis, which #means that the clustering of the data points starts really starts at  
#that point (k=2), with the number of clusters becomes large at that 
#point. 
#

#These 2 plots really debunk/refute your claim that 3-5 clusters would  
#be optimal for the wdi data. The avg. sil. shows that it's peak is at k 
#= 2. Meaning that's really the peak number of clusters with a second 
#peak at the k=4. So while 3-5 is acceptable, it's not at it's prime 
#in that range.  
#

B)

Use k-means to identify 4 clusters. Characterize the 4 clusters using a dimension reduction technique. Provide examples of countries that are representative of each cluster. Be thorough.

wdi_num <- wdi |> dplyr::select(-country)


#rownames(wdi_num) <- wdi$country

kmeans4 <- kmeans(wdi_num, center =  4, nstart = 10) 

#Using a PCA (dimension reduction)
kmeans_pca <- prcomp(wdi_num,center = TRUE , scale. = TRUE)

kmeans_biplot <- fviz_pca(
  kmeans_pca,
  habillage = factor(kmeans4$cluster),
  label = "all",     # <-- shows country names instead of numbers
  repel = TRUE       # <-- spreads them out so they don’t collide
) +
  ggtitle("K-means 4-cluster solution") +
  guides(color = "none", shape = "none")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
kmeans_biplot

#Ideally if I had more time I would be able to fix the error that comes
#with this but I sadly cannot. It would allow a rule for each 
#point to be labeled only if it met that numerical metric. So the 
#plot wouldn't have so many country names.

#wdi_num$label_lifeexp <- ifelse(wdi_num$life_expectancy >80, #wdi_num$country, "")
#wdi_num$label_co2     <- ifelse(wdi_num$co2 > 20, wdi_num$country, "")
#wdi_num$label_gdp     <- ifelse(wdi_num$gdp > 80000, wdi_num$country, "")

#wdi_num$label <- ifelse(wdi_num$life_expectancy > 80 |
#                   df$co2 > 20 |
#                   wdi_num$gdp > 80000, 
#                   wdi_num$country, "")

#ggplot(PCA_scores, aes(Dim1, Dim2)) +
#  geom_point(aes(color = cluster)) +
#  geom_text_repel(aes(label = df$label),
#                  size = 3,
#                  max.overlaps = 20,
#                  segment.alpha = 0.5)

#PC1 explains 52.4% of the variance for the plot, while PC2 explains 
#14.6% of the variance for the plot. 
#But one thing can be noticed for each cluster on what really drives 
#first cluster (green ) is fert_rate and infant_mort variables. Countries that are #driven by these high infant mortality would be Togo, Benin, Vanuana, #Chad, Senegal. Many of them also share high fertility rates on the other
#hand. A few others in the top left corner seem to have either a higher
#infant mortality or fertility rates (generally the former) but be marked #by not being rich in exports or imports
#

#The second cluster(red) is driven primarly by the variables being high in the following variables- health, electricity and life expectancy. 
#This red cluster spans not quite as far as the first green one, but 
#does entail a larger span of the biplot than the other 2 clusters we 
#have. One thing that is noticeable is that their is a sizeable portion of the red cluster that is not far from the imports and exports vectors
#meanign that a significant # of countries have higher than average 
#imports, exports or amybe a higher gdp. Good examples include Saudi Arabia, Greece, Portugal 

#The third cluster (purple) is filled with countries that like the
#second cluster are rich in  health, electricity, life expectancy., but #fall further than them in how far the project along those vectors. #Another thing is that many of them fall further (one of them looks like #a outlier) along the proj. of vectors - gdp, exports and imports- #meaning that they have more in those areas While maybe also not being as #strong in health or electricity. Good examples include Netherlands,
#Belgium, Denmark 

#the fourth cluster which is a torqoise blue is generally a smaller #cluster that is stronger in the variables internet,co2, income or gdp.
#many of the country points are outliers in how far there are projected 
#from the vectors. Particulrarly Luxembourg and Ireland. Better #reprentatives would be Norway or Switzerland for the fourth cluster

C)

Remove Ireland, Singapore, and Luxembourg from the data set. Use k-means to find 4 clusters again, with these three countries removed. How do the cluster definitions change?

library(dplyr)

wdi_filtered <- wdi[!wdi$country %in% c("Ireland", "Singapore", "Luxembourg"), ]

wdi_num_filtered <- wdi_filtered |>  dplyr :: select(-country)

kmeans_pca_filtered <- prcomp(wdi_num_filtered, center = TRUE, scale. = TRUE)

kmeans4_filtered<- kmeans(wdi_num_filtered, centers = 4, nstart = 10)

#Non labelled version 
fviz_pca(kmeans_pca_filtered,
         habillage = factor(kmeans4_filtered$cluster),
         repel = TRUE,
         label = "none") +
  ggtitle("K-means 4-cluster solution (Ireland, Singapore, Luxembourg removed)")

#From the get go we can see that PC1 explains less of the variance for #the graph than the PC1 for the original graph did (51.7% vs 52.4%). The #other thing we can see is the PC2 explains less of the variance for the
#graph than original graph did aswell (14.1% vs 14.6%). The other thing
#we can clearly see is the fact that some of the vectors are not pointing
#in the same direction as before. My guess (if they were labelled) is that the vectors of exports, imports and gdp are the ones that are not 
#in the same direction since they were being pulled by the data points
#of Ireland, Singapore & Luxembourg. They now probably point more towards #the upper right quadrant. Overall just a cleaner graph. 
#


fviz_pca_biplot(
  kmeans_pca_filtered,
  habillage = factor(kmeans4_filtered$cluster),   
  repel = TRUE,                             
  label = "var"                           
) +
  ggtitle("K-means 4-cluster solution (PCA biplot)")

#As suspected many of the vectors did indeed move in another direction 
#after getting rid of Ireland, Luxembourg & Singapore. We see that 
#the gdp, income, and co2 vectors move up in to the upper right quad.
#while the export and import vectors (two vectors w/ stronger effects)
#stay in the same place. Vectors infant_mort, fert_rate and inflation 
#all move up into (or near) the upper left quadrant. Overall a much 
#cleaner plot.

fviz_pca_ind(kmeans_pca_filtered,
             geom.ind = "text",
             repel = TRUE,
             label = wdi_filtered$country,
             habillage = factor(kmeans4_filtered$cluster))