Assignment 3 - Clustering

1. The k-means clustering algorithm will identify a collection of k clusters using a heuristic search, starting with a selection of k clusters. TRUE or FALSE
-->
TRUE

2. What does heuristic mean?
-->
Heuristic – enabling someone to discover or learn something for themselves.
Or proceeding to a solution by trial and error or by rules that are only loosely defined.

3. What is the k-means approach?
-->

(i). Partition n observations into k clusters, where each observation belongs to the cluster with the nearest centroid (the mean of the cluster’s points).
(ii). Minimize the within-cluster error and maxmize error between clusters

4. Fill-in-the-blank: 
The __  means __ for the collection of cases that form one of
__ k-clusters __ in any particular clustering are then the collection of __ mean values __ for each of the input variables over the cases within the clustering.

5. The k-means clustering algorithm is a hierarchical method. TRUE or FALSE
-->
FALSE

6. What does the k-means clustering algorithm consist of?
--> It consists following steps:
(i). Initializing the centers of the k groups to a set of randomly chosen k observations.
(ii). Allocate each observation to the group where center is nearest
(iii). Update the centroid value of each cluster
(iv). Repeat steps (ii) and (ii) until the groups are stable

7. What is data noise?
-->
Definition of noise in data could be different by the field of study. One noise in data is: noise points do not have enough observations within their radius nor
are sufficiently close to any core point. 
The algorithm starts by removing the noise points into a separate cluster that contains cases that are too different up to a point of not making sense to use them in
the cluster formation.


8. When using the k-means algorithm, why isn’t it a good idea to use different starting points as cluster centers?
-->
k-mean optimization find local minimum (Minimize sum of squared distances) which dependes on starting point. Different starting values can lead the algorithm to different clusters which may be completely different. 

9. The k-means algorithm results in obtaining cluster separation, thereby obtaining a stable and non-changing maximal clustering solution, where different starting points are
used as centers. TRUE or FALSE
-->
FALSE

10. What are the 3 species of plants in the Iris dataset?
-->
The three Iris species:  setosa, versicolor, and virginica.

11. In the chunk of code for k-means clustering applied to the Irish dataset on page 122 in Torgo’s text, explain the following arguments in the k-means() function:
(a) Iris[ , -5]    
--> Iris dataframe with the 5th column dropped

(b) Iter.max = 200         
--> 
The maximum number of iterations allowed is 200

12. The k-means( ) function returns an object that contain bits of information. What are those bits of information?
-->
Outputs of kmeans() has following information:
(i). Numer of clusters (k) with assigned number of observations in each clusters
(ii). Mean values of each variable in each cluster 
(iii). Vector of cluster assigned by k-means to each observation. 
(iv). With cluster sum of square by cluster
(v). (Between sum of square) /(Total sum of sqare) 


13. Explain cluster validation (cluster evaluation).
-->
    A key issue with any clustering algorithm is cluster validation – That is, the  question of “how to decide if an obtained solution is good or not?” 



14. Even though cluster evaluation is not commonly used, what are the evaluation measure (or index) types used to judge various aspects of cluster validity?
-->
The evaluation measures applied to judge various aspects of cluster validity are:
(i). Unsupervised
(ii).Supervised, and 
(iii).Relative


15. (a) What are unsupervised measures (internal indices)?
-->
Measures of goodness of a clustering structure without respect to external information.  The SSE is an example of this measure. Divided into two classes:
(i).Measures of cluster cohesion (compactness, tightness), which determine how closely related the objects in a cluster are, and
(ii).Measures of cluster separation (isolation), which determine how distinct or well-separated a cluster is from other clusters. 


(b) What are supervised measures (external indices)?
-->
Measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure.  
An example of a supervised index is entropy.


16. What is entropy?
-->
Entropy measures how well cluster labels match externally supplied class labels.  


17. Why are supervised measures also called external indices?
-->
Supervised measures are often called external indices because they use information that is present in the data set.

18. What are relative measures?
-->
Relative measure compares different clusterings or clusters. Example:  Two k-means clusterings can be compared using either the SSE or entropy.


19. Look at the code in Torgo, P. 123. Explain the following arguments of the table( )
function:
(a) ir3$cluster
--> 
Vector of cluster numbered predicted by kmeans()

(b) iris$Species. __
-->
Vector of true cluster number in iris dataset

20. Run the code on Page 123 in Torgo’s text. Based on the output, state the contents of Cluster 2.
--> 
Setosa = 0 
Versicolor = 48
Virginica = 14

21. Based on the output in Exercise 20 above, which clusters do not contain pure plant
classes (observations)?
-->
Cluster 2 then Cluster 3


22. Which measures deal with labels, supervised or unsupervised?
-->
Supervised

23. Fill-in-the-blank: 
Internal validation metrics only use information available (during the clustering process).
-->
Internal validation metrics – only use information available during the clustering process.  


24. What metrics elevate the quality of cluster separation?
-->
Internal validation metrics, because you are using the same data for modeling and validation.
Example of internal validation metric: train-test data set, k-fold validation, leave-one-out, bootstraping


25. The silhouette coefficient is an example of which kind of metric?
-->
Internal validation metrics.

26. In the statement
      s <-- silhouette(ir3$cluster, dist(iris[ , -5]))
explain the argument “iris[ , -5]” of the dist() function.

--> 
“iris[ , -5]” : All column except the fifth column.

27. The sum of square error (SSE) can be used to compare cluster performance only for a similar number of clusters. TRUE or FALSE
-->
TRUE

**For Exercises 28 and 29, in GT Canvas, pull up and use the lecture, "AI4OPT - Lecture 3 - Data Engineering and Mining II - Clustering - Part 3 - Fall 2022.pptx"**

28. Study Approach A, the program located towards the end of the Lecture 3 packet. Then run the Iris dataset through the  program. Keep in mind that you will have to tweak the program here and there. (Hint: After library(dataset), replace “dataset” with “Iris.” Also, replace “objects_names” with “species.”) Should your program run, publish it in RPubs and submit it via GA Canvas.
-->

###---- Iris data

rm(list=ls())

##----Code1:
 
library(cluster)
library(tidyverse)
str(iris )

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

row.names( iris )

##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
##  [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
##  [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
##  [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
##  [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
##  [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
##  [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
##  [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
##  [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
## [145] "145" "146" "147" "148" "149" "150"

dataset <- iris
## Data Preprocess
sum(!complete.cases(dataset) )

## [1] 0

    summary(dataset)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

    ## Remove or impute missing objects
    df <-   na.omit( dataset )
    
    ## Rescale (or normalization, etc.)
    df[, -5] <-  scale(df[, -5], center = T, scale = T)
    head(df)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
## 2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
## 3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
## 4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
## 5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
## 6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa

   ##---Code2:
   dataset <-  df[, -5] 
   df <- dataset
     summary(df)

##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
##  1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
##  Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064

    ## Standardization
    apply(dataset, 2, sd)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##            1            1            1            1

    apply(dataset, 2, mean)

##  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width 
## -2.318423e-15 -1.684023e-15 -1.577997e-15 -8.829974e-16

    apply(df, 2, sd)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##            1            1            1            1

    ## Distance function and visualization
    # library(factoextra)
    # distance <- get_dist(df, stand = TRUE, method = "pearson")
    # fviz_dist(distance, gradient = list(low = “#00AFBB”, mid = “white”, high = “FC4E07”)) 
                                        
 ##   Code3:
     
     ## K means
     km_output <-  kmeans(df, centers = 2, nstart = 25, iter.max = 100, algorithm = "Hartigan-Wong")
    str(km_output)

## List of 9
##  $ cluster     : Named int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ centers     : num [1:2, 1:4] -1.011 0.506 0.85 -0.425 -1.301 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ totss       : num 596
##  $ withinss    : num [1:2] 47.4 173.5
##  $ tot.withinss: num 221
##  $ betweenss   : num 375
##  $ size        : int [1:2] 50 100
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

    names(km_output)

## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

    typeof(km_output)

## [1] "list"

    length(km_output)

## [1] 9

    km_output$cluster

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2   2   2   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 141 142 143 144 145 146 147 148 149 150 
##   2   2   2   2   2   2   2   2   2   2

    ## Cluster Validation Evaluation  -  
    ## Objective function:  Sum of Square Error (SSE)
    ### SSE
    
##    Code4:
     #### Cluster cohesion
     #### SSE can be used to compare cluster performance only for a similar number of clusters
     
     km_output$totss

## [1] 596

    km_output$withinss      # distance without and within clusters

## [1]  47.35062 173.52867

    km_output$betweenss

## [1] 375.1207

    sum(c(km_output$withinss, km_output$betweenss) )

## [1] 596

        cohesion <-  sum(km_output$withinss)/ km_output$totss
        cohesion

## [1] 0.3706028

        ### Visualize Clusters
        library(factoextra)
        fviz_cluster(km_output, data = df)

        library(dplyr)
        library(ggplot2)

29. Again, study Approach A. Then run the “USArrests” dataset through the program.
Again, you will have to tweak the program here and there. (Hint: After library(dataset),
replace “dataset” with “USArrests.” Replace “objects_names” with “state.”)
If the program runs correctly, publish the program in RPubs and submit a copy via
GACanvas., or email to me.
-->

## -- US Arrest dataset

rm(list=ls())

##----Code1:

library(cluster)
library(tidyverse)
str(USArrests )

## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

row.names( USArrests )

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

dataset <- USArrests
## Data Preprocess
sum(!complete.cases(dataset) )

## [1] 0

summary(dataset)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

## Remove or impute missing objects
df <-   na.omit( dataset )

## Rescale (or normalization, etc.)
df[, -5] <-  scale(df[, -5], center = T, scale = T)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

##---Code2:
dataset <-  df[, -5] 
df <- dataset
summary(df)

##      Murder           Assault           UrbanPop             Rape        
##  Min.   :-1.6044   Min.   :-1.5090   Min.   :-2.31714   Min.   :-1.4874  
##  1st Qu.:-0.8525   1st Qu.:-0.7411   1st Qu.:-0.76271   1st Qu.:-0.6574  
##  Median :-0.1235   Median :-0.1411   Median : 0.03178   Median :-0.1209  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.7949   3rd Qu.: 0.9388   3rd Qu.: 0.84354   3rd Qu.: 0.5277  
##  Max.   : 2.2069   Max.   : 1.9948   Max.   : 1.75892   Max.   : 2.6444

## Standardization
apply(dataset, 2, sd)

##   Murder  Assault UrbanPop     Rape 
##        1        1        1        1

apply(dataset, 2, mean)

##        Murder       Assault      UrbanPop          Rape 
##  1.543210e-16  1.143530e-16 -3.996803e-16  8.526513e-16

apply(df, 2, sd)

##   Murder  Assault UrbanPop     Rape 
##        1        1        1        1

## Distance function and visualization
# library(factoextra)
# distance <- get_dist(df, stand = TRUE, method = "pearson")
# fviz_dist(distance, gradient = list(low = “#00AFBB”, mid = “white”, high = “FC4E07”)) 

##   Code3:

## K means
km_output <-  kmeans(df, centers = 2, nstart = 25, iter.max = 100, algorithm = "Hartigan-Wong")
str(km_output)

## List of 9
##  $ cluster     : Named int [1:50] 1 1 1 2 1 1 2 2 1 1 ...
##   ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ centers     : num [1:2, 1:4] 1.005 -0.67 1.014 -0.676 0.198 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##  $ totss       : num 196
##  $ withinss    : num [1:2] 46.7 56.1
##  $ tot.withinss: num 103
##  $ betweenss   : num 93.1
##  $ size        : int [1:2] 20 30
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

names(km_output)

## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

typeof(km_output)

## [1] "list"

length(km_output)

## [1] 9

km_output$cluster

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2

## Cluster Validation Evaluation  -  
## Objective function:  Sum of Square Error (SSE)
### SSE

##    Code4:
#### Cluster cohesion
#### SSE can be used to compare cluster performance only for a similar number of clusters

km_output$totss

## [1] 196

km_output$withinss      # distance without and within clusters

## [1] 46.74796 56.11445

km_output$betweenss

## [1] 93.1376

sum(c(km_output$withinss, km_output$betweenss) )

## [1] 196

    cohesion <-  sum(km_output$withinss)/ km_output$totss
    cohesion

## [1] 0.5248082

    ### Visualize Clusters
    library(factoextra)
    fviz_cluster(km_output, data = df)

    library(dplyr)
    library(ggplot2)
    
    ##        Code5: 
    
    # df    %>%
    # as.data.frame( df )  %>%
  df %>%   mutate(cluster = km_output$cluster, objects_name = row.names(dataset))    %>%
    ggplot(aes(x = UrbanPop, y = Murder, color = factor(km_output$cluster), label = rownames(df) )) + geom_text(  )

##    Code6:
     ### Put Cluster Output on the Map(1)  
     cluster_df <-  data.frame(objects_names = tolower(row.names(dataset)), cluster = unname(km_output$cluster))
    head(cluster_df)

##   objects_names cluster
## 1       alabama       1
## 2        alaska       1
## 3       arizona       1
## 4      arkansas       2
## 5    california       1
## 6      colorado       1

    cluster_df <- cluster_df %>% rename(states = "objects_names")
     library(maps)
     #states <- map_data("state")
    objects_names <-  map_data("state")
    objects_names  %>%
     left_join(cluster_df, by = c("region" = "states"))  %>%
     ggplot( ) +
     geom_polygon(aes(x = long, y = lat, fill = as.factor(cluster)), color = "white") +
     coord_fixed(1.3) +
     guides(fill = F) +
     theme_bw( ) +
     theme(panel.grid.major = element_blank( ), panel.grid.minor = element_blank( ),
           panel.border = element_blank( ),
           axis.line = element_blank( ),
           axis.text = element_blank( ),
           axis.ticks = element_blank( ),
           axis.title = element_blank( ))

 ##   Code7:
     
     ###  Elbow method to decide Optimal Number of Clusters(1)
     
     set.seed(8)
    wss <- function(k) {
     return(kmeans(df, k,  nstart = 25)$tot.withinss)
    }
    
    k_values <- 1:15
    wss_values <-  purrr::map_dbl(k_values, wss)
    plot(x = k_values, y = wss_values,
         type = "b", frame = F,
         xlab = "Number of clusters K",
         ylab = "Total within-clusters sum of square")

##    Code8:
     
     ###  Hierarchical Clustering
    hac_output <- hclust( dist(dataset, method = "euclidean"), method = "complete")
    plot(hac_output)       # Calculating distance using hierarchical clustering, using Euclidean distance

    # and using complete linkage for hierarchical clustering
    
    ### Output Desirable Number of Clusters after Modeling
    hac_cut <- cutree(hac_output, 2)
    for ( i in 1:length(hac_cut)) {
     if( hac_cut[i]   != km_output$cluster[i])   print(names(hac_cut) [i])
    }

## [1] "Missouri"

Assignment 3 - Clustering - Part 3

Binod Manandhar

2025-08-27