1. The k-means clustering algorithm will identify a collection of k clusters using a heuristic search, starting with a selection of k clusters. TRUE or FALSE
-->
TRUE
2. What does heuristic mean?
-->
Heuristic – enabling someone to discover or learn something for themselves.
Or proceeding to a solution by trial and error or by rules that are only loosely defined.
3. What is the k-means approach?
-->
(i). Partition n observations into k clusters, where each observation belongs to the cluster with the nearest centroid (the mean of the cluster’s points).
(ii). Minimize the within-cluster error and maxmize error between clusters
4. Fill-in-the-blank:
The __ means __ for the collection of cases that form one of
__ k-clusters __ in any particular clustering are then the collection of __ mean values __ for each of the input variables over the cases within the clustering.
5. The k-means clustering algorithm is a hierarchical method. TRUE or FALSE
-->
FALSE
6. What does the k-means clustering algorithm consist of?
--> It consists following steps:
(i). Initializing the centers of the k groups to a set of randomly chosen k observations.
(ii). Allocate each observation to the group where center is nearest
(iii). Update the centroid value of each cluster
(iv). Repeat steps (ii) and (ii) until the groups are stable
7. What is data noise?
-->
Definition of noise in data could be different by the field of study. One noise in data is: noise points do not have enough observations within their radius nor
are sufficiently close to any core point.
The algorithm starts by removing the noise points into a separate cluster that contains cases that are too different up to a point of not making sense to use them in
the cluster formation.
8. When using the k-means algorithm, why isn’t it a good idea to use different starting points as cluster centers?
-->
k-mean optimization find local minimum (Minimize sum of squared distances) which dependes on starting point. Different starting values can lead the algorithm to different clusters which may be completely different.
9. The k-means algorithm results in obtaining cluster separation, thereby obtaining a stable and non-changing maximal clustering solution, where different starting points are
used as centers. TRUE or FALSE
-->
FALSE
10. What are the 3 species of plants in the Iris dataset?
-->
The three Iris species: setosa, versicolor, and virginica.
11. In the chunk of code for k-means clustering applied to the Irish dataset on page 122 in Torgo’s text, explain the following arguments in the k-means() function:
(a) Iris[ , -5]
--> Iris dataframe with the 5th column dropped
(b) Iter.max = 200
-->
The maximum number of iterations allowed is 200
12. The k-means( ) function returns an object that contain bits of information. What are those bits of information?
-->
Outputs of kmeans() has following information:
(i). Numer of clusters (k) with assigned number of observations in each clusters
(ii). Mean values of each variable in each cluster
(iii). Vector of cluster assigned by k-means to each observation.
(iv). With cluster sum of square by cluster
(v). (Between sum of square) /(Total sum of sqare)
13. Explain cluster validation (cluster evaluation).
-->
A key issue with any clustering algorithm is cluster validation – That is, the question of “how to decide if an obtained solution is good or not?”
14. Even though cluster evaluation is not commonly used, what are the evaluation measure (or index) types used to judge various aspects of cluster validity?
-->
The evaluation measures applied to judge various aspects of cluster validity are:
(i). Unsupervised
(ii).Supervised, and
(iii).Relative
15. (a) What are unsupervised measures (internal indices)?
-->
Measures of goodness of a clustering structure without respect to external information. The SSE is an example of this measure. Divided into two classes:
(i).Measures of cluster cohesion (compactness, tightness), which determine how closely related the objects in a cluster are, and
(ii).Measures of cluster separation (isolation), which determine how distinct or well-separated a cluster is from other clusters.
(b) What are supervised measures (external indices)?
-->
Measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure.
An example of a supervised index is entropy.
16. What is entropy?
-->
Entropy measures how well cluster labels match externally supplied class labels.
17. Why are supervised measures also called external indices?
-->
Supervised measures are often called external indices because they use information that is present in the data set.
18. What are relative measures?
-->
Relative measure compares different clusterings or clusters. Example: Two k-means clusterings can be compared using either the SSE or entropy.
19. Look at the code in Torgo, P. 123. Explain the following arguments of the table( )
function:
(a) ir3$cluster
-->
Vector of cluster numbered predicted by kmeans()
(b) iris$Species. __
-->
Vector of true cluster number in iris dataset
20. Run the code on Page 123 in Torgo’s text. Based on the output, state the contents of Cluster 2.
-->
Setosa = 0
Versicolor = 48
Virginica = 14
21. Based on the output in Exercise 20 above, which clusters do not contain pure plant
classes (observations)?
-->
Cluster 2 then Cluster 3
22. Which measures deal with labels, supervised or unsupervised?
-->
Supervised
23. Fill-in-the-blank:
Internal validation metrics only use information available (during the clustering process).
-->
Internal validation metrics – only use information available during the clustering process.
24. What metrics elevate the quality of cluster separation?
-->
Internal validation metrics, because you are using the same data for modeling and validation.
Example of internal validation metric: train-test data set, k-fold validation, leave-one-out, bootstraping
25. The silhouette coefficient is an example of which kind of metric?
-->
Internal validation metrics.
26. In the statement
s <-- silhouette(ir3$cluster, dist(iris[ , -5]))
explain the argument “iris[ , -5]” of the dist() function.
-->
“iris[ , -5]” : All column except the fifth column.
27. The sum of square error (SSE) can be used to compare cluster performance only for a similar number of clusters. TRUE or FALSE
-->
TRUE
**For Exercises 28 and 29, in GT Canvas, pull up and use the lecture, "AI4OPT - Lecture 3 - Data Engineering and Mining II - Clustering - Part 3 - Fall 2022.pptx"**
28. Study Approach A, the program located towards the end of the Lecture 3 packet. Then run the Iris dataset through the program. Keep in mind that you will have to tweak the program here and there. (Hint: After library(dataset), replace “dataset” with “Iris.” Also, replace “objects_names” with “species.”) Should your program run, publish it in RPubs and submit it via GA Canvas.
-->
###---- Iris data
rm(list=ls())
##----Code1:
library(cluster)
library(tidyverse)
str(iris )
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
row.names( iris )
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
## [13] "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24"
## [25] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36"
## [37] "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48"
## [49] "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60"
## [61] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72"
## [73] "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96"
## [97] "97" "98" "99" "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
## [145] "145" "146" "147" "148" "149" "150"
dataset <- iris
## Data Preprocess
sum(!complete.cases(dataset) )
## [1] 0
summary(dataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
## Remove or impute missing objects
df <- na.omit( dataset )
## Rescale (or normalization, etc.)
df[, -5] <- scale(df[, -5], center = T, scale = T)
head(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 -0.8976739 1.01560199 -1.335752 -1.311052 setosa
## 2 -1.1392005 -0.13153881 -1.335752 -1.311052 setosa
## 3 -1.3807271 0.32731751 -1.392399 -1.311052 setosa
## 4 -1.5014904 0.09788935 -1.279104 -1.311052 setosa
## 5 -1.0184372 1.24503015 -1.335752 -1.311052 setosa
## 6 -0.5353840 1.93331463 -1.165809 -1.048667 setosa
##---Code2:
dataset <- df[, -5]
df <- dataset
summary(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :-1.86378 Min. :-2.4258 Min. :-1.5623 Min. :-1.4422
## 1st Qu.:-0.89767 1st Qu.:-0.5904 1st Qu.:-1.2225 1st Qu.:-1.1799
## Median :-0.05233 Median :-0.1315 Median : 0.3354 Median : 0.1321
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67225 3rd Qu.: 0.5567 3rd Qu.: 0.7602 3rd Qu.: 0.7880
## Max. : 2.48370 Max. : 3.0805 Max. : 1.7799 Max. : 1.7064
## Standardization
apply(dataset, 2, sd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 1 1
apply(dataset, 2, mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## -2.318423e-15 -1.684023e-15 -1.577997e-15 -8.829974e-16
apply(df, 2, sd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 1 1
## Distance function and visualization
# library(factoextra)
# distance <- get_dist(df, stand = TRUE, method = "pearson")
# fviz_dist(distance, gradient = list(low = “#00AFBB”, mid = “white”, high = “FC4E07”))
## Code3:
## K means
km_output <- kmeans(df, centers = 2, nstart = 25, iter.max = 100, algorithm = "Hartigan-Wong")
str(km_output)
## List of 9
## $ cluster : Named int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
## $ centers : num [1:2, 1:4] -1.011 0.506 0.85 -0.425 -1.301 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ totss : num 596
## $ withinss : num [1:2] 47.4 173.5
## $ tot.withinss: num 221
## $ betweenss : num 375
## $ size : int [1:2] 50 100
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
names(km_output)
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
typeof(km_output)
## [1] "list"
length(km_output)
## [1] 9
km_output$cluster
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 141 142 143 144 145 146 147 148 149 150
## 2 2 2 2 2 2 2 2 2 2
## Cluster Validation Evaluation -
## Objective function: Sum of Square Error (SSE)
### SSE
## Code4:
#### Cluster cohesion
#### SSE can be used to compare cluster performance only for a similar number of clusters
km_output$totss
## [1] 596
km_output$withinss # distance without and within clusters
## [1] 47.35062 173.52867
km_output$betweenss
## [1] 375.1207
sum(c(km_output$withinss, km_output$betweenss) )
## [1] 596
cohesion <- sum(km_output$withinss)/ km_output$totss
cohesion
## [1] 0.3706028
### Visualize Clusters
library(factoextra)
fviz_cluster(km_output, data = df)

library(dplyr)
library(ggplot2)
29. Again, study Approach A. Then run the “USArrests” dataset through the program.
Again, you will have to tweak the program here and there. (Hint: After library(dataset),
replace “dataset” with “USArrests.” Replace “objects_names” with “state.”)
If the program runs correctly, publish the program in RPubs and submit a copy via
GACanvas., or email to me.
-->
## -- US Arrest dataset
rm(list=ls())
##----Code1:
library(cluster)
library(tidyverse)
str(USArrests )
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
row.names( USArrests )
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
dataset <- USArrests
## Data Preprocess
sum(!complete.cases(dataset) )
## [1] 0
summary(dataset)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
## Remove or impute missing objects
df <- na.omit( dataset )
## Rescale (or normalization, etc.)
df[, -5] <- scale(df[, -5], center = T, scale = T)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
##---Code2:
dataset <- df[, -5]
df <- dataset
summary(df)
## Murder Assault UrbanPop Rape
## Min. :-1.6044 Min. :-1.5090 Min. :-2.31714 Min. :-1.4874
## 1st Qu.:-0.8525 1st Qu.:-0.7411 1st Qu.:-0.76271 1st Qu.:-0.6574
## Median :-0.1235 Median :-0.1411 Median : 0.03178 Median :-0.1209
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.7949 3rd Qu.: 0.9388 3rd Qu.: 0.84354 3rd Qu.: 0.5277
## Max. : 2.2069 Max. : 1.9948 Max. : 1.75892 Max. : 2.6444
## Standardization
apply(dataset, 2, sd)
## Murder Assault UrbanPop Rape
## 1 1 1 1
apply(dataset, 2, mean)
## Murder Assault UrbanPop Rape
## 1.543210e-16 1.143530e-16 -3.996803e-16 8.526513e-16
apply(df, 2, sd)
## Murder Assault UrbanPop Rape
## 1 1 1 1
## Distance function and visualization
# library(factoextra)
# distance <- get_dist(df, stand = TRUE, method = "pearson")
# fviz_dist(distance, gradient = list(low = “#00AFBB”, mid = “white”, high = “FC4E07”))
## Code3:
## K means
km_output <- kmeans(df, centers = 2, nstart = 25, iter.max = 100, algorithm = "Hartigan-Wong")
str(km_output)
## List of 9
## $ cluster : Named int [1:50] 1 1 1 2 1 1 2 2 1 1 ...
## ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ centers : num [1:2, 1:4] 1.005 -0.67 1.014 -0.676 0.198 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
## $ totss : num 196
## $ withinss : num [1:2] 46.7 56.1
## $ tot.withinss: num 103
## $ betweenss : num 93.1
## $ size : int [1:2] 20 30
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
names(km_output)
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
typeof(km_output)
## [1] "list"
length(km_output)
## [1] 9
km_output$cluster
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
## Cluster Validation Evaluation -
## Objective function: Sum of Square Error (SSE)
### SSE
## Code4:
#### Cluster cohesion
#### SSE can be used to compare cluster performance only for a similar number of clusters
km_output$totss
## [1] 196
km_output$withinss # distance without and within clusters
## [1] 46.74796 56.11445
km_output$betweenss
## [1] 93.1376
sum(c(km_output$withinss, km_output$betweenss) )
## [1] 196
cohesion <- sum(km_output$withinss)/ km_output$totss
cohesion
## [1] 0.5248082
### Visualize Clusters
library(factoextra)
fviz_cluster(km_output, data = df)

library(dplyr)
library(ggplot2)
## Code5:
# df %>%
# as.data.frame( df ) %>%
df %>% mutate(cluster = km_output$cluster, objects_name = row.names(dataset)) %>%
ggplot(aes(x = UrbanPop, y = Murder, color = factor(km_output$cluster), label = rownames(df) )) + geom_text( )

## Code6:
### Put Cluster Output on the Map(1)
cluster_df <- data.frame(objects_names = tolower(row.names(dataset)), cluster = unname(km_output$cluster))
head(cluster_df)
## objects_names cluster
## 1 alabama 1
## 2 alaska 1
## 3 arizona 1
## 4 arkansas 2
## 5 california 1
## 6 colorado 1
cluster_df <- cluster_df %>% rename(states = "objects_names")
library(maps)
#states <- map_data("state")
objects_names <- map_data("state")
objects_names %>%
left_join(cluster_df, by = c("region" = "states")) %>%
ggplot( ) +
geom_polygon(aes(x = long, y = lat, fill = as.factor(cluster)), color = "white") +
coord_fixed(1.3) +
guides(fill = F) +
theme_bw( ) +
theme(panel.grid.major = element_blank( ), panel.grid.minor = element_blank( ),
panel.border = element_blank( ),
axis.line = element_blank( ),
axis.text = element_blank( ),
axis.ticks = element_blank( ),
axis.title = element_blank( ))

## Code7:
### Elbow method to decide Optimal Number of Clusters(1)
set.seed(8)
wss <- function(k) {
return(kmeans(df, k, nstart = 25)$tot.withinss)
}
k_values <- 1:15
wss_values <- purrr::map_dbl(k_values, wss)
plot(x = k_values, y = wss_values,
type = "b", frame = F,
xlab = "Number of clusters K",
ylab = "Total within-clusters sum of square")

## Code8:
### Hierarchical Clustering
hac_output <- hclust( dist(dataset, method = "euclidean"), method = "complete")
plot(hac_output) # Calculating distance using hierarchical clustering, using Euclidean distance

# and using complete linkage for hierarchical clustering
### Output Desirable Number of Clusters after Modeling
hac_cut <- cutree(hac_output, 2)
for ( i in 1:length(hac_cut)) {
if( hac_cut[i] != km_output$cluster[i]) print(names(hac_cut) [i])
}
## [1] "Missouri"