# Clustering – Part 3 – Assignment 3
## AI4OPT Data Engineering and Mining II
## Clustering - Part 3 - Assignment 3
## 1. The k-means clustering algorithm will identify a collection of k clusters using a heuristic search, starting
## with a selection of k clusters. TRUE or FALSE
## • TRUE
## 2. What does heuristic mean?
## • enabling a person to discover or learn something for themselves: hands-on, or interactive; discovery
## 3. What is the k-means approach?
## • a collection of k sets of mean values for each of the variables.
## 4. Fill-in-the-blank: The __means__________ for the collection of cases that form one of __k clusters__________ in ## any particular clustering are then the collection of ___mean values___________ for each of the input variables ## over the cases within the clustering.
## 5. The k-means clustering algorithm is a hierarchical method. TRUE or FALSE
## • TRUE
## 6. What does the k-means clustering algorithm consist of?
## consists of -
## • Initializing the centers of the k groups to a set of randomly chosen observations.
## • Repeat
## - Allocate each observation to the group where center is nearest
## - Re-calculate the center of each group
#3 • until the groups are stable
## 7. What is data noise?
## • When attempting to cluster parts of the data, which can be referred to as noise, can disturb the clustering on ## the remaining domain points.
## 8. When using the k-means algorithm, why isn’t it a good idea to use different starting points as cluster centers?
## • using different starting points as cluster centers may lead the algorithm to converge to a different solution
## 9. The k-means algorithm results in obtaining cluster separation, thereby obtaining a stable and non-changing
## maximal clustering solution, where different starting points are used as centers. TRUE or FALSE
## • FALSE
## 10. What are the 3 species of plants in the Iris dataset?
## • Versicolor, Setosa, Virginica
## 11. In the chunk of code for k-means clustering applied to the Irish dataset on page 122 in Torgo’s text, explain ## the following arguments in the k-means() function:
## • Iris[ , -5] ans: __all row and all columns expect column 5____________
## • iter.max = 200 ans: ___total number of iterations is equal to 200________
## 12. The k-means( ) function returns an object that contain bits of information. What are those bits of
## information. The k-kmeans function bits of information include:
## • Clusters, centers,totss,withinss,tot.withinss,betweenss and size.
## 13. Explain cluster validation (cluster evaluation).
## • used to design the procedure of evaluating the goodness of clustering algorithm results. This is important ## to avoid finding patterns in a random data,
## 14. Even though cluster evaluation is not commonly used, what are the evaluation measure (or index) types used to ## judge various aspects of cluster validity?
## • Unsupervised, Supervised, and Relative.
## 15. (a) What are unsupervised measures (internal indices)?
## • only use information available during the clustering process.
## (b) What are supervised measures (external indices)?
## • Requires the existence of information that was not available when obtaining the clustering solution, that
## can be used to compare against the structure obtained by the clustering algorithm.
## 16. What is entropy?
## • measures how well cluster labels match externally supplied class labels.
## 17. Why are supervised measures also called external indices?
## • Supervised measures are often called external indices because they use information that is present in the
## data set
## 18. What are relative measures?
## • Compares different clusterings or clusters. A relative cluster evaluation measure is a supervised or
## unsupervised evaluation measure that is used for the purpose of comparison.
## 19. Look at the code in Torgo, P. 123. Explain the following arguments of the table( ) function:
## • ir3$cluster. __cluster membership
## • iris$Species. __access Species variable in dataset
## 20. Run the code on Page 123 in Torgo’s text. Based on the output, state the contents of Cluster 2.
## Cluster 2 is 62 observations and has the following:
## • Sepal length = 5.9016
## • Sepal width = 2.7484
## • Petal length = 4.3935
## • Petal width = 1.4338
## 21. Based on the output in Exercise 20 above, which clusters do not contain pure plant classes (observations)?
## • Clusters 2 and 3
## 22. Which measures deal with labels, supervised or unsupervised?
## • Supervised, you need a labeled set of data that the model can learn from to make correct decisions.
## 23. Fill-in-the-blank: Internal validation metrics only use information available (during the clustering
## process).
## 24. What metrics elevate the quality of cluster separation?
## • Internal validation metrics
## 25. The silhouette coefficient is an example of which kind of metric?
## • Correlation coefficient
## 26. In the statement
## s <-- silhouette(ir3$cluster, dist(iris[ , -5]))
## explain the argument “iris[ , -5]” of the dist() function.
## • all row and all columns expect column 5
## • the distance matrix of the dataset. It is is implemented in the daisy( ) function of the package,
## cluster
## 27. The sum of square error (SSE) can be used to compare cluster performance only for a similar number of
## clusters. TRUE or FALSE
## • FALSE
## 28. Study Approach A, the program located towards the end of the Lecture 3 packet. Then run the Iris dataset
## through the program. Keep in mind that you will have to tweak the program here and there. (Hint: After
## library(dataset), replace “dataset” with “Iris.” Also, replace “objects_names” with “species.”) Should your
## program run, publish it in RPubs and submit it via GA Canvas.
## • Not having success with this
## 29. Again, study Approach A. Then run the “USArrests” dataset through the program. Again, you will have to tweak ## the program here and there. (Hint: After library(dataset), replace “dataset” with “USArrests.” Replace
## “objects_names” with “state.”)
## If the program runs correctly, publish the program in RPubs and submit a copy via GACanvas., or email to me.
## • Not having success with this
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.