# Clustering – Part 3 – Assignment 3 AI4OPT Name: _____________________
# Data Engineering and Mining II Date: August 8, 2025
# Clustering - Part 3 - Assignment 3
#
# 1. The k-means clustering algorithm will identify a collection of k clusters using a heuristic
# search, starting with a selection of k clusters. TRUE or FALSE, TRUE
# 2. What does heuristic mean? enabling a person to discover or learn something for themselves:
# hands-on, or interactive; discovery.
# 3. What is the k-means approach? It is then a collection of k sets of mean values for each of the variables.
# 4. Fill-in-the-blank: The ______Means______ for the collection of cases that form one of
# ____K-Clusters________ in any particular clustering are then the collection of ____Mean Values__________
# for each of the input variables over the cases within the clustering.
# 5. The k-means clustering algorithm is a hierarchical method. TRUE or FALSE FALSE
# 6. What does the k-means clustering algorithm consist of?
# • Initialize the centers of the k groups to a set of randomly chosen observations
# • Repeat– Allocate each observation to the group whose center is nearest– Re-calculate the center of each group
# • Until the groups are stable
# 7. What is data noise? An anomaly that contains no useful information or data.
# 8. When using the k-means algorithm, why isn’t it a good idea to use different starting
# points as cluster centers? Using different starting points as cluster centers may lead the algorithm
# to converge to a different solution.
# 9. The k-means algorithm results in obtaining cluster separation, thereby obtaining a
# stable and non-changing maximal clustering solution, where different starting points are
# used as centers. TRUE or FALSE False
# 10. What are the 3 species of plants in the Iris dataset? Setosa, Versicolor and Virginica
# 11. In the chunk of code for k-means clustering applied to the Irish dataset on page 122 in
# Torgo’s text, explain the following arguments in the k-means() function:
# (a) Iris[ , -5] ans: ____Removal of 5th column_____________
# (b) Iter.max = 200 ans: __________Maximum number of iterations_____
# 12. The k-means( ) function returns an object that contain bits of information. What are
# those bits of information? Numbers of Clusters, clusters means by variable and Clustering Vector
# 13. Explain cluster validation (cluster evaluation). Validation of the outcomes with some known outputs, It
# can be internal as well as external. Internal uses only information used during clustering process.
# 14. Even though cluster evaluation is not commonly used, what are the evaluation measure
# (or index) types used to judge various aspects of cluster validity? External or Internal
# 15. (a) What are unsupervised measures (internal indices)? Ground Truth
# (b) What are supervised measures (external indices)? Silhouette
# 16. What is entropy? A mathematical term that explains the measure of
# variance in the data among different classes
# 17. Why are supervised measures also called external indices? Because the label are external to the
# Process.
# 18. What are relative measures? Compare scores between clustering process to choose the "best" one.
# 19. Look at the code in Torgo, P. 123. Explain the following arguments of the table( )
# function:
# (a) ir3$cluster. __ Rows contain the cluster information
# (b) iris$Species. __ Kabel the column with the Species
# 20. Run the code on Page 123 in Torgo’s text. Based on the output, state the contents of
# Cluster 2.Clusters of size 62
# Cluster means:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 2 5.901613 2.748387 4.393548 1.433871
# 21. Based on the output in Exercise 20 above, which clusters do not contain pure plant
# classes (observations)? Clusters #2 and #3
# 22. Which measures deal with labels, supervised or unsupervised? Supervised
# 23. Fill-in-the-blank: Internal validation metrics only use information available (during the
# clustering process).
# 24. What metrics elevate the quality of cluster separation? Silhouette Coefficient
# 25. The silhouette coefficient is an example of which kind of metric? Internal Validation
# 26. In the statement
# s <-- silhouette(ir3$cluster, dist(iris[ , -5]))
# explain the argument “iris[ , -5]” of the dist() function. Removal of the 5th Column from iris
# dataset.
#
# 27. The sum of square error (SSE) can be used to compare cluster performance only for a
# similar number of clusters. TRUE or FALSE - TRUE
#
# 28. Study Approach A, the program located towards the end of the Lecture 3 packet. Then
# run the Iris dataset through the program. Keep in mind that you will have to tweak the
# program here and there. (Hint: After library(dataset), replace “dataset” with “Iris.”
# Also, replace “objects_names” with “species.”) Should your program run, publish it in
# RPubs and submit it via GA Canvas.
#
# 29. Again, study Approach A. Then run the “USArrests” dataset through the program.
# Again, you will have to tweak the program here and there. (Hint: After library(dataset),
# replace “dataset” with “USArrests.” Replace “objects_names” with “state.”)
# If the program runs correctly, publish the program in RPubs and submit a copy via
# GACanvas., or email to me.