DE II HW#3

# Clustering – Part 3 – Assignment 3                    AI4OPT                      Name: _____________________       
# Data Engineering and Mining II                                                             Date:  August 8, 2025
#                                                     Clustering - Part 3 - Assignment 3
# 
# 1. The k-means clustering algorithm will identify a collection of k clusters using a heuristic
# search, starting with a selection of k clusters. TRUE or FALSE, TRUE

# 2. What does heuristic mean? enabling a person to discover or learn something for themselves:
# hands-on, or interactive; discovery.

# 3. What is the k-means approach? It is then a collection of k sets of mean values for each of the variables.

# 4. Fill-in-the-blank: The ______Means______ for the collection of cases that form one of
# ____K-Clusters________ in any particular clustering are then the collection of ____Mean Values__________
# for each of the input variables over the cases within the clustering.

# 5. The k-means clustering algorithm is a hierarchical method. TRUE or FALSE FALSE

# 6. What does the k-means clustering algorithm consist of?
#  • Initialize the centers of the k groups to a set of randomly chosen observations
#  • Repeat– Allocate each observation to the group whose center is nearest– Re-calculate the center of each group
#  • Until the groups are stable

# 7. What is data noise? An anomaly that contains no useful information or data.

# 8. When using the k-means algorithm, why isn’t it a good idea to use different starting
# points as cluster centers?  Using different starting points as cluster centers may lead the algorithm
# to converge to a different solution. 

# 9. The k-means algorithm results in obtaining cluster separation, thereby obtaining a
# stable and non-changing maximal clustering solution, where different starting points are
# used as centers. TRUE or FALSE False

# 10. What are the 3 species of plants in the Iris dataset?  Setosa, Versicolor and Virginica

# 11. In the chunk of code for k-means clustering applied to the Irish dataset on page 122 in
# Torgo’s text, explain the following arguments in the k-means() function:
# (a) Iris[ , -5]                    ans: ____Removal of 5th column_____________
# (b) Iter.max = 200          ans: __________Maximum number of iterations_____


# 12. The k-means( ) function returns an object that contain bits of information. What are
# those bits of information? Numbers of Clusters, clusters means by variable and Clustering Vector


# 13. Explain cluster validation (cluster evaluation). Validation of the outcomes with some known outputs, It
# can be internal as well as external. Internal uses only information used during clustering process. 

# 14. Even though cluster evaluation is not commonly used, what are the evaluation measure
# (or index) types used to judge various aspects of cluster validity? External or Internal 


# 15. (a) What are unsupervised measures (internal indices)? Ground Truth 
# (b) What are supervised measures (external indices)? Silhouette

# 16. What is entropy? A mathematical term that explains the measure of
# variance in the data among different classes


# 17. Why are supervised measures also called external indices? Because the label are external to the
#     Process.
# 18. What are relative measures? Compare scores between clustering process to choose the "best" one.

# 19. Look at the code in Torgo, P. 123. Explain the following arguments of the table( )
# function:
# (a) ir3$cluster. __ Rows contain the cluster information
# (b) iris$Species. __ Kabel the column with the Species

# 20. Run the code on Page 123 in Torgo’s text. Based on the output, state the contents of
# Cluster 2.Clusters of size 62
#  Cluster means:
#  Sepal.Length Sepal.Width Petal.Length Petal.Width
#  2     5.901613    2.748387     4.393548    1.433871

# 21. Based on the output in Exercise 20 above, which clusters do not contain pure plant
# classes (observations)? Clusters #2 and #3

# 22. Which measures deal with labels, supervised or unsupervised? Supervised

# 23. Fill-in-the-blank: Internal validation metrics only use information available (during the
# clustering process).

# 24. What metrics elevate the quality of cluster separation? Silhouette Coefficient

# 25. The silhouette coefficient is an example of which kind of metric? Internal Validation

# 26. In the statement
#                           s <-- silhouette(ir3$cluster, dist(iris[ , -5]))
# explain the argument “iris[ , -5]” of the dist() function. Removal of the 5th Column from iris
 
#  dataset.
# 
# 27. The sum of square error (SSE) can be used to compare cluster performance only for a
# similar number of clusters. TRUE or FALSE  - TRUE
# 
# 28. Study Approach A, the program located towards the end of the Lecture 3 packet. Then
# run the Iris dataset through the program. Keep in mind that you will have to tweak the
# program here and there. (Hint: After library(dataset), replace “dataset” with “Iris.”
# Also, replace “objects_names” with “species.”) Should your program run, publish it in
# RPubs and submit it via GA Canvas.
# 
# 29. Again, study Approach A. Then run the “USArrests” dataset through the program.
# Again, you will have to tweak the program here and there. (Hint: After library(dataset),
# replace “dataset” with “USArrests.” Replace “objects_names” with “state.”)
# If the program runs correctly, publish the program in RPubs and submit a copy via
# GACanvas., or email to me.
DE II HW#3

Walter James

2025-08-19