Cluster Assignment #3

# Clustering – Part 3 – Assignment 3
 
## AI4OPT   Data Engineering and Mining II                      
 
## Clustering - Part 3 - Assignment 3
 
## 1.   The k-means clustering algorithm will identify a collection of k clusters using a heuristic search, starting 
##    with a selection of k clusters. TRUE or FALSE
##     •    TRUE
## 2.   What does heuristic mean?
##     •    enabling a person to discover or learn something for themselves:    hands-on, or interactive; discovery
## 3.   What is the k-means approach?
##     •    a collection of k sets of mean values for each of the variables.

## 4.   Fill-in-the-blank: The __means__________ for the collection of cases that form one of __k clusters__________ in ##    any particular clustering are then the collection of ___mean values___________ for each of the input variables ##    over the cases within the clustering.

## 5.   The k-means clustering algorithm is a hierarchical method. TRUE or FALSE
##    • TRUE

## 6.   What does the k-means clustering algorithm consist of?
##    consists of  - 
##     •    Initializing the centers of the k groups to a set of randomly chosen observations.
##     •    Repeat 
##      -  Allocate each observation to the group where center is nearest
##      -  Re-calculate the center of each group
#3    • until the groups are stable

## 7.   What is data noise?
##    • When attempting to cluster parts of the data, which can be referred to as noise, can disturb the clustering on ##      the remaining domain points.

## 8.   When using the k-means algorithm, why isn’t it a good idea to use different starting points as cluster centers?
##    • using different starting points as cluster centers may lead the algorithm to converge to a different solution
## 9.   The k-means algorithm results in obtaining cluster separation, thereby obtaining a stable and non-changing 
##    maximal clustering solution, where different starting points are used as centers. TRUE or FALSE
##    • FALSE
## 10.  What are the 3 species of plants in the Iris dataset?
##    • Versicolor, Setosa, Virginica
## 11.  In the chunk of code for k-means clustering applied to the Irish dataset on page 122 in Torgo’s text, explain ##      the following arguments in the k-means() function:
##      •   Iris[ , -5]                   ans: __all row and all columns expect column 5____________
##      •   iter.max = 200         ans: ___total number of iterations is equal to 200________
 
## 12.  The k-means( ) function returns an object that contain bits of information. What are those bits of 
##      information. The k-kmeans function bits of information include:
##      •   Clusters, centers,totss,withinss,tot.withinss,betweenss and size.

## 13.  Explain cluster validation (cluster evaluation).
##      •   used to design the procedure of evaluating the goodness of clustering algorithm results. This is important ##        to avoid finding patterns in a random data,

## 14.  Even though cluster evaluation is not commonly used, what are the evaluation measure (or index) types used to ##      judge various aspects of cluster validity?
##      •   Unsupervised, Supervised, and Relative.  

## 15.  (a) What are unsupervised measures (internal indices)?
##      •   only use information available during the clustering process.  

##      (b)  What are supervised measures (external indices)?
##      •   Requires the existence of information that was not available when obtaining the clustering solution, that 
##        can be used to compare against the structure obtained by the clustering algorithm. 

## 16.  What is entropy?
##      •   measures how well cluster labels match externally supplied class labels.  
## 17.  Why are supervised measures also called external indices?
##      •   Supervised measures are often called external indices because they use information that is present in the 
##        data set

## 18.  What are relative measures?

##      •   Compares different clusterings or clusters.  A relative cluster evaluation measure is a supervised or 
##        unsupervised evaluation measure that is used for the purpose of comparison.

## 19.  Look at the code in Torgo, P. 123.  Explain the following arguments of the table( ) function:
##      •   ir3$cluster.  __cluster membership
##      •   iris$Species. __access Species variable in dataset
 
## 20.  Run the code on Page 123 in Torgo’s text. Based on the output, state the contents of Cluster 2.
##      Cluster 2 is 62 observations and has the following:
##          •   Sepal length = 5.9016
##          •   Sepal width =  2.7484
##          •   Petal length =  4.3935
##          •   Petal width =   1.4338

## 21.  Based on the output in Exercise 20 above, which clusters do not contain pure plant classes (observations)?
##          •   Clusters 2 and 3

## 22.  Which measures deal with labels, supervised or unsupervised?
##          •   Supervised, you need a labeled set of data that the model can learn from to make correct decisions. 
## 23.  Fill-in-the-blank:  Internal validation metrics only use information available (during the clustering 
##      process).

## 24.  What metrics elevate the quality of cluster separation?
##         •    Internal validation metrics
## 25.  The silhouette coefficient is an example of which kind of metric?

##         •    Correlation coefficient
## 26.  In the statement
##                                        s <-- silhouette(ir3$cluster, dist(iris[ , -5])) 
##                 explain the argument “iris[ , -5]” of the dist() function.
##         •    all row and all columns expect column 5
##         •    the distance matrix of the dataset. It is is implemented in the daisy(  ) function of the package, 
##            cluster
## 27.  The sum of square error (SSE) can be used to compare cluster performance only for a similar number of 
##      clusters. TRUE or FALSE
##         •    FALSE

## 28.  Study Approach A, the program located towards the end of the Lecture 3 packet.  Then run the Iris dataset 
## through the program.  Keep in mind that you will have to tweak the program here and there.  (Hint: After 
## library(dataset), replace “dataset” with “Iris.”  Also, replace “objects_names” with “species.”)  Should your 
## program run, publish it in RPubs and submit it via GA Canvas. 
##   •  Not having success with this

## 29.  Again, study Approach A.  Then run the “USArrests” dataset through the program.  Again, you will have to tweak ## the program here and there.  (Hint:  After library(dataset), replace “dataset” with “USArrests.”  Replace 
## “objects_names” with “state.”)
## If the program runs correctly, publish the program in RPubs and submit a copy via GACanvas., or email to me.  

##    • Not having success with this

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Cluster Assignment #3

Paul Brown

2023-01-11