# AI4OPT – Summer 2025                              Name: _________________________
# 
# Data Engineering and Mining II                            Assignment 1

 

#                                                   Clustering – Part 1

#1) In clustering analysis why do we explore and analyze large amounts of data?
#It allows data to be grouped without previous "classification" and allow the 
#the data scientist to create meaningful labels.    

#2) What is the uniform definition of clustering analysis? Cluster Analysis is a method of 
#creating groups of #objects, #or clusters, in such a way that objects in one cluster are very
#similar and objects in different clusters are quite #distinct.  

#3) What measure do we use to determine how similar two data points are? distance
                                                            
#4) Fill in the blanks: Every _Clustering Algorithm is based on the index 
#of similarity or dissimilarity between data points.
                                                            
#5) Is clustering a supervised machine learning task? No unsupervised machine learning task.
                                                            
#6) Clustering is used to predict certain outcomes. TRUE or FALSE False, usually for discovery.
                                                            
#7) State and explain the concept of clustering.Roughly speaking, by data clustering, we mean that 
#for a given set of data points and a similarity measure, we regroup the data such that data points
#in the same group are similar and data points in different groups are dissimilar.


#8) When is clustering useful? Clustering is useful whenever diverse and varied data can be exemplified
#by a much smaller number of groups
                                                            
#9) When dealing with the dissimilarity of two cases, what can we say about the higher distance value? 
#The Higher the dissimilarity measure the "MORE" dissimilar are the points or clusters.
                                                            
#10) Fill in the blanks: Since clustering classifies unlabeled examples, we say it is an unsupervised
#classification.   

#11) Clustering is a machine learning task. TRUE or FALSE TRUE
                                                            
#12)  You can use clustering to create class labels from unlabeled data. Once this is done, which 
#supervised learner should you use to find the most important predictors of these classes?
#K-means or Decision Trees
                                                            
#13) For numerical data, what are the two kinds of clusters? Compact Clusters, Chained Clusters
                                                            
#14) For which data do we represent a compact cluster by a center? By a mode? Numerical Compact data center 
#and categorial data by the mode.
                                                            
                                                            
#15) For which kind of cluster are any two data points reachable through a path? Chained Cluster.
                                                            
#16) What is the cohesion of a cluster? Objective function used to identify the centroid of a cluster
#by mathematical means, it is related to the SSE, used in various media.                                                        
                                                            
#17) Distance functions, k-means and hierarchical clustering, are all examples of what? Distance function 
#model parameters used to identify distance and the centroid of data within a model                                           
                                                              
#18) What is the norm? The Minkowski distance formula is known as Lp Norm.

#19) What are mixed mode data? datasets that include both nominal, and numeric features.

#20)  What is a Chained cluster? A chained cluster is a set of data points in which every member
#is more like other members in the cluster than other data points not in the cluster.
#More intuitively, any two data points in a chained cluster are reachable through a path; i.e., 
#there is a path that connects the two data points in the cluster.