# AI4OPT – Summer 2025 Name: _________________________
#
# Data Engineering and Mining II Assignment 1
# Clustering – Part 1
#1) In clustering analysis why do we explore and analyze large amounts of data?
#It allows data to be grouped without previous "classification" and allow the
#the data scientist to create meaningful labels.
#2) What is the uniform definition of clustering analysis? Cluster Analysis is a method of
#creating groups of #objects, #or clusters, in such a way that objects in one cluster are very
#similar and objects in different clusters are quite #distinct.
#3) What measure do we use to determine how similar two data points are? distance
#4) Fill in the blanks: Every _Clustering Algorithm is based on the index
#of similarity or dissimilarity between data points.
#5) Is clustering a supervised machine learning task? No unsupervised machine learning task.
#6) Clustering is used to predict certain outcomes. TRUE or FALSE False, usually for discovery.
#7) State and explain the concept of clustering.Roughly speaking, by data clustering, we mean that
#for a given set of data points and a similarity measure, we regroup the data such that data points
#in the same group are similar and data points in different groups are dissimilar.
#8) When is clustering useful? Clustering is useful whenever diverse and varied data can be exemplified
#by a much smaller number of groups
#9) When dealing with the dissimilarity of two cases, what can we say about the higher distance value?
#The Higher the dissimilarity measure the "MORE" dissimilar are the points or clusters.
#10) Fill in the blanks: Since clustering classifies unlabeled examples, we say it is an unsupervised
#classification.
#11) Clustering is a machine learning task. TRUE or FALSE TRUE
#12) You can use clustering to create class labels from unlabeled data. Once this is done, which
#supervised learner should you use to find the most important predictors of these classes?
#K-means or Decision Trees
#13) For numerical data, what are the two kinds of clusters? Compact Clusters, Chained Clusters
#14) For which data do we represent a compact cluster by a center? By a mode? Numerical Compact data center
#and categorial data by the mode.
#15) For which kind of cluster are any two data points reachable through a path? Chained Cluster.
#16) What is the cohesion of a cluster? Objective function used to identify the centroid of a cluster
#by mathematical means, it is related to the SSE, used in various media.
#17) Distance functions, k-means and hierarchical clustering, are all examples of what? Distance function
#model parameters used to identify distance and the centroid of data within a model
#18) What is the norm? The Minkowski distance formula is known as Lp Norm.
#19) What are mixed mode data? datasets that include both nominal, and numeric features.
#20) What is a Chained cluster? A chained cluster is a set of data points in which every member
#is more like other members in the cluster than other data points not in the cluster.
#More intuitively, any two data points in a chained cluster are reachable through a path; i.e.,
#there is a path that connects the two data points in the cluster.