In clustering analysis why do we explore and analyze large
amounts of data?
–> It does not have to be large amount of data. However, if you have
large amount of data it may provide better clustering.
What is the uniform definition of clustering analysis?
–> There is no uniform definition for data clustering. In general, it
mean that for a given set of data points and a similarity measure, we
group the observations such that the same group are homogenous and
different groups are heteregenous.
What measure do we use to determine how similar two data points
are?
–> Most popular are distance measure.
Fill in the blanks: Every Clustering_
____Alogrithm____ is based on the index of similarity or dissimilarity
between data points.
–>
Is clustering a supervised machine learning task?
–>Unsupervised
Clustering is used to predict certain outcomes. TRUE or FALSE. –> FALSE
State and explain the concept of clustering.
–> If we have set of data (X), dataframe, and wanted to separate into
group such that observation in the same group are homogenous using some
statistical tool.
When is clustering useful?
–> If we are given diverse and varied data and want to exemplified by
a much smaller number of groups.
When dealing with the dissimilarity of two cases, what can we say
about the higher distance value?
–> The Higher the dissimilarity measure the “MORE” dissimilar are the
points or clusters.
Fill in the blanks: Since clustering classifies unlabeled examples, we say it is an unsupervised learning
Clustering is a machine learning task. TRUE or FALSE. –> TRUE
You can use clustering to create class labels from unlabeled
data. Once this is done, which supervised learner should you use to find
the most important predictors of these classes?
–> There are many statistical method that use to find important
predictors. One common example is coefficeint of logistic
regression.
For numerical data, what are the two kinds of clusters?
–> Compact clusters, and Chained clusters.
For which data do we represent a compact cluster by a center? By
a mode?
–> Continuous varibles - Center value
Categorical varibles - Mode value
For which kind of cluster are any two data points reachable
through a path?
–> Chained cluster
What is the cohesion of a cluster?
–> In statistics and clustering analysis, cohesion of a cluster
refers to how closely related or similar the data points are within the
same cluster. The SSE (Sum of Square Error) is the most common
measure.
Distance functions, k-means and hierarchical clustering, are all
examples of what?
–> These all are all examples of tools and methods used in cluster
analysis used in statiatics and machine learning.
What is the norm?
–> A norm is a function that measures the size, length, or magnitude
of a vector. It’s a way to quantify how “big” a vector is, regardless of
its direction. Example: L1-norm, Euclidean norm, p-norm
What are mixed mode data?
–> Mixed mode data – datasets that include both nominal and numeric
variables.
What is a Chained cluster?
–> Any two data points in a chained cluster are reachable through a
path; i.e., there is a path that connects the two data points in the
cluster.
Chained cluser is useful for an elongated or chain-like group of points. Chained cluser are formed when points are connected by pairwise distances, but the overall group is not cohesive A long string of points where each point is close to the next, but the start and end of the “chain” are far apart.