## Clustering – Part 1
## 1. In clustering analysis why do we explore and analyze large amounts of data?
## Clustering is a popular unsupervised method and an essential tool for Big Data Analysis. Clustering can ## be used either as a pre-processing step to reduce data dimensionality before running the learning
## algorithm, or as a statistical tool to discover useful patterns within a dataset.
## 2. What is the uniform definition of clustering analysis?
## Cluster Analysis is a method of creating groups of objects, or clusters, in such a way that objects in ## one cluster are very similar and objects in different clusters are quite distinct.
## 3. What measure do we use to determine how similar two data points are?
## Euclidean distance measure can be used to determine how similar (or dissimilar) two data points are
## 4. Fill in the blanks: Every _____clustering algorithm _________________
## is based on the index of similarity or dissimilarity between data points.
## 5. Is clustering a supervised machine learning task?
## No, clustering is an unsupervised ML task.
## 6. Clustering is used to predict certain outcomes. TRUE or FALSE
## False, clustering is used to identify groups of similar objects in datasets with two or more variable
## quantities
## 7. State and explain the concept of clustering.
## Clustering is guided by the principle that items inside a cluster should be very similar to each other, ## but very different from those outside
## 8. When is clustering useful?
## Clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of
## groups
## 9. When dealing with the dissimilarity of two cases, what can we say about the higher distance value?
## The higher the distance values, the more different the cases are
## 10. Fill in the blanks: Since clustering classifies unlabeled examples, we say it is an _________________
## ________________.
## Cluster label
## 11. Clustering is a machine learning task. TRUE or FALSE
## True
## 12. You can use clustering to create class labels from unlabeled data. Once this is done, which supervised
## learner should you use to find the most important predictors of these classes?
## Decision trees can be used to find the most important predictors
## 13. For numerical data, what are the two kinds of clusters?
## Compact clusters and Chained clusters
##14. For which data do we represent a compact cluster by a center? By a mode?
## Categorical data for mode
## 15. For which kind of cluster are any two data points reachable through a path?
## Chained cluster
## 16. What is the cohesion of a cluster?
## Cohesion of a cluster refers to maximizing the similarity of the documents in a cluster to the cluster
## centroid
## 17. Distance functions, k-means and hierarchical clustering, are all examples of what?
## Model parameters
## 18. What is the L_P norm?
## According to Amazon SageMaker, the Lp-norm (LP) measures the p-norm distance between the facet
## distributions of the observed labels in a training dataset.
## The formula for the Lp-norm is as follows:
## Lp(Pa, Pd) = ( ∑y||Pa - Pd||p)1/p
## Where the p-norm distance between the points x and y is defined as follows:
## Lp(x, y) = (|x1-y1|p + |x2-y2|p + … +|xn-yn|p)1/p
## 19. What are mixed mode data?
## Datasets that include both nominal, and numeric features.
## 20. What is a Chained cluster?
## A chained cluster is a set of data points in which every member is more like other members in the cluster ## than other data points not in the cluster