Cluster Assignment #1 - 2nd Submission


## Clustering – Part 1
## 1.   In clustering analysis why do we explore and analyze large amounts of data?

##               Clustering is a popular unsupervised method and an essential tool for Big Data Analysis. Clustering can ##             be used either as a pre-processing step to reduce data dimensionality before running the learning 
##             algorithm, or as a statistical tool to discover useful patterns within a dataset.

## 2.   What is the uniform definition of clustering analysis?

##            Cluster Analysis is a method of creating groups of objects, or clusters, in such a way that objects in ##            one cluster are very similar and objects in different clusters are quite distinct.  

## 3.   What measure do we use to determine how similar two data points are?

##               Euclidean distance measure can be used to determine how similar (or dissimilar) two data points are

## 4. Fill in the blanks:  Every _____clustering algorithm  _________________ 
##              is based on the index of similarity or dissimilarity between data points.

## 5.   Is clustering a supervised machine learning task?

##              No, clustering is an unsupervised ML task.

## 6.   Clustering is used to predict certain outcomes.  TRUE or FALSE

##              False, clustering is used to identify groups of similar objects in datasets with two or more variable 
##              quantities

## 7. State and explain the concept of clustering.

##             Clustering is guided by the principle that items inside a cluster should be very similar to each other, ##             but very different from those outside

## 8.   When is clustering useful?

##             Clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of 
##             groups

## 9.   When dealing with the dissimilarity of two cases, what can we say about the higher distance value?

##            The higher the distance values, the more different the cases are

## 10. Fill in the blanks:  Since clustering classifies unlabeled examples, we say it is an _________________    
##    ________________. 

##           Cluster label

## 11.  Clustering is a machine learning task.  TRUE or FALSE

##           True

## 12.  You can use clustering to create class labels from unlabeled data.  Once this is done, which supervised 
##      learner should you use to find the most important predictors of these classes?

##          Decision trees can be used to find the most important predictors

## 13.  For numerical data, what are the two kinds of clusters?

##         Compact clusters and Chained clusters

##14.   For which data do we represent a compact cluster by a center?  By a mode?

##          Categorical data for mode

## 15.  For which kind of cluster are any two data points reachable through a path?

##          Chained cluster

## 16.  What is the cohesion of a cluster?

##           Cohesion of a cluster refers to maximizing the similarity of the documents in a cluster to the cluster 
##           centroid

## 17.  Distance functions, k-means and hierarchical clustering, are all examples of what?

##           Model parameters

## 18.  What is the L_P norm?

##          According to Amazon SageMaker, the Lp-norm (LP) measures the p-norm distance between the facet 
##          distributions of the observed labels in a training dataset.
##          The formula for the Lp-norm is as follows:
##              Lp(Pa, Pd) = ( ∑y||Pa - Pd||p)1/p
##                Where the p-norm distance between the points x and y is defined as follows:
##          Lp(x, y) = (|x1-y1|p + |x2-y2|p + … +|xn-yn|p)1/p

## 19.  What are mixed mode data?

##            Datasets that include both nominal, and numeric features. 

## 20.  What is a Chained cluster?
##          A chained cluster is a set of data points in which every member is more like other members in the cluster ##          than other data points not in the cluster

Cluster Assignment #1 - 2nd Submission

Paul Brown

2023-01-07