Data Engineering and Mining II

  1. In clustering analysis why do we explore and analyze large amounts of data?
    –> It does not have to be large amount of data. However, if you have large amount of data it may provide better clustering.

  2. What is the uniform definition of clustering analysis?
    –> There is no uniform definition for data clustering. In general, it mean that for a given set of data points and a similarity measure, we group the observations such that the same group are homogenous and different groups are heteregenous.

  3. What measure do we use to determine how similar two data points are?
    –> Most popular are distance measure.

  4. Fill in the blanks: Every Clustering_ ____Alogrithm____ is based on the index of similarity or dissimilarity between data points.
    –>

  5. Is clustering a supervised machine learning task?
    –>Unsupervised

  6. Clustering is used to predict certain outcomes. TRUE or FALSE. –> FALSE

  7. State and explain the concept of clustering.
    –> If we have set of data (X), dataframe, and wanted to separate into group such that observation in the same group are homogenous using some statistical tool.

  8. When is clustering useful?
    –> If we are given diverse and varied data and want to exemplified by a much smaller number of groups.

  9. When dealing with the dissimilarity of two cases, what can we say about the higher distance value?
    –> The Higher the dissimilarity measure the “MORE” dissimilar are the points or clusters.

  10. Fill in the blanks: Since clustering classifies unlabeled examples, we say it is an unsupervised learning

  11. Clustering is a machine learning task. TRUE or FALSE. –> TRUE

  12. You can use clustering to create class labels from unlabeled data. Once this is done, which supervised learner should you use to find the most important predictors of these classes?
    –> There are many statistical method that use to find important predictors. One common example is coefficeint of logistic regression.

  13. For numerical data, what are the two kinds of clusters?
    –> Compact clusters, and Chained clusters.

  14. For which data do we represent a compact cluster by a center? By a mode?
    –> Continuous varibles - Center value
    Categorical varibles - Mode value

  15. For which kind of cluster are any two data points reachable through a path?
    –> Chained cluster

  16. What is the cohesion of a cluster?
    –> In statistics and clustering analysis, cohesion of a cluster refers to how closely related or similar the data points are within the same cluster. The SSE (Sum of Square Error) is the most common measure.

  17. Distance functions, k-means and hierarchical clustering, are all examples of what?
    –> These all are all examples of tools and methods used in cluster analysis used in statiatics and machine learning.

  18. What is the norm?
    –> A norm is a function that measures the size, length, or magnitude of a vector. It’s a way to quantify how “big” a vector is, regardless of its direction. Example: L1-norm, Euclidean norm, p-norm

  19. What are mixed mode data?
    –> Mixed mode data – datasets that include both nominal and numeric variables.

  20. What is a Chained cluster?
    –> Any two data points in a chained cluster are reachable through a path; i.e., there is a path that connects the two data points in the cluster.

Chained cluser is useful for an elongated or chain-like group of points. Chained cluser are formed when points are connected by pairwise distances, but the overall group is not cohesive A long string of points where each point is close to the next, but the start and end of the “chain” are far apart.