Assignment8.R

#1. When performing clustering, discuss some of the decisions that need to be made with respect to the observations/features as well as in the case of hierarchical clustering and K-means clustering.

# Answer:

# We can cluster observations on the basis of the features in order to identify subgroups among the observations, or we can cluster features
# on the basis of the observations in order to discover subgroups among the features.
# For instance, suppose that we have a set of n observations, each with p
# features. The n observations could correspond to tissue samples for patients
# with breast cancer, and the p features could correspond to measurements
# collected for each tissue sample; these could be clinical measurements, such
# as tumor stage or grade, or they could be gene expression measurements.
# We may have a reason to believe that there is some heterogeneity among
# the n tissue samples; for instance, perhaps there are a few different unknown
# subtypes of breast cancer. Clustering could be used to find these subgroups.

# In K-means clustering, we seek to partition the observations into a pre-specified number of clusters. Here, we want to partition the observations into
# K clusters such that the total within-cluster variation, summed over all K clusters, is as small as possible.

# In Hierarchical Clustering, we end up with a tree-like visual representation of the observations, called a dendrogram,
# that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to n.
# Here, observations that fuse at the very bottom of the tree are quite similar to each other, whereas observations
# that fuse close to the top of the tree will tend to be quite different.
# We draw conclusions about the similarity of two observations based on the location on the vertical axis
# where branches containing those two observations first are fused.

# So decisions that are necessary to be made that could impact our clustering outputs are:

# Should the observations or features first be standardized in some way? For instance, maybe the variables should be centered to have mean
# zero and scaled to have standard deviation one.

# In the case of K-means clustering, how many clusters should we look for in the data?

# In the case of hierarchical clustering,
# - What dissimilarity measure should be used?
# - What type of linkage should be used?
# - Where should we cut the dendrogramin order to obtain clusters?

#2. Perform K-means clustering manually, with K = 2, on a small example with n = 6 observations and p = 2 features. The observations are as follows.

# (a) Plot the observations.
set.seed(139783)
obs = cbind(c(1, 1, 0, 5, 6, 4), c(4, 3, 4, 1, 2, 0))
obs

##      [,1] [,2]
## [1,]    1    4
## [2,]    1    3
## [3,]    0    4
## [4,]    5    1
## [5,]    6    2
## [6,]    4    0

plot(obs[,1], obs[,2])

# (b) Randomly assign a cluster label to each observation. You can use the sample() command in R to do this. Report the cluster clust_lbls for each observation.
clust_lbls = sample(2, nrow(obs), replace=T)
clust_lbls

## [1] 1 2 2 1 1 2

# (c) Compute the centroid for each cluster.
centroid1 = c(mean(obs[clust_lbls==1, 1]), mean(obs[clust_lbls==1, 2]))
centroid2 = c(mean(obs[clust_lbls==2, 1]), mean(obs[clust_lbls==2, 2]))
centroid1

## [1] 4.000000 2.333333

centroid2

## [1] 1.666667 2.333333

plot(obs[,1], obs[,2], col=(clust_lbls+1), pch=20, cex=3)
points(centroid1[1], centroid1[2], col=2, pch=4, cex = 3)
points(centroid2[1], centroid2[2], col=3, pch=4, cex = 3)

# (d) Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster clust_lbls for each observation.
euclid = function(a, b) {
  return(sqrt((a[1] - b[1])^2 + (a[2]-b[2])^2))
}
assign_clust_lbls = function(obs, centroid1, centroid2) {
  clust_lbls = rep(NA, nrow(obs))
  for (i in 1:nrow(obs)) {
    if (euclid(obs[i,], centroid1) < euclid(obs[i,], centroid2)) {
      clust_lbls[i] = 1
    } else {
      clust_lbls[i] = 2
    }
  }
  return(clust_lbls)
}
clust_lbls = assign_clust_lbls(obs, centroid1, centroid2)
clust_lbls

## [1] 2 2 2 1 1 1

# (e) Repeat (c) and (d) until the answers obtained stop changing.
last_clust_lbls = rep(-1, 6)
while (!all(last_clust_lbls == clust_lbls)) {
  last_clust_lbls = clust_lbls
  centroid1 = c(mean(obs[clust_lbls==1, 1]), mean(obs[clust_lbls==1, 2]))
  centroid2 = c(mean(obs[clust_lbls==2, 1]), mean(obs[clust_lbls==2, 2]))
  print(centroid1)
  print(centroid2)
  clust_lbls = assign_clust_lbls(obs, centroid1, centroid2)
}

## [1] 5 1
## [1] 0.6666667 3.6666667

clust_lbls

## [1] 2 2 2 1 1 1

# (f) In your plot from (a), color the observations according to the cluster clust_lbls obtained.
plot(obs[,1], obs[,2], col=(clust_lbls+1), pch=20, cex=3)
points(centroid1[1], centroid1[2], col=2, pch=4, cex=3)
points(centroid2[1], centroid2[2], col=3, pch=4, cex=3)

Assignment8.R

namit

Tue Apr 25 10:39:56 2017