Lab 11

  1. In Section 12.2.3, a formula for calculating PVE was given in Equation 12.10. We also saw that the PVE can be obtained using the sdev output of the prcomp() function. On the USArrests data, calculate PVE in two ways:
    1. Using the sdev output of the prcomp() function, as was done in Section 12.2.3.
    pr.out <- prcomp(USArrests, scale = TRUE)
    pr.var <- pr.out$sdev^2
    pve <- pr.var / sum(pr.var)
    sum(pr.var)
    ## [1] 4
    pve
    ## [1] 0.62006039 0.24744129 0.08914080 0.04335752
    1. By applying Equation 12.10 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 12.10 to obtain the PVE. These two approaches should give the same results. Hint: You will only obtain the same results in (a) and (b) if the same data is used in both cases. For instance, if in (a) you performed prcomp() using centered and scaled variables, then you must center and scale the variables before applying Equation 12.10 in (b).
    loadings <- pr.out$rotation
    USArrests2 <- scale(USArrests)
    sumvar <- sum(apply(as.matrix(USArrests2)^2, 2, sum))
    apply((as.matrix(USArrests2) %*% loadings)^2, 2, sum) / sumvar
    ##        PC1        PC2        PC3        PC4 
    ## 0.62006039 0.24744129 0.08914080 0.04335752
  1. Consider the USArrests data. We will now perform hierarchical clustering on the states.

    1. Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
    set.seed(2)
    hc.complete <- hclust(dist(USArrests), method = "complete")
    plot(hc.complete)

    1. Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
    ct <- cutree(hc.complete, 3)
    sapply(1:3, function(i) names(ct)[ct ==i])
    ## [[1]]
    ##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
    ##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
    ##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
    ## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
    ## 
    ## [[2]]
    ##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
    ##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
    ##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
    ## [13] "Washington"    "Wyoming"      
    ## 
    ## [[3]]
    ##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
    ##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
    ##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
    ## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
    ## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"
    1. Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
    sd.data <- scale(USArrests)
    hc.complete.sd <- hclust(dist(sd.data), method = "complete")
    plot(hc.complete.sd)

    1. What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
    ct2 <- cutree(hc.complete.sd, 3)
    sapply(1:3, function(i) names(ct2)[ct2==i])
    ## [[1]]
    ## [1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
    ## [5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     
    ## 
    ## [[2]]
    ##  [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
    ##  [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
    ## [11] "Texas"     
    ## 
    ## [[3]]
    ##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
    ##  [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
    ##  [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
    ## [13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
    ## [17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
    ## [21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
    ## [25] "Utah"          "Vermont"       "Virginia"      "Washington"   
    ## [29] "West Virginia" "Wisconsin"     "Wyoming"
    table(cutree(hc.complete, 3), cutree(hc.complete.sd, 3))
    ##    
    ##      1  2  3
    ##   1  6  9  1
    ##   2  2  2 10
    ##   3  0  0 20

    Scaling the variables affect the clusters obtained although the trees are similar. The variables must be scaled first because the data has different measurement units.

  1. In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

    1. Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables. Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.
    set.seed(2)
    x <- matrix(rnorm(20 * 3 * 50, mean = 0, sd = 0.001), ncol = 50)
    x[1:20, 2] <- 1
    x[21:40, 1] <- 2
    x[21:40, 2] <- 2
    x[41:60, 1] <- 1
    true.labels <- c(rep(1, 20), rep(2, 20), rep(3, 20))
    1. Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.
    pr.out <- prcomp(x)
    plot(pr.out$x[, 1:2], col = 1:3, xlab = "Z1", ylab = "Z2", pch = 19)

    1. Perform K-means clustering of the observations with K =3. How well do the clusters that you obtained in K-means clustering compare to the true class labels? Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.
    km.out <- kmeans(x, 3, nstart = 20)
    table(true.labels, km.out$cluster)
    ##            
    ## true.labels  1  2  3
    ##           1  0  0 20
    ##           2 20  0  0
    ##           3  0 20  0

    K-means separates the clusters out perfectly.

    1. Perform K-means clustering with K =2. Describe your results.
    km.out <- kmeans(x, 2, nstart = 20)
    table(true.labels, km.out$cluster)
    ##            
    ## true.labels  1  2
    ##           1 20  0
    ##           2  0 20
    ##           3 20  0

    This method tries to find two clusters, but there are three true classes in the data so the clusters may not match the original classes. Cluster 1 contains all class 2 observations and cluster 2 has a mix of class 1 and 3.

    1. Now perform K-means clustering with K =4, and describe your results.
    km.out <- kmeans(x, 4, nstart = 20)
    table(true.labels, km.out$cluster)
    ##            
    ## true.labels  1  2  3  4
    ##           1 11  9  0  0
    ##           2  0  0 20  0
    ##           3  0  0  0 20

    The clusters produced do not align perfectly with the original three classes. Cluster 1 contains all class 3 observations, cluster 2 contains some observations from class 1, cluster 3 contains some observations from class 1, and cluster 4 contains all observations from class 2.

    1. Now perform K-means clustering with K =3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
    km.out <- kmeans(pr.out$x[, 1:2], 3, nstart = 20)
    table(true.labels, km.out$cluster)
    ##            
    ## true.labels  1  2  3
    ##           1  0  0 20
    ##           2  0 20  0
    ##           3 20  0  0

    The clusters perfectly align with the true class labels. This means k-means clustering on the first two principal components is just as effective as on the original data.

    1. Using the scale() function, perform K-means clustering with K =3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (c)? Explain.
    km.out <- kmeans(scale(x), 3, nstart = 20)
    table(true.labels, km.out$cluster)
    ##            
    ## true.labels  1  2  3
    ##           1  9  2  9
    ##           2  2 18  0
    ##           3  7  1 12

Class 2 is perfectly clustered into cluster 1, but the other classes are divided. This is less accurate than the previously perfect match on the unscaled data. Scaling may have diluted the useful signal in the data.