Multivariate Statistics: Tutorial 2

Question 1:

Reading in the data:

data_cereals <- read.table("./data/T11-9_Fixed_JB (1).DAT", header = FALSE, row.names = 1)

a) Creating a distance matrix

dist_cereals <- dist(data_cereals[,-1])

Displaying a subset of the distance matrix:

print(round(dist_cereals_sub <- dist(data_cereals[1:6,-1]),2))

##                  ACCheerios Cheerios CocoaPuffs CountChocula GoldenGrahams
## Cheerios             116.04                                               
## CocoaPuffs            15.51   121.65                                      
## CountChocula           6.36   117.89      10.00                           
## GoldenGrahams        103.20    61.63     100.62       102.10              
## HoneyNutCheerios      72.82    44.12      78.36        74.43         54.26

b) Creating and Plotting the Cluster diagrams for “single” and for “complete”:

clust_cer1 <- hclust(dist_cereals,method = "single")
clust_cer2 <- hclust(dist_cereals,method = "complete")

plot(clust_cer1)

plot(clust_cer2)

COMMENT: The two diagrams are quite different. For example, in the first, All Bran stands on a cluster of its own, while in the second it forms part of a larger cluster.

Question 2

Doing a K-means clustering for 2,3 and 4 centers respectively and subsetting only the clustering:

kmeans_cer2 <- kmeans(x = data_cereals[,2:10], centers = 2)
kmeans_cer3 <- kmeans(x = data_cereals[,2:10], centers = 3)
kmeans_cer4 <- kmeans(x = data_cereals[,2:10], centers = 4)

kmeans2_cer_clust <- kmeans_cer2$cluster
kmeans3_cer_clust <- kmeans_cer3$cluster
kmeans4_cer_clust <- kmeans_cer4$cluster

Comparing these results with the results from Question 1 above:

First need to create cuts for 2,3 and 4 on the two dendogram renditions:

dendo_cer1_2 <- cutree(tree=clust_cer1, k = 2)
dendo_cer1_3 <- cutree(tree=clust_cer1, k = 3)
dendo_cer1_4 <- cutree(tree=clust_cer1, k = 4)
dendo_cer2_2 <- cutree(tree=clust_cer2, k = 2)
dendo_cer2_3 <- cutree(tree=clust_cer2, k = 3)
dendo_cer2_4 <- cutree(tree=clust_cer2, k = 4)

Biding the dendogram for each number of clusters with the respective K-means clustering:

clust_cer2 <- cbind(dendo_cer1_2,dendo_cer2_2, kmeans2_cer_clust)
clust_cer3 <- cbind(dendo_cer1_3,dendo_cer2_3, kmeans3_cer_clust)
clust_cer4 <- cbind(dendo_cer1_4,dendo_cer2_4, kmeans4_cer_clust)

Now comparing these:

round(cor(clust_cer2),2)

##                   dendo_cer1_2 dendo_cer2_2 kmeans2_cer_clust
## dendo_cer1_2              1.00        -0.08             -0.08
## dendo_cer2_2             -0.08         1.00              1.00
## kmeans2_cer_clust        -0.08         1.00              1.00

round(cor(clust_cer3),2)

##                   dendo_cer1_3 dendo_cer2_3 kmeans3_cer_clust
## dendo_cer1_3              1.00         0.57             -0.45
## dendo_cer2_3              0.57         1.00             -0.41
## kmeans3_cer_clust        -0.45        -0.41              1.00

round(cor(clust_cer4),2)

##                   dendo_cer1_4 dendo_cer2_4 kmeans4_cer_clust
## dendo_cer1_4              1.00         0.56             -0.59
## dendo_cer2_4              0.56         1.00             -0.47
## kmeans4_cer_clust        -0.59        -0.47              1.00

COMMENT: From the correlations, the following:
* It appears that the K-means clustering and the dendogram (complete) give exactly the same clusters for two clusters and that the comparitively stronger correlation remains as the number of centers/clusters increases.
* The correlations generally become weaker as the number of clusters/centers increases.

Question 3

data_records <- read.table("./data/T1-9.dat", header = FALSE, row.names = 1, sep ="")

NOTE: Needed to change the spaces after Korea, N and Korea, S to Korea,N and Korea,S

a) Creating a euclidean distance matrix:

dist_records <- dist(data_records)

Displaying a subset of the distance matrix:

print(round(dist_records_sub <- dist(data_records[1:6,-1]),2))

##       ARG   AUS   AUT   BEL   BER
## AUS  7.89                        
## AUT  4.48 11.03                  
## BEL  7.37  2.88 11.33            
## BER 23.88 31.06 20.04 31.21      
## BRA  3.49  4.42  6.95  4.45 26.92

b) Doing the same as above, dendograms for “Single” and “Complete”:

clust_records1 <- hclust(dist_records,method = "single")
clust_records2 <- hclust(dist_records,method = "complete")

plot(clust_records1)

plot(clust_records2)

COMMENT: Again, the results as is quite apparant from the dendograms are really quite different.

c) Again doing the same as above, inputting into K-means, again using 2,3 and 4 centers.

kmeans_records2 <- kmeans(x = data_records, centers = 2)
kmeans_records3 <- kmeans(x = data_records, centers = 3)
kmeans_records4 <- kmeans(x = data_records, centers = 4)

kmeans2_rec_clust <- kmeans_records2$cluster
kmeans3_rec_clust <- kmeans_records3$cluster
kmeans4_rec_clust <- kmeans_records4$cluster

First need to create cuts for 2,3 and 4 on the two dendogram renditions:

dendo_rec1_2 <- cutree(tree=clust_records1, k = 2)
dendo_rec1_3 <- cutree(tree=clust_records1, k = 3)
dendo_rec1_4 <- cutree(tree=clust_records1, k = 4)
dendo_rec2_2 <- cutree(tree=clust_records2, k = 2)
dendo_rec2_3 <- cutree(tree=clust_records2, k = 3)
dendo_rec2_4 <- cutree(tree=clust_records2, k = 4)

Biding the dendogram for each number of clusters with the respective K-means clustering:

clust_records2 <- cbind(dendo_rec1_2,dendo_rec2_2, kmeans2_rec_clust)
clust_records3 <- cbind(dendo_rec1_3,dendo_rec2_3, kmeans3_rec_clust)
clust_records4 <- cbind(dendo_rec1_4,dendo_rec2_4, kmeans4_rec_clust)

Now comparing these:

round(cor(clust_records2),2)

##                   dendo_rec1_2 dendo_rec2_2 kmeans2_rec_clust
## dendo_rec1_2              1.00         0.81             -0.44
## dendo_rec2_2              0.81         1.00             -0.54
## kmeans2_rec_clust        -0.44        -0.54              1.00

round(cor(clust_records3),2)

##                   dendo_rec1_3 dendo_rec2_3 kmeans3_rec_clust
## dendo_rec1_3              1.00         0.60              0.52
## dendo_rec2_3              0.60         1.00             -0.26
## kmeans3_rec_clust         0.52        -0.26              1.00

round(cor(clust_records4),2)

##                   dendo_rec1_4 dendo_rec2_4 kmeans4_rec_clust
## dendo_rec1_4              1.00          0.7             -0.16
## dendo_rec2_4              0.70          1.0             -0.50
## kmeans4_rec_clust        -0.16         -0.5              1.00

Question 4

See hand-written work

Question 5

Using the data_cereals dataset, and working with three K-means centers. That is with kmeans_cer3

Consider the centers:

centers <- kmeans_cer3$centers
star <- stars(centers,len=0.6,lwd=2, col.lines=1:6)

For the Faces:

install.packages(“aplpack”)

library(aplpack)

## Warning: package 'aplpack' was built under R version 3.1.3

## Loading required package: tcltk

face <- faces(data_cereals[,2:10])

## effect of variables:
##  modified item       Var  
##  "height of face   " "V3" 
##  "width of face    " "V4" 
##  "structure of face" "V5" 
##  "height of mouth  " "V6" 
##  "width of mouth   " "V7" 
##  "smiling          " "V8" 
##  "height of eyes   " "V9" 
##  "width of eyes    " "V10"
##  "height of hair   " "V11"
##  "width of hair   "  "V3" 
##  "style of hair   "  "V4" 
##  "height of nose  "  "V5" 
##  "width of nose   "  "V6" 
##  "width of ear    "  "V7" 
##  "height of ear   "  "V8"

The heatmap:

heatmap(sapply(data_cereals,as.numeric))