Q3a

eupop <- read.table('/rstudio_files/ST464/data/eupop.txt', header=T, row.names=1)
eupop <- eupop[,-5]
d_eu <- dist(eupop,"euclidean")
h_eu <- hclust(d_eu,"average")
d2_eu = color_branches((as.dendrogram(h_eu)),k=3,col=c(2,3,4))
plot(d2_eu)

The countries are grouped into three clusters. Ireland belongs to the cluster in red (Cluster 3), Spain, Germany, Greece and Italy belong to cluster 2 (blue colour) while other countires are in cluster 1 (green colour). Ireland was found ina cluster with the highest percentage of population with the age group 0 - 14yrs and lowest population from 45yrs and above. the next is cluster 2 while the last group in green has the highest population across the age groups.

Q3b

source('~/rstudio_files/ST464/sumPartition.R')
sumPartition(eupop,cutree(h_eu,3))
## Final Partition
## 
## Number of clusters  3
## 
##           N.obs Within.clus.SS Ave.dist..Centroid Max.dist.centroid
## Cluster 1    10         55.305           2.211730          4.030199
## Cluster 2     4         17.535           1.831295          3.117090
## Cluster 3     1          0.000           0.000000          0.000000
## 
## 
## Cluster centroids
## 
##       Cluster 1 Cluster 2 Cluster 3 Grand centrd
## p014  18.23     15.25     22.2      17.7        
## p1544 42.52     43.775    46.2      43.1        
## p4564 23.94     24.25     20.3      23.78       
## p65.  15.36     16.725    11.3      15.45333    
## 
## 
## Distances between Cluster centroids
## 
##           Cluster 1 Cluster 2 Cluster 3
## Cluster 1  0.000000  3.523457  7.683521
## Cluster 2  3.523457  0.000000  9.960735
## Cluster 3  7.683521  9.960735  0.000000

Cluster 3 has percentage population within 0 - 44yrs above the centroid and 45yrs and above are found below the centroid. cluster 1 is the most spread out. Lastly, cluster 1 and 2 are the closest while cluster 2 and 3 are farthest away.

Q3c

km_eu <- kmeans(eupop, 3,nstart=10)
cluskm_eu <- km_eu$cluster
ord <- order(cluskm_eu)
stars(eupop[ord,],nrow=3, col.stars=cluskm_eu[ord]+1) 

Using the kmeans, Ireland is again in red cluster, cluster blue consists of Portugal, Germany, Austria, Greece, Italy, and Spain. other countires are in cluster green.

Q3d

par(mfrow=c(1,2))
stars(eupop[h_eu$order,],nrow=3, col.stars=cutree(h_eu,3)[h_eu$order]+1, main = "hCluster") # hcluster plot
stars(eupop[ord,],nrow=3, col.stars=cluskm_eu[ord]+1, main = "kmeans") # kmeans cluster plot

Interestingly the number of countries in cluster 2 of hcluster increased from 4 to 6 in kmeans cluster while the countires in cluster 1 in hcluster decreased from 10 to 8 in kmeans.

Q4a

music <- read.csv('/rstudio_files/ST464/data/music.csv')
a <- apply(music[,4:8],2,median)
a
##          LVar          LAve          LMax        LFEner         LFreq 
##  8.210359e+06 -5.662044e+00  2.443100e+04  1.043496e+02  1.752937e+02
b <- apply(music[,4:8],2,mad, constant=1)
b
##         LVar         LAve         LMax       LFEner        LFreq 
## 7.074681e+06 6.342972e+00 5.939500e+03 2.988300e+00 1.094850e+02
music2 <- scale(music[,4:8],a,b)
head(music2)
##             LVar      LAve      LMax     LFEner      LFreq
## [1,]  1.32732438 -13.29737 0.9243202  0.5258240 -1.0569481
## [2,]  0.18837063 -11.05234 0.5379241 -0.5064652 -1.0669356
## [3,]  0.11860927 -14.56744 0.3267952 -0.6775591 -0.4630751
## [4,] -0.09228988 -13.37055 0.7520835 -0.9146170 -1.1556708
## [5,] -0.27253145 -13.13116 0.5907905 -1.3549075 -0.9249976
## [6,] -0.50101085  -9.98882 0.1852008 -1.3724024 -0.8575818
km1 <- kmeans(music2, 1,nstart=25)  ###  The code is repeated for k = 1, 2, 3, ..., 15
km1$tot.withinss                 ######  to obtain TWSS for each run
## [1] 4728.111
k_twss <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,4728.11,2878.15,1776.08,1004.15, 723.02,552.29,459.65,394.04,334.75,289.43,250.53,213.58, 190.72,174.54,160.10), nrow=15)
colnames(k_twss) <- c("k", "TWSS")
k_twss
##        k    TWSS
##  [1,]  1 4728.11
##  [2,]  2 2878.15
##  [3,]  3 1776.08
##  [4,]  4 1004.15
##  [5,]  5  723.02
##  [6,]  6  552.29
##  [7,]  7  459.65
##  [8,]  8  394.04
##  [9,]  9  334.75
## [10,] 10  289.43
## [11,] 11  250.53
## [12,] 12  213.58
## [13,] 13  190.72
## [14,] 14  174.54
## [15,] 15  160.10
plot(k_twss, ylim=c(100,5000), xlab= "Number of cluster k", ylab="TWSS", main="k vs TWSS")

The TWSS dropped sharply from 4728 in cluster k = 1 to 723 in k = 5, and then continue to decrease gradually. The best fitting number of clusters would be at k = 5.

Q4b

km5 <- kmeans(music2, 5,nstart=10)
tab <- table(music$Artist,km5$cluster)
tab
##            
##              1  2  3  4  5
##   Abba       0  0  0  0 10
##   Beatles    0  0  3  7  0
##   Beethoven  0  0  8  0  0
##   Eels       0  0  3  7  0
##   Enya       0  0  3  0  0
##   Mozart     0  0  6  0  0
##   Vivaldi    1  3  6  0  0

Q5

protein <- read.csv('/rstudio_files/ST464/data/protein.csv', header=T, row.names=1)
head(protein)
##                RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg
## Albania           10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5    1.7
## Austria            8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3    4.3
## Belgium           13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1    4.0
## Bulgaria           7.8       6.0  1.6  8.3  1.2    56.7    1.1  3.7    4.2
## Czechoslovakia     9.7      11.4  2.8 12.5  2.0    34.3    5.0  1.1    4.0
## Denmark           10.6      10.8  3.7 25.0  9.9    21.9    4.8  0.7    2.4
pr <- dist(protein,"euclidean")
h_pr <- hclust(pr,"average")
dpr=color_branches((as.dendrogram(h_pr)),k=5,col=c(2,3,4,6,5))
plot(dpr)

cutree(h_pr,5)
##        Albania        Austria        Belgium       Bulgaria Czechoslovakia 
##              1              2              2              3              1 
##        Denmark    EastGermany        Finland         France         Greece 
##              2              4              5              2              1 
##        Hungary        Ireland          Italy    Netherlands         Norway 
##              1              2              1              2              2 
##         Poland       Portugal        Romania          Spain         Sweden 
##              1              4              3              4              2 
##    Switzerland             UK           USSR    WestGermany     Yugoslavia 
##              2              2              1              2              3

The protein compositions across countries are grouped in five clusters with Finland in a cluster (purple), cluster 4 (blue) consists of Portugal, Spain and EastGermany, cluster 3 (red) consists of Bulgaria, Romania ad Yugoslavia, while other countires are divided between cluster 2 (skyblue) and 1(green).

stars(protein[h_pr$order,],nrow=5, col.stars=cutree(h_pr,5)[h_pr$order]+1)

sumPartition(protein,cutree(h_pr,5))
## Final Partition
## 
## Number of clusters  5
## 
##           N.obs Within.clus.SS Ave.dist..Centroid Max.dist.centroid
## Cluster 1     7       416.9914           7.539639          9.848920
## Cluster 2    11       488.5873           6.567144          8.128696
## Cluster 3     3        47.0000           3.874619          4.838388
## Cluster 4     3       148.6067           6.857188          8.562969
## Cluster 5     1         0.0000           0.000000          0.000000
## 
## 
## Cluster centroids
## 
##           Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Grand centrd
## RedMeat   8.642857  12.32727  6.133333  7.233333  9.5       9.828       
## WhiteMeat 6.871429  9.854545  5.766667  6.233333  4.9       7.896       
## Eggs      2.385714  3.8       1.433333  2.633333  2.7       2.936       
## Milk      14.04286  22.02727  9.633333  8.2       33.7      17.112      
## Fish      2.542857  4.918182  0.9333333 8.866667  5.8       4.284       
## Cereals   39.27143  23.81818  54.06667  26.93333  26.3      32.248      
## Starch    3.742857  4.572727  2.4       6.033333  5.1       4.276       
## Nuts      4.214286  1.836364  4.9       3.8       1         3.072       
## Fr.Veg    4.657143  3.681818  3.4       6.233333  1.4       4.136       
## 
## 
## Distances between Cluster centroids
## 
##           Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
## Cluster 1   0.00000  18.43813  15.91266  15.38557  24.34692
## Cluster 2  18.43813   0.00000  34.04882  16.41372  13.53236
## Cluster 3  15.91266  34.04882   0.00000  28.74920  37.60408
## Cluster 4  15.38557  16.41372  28.74920   0.00000  26.43951
## Cluster 5  24.34692  13.53236  37.60408  26.43951   0.00000

Countires in cluster 2 have higher protein compositions of red/white-meat and eggs above the centroid, other clusters with these compositions are below the centroid.There are large variations in the clusters for other types of protein compositions. While cluster 2 is the most spread out and cluster 5 is the least spread out, Finland protein compositions seem to be closest to that of the countires in cluster 2 and farthest to the countries in cluster 3.