Q3a
eupop <- read.table('/rstudio_files/ST464/data/eupop.txt', header=T, row.names=1)
eupop <- eupop[,-5]
d_eu <- dist(eupop,"euclidean")
h_eu <- hclust(d_eu,"average")
d2_eu = color_branches((as.dendrogram(h_eu)),k=3,col=c(2,3,4))
plot(d2_eu)

The countries are grouped into three clusters. Ireland belongs to the cluster in red (Cluster 3), Spain, Germany, Greece and Italy belong to cluster 2 (blue colour) while other countires are in cluster 1 (green colour). Ireland was found ina cluster with the highest percentage of population with the age group 0 - 14yrs and lowest population from 45yrs and above. the next is cluster 2 while the last group in green has the highest population across the age groups.
Q3b
source('~/rstudio_files/ST464/sumPartition.R')
sumPartition(eupop,cutree(h_eu,3))
## Final Partition
##
## Number of clusters 3
##
## N.obs Within.clus.SS Ave.dist..Centroid Max.dist.centroid
## Cluster 1 10 55.305 2.211730 4.030199
## Cluster 2 4 17.535 1.831295 3.117090
## Cluster 3 1 0.000 0.000000 0.000000
##
##
## Cluster centroids
##
## Cluster 1 Cluster 2 Cluster 3 Grand centrd
## p014 18.23 15.25 22.2 17.7
## p1544 42.52 43.775 46.2 43.1
## p4564 23.94 24.25 20.3 23.78
## p65. 15.36 16.725 11.3 15.45333
##
##
## Distances between Cluster centroids
##
## Cluster 1 Cluster 2 Cluster 3
## Cluster 1 0.000000 3.523457 7.683521
## Cluster 2 3.523457 0.000000 9.960735
## Cluster 3 7.683521 9.960735 0.000000
Cluster 3 has percentage population within 0 - 44yrs above the centroid and 45yrs and above are found below the centroid. cluster 1 is the most spread out. Lastly, cluster 1 and 2 are the closest while cluster 2 and 3 are farthest away.
Q3c
km_eu <- kmeans(eupop, 3,nstart=10)
cluskm_eu <- km_eu$cluster
ord <- order(cluskm_eu)
stars(eupop[ord,],nrow=3, col.stars=cluskm_eu[ord]+1)

Using the kmeans, Ireland is again in red cluster, cluster blue consists of Portugal, Germany, Austria, Greece, Italy, and Spain. other countires are in cluster green.
Q3d
par(mfrow=c(1,2))
stars(eupop[h_eu$order,],nrow=3, col.stars=cutree(h_eu,3)[h_eu$order]+1, main = "hCluster") # hcluster plot
stars(eupop[ord,],nrow=3, col.stars=cluskm_eu[ord]+1, main = "kmeans") # kmeans cluster plot

Interestingly the number of countries in cluster 2 of hcluster increased from 4 to 6 in kmeans cluster while the countires in cluster 1 in hcluster decreased from 10 to 8 in kmeans.
Q4a
music <- read.csv('/rstudio_files/ST464/data/music.csv')
a <- apply(music[,4:8],2,median)
a
## LVar LAve LMax LFEner LFreq
## 8.210359e+06 -5.662044e+00 2.443100e+04 1.043496e+02 1.752937e+02
b <- apply(music[,4:8],2,mad, constant=1)
b
## LVar LAve LMax LFEner LFreq
## 7.074681e+06 6.342972e+00 5.939500e+03 2.988300e+00 1.094850e+02
music2 <- scale(music[,4:8],a,b)
head(music2)
## LVar LAve LMax LFEner LFreq
## [1,] 1.32732438 -13.29737 0.9243202 0.5258240 -1.0569481
## [2,] 0.18837063 -11.05234 0.5379241 -0.5064652 -1.0669356
## [3,] 0.11860927 -14.56744 0.3267952 -0.6775591 -0.4630751
## [4,] -0.09228988 -13.37055 0.7520835 -0.9146170 -1.1556708
## [5,] -0.27253145 -13.13116 0.5907905 -1.3549075 -0.9249976
## [6,] -0.50101085 -9.98882 0.1852008 -1.3724024 -0.8575818
km1 <- kmeans(music2, 1,nstart=25) ### The code is repeated for k = 1, 2, 3, ..., 15
km1$tot.withinss ###### to obtain TWSS for each run
## [1] 4728.111
k_twss <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,4728.11,2878.15,1776.08,1004.15, 723.02,552.29,459.65,394.04,334.75,289.43,250.53,213.58, 190.72,174.54,160.10), nrow=15)
colnames(k_twss) <- c("k", "TWSS")
k_twss
## k TWSS
## [1,] 1 4728.11
## [2,] 2 2878.15
## [3,] 3 1776.08
## [4,] 4 1004.15
## [5,] 5 723.02
## [6,] 6 552.29
## [7,] 7 459.65
## [8,] 8 394.04
## [9,] 9 334.75
## [10,] 10 289.43
## [11,] 11 250.53
## [12,] 12 213.58
## [13,] 13 190.72
## [14,] 14 174.54
## [15,] 15 160.10
plot(k_twss, ylim=c(100,5000), xlab= "Number of cluster k", ylab="TWSS", main="k vs TWSS")

The TWSS dropped sharply from 4728 in cluster k = 1 to 723 in k = 5, and then continue to decrease gradually. The best fitting number of clusters would be at k = 5.
Q4b
km5 <- kmeans(music2, 5,nstart=10)
tab <- table(music$Artist,km5$cluster)
tab
##
## 1 2 3 4 5
## Abba 0 0 0 0 10
## Beatles 0 0 3 7 0
## Beethoven 0 0 8 0 0
## Eels 0 0 3 7 0
## Enya 0 0 3 0 0
## Mozart 0 0 6 0 0
## Vivaldi 1 3 6 0 0
Q5
protein <- read.csv('/rstudio_files/ST464/data/protein.csv', header=T, row.names=1)
head(protein)
## RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg
## Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7
## Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3
## Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0
## Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2
## Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0
## Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4
pr <- dist(protein,"euclidean")
h_pr <- hclust(pr,"average")
dpr=color_branches((as.dendrogram(h_pr)),k=5,col=c(2,3,4,6,5))
plot(dpr)

cutree(h_pr,5)
## Albania Austria Belgium Bulgaria Czechoslovakia
## 1 2 2 3 1
## Denmark EastGermany Finland France Greece
## 2 4 5 2 1
## Hungary Ireland Italy Netherlands Norway
## 1 2 1 2 2
## Poland Portugal Romania Spain Sweden
## 1 4 3 4 2
## Switzerland UK USSR WestGermany Yugoslavia
## 2 2 1 2 3
The protein compositions across countries are grouped in five clusters with Finland in a cluster (purple), cluster 4 (blue) consists of Portugal, Spain and EastGermany, cluster 3 (red) consists of Bulgaria, Romania ad Yugoslavia, while other countires are divided between cluster 2 (skyblue) and 1(green).
stars(protein[h_pr$order,],nrow=5, col.stars=cutree(h_pr,5)[h_pr$order]+1)

sumPartition(protein,cutree(h_pr,5))
## Final Partition
##
## Number of clusters 5
##
## N.obs Within.clus.SS Ave.dist..Centroid Max.dist.centroid
## Cluster 1 7 416.9914 7.539639 9.848920
## Cluster 2 11 488.5873 6.567144 8.128696
## Cluster 3 3 47.0000 3.874619 4.838388
## Cluster 4 3 148.6067 6.857188 8.562969
## Cluster 5 1 0.0000 0.000000 0.000000
##
##
## Cluster centroids
##
## Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Grand centrd
## RedMeat 8.642857 12.32727 6.133333 7.233333 9.5 9.828
## WhiteMeat 6.871429 9.854545 5.766667 6.233333 4.9 7.896
## Eggs 2.385714 3.8 1.433333 2.633333 2.7 2.936
## Milk 14.04286 22.02727 9.633333 8.2 33.7 17.112
## Fish 2.542857 4.918182 0.9333333 8.866667 5.8 4.284
## Cereals 39.27143 23.81818 54.06667 26.93333 26.3 32.248
## Starch 3.742857 4.572727 2.4 6.033333 5.1 4.276
## Nuts 4.214286 1.836364 4.9 3.8 1 3.072
## Fr.Veg 4.657143 3.681818 3.4 6.233333 1.4 4.136
##
##
## Distances between Cluster centroids
##
## Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
## Cluster 1 0.00000 18.43813 15.91266 15.38557 24.34692
## Cluster 2 18.43813 0.00000 34.04882 16.41372 13.53236
## Cluster 3 15.91266 34.04882 0.00000 28.74920 37.60408
## Cluster 4 15.38557 16.41372 28.74920 0.00000 26.43951
## Cluster 5 24.34692 13.53236 37.60408 26.43951 0.00000
Countires in cluster 2 have higher protein compositions of red/white-meat and eggs above the centroid, other clusters with these compositions are below the centroid.There are large variations in the clusters for other types of protein compositions. While cluster 2 is the most spread out and cluster 5 is the least spread out, Finland protein compositions seem to be closest to that of the countires in cluster 2 and farthest to the countries in cluster 3.