ST464_Assignment1

Q3a

eupop <- read.table('/rstudio_files/ST464/data/eupop.txt', header=T, row.names=1)
eupop <- eupop[,-5]
d_eu <- dist(eupop,"euclidean")
h_eu <- hclust(d_eu,"average")

d2_eu = color_branches((as.dendrogram(h_eu)),k=3,col=c(2,3,4))
plot(d2_eu)

The countries are grouped into three clusters. Ireland belongs to the cluster in red (Cluster 3), Spain, Germany, Greece and Italy belong to cluster 2 (blue colour) while other countires are in cluster 1 (green colour). Ireland was found ina cluster with the highest percentage of population with the age group 0 - 14yrs and lowest population from 45yrs and above. the next is cluster 2 while the last group in green has the highest population across the age groups.

Q3b

source('~/rstudio_files/ST464/sumPartition.R')
sumPartition(eupop,cutree(h_eu,3))

## Final Partition
## 
## Number of clusters  3
## 
##           N.obs Within.clus.SS Ave.dist..Centroid Max.dist.centroid
## Cluster 1    10         55.305           2.211730          4.030199
## Cluster 2     4         17.535           1.831295          3.117090
## Cluster 3     1          0.000           0.000000          0.000000
## 
## 
## Cluster centroids
## 
##       Cluster 1 Cluster 2 Cluster 3 Grand centrd
## p014  18.23     15.25     22.2      17.7        
## p1544 42.52     43.775    46.2      43.1        
## p4564 23.94     24.25     20.3      23.78       
## p65.  15.36     16.725    11.3      15.45333    
## 
## 
## Distances between Cluster centroids
## 
##           Cluster 1 Cluster 2 Cluster 3
## Cluster 1  0.000000  3.523457  7.683521
## Cluster 2  3.523457  0.000000  9.960735
## Cluster 3  7.683521  9.960735  0.000000

Cluster 3 has percentage population within 0 - 44yrs above the centroid and 45yrs and above are found below the centroid. cluster 1 is the most spread out. Lastly, cluster 1 and 2 are the closest while cluster 2 and 3 are farthest away.

Q3c

km_eu <- kmeans(eupop, 3,nstart=10)
cluskm_eu <- km_eu$cluster
ord <- order(cluskm_eu)
stars(eupop[ord,],nrow=3, col.stars=cluskm_eu[ord]+1)

Using the kmeans, Ireland is again in red cluster, cluster blue consists of Portugal, Germany, Austria, Greece, Italy, and Spain. other countires are in cluster green.

Q3d

par(mfrow=c(1,2))
stars(eupop[h_eu$order,],nrow=3, col.stars=cutree(h_eu,3)[h_eu$order]+1, main = "hCluster") # hcluster plot
stars(eupop[ord,],nrow=3, col.stars=cluskm_eu[ord]+1, main = "kmeans") # kmeans cluster plot

Interestingly the number of countries in cluster 2 of hcluster increased from 4 to 6 in kmeans cluster while the countires in cluster 1 in hcluster decreased from 10 to 8 in kmeans.

Q4a

music <- read.csv('/rstudio_files/ST464/data/music.csv')
a <- apply(music[,4:8],2,median)
a

##          LVar          LAve          LMax        LFEner         LFreq 
##  8.210359e+06 -5.662044e+00  2.443100e+04  1.043496e+02  1.752937e+02

b <- apply(music[,4:8],2,mad, constant=1)
b

##         LVar         LAve         LMax       LFEner        LFreq 
## 7.074681e+06 6.342972e+00 5.939500e+03 2.988300e+00 1.094850e+02

music2 <- scale(music[,4:8],a,b)
head(music2)

##             LVar      LAve      LMax     LFEner      LFreq
## [1,]  1.32732438 -13.29737 0.9243202  0.5258240 -1.0569481
## [2,]  0.18837063 -11.05234 0.5379241 -0.5064652 -1.0669356
## [3,]  0.11860927 -14.56744 0.3267952 -0.6775591 -0.4630751
## [4,] -0.09228988 -13.37055 0.7520835 -0.9146170 -1.1556708
## [5,] -0.27253145 -13.13116 0.5907905 -1.3549075 -0.9249976
## [6,] -0.50101085  -9.98882 0.1852008 -1.3724024 -0.8575818

km1 <- kmeans(music2, 1,nstart=25)  ###  The code is repeated for k = 1, 2, 3, ..., 15
km1$tot.withinss                 ######  to obtain TWSS for each run

## [1] 4728.111

k_twss <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,4728.11,2878.15,1776.08,1004.15, 723.02,552.29,459.65,394.04,334.75,289.43,250.53,213.58, 190.72,174.54,160.10), nrow=15)
colnames(k_twss) <- c("k", "TWSS")
k_twss

##        k    TWSS
##  [1,]  1 4728.11
##  [2,]  2 2878.15
##  [3,]  3 1776.08
##  [4,]  4 1004.15
##  [5,]  5  723.02
##  [6,]  6  552.29
##  [7,]  7  459.65
##  [8,]  8  394.04
##  [9,]  9  334.75
## [10,] 10  289.43
## [11,] 11  250.53
## [12,] 12  213.58
## [13,] 13  190.72
## [14,] 14  174.54
## [15,] 15  160.10

plot(k_twss, ylim=c(100,5000), xlab= "Number of cluster k", ylab="TWSS", main="k vs TWSS")

The TWSS dropped sharply from 4728 in cluster k = 1 to 723 in k = 5, and then continue to decrease gradually. The best fitting number of clusters would be at k = 5.

Q4b

km5 <- kmeans(music2, 5,nstart=10)
tab <- table(music$Artist,km5$cluster)
tab

##            
##              1  2  3  4  5
##   Abba       0  0  0  0 10
##   Beatles    0  0  3  7  0
##   Beethoven  0  0  8  0  0
##   Eels       0  0  3  7  0
##   Enya       0  0  3  0  0
##   Mozart     0  0  6  0  0
##   Vivaldi    1  3  6  0  0

Q5

protein <- read.csv('/rstudio_files/ST464/data/protein.csv', header=T, row.names=1)
head(protein)

##                RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg
## Albania           10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5    1.7
## Austria            8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3    4.3
## Belgium           13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1    4.0
## Bulgaria           7.8       6.0  1.6  8.3  1.2    56.7    1.1  3.7    4.2
## Czechoslovakia     9.7      11.4  2.8 12.5  2.0    34.3    5.0  1.1    4.0
## Denmark           10.6      10.8  3.7 25.0  9.9    21.9    4.8  0.7    2.4

pr <- dist(protein,"euclidean")
h_pr <- hclust(pr,"average")
dpr=color_branches((as.dendrogram(h_pr)),k=5,col=c(2,3,4,6,5))
plot(dpr)

cutree(h_pr,5)

##        Albania        Austria        Belgium       Bulgaria Czechoslovakia 
##              1              2              2              3              1 
##        Denmark    EastGermany        Finland         France         Greece 
##              2              4              5              2              1 
##        Hungary        Ireland          Italy    Netherlands         Norway 
##              1              2              1              2              2 
##         Poland       Portugal        Romania          Spain         Sweden 
##              1              4              3              4              2 
##    Switzerland             UK           USSR    WestGermany     Yugoslavia 
##              2              2              1              2              3

The protein compositions across countries are grouped in five clusters with Finland in a cluster (purple), cluster 4 (blue) consists of Portugal, Spain and EastGermany, cluster 3 (red) consists of Bulgaria, Romania ad Yugoslavia, while other countires are divided between cluster 2 (skyblue) and 1(green).

stars(protein[h_pr$order,],nrow=5, col.stars=cutree(h_pr,5)[h_pr$order]+1)

sumPartition(protein,cutree(h_pr,5))

## Final Partition
## 
## Number of clusters  5
## 
##           N.obs Within.clus.SS Ave.dist..Centroid Max.dist.centroid
## Cluster 1     7       416.9914           7.539639          9.848920
## Cluster 2    11       488.5873           6.567144          8.128696
## Cluster 3     3        47.0000           3.874619          4.838388
## Cluster 4     3       148.6067           6.857188          8.562969
## Cluster 5     1         0.0000           0.000000          0.000000
## 
## 
## Cluster centroids
## 
##           Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Grand centrd
## RedMeat   8.642857  12.32727  6.133333  7.233333  9.5       9.828       
## WhiteMeat 6.871429  9.854545  5.766667  6.233333  4.9       7.896       
## Eggs      2.385714  3.8       1.433333  2.633333  2.7       2.936       
## Milk      14.04286  22.02727  9.633333  8.2       33.7      17.112      
## Fish      2.542857  4.918182  0.9333333 8.866667  5.8       4.284       
## Cereals   39.27143  23.81818  54.06667  26.93333  26.3      32.248      
## Starch    3.742857  4.572727  2.4       6.033333  5.1       4.276       
## Nuts      4.214286  1.836364  4.9       3.8       1         3.072       
## Fr.Veg    4.657143  3.681818  3.4       6.233333  1.4       4.136       
## 
## 
## Distances between Cluster centroids
## 
##           Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
## Cluster 1   0.00000  18.43813  15.91266  15.38557  24.34692
## Cluster 2  18.43813   0.00000  34.04882  16.41372  13.53236
## Cluster 3  15.91266  34.04882   0.00000  28.74920  37.60408
## Cluster 4  15.38557  16.41372  28.74920   0.00000  26.43951
## Cluster 5  24.34692  13.53236  37.60408  26.43951   0.00000

Countires in cluster 2 have higher protein compositions of red/white-meat and eggs above the centroid, other clusters with these compositions are below the centroid.There are large variations in the clusters for other types of protein compositions. While cluster 2 is the most spread out and cluster 5 is the least spread out, Finland protein compositions seem to be closest to that of the countires in cluster 2 and farthest to the countries in cluster 3.

ST464_Assignment1

Kazeem Ishola 17252302

Q3a

Q3b

Cluster 3 has percentage population within 0 - 44yrs above the centroid and 45yrs and above are found below the centroid. cluster 1 is the most spread out. Lastly, cluster 1 and 2 are the closest while cluster 2 and 3 are farthest away.

Q3c

Using the kmeans, Ireland is again in red cluster, cluster blue consists of Portugal, Germany, Austria, Greece, Italy, and Spain. other countires are in cluster green.

Q3d

Interestingly the number of countries in cluster 2 of hcluster increased from 4 to 6 in kmeans cluster while the countires in cluster 1 in hcluster decreased from 10 to 8 in kmeans.

Q4a

The TWSS dropped sharply from 4728 in cluster k = 1 to 723 in k = 5, and then continue to decrease gradually. The best fitting number of clusters would be at k = 5.

Q4b

Q5