Create 12 data vectors to represent each of the following. \[ \underline{a}=(3986,1943,444)\\ \underline{b}=(4271,1444,344)\\ \underline{c}=(8443,2020,415)\\ \underline{d}=(1898,1877,311)\\ \underline{e}=(2916,1253,289)\\ \underline{f}=(7476,1583,347)\\ \underline{g}=(3690,2131,491)\\ \underline{h}=(1355,1837,265)\\ \underline{i}=(7251,1492,306)\\ \underline{j}=(6498,1683,405)\\ \underline{k}=(5211,1298,367)\\ \underline{l}=(8001,1950,392)\\ \]
Use the R-function rbind() to create a matrix representing these vectors in a single data structure.
Use the scale()-function, so that all vector dimensions are of similar size.
Create a scatter plot of the data in dimensions 1 and 2 of this scaled data structure.
Using this scatter plot, identify any potential clusters among the 12 vectors.
Part 1. We begin by creating a list of data structures corresponding to the data objects given.
a<-c(3986,1943,444)
b<-c(4271,1444,344)
c<-c(8443,2020,415)
d<-c(1898,1877,311)
e<-c(2916,1253,289)
f<-c(7476,1583,347)
g<-c(3690,2131,491)
h<-c(1355,1837,265)
i<-c(7251,1492,306)
j<-c(6498,1683,405)
k<-c(5211,1298,367)
l<-c(8001,1950,392)
Part 2. The single data structure is given by:
Data<-rbind(a,b,c,d,e,f,g,h,i,j,k,l)
Data
[,1] [,2] [,3]
a 3986 1943 444
b 4271 1444 344
c 8443 2020 415
d 1898 1877 311
e 2916 1253 289
f 7476 1583 347
g 3690 2131 491
h 1355 1837 265
i 7251 1492 306
j 6498 1683 405
k 5211 1298 367
l 8001 1950 392
Part 3. The scaled data structure is given by:
Data_S<-scale(Data,center=F,scale=c(1000,1000,100))
Data_S
[,1] [,2] [,3]
a 3.986 1.943 4.44
b 4.271 1.444 3.44
c 8.443 2.020 4.15
d 1.898 1.877 3.11
e 2.916 1.253 2.89
f 7.476 1.583 3.47
g 3.690 2.131 4.91
h 1.355 1.837 2.65
i 7.251 1.492 3.06
j 6.498 1.683 4.05
k 5.211 1.298 3.67
l 8.001 1.950 3.92
attr(,"scaled:scale")
[1] 1000 1000 100
Part 4. The scatter plot of the scaled data structure in dimensions 1 and 2 is given by:
plot(Data_S[,1],Data_S[,2],col='steelblue',pch=15,xlab='Scaled Dimension 1',ylab='Sclaed Dimension 2',main='Scatter Plot for Data in Dimensions 1 and 2')
text(x=Data_S[,1]+0.2,y=Data_S[,2],labels=c('a','b','c','d','e','f','g','h','i','j','k','l'),col='darkslategray')
Part 5. It appears from this scatter plot that the data is clustered as \[(a,g), (b,k), (c,l), (d,h), e, (f,i,j).\]
Remark: * While this clustering appears natural when we use the scatter plot, we must also consider that we are only comparing the data vectors along two dimensions.
When we compare the data vectors along another two dimensions we may get a completely different clustering.
When we perform a full cluster analysis, we take into account all dimensions of the objects in the data-set.
Eight makes of car were compared using three different criteria:
The data collected are given in the table below:
| Make | Price (euro) | Engine (cc) | Efficiency |
|---|---|---|---|
| Audi | 38812 | 1968 | 22.7 |
| BMW | 35571 | 1995 | 23.7 |
| Citroen | 20451 | 1560 | 24.7 |
| Hyundai | 23620 | 1685 | 23.7 |
| Jaguar | 53693 | 1999 | 26.5 |
| Mercedes | 41909 | 1950 | 25.5 |
| Mitsubishi | 28192 | 2268 | 18.8 |
| Toyota | 27978 | 1995 | 21.6 |
Using the data in this table, answer the following:
Create 8 data vectors to represent each car make.
Combine these data vectors using the rbind() function, to create a single data structure.
Rescale the dimensions of this data stricture, so each dimension is measured in the same order.
Create a scatter-plot of this data along dimensions 1 and 2 of this data set.
Use this scatter plot to identify potential clusters in the data-set.
Create a scatter-plot of this data along dimensions 1 and 3 of this data set.
Use this scatter plot to identify potential clusters in the data-set.
Do the clusters match?
To implement a clustering algorithm on a set of data vectors, we must find the distances between those data vectors, using a metric such as the Euclidean or Manhattan metric.
Once we have these distances we can use a clustering algorithm such as Complete Linkage or Single Linkage to identify the clusters in the data-set.
Using the scaled data-frame from Example 1, answer the following:
Find the Manhattan distances between these data vectors.
Using these Manhattan distances, use the function hclust() to perform a cluster analysis on this data using the Single Linkage method.
Use the function fvis_dend() to create a dendrogram for the data with 2,4,6 and 8 clusters.
Solution:
Part 1: The Manhattan distances between the scaled data vectors are given by
Dist_S_M=dist(Data_S,method='manhattan')
Dist_S_M
a b c d e f g h i j k
b 1.784
c 4.824 5.458
d 3.484 3.136 7.728
e 3.310 2.096 7.554 1.862
f 4.820 3.374 2.084 6.232 5.470
g 0.954 2.738 5.624 3.846 3.672 5.774
h 4.527 4.099 8.771 1.043 2.385 7.195 4.889
i 5.096 3.408 2.810 5.788 4.744 0.726 6.050 6.651
j 3.162 3.076 2.382 5.734 5.172 1.658 4.116 6.697 1.934
k 2.640 1.316 4.434 4.452 3.120 2.750 3.594 5.415 2.844 2.052
l 4.542 4.716 0.742 6.986 6.812 1.342 5.482 8.029 2.068 1.900 3.692
Part 2: The clustering is performed using the function hclust(), which is available from the library cluster.
library(cluster)
Cluster_1_M=hclust(Dist_S_M,method='single')
Cluster_1_M
Call:
hclust(d = Dist_S_M, method = "single")
Cluster method : single
Distance : manhattan
Number of objects: 12
Part 3: The function fviz_dend() is available in the factoextra library, which we import as follows:
library(factoextra)
fviz_dend(Cluster_1_M,k=2,color_labels_by_k=T, rect=T, k_colors=c("red", "dodgerblue"))
fviz_dend(Cluster_1_M,k=4,color_labels_by_k=T, rect=T, k_colors=c("red", "dodgerblue",'seagreen4','goldenrod'))
fviz_dend(Cluster_1_M,k=6,color_labels_by_k=T, rect=T, k_colors=c("red", "dodgerblue",'seagreen4','goldenrod','azure4','coral'))
fviz_dend(Cluster_1_M,k=8,color_labels_by_k=T, rect=T, k_colors=c("red", "dodgerblue",'seagreen4','goldenrod','azure4','coral', 'purple', 'firebrick'))
Using the scaled car-data vectors created in in Exercise 1, answer the following:
Find the Manhattan distances between these data vectors.
Using these Manhattan distances, use the function hclust() to perform a cluster analysis on this data using the Complete Linkage method.
Use the function fvis_dend() to create a dendrogram for the data with 2,4,6 and 8 clusters.
Six makes of laptop we compared using three different criteria:
The data is available at
Moodle \(\rightarrow\) Data Visualisation \(\rightarrow\) Data Files \(\rightarrow\) LaptopData
Laptops<-read.csv(file.choose())
Laptops
Step 1
LaptopNames<-as.character(Laptops$Make)
LaptopNames
[1] "Dell Inspiron" "Acer Aspire" "LG Gram" "Lenovo Yoga" "Google Pixelbook" "Dell i3168"
Step 2
LaptopData<-as.data.frame(Laptops[,-1],row.names=LaptopNames)
LaptopData
Using this data structure answer the following:
Rescale this new data structure so that all dimensions have values between 0 and 10.
Find the Euclidean and Manhattan Distances between these scaled vectors.
Apply the clustering algorithm using Complete Linkage to both of these distance sets.
Plot a dedrogram with 2, 3, 4 and 5 data clusters obtained using each metric.
On Moodle, the file EuroZoneData2017.csv compares the countries of the Euro Zone (excluding Malta as not all data was available), using 4 different criteria
Moodle \(\rightarrow\) Data Visualisation \(\rightarrow\) Workbook Files \(\rightarrow\) EuroZoneData2017.csv
(Source: http://www.worldbank.org)
Using this data answer the following:
Import the data into this R workbook using read.csv().
Create an appropriate data frame from this data file which can be used by the dist() function.
Rescale the data columns of this new data frame so that all dimensions have a similar size.
Find the Manhattan distances using this re-scaled data frame.
Apply the clustering algorithms Complete Linkage and Single Linkage to both of these distances.
Plot a dedrogram with 2, 10, 15 and 20 data clusters obtained using both clustering methods.