Cluster analysis allows a researcher to uncover a structure in a large set of data, without explaining why that data structure exists.
There are multiple approaches to cluster analysis, but all of them rely on comparing similar objects using multiple criteria.
Each object may be characterised by one or several features, and each of these features is collected in data vector
The number of entries in these data vectors is called the dimension of the vector.
To quantify the similarity of two data objects, we use a metric to find the distance between the two objects along each of their dimensions.
Given the data objects \(\underline{x}=(x_1,x_2,\ldots,x_n)\) and \(\underline{y}=(y_1,y_2,\ldots,y_n)\), then the ``distance’’ between these data objects is given by \[ \langle \underline{x},\underline{y}\rangle_{E}=\sqrt{\sum_{i=1}^{n}(x_{i}-y_{i})^2} = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2+\cdots+(x_n-y_n)^2}. \]
The smaller the Euclidean distance between the data objects \(\underline{x}\) & \(\underline{y}\), the more similar the data objects.
When the dimensions used to characterise each data object are of varying orders of magnitude, then the Euclidean metric tends to exaggerate the effects of the larger dimensions compared to the smaller dimensions. As such, it is often a good idea to re-scale all dimensions to be of the same order of magnitude.
When comparing data objects, it is necessary that each object have the same number of dimensions, i.e. it does not make sense to find the separation between the data objects \(\underline{x}=(x_1,x_2,x_3,x_4)\) and \(\underline{y}=(y_1,y_2,y_3)\).
Given the data objects \[ \underline{x}=(3443, 561, 23.1, 0.856)\qquad \underline{y}=(5441,397,19.3, 0.472) \] answer the following:
Create 2 data vectors to represent \(\underline{x}\) and \(\underline{y}\).
Identify the number of dimensions in these vectors.
Create and R-function to calculate the Euclidean Distance between two arbitrary vectors.
Find the Euclidean distance without re-scaling any of the dimensions.
Find the Euclidean distance by re-scaling the dimensions as necessary.
Part 1. The data vectors are given by
x<-c(3443,561,23.1,0.856)
y<-c(5441,397,19.3,0.472)
x
[1] 3443.000 561.000 23.100 0.856
y
[1] 5441.000 397.000 19.300 0.472
Part 2. Each vector has four entries so the dimension is 4.
Part 3. To illustrate how this function should operate, we work our way from inside to outside:
x-y
[1] -1998.000 164.000 3.800 0.384
We see that x-y returns another 4-dimensional vector, these are the differences in each dimension
(x-y)^2
[1] 3.992004e+06 2.689600e+04 1.444000e+01 1.474560e-01
This has taken the value in each dimension of x-y and squared it.
sum((x-y)^2)
[1] 4018915
This has added the four numbers in (x-y)^2.
sqrt(sum((x-y)^2))
[1] 2004.723
Alternatively, we can create a function ED() (for Euclidean Distance) as follows
ED <- function(a,b){
Euclidean_Distance = sqrt(sum((a-b)^2))
return(Euclidean_Distance)
} # a and b are place-holders, you an use any symbols you wish.
The arguments a and b are place holders in this definition, and they represent the vectors we want to find the distance between. Any symbols may be used in place of a and b
The Euclidean distance between x and y is now calculate using ED()
ED(x,y)
[1] 2004.723
This distance is dominated by the distance between x and y along the first dimension.
Part 4. Next we re-scale each of the dimensions so they are all measured in thousands, and use the function we previously defined to calculate the Euclidean distance between these rescaled data objects
x1<-c(3443,5610,2310,8560)
y1<-c(5441,3970,1930,4720)
ED(x1,y1)
[1] 4644.524
In this case the distance between the data objects is measured more evenly across all dimensions.
Given the data objects \(\underline{x}=(x_1,x_2,\ldots,x_n)\) and \(\underline{y}=(y_1,y_2,\ldots,y_n)\), the Manhattan distance between the data objects id given by \[ \langle \underline{x}, \underline{y}\rangle_{M} = \sum_{i=1}^{n}\left\vert x_i-y_i\right\vert=\left\vert x_1-y_1\right\vert+\left\vert x_2-y_2\right\vert+\ldots+\left\vert x_n-y_n\right\vert. \]
Again, the smaller the Manhattan distance, the more similar the data objects.
As with the Euclidean metric, the Manhattan distance only makes sense when the data objects \(\underline{x}\) and \(\underline{y}\) have the same number of dimensions.
Define a function to measure the Manhattan distance between two data objects.
To do this you only have to modify the function ED() slightly!!
Given the data objects
\[ \underline{u}=(1415, 1843, 992, 875)\qquad \underline{v}=(2533, 1005, 329, 176) \]
answer the following:
Identify the number of dimensions characterising the data objects \(\underline{u}\) and \(\underline{v}\).
Use the Manhattan distance function to quantify the similarity between the two data objects \(\underline{u}\) and \(\underline{v}\) (there is no need to re-scale dimensions in this case).
Create 12 data vectors to represent each of the following. \[ \underline{a}=(3986,1943,444)\\ \underline{b}=(4271,1444,344)\\ \underline{c}=(8443,2020,415)\\ \underline{d}=(1898,1877,311)\\ \underline{e}=(2916,1253,289)\\ \underline{f}=(7476,1583,347)\\ \underline{g}=(3690,2131,491)\\ \underline{h}=(1355,1837,265)\\ \underline{i}=(7251,1492,306)\\ \underline{j}=(6498,1683,405)\\ \underline{k}=(5211,1298,367)\\ \underline{l}=(8001,1950,392)\\ \]
Use the R-function rbind() to create a matrix representing these vectors in a single data structure. X
Use the scale()-function, so that all vector dimensions are of similar size.
Use the R-function dist() to calculate the Euclidean and Manhattan distances between these data vectors.
Use these Euclidean and Manhattan distances to create a heat map to represent these distances graphically.
Part 1. We begin by creating a list of data structures corresponding to the data objects given.
a<-c(3986,1943,444)
b<-c(4271,1444,344)
c<-c(8443,2020,415)
d<-c(1898,1877,311)
e<-c(2916,1253,289)
f<-c(7476,1583,347)
g<-c(3690,2131,491)
h<-c(1355,1837,265)
i<-c(7251,1492,306)
j<-c(6498,1683,405)
k<-c(5211,1298,367)
l<-c(8001,1950,392)
Part 2. The single data structure is given by:
Data<-rbind(a,b,c,d,e,f,g,h,i,j,k,l)
Data
[,1] [,2] [,3]
a 3986 1943 444
b 4271 1444 344
c 8443 2020 415
d 1898 1877 311
e 2916 1253 289
f 7476 1583 347
g 3690 2131 491
h 1355 1837 265
i 7251 1492 306
j 6498 1683 405
k 5211 1298 367
l 8001 1950 392
Part 4. Notice that the first two dimensions of each vector is measured in thousands, while the last dimension is measured in units of 100. To make all dimensions have the same order, we divide the first two dimensions by 1000 and the last dimension by 100. This is done using the scale()-function
Scaled_Data<-scale(Data,center=F,scale=c(1000,1000,100))
Scaled_Data
[,1] [,2] [,3]
a 3.986 1.943 4.44
b 4.271 1.444 3.44
c 8.443 2.020 4.15
d 1.898 1.877 3.11
e 2.916 1.253 2.89
f 7.476 1.583 3.47
g 3.690 2.131 4.91
h 1.355 1.837 2.65
i 7.251 1.492 3.06
j 6.498 1.683 4.05
k 5.211 1.298 3.67
l 8.001 1.950 3.92
attr(,"scaled:scale")
[1] 1000 1000 100
Part 4. The Euclidean distances between all of these vectors is given by:
Dist_E<- dist(Scaled_Data,method="euclidean")
Dist_E
a b c d e f g h i j k
b 1.1533542
c 4.4670883 4.2710022
d 2.4764895 2.4346495 6.6286555
e 2.0058664 1.4747902 5.7204561 1.2141252
f 3.6401374 3.2081531 1.2603404 5.5973315 4.6085681
g 0.5863958 1.7234935 4.8146578 2.5526026 2.3346006 4.0875054
h 3.1839436 3.0465727 7.2472914 0.7127756 1.6838578 6.1809026 3.2628609
i 3.5732375 3.0045139 1.6993375 5.3670601 4.3449104 0.4764515 4.0634397 5.9202991
j 2.5553559 2.3213681 1.9765106 4.6990676 3.7896206 1.1414394 2.9707184 5.3323695 1.2584077
k 1.5841559 0.9786807 3.3462678 3.4095176 2.4243453 2.2916042 2.1318841 4.0248797 2.1380683 1.3960638
l 4.0485397 3.7946457 0.5031541 6.1569504 5.2348767 0.7828244 4.4269156 6.7671992 1.2295788 1.5320568 2.8760570
The Manhattan distances between these vectors is given by
Dist_M<- dist(Scaled_Data,method="manhattan")
Dist_M
a b c d e f g h i j k
b 1.784
c 4.824 5.458
d 3.484 3.136 7.728
e 3.310 2.096 7.554 1.862
f 4.820 3.374 2.084 6.232 5.470
g 0.954 2.738 5.624 3.846 3.672 5.774
h 4.527 4.099 8.771 1.043 2.385 7.195 4.889
i 5.096 3.408 2.810 5.788 4.744 0.726 6.050 6.651
j 3.162 3.076 2.382 5.734 5.172 1.658 4.116 6.697 1.934
k 2.640 1.316 4.434 4.452 3.120 2.750 3.594 5.415 2.844 2.052
l 4.542 4.716 0.742 6.986 6.812 1.342 5.482 8.029 2.068 1.900 3.692
By using Scaled_Data the distances between the 12 objects is spread more evenly across all three dimensions.
Part 5.
To create a heat map we first import the library factoextra. It may be necessary to install this package first by going to
__Tools__ $\rightarrow$ __Install__
and writing __factoextra__ in the __Packages__ cell in this dialogue box.library("factoextra")
fviz_dist(Dist_E,gradient=list(low='ivory',mid='cornflowerblue',hight='midnightblue'))
fviz_dist(Dist_M,gradient=list(low='ivory',mid='cornflowerblue',hight='red'))
Eight makes of car were compared using three different criteria:
The data collected are given in the table below:
| Make | Price (euro) | Engine (cc) | Efficiency |
|---|---|---|---|
| Audi | 38812 | 1968 | 22.7 |
| BMW | 35571 | 1995 | 23.7 |
| Citroen | 20451 | 1560 | 24.7 |
| Hyundai | 23620 | 1685 | 23.7 |
| Jaguar | 53693 | 1999 | 26.5 |
| Mercedes | 41909 | 1950 | 25.5 |
| Mitsubishi | 28192 | 2268 | 18.8 |
| Toyota | 27978 | 1995 | 21.6 |
Using the data in this table, answer the following:
Create 8 data vectors to represent each car make.
Combine these data vectors using the rbind() function, to create a single data structure.
Rescale the dimensions of this data stricture, so each dimension is measured in the same order.
Create a table of Euclidean and Manhattan distances for this re-scaled data structure.
Create a heat map to represent these distances (Euclidean and Manhattan) using the function fviz_dist()
We can also apply the functions dist() and fvis_dist() to data which we import from .csv files.
However, to do so, it may be necessary to modify the imported data slightly, otherwise these functions may return errors, depending on the structure of the data file.
Six makes of laptop we compared using three different criteria:
The data is available at
Moodle \(\rightarrow\) Data Visualisation \(\rightarrow\) Data Files \(\rightarrow\) LaptopData
Laptops<-read.csv(file.choose())
Laptops
Using this data, answer the following:
Part 1. Try to find the Euclidean distances between these laptops using the data frame above.
dist(Laptops,method='euclidean')
NAs introduced by coercion
1 2 3 4 5
2 559.034894
3 8.944272 559.177979
4 545.677205 19.068954 545.603897
5 545.686655 19.334231 545.613348 1.874166
6 559.052815 4.473533 559.195896 19.148433 18.992762
Part 2. Modify this data structure so that it can used by the function dist().
Notice there are 6 rows and 5 columns in the data structure Laptops. The function dist() can only be applied to data frames consisting of numbers.
There are two steps to creating the necessary data structure:
LaptopNames<-as.character(Laptops$Make)
LaptopNames
[1] "Dell Inspiron" "Acer Aspire" "LG Gram" "Lenovo Yoga" "Google Pixelbook" "Dell i3168"
LaptopData<-as.data.frame(Laptops[,-1],row.names=LaptopNames)
LaptopData
dist(LaptopData, method='euclidean')
Dell Inspiron Acer Aspire LG Gram Lenovo Yoga Google Pixelbook
Acer Aspire 500.016010
LG Gram 8.000000 500.143989
Lenovo Yoga 488.068530 17.055791 488.002961
Google Pixelbook 488.076982 17.293062 488.011414 1.676305
Dell i3168 500.032039 4.001250 500.160014 17.126879 16.987643
Part 3. Rescale this new data structure so that all dimensions have values between 0 and 10.
Scaled_Laptop_Data <- scale(LaptopData,center=F,scale=c(1000,10,10,1))
Scaled_Laptop_Data
Storage Screen RAM ClockSpeed
Dell Inspiron 1.000 1.56 0.8 1.8
Acer Aspire 0.500 1.56 0.4 1.7
LG Gram 1.000 1.56 1.6 1.8
Lenovo Yoga 0.512 1.39 1.6 1.8
Google Pixelbook 0.512 1.23 1.6 1.3
Dell i3168 0.500 1.16 0.4 1.6
attr(,"scaled:scale")
[1] 1000 10 10 1
Part 4. Find the Euclidean and Manhattan distances in this re-scaled data structure.
The Euclidean distances are
Laptop_Distances_E<-dist(Scaled_Laptop_Data,method='euclidean')
Laptop_Distances_E
Dell Inspiron Acer Aspire LG Gram Lenovo Yoga Google Pixelbook
Acer Aspire 0.6480741
LG Gram 0.8000000 1.3038405
Lenovo Yoga 0.9523886 1.2161595 0.5167630
Google Pixelbook 1.1122248 1.3073041 0.7726862 0.5249762
Dell i3168 0.7810250 0.4123106 1.3747727 1.2381615 1.2389689
The Manhattan distances are
Laptop_Distances_M<-dist(Scaled_Laptop_Data,method='manhattan')
Laptop_Distances_M
Dell Inspiron Acer Aspire LG Gram Lenovo Yoga Google Pixelbook
Acer Aspire 1.000
LG Gram 0.800 1.800
Lenovo Yoga 1.458 1.482 0.658
Google Pixelbook 2.118 1.942 1.318 0.660
Dell i3168 1.500 0.500 2.300 1.642 1.582
Part 5. Create a heat map to represent these distances (for the Euclidean and Manhattan distances).
The heat map for the Euclidean distances is
fviz_dist(Laptop_Distances_E,gradient=list(low='ivory',mid='cornflowerblue',hight='red'))
The heat map for the Manhattan distances is
fviz_dist(Laptop_Distances_M,gradient=list(low='ivory',mid='cornflowerblue',hight='red'))
On Moodle, the file ISEQ(16Nov2017).csv contains the share price in €, the market capitalisation in € Millions, and the relative value of the company to the overall market cap. of the 20 compaines on the ISEQ index.
Moodle \(\rightarrow\) Data Visualisation \(\rightarrow\) Workbook Files \(\rightarrow\) ISEQ(16Nov2017).csv
(Source: http://www.isqe.ie)
Using this data answer the following:
Import the data into this R workbook using read.csv().
Create an appropriate data frame from this data file which can be used by the dist() function.
Rescale the data columns of this new data frame so that all dimensions have a similar size.
Find the Euclidean and Manhattan distances using this re-scaled data frame.
Create a heat map to represent the Euclidean and Manhattan distances between these companies.
On Moodle, the file EuroZoneData2017.csv compares the countries of the Euro Zone (excluding Malta as not all data was avaialble), using 4 different criteria
Moodle \(\rightarrow\) Data Visualisation \(\rightarrow\) Workbook Files \(\rightarrow\) EuroZoneData2017.csv
(Source: http://www.worldbank.org)
Using this data answer the following:
Import the data into this R workbook using read.csv().
Create an appropriate data frame from this data file which can be used by the dist() function.
Rescale the data columns of this new data frame so that all dimensions have a similar size.
Find the Euclidean and Manhattan distances using this re-scaled data frame.
Create a heat map to represent the Euclidean and Manhattan distances between these countries.