Disclaimer
Please do not cite this document. Data and syntax can be reused freely
while keep attributing and referring to the original source (where
applicable) in the reference. This document is used for educational
purpose only.
This analysis will heavily used package factoextra
.
#install.packages("factoextra")
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Common procedures:
Example:
Suppose there is a Mall owner wants to group customers in the Mall he
owns, so that the marketing team can develop the right strategy for the
right customer. The data held by the Mall are Customer ID, customer age
(age), annual income in thousand dollars (annual income) and spending
score. The spending score is the value given by the Mall to customers
based on customer behavior (time of visit, type of goods purchased, and
the amount of money spent when shopping) which has a value range of
1-100. The greater the value of the Spending Score, the more loyal the
customer is to the Mall and the greater the spending money used.
The data can be obtained at the following link
data_mall <- read.csv("Mall_Customers.csv")
str(data_mall)
## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Genre : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score: int 39 81 6 77 40 76 6 94 3 72 ...
dim(data_mall)
## [1] 200 5
The variables we are going to use to apply k-means are
Age
, Annual.Income
, and
Spending.Score
variables. Therefore, We will subset the
variable of our interest and eliminate the remaining.
data_mall <- data_mall[,c("Age","Annual.Income","Spending.Score")]
head(data_mall)
## Age Annual.Income Spending.Score
## 1 19 15 39
## 2 21 15 81
## 3 20 16 6
## 4 23 16 77
## 5 31 17 40
## 6 22 17 76
Data Standardization
Variable standardization is the process of transforming a variable into
a variable that has a mean of zero and a standard deviation of one. This
standardization process is carried out if we look at the differences in
the units of measurement for the variables used for example (age and
income). Standardization is done because the k-means method uses the
concept of distance between objects/observations, which is sensitive
to the unit of measurement. The formula for standardizing the data
is as follows:
\[
x = \frac{y_i-\bar{y}}{\sigma_y}
\] where \(\bar{y}\) is the mean
of \(y_i\) and \(\sigma_y\) is the standard deviation of y.
In R, standardization of data can be done using the scale
function.
data_mall_standardize <- scale(data_mall)
apply(data_mall_standardize,2,mean)
## Age Annual.Income Spending.Score
## -1.016906e-16 -8.144310e-17 -1.096708e-16
in order to check its standard deviation:
apply(data_mall_standardize,2,sd)
## Age Annual.Income Spending.Score
## 1 1 1
If we consider the mean and standard deviation of the variables after being standardized, they are close to zero and one.
Note:
In the data pre-processing stage, we prepare the data so that the
k-means method can be applied optimally. Two things that are generally
done at this stage are selecting the variables used and standardizing
the variables.
Generally, the number of clusters can be determined using several statistical criteria, such as silhouette coefficient and WSS or (Within Sum of Square).
Silhouette coefficient criteria are calculated based on the distance between observations. This coefficient measures how close an observation is to other observations in the same cluster (known as the cohesion measure) compared to the distance to other observations in different clusters (known as the separation measure). The greater the value of the coefficient indicates that the cluster formed is appropriate.
The WSS criterion is a criterion that calculates the diversity in the cluster that is formed. The smaller the diversity in the clusters formed indicates that the clusters formed are appropriate.
Using these criteria, we can compare the number of clusters that best
fit the data we are analyzing. In R, the fviz_nbclust
package function factoextra
can be used to select the
number of clusters.
fviz_nbclust(data_mall_standardize,FUNcluster = kmeans,method = "silhouette")
fviz_nbclust(data_mall_standardize,FUNcluster = kmeans,method = "wss")
For the silhoutte coefficient criteria, we choose the number of clusters with the highest coefficient values. While in WSS, the number of clusters that we choose is based on the number of clusters where the line is shaped like an elbow (elbow). In the picture above the line forms an angle when it is in the fourth cluster. Because this determination is based on visuals, so everyone may have a different look at the elbow pattern.
Based on these two criteria, the number of the best selected clusters is different. If so, the number of clusters can be determined based on the ease of interpretation of the clusters formed. In this practical guide we will use only 4 clusters.
Note: by default the number of clusters tested on
the function fviz_nbclust
is 10, if you want to change
this, you can use arguments k.max
in the function, for
example k.max=20
.
fviz_nbclust(data_mall_standardize,FUNcluster = kmeans, k.max = 20, method = "wss")
After we get the best number of clusters, then we will apply the
K-means method to get the cluster label for each observation. The
function eclust
of the package factoextra
is
used to implement the k-means method. In the function
eclust
, we only need to enter data that was previously
standardized, because in the function there is an argument
stand
, which if set stand=TRUE
automatically
the data we use will be standardized.
kmeans_mall <- eclust(data_mall,stand = TRUE,FUNcluster = "kmeans",k=4,graph = F)
kmeans_mall$cluster
## [1] 3 3 3 3 3 3 2 3 2 3 2 3 2 3 2 3 3 3 2 3 3 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
## [38] 3 2 3 2 3 2 3 2 3 2 3 3 3 2 3 3 2 2 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 2 2 2 2
## [75] 2 3 2 2 3 2 2 3 2 2 3 2 2 3 3 2 2 3 2 2 3 3 2 3 2 3 3 2 2 3 2 3 2 2 2 2 2
## [112] 3 1 3 3 3 2 2 2 2 3 1 4 4 1 4 1 4 2 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
## [149] 1 4 1 4 1 4 1 4 1 4 1 4 2 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1
## [186] 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
kmeans_mall$centers
## Age Annual.Income Spending.Score
## 1 0.03711223 0.9876366 -1.1857814
## 2 1.08344244 -0.4893373 -0.3961802
## 3 -0.96008279 -0.7827991 0.3910484
## 4 -0.42773261 0.9724070 1.2130414
Cluster label for each observation/object, can be obtained by using
$cluster
. Then, the interpretation of each cluster that is
formed can be done using the help of the average value of each variable
calculated based on the cluster. This information can be obtained using
$centers
. Because we standardize the variables, the average
value obtained is also in the standardized scale.
Based on the average value of $centers
, here is the
interpretation:
cluster 1: this cluster is customers who are quite young (the age variable is small) and have high income (the income variable is large) but spend very little money on shopping (the spending score variable is small or even negative).
cluster 2: this cluster is customers who are old (the age variable is large) and have low income (the income variable is small) and spend very little money on shopping (the spending score variable is small). This cluster may be a customer who has retired and only has income from retirement benefits.
cluster 3: this cluster is customers who are very young (age variable is small) and have low income (low income variable) but spend quite a lot of money on shopping (spending score variable is large). This cluster may be a strange customer, because they have a small income but spend a lot of money.
cluster 4: this cluster is customers who are still quite young (the age variable is small) and have high income (the income variable is large) but spends quite a lot of money on shopping (the spending score variable is large). This cluster may be the most attractive customer to be the next marketing target.
If it is difficult to read the results in standardized scale form
then we can use a function aggregate
to see the average in
the original scale. This function can calculate the average of each
variable based on the cluster that is formed.
aggregate(data_mall,by =list(cluster=kmeans_mall$cluster),
FUN = mean)
## cluster Age Annual.Income Spending.Score
## 1 1 39.36842 86.50000 19.57895
## 2 2 53.98462 47.70769 39.96923
## 3 3 25.43860 40.00000 60.29825
## 4 4 32.87500 86.10000 81.52500
Another way to interpret cluster results is to use a scatterplot. If there are more than two variables to construct a cluster, then before the scatterplot is formed, these variables are reduced first using principal component analysis into two main components. However, for the interpretation of each cluster we must know the interpretation of the two main components and not necessarily with the two main components being able to explain the diversity of the original data well.
fviz_cluster(kmeans_mall)
The interpretation of the two main components can be seen with their characteristic roots.
pca_mall <- prcomp(data_mall_standardize)
pca_mall$rotation
## PC1 PC2 PC3
## Age 0.70638235 -0.03014116 0.707188441
## Annual.Income -0.04802398 -0.99883160 0.005397916
## Spending.Score -0.70619946 0.03777499 0.707004506
Common procedures:
Data processing stage is almost similar with that of the k-means procedures.
Using silhouette and wss koefisien coefficients
For illustration, we will use the silhouette method only because it is
easier to determine the number of clusters.
#complete
fviz_nbclust(data_mall_standardize,FUNcluster = hcut,method = "silhouette",
hc_method = "complete",hc_metric = "euclidean")
#average
fviz_nbclust(data_mall_standardize,FUNcluster = hcut,method = "silhouette",
hc_method = "average",hc_metric = "euclidean")
#centroid
fviz_nbclust(data_mall_standardize,FUNcluster = hcut,method = "silhouette",
hc_method = "centroid",hc_metric = "euclidean")
#ward
fviz_nbclust(data_mall_standardize,FUNcluster = hcut,method = "silhouette",
hc_method = "ward.D",hc_metric = "euclidean")
Based on the silhouette coefficients, the complete and average methods choose 5 clusters, while the centroid and ward methods choose 2 and 6 clusters, respectively. For now, we will try to use 5 clusters with the complete method (If the two linkage methods choose the same number of clusters, the clusters formed will be relatively similar, therefore you can choose one).
Using Dendogram
The use of dendograms for data that has many observations may not be
effective because selecting clusters with dendograms is done
visually.
linkage_methods <- c("complete","average","centroid","ward.D")
hc_mall_dend <- lapply(linkage_methods, function(i)
hclust(dist(data_mall_standardize,method = 'euclidean'),method = i)
)
#complete
fviz_dend(hc_mall_dend[[1]])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
#average
fviz_dend(hc_mall_dend[[2]])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
#centroid
fviz_dend(hc_mall_dend[[3]])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
#ward
fviz_dend(hc_mall_dend[[4]])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
it is observed from the four dendograms in each linkage method, the number of clusters formed is the same as using the silhouette coefficient above.
hc_mall <- eclust(data_mall,stand = TRUE,FUNcluster = "hclust",k=5,hc_method = "complete",hc_metric = "euclidean",graph = F)
hc_mall$cluster
## [1] 1 2 1 2 1 2 1 2 3 2 3 2 3 2 1 2 1 2 3 2 1 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3
## [38] 2 3 2 3 2 3 1 3 2 3 1 1 1 3 1 1 3 3 3 3 3 1 3 3 1 3 3 3 1 1 3 1 1 3 3 3 3
## [75] 3 1 1 1 1 3 3 1 3 3 1 3 3 1 1 3 3 1 3 1 1 1 3 1 3 1 1 3 3 1 3 1 3 3 3 3 3
## [112] 1 1 1 1 1 3 3 3 3 1 1 1 4 1 4 5 4 5 4 5 4 1 4 5 4 5 4 5 4 5 4 1 4 5 4 5 4
## [149] 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5
## [186] 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4
aggregate(data_mall,by =list(cluster=hc_mall$cluster),
FUN = mean)
## cluster Age Annual.Income Spending.Score
## 1 1 28.35417 50.29167 45.93750
## 2 2 24.80952 25.61905 80.23810
## 3 3 55.33333 47.31579 41.08772
## 4 4 32.69231 86.53846 82.12821
## 5 5 41.68571 88.22857 17.28571
fviz_cluster(hc_mall)
The interpretation of the two main components can be seen with their characteristic roots.
pca_mall <- prcomp(data_mall_standardize)
pca_mall$rotation
## PC1 PC2 PC3
## Age 0.70638235 -0.03014116 0.707188441
## Annual.Income -0.04802398 -0.99883160 0.005397916
## Spending.Score -0.70619946 0.03777499 0.707004506
R package corrplot
provides a visual exploratory tool on
correlation matrix that supports automatic variable reordering to help
detect hidden patterns among variables.
corrplot
is very easy to use and provides a rich array of
plotting options in visualization method, graphic layout, color, legend,
text labels, etc. It also provides p-values and confidence intervals to
help users determine the statistical significance of the
correlations.
#install.packages("corrplot")
library(corrplot)
## corrplot 0.90 loaded
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
corrplot(mpg %>% select(where(is.numeric)) %>% cor,
method = "pie",type = "lower",
diag = FALSE)
one way to visualize the correlation matrix when more variables are involved is to use a heatmap where the colors used will be adjusted depending on the magnitude of the correlation coefficient value. It is not only displays the value of the correlation coefficient in the form of color, this function also performs grouping of variables based on the correlation. The highly correlated variables will be placed close together.
library(ggplot2)
We are going to use mtcars
data as an example:
mydata <- mtcars[, c(1,3,4,5,6,7)]
head(mydata)
## mpg disp hp drat wt qsec
## Mazda RX4 21.0 160 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 160 110 3.90 2.875 17.02
## Datsun 710 22.8 108 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 360 175 3.15 3.440 17.02
## Valiant 18.1 225 105 2.76 3.460 20.22
Then, we compute the correlation matrix:
cormat <- round(cor(mydata),2)
head(cormat)
## mpg disp hp drat wt qsec
## mpg 1.00 -0.85 -0.78 0.68 -0.87 0.42
## disp -0.85 1.00 0.79 -0.71 0.89 -0.43
## hp -0.78 0.79 1.00 -0.45 0.66 -0.71
## drat 0.68 -0.71 -0.45 1.00 -0.71 0.09
## wt -0.87 0.89 0.66 -0.71 1.00 -0.17
## qsec 0.42 -0.43 -0.71 0.09 -0.17 1.00
It is time to create correlation heatmap using ggplot2
package. Yet, before that we need to rearrange the data from wide format
to long format using reshape2
package:
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
melted_cormat <- melt(cormat)
head(melted_cormat)
## Var1 Var2 value
## 1 mpg mpg 1.00
## 2 disp mpg -0.85
## 3 hp mpg -0.78
## 4 drat mpg 0.68
## 5 wt mpg -0.87
## 6 qsec mpg 0.42
The function geom_tile()
from ggplot2
package is used to visualize the correlation matrix:
library(ggplot2)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
The default plot is not quite fancy. We’ll see in the next sections, how to change the appearance of the heatmap. Firstly, we need to get the lower and upper triangles of the correlation matrix, because usually correlation matrix has redundant information. We’ll utilize this helper function below to set half of it to NA:
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
Usage:
upper_tri <- get_upper_tri(cormat)
upper_tri
## mpg disp hp drat wt qsec
## mpg 1 -0.85 -0.78 0.68 -0.87 0.42
## disp NA 1.00 0.79 -0.71 0.89 -0.43
## hp NA NA 1.00 -0.45 0.66 -0.71
## drat NA NA NA 1.00 -0.71 0.09
## wt NA NA NA NA 1.00 -0.17
## qsec NA NA NA NA NA 1.00
Let us finish the correlation matrix heatmap by melting the correlation data and drop the NA values:
# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
If we aim to reorder the correlation matrix according to correlation
coefficient, we can use hclust
function:
#helper function to reorder correlation matrix:
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
Reordered correlation data visualization:
# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
# Print the heatmap
print(ggheatmap)
Then, we just need to add correlation coefficients on the heatmap:
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
Dimensionality reduction is a technique for reducing the dimensions of a dataset/data features. Usually the dataset we want to process has tens or maybe hundreds of features or columns. With dimension reduction, we can reduce the number of features or columns without removing information from the dataset. PCA is used to handle this problem.
library(factoextra)
#install.packages("ggcorrplot")
library(ggcorrplot)
library(openxlsx)
We can obtain the dataset by downloading the woman track records dataset through this link
The data is the record time for the results of seven women’s athletic events from 55 countries at one of the Olympic events, namely the 100 meter, 200 meter, 400 meter, 800 meter, 1500 meter, 3000 meter and marathon running. The first three running events are recorded in seconds, while the other four events are recorded in minutes.
#Import Data
data_women_records <- openxlsx::read.xlsx("women_track_records.xlsx")
head(data_women_records)
## lOOm 200m 400m 800m 1500m 3000m Marathon country
## 1 11.61 22.94 54.50 2.15 4.43 9.79 178.52 argentina
## 2 11.20 22.35 51.08 1.98 4.13 9.08 152.37 australia
## 3 11.43 23.09 50.62 1.99 4.22 9.34 159.37 austria
## 4 11.41 23.04 52.00 2.00 4.14 8.88 157.85 belgium
## 5 11.46 23.05 53.30 2.16 4.58 9.81 169.98 bermuda
## 6 11.31 23.17 52.80 2.10 4.49 9.77 168.75 brazil
Note: openxlsx::read.xlsx
means using a
function read.xlsx
that is in the package
openxlsx
without calling the package first using library.
The function head
is used to display the first 6 rows of
data.
For further analysis purposes, country names will be used as row
names in the data. This is done using the function
rownames
.
rownames(data_women_records) <- data_women_records$country
data_women_records <- data_women_records[,-8]
data_women_records[,-8]
means we omit the eighth column
in the data (country column).
Let’s create correlation matrix for exploration:
cor_women <- cor(data_women_records)
cor_women
## lOOm 200m 400m 800m 1500m 3000m Marathon
## lOOm 1.0000000 0.9527911 0.8346918 0.7276888 0.7283709 0.7416988 0.6863358
## 200m 0.9527911 1.0000000 0.8569621 0.7240597 0.6983643 0.7098710 0.6855745
## 400m 0.8346918 0.8569621 1.0000000 0.8984052 0.7878417 0.7776369 0.7054241
## 800m 0.7276888 0.7240597 0.8984052 1.0000000 0.9016138 0.8635652 0.7792922
## 1500m 0.7283709 0.6983643 0.7878417 0.9016138 1.0000000 0.9691690 0.8779334
## 3000m 0.7416988 0.7098710 0.7776369 0.8635652 0.9691690 1.0000000 0.8998374
## Marathon 0.6863358 0.6855745 0.7054241 0.7792922 0.8779334 0.8998374 1.0000000
It is easier to see this correlation matrix can be made in graphical form in the following way:
ggcorrplot(cor_women,type="lower",lab = TRUE)
It is generating almost the same correlation heatmap matrix with the
previous section but in simpler and easier way. Next stage is applying
the principal component analysis.
In R, Implementing this PCA can be done using function
prcomp
. This function has arguments scale.
and
center.
. If these two arguments are TRUE
, the
matrix used to calculate PCA is the correlation matrix. Otherwise, if
both of these arguments are FALSE
or
scale.=FALSE
, then the matrix used is the covariance
matrix.
pca_women_records <- prcomp(data_women_records,scale.=TRUE,center=TRUE)
summary(pca_women_records)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4095 0.80848 0.54762 0.35423 0.23198 0.19761 0.14981
## Proportion of Variance 0.8294 0.09338 0.04284 0.01793 0.00769 0.00558 0.00321
## Cumulative Proportion 0.8294 0.92276 0.96560 0.98353 0.99122 0.99679 1.00000
There are three kinds of results from the syntax above, namely
Standard deviation
, Proportion of Variance
and
Cumulative Proportion
of each Principal Component. The
standard deviation is the root of the characteristic root (eigenvalue).
In this case, the feature root acts as the variance of each main
component. The proportion of variance is obtained from the feature root
in each component divided by the total feature root.
Proportion of Variance
explains how much variation in the
original variable can be explained by each main component. The bigger
the value, the better the principal component is to represent the
original variable.
Cumulative Proportion
describes how much variation can
be explained by the cumulative principal component. For example, using
only two main components (PC1
and PC2
), it can
explain more than 92% of the diversity of the data. Based on this, we
will choose to use only two main components.
fviz_screeplot(pca_women_records,geom="line")
Another thing that can be done to determine how many major components
are used is with a screeplot. The function to display a screeplot in R
is fviz_screeplot
that obtained from the package
factoextra
.
The number of principal components can be determined with a screeplot by looking at the main component where the line is shaped like an elbow (elbow). In the picture above the line forms an angle when it is in the second main component (second dimension). So that the number of main components used is two (Main Component 1 and Main Component 2)
Interpretation
Interpretation of the PCA method can be done by using the feature vector
on each of the main components. The larger the feature vector in a
particular principal component, the greater the contribution of the
original variable to construct the main component. Another note to note
is that a negative value in the feature vector indicates that the
original variable makes a return contribution to the formation of the
principal component. In the context of negative feature vectors, the
larger the value of the original variable, the smaller the value of the
principal component.
pca_women_records$rotation
## PC1 PC2 PC3 PC4 PC5 PC6
## lOOm 0.3683561 0.4900597 -0.28601157 0.31938631 0.23116950 0.619825234
## 200m 0.3653642 0.5365800 -0.22981913 -0.08330196 0.04145457 -0.710764580
## 400m 0.3816103 0.2465377 0.51536655 -0.34737748 -0.57217791 0.190945970
## 800m 0.3845592 -0.1554023 0.58452608 -0.04207636 0.62032379 -0.019089032
## 1500m 0.3891040 -0.3604093 0.01291198 0.42953873 0.03026144 -0.231248381
## 3000m 0.3888661 -0.3475394 -0.15272772 0.36311995 -0.46335476 0.009277159
## Marathon 0.3670038 -0.3692076 -0.48437037 -0.67249685 0.13053590 0.142280558
## PC7
## lOOm 0.05217655
## 200m -0.10922503
## 400m 0.20849691
## 800m -0.31520972
## 1500m 0.69256151
## 3000m -0.59835943
## Marathon 0.06959828
Since we only use two components, the feature vector will be interpreted only on PC1 and PC2. PC1 has relatively the same feature vector, which is around 0.3 for all competitions. This relatively similar feature vector indicates that the contribution of the original variable to construct this principal component is relatively the same. This means that the values in PC1 (score value) can describe the running time for all competitions. Therefore we can use PC1 to determine which country has the fastest runner for all race categories. The feature vector in PC2 has a positive value for short distance running (100m -400m) and negative value for long distance running (800m-marathon). This means that the higher the score value on PC2, the shorter the running time for the short distance branch, but the faster the running time for the long distance branch. Therefore, PC2 can be used to determine which countries in the short distance race the timing is similar to that of the long distance race.
Note: The interpretation of the main components has a high subjectivity, therefore everyone interprets it differently.
The last thing that can be interpreted is the score value on PC1 and PC2. The score value is a new observation/coordinate on the main component variable. In the context of the runner data above, the observations are countries, so we can provide insight into the running race branches of each country.
To see the score value on the main components, it can be seen using the following syntax.
pca_women_records$x
## PC1 PC2 PC3 PC4 PC5
## argentina 0.52726229 -0.674719057 0.61558926 0.04552931 -0.0025575859
## australia -2.09355119 -0.532798687 -0.04317487 0.18737792 -0.2183555124
## austria -1.38044331 -0.274995495 -0.53231305 0.42623519 -0.0255039187
## belgium -1.50998970 0.090963946 -0.08345684 -0.03942270 -0.0303263716
## bermuda 0.38781757 -0.976409482 0.64887210 0.47445652 0.2043216118
## brazil -0.11839732 -0.911515514 0.32214131 0.34096557 -0.0959612766
## burma 1.68203844 0.586191540 0.07582590 -0.14511731 0.6034322854
## canada -2.60813211 -0.695287897 0.10954223 0.03328484 0.1410820801
## chile 0.54783013 1.171926564 -0.23733316 -0.09612578 -0.2156317054
## china 0.64127320 0.980047440 0.05370738 0.02293887 -0.0579061799
## columbia 0.14157212 0.154468463 0.21456812 0.17614694 0.1895221521
## cookis 6.07727834 1.399503624 -0.21282983 -0.28085726 -0.0529187535
## costa 2.61922856 0.305956420 1.09460087 0.41203614 -0.5847203364
## czech -3.05379905 -1.012730429 -1.04879114 0.37317120 -0.0258660649
## denmark -1.11637538 0.540131452 0.41098527 -0.17591883 -0.1041309755
## domrep 2.29543640 -0.616097039 0.64633093 -0.26243482 0.3964092177
## finland -2.18183995 -0.670248707 0.08087437 0.08706670 0.3299422783
## france -1.89216992 -0.443832046 0.14465457 -0.05322917 -0.1896245118
## gdr -3.50601681 -1.202500275 -0.52603497 -0.12430576 0.0884050075
## frg -2.92577741 -0.437137537 -0.20120561 -0.02584472 0.0480344770
## gbni -2.78315609 -0.578349737 0.13304952 -0.13024804 0.0417404253
## greece 0.81424542 0.233714829 -0.15991295 -0.08693903 -0.4548804436
## guatemal 3.22729824 -0.919057694 0.44304609 -0.09073846 0.3545878042
## hungary -1.47721337 0.059384256 -0.15006131 0.12504871 0.0924438246
## india 1.01453673 0.253605564 -0.51071114 0.05275360 0.0509072296
## indonesia 2.11236466 -0.378269923 0.33669289 -0.18769360 0.3751519942
## ireland -1.11735099 0.513043238 0.44713600 -0.08797261 -0.0255638127
## israel -0.14296749 0.155867013 0.75136633 -0.16605776 -0.2905821673
## italy -2.13954076 0.351991902 -0.07731676 -0.29052538 -0.2244816386
## japan -0.05923268 0.657973909 0.40042678 0.42998430 0.1230753651
## kenya -0.43089430 0.480611126 -0.75309705 -0.32602370 -0.0643810184
## korea 1.23386149 0.814398564 0.55640268 0.23959883 0.0129823046
## dprkorea 0.46229683 1.717885467 -1.91963681 0.34183702 0.3375496932
## luxembou 1.30174495 1.182174513 -0.10512035 -0.03151618 -0.4495547106
## malaysia 2.34053535 -0.001259817 0.11819610 0.84655411 0.1277242627
## mauritiu 4.23384717 -1.180202966 -0.08772947 -1.39411471 -0.1640608518
## mexico -0.06348187 0.568969717 -0.09040891 0.45214535 -0.2961853317
## netherla -1.79442661 -0.047085355 0.14270838 -0.10800877 -0.3625892311
## nz -1.51125893 0.377920885 0.06112304 0.36901628 0.2627172276
## norway -1.48300990 0.904195203 0.38741591 -0.14529920 0.1311735490
## png 3.98086034 -0.340244237 -0.28834865 -0.29398974 0.1595586421
## philippi 1.64018955 -0.876042797 0.22214354 -0.02227534 0.2056514197
## poland -2.67209659 -0.702323548 -0.59597055 -0.02384024 0.0364171238
## portugal -0.22428415 1.252662130 0.46217831 -0.02349218 0.2384703479
## rumania -2.02982623 0.618046803 -0.83844940 -0.46958197 -0.0740123635
## singapor 1.97013151 0.951982007 -0.38929662 0.40329233 0.0732785524
## spain -0.35565287 0.925478452 -0.05010366 -0.10341725 0.0927143013
## sweden -1.82775494 -0.254820864 0.24805824 -0.14902201 -0.0006283576
## switzerl -1.34665382 0.514180988 0.24518048 -0.22415740 -0.0858444392
## taipei -0.50011940 -1.234653297 0.31159932 -0.04781148 0.0094018502
## thailand 1.95317730 -0.139873925 0.81360772 0.65551039 -0.1576314482
## turkey 1.60820411 0.594342560 0.16105018 -0.81523551 0.1477445097
## usa -3.33581190 -0.685104574 0.38123893 -0.27057989 -0.1954822627
## ussr -3.46468721 -0.245078447 -0.64637481 -0.20743799 -0.0824770130
## wsamoa 8.33288156 -2.326979228 -1.49263486 0.40428466 -0.3425812545
## PC6 PC7
## argentina 0.457907931 -0.079962873
## australia 0.137966927 -0.009815157
## austria -0.081682704 -0.106172559
## belgium 0.062877332 0.138490873
## bermuda -0.049426348 0.047829472
## brazil -0.300448073 -0.006724259
## burma 0.277326099 0.055825566
## canada -0.116436013 -0.117243747
## chile 0.128882585 0.003967955
## china 0.046618764 0.172340012
## columbia -0.324561612 -0.122442477
## cookis -0.055706058 -0.289275038
## costa -0.065656582 -0.044687998
## czech 0.047409019 0.188230868
## denmark -0.179964399 0.322413538
## domrep -0.006777189 0.321472376
## finland -0.031636754 -0.182734885
## france -0.035807689 0.053223788
## gdr -0.047152707 -0.176067724
## frg -0.191479813 0.086729269
## gbni 0.012368224 0.059974475
## greece 0.093873143 -0.121106747
## guatemal -0.279702884 -0.030512752
## hungary 0.061704372 -0.002880404
## india 0.134490600 -0.442230795
## indonesia -0.013434970 -0.058448424
## ireland -0.149414709 -0.043637389
## israel -0.090944777 -0.094991490
## italy 0.012049601 0.080196477
## japan -0.182028982 0.141996717
## kenya 0.119151878 -0.011054538
## korea -0.028182131 -0.027699883
## dprkorea -0.561883421 -0.072075631
## luxembou -0.115327186 0.123404789
## malaysia 0.374021886 -0.140160623
## mauritiu -0.321676670 -0.209056067
## mexico 0.402826018 -0.127280417
## netherla 0.051567793 -0.071966309
## nz 0.078135023 0.198553009
## norway 0.226377016 0.086297791
## png 0.111686449 0.039358192
## philippi 0.267820061 -0.094783185
## poland 0.144920475 -0.248593496
## portugal -0.041084319 0.041047319
## rumania -0.050795933 0.167824160
## singapor 0.091437114 0.018007615
## spain 0.124420761 -0.021530131
## sweden -0.159842868 0.036333974
## switzerl 0.047330468 0.068598802
## taipei 0.024237079 -0.093082005
## thailand -0.482014923 -0.032198783
## turkey 0.287104548 0.191270368
## usa -0.047049486 0.040368337
## ussr 0.097960158 0.017756855
## wsamoa 0.087647875 0.376903191
To make it easier to interpret the score value, the graph below is used.
fviz_pca_ind(pca_women_records,col.ind = "darkred")
Based on the score value graph, it can be seen that the country that has a record of late runners for all competitions is Wsamoa. This is because the wsamoa score value wsakoa for PC1 (Dim1) is the largest among the others. Even though wsamoa has the longest running in all competitions, the smallest time difference between long-distance and short-distance runners is wsamoa. This means runners for short distance races are very slow because they have almost the same time as long distance runners. While the country that has the fastest runner for all competitions is gdr.
In this practical note, we will use packages ggpubr
and
MASS
.
library(ggpubr)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Swiss Data
We are going to use Swiss Data that consist of Standardized fertility
measure and socio-economic indicators for each of 47 French-speaking
provinces of Switzerland at about 1888. A data frame with 47
observations on 6 variables, each of which is in percent, i.e., in [0,
100].
All variables but ‘Fertility’ give proportions of the population.
data(swiss)
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
MDS Implementation:
# calculate distance matrix
distance_swiss <- dist(swiss, method = "euclidean")
# Metric MDS
metricMDS_swiss <- cmdscale(d = distance_swiss,k = 2)
colnames(metricMDS_swiss) <- c("Dim.1","Dim.2")
metricMDS_swiss <- as.data.frame(metricMDS_swiss)
head(metricMDS_swiss)
## Dim.1 Dim.2
## Courtelary 37.032433 -17.434879
## Delemont -42.797334 -14.687668
## Franches-Mnt -51.081639 -19.274036
## Moutier 7.716707 -5.458722
## Neuveville 35.032658 5.126097
## Porrentruy -44.161953 -25.922412
Kruskal’s non metric MDS:
KruskalMDS_swiss <- isoMDS(d = distance_swiss,k = 2)
## initial value 5.463800
## iter 5 value 4.499103
## iter 5 value 4.495335
## iter 5 value 4.492669
## final value 4.492669
## converged
KruskalMDS_swiss <- as.data.frame(KruskalMDS_swiss$points)
colnames(KruskalMDS_swiss) <- c("Dim.1","Dim.2")
head(KruskalMDS_swiss)
## Dim.1 Dim.2
## Courtelary 38.850496 -16.154674
## Delemont -42.676573 -13.720989
## Franches-Mnt -53.587659 -21.335763
## Moutier 6.735536 -4.604116
## Neuveville 35.622307 4.633972
## Porrentruy -44.739479 -25.495702
Graphic MDS
ggscatter(metricMDS_swiss,x="Dim.1",y="Dim.2",color = "steelblue",
label = rownames(swiss),
repel = TRUE)
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Kruskal’s non-metric MDS:
ggscatter(KruskalMDS_swiss,x="Dim.1",y="Dim.2",color = "steelblue",
label = rownames(swiss),
repel = TRUE)
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps