STA563: Data Exploration and Visualization (3): Multivariate Data Exploration

Cluster Analysis
Correlation Heatmap
Dimension Reduction
Multi Dimensional Scaling
References

Disclaimer
Please do not cite this document. Data and syntax can be reused freely while keep attributing and referring to the original source (where applicable) in the reference. This document is used for educational purpose only.

Cluster Analysis

This analysis will heavily used package factoextra.

#install.packages("factoextra")
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

K-Means Method

Common procedures:

Data Pre-processing
Decide number of cluster
K-means application
Cluster interpretation

Example:
Suppose there is a Mall owner wants to group customers in the Mall he owns, so that the marketing team can develop the right strategy for the right customer. The data held by the Mall are Customer ID, customer age (age), annual income in thousand dollars (annual income) and spending score. The spending score is the value given by the Mall to customers based on customer behavior (time of visit, type of goods purchased, and the amount of money spent when shopping) which has a value range of 1-100. The greater the value of the Spending Score, the more loyal the customer is to the Mall and the greater the spending money used.
The data can be obtained at the following link

data_mall <- read.csv("Mall_Customers.csv")
str(data_mall)

## 'data.frame':    200 obs. of  5 variables:
##  $ CustomerID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Genre         : chr  "Male" "Male" "Female" "Female" ...
##  $ Age           : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual.Income : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending.Score: int  39 81 6 77 40 76 6 94 3 72 ...

dim(data_mall)

## [1] 200   5

1. Data Pre-processing

The variables we are going to use to apply k-means are Age, Annual.Income, and Spending.Score variables. Therefore, We will subset the variable of our interest and eliminate the remaining.

data_mall <- data_mall[,c("Age","Annual.Income","Spending.Score")]
head(data_mall)

##   Age Annual.Income Spending.Score
## 1  19            15             39
## 2  21            15             81
## 3  20            16              6
## 4  23            16             77
## 5  31            17             40
## 6  22            17             76

Data Standardization
Variable standardization is the process of transforming a variable into a variable that has a mean of zero and a standard deviation of one. This standardization process is carried out if we look at the differences in the units of measurement for the variables used for example (age and income). Standardization is done because the k-means method uses the concept of distance between objects/observations, which is sensitive to the unit of measurement. The formula for standardizing the data is as follows:

\[ x = \frac{y_i-\bar{y}}{\sigma_y} \] where $\bar{y}$ is the mean of $y_i$ and $\sigma_y$ is the standard deviation of y. In R, standardization of data can be done using the scale function.

data_mall_standardize <- scale(data_mall)
apply(data_mall_standardize,2,mean)

##            Age  Annual.Income Spending.Score 
##  -1.016906e-16  -8.144310e-17  -1.096708e-16

in order to check its standard deviation:

apply(data_mall_standardize,2,sd)

##            Age  Annual.Income Spending.Score 
##              1              1              1

If we consider the mean and standard deviation of the variables after being standardized, they are close to zero and one.

Note:
In the data pre-processing stage, we prepare the data so that the k-means method can be applied optimally. Two things that are generally done at this stage are selecting the variables used and standardizing the variables.

2. Optimal number of cluster selection

Generally, the number of clusters can be determined using several statistical criteria, such as silhouette coefficient and WSS or (Within Sum of Square).

Silhouette coefficient criteria are calculated based on the distance between observations. This coefficient measures how close an observation is to other observations in the same cluster (known as the cohesion measure) compared to the distance to other observations in different clusters (known as the separation measure). The greater the value of the coefficient indicates that the cluster formed is appropriate.

The WSS criterion is a criterion that calculates the diversity in the cluster that is formed. The smaller the diversity in the clusters formed indicates that the clusters formed are appropriate.

Using these criteria, we can compare the number of clusters that best fit the data we are analyzing. In R, the fviz_nbclust package function factoextra can be used to select the number of clusters.

fviz_nbclust(data_mall_standardize,FUNcluster = kmeans,method = "silhouette")

fviz_nbclust(data_mall_standardize,FUNcluster = kmeans,method = "wss")

For the silhoutte coefficient criteria, we choose the number of clusters with the highest coefficient values. While in WSS, the number of clusters that we choose is based on the number of clusters where the line is shaped like an elbow (elbow). In the picture above the line forms an angle when it is in the fourth cluster. Because this determination is based on visuals, so everyone may have a different look at the elbow pattern.

Based on these two criteria, the number of the best selected clusters is different. If so, the number of clusters can be determined based on the ease of interpretation of the clusters formed. In this practical guide we will use only 4 clusters.

Note: by default the number of clusters tested on the function fviz_nbclust is 10, if you want to change this, you can use arguments k.max in the function, for example k.max=20.

fviz_nbclust(data_mall_standardize,FUNcluster = kmeans, k.max = 20, method = "wss")

3. K-Means Application

After we get the best number of clusters, then we will apply the K-means method to get the cluster label for each observation. The function eclust of the package factoextra is used to implement the k-means method. In the function eclust, we only need to enter data that was previously standardized, because in the function there is an argument stand, which if set stand=TRUE automatically the data we use will be standardized.

kmeans_mall <- eclust(data_mall,stand = TRUE,FUNcluster = "kmeans",k=4,graph = F)
kmeans_mall$cluster

##   [1] 3 3 3 3 3 3 2 3 2 3 2 3 2 3 2 3 3 3 2 3 3 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
##  [38] 3 2 3 2 3 2 3 2 3 2 3 3 3 2 3 3 2 2 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 2 2 2 2
##  [75] 2 3 2 2 3 2 2 3 2 2 3 2 2 3 3 2 2 3 2 2 3 3 2 3 2 3 3 2 2 3 2 3 2 2 2 2 2
## [112] 3 1 3 3 3 2 2 2 2 3 1 4 4 1 4 1 4 2 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
## [149] 1 4 1 4 1 4 1 4 1 4 1 4 2 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1
## [186] 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4

kmeans_mall$centers

##           Age Annual.Income Spending.Score
## 1  0.03711223     0.9876366     -1.1857814
## 2  1.08344244    -0.4893373     -0.3961802
## 3 -0.96008279    -0.7827991      0.3910484
## 4 -0.42773261     0.9724070      1.2130414

Cluster label for each observation/object, can be obtained by using $cluster. Then, the interpretation of each cluster that is formed can be done using the help of the average value of each variable calculated based on the cluster. This information can be obtained using $centers. Because we standardize the variables, the average value obtained is also in the standardized scale.

4. Interpretation of the formed cluster

Based on the average value of $centers, here is the interpretation:

cluster 1: this cluster is customers who are quite young (the age variable is small) and have high income (the income variable is large) but spend very little money on shopping (the spending score variable is small or even negative).
cluster 2: this cluster is customers who are old (the age variable is large) and have low income (the income variable is small) and spend very little money on shopping (the spending score variable is small). This cluster may be a customer who has retired and only has income from retirement benefits.
cluster 3: this cluster is customers who are very young (age variable is small) and have low income (low income variable) but spend quite a lot of money on shopping (spending score variable is large). This cluster may be a strange customer, because they have a small income but spend a lot of money.
cluster 4: this cluster is customers who are still quite young (the age variable is small) and have high income (the income variable is large) but spends quite a lot of money on shopping (the spending score variable is large). This cluster may be the most attractive customer to be the next marketing target.

If it is difficult to read the results in standardized scale form then we can use a function aggregate to see the average in the original scale. This function can calculate the average of each variable based on the cluster that is formed.

aggregate(data_mall,by =list(cluster=kmeans_mall$cluster),
            FUN = mean)

##   cluster      Age Annual.Income Spending.Score
## 1       1 39.36842      86.50000       19.57895
## 2       2 53.98462      47.70769       39.96923
## 3       3 25.43860      40.00000       60.29825
## 4       4 32.87500      86.10000       81.52500

Another way to interpret cluster results is to use a scatterplot. If there are more than two variables to construct a cluster, then before the scatterplot is formed, these variables are reduced first using principal component analysis into two main components. However, for the interpretation of each cluster we must know the interpretation of the two main components and not necessarily with the two main components being able to explain the diversity of the original data well.

fviz_cluster(kmeans_mall)

The interpretation of the two main components can be seen with their characteristic roots.

pca_mall <- prcomp(data_mall_standardize)
pca_mall$rotation

##                        PC1         PC2         PC3
## Age             0.70638235 -0.03014116 0.707188441
## Annual.Income  -0.04802398 -0.99883160 0.005397916
## Spending.Score -0.70619946  0.03777499 0.707004506

Hierarchical Clustering Method

Common procedures:

Data Pre-processing
Linkage Method and Number of Cluster Selection
Hierarchical Clustering Implementation
Interpretation of the Formed Cluster

1. Data Pre-processing

Data processing stage is almost similar with that of the k-means procedures.

2. Linkage Method and Number of Cluster Selection

Using silhouette and wss koefisien coefficients
For illustration, we will use the silhouette method only because it is easier to determine the number of clusters.

#complete
fviz_nbclust(data_mall_standardize,FUNcluster = hcut,method = "silhouette",
             hc_method = "complete",hc_metric = "euclidean")

#average
fviz_nbclust(data_mall_standardize,FUNcluster = hcut,method = "silhouette",
             hc_method = "average",hc_metric = "euclidean")

#centroid
fviz_nbclust(data_mall_standardize,FUNcluster = hcut,method = "silhouette",
             hc_method = "centroid",hc_metric = "euclidean")

#ward
fviz_nbclust(data_mall_standardize,FUNcluster = hcut,method = "silhouette",
             hc_method = "ward.D",hc_metric = "euclidean")

Based on the silhouette coefficients, the complete and average methods choose 5 clusters, while the centroid and ward methods choose 2 and 6 clusters, respectively. For now, we will try to use 5 clusters with the complete method (If the two linkage methods choose the same number of clusters, the clusters formed will be relatively similar, therefore you can choose one).

Using Dendogram
The use of dendograms for data that has many observations may not be effective because selecting clusters with dendograms is done visually.

linkage_methods <- c("complete","average","centroid","ward.D")
hc_mall_dend <- lapply(linkage_methods, function(i)
  hclust(dist(data_mall_standardize,method = 'euclidean'),method = i)
  )
#complete
fviz_dend(hc_mall_dend[[1]])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

#average
fviz_dend(hc_mall_dend[[2]])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

#centroid
fviz_dend(hc_mall_dend[[3]])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

#ward
fviz_dend(hc_mall_dend[[4]])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

it is observed from the four dendograms in each linkage method, the number of clusters formed is the same as using the silhouette coefficient above.

3. Hierarcichal Clustering Implementation

hc_mall <- eclust(data_mall,stand = TRUE,FUNcluster = "hclust",k=5,hc_method = "complete",hc_metric = "euclidean",graph = F)
hc_mall$cluster

##   [1] 1 2 1 2 1 2 1 2 3 2 3 2 3 2 1 2 1 2 3 2 1 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3
##  [38] 2 3 2 3 2 3 1 3 2 3 1 1 1 3 1 1 3 3 3 3 3 1 3 3 1 3 3 3 1 1 3 1 1 3 3 3 3
##  [75] 3 1 1 1 1 3 3 1 3 3 1 3 3 1 1 3 3 1 3 1 1 1 3 1 3 1 1 3 3 1 3 1 3 3 3 3 3
## [112] 1 1 1 1 1 3 3 3 3 1 1 1 4 1 4 5 4 5 4 5 4 1 4 5 4 5 4 5 4 5 4 1 4 5 4 5 4
## [149] 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5
## [186] 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4

4. Interpretation of the Formed Cluster

aggregate(data_mall,by =list(cluster=hc_mall$cluster),
            FUN = mean)

##   cluster      Age Annual.Income Spending.Score
## 1       1 28.35417      50.29167       45.93750
## 2       2 24.80952      25.61905       80.23810
## 3       3 55.33333      47.31579       41.08772
## 4       4 32.69231      86.53846       82.12821
## 5       5 41.68571      88.22857       17.28571

fviz_cluster(hc_mall)

The interpretation of the two main components can be seen with their characteristic roots.

pca_mall <- prcomp(data_mall_standardize)
pca_mall$rotation

##                        PC1         PC2         PC3
## Age             0.70638235 -0.03014116 0.707188441
## Annual.Income  -0.04802398 -0.99883160 0.005397916
## Spending.Score -0.70619946  0.03777499 0.707004506

Correlation Heatmap

R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
corrplot is very easy to use and provides a rich array of plotting options in visualization method, graphic layout, color, legend, text labels, etc. It also provides p-values and confidence intervals to help users determine the statistical significance of the correlations.

#install.packages("corrplot")
library(corrplot)

## corrplot 0.90 loaded

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

corrplot(mpg %>% select(where(is.numeric)) %>% cor,
         method = "pie",type = "lower",
         diag = FALSE)

one way to visualize the correlation matrix when more variables are involved is to use a heatmap where the colors used will be adjusted depending on the magnitude of the correlation coefficient value. It is not only displays the value of the correlation coefficient in the form of color, this function also performs grouping of variables based on the correlation. The highly correlated variables will be placed close together.

library(ggplot2)

We are going to use mtcars data as an example:

mydata <- mtcars[, c(1,3,4,5,6,7)]
head(mydata)

##                    mpg disp  hp drat    wt  qsec
## Mazda RX4         21.0  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0  160 110 3.90 2.875 17.02
## Datsun 710        22.8  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7  360 175 3.15 3.440 17.02
## Valiant           18.1  225 105 2.76 3.460 20.22

Then, we compute the correlation matrix:

cormat <- round(cor(mydata),2)
head(cormat)

##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00

It is time to create correlation heatmap using ggplot2 package. Yet, before that we need to rearrange the data from wide format to long format using reshape2 package:

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

melted_cormat <- melt(cormat)
head(melted_cormat)

##   Var1 Var2 value
## 1  mpg  mpg  1.00
## 2 disp  mpg -0.85
## 3   hp  mpg -0.78
## 4 drat  mpg  0.68
## 5   wt  mpg -0.87
## 6 qsec  mpg  0.42

The function geom_tile() from ggplot2 package is used to visualize the correlation matrix:

library(ggplot2)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

The default plot is not quite fancy. We’ll see in the next sections, how to change the appearance of the heatmap. Firstly, we need to get the lower and upper triangles of the correlation matrix, because usually correlation matrix has redundant information. We’ll utilize this helper function below to set half of it to NA:

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(cormat){
    cormat[upper.tri(cormat)] <- NA
    return(cormat)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(cormat){
    cormat[lower.tri(cormat)]<- NA
    return(cormat)
  }

Usage:

upper_tri <- get_upper_tri(cormat)
upper_tri

##      mpg  disp    hp  drat    wt  qsec
## mpg    1 -0.85 -0.78  0.68 -0.87  0.42
## disp  NA  1.00  0.79 -0.71  0.89 -0.43
## hp    NA    NA  1.00 -0.45  0.66 -0.71
## drat  NA    NA    NA  1.00 -0.71  0.09
## wt    NA    NA    NA    NA  1.00 -0.17
## qsec  NA    NA    NA    NA    NA  1.00

Let us finish the correlation matrix heatmap by melting the correlation data and drop the NA values:

# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

If we aim to reorder the correlation matrix according to correlation coefficient, we can use hclust function:

#helper function to reorder correlation matrix:
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

Reordered correlation data visualization:

# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
    name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()
# Print the heatmap
print(ggheatmap)

Then, we just need to add correlation coefficients on the heatmap:

ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

Dimension Reduction

Dimensionality reduction is a technique for reducing the dimensions of a dataset/data features. Usually the dataset we want to process has tens or maybe hundreds of features or columns. With dimension reduction, we can reduce the number of features or columns without removing information from the dataset. PCA is used to handle this problem.

library(factoextra)
#install.packages("ggcorrplot")
library(ggcorrplot)
library(openxlsx)

We can obtain the dataset by downloading the woman track records dataset through this link

The data is the record time for the results of seven women’s athletic events from 55 countries at one of the Olympic events, namely the 100 meter, 200 meter, 400 meter, 800 meter, 1500 meter, 3000 meter and marathon running. The first three running events are recorded in seconds, while the other four events are recorded in minutes.

#Import Data
data_women_records <- openxlsx::read.xlsx("women_track_records.xlsx")
head(data_women_records)

##    lOOm  200m  400m 800m 1500m 3000m Marathon   country
## 1 11.61 22.94 54.50 2.15  4.43  9.79   178.52 argentina
## 2 11.20 22.35 51.08 1.98  4.13  9.08   152.37 australia
## 3 11.43 23.09 50.62 1.99  4.22  9.34   159.37   austria
## 4 11.41 23.04 52.00 2.00  4.14  8.88   157.85   belgium
## 5 11.46 23.05 53.30 2.16  4.58  9.81   169.98   bermuda
## 6 11.31 23.17 52.80 2.10  4.49  9.77   168.75    brazil

Note: openxlsx::read.xlsx means using a function read.xlsx that is in the package openxlsx without calling the package first using library. The function head is used to display the first 6 rows of data.

For further analysis purposes, country names will be used as row names in the data. This is done using the function rownames.

rownames(data_women_records) <- data_women_records$country
data_women_records <- data_women_records[,-8]

data_women_records[,-8] means we omit the eighth column in the data (country column).
Let’s create correlation matrix for exploration:

cor_women <- cor(data_women_records)
cor_women

##               lOOm      200m      400m      800m     1500m     3000m  Marathon
## lOOm     1.0000000 0.9527911 0.8346918 0.7276888 0.7283709 0.7416988 0.6863358
## 200m     0.9527911 1.0000000 0.8569621 0.7240597 0.6983643 0.7098710 0.6855745
## 400m     0.8346918 0.8569621 1.0000000 0.8984052 0.7878417 0.7776369 0.7054241
## 800m     0.7276888 0.7240597 0.8984052 1.0000000 0.9016138 0.8635652 0.7792922
## 1500m    0.7283709 0.6983643 0.7878417 0.9016138 1.0000000 0.9691690 0.8779334
## 3000m    0.7416988 0.7098710 0.7776369 0.8635652 0.9691690 1.0000000 0.8998374
## Marathon 0.6863358 0.6855745 0.7054241 0.7792922 0.8779334 0.8998374 1.0000000

It is easier to see this correlation matrix can be made in graphical form in the following way:

ggcorrplot(cor_women,type="lower",lab = TRUE)

It is generating almost the same correlation heatmap matrix with the previous section but in simpler and easier way. Next stage is applying the principal component analysis.
In R, Implementing this PCA can be done using function prcomp. This function has arguments scale. and center. . If these two arguments are TRUE, the matrix used to calculate PCA is the correlation matrix. Otherwise, if both of these arguments are FALSE or scale.=FALSE, then the matrix used is the covariance matrix.

pca_women_records <- prcomp(data_women_records,scale.=TRUE,center=TRUE)
summary(pca_women_records)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4095 0.80848 0.54762 0.35423 0.23198 0.19761 0.14981
## Proportion of Variance 0.8294 0.09338 0.04284 0.01793 0.00769 0.00558 0.00321
## Cumulative Proportion  0.8294 0.92276 0.96560 0.98353 0.99122 0.99679 1.00000

There are three kinds of results from the syntax above, namely Standard deviation, Proportion of Variance and Cumulative Proportion of each Principal Component. The standard deviation is the root of the characteristic root (eigenvalue). In this case, the feature root acts as the variance of each main component. The proportion of variance is obtained from the feature root in each component divided by the total feature root. Proportion of Variance explains how much variation in the original variable can be explained by each main component. The bigger the value, the better the principal component is to represent the original variable.

Cumulative Proportion describes how much variation can be explained by the cumulative principal component. For example, using only two main components (PC1 and PC2), it can explain more than 92% of the diversity of the data. Based on this, we will choose to use only two main components.

fviz_screeplot(pca_women_records,geom="line")

Another thing that can be done to determine how many major components are used is with a screeplot. The function to display a screeplot in R is fviz_screeplot that obtained from the package factoextra.

The number of principal components can be determined with a screeplot by looking at the main component where the line is shaped like an elbow (elbow). In the picture above the line forms an angle when it is in the second main component (second dimension). So that the number of main components used is two (Main Component 1 and Main Component 2)

Interpretation
Interpretation of the PCA method can be done by using the feature vector on each of the main components. The larger the feature vector in a particular principal component, the greater the contribution of the original variable to construct the main component. Another note to note is that a negative value in the feature vector indicates that the original variable makes a return contribution to the formation of the principal component. In the context of negative feature vectors, the larger the value of the original variable, the smaller the value of the principal component.

pca_women_records$rotation

##                PC1        PC2         PC3         PC4         PC5          PC6
## lOOm     0.3683561  0.4900597 -0.28601157  0.31938631  0.23116950  0.619825234
## 200m     0.3653642  0.5365800 -0.22981913 -0.08330196  0.04145457 -0.710764580
## 400m     0.3816103  0.2465377  0.51536655 -0.34737748 -0.57217791  0.190945970
## 800m     0.3845592 -0.1554023  0.58452608 -0.04207636  0.62032379 -0.019089032
## 1500m    0.3891040 -0.3604093  0.01291198  0.42953873  0.03026144 -0.231248381
## 3000m    0.3888661 -0.3475394 -0.15272772  0.36311995 -0.46335476  0.009277159
## Marathon 0.3670038 -0.3692076 -0.48437037 -0.67249685  0.13053590  0.142280558
##                  PC7
## lOOm      0.05217655
## 200m     -0.10922503
## 400m      0.20849691
## 800m     -0.31520972
## 1500m     0.69256151
## 3000m    -0.59835943
## Marathon  0.06959828

Since we only use two components, the feature vector will be interpreted only on PC1 and PC2. PC1 has relatively the same feature vector, which is around 0.3 for all competitions. This relatively similar feature vector indicates that the contribution of the original variable to construct this principal component is relatively the same. This means that the values in PC1 (score value) can describe the running time for all competitions. Therefore we can use PC1 to determine which country has the fastest runner for all race categories. The feature vector in PC2 has a positive value for short distance running (100m -400m) and negative value for long distance running (800m-marathon). This means that the higher the score value on PC2, the shorter the running time for the short distance branch, but the faster the running time for the long distance branch. Therefore, PC2 can be used to determine which countries in the short distance race the timing is similar to that of the long distance race.

Note: The interpretation of the main components has a high subjectivity, therefore everyone interprets it differently.

The last thing that can be interpreted is the score value on PC1 and PC2. The score value is a new observation/coordinate on the main component variable. In the context of the runner data above, the observations are countries, so we can provide insight into the running race branches of each country.

To see the score value on the main components, it can be seen using the following syntax.

pca_women_records$x

##                   PC1          PC2         PC3         PC4           PC5
## argentina  0.52726229 -0.674719057  0.61558926  0.04552931 -0.0025575859
## australia -2.09355119 -0.532798687 -0.04317487  0.18737792 -0.2183555124
## austria   -1.38044331 -0.274995495 -0.53231305  0.42623519 -0.0255039187
## belgium   -1.50998970  0.090963946 -0.08345684 -0.03942270 -0.0303263716
## bermuda    0.38781757 -0.976409482  0.64887210  0.47445652  0.2043216118
## brazil    -0.11839732 -0.911515514  0.32214131  0.34096557 -0.0959612766
## burma      1.68203844  0.586191540  0.07582590 -0.14511731  0.6034322854
## canada    -2.60813211 -0.695287897  0.10954223  0.03328484  0.1410820801
## chile      0.54783013  1.171926564 -0.23733316 -0.09612578 -0.2156317054
## china      0.64127320  0.980047440  0.05370738  0.02293887 -0.0579061799
## columbia   0.14157212  0.154468463  0.21456812  0.17614694  0.1895221521
## cookis     6.07727834  1.399503624 -0.21282983 -0.28085726 -0.0529187535
## costa      2.61922856  0.305956420  1.09460087  0.41203614 -0.5847203364
## czech     -3.05379905 -1.012730429 -1.04879114  0.37317120 -0.0258660649
## denmark   -1.11637538  0.540131452  0.41098527 -0.17591883 -0.1041309755
## domrep     2.29543640 -0.616097039  0.64633093 -0.26243482  0.3964092177
## finland   -2.18183995 -0.670248707  0.08087437  0.08706670  0.3299422783
## france    -1.89216992 -0.443832046  0.14465457 -0.05322917 -0.1896245118
## gdr       -3.50601681 -1.202500275 -0.52603497 -0.12430576  0.0884050075
## frg       -2.92577741 -0.437137537 -0.20120561 -0.02584472  0.0480344770
## gbni      -2.78315609 -0.578349737  0.13304952 -0.13024804  0.0417404253
## greece     0.81424542  0.233714829 -0.15991295 -0.08693903 -0.4548804436
## guatemal   3.22729824 -0.919057694  0.44304609 -0.09073846  0.3545878042
## hungary   -1.47721337  0.059384256 -0.15006131  0.12504871  0.0924438246
## india      1.01453673  0.253605564 -0.51071114  0.05275360  0.0509072296
## indonesia  2.11236466 -0.378269923  0.33669289 -0.18769360  0.3751519942
## ireland   -1.11735099  0.513043238  0.44713600 -0.08797261 -0.0255638127
## israel    -0.14296749  0.155867013  0.75136633 -0.16605776 -0.2905821673
## italy     -2.13954076  0.351991902 -0.07731676 -0.29052538 -0.2244816386
## japan     -0.05923268  0.657973909  0.40042678  0.42998430  0.1230753651
## kenya     -0.43089430  0.480611126 -0.75309705 -0.32602370 -0.0643810184
## korea      1.23386149  0.814398564  0.55640268  0.23959883  0.0129823046
## dprkorea   0.46229683  1.717885467 -1.91963681  0.34183702  0.3375496932
## luxembou   1.30174495  1.182174513 -0.10512035 -0.03151618 -0.4495547106
## malaysia   2.34053535 -0.001259817  0.11819610  0.84655411  0.1277242627
## mauritiu   4.23384717 -1.180202966 -0.08772947 -1.39411471 -0.1640608518
## mexico    -0.06348187  0.568969717 -0.09040891  0.45214535 -0.2961853317
## netherla  -1.79442661 -0.047085355  0.14270838 -0.10800877 -0.3625892311
## nz        -1.51125893  0.377920885  0.06112304  0.36901628  0.2627172276
## norway    -1.48300990  0.904195203  0.38741591 -0.14529920  0.1311735490
## png        3.98086034 -0.340244237 -0.28834865 -0.29398974  0.1595586421
## philippi   1.64018955 -0.876042797  0.22214354 -0.02227534  0.2056514197
## poland    -2.67209659 -0.702323548 -0.59597055 -0.02384024  0.0364171238
## portugal  -0.22428415  1.252662130  0.46217831 -0.02349218  0.2384703479
## rumania   -2.02982623  0.618046803 -0.83844940 -0.46958197 -0.0740123635
## singapor   1.97013151  0.951982007 -0.38929662  0.40329233  0.0732785524
## spain     -0.35565287  0.925478452 -0.05010366 -0.10341725  0.0927143013
## sweden    -1.82775494 -0.254820864  0.24805824 -0.14902201 -0.0006283576
## switzerl  -1.34665382  0.514180988  0.24518048 -0.22415740 -0.0858444392
## taipei    -0.50011940 -1.234653297  0.31159932 -0.04781148  0.0094018502
## thailand   1.95317730 -0.139873925  0.81360772  0.65551039 -0.1576314482
## turkey     1.60820411  0.594342560  0.16105018 -0.81523551  0.1477445097
## usa       -3.33581190 -0.685104574  0.38123893 -0.27057989 -0.1954822627
## ussr      -3.46468721 -0.245078447 -0.64637481 -0.20743799 -0.0824770130
## wsamoa     8.33288156 -2.326979228 -1.49263486  0.40428466 -0.3425812545
##                    PC6          PC7
## argentina  0.457907931 -0.079962873
## australia  0.137966927 -0.009815157
## austria   -0.081682704 -0.106172559
## belgium    0.062877332  0.138490873
## bermuda   -0.049426348  0.047829472
## brazil    -0.300448073 -0.006724259
## burma      0.277326099  0.055825566
## canada    -0.116436013 -0.117243747
## chile      0.128882585  0.003967955
## china      0.046618764  0.172340012
## columbia  -0.324561612 -0.122442477
## cookis    -0.055706058 -0.289275038
## costa     -0.065656582 -0.044687998
## czech      0.047409019  0.188230868
## denmark   -0.179964399  0.322413538
## domrep    -0.006777189  0.321472376
## finland   -0.031636754 -0.182734885
## france    -0.035807689  0.053223788
## gdr       -0.047152707 -0.176067724
## frg       -0.191479813  0.086729269
## gbni       0.012368224  0.059974475
## greece     0.093873143 -0.121106747
## guatemal  -0.279702884 -0.030512752
## hungary    0.061704372 -0.002880404
## india      0.134490600 -0.442230795
## indonesia -0.013434970 -0.058448424
## ireland   -0.149414709 -0.043637389
## israel    -0.090944777 -0.094991490
## italy      0.012049601  0.080196477
## japan     -0.182028982  0.141996717
## kenya      0.119151878 -0.011054538
## korea     -0.028182131 -0.027699883
## dprkorea  -0.561883421 -0.072075631
## luxembou  -0.115327186  0.123404789
## malaysia   0.374021886 -0.140160623
## mauritiu  -0.321676670 -0.209056067
## mexico     0.402826018 -0.127280417
## netherla   0.051567793 -0.071966309
## nz         0.078135023  0.198553009
## norway     0.226377016  0.086297791
## png        0.111686449  0.039358192
## philippi   0.267820061 -0.094783185
## poland     0.144920475 -0.248593496
## portugal  -0.041084319  0.041047319
## rumania   -0.050795933  0.167824160
## singapor   0.091437114  0.018007615
## spain      0.124420761 -0.021530131
## sweden    -0.159842868  0.036333974
## switzerl   0.047330468  0.068598802
## taipei     0.024237079 -0.093082005
## thailand  -0.482014923 -0.032198783
## turkey     0.287104548  0.191270368
## usa       -0.047049486  0.040368337
## ussr       0.097960158  0.017756855
## wsamoa     0.087647875  0.376903191

To make it easier to interpret the score value, the graph below is used.

fviz_pca_ind(pca_women_records,col.ind = "darkred")

Based on the score value graph, it can be seen that the country that has a record of late runners for all competitions is Wsamoa. This is because the wsamoa score value wsakoa for PC1 (Dim1) is the largest among the others. Even though wsamoa has the longest running in all competitions, the smallest time difference between long-distance and short-distance runners is wsamoa. This means runners for short distance races are very slow because they have almost the same time as long distance runners. While the country that has the fastest runner for all competitions is gdr.

Multi Dimensional Scaling

In this practical note, we will use packages ggpubr and MASS.

library(ggpubr)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

Swiss Data
We are going to use Swiss Data that consist of Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. A data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].

Fertility Ig, ‘common standardized fertility measure’
Agriculture % of males involved in agriculture as occupation
Examination % draftees receiving highest mark on army examination
Education % education beyond primary school for draftees.
Catholic % ‘catholic’ (as opposed to ‘protestant’).
Infant.Mortality live births who live less than 1 year.

All variables but ‘Fertility’ give proportions of the population.

data(swiss)
head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

MDS Implementation:

# calculate distance matrix
distance_swiss <- dist(swiss, method = "euclidean")
# Metric MDS
metricMDS_swiss <- cmdscale(d = distance_swiss,k = 2)
colnames(metricMDS_swiss) <- c("Dim.1","Dim.2")
metricMDS_swiss <- as.data.frame(metricMDS_swiss)
head(metricMDS_swiss)

##                   Dim.1      Dim.2
## Courtelary    37.032433 -17.434879
## Delemont     -42.797334 -14.687668
## Franches-Mnt -51.081639 -19.274036
## Moutier        7.716707  -5.458722
## Neuveville    35.032658   5.126097
## Porrentruy   -44.161953 -25.922412

Kruskal’s non metric MDS:

KruskalMDS_swiss <- isoMDS(d = distance_swiss,k = 2)

## initial  value 5.463800 
## iter   5 value 4.499103
## iter   5 value 4.495335
## iter   5 value 4.492669
## final  value 4.492669 
## converged

KruskalMDS_swiss <- as.data.frame(KruskalMDS_swiss$points)
colnames(KruskalMDS_swiss) <- c("Dim.1","Dim.2")
head(KruskalMDS_swiss)

##                   Dim.1      Dim.2
## Courtelary    38.850496 -16.154674
## Delemont     -42.676573 -13.720989
## Franches-Mnt -53.587659 -21.335763
## Moutier        6.735536  -4.604116
## Neuveville    35.622307   4.633972
## Porrentruy   -44.739479 -25.495702

Graphic MDS

ggscatter(metricMDS_swiss,x="Dim.1",y="Dim.2",color = "steelblue",
          label = rownames(swiss),
          repel = TRUE)

## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Kruskal’s non-metric MDS:

ggscatter(KruskalMDS_swiss,x="Dim.1",y="Dim.2",color = "steelblue",
          label = rownames(swiss),
          repel = TRUE)

## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

STA563: Data Exploration and Visualization (3): Multivariate Data Exploration

Sachnaz Desta Oktarina

2022-08-31

Cluster Analysis

K-Means Method

1. Data Pre-processing

2. Optimal number of cluster selection

3. K-Means Application

4. Interpretation of the formed cluster

Hierarchical Clustering Method

1. Data Pre-processing

2. Linkage Method and Number of Cluster Selection

3. Hierarcichal Clustering Implementation

4. Interpretation of the Formed Cluster

Correlation Heatmap

Dimension Reduction

Multi Dimensional Scaling

References