PCA and Clustering GDP

Introduction
Library
Data
PCA
K-means
PAM
Hierarchical
Conclusion
References

Introduction

The GDP (Gross Domestic Product), which is the economic figure that receives the most concern since it is thought to be the greatest single measure of people’s welfare, can be used as one of the benchmarks for a nation’s progress.This research intends to use PCA method for dimension reduction and cluster countries based on GDP contribution from 2015 to 2021.

Library

library(readxl)
library(tidyr)
library(factoextra)
library(cluster)

Data

The following data is taken from The World Bank.

# Loading readxl package

df <- read_excel("GDP.xls")
head(df)

## # A tibble: 6 × 66
##   `Country Name`       `Country Code` `Indicator Name` `Indicator Code`   `1960`
##   <chr>                <chr>          <chr>            <chr>               <dbl>
## 1 Aruba                ABW            GDP (current US… NY.GDP.MKTP.CD   NA      
## 2 Africa Eastern and … AFE            GDP (current US… NY.GDP.MKTP.CD    2.13e10
## 3 Afghanistan          AFG            GDP (current US… NY.GDP.MKTP.CD    5.38e 8
## 4 Africa Western and … AFW            GDP (current US… NY.GDP.MKTP.CD    1.04e10
## 5 Angola               AGO            GDP (current US… NY.GDP.MKTP.CD   NA      
## 6 Albania              ALB            GDP (current US… NY.GDP.MKTP.CD   NA      
## # ℹ 61 more variables: `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
## #   `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
## #   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
## #   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
## #   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## #   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## #   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, …

I decided to restrict this research to the last seven years due to a significant number of missing data at the beginning of the year and in order to acquire accurate data.

# Observations only focus on 2015 to 2021 
df2 <- df[,-5:-59]

# Drop NA
df2 <- na.omit(df2)

# Checks for missing values 
sum(is.na(df2))

## [1] 0

# Deletes some country names
df2 <- df2[-c(2,4,8,37,49,51,60,61,62,63,64,67,71,72,91,94,98,99,100,101,103,121,127,128,129,131,132,134,144,147,152,160,171,173,181,186,187,193,203,204,205,215,216,220,222,224,225,233,240),]

# Labels
df2.country <- df2$`Country Code`

# Scale data
my_data <- df2[,5:11]

data_scale <- scale(my_data)

# Distance
data_dist <- dist(data_scale)

PCA

PCA is a statistical procedure that allows users to encapsulate the informativeness in large data tables using a smaller set of overview indictors that can be more easily visualized and evaluated.

# The variance matrix
data_cov <- cov(data_scale)
data_cov

##           2015      2016      2017      2018      2019      2020      2021
## 2015 1.0000000 0.9995937 0.9994002 0.9984341 0.9983628 0.9967502 0.9929270
## 2016 0.9995937 1.0000000 0.9994958 0.9981956 0.9981999 0.9966330 0.9920770
## 2017 0.9994002 0.9994958 1.0000000 0.9993412 0.9992544 0.9979195 0.9946644
## 2018 0.9984341 0.9981956 0.9993412 1.0000000 0.9998504 0.9993608 0.9975006
## 2019 0.9983628 0.9981999 0.9992544 0.9998504 1.0000000 0.9995587 0.9975906
## 2020 0.9967502 0.9966330 0.9979195 0.9993608 0.9995587 1.0000000 0.9987986
## 2021 0.9929270 0.9920770 0.9946644 0.9975006 0.9975906 0.9987986 1.0000000

The main diagonal is always 1.

# Eigen Value and Eigen Vector
data_eigen <- eigen(data_cov)
data_eigen

## eigen() decomposition
## $values
## [1] 6.986837e+00 1.177578e-02 6.190898e-04 5.043526e-04 1.281898e-04
## [6] 1.131905e-04 2.286294e-05
## 
## $vectors
##            [,1]        [,2]        [,3]         [,4]       [,5]         [,6]
## [1,] -0.3778908 -0.40041654  0.71629253 -0.344439707  0.1542596  0.005489748
## [2,] -0.3778220 -0.45917586 -0.35782442 -0.283963898 -0.4322999 -0.057846329
## [3,] -0.3781400 -0.24850895 -0.11493321  0.610265084 -0.2684117  0.379448287
## [4,] -0.3782809  0.05723015  0.03697213  0.432799430  0.1716062 -0.791346719
## [5,] -0.3782881  0.07167459 -0.17576042  0.077026163  0.7121901  0.418644376
## [6,] -0.3780826  0.29223319 -0.44434063 -0.484762898  0.0787085 -0.135199235
## [7,] -0.3772459  0.68793506  0.34055805 -0.007673069 -0.4176332  0.181303851
##             [,7]
## [1,] -0.20328614
## [2,]  0.49749860
## [3,] -0.43999371
## [4,]  0.09626889
## [5,]  0.36394812
## [6,] -0.56102208
## [7,]  0.24719165

Eigen values can be used for the y axis. The results of the eigen $vectors to calculate the principal component coefficients.

# Main component function
pr.out <- prcomp(x = my_data, center = TRUE, scale. = TRUE)
# Determination of the number of principal components
summary(pr.out)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6      PC7
## Standard deviation     2.6433 0.10852 0.02488 0.02246 0.01132 0.01064 0.004782
## Proportion of Variance 0.9981 0.00168 0.00009 0.00007 0.00002 0.00002 0.000000
## Cumulative Proportion  0.9981 0.99980 0.99989 0.99996 0.99998 1.00000 1.000000

The output above indicates that I should choose two main components, because the two main components have been able to capture 99,80% of the diversity of data.

# Calculates the principal component coefficients
pr.out

## Standard deviations (1, .., p=7):
## [1] 2.64326248 0.10851624 0.02488151 0.02245780 0.01132209 0.01063910 0.00478152
## 
## Rotation (n x k) = (7 x 7):
##            PC1         PC2         PC3          PC4        PC5          PC6
## 2015 0.3778908 -0.40041654 -0.71629253  0.344439707  0.1542596 -0.005489748
## 2016 0.3778220 -0.45917586  0.35782442  0.283963898 -0.4322999  0.057846329
## 2017 0.3781400 -0.24850895  0.11493321 -0.610265084 -0.2684117 -0.379448287
## 2018 0.3782809  0.05723015 -0.03697213 -0.432799430  0.1716062  0.791346719
## 2019 0.3782881  0.07167459  0.17576042 -0.077026163  0.7121901 -0.418644376
## 2020 0.3780826  0.29223319  0.44434063  0.484762898  0.0787085  0.135199235
## 2021 0.3772459  0.68793506 -0.34055805  0.007673069 -0.4176332 -0.181303851
##              PC7
## 2015  0.20328614
## 2016 -0.49749860
## 2017  0.43999371
## 2018 -0.09626889
## 2019 -0.36394812
## 2020  0.56102208
## 2021 -0.24719165

The following is the result of rotating the eigen value vector from a negative to a positive value and vice versa.

# Forming Scree plots
scree_data <- data.frame(eigen_value = eigen(data_cov)$values, PC = 1:7)
scree_data

##    eigen_value PC
## 1 6.986837e+00  1
## 2 1.177578e-02  2
## 3 6.190898e-04  3
## 4 5.043526e-04  4
## 5 1.281898e-04  5
## 6 1.131905e-04  6
## 7 2.286294e-05  7

plot(x = scree_data$eigen_value, type = 'b',
     xlab = 'Main Component', ylab = 'Eigen Value', main = 'Scree Plot')

Based on the figure above, there is a sloping point in the second, indicating that there are two main components.

After acquiring the major component equations, I can alter the data by introducing variable values in the constructed main components.

# Data Reconstruction
head(pr.out$x[,1:2])

##              PC1           PC2
## [1,] -0.60646690  0.0110077439
## [2,] -0.58388289  0.0077129365
## [3,] -0.51197768  0.0003631248
## [4,] -0.59035584  0.0125313193
## [5,] -0.60649560  0.0111888915
## [6,] -0.04272454 -0.0264252192

It is evident it is possible to condense data into two dimensions or variables while still describing its diversity.

# Visualization of reconstructed data
fviz_pca(pr.out) + 
  labs(title="Variables loadings for PCA")

The following is a visualization of the proportion of variance explained for each number of PCA.

K-means

K-means clustering is commonly employed when there is no single outcome variable to forecast. Instead, it is utilized when you have a set of features that we want to use to locate groupings of observations that share similar properties.

# Calculate how many clusters
fviz_nbclust(data_scale, kmeans, method = "wss") + 
  labs(subtitle = "Elbow Method")

Clusters for the data is 3.

fviz_nbclust(data_scale, kmeans, method = "silhouette") + 
  labs(subtitle = "Silhouette Method")

Clusters for the data is 2.

fviz_nbclust(data_scale, kmeans, method = "gap_stat") + 
  labs(subtitle = "Gap Statistic Method")

Clusters for the data is 1.

Since the outcomes of the three approaches had varied values, I chose the middle value, which is 2.

Visualization the K-means clustering

k.out <- kmeans(data_scale, centers = 2, nstart = 100)

k.clusters <- k.out$cluster
rownames(data_scale) <- paste(df2$`Country Code`, 1:dim(df2)[1], sep = "_")

fviz_cluster(list(data = data_scale, cluster = k.clusters))

The following is a two-group depiction of K-means clustering.

PAM

PAM stands for Partition Around Medoids. PAM is being tested for partitioning algorithms with the goal of reducing the average dissimilarity of objects to their nearest picked object.

fviz_nbclust(data_scale, FUNcluster = pam, "wss")+ 
  labs(subtitle = "Elbow Method")

Clusters for the data is 3.

fviz_nbclust(data_scale, FUNcluster = pam, "silhouette")+ 
  labs(subtitle = "Silhouette Method")

Clusters for the data is 2.

fviz_nbclust(data_scale, FUNcluster = pam, "gap_stat")+ 
  labs(subtitle = "Gap Statistic Method")

Clusters for the data is 1.

Since the outcomes of the three approaches had varied values, I chose the middle value, which is 2.

Visualization the PAM clustering

pam_cl<-eclust(data_scale, "pam", k= 2)

The following is a two-group depiction of PAM Clustering.

fviz_silhouette(pam_cl)

##   cluster size ave.sil.width
## 1       1  194          0.97
## 2       2    2          0.56

The first cluster includes 194 countries, whereas the second includes only two.

Hierarchical

One of hierarchical clustering’s key benefits is that it gives precise details about which observations are most similar to one another.

# Dendrogram
h.out <- hclust(data_dist, method = "complete")
plot(h.out)
# Divide into 2 segments
rect.hclust(h.out, k = 2, border = 2:5)

The graphic shows that the dendrogram clusters are separated into two clusters after plotting.

# Calculate how many clusters
fviz_nbclust(data_scale, FUN = hcut, method = "wss")+ 
  labs(subtitle = "Elbow Method")

Clusters for the data is 3.

fviz_nbclust(data_scale, FUN = hcut, method = "silhouette")+ 
  labs(subtitle = "Silhouette Method")

Clusters for the data is 2.

fviz_nbclust(data_scale, FUN = hcut, method = "gap_stat")+ 
  labs(subtitle = "Gap Statistic Method")

Clusters for the data is 1.

The three ways’ outcomes have varying values. However, the dendrogram and silhouette methods both yield a value of two, therefore I opt for 2 clustering.

Visualization the Hierarchical clustering

h.clusters <- cutree(h.out, k=2)
rownames(data_scale) <- paste(df2$`Country Code`, 1:dim(df2)[1], sep = "_")
fviz_cluster(list(data = data_scale, cluster = h.clusters))

The following is a two-group depiction of Hierarchical clustering.

Conclusion

Every algorithm produced comparable findings. Subsequently, based on the PCA results, there are two primary components. Following that, K-means, PAM, and Hierarchical clustering all generated findings with two clusters. The United States and China have significantly higher GDP values than the rest of the world. It can be inferred from the analysis that all of the world’s nations can be classified into two groups based on GDP contribution, with the United States and China having the strongest economies from 2015 to 2021 when compared to other countries.

References

https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186#:~:text=Principal%20component%20analysis%2C%20or%20PCA,more%20easily%20visualized%20and%20analyzed.
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
https://crunchingthedata.com/when-to-use-k-means-clustering/
https://www.cs.umb.edu/cs738/pam1.pdf
https://crunchingthedata.com/when-to-use-hierarchical-clustering/

PCA and Clustering GDP

Amabel Nabila

Introduction

Library

Data

PCA

PCA is a statistical procedure that allows users to encapsulate the informativeness in large data tables using a smaller set of overview indictors that can be more easily visualized and evaluated.

The main diagonal is always 1.

Eigen values can be used for the y axis. The results of the eigen $vectors to calculate the principal component coefficients.

The output above indicates that I should choose two main components, because the two main components have been able to capture 99,80% of the diversity of data.

The following is the result of rotating the eigen value vector from a negative to a positive value and vice versa.

Based on the figure above, there is a sloping point in the second, indicating that there are two main components.

After acquiring the major component equations, I can alter the data by introducing variable values in the constructed main components.

It is evident it is possible to condense data into two dimensions or variables while still describing its diversity.

The following is a visualization of the proportion of variance explained for each number of PCA.

K-means

K-means clustering is commonly employed when there is no single outcome variable to forecast. Instead, it is utilized when you have a set of features that we want to use to locate groupings of observations that share similar properties.

Since the outcomes of the three approaches had varied values, I chose the middle value, which is 2.

Visualization the K-means clustering

The following is a two-group depiction of K-means clustering.

PAM

PAM stands for Partition Around Medoids. PAM is being tested for partitioning algorithms with the goal of reducing the average dissimilarity of objects to their nearest picked object.

Since the outcomes of the three approaches had varied values, I chose the middle value, which is 2.

Visualization the PAM clustering

The following is a two-group depiction of PAM Clustering.

The first cluster includes 194 countries, whereas the second includes only two.

Hierarchical

One of hierarchical clustering’s key benefits is that it gives precise details about which observations are most similar to one another.

The graphic shows that the dendrogram clusters are separated into two clusters after plotting.

The three ways’ outcomes have varying values. However, the dendrogram and silhouette methods both yield a value of two, therefore I opt for 2 clustering.

Visualization the Hierarchical clustering

The following is a two-group depiction of Hierarchical clustering.

Conclusion

References