The GDP (Gross Domestic Product), which is the economic figure that receives the most concern since it is thought to be the greatest single measure of people’s welfare, can be used as one of the benchmarks for a nation’s progress.This research intends to use PCA method for dimension reduction and cluster countries based on GDP contribution from 2015 to 2021.
library(readxl)
library(tidyr)
library(factoextra)
library(cluster)
The following data is taken from The World Bank.
# Loading readxl package
df <- read_excel("GDP.xls")
head(df)
## # A tibble: 6 × 66
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960`
## <chr> <chr> <chr> <chr> <dbl>
## 1 Aruba ABW GDP (current US… NY.GDP.MKTP.CD NA
## 2 Africa Eastern and … AFE GDP (current US… NY.GDP.MKTP.CD 2.13e10
## 3 Afghanistan AFG GDP (current US… NY.GDP.MKTP.CD 5.38e 8
## 4 Africa Western and … AFW GDP (current US… NY.GDP.MKTP.CD 1.04e10
## 5 Angola AGO GDP (current US… NY.GDP.MKTP.CD NA
## 6 Albania ALB GDP (current US… NY.GDP.MKTP.CD NA
## # ℹ 61 more variables: `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
## # `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
## # `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
## # `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
## # `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## # `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## # `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, …
I decided to restrict this research to the last seven years due to a significant number of missing data at the beginning of the year and in order to acquire accurate data.
# Observations only focus on 2015 to 2021
df2 <- df[,-5:-59]
# Drop NA
df2 <- na.omit(df2)
# Checks for missing values
sum(is.na(df2))
## [1] 0
# Deletes some country names
df2 <- df2[-c(2,4,8,37,49,51,60,61,62,63,64,67,71,72,91,94,98,99,100,101,103,121,127,128,129,131,132,134,144,147,152,160,171,173,181,186,187,193,203,204,205,215,216,220,222,224,225,233,240),]
# Labels
df2.country <- df2$`Country Code`
# Scale data
my_data <- df2[,5:11]
data_scale <- scale(my_data)
# Distance
data_dist <- dist(data_scale)
# The variance matrix
data_cov <- cov(data_scale)
data_cov
## 2015 2016 2017 2018 2019 2020 2021
## 2015 1.0000000 0.9995937 0.9994002 0.9984341 0.9983628 0.9967502 0.9929270
## 2016 0.9995937 1.0000000 0.9994958 0.9981956 0.9981999 0.9966330 0.9920770
## 2017 0.9994002 0.9994958 1.0000000 0.9993412 0.9992544 0.9979195 0.9946644
## 2018 0.9984341 0.9981956 0.9993412 1.0000000 0.9998504 0.9993608 0.9975006
## 2019 0.9983628 0.9981999 0.9992544 0.9998504 1.0000000 0.9995587 0.9975906
## 2020 0.9967502 0.9966330 0.9979195 0.9993608 0.9995587 1.0000000 0.9987986
## 2021 0.9929270 0.9920770 0.9946644 0.9975006 0.9975906 0.9987986 1.0000000
# Eigen Value and Eigen Vector
data_eigen <- eigen(data_cov)
data_eigen
## eigen() decomposition
## $values
## [1] 6.986837e+00 1.177578e-02 6.190898e-04 5.043526e-04 1.281898e-04
## [6] 1.131905e-04 2.286294e-05
##
## $vectors
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.3778908 -0.40041654 0.71629253 -0.344439707 0.1542596 0.005489748
## [2,] -0.3778220 -0.45917586 -0.35782442 -0.283963898 -0.4322999 -0.057846329
## [3,] -0.3781400 -0.24850895 -0.11493321 0.610265084 -0.2684117 0.379448287
## [4,] -0.3782809 0.05723015 0.03697213 0.432799430 0.1716062 -0.791346719
## [5,] -0.3782881 0.07167459 -0.17576042 0.077026163 0.7121901 0.418644376
## [6,] -0.3780826 0.29223319 -0.44434063 -0.484762898 0.0787085 -0.135199235
## [7,] -0.3772459 0.68793506 0.34055805 -0.007673069 -0.4176332 0.181303851
## [,7]
## [1,] -0.20328614
## [2,] 0.49749860
## [3,] -0.43999371
## [4,] 0.09626889
## [5,] 0.36394812
## [6,] -0.56102208
## [7,] 0.24719165
# Main component function
pr.out <- prcomp(x = my_data, center = TRUE, scale. = TRUE)
# Determination of the number of principal components
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6433 0.10852 0.02488 0.02246 0.01132 0.01064 0.004782
## Proportion of Variance 0.9981 0.00168 0.00009 0.00007 0.00002 0.00002 0.000000
## Cumulative Proportion 0.9981 0.99980 0.99989 0.99996 0.99998 1.00000 1.000000
# Calculates the principal component coefficients
pr.out
## Standard deviations (1, .., p=7):
## [1] 2.64326248 0.10851624 0.02488151 0.02245780 0.01132209 0.01063910 0.00478152
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4 PC5 PC6
## 2015 0.3778908 -0.40041654 -0.71629253 0.344439707 0.1542596 -0.005489748
## 2016 0.3778220 -0.45917586 0.35782442 0.283963898 -0.4322999 0.057846329
## 2017 0.3781400 -0.24850895 0.11493321 -0.610265084 -0.2684117 -0.379448287
## 2018 0.3782809 0.05723015 -0.03697213 -0.432799430 0.1716062 0.791346719
## 2019 0.3782881 0.07167459 0.17576042 -0.077026163 0.7121901 -0.418644376
## 2020 0.3780826 0.29223319 0.44434063 0.484762898 0.0787085 0.135199235
## 2021 0.3772459 0.68793506 -0.34055805 0.007673069 -0.4176332 -0.181303851
## PC7
## 2015 0.20328614
## 2016 -0.49749860
## 2017 0.43999371
## 2018 -0.09626889
## 2019 -0.36394812
## 2020 0.56102208
## 2021 -0.24719165
# Forming Scree plots
scree_data <- data.frame(eigen_value = eigen(data_cov)$values, PC = 1:7)
scree_data
## eigen_value PC
## 1 6.986837e+00 1
## 2 1.177578e-02 2
## 3 6.190898e-04 3
## 4 5.043526e-04 4
## 5 1.281898e-04 5
## 6 1.131905e-04 6
## 7 2.286294e-05 7
plot(x = scree_data$eigen_value, type = 'b',
xlab = 'Main Component', ylab = 'Eigen Value', main = 'Scree Plot')
# Data Reconstruction
head(pr.out$x[,1:2])
## PC1 PC2
## [1,] -0.60646690 0.0110077439
## [2,] -0.58388289 0.0077129365
## [3,] -0.51197768 0.0003631248
## [4,] -0.59035584 0.0125313193
## [5,] -0.60649560 0.0111888915
## [6,] -0.04272454 -0.0264252192
# Visualization of reconstructed data
fviz_pca(pr.out) +
labs(title="Variables loadings for PCA")
k.out <- kmeans(data_scale, centers = 2, nstart = 100)
k.clusters <- k.out$cluster
rownames(data_scale) <- paste(df2$`Country Code`, 1:dim(df2)[1], sep = "_")
fviz_cluster(list(data = data_scale, cluster = k.clusters))
fviz_nbclust(data_scale, FUNcluster = pam, "wss")+
labs(subtitle = "Elbow Method")
Clusters for the data is 3.
fviz_nbclust(data_scale, FUNcluster = pam, "silhouette")+
labs(subtitle = "Silhouette Method")
Clusters for the data is 2.
fviz_nbclust(data_scale, FUNcluster = pam, "gap_stat")+
labs(subtitle = "Gap Statistic Method")
Clusters for the data is 1.
pam_cl<-eclust(data_scale, "pam", k= 2)
fviz_silhouette(pam_cl)
## cluster size ave.sil.width
## 1 1 194 0.97
## 2 2 2 0.56
# Dendrogram
h.out <- hclust(data_dist, method = "complete")
plot(h.out)
# Divide into 2 segments
rect.hclust(h.out, k = 2, border = 2:5)
# Calculate how many clusters
fviz_nbclust(data_scale, FUN = hcut, method = "wss")+
labs(subtitle = "Elbow Method")
Clusters for the data is 3.
fviz_nbclust(data_scale, FUN = hcut, method = "silhouette")+
labs(subtitle = "Silhouette Method")
Clusters for the data is 2.
fviz_nbclust(data_scale, FUN = hcut, method = "gap_stat")+
labs(subtitle = "Gap Statistic Method")
Clusters for the data is 1.
h.clusters <- cutree(h.out, k=2)
rownames(data_scale) <- paste(df2$`Country Code`, 1:dim(df2)[1], sep = "_")
fviz_cluster(list(data = data_scale, cluster = h.clusters))
Every algorithm produced comparable findings. Subsequently, based on the PCA results, there are two primary components. Following that, K-means, PAM, and Hierarchical clustering all generated findings with two clusters. The United States and China have significantly higher GDP values than the rest of the world. It can be inferred from the analysis that all of the world’s nations can be classified into two groups based on GDP contribution, with the United States and China having the strongest economies from 2015 to 2021 when compared to other countries.