Introduction

In this project I conducted dimension reduction on 95 democracy indicators from V-Dem dataset. The original dataset consists of more than 4000 variables. However, the aim of the study is to test how we can handle such type of data with dimension reduction techniques.

Analysis

As the first step of the analysis, I downloaded and cleaned the data.

dem<-read.csv("V-Dem-CY-Full+Others-v14.csv")
demo<-dem[dem$year=="2019", ]
dem1<-demo[, -c(which(names(demo)=="country_text_id"):which(names(demo)=="COWcode"))]
dem_extra<-dem1[, c(1:96)]
dem_extracted<-na.omit(dem_extra)

Checking the dimensions in the data:

dim(dem_extracted)
## [1] 173  96

Firstly, it is needed to construct correlation matrix to choose the most appropriate dimension reduction algorithm.

library('corrplot')
pears_corr<-cor(dem_extracted[, -1], method="pearson")
corrplot(pears_corr, order="alphabet", tl.cex=0.4)

From the correlation matrix, we can clearly see that most of the indicators are highly correlated. In this case, PCA method can be utilized to decrease the number of dimensions and eliminate multicolinearity in the data. Before we can proceed to the data analysis, it is needed to standardise the data. For this purpose “preProcess” and “predict” functions from “caret” package were applied.

library('caret')
data_prop<-preProcess(as.matrix(dem_extracted[, -1]), method=c("center", "scale"))
pred<-predict(data_prop, dem_extracted[,-1])

Now, we can perform PCA and visualize the output with biplot.

library('stats')
pca1<-prcomp(pred, center=FALSE, scale=FALSE)
biplot(pca1, scale=0)

To choose the most appropriate number of principal components we can apply Kaiser criterion. We take the squares of standard deviations and choose ones greater than 1. Function “which” returns the indexes of the corresponding principal components.

sdev_sq<-pca1$sdev^2
which(sdev_sq>1)
##  [1]  1  2  3  4  5  6  7  8  9 10 11
summary(pca1)
## Importance of components:
##                           PC1     PC2    PC3    PC4     PC5     PC6    PC7
## Standard deviation     7.6204 2.85932 2.0767 1.7678 1.65367 1.41320 1.3364
## Proportion of Variance 0.6113 0.08606 0.0454 0.0329 0.02879 0.02102 0.0188
## Cumulative Proportion  0.6113 0.69732 0.7427 0.7756 0.80440 0.82542 0.8442
##                            PC8    PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.21033 1.1239 1.06514 1.00591 0.92574 0.89712 0.86835
## Proportion of Variance 0.01542 0.0133 0.01194 0.01065 0.00902 0.00847 0.00794
## Cumulative Proportion  0.85964 0.8729 0.88488 0.89553 0.90455 0.91302 0.92096
##                           PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.84624 0.8155 0.78413 0.75298 0.72596 0.69728 0.67241
## Proportion of Variance 0.00754 0.0070 0.00647 0.00597 0.00555 0.00512 0.00476
## Cumulative Proportion  0.92850 0.9355 0.94197 0.94794 0.95349 0.95861 0.96336
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.64356 0.62917 0.60131 0.59532 0.57427 0.53963 0.48227
## Proportion of Variance 0.00436 0.00417 0.00381 0.00373 0.00347 0.00307 0.00245
## Cumulative Proportion  0.96772 0.97189 0.97570 0.97943 0.98290 0.98596 0.98841
##                          PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.4471 0.41458 0.38471 0.36783 0.33036 0.30194 0.25605
## Proportion of Variance 0.0021 0.00181 0.00156 0.00142 0.00115 0.00096 0.00069
## Cumulative Proportion  0.9905 0.99233 0.99388 0.99531 0.99646 0.99742 0.99811
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.22859 0.18273 0.12913 0.12191 0.09387 0.08357 0.08133
## Proportion of Variance 0.00055 0.00035 0.00018 0.00016 0.00009 0.00007 0.00007
## Cumulative Proportion  0.99866 0.99901 0.99918 0.99934 0.99943 0.99951 0.99958
##                           PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.07270 0.06612 0.05814 0.05471 0.04996 0.04920 0.04653
## Proportion of Variance 0.00006 0.00005 0.00004 0.00003 0.00003 0.00003 0.00002
## Cumulative Proportion  0.99963 0.99968 0.99971 0.99975 0.99977 0.99980 0.99982
##                           PC50    PC51    PC52    PC53    PC54    PC55    PC56
## Standard deviation     0.04395 0.04134 0.03972 0.03443 0.03363 0.03257 0.02904
## Proportion of Variance 0.00002 0.00002 0.00002 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion  0.99984 0.99986 0.99987 0.99989 0.99990 0.99991 0.99992
##                           PC57    PC58    PC59    PC60    PC61    PC62   PC63
## Standard deviation     0.02848 0.02762 0.02549 0.02335 0.02247 0.02159 0.0197
## Proportion of Variance 0.00001 0.00001 0.00001 0.00001 0.00001 0.00000 0.0000
## Cumulative Proportion  0.99993 0.99994 0.99994 0.99995 0.99995 0.99996 1.0000
##                           PC64    PC65    PC66    PC67    PC68    PC69    PC70
## Standard deviation     0.01926 0.01813 0.01707 0.01572 0.01565 0.01525 0.01457
## Proportion of Variance 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
## Cumulative Proportion  0.99997 0.99997 0.99997 0.99998 0.99998 0.99998 0.99998
##                           PC71   PC72    PC73    PC74    PC75    PC76    PC77
## Standard deviation     0.01399 0.0127 0.01204 0.01132 0.01094 0.01078 0.01037
## Proportion of Variance 0.00000 0.0000 0.00000 0.00000 0.00000 0.00000 0.00000
## Cumulative Proportion  0.99999 1.0000 0.99999 0.99999 0.99999 0.99999 0.99999
##                            PC78    PC79     PC80     PC81    PC82     PC83
## Standard deviation     0.009994 0.00916 0.008426 0.007942 0.00768 0.007597
## Proportion of Variance 0.000000 0.00000 0.000000 0.000000 0.00000 0.000000
## Cumulative Proportion  0.999990 1.00000 1.000000 1.000000 1.00000 1.000000
##                            PC84     PC85     PC86     PC87     PC88    PC89
## Standard deviation     0.007113 0.005907 0.005661 0.005161 0.004518 0.00359
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000
## Cumulative Proportion  1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
##                            PC90     PC91     PC92     PC93     PC94      PC95
## Standard deviation     0.003207 0.002834 0.002383 0.002116 0.001158 0.0008749
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000
## Cumulative Proportion  1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000

From the summary table, we can conclude the same. First eleven components explain almost 90 per cent of variance. However, we can also conduct principal components analysis using “princomp()” function. The main difference of “prcomp” and “princomp” functions is the the way those functions calculate principal components.”prcomp” makes calculations through singular value decomposition on the original matrix, whereas “princomp” uses eigenvector decomposition on the covariance matrix. Nevertheless, the results obtained by these two functions are often similar (Harvey and Hanson 2024).

library('stats')
pca2<-princomp(pred)
summary(pca2)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     7.5983021 2.85104065 2.07065266 1.76269007 1.64888545
## Proportion of Variance 0.6112617 0.08605991 0.04539506 0.03289622 0.02878558
## Cumulative Proportion  0.6112617 0.69732158 0.74271663 0.77561285 0.80439843
##                           Comp.6     Comp.7     Comp.8    Comp.9    Comp.10
## Standard deviation     1.4091141 1.33255352 1.20683168 1.1206246 1.06206116
## Proportion of Variance 0.0210226 0.01880024 0.01542011 0.0132958 0.01194244
## Cumulative Proportion  0.8254210 0.84422127 0.85964138 0.8729372 0.88487961
##                           Comp.11     Comp.12     Comp.13     Comp.14
## Standard deviation     1.00300005 0.923063253 0.894527066 0.865839702
## Proportion of Variance 0.01065114 0.009021048 0.008471904 0.007937231
## Cumulative Proportion  0.89553075 0.904551798 0.913023702 0.920960933
##                            Comp.15     Comp.16     Comp.17     Comp.18
## Standard deviation     0.843795340 0.813127404 0.781859233 0.750800085
## Proportion of Variance 0.007538211 0.007000213 0.006472189 0.005968191
## Cumulative Proportion  0.928499144 0.935499357 0.941971546 0.947939736
##                            Comp.19     Comp.20     Comp.21     Comp.22
## Standard deviation     0.723862648 0.695258888 0.670464869 0.641700257
## Proportion of Variance 0.005547616 0.005117845 0.004759333 0.004359719
## Cumulative Proportion  0.953487352 0.958605198 0.963364531 0.967724249
##                            Comp.23    Comp.24     Comp.25     Comp.26
## Standard deviation     0.627348146 0.59957200 0.593598243 0.572609088
## Proportion of Variance 0.004166883 0.00380607 0.003730605 0.003471447
## Cumulative Proportion  0.971891132 0.97569720 0.979427807 0.982899254
##                           Comp.27     Comp.28     Comp.29     Comp.30
## Standard deviation     0.53806470 0.480876800 0.445841479 0.413384555
## Proportion of Variance 0.00306523 0.002448283 0.002104529 0.001809267
## Cumulative Proportion  0.98596448 0.988412767 0.990517297 0.992326563
##                            Comp.31     Comp.32     Comp.33      Comp.34
## Standard deviation     0.383599843 0.366763524 0.329403380 0.3010627689
## Proportion of Variance 0.001557941 0.001424185 0.001148815 0.0009596396
## Cumulative Proportion  0.993884504 0.995308688 0.996457503 0.9974171431
##                             Comp.35      Comp.36      Comp.37      Comp.38
## Standard deviation     0.2553041371 0.2279256639 0.1822016433 0.1287572151
## Proportion of Variance 0.0006900964 0.0005500226 0.0003514784 0.0001755243
## Cumulative Proportion  0.9981072395 0.9986572620 0.9990087404 0.9991842647
##                             Comp.39      Comp.40      Comp.41      Comp.42
## Standard deviation     0.1215613181 9.359998e-02 8.332356e-02 8.109609e-02
## Proportion of Variance 0.0001564533 9.275676e-05 7.350716e-05 6.962959e-05
## Cumulative Proportion  0.9993407180 9.994335e-01 9.995070e-01 9.995766e-01
##                             Comp.43      Comp.44      Comp.45      Comp.46
## Standard deviation     7.248973e-02 6.593036e-02 0.0579676648 5.454796e-02
## Proportion of Variance 5.563487e-05 4.602194e-05 0.0000355767 3.150294e-05
## Cumulative Proportion  9.996322e-01 9.996783e-01 0.9997138451 9.997453e-01
##                             Comp.47      Comp.48      Comp.49      Comp.50
## Standard deviation     4.981931e-02 4.905533e-02 0.0463951194 4.382385e-02
## Proportion of Variance 2.627783e-05 2.547807e-05 0.0000227897 2.033364e-05
## Cumulative Proportion  9.997716e-01 9.997971e-01 0.9998198936 9.998402e-01
##                             Comp.51      Comp.52      Comp.53      Comp.54
## Standard deviation     4.121942e-02 3.960247e-02 3.432770e-02 3.353194e-02
## Proportion of Variance 1.798862e-05 1.660499e-05 1.247623e-05 1.190451e-05
## Cumulative Proportion  9.998582e-01 9.998748e-01 9.998873e-01 9.998992e-01
##                             Comp.55      Comp.56      Comp.57      Comp.58
## Standard deviation     3.247329e-02 2.895911e-02 2.839267e-02 2.754304e-02
## Proportion of Variance 1.116469e-05 8.879007e-06 8.535061e-06 8.031893e-06
## Cumulative Proportion  9.999104e-01 9.999192e-01 9.999278e-01 9.999358e-01
##                             Comp.59      Comp.60      Comp.61      Comp.62
## Standard deviation     2.541472e-02 2.327988e-02 2.240484e-02 2.153129e-02
## Proportion of Variance 6.838561e-06 5.737932e-06 5.314687e-06 4.908336e-06
## Cumulative Proportion  9.999427e-01 9.999484e-01 9.999537e-01 9.999586e-01
##                             Comp.63      Comp.64      Comp.65      Comp.66
## Standard deviation     1.963901e-02 1.920650e-02 1.807273e-02 1.701577e-02
## Proportion of Variance 4.083507e-06 3.905624e-06 3.458131e-06 3.065470e-06
## Cumulative Proportion  9.999627e-01 9.999666e-01 9.999701e-01 9.999731e-01
##                             Comp.67      Comp.68      Comp.69      Comp.70
## Standard deviation     1.566954e-02 1.560894e-02 1.520473e-02 1.453218e-02
## Proportion of Variance 2.599599e-06 2.579531e-06 2.447663e-06 2.235918e-06
## Cumulative Proportion  9.999757e-01 9.999783e-01 9.999808e-01 9.999830e-01
##                             Comp.71      Comp.72      Comp.73      Comp.74
## Standard deviation     1.394650e-02 1.266600e-02 1.200676e-02 1.128248e-02
## Proportion of Variance 2.059322e-06 1.698528e-06 1.526321e-06 1.347731e-06
## Cumulative Proportion  9.999850e-01 9.999867e-01 9.999883e-01 9.999896e-01
##                             Comp.75      Comp.76      Comp.77      Comp.78
## Standard deviation     1.091145e-02 1.075110e-02 1.034148e-02 9.964998e-03
## Proportion of Variance 1.260546e-06 1.223771e-06 1.132295e-06 1.051353e-06
## Cumulative Proportion  9.999909e-01 9.999921e-01 9.999932e-01 9.999943e-01
##                             Comp.79      Comp.80      Comp.81      Comp.82
## Standard deviation     9.133064e-03 8.401673e-03 7.919180e-03 7.657971e-03
## Proportion of Variance 8.831349e-07 7.473527e-07 6.639793e-07 6.208998e-07
## Cumulative Proportion  9.999952e-01 9.999959e-01 9.999966e-01 9.999972e-01
##                             Comp.83      Comp.84      Comp.85      Comp.86
## Standard deviation     7.575360e-03 7.092117e-03 5.889643e-03 5.644694e-03
## Proportion of Variance 6.075760e-07 5.325322e-07 3.672587e-07 3.373454e-07
## Cumulative Proportion  9.999978e-01 9.999983e-01 9.999987e-01 9.999990e-01
##                             Comp.87      Comp.88      Comp.89      Comp.90
## Standard deviation     5.146467e-03 4.504618e-03 3.579544e-03 3.198113e-03
## Proportion of Variance 2.804222e-07 2.148374e-07 1.356593e-07 1.082883e-07
## Cumulative Proportion  9.999993e-01 9.999995e-01 9.999997e-01 9.999998e-01
##                             Comp.91      Comp.92     Comp.93      Comp.94
## Standard deviation     2.826267e-03 2.375731e-03 2.11020e-03 1.154832e-03
## Proportion of Variance 8.457083e-08 5.975700e-08 4.71456e-08 1.411991e-08
## Cumulative Proportion  9.999999e-01 9.999999e-01 1.00000e+00 1.000000e+00
##                             Comp.95
## Standard deviation     8.723840e-04
## Proportion of Variance 8.057669e-09
## Cumulative Proportion  1.000000e+00

In this case, the first eleven principal components stand for almost 90 per cent of variance.

Now, let us visualize the output and check the quality of the analysis.

library('factoextra')
fviz_eig(pca1, choice='eigenvalue')

We can clearly see that the first component explains approximately 60 per cent of variance (with eigenvalue equals to 6).

We can also put the variance explained on vertical axis.

fviz_eig(pca1)

Now lets see which variables contribute mostly to first two principal components.

library(gridExtra)
var<-get_pca_var(pca1)
a<-fviz_contrib(pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')

Unfortunately, it is impossible to visualize the output of “princomp” through “factoextra” package. Consequently, we need to apply base R.

plot(pca2, type='lines', main='Scree plot')

As it was stated by Harvey and Hanson (2024), the results of two functions does differ much. To summarize, by running principal component analysis, we managed to decrease the number of dimensions from 95 to only 11 and preserve almost 90 per cent of variation in the data.

PCA analysis proved itself to be highly effective in dealing with highly correlated data. Nevertheless, it is not able to preserve cluster structures in the data.

library('factoextra')
get_clust_tendency(pred, 20, graph=TRUE, gradient = list(low = "red", mid = "white", high = "blue"), seed = 123)
## $hopkins_stat
## [1] 0.722153
## 
## $plot

In our case, the Hopkins statistic is relatively high, which identifies some cluster structures in the data. To decrease the number of dimensions to two and preserve those structures, we additionaly applied t-SNE algorithm.

set.seed(123)
library('Rtsne')
tsne<-Rtsne(pred, dims=2, perplexity=25)

To make a visualization, we need to construct a dataframe containing the results.

tsne_res<-data.frame(Dim1=tsne$Y[, 1], Dim2=tsne$Y[, 2], Country=dem_extracted[, 1])
head(tsne_res)
##        Dim1      Dim2      Country
## 1 -3.125425  1.412076       Mexico
## 2 -1.778379 12.321756       Sweden
## 3 -0.706194 13.793165  Switzerland
## 4  1.918939  5.320777        Ghana
## 5 -1.569685  4.790853 South Africa
## 6 -3.421264 10.164769        Japan

Now, let us visualize the results.

library('ggplot2')
ggplot(tsne_res, aes(x = Dim1, y = Dim2)) +
  geom_point(color='red',size = 3) +
  labs(title = "t-SNE Visualization", x = "Dimension 1", y = "Dimension 2") +
  theme_minimal()

Visually, we can observe some groups in the data. Let us find the appropriate number of clusters and conduct k-means.

library('gridExtra')
library('factoextra')
opt_km<-fviz_nbclust(pred, FUNcluster=kmeans, method="gap")+theme_classic()
grid.arrange(opt_km)

The k-means procedure with eight clusters and assign clusters to each country.

library('stats')
set.seed(1234)
km8<-kmeans(pred, 8)
tsne_res$cluster<-factor(km8$cluster)
head(tsne_res)
##        Dim1      Dim2      Country cluster
## 1 -3.125425  1.412076       Mexico       7
## 2 -1.778379 12.321756       Sweden       8
## 3 -0.706194 13.793165  Switzerland       8
## 4  1.918939  5.320777        Ghana       2
## 5 -1.569685  4.790853 South Africa       2
## 6 -3.421264 10.164769        Japan       8

We can visualize the data one more time.

library('ggplot2')
colors_clust<-c('red', 'blue', 'green', 'black', 'yellow', 'purple', 'brown', 'orange')
ggplot(tsne_res, aes(x = Dim1, y = Dim2, color=cluster)) +
  geom_point(size = 3) + 
  labs(title = "t-SNE Visualization", x = "Dimension 1", y = "Dimension 2") + theme_minimal() + scale_color_manual(values=colors_clust)

To conclude, t-SNE algorithm reduced number of dimensions from 95 to only two, and we can observe some clusters. However, the groups overlaps in some points.

Conclusion

In this project, we conducted principal component analysis on dataset, which consists of 95 democracy indicators. As the data appeared to be highly correlated, using PCA we reduced the number of dimensions from 95 to only 11 and preserved almost 90 per cent of variance from the original data. In addition to this, Hopkins statistic shown presence of cluster structures in the data. We seized the number of dimensions to two and plotted the results using t-SNE method. Then, we defined the optimal number of clusters and conducted k-means. After that, the output was plotted again. From the final graph, we can observe cluster patterns in the data.

Sources

  1. Harvey, D., Hanson, B., 2024, A Comparison of Functions for PCA.
  2. https://v-dem.net/data/the-v-dem-dataset/, V-Dem dataset.