Introduction
In this project I conducted dimension reduction on 95 democracy indicators from V-Dem dataset. The original dataset consists of more than 4000 variables. However, the aim of the study is to test how we can handle such type of data with dimension reduction techniques.
Analysis
As the first step of the analysis, I downloaded and cleaned the data.
dem<-read.csv("V-Dem-CY-Full+Others-v14.csv")
demo<-dem[dem$year=="2019", ]
dem1<-demo[, -c(which(names(demo)=="country_text_id"):which(names(demo)=="COWcode"))]
dem_extra<-dem1[, c(1:96)]
dem_extracted<-na.omit(dem_extra)
Checking the dimensions in the data:
dim(dem_extracted)
## [1] 173 96
Firstly, it is needed to construct correlation matrix to choose the most appropriate dimension reduction algorithm.
library('corrplot')
pears_corr<-cor(dem_extracted[, -1], method="pearson")
corrplot(pears_corr, order="alphabet", tl.cex=0.4)
From the correlation matrix, we can clearly see that most of the indicators are highly correlated. In this case, PCA method can be utilized to decrease the number of dimensions and eliminate multicolinearity in the data. Before we can proceed to the data analysis, it is needed to standardise the data. For this purpose “preProcess” and “predict” functions from “caret” package were applied.
library('caret')
data_prop<-preProcess(as.matrix(dem_extracted[, -1]), method=c("center", "scale"))
pred<-predict(data_prop, dem_extracted[,-1])
Now, we can perform PCA and visualize the output with biplot.
library('stats')
pca1<-prcomp(pred, center=FALSE, scale=FALSE)
biplot(pca1, scale=0)
To choose the most appropriate number of principal components we can apply Kaiser criterion. We take the squares of standard deviations and choose ones greater than 1. Function “which” returns the indexes of the corresponding principal components.
sdev_sq<-pca1$sdev^2
which(sdev_sq>1)
## [1] 1 2 3 4 5 6 7 8 9 10 11
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 7.6204 2.85932 2.0767 1.7678 1.65367 1.41320 1.3364
## Proportion of Variance 0.6113 0.08606 0.0454 0.0329 0.02879 0.02102 0.0188
## Cumulative Proportion 0.6113 0.69732 0.7427 0.7756 0.80440 0.82542 0.8442
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.21033 1.1239 1.06514 1.00591 0.92574 0.89712 0.86835
## Proportion of Variance 0.01542 0.0133 0.01194 0.01065 0.00902 0.00847 0.00794
## Cumulative Proportion 0.85964 0.8729 0.88488 0.89553 0.90455 0.91302 0.92096
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.84624 0.8155 0.78413 0.75298 0.72596 0.69728 0.67241
## Proportion of Variance 0.00754 0.0070 0.00647 0.00597 0.00555 0.00512 0.00476
## Cumulative Proportion 0.92850 0.9355 0.94197 0.94794 0.95349 0.95861 0.96336
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.64356 0.62917 0.60131 0.59532 0.57427 0.53963 0.48227
## Proportion of Variance 0.00436 0.00417 0.00381 0.00373 0.00347 0.00307 0.00245
## Cumulative Proportion 0.96772 0.97189 0.97570 0.97943 0.98290 0.98596 0.98841
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.4471 0.41458 0.38471 0.36783 0.33036 0.30194 0.25605
## Proportion of Variance 0.0021 0.00181 0.00156 0.00142 0.00115 0.00096 0.00069
## Cumulative Proportion 0.9905 0.99233 0.99388 0.99531 0.99646 0.99742 0.99811
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.22859 0.18273 0.12913 0.12191 0.09387 0.08357 0.08133
## Proportion of Variance 0.00055 0.00035 0.00018 0.00016 0.00009 0.00007 0.00007
## Cumulative Proportion 0.99866 0.99901 0.99918 0.99934 0.99943 0.99951 0.99958
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.07270 0.06612 0.05814 0.05471 0.04996 0.04920 0.04653
## Proportion of Variance 0.00006 0.00005 0.00004 0.00003 0.00003 0.00003 0.00002
## Cumulative Proportion 0.99963 0.99968 0.99971 0.99975 0.99977 0.99980 0.99982
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.04395 0.04134 0.03972 0.03443 0.03363 0.03257 0.02904
## Proportion of Variance 0.00002 0.00002 0.00002 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion 0.99984 0.99986 0.99987 0.99989 0.99990 0.99991 0.99992
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.02848 0.02762 0.02549 0.02335 0.02247 0.02159 0.0197
## Proportion of Variance 0.00001 0.00001 0.00001 0.00001 0.00001 0.00000 0.0000
## Cumulative Proportion 0.99993 0.99994 0.99994 0.99995 0.99995 0.99996 1.0000
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.01926 0.01813 0.01707 0.01572 0.01565 0.01525 0.01457
## Proportion of Variance 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
## Cumulative Proportion 0.99997 0.99997 0.99997 0.99998 0.99998 0.99998 0.99998
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.01399 0.0127 0.01204 0.01132 0.01094 0.01078 0.01037
## Proportion of Variance 0.00000 0.0000 0.00000 0.00000 0.00000 0.00000 0.00000
## Cumulative Proportion 0.99999 1.0000 0.99999 0.99999 0.99999 0.99999 0.99999
## PC78 PC79 PC80 PC81 PC82 PC83
## Standard deviation 0.009994 0.00916 0.008426 0.007942 0.00768 0.007597
## Proportion of Variance 0.000000 0.00000 0.000000 0.000000 0.00000 0.000000
## Cumulative Proportion 0.999990 1.00000 1.000000 1.000000 1.00000 1.000000
## PC84 PC85 PC86 PC87 PC88 PC89
## Standard deviation 0.007113 0.005907 0.005661 0.005161 0.004518 0.00359
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000
## Cumulative Proportion 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
## PC90 PC91 PC92 PC93 PC94 PC95
## Standard deviation 0.003207 0.002834 0.002383 0.002116 0.001158 0.0008749
## Proportion of Variance 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000
## Cumulative Proportion 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000
From the summary table, we can conclude the same. First eleven components explain almost 90 per cent of variance. However, we can also conduct principal components analysis using “princomp()” function. The main difference of “prcomp” and “princomp” functions is the the way those functions calculate principal components.”prcomp” makes calculations through singular value decomposition on the original matrix, whereas “princomp” uses eigenvector decomposition on the covariance matrix. Nevertheless, the results obtained by these two functions are often similar (Harvey and Hanson 2024).
library('stats')
pca2<-princomp(pred)
summary(pca2)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 7.5983021 2.85104065 2.07065266 1.76269007 1.64888545
## Proportion of Variance 0.6112617 0.08605991 0.04539506 0.03289622 0.02878558
## Cumulative Proportion 0.6112617 0.69732158 0.74271663 0.77561285 0.80439843
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 1.4091141 1.33255352 1.20683168 1.1206246 1.06206116
## Proportion of Variance 0.0210226 0.01880024 0.01542011 0.0132958 0.01194244
## Cumulative Proportion 0.8254210 0.84422127 0.85964138 0.8729372 0.88487961
## Comp.11 Comp.12 Comp.13 Comp.14
## Standard deviation 1.00300005 0.923063253 0.894527066 0.865839702
## Proportion of Variance 0.01065114 0.009021048 0.008471904 0.007937231
## Cumulative Proportion 0.89553075 0.904551798 0.913023702 0.920960933
## Comp.15 Comp.16 Comp.17 Comp.18
## Standard deviation 0.843795340 0.813127404 0.781859233 0.750800085
## Proportion of Variance 0.007538211 0.007000213 0.006472189 0.005968191
## Cumulative Proportion 0.928499144 0.935499357 0.941971546 0.947939736
## Comp.19 Comp.20 Comp.21 Comp.22
## Standard deviation 0.723862648 0.695258888 0.670464869 0.641700257
## Proportion of Variance 0.005547616 0.005117845 0.004759333 0.004359719
## Cumulative Proportion 0.953487352 0.958605198 0.963364531 0.967724249
## Comp.23 Comp.24 Comp.25 Comp.26
## Standard deviation 0.627348146 0.59957200 0.593598243 0.572609088
## Proportion of Variance 0.004166883 0.00380607 0.003730605 0.003471447
## Cumulative Proportion 0.971891132 0.97569720 0.979427807 0.982899254
## Comp.27 Comp.28 Comp.29 Comp.30
## Standard deviation 0.53806470 0.480876800 0.445841479 0.413384555
## Proportion of Variance 0.00306523 0.002448283 0.002104529 0.001809267
## Cumulative Proportion 0.98596448 0.988412767 0.990517297 0.992326563
## Comp.31 Comp.32 Comp.33 Comp.34
## Standard deviation 0.383599843 0.366763524 0.329403380 0.3010627689
## Proportion of Variance 0.001557941 0.001424185 0.001148815 0.0009596396
## Cumulative Proportion 0.993884504 0.995308688 0.996457503 0.9974171431
## Comp.35 Comp.36 Comp.37 Comp.38
## Standard deviation 0.2553041371 0.2279256639 0.1822016433 0.1287572151
## Proportion of Variance 0.0006900964 0.0005500226 0.0003514784 0.0001755243
## Cumulative Proportion 0.9981072395 0.9986572620 0.9990087404 0.9991842647
## Comp.39 Comp.40 Comp.41 Comp.42
## Standard deviation 0.1215613181 9.359998e-02 8.332356e-02 8.109609e-02
## Proportion of Variance 0.0001564533 9.275676e-05 7.350716e-05 6.962959e-05
## Cumulative Proportion 0.9993407180 9.994335e-01 9.995070e-01 9.995766e-01
## Comp.43 Comp.44 Comp.45 Comp.46
## Standard deviation 7.248973e-02 6.593036e-02 0.0579676648 5.454796e-02
## Proportion of Variance 5.563487e-05 4.602194e-05 0.0000355767 3.150294e-05
## Cumulative Proportion 9.996322e-01 9.996783e-01 0.9997138451 9.997453e-01
## Comp.47 Comp.48 Comp.49 Comp.50
## Standard deviation 4.981931e-02 4.905533e-02 0.0463951194 4.382385e-02
## Proportion of Variance 2.627783e-05 2.547807e-05 0.0000227897 2.033364e-05
## Cumulative Proportion 9.997716e-01 9.997971e-01 0.9998198936 9.998402e-01
## Comp.51 Comp.52 Comp.53 Comp.54
## Standard deviation 4.121942e-02 3.960247e-02 3.432770e-02 3.353194e-02
## Proportion of Variance 1.798862e-05 1.660499e-05 1.247623e-05 1.190451e-05
## Cumulative Proportion 9.998582e-01 9.998748e-01 9.998873e-01 9.998992e-01
## Comp.55 Comp.56 Comp.57 Comp.58
## Standard deviation 3.247329e-02 2.895911e-02 2.839267e-02 2.754304e-02
## Proportion of Variance 1.116469e-05 8.879007e-06 8.535061e-06 8.031893e-06
## Cumulative Proportion 9.999104e-01 9.999192e-01 9.999278e-01 9.999358e-01
## Comp.59 Comp.60 Comp.61 Comp.62
## Standard deviation 2.541472e-02 2.327988e-02 2.240484e-02 2.153129e-02
## Proportion of Variance 6.838561e-06 5.737932e-06 5.314687e-06 4.908336e-06
## Cumulative Proportion 9.999427e-01 9.999484e-01 9.999537e-01 9.999586e-01
## Comp.63 Comp.64 Comp.65 Comp.66
## Standard deviation 1.963901e-02 1.920650e-02 1.807273e-02 1.701577e-02
## Proportion of Variance 4.083507e-06 3.905624e-06 3.458131e-06 3.065470e-06
## Cumulative Proportion 9.999627e-01 9.999666e-01 9.999701e-01 9.999731e-01
## Comp.67 Comp.68 Comp.69 Comp.70
## Standard deviation 1.566954e-02 1.560894e-02 1.520473e-02 1.453218e-02
## Proportion of Variance 2.599599e-06 2.579531e-06 2.447663e-06 2.235918e-06
## Cumulative Proportion 9.999757e-01 9.999783e-01 9.999808e-01 9.999830e-01
## Comp.71 Comp.72 Comp.73 Comp.74
## Standard deviation 1.394650e-02 1.266600e-02 1.200676e-02 1.128248e-02
## Proportion of Variance 2.059322e-06 1.698528e-06 1.526321e-06 1.347731e-06
## Cumulative Proportion 9.999850e-01 9.999867e-01 9.999883e-01 9.999896e-01
## Comp.75 Comp.76 Comp.77 Comp.78
## Standard deviation 1.091145e-02 1.075110e-02 1.034148e-02 9.964998e-03
## Proportion of Variance 1.260546e-06 1.223771e-06 1.132295e-06 1.051353e-06
## Cumulative Proportion 9.999909e-01 9.999921e-01 9.999932e-01 9.999943e-01
## Comp.79 Comp.80 Comp.81 Comp.82
## Standard deviation 9.133064e-03 8.401673e-03 7.919180e-03 7.657971e-03
## Proportion of Variance 8.831349e-07 7.473527e-07 6.639793e-07 6.208998e-07
## Cumulative Proportion 9.999952e-01 9.999959e-01 9.999966e-01 9.999972e-01
## Comp.83 Comp.84 Comp.85 Comp.86
## Standard deviation 7.575360e-03 7.092117e-03 5.889643e-03 5.644694e-03
## Proportion of Variance 6.075760e-07 5.325322e-07 3.672587e-07 3.373454e-07
## Cumulative Proportion 9.999978e-01 9.999983e-01 9.999987e-01 9.999990e-01
## Comp.87 Comp.88 Comp.89 Comp.90
## Standard deviation 5.146467e-03 4.504618e-03 3.579544e-03 3.198113e-03
## Proportion of Variance 2.804222e-07 2.148374e-07 1.356593e-07 1.082883e-07
## Cumulative Proportion 9.999993e-01 9.999995e-01 9.999997e-01 9.999998e-01
## Comp.91 Comp.92 Comp.93 Comp.94
## Standard deviation 2.826267e-03 2.375731e-03 2.11020e-03 1.154832e-03
## Proportion of Variance 8.457083e-08 5.975700e-08 4.71456e-08 1.411991e-08
## Cumulative Proportion 9.999999e-01 9.999999e-01 1.00000e+00 1.000000e+00
## Comp.95
## Standard deviation 8.723840e-04
## Proportion of Variance 8.057669e-09
## Cumulative Proportion 1.000000e+00
In this case, the first eleven principal components stand for almost 90 per cent of variance.
Now, let us visualize the output and check the quality of the analysis.
library('factoextra')
fviz_eig(pca1, choice='eigenvalue')
We can clearly see that the first component explains approximately 60 per cent of variance (with eigenvalue equals to 6).
We can also put the variance explained on vertical axis.
fviz_eig(pca1)
Now lets see which variables contribute mostly to first two principal components.
library(gridExtra)
var<-get_pca_var(pca1)
a<-fviz_contrib(pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')
Unfortunately, it is impossible to visualize the output of “princomp” through “factoextra” package. Consequently, we need to apply base R.
plot(pca2, type='lines', main='Scree plot')
As it was stated by Harvey and Hanson (2024), the results of two functions does differ much. To summarize, by running principal component analysis, we managed to decrease the number of dimensions from 95 to only 11 and preserve almost 90 per cent of variation in the data.
PCA analysis proved itself to be highly effective in dealing with highly correlated data. Nevertheless, it is not able to preserve cluster structures in the data.
library('factoextra')
get_clust_tendency(pred, 20, graph=TRUE, gradient = list(low = "red", mid = "white", high = "blue"), seed = 123)
## $hopkins_stat
## [1] 0.722153
##
## $plot
In our case, the Hopkins statistic is relatively high, which identifies some cluster structures in the data. To decrease the number of dimensions to two and preserve those structures, we additionaly applied t-SNE algorithm.
set.seed(123)
library('Rtsne')
tsne<-Rtsne(pred, dims=2, perplexity=25)
To make a visualization, we need to construct a dataframe containing the results.
tsne_res<-data.frame(Dim1=tsne$Y[, 1], Dim2=tsne$Y[, 2], Country=dem_extracted[, 1])
head(tsne_res)
## Dim1 Dim2 Country
## 1 -3.125425 1.412076 Mexico
## 2 -1.778379 12.321756 Sweden
## 3 -0.706194 13.793165 Switzerland
## 4 1.918939 5.320777 Ghana
## 5 -1.569685 4.790853 South Africa
## 6 -3.421264 10.164769 Japan
Now, let us visualize the results.
library('ggplot2')
ggplot(tsne_res, aes(x = Dim1, y = Dim2)) +
geom_point(color='red',size = 3) +
labs(title = "t-SNE Visualization", x = "Dimension 1", y = "Dimension 2") +
theme_minimal()
Visually, we can observe some groups in the data. Let us find the appropriate number of clusters and conduct k-means.
library('gridExtra')
library('factoextra')
opt_km<-fviz_nbclust(pred, FUNcluster=kmeans, method="gap")+theme_classic()
grid.arrange(opt_km)
The k-means procedure with eight clusters and assign clusters to each country.
library('stats')
set.seed(1234)
km8<-kmeans(pred, 8)
tsne_res$cluster<-factor(km8$cluster)
head(tsne_res)
## Dim1 Dim2 Country cluster
## 1 -3.125425 1.412076 Mexico 7
## 2 -1.778379 12.321756 Sweden 8
## 3 -0.706194 13.793165 Switzerland 8
## 4 1.918939 5.320777 Ghana 2
## 5 -1.569685 4.790853 South Africa 2
## 6 -3.421264 10.164769 Japan 8
We can visualize the data one more time.
library('ggplot2')
colors_clust<-c('red', 'blue', 'green', 'black', 'yellow', 'purple', 'brown', 'orange')
ggplot(tsne_res, aes(x = Dim1, y = Dim2, color=cluster)) +
geom_point(size = 3) +
labs(title = "t-SNE Visualization", x = "Dimension 1", y = "Dimension 2") + theme_minimal() + scale_color_manual(values=colors_clust)
To conclude, t-SNE algorithm reduced number of dimensions from 95 to only two, and we can observe some clusters. However, the groups overlaps in some points.
Conclusion
In this project, we conducted principal component analysis on dataset, which consists of 95 democracy indicators. As the data appeared to be highly correlated, using PCA we reduced the number of dimensions from 95 to only 11 and preserved almost 90 per cent of variance from the original data. In addition to this, Hopkins statistic shown presence of cluster structures in the data. We seized the number of dimensions to two and plotted the results using t-SNE method. Then, we defined the optimal number of clusters and conducted k-means. After that, the output was plotted again. From the final graph, we can observe cluster patterns in the data.
Sources