Introduction

The objective of this paper is to reduce the complexity of the data by extracting the main features and structures of the data while retaining the key information by dimensionality reduction of the 2023 national primary data using PCA (Principal Component Analysis) method.
The PCA method is effective in identifying the main components of the data, which helps to understand the variance contribution and the intrinsic pattern of the data. This paper aims to reveal the underlying patterns of the data, identify key variables, and provide a scientific basis for subsequent data analysis, pattern recognition and decision support.

# Load necessary libraries
library(factoextra)
library(tidyverse)
library(readr)
library(ROCR)
library(PerformanceAnalytics)
library(e1071)
library(caret)
library(gbm)
library(corrplot)
library(ggcorrplot)
library(MASS)
library(rpart)
library(caTools)
library(naivebayes)
library(class)
library(ISLR)
library(glmnet)
library(Hmisc)
library(funModeling)
library(pROC)
library(randomForest)
library(naniar)
library(factoextra)
# Load the dataset
df <- read.csv('C:/Users/Pandita/Desktop/world-data.csv')
head(df)
##    Country Agricultural.Land.... Armed.Forces.size Birth.Rate Forested.Area....
## 1  Armenia                  0.59             49000      13.99              0.12
## 2 Barbados                  0.23              1000      10.65              0.15
## 3  Belgium                  0.45             32000      10.30              0.23
## 4   Belize                  0.07              2000      20.79              0.60
## 5    Benin                  0.33             12000      36.22              0.38
## 6   Bhutan                  0.14              6000      17.26              0.73
##   Infant.mortality Out.of.pocket.health.expenditure
## 1             11.0                             0.82
## 2             11.3                             0.45
## 3              2.9                             0.18
## 4             11.2                             0.23
## 5             60.5                             0.41
## 6             24.8                             0.20
##   Population..Labor.force.participation.... Unemployment.rate
## 1                                      0.56              0.17
## 2                                      0.65              0.10
## 3                                      0.54              0.06
## 4                                      0.65              0.06
## 5                                      0.71              0.02
## 6                                      0.67              0.02

Check for missing values

There are no missing values in the Dataset.

vis_miss(df)

Data pre-processing

Data Normalization

First, we need to preprocess the data by Min-Max Scaling each column and adding a small constant 1e-8 to prevent division by zero.

#(Min-Max Scaling)
data_normalized <- df
data_normalized[, 2:6] <- apply(df[, 2:6], 2, function(x) (x - min(x)) / (max(x) - min(x))+ 1e-8)
data_normalized[, 2:6]
##    Agricultural.Land.... Armed.Forces.size Birth.Rate Forested.Area....
## 1             0.75362320       0.015841594 0.19508586        0.11363637
## 2             0.23188407       0.000000010 0.09621079        0.14772728
## 3             0.55072465       0.010231033 0.08584963        0.23863637
## 4             0.00000001       0.000330043 0.39638841        0.65909092
## 5             0.37681160       0.003630373 0.85316756        0.40909092
## 6             0.10144929       0.001650175 0.29188870        0.80681819
## 7             0.56521740       0.002640274 0.51568977        0.19318183
## 8             0.39130436       0.240594069 0.19301363        0.64772728
## 9             0.00000001       0.023432353 0.07992896        0.40909092
## 10            0.20289856       0.039934003 0.14890469        0.25000001
## 11            0.71014494       0.889108921 0.10361161        0.22727274
## 12            0.47826088       0.158415852 0.22143281        0.57954546
## 13            0.34782610       0.003630373 0.75370042        0.71590910
## 14            0.40579711       0.002970307 0.19449379        0.60227274
## 15            0.30434784       0.005610571 0.04736531        0.36363637
## 16            0.55072465       0.007260736 0.09769095        0.37500001
## 17            0.07246378       0.043894399 1.00000001        0.73863637
## 18            0.60869566       0.023102320 0.35849616        0.45454546
## 19            1.00000001       0.013531363 0.32119598        0.12500001
## 20            0.23188407       0.001650175 0.10361161        0.55681819
## 21            0.23188407       0.000990109 0.41089403        0.61363637
## 22            0.65217392       0.100990109 0.11545294        0.32954546
## 23            0.18840581       0.001980208 0.71669628        1.00000001
## 24            0.76811595       0.000000010 0.92184726        0.52272728
## 25            0.40579711       0.008250835 0.17969214        0.44318183
## 26            0.59420291       0.059075918 0.06216697        0.35227274
## 27            0.89855073       0.004950505 0.65156899        0.44318183
## 28            0.59420291       0.047854795 0.02072233        0.34090910
## 29            0.42028987       0.013861396 0.50799291        0.35227274
## 30            0.31884059       0.007260736 0.42036709        0.43181819
## 31            0.73913044       0.012871297 0.06512730        0.23863637
## 32            0.76811595       1.000000010 0.30965069        0.25000001
## 33            0.36231885       0.222772287 0.31586739        0.54545456
## 34            0.30434784       0.185478558 0.33688574        0.05681819
## 35            0.20289856       0.068646875 0.64179989        0.00000001
## 36            0.26086958       0.058415852 0.39668444        0.06818183
## 37            0.49275363       0.000990109 0.25754886        0.32954546
## 38            0.07246378       0.085808591 0.00000001        0.76136365
##    Infant.mortality
## 1       0.138554227
## 2       0.143072299
## 3       0.016566275
## 4       0.141566275
## 5       0.884036155
## 6       0.346385552
## 7       0.424698805
## 8       0.165662661
## 9       0.037650612
## 10      0.066265070
## 11      0.084337359
## 12      0.156626516
## 13      0.518072299
## 14      0.087349408
## 15      0.033132540
## 16      0.013554227
## 17      1.000000010
## 18      0.335843383
## 19      0.150602420
## 20      0.004518082
## 21      0.298192781
## 22      0.024096396
## 23      0.465361456
## 24      0.560240974
## 25      0.103915673
## 26      0.019578323
## 27      0.498493986
## 28      0.027108444
## 29      0.305722902
## 30      0.200301215
## 31      0.027108444
## 32      0.423192781
## 33      0.290662661
## 34      0.159638564
## 35      0.311746998
## 36      0.018072299
## 37      0.159638564
## 38      0.000000010

As we can see we don’t have missing values. This means that we can move forward.

colSums(is.na(data_normalized))
##                                   Country 
##                                         0 
##                     Agricultural.Land.... 
##                                         0 
##                         Armed.Forces.size 
##                                         0 
##                                Birth.Rate 
##                                         0 
##                         Forested.Area.... 
##                                         0 
##                          Infant.mortality 
##                                         0 
##          Out.of.pocket.health.expenditure 
##                                         0 
## Population..Labor.force.participation.... 
##                                         0 
##                         Unemployment.rate 
##                                         0

From this ordered dissimilarity plot, it appears that there is some structure and pattern in the data. Often, distinct chunks or patterns of color in the plot can indicate potential clustering structures.

Relevant analysis

d<-dist(data_normalized[,2:6])
fviz_dist(d, show_labels = FALSE)+ labs(title = "c in 2023")

corrplot(cor(data_normalized[,2:6], use="complete"), method="number", type="upper", diag=T, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="Correlation of Country-Data", mar=c(0,0,1,0))

cor_matrix <- cor(data_normalized[,2:6])
print(cor_matrix)
##                       Agricultural.Land.... Armed.Forces.size  Birth.Rate
## Agricultural.Land....            1.00000000       0.267838992 -0.05893966
## Armed.Forces.size                0.26783899       1.000000000 -0.14334426
## Birth.Rate                      -0.05893966      -0.143344264  1.00000000
## Forested.Area....               -0.41703233      -0.173229589  0.25684812
## Infant.mortality                -0.03909071      -0.001905202  0.90224866
##                       Forested.Area.... Infant.mortality
## Agricultural.Land....        -0.4170323     -0.039090714
## Armed.Forces.size            -0.1732296     -0.001905202
## Birth.Rate                    0.2568481      0.902248661
## Forested.Area....             1.0000000      0.320873961
## Infant.mortality              0.3208740      1.000000000
cor_matrix <- cor(data_normalized[,2:8])
corrplot(cor_matrix, method = "ellipse", type="upper",)

ggcorrplot(cor_matrix,lab = T)

PCA Analysis

pca <- prcomp(data_normalized[,2:6], scale = TRUE)
pca
## Standard deviations (1, .., p=5):
## [1] 1.4662227 1.1810420 0.9201123 0.7258961 0.2860051
## 
## Rotation (n x k) = (5 x 5):
##                              PC1        PC2         PC3         PC4         PC5
## Agricultural.Land.... -0.2503384  0.6328285  0.27364217  0.67943960  0.01845354
## Armed.Forces.size     -0.1978717  0.4750746 -0.82907387 -0.18465961  0.11699693
## Birth.Rate             0.6014786  0.3201625  0.17434135 -0.16557621  0.69130971
## Forested.Area....      0.4220142 -0.3733623 -0.45515717  0.68425976  0.08441080
## Infant.mortality       0.5985859  0.3632200 -0.01391096 -0.09292874 -0.70776895
fviz_screeplot(pca,addlabels = TRUE,ylim=c(0,100),main="Scree Plot of PCA")

Variance Contribution:
Dimension 1 explains 43% of the variance and is the largest contributor, indicating that this principal component captures the major patterns of variation in the data.
Dimension 2 explained 27.9% of the variance and was the second largest contributor.
Dimension 3 explained 16.9% of the variance and remained a significant principal component.
Dimensions 4 and 5 contributed less variance, 10.5% and 1.6% respectively.

summary(pca)
## Importance of components:
##                          PC1    PC2    PC3    PC4     PC5
## Standard deviation     1.466 1.1810 0.9201 0.7259 0.28601
## Proportion of Variance 0.430 0.2790 0.1693 0.1054 0.01636
## Cumulative Proportion  0.430 0.7089 0.8783 0.9836 1.00000
fviz_eig(pca)

Cumulative proportion of explained variance displayed above indicates that 3 components are able to explain over 80% of the variance. It means that this proportion of information can be preserved after reducing number of variables by half. First two components are able to explain over 3/4 of the variance so this number of components is enough. It means that results given by all three methods are the same.

Components analysis

The “cloud of points” graph shows individual observations quality of representation.

fviz_pca_biplot(pca, 
                repel = TRUE,  
                col.var = "blue",  
                col.ind = "red",   
                labelsize = 4,     
                arrowsize = 1.5,   
                title = "PCA Biplot of Data") +
  theme_minimal()

fviz_pca_ind(pca, col.ind="cos2", geom = "point", gradient.cols = c("green", "blue", "red" ))

fviz_pca_var(pca, col.var = "red")

The plot above illustrates the relationships between variables and the “quality” of all factors. Variables that are positively correlated are positioned close to each other, while those that are negatively correlated are located on opposite sides of the plot. The “quality” of a variable is indicated by its distance from the center of the plot, with “the best” variables being protein and sodium, as they are positioned farthest from the center. However, based solely on this graph, it is challenging to clearly distinguish the individual components.

The percentage of the contribution of the first two components is displayed in the plots shown below.

Conclusion

Dimension reduction refers to the process of decreasing the number of dimensions (or features) in a dataset while aiming to retain as much of the original information as possible. The research conducted demonstrates that more than 80% of the variance in the dataset can be explained by just half of the variables. Furthermore, two out of the eight variables are capable of preserving over three-quarters of the information contained in the original dataset. These findings highlight the effectiveness of dimension reduction techniques, which are particularly valuable for analyzing and storing large datasets efficiently. By reducing complexity without significant loss of information, these methods enable more streamlined data processing and interpretation.