The objective of this paper is to reduce the complexity of the data
by extracting the main features and structures of the data while
retaining the key information by dimensionality reduction of the 2023
national primary data using PCA (Principal Component Analysis)
method.
The PCA method is effective in identifying the main components of the
data, which helps to understand the variance contribution and the
intrinsic pattern of the data. This paper aims to reveal the underlying
patterns of the data, identify key variables, and provide a scientific
basis for subsequent data analysis, pattern recognition and decision
support.
# Load necessary libraries
library(factoextra)
library(tidyverse)
library(readr)
library(ROCR)
library(PerformanceAnalytics)
library(e1071)
library(caret)
library(gbm)
library(corrplot)
library(ggcorrplot)
library(MASS)
library(rpart)
library(caTools)
library(naivebayes)
library(class)
library(ISLR)
library(glmnet)
library(Hmisc)
library(funModeling)
library(pROC)
library(randomForest)
library(naniar)
library(factoextra)
# Load the dataset
df <- read.csv('C:/Users/Pandita/Desktop/world-data.csv')
head(df)
## Country Agricultural.Land.... Armed.Forces.size Birth.Rate Forested.Area....
## 1 Armenia 0.59 49000 13.99 0.12
## 2 Barbados 0.23 1000 10.65 0.15
## 3 Belgium 0.45 32000 10.30 0.23
## 4 Belize 0.07 2000 20.79 0.60
## 5 Benin 0.33 12000 36.22 0.38
## 6 Bhutan 0.14 6000 17.26 0.73
## Infant.mortality Out.of.pocket.health.expenditure
## 1 11.0 0.82
## 2 11.3 0.45
## 3 2.9 0.18
## 4 11.2 0.23
## 5 60.5 0.41
## 6 24.8 0.20
## Population..Labor.force.participation.... Unemployment.rate
## 1 0.56 0.17
## 2 0.65 0.10
## 3 0.54 0.06
## 4 0.65 0.06
## 5 0.71 0.02
## 6 0.67 0.02
There are no missing values in the Dataset.
vis_miss(df)
First, we need to preprocess the data by Min-Max Scaling each column and adding a small constant 1e-8 to prevent division by zero.
#(Min-Max Scaling)
data_normalized <- df
data_normalized[, 2:6] <- apply(df[, 2:6], 2, function(x) (x - min(x)) / (max(x) - min(x))+ 1e-8)
data_normalized[, 2:6]
## Agricultural.Land.... Armed.Forces.size Birth.Rate Forested.Area....
## 1 0.75362320 0.015841594 0.19508586 0.11363637
## 2 0.23188407 0.000000010 0.09621079 0.14772728
## 3 0.55072465 0.010231033 0.08584963 0.23863637
## 4 0.00000001 0.000330043 0.39638841 0.65909092
## 5 0.37681160 0.003630373 0.85316756 0.40909092
## 6 0.10144929 0.001650175 0.29188870 0.80681819
## 7 0.56521740 0.002640274 0.51568977 0.19318183
## 8 0.39130436 0.240594069 0.19301363 0.64772728
## 9 0.00000001 0.023432353 0.07992896 0.40909092
## 10 0.20289856 0.039934003 0.14890469 0.25000001
## 11 0.71014494 0.889108921 0.10361161 0.22727274
## 12 0.47826088 0.158415852 0.22143281 0.57954546
## 13 0.34782610 0.003630373 0.75370042 0.71590910
## 14 0.40579711 0.002970307 0.19449379 0.60227274
## 15 0.30434784 0.005610571 0.04736531 0.36363637
## 16 0.55072465 0.007260736 0.09769095 0.37500001
## 17 0.07246378 0.043894399 1.00000001 0.73863637
## 18 0.60869566 0.023102320 0.35849616 0.45454546
## 19 1.00000001 0.013531363 0.32119598 0.12500001
## 20 0.23188407 0.001650175 0.10361161 0.55681819
## 21 0.23188407 0.000990109 0.41089403 0.61363637
## 22 0.65217392 0.100990109 0.11545294 0.32954546
## 23 0.18840581 0.001980208 0.71669628 1.00000001
## 24 0.76811595 0.000000010 0.92184726 0.52272728
## 25 0.40579711 0.008250835 0.17969214 0.44318183
## 26 0.59420291 0.059075918 0.06216697 0.35227274
## 27 0.89855073 0.004950505 0.65156899 0.44318183
## 28 0.59420291 0.047854795 0.02072233 0.34090910
## 29 0.42028987 0.013861396 0.50799291 0.35227274
## 30 0.31884059 0.007260736 0.42036709 0.43181819
## 31 0.73913044 0.012871297 0.06512730 0.23863637
## 32 0.76811595 1.000000010 0.30965069 0.25000001
## 33 0.36231885 0.222772287 0.31586739 0.54545456
## 34 0.30434784 0.185478558 0.33688574 0.05681819
## 35 0.20289856 0.068646875 0.64179989 0.00000001
## 36 0.26086958 0.058415852 0.39668444 0.06818183
## 37 0.49275363 0.000990109 0.25754886 0.32954546
## 38 0.07246378 0.085808591 0.00000001 0.76136365
## Infant.mortality
## 1 0.138554227
## 2 0.143072299
## 3 0.016566275
## 4 0.141566275
## 5 0.884036155
## 6 0.346385552
## 7 0.424698805
## 8 0.165662661
## 9 0.037650612
## 10 0.066265070
## 11 0.084337359
## 12 0.156626516
## 13 0.518072299
## 14 0.087349408
## 15 0.033132540
## 16 0.013554227
## 17 1.000000010
## 18 0.335843383
## 19 0.150602420
## 20 0.004518082
## 21 0.298192781
## 22 0.024096396
## 23 0.465361456
## 24 0.560240974
## 25 0.103915673
## 26 0.019578323
## 27 0.498493986
## 28 0.027108444
## 29 0.305722902
## 30 0.200301215
## 31 0.027108444
## 32 0.423192781
## 33 0.290662661
## 34 0.159638564
## 35 0.311746998
## 36 0.018072299
## 37 0.159638564
## 38 0.000000010
As we can see we don’t have missing values. This means that we can move forward.
colSums(is.na(data_normalized))
## Country
## 0
## Agricultural.Land....
## 0
## Armed.Forces.size
## 0
## Birth.Rate
## 0
## Forested.Area....
## 0
## Infant.mortality
## 0
## Out.of.pocket.health.expenditure
## 0
## Population..Labor.force.participation....
## 0
## Unemployment.rate
## 0
From this ordered dissimilarity plot, it appears that there is some structure and pattern in the data. Often, distinct chunks or patterns of color in the plot can indicate potential clustering structures.
d<-dist(data_normalized[,2:6])
fviz_dist(d, show_labels = FALSE)+ labs(title = "c in 2023")
corrplot(cor(data_normalized[,2:6], use="complete"), method="number", type="upper", diag=T, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="Correlation of Country-Data", mar=c(0,0,1,0))
cor_matrix <- cor(data_normalized[,2:6])
print(cor_matrix)
## Agricultural.Land.... Armed.Forces.size Birth.Rate
## Agricultural.Land.... 1.00000000 0.267838992 -0.05893966
## Armed.Forces.size 0.26783899 1.000000000 -0.14334426
## Birth.Rate -0.05893966 -0.143344264 1.00000000
## Forested.Area.... -0.41703233 -0.173229589 0.25684812
## Infant.mortality -0.03909071 -0.001905202 0.90224866
## Forested.Area.... Infant.mortality
## Agricultural.Land.... -0.4170323 -0.039090714
## Armed.Forces.size -0.1732296 -0.001905202
## Birth.Rate 0.2568481 0.902248661
## Forested.Area.... 1.0000000 0.320873961
## Infant.mortality 0.3208740 1.000000000
cor_matrix <- cor(data_normalized[,2:8])
corrplot(cor_matrix, method = "ellipse", type="upper",)
ggcorrplot(cor_matrix,lab = T)
pca <- prcomp(data_normalized[,2:6], scale = TRUE)
pca
## Standard deviations (1, .., p=5):
## [1] 1.4662227 1.1810420 0.9201123 0.7258961 0.2860051
##
## Rotation (n x k) = (5 x 5):
## PC1 PC2 PC3 PC4 PC5
## Agricultural.Land.... -0.2503384 0.6328285 0.27364217 0.67943960 0.01845354
## Armed.Forces.size -0.1978717 0.4750746 -0.82907387 -0.18465961 0.11699693
## Birth.Rate 0.6014786 0.3201625 0.17434135 -0.16557621 0.69130971
## Forested.Area.... 0.4220142 -0.3733623 -0.45515717 0.68425976 0.08441080
## Infant.mortality 0.5985859 0.3632200 -0.01391096 -0.09292874 -0.70776895
fviz_screeplot(pca,addlabels = TRUE,ylim=c(0,100),main="Scree Plot of PCA")
Variance Contribution:
Dimension 1 explains 43% of the variance and is the largest contributor,
indicating that this principal component captures the major patterns of
variation in the data.
Dimension 2 explained 27.9% of the variance and was the second largest
contributor.
Dimension 3 explained 16.9% of the variance and remained a significant
principal component.
Dimensions 4 and 5 contributed less variance, 10.5% and 1.6%
respectively.
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.466 1.1810 0.9201 0.7259 0.28601
## Proportion of Variance 0.430 0.2790 0.1693 0.1054 0.01636
## Cumulative Proportion 0.430 0.7089 0.8783 0.9836 1.00000
fviz_eig(pca)
Cumulative proportion of explained variance displayed above indicates that 3 components are able to explain over 80% of the variance. It means that this proportion of information can be preserved after reducing number of variables by half. First two components are able to explain over 3/4 of the variance so this number of components is enough. It means that results given by all three methods are the same.
The “cloud of points” graph shows individual observations quality of representation.
fviz_pca_biplot(pca,
repel = TRUE,
col.var = "blue",
col.ind = "red",
labelsize = 4,
arrowsize = 1.5,
title = "PCA Biplot of Data") +
theme_minimal()
fviz_pca_ind(pca, col.ind="cos2", geom = "point", gradient.cols = c("green", "blue", "red" ))
fviz_pca_var(pca, col.var = "red")
The plot above illustrates the relationships between variables and the “quality” of all factors. Variables that are positively correlated are positioned close to each other, while those that are negatively correlated are located on opposite sides of the plot. The “quality” of a variable is indicated by its distance from the center of the plot, with “the best” variables being protein and sodium, as they are positioned farthest from the center. However, based solely on this graph, it is challenging to clearly distinguish the individual components.
The percentage of the contribution of the first two components is displayed in the plots shown below.
Dimension reduction refers to the process of decreasing the number of dimensions (or features) in a dataset while aiming to retain as much of the original information as possible. The research conducted demonstrates that more than 80% of the variance in the dataset can be explained by just half of the variables. Furthermore, two out of the eight variables are capable of preserving over three-quarters of the information contained in the original dataset. These findings highlight the effectiveness of dimension reduction techniques, which are particularly valuable for analyzing and storing large datasets efficiently. By reducing complexity without significant loss of information, these methods enable more streamlined data processing and interpretation.