I decided to analyse data on socio-economic and health factors that determine the overall development of the countries. I selected the dataset as a source for my analysis on Kaggle.com (https://www.kaggle.com/datasets/vipulgohel/clustering-pca-assignment). I cleaned the data and made a factor variable.
data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/country_data.xlsx")
colnames(mydata) <- c("ID", "Country", "ChildMort", "Exports", "Imports", "Income", "Inflation", "GDP")
head(mydata, 15)
## # A tibble: 15 × 8
## ID Country ChildMort Exports Imports Income Inflation GDP
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Afghanistan 90.2 10 44.9 1610 9.44 553
## 2 2 Albania 16.6 28 48.6 9930 4.49 4090
## 3 3 Algeria 27.3 38.4 31.4 12900 16.1 4460
## 4 4 Angola 119 62.3 42.9 5900 22.4 3530
## 5 5 Antigua and… 10.3 45.5 58.9 19100 1.44 12200
## 6 6 Argentina 14.5 18.9 16 18700 20.9 10300
## 7 7 Armenia 18.1 20.8 45.3 6700 7.77 3220
## 8 8 Australia 4.8 19.8 20.9 41400 1.16 51900
## 9 9 Austria 4.3 51.3 47.8 43200 0.873 46900
## 10 10 Azerbaijan 39.2 54.3 20.7 16000 13.8 5840
## 11 11 Bahamas 13.8 35 43.7 22900 -0.393 28000
## 12 12 Bahrain 8.6 69.5 50.9 41100 7.44 20700
## 13 13 Bangladesh 49.4 16 21.8 2440 7.14 758
## 14 14 Barbados 14.2 39.5 48.7 15300 0.321 16000
## 15 15 Belarus 5.5 51.4 64.5 16200 15.1 6030
Definitions of all variables:
library(psych)
describe(mydata[ , c(-1, -2)])
## vars n mean sd median trimmed mad min
## ChildMort 1 167 38.27 40.33 19.30 31.59 21.94 2.60
## Exports 2 167 41.76 27.72 35.40 38.22 20.90 2.20
## Imports 3 167 46.89 24.21 43.30 44.34 21.05 0.07
## Income 4 167 17144.69 19278.07 9960.00 13807.63 11638.41 609.00
## Inflation 5 167 7.16 7.48 5.14 6.15 5.66 -4.21
## GDP 6 167 12964.16 18328.70 4660.00 9146.43 5814.76 231.00
## max range skew kurtosis se
## ChildMort 208.0 205.40 1.42 1.62 3.12
## Exports 200.0 197.80 2.36 9.05 2.15
## Imports 174.0 173.93 1.87 6.41 1.87
## Income 125000.0 124391.00 2.19 6.67 1491.78
## Inflation 45.9 50.11 1.78 4.91 0.58
## GDP 105000.0 104769.00 2.18 5.23 1418.32
Exports have the largest positive skew therefore it is skewed to right. The average annual income in observed countries is 17.144,69 $. The lowest number of imports in the sample data was 0.07% of total GDP.
To make the PCA, I have to make some checks:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
fit <- lm (GDP ~ ChildMort + Exports + Imports + Income + Inflation,
data = mydata)
vif(fit)
## ChildMort Exports Imports Income Inflation
## 1.418941 2.889246 2.340056 1.903555 1.159029
mean(vif(fit))
## [1] 1.942166
With VIF statistics I checked for multicoliniarity. Even if it is below 5 for all, the mean is quite high and shows high correlation between variables. I decided to join them with PCA.
mydata_PCA <- mydata[ , c("ChildMort", "Exports", "Imports", "Income")]
R <- cor(mydata_PCA)
round(R, 2)
## ChildMort Exports Imports Income
## ChildMort 1.00 -0.30 -0.13 -0.52
## Exports -0.30 1.00 0.68 0.49
## Imports -0.13 0.68 1.00 0.12
## Income -0.52 0.49 0.12 1.00
library(psych)
corPlot(R)
The table and the plot both show correlation between variables. From them I can see that only correlations between Imports and child mortality and income are below 0.3 (the higher the better). But because it has a high correlation with Exports, I will leave it in the dataset for further analysis.
mydata_PCA <- mydata[ , c("ChildMort", "Exports", "Imports", "Income")]
cortest.bartlett(R, n = nrow(mydata_PCA))
## $chisq
## [1] 222.6528
##
## $p.value
## [1] 2.828364e-45
##
## $df
## [1] 6
Based on the sample data, I can reject H0 at p-value < 0.001, which means that population correlation matrix and identity matrix are not the same. In other words, at least one of the correlation coefficients differs from 0. This means that we can do PCA.
det(R)
## [1] 0.2569125
This is checked just for FA, not for PCA.
R <- cor(mydata_PCA)
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA = 0.51
## MSA for each item =
## ChildMort Exports Imports Income
## 0.65 0.51 0.44 0.49
MSA is above 0.5 so the requirement is met. Imports have the least common information with others and cause a lower MSA, because it is the least correlated with other variables. In other words, if we join all five variables in one or two PCs, the most information will be lost for Imports. For Imports and Income it is not above 0.5 which means that if I join all five in one or two PCs, the most information will be lost for Imports and Income. Based on this, data is not the most appropriate for PCA but I will still continue with the analysis.
#install.packages("FactoMineR")
library(FactoMineR)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
components <- PCA(mydata_PCA,
scale.unit = TRUE,
graph = FALSE)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.1528976 53.822440 53.82244
## Dim.2 1.1291555 28.228887 82.05133
## Dim.3 0.5112175 12.780437 94.83176
## Dim.4 0.2067294 5.168236 100.00000
This is a PCA on standardized variables. Eigenvalues show the amount of variability explained by each PC and indicate the proportion of the total variance in the dataset that is determined by that PC. If we sum all eigenvalues we get 4 (same as the number of variables), because they are standardised and their variance is 1, therefore all together represent total information.
Explanations:
fviz_eig(components,
choice = "eigenvalue",
main = "Scree plot",
ylab = "Eigenvalue",
xlab = "Principal component",
addlabels = TRUE)
library(psych)
fa.parallel(mydata_PCA,
sim = FALSE,
fa = "pc")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs =
## np.obs, : The estimated weights for the factor scores are probably
## incorrect. Try a different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : An ultra-Heywood case was detected. Examine the results
## carefully
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
Based on Parallel analysis and Scree plot above I can see that I should include 2 PCs. From Scree plot, I should look at where is the most evident break and then subtract 1 (most evident break is at third component). In Parallel analysis you should take as many components as long as the empirical eigenvalues are higher than theoretical ones. From the eigenvalues above I can also see that I should include 2 PCs, because based on Kaiser’s rule, you should make as many PCs as there is eigenvalues above 1 (Dim1 and Dim2).
Now I will perform a PCA again and include first two PCs.
components <- PCA(mydata_PCA,
scale.unit = TRUE,
graph = FALSE,
ncp = 2)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 167 individuals, described by 4 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
print(components$var$cor)
## Dim.1 Dim.2
## ChildMort -0.6347827 0.5808986
## Exports 0.8750028 0.3223674
## Imports 0.6666630 0.6694831
## Income 0.7347646 -0.4894731
This is a matrix of rescaled coefficients.
Explanations:
print(components$var$contrib)
## Dim.1 Dim.2
## ChildMort 18.71659 29.884567
## Exports 35.56277 9.203402
## Imports 20.64378 39.694055
## Income 25.07686 21.217975
This matrix shows how different variables contributed to PCs.
Explanations:
library(factoextra)
fviz_pca_var(components,
repel = TRUE)
fviz_pca_biplot(components)
Above we can see that rescaled coefficient for Child mortality and PC1 is negative, which indicates that a lower Child mortality has a positive effect on PC1 and makes it higher. On the other hand, if Income, Exports and Imports get higher, the PC1 also gets higher. Based on that I would say that PC1 measures general development of a country. PC2 on the other hand measures contrast between trade (Import and Exports) and social welfare (Income and Child mortality).
For example, country ID 132 (Seychelles) has above average in general development and has more developed trade. Country ID 67 (Haiti) is below average in development and has also more developed trade. Country ID 78 (Japan) is below average in development and has more developed social welfare.