I decided to analyse data on socio-economic and health factors that determine the overall development of the countries. I selected the dataset as a source for my analysis on Kaggle.com (https://www.kaggle.com/datasets/vipulgohel/clustering-pca-assignment). I cleaned the data and made a factor variable.
data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/country_data.xlsx")
colnames(mydata) <- c("ID", "Country", "ChildMort", "Exports", "Imports", "Income", "Inflation", "GDP")
head(mydata, 15)
## # A tibble: 15 × 8
## ID Country ChildMort Exports Imports Income Inflation GDP
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Afghanistan 90.2 10 44.9 1610 9.44 553
## 2 2 Albania 16.6 28 48.6 9930 4.49 4090
## 3 3 Algeria 27.3 38.4 31.4 12900 16.1 4460
## 4 4 Angola 119 62.3 42.9 5900 22.4 3530
## 5 5 Antigua and… 10.3 45.5 58.9 19100 1.44 12200
## 6 6 Argentina 14.5 18.9 16 18700 20.9 10300
## 7 7 Armenia 18.1 20.8 45.3 6700 7.77 3220
## 8 8 Australia 4.8 19.8 20.9 41400 1.16 51900
## 9 9 Austria 4.3 51.3 47.8 43200 0.873 46900
## 10 10 Azerbaijan 39.2 54.3 20.7 16000 13.8 5840
## 11 11 Bahamas 13.8 35 43.7 22900 -0.393 28000
## 12 12 Bahrain 8.6 69.5 50.9 41100 7.44 20700
## 13 13 Bangladesh 49.4 16 21.8 2440 7.14 758
## 14 14 Barbados 14.2 39.5 48.7 15300 0.321 16000
## 15 15 Belarus 5.5 51.4 64.5 16200 15.1 6030
Definitions of all variables:
library(psych)
describe(mydata[ , c(-1, -2)])
## vars n mean sd median trimmed mad min
## ChildMort 1 167 38.27 40.33 19.30 31.59 21.94 2.60
## Exports 2 167 41.76 27.72 35.40 38.22 20.90 2.20
## Imports 3 167 46.89 24.21 43.30 44.34 21.05 0.07
## Income 4 167 17144.69 19278.07 9960.00 13807.63 11638.41 609.00
## Inflation 5 167 7.16 7.48 5.14 6.15 5.66 -4.21
## GDP 6 167 12964.16 18328.70 4660.00 9146.43 5814.76 231.00
## max range skew kurtosis se
## ChildMort 208.0 205.40 1.42 1.62 3.12
## Exports 200.0 197.80 2.36 9.05 2.15
## Imports 174.0 173.93 1.87 6.41 1.87
## Income 125000.0 124391.00 2.19 6.67 1491.78
## Inflation 45.9 50.11 1.78 4.91 0.58
## GDP 105000.0 104769.00 2.18 5.23 1418.32
Children mortality has the largest positive skew therefore it is skewed to right, while life expectancy has the largest negative skew and is skewed to left. The average annual income in observed countries is 17.144,69 $. In the observed sample data, 50% of people live by the age of 73.1 years, while 50% live longer.
To make the PCA, I have to make some checks:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
fit <- lm (GDP ~ ChildMort + Exports + Imports + Income + Inflation,
data = mydata)
vif(fit)
## ChildMort Exports Imports Income Inflation
## 1.418941 2.889246 2.340056 1.903555 1.159029
mean(vif(fit))
## [1] 1.942166
With VIF statistics I checked for multicoliniarity. Even if it is below 5 for all, the mean is very high and shows higher correlation between variables. I decided to join them with PCA. I joined the ones that show too high multicoliniarity.
mydata_PCA <- mydata[ , c("ChildMort", "Exports", "Imports", "Income")]
R <- cor(mydata_PCA)
round(R, 2)
## ChildMort Exports Imports Income
## ChildMort 1.00 -0.30 -0.13 -0.52
## Exports -0.30 1.00 0.68 0.49
## Imports -0.13 0.68 1.00 0.12
## Income -0.52 0.49 0.12 1.00
library(psych)
corPlot(R)
The table and the plot both show correlation between variables. From them I can see that only correlations between Imports and child mortality and income are below 0.3 (the higher the better). But because it has a high correlation with Exports, I will leave it in the dataset for further analysis.
mydata_PCA <- mydata[ , c("ChildMort", "Exports", "Imports", "Income")]
cortest.bartlett(R, n = nrow(mydata_PCA))
## $chisq
## [1] 222.6528
##
## $p.value
## [1] 2.828364e-45
##
## $df
## [1] 6
Based on the sample data, I can reject H0 at p-value < 0.001, which means that population correlation matrix and identity matrix are not the same. In other words, at least one of the correlation coefficients differs from 0. This means that data is suitable and we can do PCA.
R <- cor(mydata_PCA)
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA = 0.51
## MSA for each item =
## ChildMort Exports Imports Income
## 0.65 0.51 0.44 0.49
Imports have the least common information with others. In other words, if we join all five variables in one PC, the most information will be lost for Imports. For Imports and Income it is not above 0.5 which means that if I join all five in one PC, the most information will be lost for Imports and Income. Total MSA is also higher than 0.5.
#install.packages("FactoMineR")
library(FactoMineR)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
components <- PCA(mydata_PCA,
scale.unit = TRUE,
graph = FALSE)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.1528976 53.822440 53.82244
## Dim.2 1.1291555 28.228887 82.05133
## Dim.3 0.5112175 12.780437 94.83176
## Dim.4 0.2067294 5.168236 100.00000
This is a PCA on standardized variables. Eigenvalues show the amount of variability explained by each PC and indicate the proportion of the total variance in the dataset that is determined by that PC.
Explanations:
fviz_eig(components,
choice = "eigenvalue",
main = "Scree plot",
ylab = "Eigenvalue",
xlab = "Principal component",
addlabels = TRUE)
library(psych)
fa.parallel(mydata_PCA,
sim = FALSE,
fa = "pc")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs =
## np.obs, : The estimated weights for the factor scores are probably
## incorrect. Try a different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : An ultra-Heywood case was detected. Examine the results
## carefully
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
Based on Parallel analysis and Scree plot above I can see that I should include 2 PCs. From the eigenvalues above I can also see that I should include 2 PCs, because based on Kaiser’s rule, you should make as many PCs as there is eigenvalues above 1 (Dim1 and Dim2).
Now I will perform a PCA again and include first two PCs.
components <- PCA(mydata_PCA,
scale.unit = TRUE,
graph = FALSE,
ncp = 2)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 167 individuals, described by 4 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
print(components$var$cor)
## Dim.1 Dim.2
## ChildMort -0.6347827 0.5808986
## Exports 0.8750028 0.3223674
## Imports 0.6666630 0.6694831
## Income 0.7347646 -0.4894731
This is a matrix of rescaled coefficients.
Explanations:
print(components$var$contrib)
## Dim.1 Dim.2
## ChildMort 18.71659 29.884567
## Exports 35.56277 9.203402
## Imports 20.64378 39.694055
## Income 25.07686 21.217975
This matrix shows how different variables contributed to PCs.
Explanations:
library(factoextra)
fviz_pca_var(components,
repel = TRUE)
fviz_pca_biplot(components)
Above we can see how annual GDP growth can be explained by child mortality, imports, exports and income. Income, exports and imports have a positive effect on GDP growth, while child mortality has a negative effect. On the other hand, the higher the income, the lower the child mortality.
For example, country ID 132 has below average child mortality rates and above average income, imports and exports quantities. It has performs better on trade data than on income.