Author: Jethro Muwanguzi
Title: Homework Assignment 2 - MVA
Date: 2025-27-01
. .
This assignment applies Principal Component Analysis to examine socioeconomic indicators across countries. The goal is to simplify the data while uncovering the main factors that explain differences in economic and social development among nations.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(readr))
file_path <- "/Users/jethro/Downloads/archive-3/Country-data.csv"
countrydata <- read_csv(file_path, show_col_types = FALSE)
kable(head(countrydata, 6), format = "html", caption = "First Six Rows of the Dataset") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"), position = "center") %>%
row_spec(0, bold = TRUE, background = "#4682B4", color = "white") %>%
row_spec(1:6, background = "#f0f8ff")
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
| Argentina | 14.5 | 18.9 | 8.10 | 16.0 | 18700 | 20.90 | 75.8 | 2.37 | 10300 |
Unit of Observation: A single country
Sample Size: 167 rows (representing 167 countries)
Variables:
Source: Kaggle (https://www.kaggle.com/datasets/vipulgohel/clustering-pca-assignment)
. .
library(dplyr)
library(psych)
library(knitr)
library(kableExtra)
describe_table <- psych::describe(countrydata[, c("child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer", "gdpp")])
kable(as.data.frame(describe_table), caption = "Descriptive Statistics") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"),
position = "left", font_size = 10) %>%
row_spec(0, bold = TRUE, background = "lightblue") %>%
column_spec(1, bold = TRUE)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| child_mort | 1 | 167 | 38.270060 | 40.328932 | 19.30 | 31.588889 | 21.942480 | 2.6000 | 2.08e+02 | 205.4000 | 1.4248158 | 1.6228439 | 3.1207464 |
| exports | 2 | 167 | 41.108976 | 27.412010 | 35.00 | 37.799259 | 21.201180 | 0.1090 | 2.00e+02 | 199.8910 | 2.4020624 | 9.6490756 | 2.1212050 |
| health | 3 | 167 | 6.815689 | 2.746837 | 6.32 | 6.663333 | 2.639028 | 1.8100 | 1.79e+01 | 16.0900 | 0.6931186 | 0.5944336 | 0.2125567 |
| imports | 4 | 167 | 46.890215 | 24.209589 | 43.30 | 44.337037 | 21.052920 | 0.0659 | 1.74e+02 | 173.9341 | 1.8711858 | 6.4058921 | 1.8733942 |
| income | 5 | 167 | 17144.688623 | 19278.067698 | 9960.00 | 13807.629630 | 11638.410000 | 609.0000 | 1.25e+05 | 124391.0000 | 2.1915532 | 6.6674347 | 1491.7816663 |
| inflation | 6 | 167 | 7.781832 | 10.570704 | 5.39 | 6.265133 | 5.722836 | -4.2100 | 1.04e+02 | 108.2100 | 5.0618313 | 39.9484352 | 0.8179856 |
| life_expec | 7 | 167 | 70.555689 | 8.893172 | 73.10 | 71.327407 | 8.895600 | 32.1000 | 8.28e+01 | 50.7000 | -0.9536222 | 1.0329499 | 0.6881743 |
| total_fer | 8 | 167 | 2.947964 | 1.513848 | 2.41 | 2.768741 | 1.215732 | 1.1500 | 7.49e+00 | 6.3400 | 0.9497881 | -0.2501780 | 0.1171450 |
| gdpp | 9 | 167 | 12964.155689 | 18328.704809 | 4660.00 | 9146.429630 | 5814.757200 | 231.0000 | 1.05e+05 | 104769.0000 | 2.1783653 | 5.2286139 | 1418.3177603 |
SOME OBSERVATIONS
Income: The income variable has a mean of 17,144.69 and ranges from 609 to 125,000, indicating significant economic disparities globally. The median income of 4,660 highlights that most countries fall below the average, skewed by a few very wealthy nations.
Inflation: Inflation shows a mean of 7.78%, with values ranging from extreme deflation at -42.1% to hyperinflation at 104.0%. The median value of 5.39% indicates that most countries experience moderate inflation, while extreme values pull the average upward.
child_mort: Child mortality has a mean of 38.27 deaths per 1,000 live births, with a minimum of 2.60 and a maximum of 200.00. This indicates stark contrasts in child health outcomes between developed and developing nations.
life_expec: Life expectancy has a mean of 70.55 years, ranging from 32.10 to 82.80. The median value of 71.33 reflects relatively high global life expectancy, though some countries still face significantly lower lifespans.
gdpp: GDP per capita has a mean of 12,964, with values ranging from 231 to 105,000. The median of 4,660 highlights the substantial economic inequality, with most countries earning far below the average due to the influence of wealthy nations.
. .
Before we perform the PCA, we check the correlations, carry out the KMO-MSA tests and the Bartlett’s test of sphericity
library(dplyr)
library(knitr)
library(DT)
countrydata_PCA <- countrydata[, c("child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer", "gdpp")]
R <- cor(countrydata_PCA)
R_rounded <- round(R, 3)
kable(R_rounded, format = "html", table.attr = 'style="background-color: #eaf3fc; border: 1px solid #b3cde3;"', caption = "Correlation Matrix (Rounded)") %>%
kableExtra::kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
|---|---|---|---|---|---|---|---|---|---|
| child_mort | 1.000 | -0.318 | -0.200 | -0.127 | -0.524 | 0.288 | -0.887 | 0.848 | -0.483 |
| exports | -0.318 | 1.000 | -0.114 | 0.737 | 0.517 | -0.107 | 0.316 | -0.320 | 0.419 |
| health | -0.200 | -0.114 | 1.000 | 0.096 | 0.130 | -0.255 | 0.211 | -0.197 | 0.346 |
| imports | -0.127 | 0.737 | 0.096 | 1.000 | 0.122 | -0.247 | 0.054 | -0.159 | 0.115 |
| income | -0.524 | 0.517 | 0.130 | 0.122 | 1.000 | -0.148 | 0.612 | -0.502 | 0.896 |
| inflation | 0.288 | -0.107 | -0.255 | -0.247 | -0.148 | 1.000 | -0.240 | 0.317 | -0.222 |
| life_expec | -0.887 | 0.316 | 0.211 | 0.054 | 0.612 | -0.240 | 1.000 | -0.761 | 0.600 |
| total_fer | 0.848 | -0.320 | -0.197 | -0.159 | -0.502 | 0.317 | -0.761 | 1.000 | -0.455 |
| gdpp | -0.483 | 0.419 | 0.346 | 0.115 | 0.896 | -0.222 | 0.600 | -0.455 | 1.000 |
For a more balanced analysis of the relationships among the variables, I use the correlation matrix to ensure that the PCA results are not skewed by variables with larger scales or variances.
library(psych)
corPlot(R)
The correlation plot above, as a visual representation of the correlation matrix. Below are some key observations;
#we proceed to carry out the Bartlett's Test of sphericity
library(psych)
bartlett_test <- cortest.bartlett(R, n = nrow(countrydata))
bartlett_test
## $chisq
## [1] 1169.737
##
## $p.value
## [1] 3.136862e-222
##
## $df
## [1] 36
We reject the null hypothesis (H₀) at p-value < 0.001
determinant_value <- det(R)
determinant_value
## [1] 0.0007368109
#Below we carry out the KMO-MSA tests, looking for values above 0.5
kmo_result <- KMO(R)
kmo_result
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA = 0.68
## MSA for each item =
## child_mort exports health imports income inflation life_expec
## 0.73 0.57 0.36 0.41 0.69 0.74 0.80
## total_fer gdpp
## 0.86 0.66
From the KMO-MSA results above, most variables have the preferable values above 0.5 however, ‘health’ and ‘imports’ are below. Retaining them might distort the results we get. I then proceed to exclude them from the further analysis.
# Excluding 'country' and low-KMO variables ('health' and 'imports')
filtered_data <- countrydata[, !(colnames(countrydata) %in% c("country", "health", "imports"))]
print(colnames(filtered_data))
## [1] "child_mort" "exports" "income" "inflation" "life_expec"
## [6] "total_fer" "gdpp"
# I proceed to recompute the correlation matrix, run the Bartlett's Test and KMO Test"
R_filtered <- cor(filtered_data)
bartlett_test_filtered <- cortest.bartlett(R_filtered, n = nrow(filtered_data))
bartlett_test_filtered
## $chisq
## [1] 890.692
##
## $p.value
## [1] 5.08529e-175
##
## $df
## [1] 21
kmo_result_filtered <- KMO(R_filtered)
kmo_result_filtered
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R_filtered)
## Overall MSA = 0.76
## MSA for each item =
## child_mort exports income inflation life_expec total_fer gdpp
## 0.73 0.85 0.70 0.73 0.80 0.85 0.69
Observations
H₀: P = I
H₁: P ≠ I
We reject the null hypothesis (H₀) at p-value < 0.001, this confirms that the variables are correlated enough to perform PCA.
KMO-MSA Tests
I then proceed with carrying out the PCA on the filtered dataset that has passed all tests for further analysis.
R <- cor(countrydata[, c("child_mort", "exports", "income", "inflation", "life_expec", "total_fer", "gdpp")])
library(FactoMineR)
library(factoextra)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(knitr)
library(kableExtra)
library(DT)
countrydata_PCA <- countrydata[, c("child_mort", "exports", "income", "inflation", "life_expec", "total_fer", "gdpp")]
components <- PCA(countrydata_PCA,
scale.unit = TRUE,
graph = FALSE)
eigenvalues <- get_eigenvalue(components)
kable(eigenvalues, format = "html",
col.names = c("Eigenvalue", "Variance (%)", "Cumulative Variance (%)"),
caption = "Eigenvalues and Variance Explained by PCA Components") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"),
position = "center") %>%
row_spec(0, bold = TRUE, background = "#cfe8f6") %>%
row_spec(1:nrow(eigenvalues), background = "#eaf3fc")
| Eigenvalue | Variance (%) | Cumulative Variance (%) | |
|---|---|---|---|
| Dim.1 | 3.9819296 | 56.884708 | 56.88471 |
| Dim.2 | 1.1163086 | 15.947266 | 72.83197 |
| Dim.3 | 0.8507101 | 12.153002 | 84.98498 |
| Dim.4 | 0.6522514 | 9.317877 | 94.30285 |
| Dim.5 | 0.2233174 | 3.190249 | 97.49310 |
| Dim.6 | 0.0949972 | 1.357103 | 98.85021 |
| Dim.7 | 0.0804856 | 1.149794 | 100.00000 |
When working with technical data, we look for the cumulative variance from the first two components to be above 70%. In the case of the results above, this results into 72.83% which gives us confidence to proceed with PCA.
# I proceed to draw the screeplot to judge how many components to be retained based on the 'elbow' rule.
library(factoextra)
fviz_eig(components,
choice = "eigenvalue",
main = "Screeplot",
ylab = "Eigenvalue",
xlab = "Principal Component",
addlabels = TRUE)
From personal observation of the scree plot, we observe a sharp decline at PC 2, at this point we consider one less (2-1) and would therefore retain 1 component.
# We also carry out the parallel analysis to assess how many components we retain
library(psych)
fa.parallel(countrydata_PCA,
sim = FALSE,
fa = "pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
Although the parallel analysis scree plot above suggests retaining one component, keeping two components is preferable, ensuring a more comprehensive understanding of the data by capturing 72.8% of the total variance which also exceeds the threshold for technical data (70%). So for further analysis, we shall retain 2 components.
library(FactoMineR)
components <- PCA(countrydata_PCA,
ncp = 2,
scale.unit = TRUE,
graph = FALSE)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 167 individuals, described by 7 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
# Correlation of variables with components
library(kableExtra)
cor_matrix <- components$var$cor
kable(cor_matrix, format = "html", digits = 3, caption = "Correlation of Variables with Components") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
row_spec(0, bold = T, color = "white", background = "#4682B4") %>%
row_spec(1:nrow(cor_matrix), background = "#f0f8ff")
| Dim.1 | Dim.2 | |
|---|---|---|
| child_mort | -0.868 | 0.347 |
| exports | 0.544 | 0.490 |
| income | 0.827 | 0.447 |
| inflation | -0.364 | 0.477 |
| life_expec | 0.888 | -0.208 |
| total_fer | -0.831 | 0.355 |
| gdpp | 0.798 | 0.398 |
From the results above, we see that the for PC1, the variables of life-expec, income and gdpp have strong positive correlations with PC1. I identify this as the prosperity and economic well-being axis
PC2 captures a strong correlation with exports, inflation and income, representing the trade axis
# Contributions of variables to components
library(kableExtra)
contrib_matrix <- components$var$contrib
kable(contrib_matrix, format = "html", digits = 3, caption = "Contributions of Variables to Components (%)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
row_spec(0, bold = T, color = "white", background = "#4682B4") %>%
row_spec(1:nrow(contrib_matrix), background = "#f0f8ff")
| Dim.1 | Dim.2 | |
|---|---|---|
| child_mort | 18.936 | 10.784 |
| exports | 7.443 | 21.527 |
| income | 17.166 | 17.938 |
| inflation | 3.319 | 20.369 |
| life_expec | 19.784 | 3.881 |
| total_fer | 17.363 | 11.292 |
| gdpp | 15.988 | 14.207 |
The prosperity and economic well-being axis is defined primarily by life_expec, child_mort, total_fer & income and PC2 as the trade axis is defined heavily by the exports and inflation
#Next, we visualize the variables and their influcence over the PCs
library(factoextra)
fviz_pca_var(components,
repel = TRUE,
col.var = "contrib",
gradient.cols = c("#f0f8ff", "#4682B4", "#000080"),
ggtheme = theme_minimal())
PC1 (Horizontal Axis):
PC1 represents a development gradient, with regions characterized by higher income, GDP per capita, and life expectancy at one end, and high child mortality and fertility rates at the other.
PC2 (Vertical Axis):
Key to note is that child_mort and total_fer vs. income, gdpp, and life_expec are positioned in opposite directions, reinforcing their inverse relationships.
# Next, we visualise individual data points of countries to see how they cluster along these components.
library(factoextra)
fviz_pca_biplot(components,
repel = TRUE,
col.var = "contrib",
gradient.cols = c("#f0f8ff", "#4682B4", "#000080"),
col.ind = "#696969",
ggtheme = theme_minimal())
The biplot shows a wide distribution of countries, reflecting varying levels of socioeconomic development. While many countries cluster near the center, indicating average characteristics, others are positioned at the extremes, representing distinct profiles.
This biplot simplifies complex data into actionable insights, revealing relationships between variables and clustering similar observations
PC1 and PC2 effectively separate observations based on prosperity metrics and trade factors. Countries with high prosperity are concentrated on the positive side of PC1. Trade-influenced countries align more with PC2.
Clusters and outliers highlight distinct groups of countries with shared characteristics or unique traits.
#For further Analysis, I explored individual points in extreme positions (92, 114, 133) for profiling.
library(kableExtra)
extreme_points <- countrydata[rownames(countrydata) %in% c("92", "114", "133"), ]
extreme_points %>%
kbl(caption = "Extreme Points (Countries)") %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed"),
position = "center") %>%
row_spec(0, background = "#f0f8ff", bold = TRUE) %>%
row_spec(1:nrow(extreme_points), background = "#e6f2ff")
| country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
|---|---|---|---|---|---|---|---|---|---|
| Luxembourg | 2.8 | 175.0 | 7.77 | 142.0 | 91700 | 3.62 | 81.3 | 1.63 | 105000 |
| Nigeria | 130.0 | 25.3 | 5.07 | 17.4 | 5150 | 104.00 | 60.5 | 5.84 | 2330 |
| Sierra Leone | 160.0 | 16.8 | 13.10 | 34.5 | 1220 | 17.20 | 55.0 | 5.20 | 399 |
PC 1 differentiates countries based on economic and health characteristics, with prosperous nations like Luxembourg at one end and struggling nations like Sierra Leone at the other.
PC 2 emphasizes trade and inflation dynamics, as seen with Nigeria, where high inflation significantly influences its profile.