Author: Jethro Muwanguzi

Title: Homework Assignment 2 - MVA

Date: 2025-27-01

. .

INTRODUCTION

This assignment applies Principal Component Analysis to examine socioeconomic indicators across countries. The goal is to simplify the data while uncovering the main factors that explain differences in economic and social development among nations.

1 Data Import - Country Classification

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(readr))

file_path <- "/Users/jethro/Downloads/archive-3/Country-data.csv"

countrydata <- read_csv(file_path, show_col_types = FALSE)


kable(head(countrydata, 6), format = "html", caption = "First Six Rows of the Dataset") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"), position = "center") %>%
  row_spec(0, bold = TRUE, background = "#4682B4", color = "white") %>% 
  row_spec(1:6, background = "#f0f8ff")                               
First Six Rows of the Dataset
country child_mort exports health imports income inflation life_expec total_fer gdpp
Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553
Albania 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090
Algeria 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460
Angola 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530
Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200
Argentina 14.5 18.9 8.10 16.0 18700 20.90 75.8 2.37 10300

2 Data Description

Variables:

  1. country: Names of countries
  1. child_mort: Death of children under 5 years of age per 1,000 live births
  2. exports: Exports of goods and services as a percentage of the total GDP.
  3. health: Total health spending as a percentage of the total GDP
  4. imports: Imports of goods and services as a percentage of the total GDP
  5. Income: Net income per person
  6. Inflation: Annual growth rate of the total GDP
  7. life_expec: Average number of years a newborn child would live if current mortality patterns remain unchanged
  8. total_fer: Number of children that would be born to each woman if current age-fertility rates remain constant
  9. gdpp: GDP per capita, calculated as the total GDP divided by the total population

Source: Kaggle (https://www.kaggle.com/datasets/vipulgohel/clustering-pca-assignment)

. .

3 Descriptive Statistics

library(dplyr)
library(psych)
library(knitr)
library(kableExtra)

describe_table <- psych::describe(countrydata[, c("child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer", "gdpp")])
                                              
kable(as.data.frame(describe_table), caption = "Descriptive Statistics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), 
                position = "left", font_size = 10) %>%
     row_spec(0, bold = TRUE, background = "lightblue") %>%
       column_spec(1, bold = TRUE)
Descriptive Statistics
vars n mean sd median trimmed mad min max range skew kurtosis se
child_mort 1 167 38.270060 40.328932 19.30 31.588889 21.942480 2.6000 2.08e+02 205.4000 1.4248158 1.6228439 3.1207464
exports 2 167 41.108976 27.412010 35.00 37.799259 21.201180 0.1090 2.00e+02 199.8910 2.4020624 9.6490756 2.1212050
health 3 167 6.815689 2.746837 6.32 6.663333 2.639028 1.8100 1.79e+01 16.0900 0.6931186 0.5944336 0.2125567
imports 4 167 46.890215 24.209589 43.30 44.337037 21.052920 0.0659 1.74e+02 173.9341 1.8711858 6.4058921 1.8733942
income 5 167 17144.688623 19278.067698 9960.00 13807.629630 11638.410000 609.0000 1.25e+05 124391.0000 2.1915532 6.6674347 1491.7816663
inflation 6 167 7.781832 10.570704 5.39 6.265133 5.722836 -4.2100 1.04e+02 108.2100 5.0618313 39.9484352 0.8179856
life_expec 7 167 70.555689 8.893172 73.10 71.327407 8.895600 32.1000 8.28e+01 50.7000 -0.9536222 1.0329499 0.6881743
total_fer 8 167 2.947964 1.513848 2.41 2.768741 1.215732 1.1500 7.49e+00 6.3400 0.9497881 -0.2501780 0.1171450
gdpp 9 167 12964.155689 18328.704809 4660.00 9146.429630 5814.757200 231.0000 1.05e+05 104769.0000 2.1783653 5.2286139 1418.3177603

SOME OBSERVATIONS

  1. Income: The income variable has a mean of 17,144.69 and ranges from 609 to 125,000, indicating significant economic disparities globally. The median income of 4,660 highlights that most countries fall below the average, skewed by a few very wealthy nations.

  2. Inflation: Inflation shows a mean of 7.78%, with values ranging from extreme deflation at -42.1% to hyperinflation at 104.0%. The median value of 5.39% indicates that most countries experience moderate inflation, while extreme values pull the average upward.

  3. child_mort: Child mortality has a mean of 38.27 deaths per 1,000 live births, with a minimum of 2.60 and a maximum of 200.00. This indicates stark contrasts in child health outcomes between developed and developing nations.

  4. life_expec: Life expectancy has a mean of 70.55 years, ranging from 32.10 to 82.80. The median value of 71.33 reflects relatively high global life expectancy, though some countries still face significantly lower lifespans.

  5. gdpp: GDP per capita has a mean of 12,964, with values ranging from 231 to 105,000. The median of 4,660 highlights the substantial economic inequality, with most countries earning far below the average due to the influence of wealthy nations.

. .

4.1 RESEARCH QUESTION:

How can we use principal component analysis to identify key dimensions that summarize the economic and demographic characteristics of countries, and what relationships exist among these dimensions?

Before we perform the PCA, we check the correlations, carry out the KMO-MSA tests and the Bartlett’s test of sphericity

library(dplyr)
library(knitr)
library(DT)

countrydata_PCA <- countrydata[, c("child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer", "gdpp")]

R <- cor(countrydata_PCA)
R_rounded <- round(R, 3)

kable(R_rounded, format = "html", table.attr = 'style="background-color: #eaf3fc; border: 1px solid #b3cde3;"', caption = "Correlation Matrix (Rounded)") %>%
  kableExtra::kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Correlation Matrix (Rounded)
child_mort exports health imports income inflation life_expec total_fer gdpp
child_mort 1.000 -0.318 -0.200 -0.127 -0.524 0.288 -0.887 0.848 -0.483
exports -0.318 1.000 -0.114 0.737 0.517 -0.107 0.316 -0.320 0.419
health -0.200 -0.114 1.000 0.096 0.130 -0.255 0.211 -0.197 0.346
imports -0.127 0.737 0.096 1.000 0.122 -0.247 0.054 -0.159 0.115
income -0.524 0.517 0.130 0.122 1.000 -0.148 0.612 -0.502 0.896
inflation 0.288 -0.107 -0.255 -0.247 -0.148 1.000 -0.240 0.317 -0.222
life_expec -0.887 0.316 0.211 0.054 0.612 -0.240 1.000 -0.761 0.600
total_fer 0.848 -0.320 -0.197 -0.159 -0.502 0.317 -0.761 1.000 -0.455
gdpp -0.483 0.419 0.346 0.115 0.896 -0.222 0.600 -0.455 1.000

For a more balanced analysis of the relationships among the variables, I use the correlation matrix to ensure that the PCA results are not skewed by variables with larger scales or variances.

library(psych)
corPlot(R)

The correlation plot above, as a visual representation of the correlation matrix. Below are some key observations;

  • There is a strong positive correlation between child_mort and total_fer at 0.85, indicating a close relation.
  • There is a strong negative correlation, between life_expec and child_mort at -0.89, suggesting these variables move in opposite directions.
  • Variables like health have weak correlations with most others, making them less impactful in PCA.
#we proceed to carry out the Bartlett's Test of sphericity
library(psych)
bartlett_test <- cortest.bartlett(R, n = nrow(countrydata))
bartlett_test
## $chisq
## [1] 1169.737
## 
## $p.value
## [1] 3.136862e-222
## 
## $df
## [1] 36
  • H₀: P = I
  • H₁: P ≠ I

We reject the null hypothesis (H₀) at p-value < 0.001

determinant_value <- det(R)
determinant_value
## [1] 0.0007368109
#Below we carry out the KMO-MSA tests, looking for values above 0.5 
kmo_result <- KMO(R)
kmo_result
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA =  0.68
## MSA for each item = 
## child_mort    exports     health    imports     income  inflation life_expec 
##       0.73       0.57       0.36       0.41       0.69       0.74       0.80 
##  total_fer       gdpp 
##       0.86       0.66

From the KMO-MSA results above, most variables have the preferable values above 0.5 however, ‘health’ and ‘imports’ are below. Retaining them might distort the results we get. I then proceed to exclude them from the further analysis.

# Excluding 'country' and low-KMO variables ('health' and 'imports')
filtered_data <- countrydata[, !(colnames(countrydata) %in% c("country", "health", "imports"))]

print(colnames(filtered_data))
## [1] "child_mort" "exports"    "income"     "inflation"  "life_expec"
## [6] "total_fer"  "gdpp"
# I proceed to recompute the correlation matrix, run the Bartlett's Test and KMO Test"
R_filtered <- cor(filtered_data)

bartlett_test_filtered <- cortest.bartlett(R_filtered, n = nrow(filtered_data))
bartlett_test_filtered
## $chisq
## [1] 890.692
## 
## $p.value
## [1] 5.08529e-175
## 
## $df
## [1] 21
kmo_result_filtered <- KMO(R_filtered)
kmo_result_filtered
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R_filtered)
## Overall MSA =  0.76
## MSA for each item = 
## child_mort    exports     income  inflation life_expec  total_fer       gdpp 
##       0.73       0.85       0.70       0.73       0.80       0.85       0.69

Observations

  • H₀: P = I

  • H₁: P ≠ I

We reject the null hypothesis (H₀) at p-value < 0.001, this confirms that the variables are correlated enough to perform PCA.

KMO-MSA Tests

  1. The overall MSA of 0.76 is in the middling category of data adequacy which is a good enough value for us to proceed with PCA.
  2. The variable-specific MSAs are all okay as they are out of the miserable dimension of data adequacy. this is sufficient for us to proceed with PCA.

I then proceed with carrying out the PCA on the filtered dataset that has passed all tests for further analysis.

R <- cor(countrydata[, c("child_mort", "exports", "income", "inflation", "life_expec", "total_fer", "gdpp")])
library(FactoMineR)
library(factoextra)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(knitr)
library(kableExtra)
library(DT)

countrydata_PCA <- countrydata[, c("child_mort", "exports", "income", "inflation", "life_expec", "total_fer", "gdpp")]

components <- PCA(countrydata_PCA, 
                  scale.unit = TRUE,  
                  graph = FALSE)    

eigenvalues <- get_eigenvalue(components)

kable(eigenvalues, format = "html", 
      col.names = c("Eigenvalue", "Variance (%)", "Cumulative Variance (%)"),
      caption = "Eigenvalues and Variance Explained by PCA Components") %>%
  
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"), 
                position = "center") %>%
  row_spec(0, bold = TRUE, background = "#cfe8f6") %>%
  row_spec(1:nrow(eigenvalues), background = "#eaf3fc")
Eigenvalues and Variance Explained by PCA Components
Eigenvalue Variance (%) Cumulative Variance (%)
Dim.1 3.9819296 56.884708 56.88471
Dim.2 1.1163086 15.947266 72.83197
Dim.3 0.8507101 12.153002 84.98498
Dim.4 0.6522514 9.317877 94.30285
Dim.5 0.2233174 3.190249 97.49310
Dim.6 0.0949972 1.357103 98.85021
Dim.7 0.0804856 1.149794 100.00000

When working with technical data, we look for the cumulative variance from the first two components to be above 70%. In the case of the results above, this results into 72.83% which gives us confidence to proceed with PCA.

# I proceed to draw the screeplot to judge how many components to be retained based on the 'elbow' rule.

library(factoextra)

fviz_eig(components,
         choice = "eigenvalue",
         main = "Screeplot",
         ylab = "Eigenvalue",
         xlab = "Principal Component",
         addlabels = TRUE)

From personal observation of the scree plot, we observe a sharp decline at PC 2, at this point we consider one less (2-1) and would therefore retain 1 component.

# We also carry out the parallel analysis to assess how many components we retain
library(psych)

fa.parallel(countrydata_PCA, 
            sim = FALSE,
            fa = "pc")    

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  1

Although the parallel analysis scree plot above suggests retaining one component, keeping two components is preferable, ensuring a more comprehensive understanding of the data by capturing 72.8% of the total variance which also exceeds the threshold for technical data (70%). So for further analysis, we shall retain 2 components.

library(FactoMineR)

components <- PCA(countrydata_PCA, 
                  ncp = 2,    
                  scale.unit = TRUE,  
                  graph = FALSE)     

components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 167 individuals, described by 7 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"
# Correlation of variables with components
library(kableExtra)

cor_matrix <- components$var$cor

kable(cor_matrix, format = "html", digits = 3, caption = "Correlation of Variables with Components") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  row_spec(0, bold = T, color = "white", background = "#4682B4") %>%
  row_spec(1:nrow(cor_matrix), background = "#f0f8ff")
Correlation of Variables with Components
Dim.1 Dim.2
child_mort -0.868 0.347
exports 0.544 0.490
income 0.827 0.447
inflation -0.364 0.477
life_expec 0.888 -0.208
total_fer -0.831 0.355
gdpp 0.798 0.398

From the results above, we see that the for PC1, the variables of life-expec, income and gdpp have strong positive correlations with PC1. I identify this as the prosperity and economic well-being axis

PC2 captures a strong correlation with exports, inflation and income, representing the trade axis

# Contributions of variables to components
library(kableExtra)

contrib_matrix <- components$var$contrib

kable(contrib_matrix, format = "html", digits = 3, caption = "Contributions of Variables to Components (%)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  row_spec(0, bold = T, color = "white", background = "#4682B4") %>% 
  row_spec(1:nrow(contrib_matrix), background = "#f0f8ff")
Contributions of Variables to Components (%)
Dim.1 Dim.2
child_mort 18.936 10.784
exports 7.443 21.527
income 17.166 17.938
inflation 3.319 20.369
life_expec 19.784 3.881
total_fer 17.363 11.292
gdpp 15.988 14.207

The prosperity and economic well-being axis is defined primarily by life_expec, child_mort, total_fer & income and PC2 as the trade axis is defined heavily by the exports and inflation

#Next, we visualize the variables and their influcence over the PCs
library(factoextra)

fviz_pca_var(components,
             repel = TRUE,
             col.var = "contrib",
             gradient.cols = c("#f0f8ff", "#4682B4", "#000080"),
             ggtheme = theme_minimal())

KEY OBSERVATIONS

PC1 (Horizontal Axis):

  1. Highly positively correlated variables:
  • income, life_expec, and gdpp: These variables align closely with PC1, indicating that PC1 captures the concept of prosperity and economic well-being.
  1. Highly negatively correlated variables:
  • child_mort and total_fer: These variables point in the opposite direction of income and life_expec, indicating an inverse relationship with prosperity.

PC1 represents a development gradient, with regions characterized by higher income, GDP per capita, and life expectancy at one end, and high child mortality and fertility rates at the other.

PC2 (Vertical Axis):

  1. Highly positively correlated variables:
  • exports and inflation: These variables align closely with PC2, suggesting that PC2 captures trade-related factors.
  1. Variables like child_mort and life_expec are less aligned with PC2, suggesting that health-related factors are not major contributors to this component.

Key to note is that child_mort and total_fer vs. income, gdpp, and life_expec are positioned in opposite directions, reinforcing their inverse relationships.

# Next, we visualise individual data points of countries to see how they cluster along these components.
library(factoextra)

fviz_pca_biplot(components, 
                repel = TRUE,                     
                col.var = "contrib",
                gradient.cols = c("#f0f8ff", "#4682B4", "#000080"),
                col.ind = "#696969", 
                ggtheme = theme_minimal())

OBSERVATIONS

The biplot shows a wide distribution of countries, reflecting varying levels of socioeconomic development. While many countries cluster near the center, indicating average characteristics, others are positioned at the extremes, representing distinct profiles.

This biplot simplifies complex data into actionable insights, revealing relationships between variables and clustering similar observations

  • Countries are spread across the biplot, with some clusters near the center, indicating average characteristics, and others spread towards the extremes.
  • Countries 92, 124, and 134 are located far to the positive end of PC1, suggesting high prosperity levels (high income, gdpp, and life_expec).
  • Countries like 114 and 50 align more with PC2, possibly indicating high trade activity or inflation.

PC1 and PC2 effectively separate observations based on prosperity metrics and trade factors. Countries with high prosperity are concentrated on the positive side of PC1. Trade-influenced countries align more with PC2.

Clusters and outliers highlight distinct groups of countries with shared characteristics or unique traits.

#For further Analysis, I explored individual points in extreme positions (92, 114, 133) for profiling.
library(kableExtra)

extreme_points <- countrydata[rownames(countrydata) %in% c("92", "114", "133"), ]

extreme_points %>%
  kbl(caption = "Extreme Points (Countries)") %>%
  kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed"),
                position = "center") %>%
  row_spec(0, background = "#f0f8ff", bold = TRUE) %>%  
  row_spec(1:nrow(extreme_points), background = "#e6f2ff")
Extreme Points (Countries)
country child_mort exports health imports income inflation life_expec total_fer gdpp
Luxembourg 2.8 175.0 7.77 142.0 91700 3.62 81.3 1.63 105000
Nigeria 130.0 25.3 5.07 17.4 5150 104.00 60.5 5.84 2330
Sierra Leone 160.0 16.8 13.10 34.5 1220 17.20 55.0 5.20 399

PC 1 differentiates countries based on economic and health characteristics, with prosperous nations like Luxembourg at one end and struggling nations like Sierra Leone at the other.

PC 2 emphasizes trade and inflation dynamics, as seen with Nigeria, where high inflation significantly influences its profile.

CONCLUSION

The PCA reveals two key dimensions driving socioeconomic differences across countries.
PC1 represents economic prosperity and health, with countries like Luxembourg excelling due to high income, life expectancy, and low mortality, while Sierra Leone faces challenges on these fronts.
PC2 reflects trade dynamics and inflation, where Nigeria stands out for its high inflation rates.
Together, these dimensions explain 72.8% of the variance, showing how countries cluster based on development and economic pressures. Overall, the analysis highlights significant disparities in socioeconomic profiles globally.