JethroHW2.knit

Author: Jethro Muwanguzi

Title: Homework Assignment 2 - MVA

Date: 2025-27-01

. .

INTRODUCTION

This assignment applies Principal Component Analysis to examine socioeconomic indicators across countries. The goal is to simplify the data while uncovering the main factors that explain differences in economic and social development among nations.

LINK - https://www.kaggle.com/datasets/vipulgohel/clustering-pca-assignment

. .

1 Data Import - Country Classification

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(readr))

file_path <- "/Users/jethro/Downloads/archive-3/Country-data.csv"

countrydata <- read_csv(file_path, show_col_types = FALSE)


kable(head(countrydata, 6), format = "html", caption = "First Six Rows of the Dataset") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"), position = "center") %>%
  row_spec(0, bold = TRUE, background = "#4682B4", color = "white") %>% 
  row_spec(1:6, background = "#f0f8ff")

First Six Rows of the Dataset
country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Afghanistan	90.2	10.0	7.58	44.9	1610	9.44	56.2	5.82	553
Albania	16.6	28.0	6.55	48.6	9930	4.49	76.3	1.65	4090
Algeria	27.3	38.4	4.17	31.4	12900	16.10	76.5	2.89	4460
Angola	119.0	62.3	2.85	42.9	5900	22.40	60.1	6.16	3530
Antigua and Barbuda	10.3	45.5	6.03	58.9	19100	1.44	76.8	2.13	12200
Argentina	14.5	18.9	8.10	16.0	18700	20.90	75.8	2.37	10300

2 Data Description

Unit of Observation: A single country
Sample Size: 167 rows (representing 167 countries)

Variables:

country: Names of countries

child_mort: Death of children under 5 years of age per 1,000 live births
exports: Exports of goods and services as a percentage of the total GDP.
health: Total health spending as a percentage of the total GDP
imports: Imports of goods and services as a percentage of the total GDP
Income: Net income per person
Inflation: Annual growth rate of the total GDP
life_expec: Average number of years a newborn child would live if current mortality patterns remain unchanged
total_fer: Number of children that would be born to each woman if current age-fertility rates remain constant
gdpp: GDP per capita, calculated as the total GDP divided by the total population

Source: Kaggle (https://www.kaggle.com/datasets/vipulgohel/clustering-pca-assignment)

. .

3 Descriptive Statistics

library(dplyr)
library(psych)
library(knitr)
library(kableExtra)

describe_table <- psych::describe(countrydata[, c("child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer", "gdpp")])
                                              
kable(as.data.frame(describe_table), caption = "Descriptive Statistics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), 
                position = "left", font_size = 10) %>%
     row_spec(0, bold = TRUE, background = "lightblue") %>%
       column_spec(1, bold = TRUE)

Descriptive Statistics
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
child_mort	1	167	38.270060	40.328932	19.30	31.588889	21.942480	2.6000	2.08e+02	205.4000	1.4248158	1.6228439	3.1207464
exports	2	167	41.108976	27.412010	35.00	37.799259	21.201180	0.1090	2.00e+02	199.8910	2.4020624	9.6490756	2.1212050
health	3	167	6.815689	2.746837	6.32	6.663333	2.639028	1.8100	1.79e+01	16.0900	0.6931186	0.5944336	0.2125567
imports	4	167	46.890215	24.209589	43.30	44.337037	21.052920	0.0659	1.74e+02	173.9341	1.8711858	6.4058921	1.8733942
income	5	167	17144.688623	19278.067698	9960.00	13807.629630	11638.410000	609.0000	1.25e+05	124391.0000	2.1915532	6.6674347	1491.7816663
inflation	6	167	7.781832	10.570704	5.39	6.265133	5.722836	-4.2100	1.04e+02	108.2100	5.0618313	39.9484352	0.8179856
life_expec	7	167	70.555689	8.893172	73.10	71.327407	8.895600	32.1000	8.28e+01	50.7000	-0.9536222	1.0329499	0.6881743
total_fer	8	167	2.947964	1.513848	2.41	2.768741	1.215732	1.1500	7.49e+00	6.3400	0.9497881	-0.2501780	0.1171450
gdpp	9	167	12964.155689	18328.704809	4660.00	9146.429630	5814.757200	231.0000	1.05e+05	104769.0000	2.1783653	5.2286139	1418.3177603

SOME OBSERVATIONS

Income: The income variable has a mean of 17,144.69 and ranges from 609 to 125,000, indicating significant economic disparities globally. The median income of 4,660 highlights that most countries fall below the average, skewed by a few very wealthy nations.
Inflation: Inflation shows a mean of 7.78%, with values ranging from extreme deflation at -42.1% to hyperinflation at 104.0%. The median value of 5.39% indicates that most countries experience moderate inflation, while extreme values pull the average upward.
child_mort: Child mortality has a mean of 38.27 deaths per 1,000 live births, with a minimum of 2.60 and a maximum of 200.00. This indicates stark contrasts in child health outcomes between developed and developing nations.
life_expec: Life expectancy has a mean of 70.55 years, ranging from 32.10 to 82.80. The median value of 71.33 reflects relatively high global life expectancy, though some countries still face significantly lower lifespans.
gdpp: GDP per capita has a mean of 12,964, with values ranging from 231 to 105,000. The median of 4,660 highlights the substantial economic inequality, with most countries earning far below the average due to the influence of wealthy nations.

. .

4.1 RESEARCH QUESTION:

How can we use principal component analysis to identify key dimensions that summarize the economic and demographic characteristics of countries, and what relationships exist among these dimensions?

Before we perform the PCA, we check the correlations, carry out the KMO-MSA tests and the Bartlett’s test of sphericity

library(dplyr)
library(knitr)
library(DT)

countrydata_PCA <- countrydata[, c("child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer", "gdpp")]

R <- cor(countrydata_PCA)
R_rounded <- round(R, 3)

kable(R_rounded, format = "html", table.attr = 'style="background-color: #eaf3fc; border: 1px solid #b3cde3;"', caption = "Correlation Matrix (Rounded)") %>%
  kableExtra::kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Correlation Matrix (Rounded)
	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
child_mort	1.000	-0.318	-0.200	-0.127	-0.524	0.288	-0.887	0.848	-0.483
exports	-0.318	1.000	-0.114	0.737	0.517	-0.107	0.316	-0.320	0.419
health	-0.200	-0.114	1.000	0.096	0.130	-0.255	0.211	-0.197	0.346
imports	-0.127	0.737	0.096	1.000	0.122	-0.247	0.054	-0.159	0.115
income	-0.524	0.517	0.130	0.122	1.000	-0.148	0.612	-0.502	0.896
inflation	0.288	-0.107	-0.255	-0.247	-0.148	1.000	-0.240	0.317	-0.222
life_expec	-0.887	0.316	0.211	0.054	0.612	-0.240	1.000	-0.761	0.600
total_fer	0.848	-0.320	-0.197	-0.159	-0.502	0.317	-0.761	1.000	-0.455
gdpp	-0.483	0.419	0.346	0.115	0.896	-0.222	0.600	-0.455	1.000

For a more balanced analysis of the relationships among the variables, I use the correlation matrix to ensure that the PCA results are not skewed by variables with larger scales or variances.

library(psych)
corPlot(R)

The correlation plot above, as a visual representation of the correlation matrix. Below are some key observations;

There is a strong positive correlation between child_mort and total_fer at 0.85, indicating a close relation.
There is a strong negative correlation, between life_expec and child_mort at -0.89, suggesting these variables move in opposite directions.
Variables like health have weak correlations with most others, making them less impactful in PCA.

#we proceed to carry out the Bartlett's Test of sphericity
library(psych)
bartlett_test <- cortest.bartlett(R, n = nrow(countrydata))
bartlett_test

## $chisq
## [1] 1169.737
## 
## $p.value
## [1] 3.136862e-222
## 
## $df
## [1] 36

H₀: P = I
H₁: P ≠ I

We reject the null hypothesis (H₀) at p-value < 0.001

determinant_value <- det(R)
determinant_value

## [1] 0.0007368109

#Below we carry out the KMO-MSA tests, looking for values above 0.5 
kmo_result <- KMO(R)
kmo_result

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA =  0.68
## MSA for each item = 
## child_mort    exports     health    imports     income  inflation life_expec 
##       0.73       0.57       0.36       0.41       0.69       0.74       0.80 
##  total_fer       gdpp 
##       0.86       0.66

From the KMO-MSA results above, most variables have the preferable values above 0.5 however, ‘health’ and ‘imports’ are below. Retaining them might distort the results we get. I then proceed to exclude them from the further analysis.

# Excluding 'country' and low-KMO variables ('health' and 'imports')
filtered_data <- countrydata[, !(colnames(countrydata) %in% c("country", "health", "imports"))]

print(colnames(filtered_data))

## [1] "child_mort" "exports"    "income"     "inflation"  "life_expec"
## [6] "total_fer"  "gdpp"

# I proceed to recompute the correlation matrix, run the Bartlett's Test and KMO Test"
R_filtered <- cor(filtered_data)

bartlett_test_filtered <- cortest.bartlett(R_filtered, n = nrow(filtered_data))
bartlett_test_filtered

## $chisq
## [1] 890.692
## 
## $p.value
## [1] 5.08529e-175
## 
## $df
## [1] 21

kmo_result_filtered <- KMO(R_filtered)
kmo_result_filtered

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R_filtered)
## Overall MSA =  0.76
## MSA for each item = 
## child_mort    exports     income  inflation life_expec  total_fer       gdpp 
##       0.73       0.85       0.70       0.73       0.80       0.85       0.69

Observations

H₀: P = I
H₁: P ≠ I

We reject the null hypothesis (H₀) at p-value < 0.001, this confirms that the variables are correlated enough to perform PCA.

KMO-MSA Tests

The overall MSA of 0.76 is in the middling category of data adequacy which is a good enough value for us to proceed with PCA.
The variable-specific MSAs are all okay as they are out of the miserable dimension of data adequacy. this is sufficient for us to proceed with PCA.

I then proceed with carrying out the PCA on the filtered dataset that has passed all tests for further analysis.

R <- cor(countrydata[, c("child_mort", "exports", "income", "inflation", "life_expec", "total_fer", "gdpp")])

library(FactoMineR)
library(factoextra)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(knitr)
library(kableExtra)
library(DT)

countrydata_PCA <- countrydata[, c("child_mort", "exports", "income", "inflation", "life_expec", "total_fer", "gdpp")]

components <- PCA(countrydata_PCA, 
                  scale.unit = TRUE,  
                  graph = FALSE)    

eigenvalues <- get_eigenvalue(components)

kable(eigenvalues, format = "html", 
      col.names = c("Eigenvalue", "Variance (%)", "Cumulative Variance (%)"),
      caption = "Eigenvalues and Variance Explained by PCA Components") %>%
  
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"), 
                position = "center") %>%
  row_spec(0, bold = TRUE, background = "#cfe8f6") %>%
  row_spec(1:nrow(eigenvalues), background = "#eaf3fc")

Eigenvalues and Variance Explained by PCA Components
	Eigenvalue	Variance (%)	Cumulative Variance (%)
Dim.1	3.9819296	56.884708	56.88471
Dim.2	1.1163086	15.947266	72.83197
Dim.3	0.8507101	12.153002	84.98498
Dim.4	0.6522514	9.317877	94.30285
Dim.5	0.2233174	3.190249	97.49310
Dim.6	0.0949972	1.357103	98.85021
Dim.7	0.0804856	1.149794	100.00000

When working with technical data, we look for the cumulative variance from the first two components to be above 70%. In the case of the results above, this results into 72.83% which gives us confidence to proceed with PCA.

# I proceed to draw the screeplot to judge how many components to be retained based on the 'elbow' rule.

library(factoextra)

fviz_eig(components,
         choice = "eigenvalue",
         main = "Screeplot",
         ylab = "Eigenvalue",
         xlab = "Principal Component",
         addlabels = TRUE)

From personal observation of the scree plot, we observe a sharp decline at PC 2, at this point we consider one less (2-1) and would therefore retain 1 component.

# We also carry out the parallel analysis to assess how many components we retain
library(psych)

fa.parallel(countrydata_PCA, 
            sim = FALSE,
            fa = "pc")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  1

Although the parallel analysis scree plot above suggests retaining one component, keeping two components is preferable, ensuring a more comprehensive understanding of the data by capturing 72.8% of the total variance which also exceeds the threshold for technical data (70%). So for further analysis, we shall retain 2 components.

library(FactoMineR)

components <- PCA(countrydata_PCA, 
                  ncp = 2,    
                  scale.unit = TRUE,  
                  graph = FALSE)     

components

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 167 individuals, described by 7 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

# Correlation of variables with components
library(kableExtra)

cor_matrix <- components$var$cor

kable(cor_matrix, format = "html", digits = 3, caption = "Correlation of Variables with Components") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  row_spec(0, bold = T, color = "white", background = "#4682B4") %>%
  row_spec(1:nrow(cor_matrix), background = "#f0f8ff")

Correlation of Variables with Components
	Dim.1	Dim.2
child_mort	-0.868	0.347
exports	0.544	0.490
income	0.827	0.447
inflation	-0.364	0.477
life_expec	0.888	-0.208
total_fer	-0.831	0.355
gdpp	0.798	0.398

From the results above, we see that the for PC1, the variables of life-expec, income and gdpp have strong positive correlations with PC1. I identify this as the prosperity and economic well-being axis

PC2 captures a strong correlation with exports, inflation and income, representing the trade axis

# Contributions of variables to components
library(kableExtra)

contrib_matrix <- components$var$contrib

kable(contrib_matrix, format = "html", digits = 3, caption = "Contributions of Variables to Components (%)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  row_spec(0, bold = T, color = "white", background = "#4682B4") %>% 
  row_spec(1:nrow(contrib_matrix), background = "#f0f8ff")

Contributions of Variables to Components (%)
	Dim.1	Dim.2
child_mort	18.936	10.784
exports	7.443	21.527
income	17.166	17.938
inflation	3.319	20.369
life_expec	19.784	3.881
total_fer	17.363	11.292
gdpp	15.988	14.207

The prosperity and economic well-being axis is defined primarily by life_expec, child_mort, total_fer & income and PC2 as the trade axis is defined heavily by the exports and inflation

#Next, we visualize the variables and their influcence over the PCs
library(factoextra)

fviz_pca_var(components,
             repel = TRUE,
             col.var = "contrib",
             gradient.cols = c("#f0f8ff", "#4682B4", "#000080"),
             ggtheme = theme_minimal())

KEY OBSERVATIONS

PC1 (Horizontal Axis):

Highly positively correlated variables:

income, life_expec, and gdpp: These variables align closely with PC1, indicating that PC1 captures the concept of prosperity and economic well-being.

Highly negatively correlated variables:

child_mort and total_fer: These variables point in the opposite direction of income and life_expec, indicating an inverse relationship with prosperity.

PC1 represents a development gradient, with regions characterized by higher income, GDP per capita, and life expectancy at one end, and high child mortality and fertility rates at the other.

PC2 (Vertical Axis):

Highly positively correlated variables:

exports and inflation: These variables align closely with PC2, suggesting that PC2 captures trade-related factors.

Variables like child_mort and life_expec are less aligned with PC2, suggesting that health-related factors are not major contributors to this component.

Key to note is that child_mort and total_fer vs. income, gdpp, and life_expec are positioned in opposite directions, reinforcing their inverse relationships.

# Next, we visualise individual data points of countries to see how they cluster along these components.
library(factoextra)

fviz_pca_biplot(components, 
                repel = TRUE,                     
                col.var = "contrib",
                gradient.cols = c("#f0f8ff", "#4682B4", "#000080"),
                col.ind = "#696969", 
                ggtheme = theme_minimal())

OBSERVATIONS

The biplot shows a wide distribution of countries, reflecting varying levels of socioeconomic development. While many countries cluster near the center, indicating average characteristics, others are positioned at the extremes, representing distinct profiles.

This biplot simplifies complex data into actionable insights, revealing relationships between variables and clustering similar observations

Countries are spread across the biplot, with some clusters near the center, indicating average characteristics, and others spread towards the extremes.
Countries 92, 124, and 134 are located far to the positive end of PC1, suggesting high prosperity levels (high income, gdpp, and life_expec).
Countries like 114 and 50 align more with PC2, possibly indicating high trade activity or inflation.

PC1 and PC2 effectively separate observations based on prosperity metrics and trade factors. Countries with high prosperity are concentrated on the positive side of PC1. Trade-influenced countries align more with PC2.

Clusters and outliers highlight distinct groups of countries with shared characteristics or unique traits.

#For further Analysis, I explored individual points in extreme positions (92, 114, 133) for profiling.
library(kableExtra)

extreme_points <- countrydata[rownames(countrydata) %in% c("92", "114", "133"), ]

extreme_points %>%
  kbl(caption = "Extreme Points (Countries)") %>%
  kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed"),
                position = "center") %>%
  row_spec(0, background = "#f0f8ff", bold = TRUE) %>%  
  row_spec(1:nrow(extreme_points), background = "#e6f2ff")

Extreme Points (Countries)
country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
Luxembourg	2.8	175.0	7.77	142.0	91700	3.62	81.3	1.63	105000
Nigeria	130.0	25.3	5.07	17.4	5150	104.00	60.5	5.84	2330
Sierra Leone	160.0	16.8	13.10	34.5	1220	17.20	55.0	5.20	399

PC 1 differentiates countries based on economic and health characteristics, with prosperous nations like Luxembourg at one end and struggling nations like Sierra Leone at the other.

PC 2 emphasizes trade and inflation dynamics, as seen with Nigeria, where high inflation significantly influences its profile.

CONCLUSION

The PCA reveals two key dimensions driving socioeconomic differences across countries.

PC1 represents economic prosperity and health, with countries like Luxembourg excelling due to high income, life expectancy, and low mortality, while Sierra Leone faces challenges on these fronts.

PC2 reflects trade dynamics and inflation, where Nigeria stands out for its high inflation rates.

Together, these dimensions explain 72.8% of the variance, showing how countries cluster based on development and economic pressures. Overall, the analysis highlights significant disparities in socioeconomic profiles globally.