Introduction:

This project focuses on applying dimension reduction techniques—Principal Component Analysis (PCA) and Multidimensional Scaling (MDS)—to a dataset related to primary education, obtained from UNICEF. The aim is to understand patterns in the data while reducing its dimensional complexity for efficient analysis.

Primary School

Dataset Description:

The dataset, downloaded from UNICEF’s Pre-Primary Education Data, provides comprehensive insights into global attendance rates and related demographics for pre-primary education, as recorded one year before the primary year of schooling. It includes:Country and Regional Data, Attendance Metrics, Socioeconomic Stratification, Data Sources and Time Periods, Population Information. This dataset is pivotal for understanding attendance disparities across genders, regions, and socioeconomic groups, providing a robust foundation for applying dimension reduction techniques to uncover hidden patterns and key influences on pre-primary education attendance globally.

# Load the dataset
education_data <- read.csv("Data//Primary_education_data.csv")

# Inspect the data
dim(education_data)

## [1] 205  23

str(education_data)

## 'data.frame':    205 obs. of  23 variables:
##  $ ISO3                   : chr  "AFG" "ALB" "DZA" "AND" ...
##  $ Country                : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ Region                 : chr  "SA" "ECA" "MENA" "ECA" ...
##  $ Sub.region             : chr  "SA" "EECA" "MENA" "WE" ...
##  $ Development.Regions    : chr  "Least Developed" "More Developed" "Less Developed" "More Developed" ...
##  $ Total                  : num  34.1 NA 78.1 NA 52.9 ...
##  $ Gender..Female         : num  29.3 NA 79.8 NA 50.5 ...
##  $ Gender.Male            : num  38.5 NA 76.7 NA 55.1 ...
##  $ Residence.Rural        : num  34.3 NA 69.6 NA 36.1 ...
##  $ Residence.Urban        : num  33.5 NA 83.4 NA 63.5 ...
##  $ Wealth.quintile.Poorest: num  37.5 NA 61 NA 32 ...
##  $ Wealth.quintile.Second : num  33.8 NA 77.1 NA 38.6 ...
##  $ Wealth.quintile.Middle : num  31.8 NA 83.2 NA 54.2 ...
##  $ Wealth.quintile.Fourth : num  32.2 NA 86.9 NA 69.6 ...
##  $ Wealth.quintile.Richest: num  35.6 NA 88.3 NA 81.2 ...
##  $ Source                 : chr  "DHS 2015" "" "MICS 2019" "" ...
##  $ Time_period            : int  2015 NA 2020 NA 2016 NA NA 2020 2016 NA ...
##  $ Pop_total              : int  1096206 34333 977228 661 1102635 198 1486 749906 43011 334911 ...
##  $ Pop_female             : int  534029 16502 478249 311 546100 97 732 367999 20215 163004 ...
##  $ Pop_male               : int  562177 17831 498979 350 556535 101 754 381907 22796 171907 ...
##  $ Pop_rural              : num  816729.5 13623.7 267477.4 78.9 380250.2 ...
##  $ Pop_urban              : num  279477 20709 709751 582 722385 ...
##  $ Urban_percentage       : num  0.255 0.603 0.726 0.881 0.655 ...

Steps Performed:

1.Data Preparation:

The dataset was preprocessed to handle missing values and standardize scales. Standardization ensures attributes contribute equally during analysis.

Replace missing values with the mean (numerical) or mode (categorical)

# Replace missing values with the mean (numerical) or mode (categorical)
education_data_clean <- education_data %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
  mutate(across(where(is.character), ~ ifelse(is.na(.), Mode(.), .)))

# Define a function for mode (for categorical variables)
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))] 
}

Remove the row if the total column is NA

education_data_cleaned <- education_data %>%
  filter(!is.na(Total))

Normalization numerical columns

# Select numerical columns for normalization
numerical_columns <- education_data %>% dplyr::select(where(is.numeric))

# Apply normalization (scaling)
normalized_columns <- scale(numerical_columns)

# Reassign the normalized data back to the dataset
education_data[names(numerical_columns)] <- as.data.frame(normalized_columns)

Number of columns in my data is

dim(education_data)

## [1] 114  23

Extract the numeric part of the data:

education_matrix <- education_data %>% dplyr::select(where(is.numeric)) %>% as.matrix()

2.Dimensionality Reduction Using PCA:

Principal Component Analysis (PCA) PCA is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

# Perform PCA
pca_result <- prcomp(education_matrix, center = TRUE, scale. = TRUE)

# Summary of PCA
summary(pca_result)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.0370 2.2200 1.01780 0.91320 0.70776 0.44276 0.27711
## Proportion of Variance 0.5425 0.2899 0.06094 0.04905 0.02947 0.01153 0.00452
## Cumulative Proportion  0.5425 0.8325 0.89339 0.94244 0.97191 0.98344 0.98796
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.26374 0.24144 0.15706 0.14708 0.12876 0.10280 0.05774
## Proportion of Variance 0.00409 0.00343 0.00145 0.00127 0.00098 0.00062 0.00020
## Cumulative Proportion  0.99205 0.99548 0.99693 0.99820 0.99918 0.99980 0.99999
##                           PC15      PC16     PC17
## Standard deviation     0.01110 3.475e-11 5.97e-16
## Proportion of Variance 0.00001 0.000e+00 0.00e+00
## Cumulative Proportion  1.00000 1.000e+00 1.00e+00

# View rotation matrix (loadings)
pca_result$rotation

Interpretation for PCA

# Eigenvalues and explained variance
eigen_values <- pca_result$sdev^2  # Square of singular values
explained_variance <- eigen_values / sum(eigen_values) * 100  # Percentage of explained variance
cumulative_variance <- cumsum(explained_variance)  # Cumulative variance

# Create a data frame for eigenvalues
eigen_df <- data.frame(
  PC = paste0("PC", 1:length(eigen_values)),
  Eigenvalue = eigen_values,
  Variance_Explained = explained_variance,
  Cumulative_Variance = cumulative_variance
)

Loadings:

The loadings indicate how much each original feature contributes to each principal component (PC). Higher loading values (positive or negative) signify that the respective feature has a stronger influence on that particular principal component. This helps identify which original variables are most important in defining the directions of maximum variance in the dataset. By examining these loadings, we can interpret the characteristics represented by each principal component and their relevance to the underlying data structure.

# Extract loadings
loadings <- pca_result$rotation

# Plot loadings for PC1
barplot(loadings[, 1], main = "Loadings for PC1", xlab = "Variables", ylab = "Loadings")

Eigenvalues:

The eigenvalues represent the amount of variance captured or explained by each principal component. In the table above, we see that the first few PCs (e.g., PC1, PC2, and PC3) have high eigenvalues, which indicates that they account for a significant proportion of the total variance in the data. Specifically:

# Print eigenvalues and variance explained
print(eigen_df)

##      PC   Eigenvalue Variance_Explained Cumulative_Variance
## 1   PC1 9.223181e+00       5.425400e+01            54.25400
## 2   PC2 4.928464e+00       2.899096e+01            83.24497
## 3   PC3 1.035921e+00       6.093651e+00            89.33862
## 4   PC4 8.339286e-01       4.905462e+00            94.24408
## 5   PC5 5.009250e-01       2.946618e+00            97.19070
## 6   PC6 1.960379e-01       1.153164e+00            98.34386
## 7   PC7 7.678795e-02       4.516938e-01            98.79556
## 8   PC8 6.955883e-02       4.091696e-01            99.20473
## 9   PC9 5.829149e-02       3.428911e-01            99.54762
## 10 PC10 2.466914e-02       1.451126e-01            99.69273
## 11 PC11 2.163191e-02       1.272465e-01            99.81998
## 12 PC12 1.657943e-02       9.752605e-02            99.91750
## 13 PC13 1.056732e-02       6.216068e-02            99.97967
## 14 PC14 3.333822e-03       1.961072e-02            99.99928
## 15 PC15 1.231062e-04       7.241542e-04           100.00000
## 16 PC16 1.207382e-21       7.102249e-21           100.00000
## 17 PC17 3.563604e-31       2.096238e-30           100.00000

PC1 explains 54.25% of the total variance.
PC2 adds another 29.90%, bringing the cumulative variance explained to approximately 83.25%.
With PC3, the cumulative variance reaches 89.34%, showing diminishing returns for subsequent PCs.

This suggests that most of the variability in the dataset can be represented using just the first few principal components, while later components contribute very little to explaining additional variance.

fviz_eig(pca_result, choice = "eigenvalue", main = "Scree Plot (Eigenvalues)")

Scree plot of eigenvalues

Cumulative Variance Explained:

The cumulative variance plot demonstrates that the first 3-5 components explain nearly all the variance in the dataset, with the cumulative variance exceeding 95% by the fifth component. This provides a strong justification for dimensionality reduction, as retaining only the first few principal components would significantly reduce the complexity of the data while preserving its essential information.

# Plot cumulative variance
plot(cumulative_variance, type = "b", xlab = "Principal Components", ylab = "Cumulative Variance (%)",
     main = "Cumulative Variance Explained")

Overall, this analysis reveals the utility of PCA in simplifying high-dimensional data while retaining most of the variability, facilitating further analysis or modeling efforts.

# Biplot for the first two principal components
biplot(pca_result, scale = 0, main = "PCA Biplot")

The biplot illustrates the relationships between the original variables (red arrows) and the observations (black points) in the space defined by the first two principal components (PC1 and PC2), highlighting the contribution of each variable to the principal components.

fviz_pca_var(pca_result, col.var="steelblue")

The PCA variable plot shows that most variables strongly contribute to Dim1 (54.3%), while fewer variables are aligned with Dim2 (29%). Variables with longer arrows are more influential, and their directions indicate correlations (similar directions for positive correlation, opposite for negative

Variable contributions to the first two principal components

var_contrib_PC1 <- fviz_contrib(pca_result, choice = "var", axes = 1, xtickslab.rt = 90)
var_contrib_PC2 <- fviz_contrib(pca_result, choice = "var", axes = 2, xtickslab.rt = 90)

# Combine and visualize contributions
library(gridExtra)
grid.arrange(var_contrib_PC1, var_contrib_PC2, top = "Variable Contributions to the First Two PCs")

PC1 is strongly influenced by Total, Gender_Male, and Gender_Female, while PC2 is dominated by population-related variables like Pop_Female, Pop_Male, and Pop_Total. This shows PC1 focuses on attendance and wealth, while PC2 captures demographic patterns.

# Clustering with PCA components
pca_scores <- pca_result$x[, 1:2]  # First two principal components
clusters <- KMeans_rcpp(pca_scores, clusters = 3, num_init = 5, max_iters = 100)

# Assign cluster labels to the PCA scores
cluster_labels <- clusters$clusters

# Plot clusters
plot(pca_scores, col = cluster_labels, pch = 19, 
     main = "K-means Clustering on PCA Components",
     xlab = "Principal Component 1", ylab = "Principal Component 2")
points(clusters$centroids, col = 1:3, pch = 8, cex = 2)  # Centroids

The plot represents the results of K-means clustering applied to the first two principal components (PC1 and PC2). Data points are grouped into three distinct clusters, differentiated by colors, with centroids marked as larger symbols. This clustering highlights patterns and groupings within the dataset, leveraging the reduced dimensions from PCA to simplify and enhance data interpretation.

3.Dimensionality Reduction Using MDS

Multidimensional Scaling (MDS) is a technique used to visualize the similarity or dissimilarity of data in a lower-dimensional space. It preserves the pairwise distances between data points as accurately as possible while reducing dimensionality.

# Compute correlations for numerical variables
cor_matrix <- cor(education_matrix, method = "pearson")

# Visualize correlations
corrplot(cor_matrix, method = "color", order = "hclust", tl.cex = 0.6)

The heatmap represents the Pearson correlation matrix for numerical variables. Darker colors indicate stronger correlations (positive or negative), with variables clustered hierarchically to reveal patterns. Strong correlations are observed among wealth-related variables and attendance metrics, helping identify relationships and redundancies in the data.

ggpairs(as.data.frame(education_matrix))

The ggpairs function generates pairwise plots for all variables in the dataset, including scatterplots, histograms, and correlation coefficients. This helps visualize relationships, distributions, and correlations between variables for comprehensive exploratory data analysis.

Distance Matrix

This code computes the pairwise distance matrix for the dataset using the Euclidean distance method. The resulting matrix quantifies the dissimilarity between observations, where smaller values indicate higher similarity. The first 5x5 portion of the matrix is displayed for inspection.

# Compute the distance matrix
distance_matrix <- dist(education_matrix, method = "euclidean")

# Convert the distance matrix to a matrix format (optional)
as.matrix(distance_matrix)[1:5, 1:5]  # View the first 5x5 distances

##          1        2        3        4        5
## 1 0.000000 7.379434 4.493153 8.038707 4.647702
## 2 7.379434 0.000000 3.863173 1.393947 3.238823
## 3 4.493153 3.863173 0.000000 4.599464 2.297331
## 4 8.038707 1.393947 4.599464 0.000000 3.829134
## 5 4.647702 3.238823 2.297331 3.829134 0.000000

Perform classical MDS

# Perform classical MDS (cmdscale)
mds_result <- cmdscale(distance_matrix, k = 2, eig = TRUE)

# Extract MDS coordinates
mds_coordinates <- mds_result$points

# View the first few coordinates
head(mds_coordinates)

##            [,1]       [,2]
## [1,]  5.7639955  0.3192705
## [2,] -1.0275262  0.1300539
## [3,]  2.5340411  0.3130165
## [4,] -1.6457330 -0.1591963
## [5,]  1.6019750 -0.7938616
## [6,]  0.8265144 -0.7123181

Stress Value

cat("Stress Value for MDS:", mds_result$stress, "\n")

## Stress Value for MDS:

# Compute Stress Value for MDS
dist_matrix <- dist(education_data)

## Warning in dist(education_data): NAs introduced by coercion

mds_result1 <- smacofSym(dist_matrix, ndim = 3)  # MDS in 3d
cat("Stress Value for MDS:", mds_result1$stress, "\n")

## Stress Value for MDS: 0.06284785

The stress value for the Multidimensional Scaling (MDS) analysis with the initial configuration is 0.1139, which falls within the fair fit range according to Kruskal’s guidelines. This suggests that the MDS representation captures much of the original distance information, but with some degree of distortion. However, when the number of dimensions is increased to k = 3, the stress value improves significantly to 0.0628, which is classified as an excellent fit. This indicates that adding an additional dimension greatly enhances the ability of the MDS configuration to represent the original data’s relationships more accurately. This improvement highlights the importance of selecting an appropriate number of dimensions for MDS to optimize the trade-off between interpretability and accuracy in the reduced-dimensional representation.

# Create a data frame and display only the top 10 dimensions
top_15_variance <- data.frame(
  Dimension = seq_along(explained_variance), 
  Eigenvalue = eigenvalues,
  Explained_Variance = explained_variance,
  Cumulative_Variance = cumulative_variance
)

# Show only the top 10 rows
head(top_15_variance, 15)

##    Dimension   Eigenvalue Explained_Variance Cumulative_Variance
## 1          1 1042.2194246       5.425400e+01            54.25400
## 2          2  556.9164326       2.899096e+01            83.24497
## 3          3  117.0590411       6.093651e+00            89.33862
## 4          4   94.2339308       4.905462e+00            94.24408
## 5          5   56.6045299       2.946618e+00            97.19070
## 6          6   22.1522832       1.153164e+00            98.34386
## 7          7    8.6770379       4.516938e-01            98.79556
## 8          8    7.8601480       4.091696e-01            99.20473
## 9          9    6.5869386       3.428911e-01            99.54762
## 10        10    2.7876127       1.451126e-01            99.69273
## 11        11    2.4444056       1.272465e-01            99.81998
## 12        12    1.8734754       9.752605e-02            99.91750
## 13        13    1.1941067       6.216068e-02            99.97967
## 14        14    0.3767219       1.961072e-02            99.99928
## 15        15    0.0139110       7.241542e-04           100.00000

Based on the eigenvalues from my data, I can conclude that selecting 2 dimensions (k = 2) captures 83.24% of the total variance, which is a significant proportion and sufficient for most visualizations and analyses. This choice strikes a balance between retaining the majority of the variance and maintaining simplicity in a 2D representation. Expanding to 3 dimensions (k = 3) increases the cumulative variance explained to 89.34%, adding approximately 6% more variance. While this additional variance might be beneficial for tasks requiring higher precision or 3D visualization, it may not always justify the added complexity. Beyond 3 dimensions, the variance gained becomes negligible, with the 4th dimension contributing only around 5% more, and even less for subsequent dimensions.

Therefore, based on the eigenvalues, I conclude that 2 dimensions are optimal for my analysis, providing a practical trade-off between simplicity and meaningful variance retention, while 3 dimensions could be used if additional precision is necessary.

Visualize MDS Results

# Plot the MDS coordinates
plot(mds_coordinates[, 1], mds_coordinates[, 2], 
     xlab = "Dimension 1", ylab = "Dimension 2", 
     main = "MDS Plot", pch = 21, bg = "blue")


# Add labels to the plot
text(mds_coordinates[, 1], mds_coordinates[, 2], 
     labels = rownames(education_matrix), cex = 0.7, pos = 3)

Perform non-metric MDS using smacof

non_metric_mds <- smacofSym(distance_matrix, ndim = 2)

# Extract MDS coordinates
non_metric_coordinates <- non_metric_mds$conf

# Stress value
non_metric_mds$stress

## [1] 0.1138877

# Visualize the results
plot(non_metric_coordinates[, 1], non_metric_coordinates[, 2], 
     xlab = "Dimension 1", ylab = "Dimension 2", 
     main = "Non-Metric MDS Plot", pch = 21, bg = "red")

# Add labels
text(non_metric_coordinates[, 1], non_metric_coordinates[, 2], 
     labels = rownames(education_matrix), cex = 0.7, pos = 3)

Interpretation for MDS

#Simple plot of the MDS result
plot(fit.data, main = "MDS Plot (Simple)")

# Enhanced plot with the impact of observations on the stress function
plot(
  fit.data, 
  pch = 21, 
  cex = as.numeric(fit.data$spp),  # Symbol size based on stress-per-point (spp)
  bg = "red", 
  main = "MDS Plot with Stress Impact"
)

The code performs Multidimensional Scaling (MDS) to visualize the relationships between variables in a 2D space based on their dissimilarities. The Simple MDS Plot shows variable positions, with closer points indicating higher similarity. The Stress Impact Plot highlights variables like Urban_percentage and Time_period contributing significantly to stress, indicating challenges in representing them accurately in reduced dimensions

K-means Clustering on MDS Coordinates

if (!is.matrix(mds_coordinates) && !is.data.frame(mds_coordinates)) {
  stop("mds_coordinates must be a matrix or data frame with 2D coordinates.")
}

# Perform K-means clustering
set.seed(123)  # Ensure reproducibility
kmeans_result <- kmeans(mds_coordinates, centers = 3, nstart = 5)
library(ggplot2)
ggplot(data = as.data.frame(mds_coordinates), aes(x = V1, y = V2, color = factor(kmeans_result$cluster))) +
  geom_point(size = 3) +
  geom_point(data = as.data.frame(kmeans_result$centers), aes(x = V1, y = V2), size = 5, shape = 8, color = "black") +
  labs(title = "K-means Clustering on MDS Coordinates", x = "Dimension 1", y = "Dimension 2") +
  theme_minimal()

The plot visualizes the results of K-means clustering applied to MDS coordinates, dividing the data into three clusters (colored points). Cluster centroids are marked with black stars. The clustering highlights distinct groupings in the MDS space, with points within each cluster being relatively similar.

Conclusion

This project showed how dimension reduction techniques can simplify complex educational data. Key points include:

PCA helped identify the most important factors influencing primary education, making the data simpler and reducing unnecessary information.
MDS offered a different way to look at the data by focusing on the relationships between regions.
Together, PCA and MDS provided complementary insights:
- PCA was great for understanding how much information is captured.
- MDS was useful for visualizing connections between data points.

Reducing Complexity in Primary Education Data for Enhanced Analysis

Shagufta Shaheen

19/01/2025