This project focuses on applying dimension reduction techniques—Principal Component Analysis (PCA) and Multidimensional Scaling (MDS)—to a dataset related to primary education, obtained from UNICEF. The aim is to understand patterns in the data while reducing its dimensional complexity for efficient analysis.
The dataset, downloaded from UNICEF’s Pre-Primary Education Data, provides comprehensive insights into global attendance rates and related demographics for pre-primary education, as recorded one year before the primary year of schooling. It includes:Country and Regional Data, Attendance Metrics, Socioeconomic Stratification, Data Sources and Time Periods, Population Information. This dataset is pivotal for understanding attendance disparities across genders, regions, and socioeconomic groups, providing a robust foundation for applying dimension reduction techniques to uncover hidden patterns and key influences on pre-primary education attendance globally.
# Load the dataset
education_data <- read.csv("Data//Primary_education_data.csv")
# Inspect the data
dim(education_data)
## [1] 205 23
str(education_data)
## 'data.frame': 205 obs. of 23 variables:
## $ ISO3 : chr "AFG" "ALB" "DZA" "AND" ...
## $ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ Region : chr "SA" "ECA" "MENA" "ECA" ...
## $ Sub.region : chr "SA" "EECA" "MENA" "WE" ...
## $ Development.Regions : chr "Least Developed" "More Developed" "Less Developed" "More Developed" ...
## $ Total : num 34.1 NA 78.1 NA 52.9 ...
## $ Gender..Female : num 29.3 NA 79.8 NA 50.5 ...
## $ Gender.Male : num 38.5 NA 76.7 NA 55.1 ...
## $ Residence.Rural : num 34.3 NA 69.6 NA 36.1 ...
## $ Residence.Urban : num 33.5 NA 83.4 NA 63.5 ...
## $ Wealth.quintile.Poorest: num 37.5 NA 61 NA 32 ...
## $ Wealth.quintile.Second : num 33.8 NA 77.1 NA 38.6 ...
## $ Wealth.quintile.Middle : num 31.8 NA 83.2 NA 54.2 ...
## $ Wealth.quintile.Fourth : num 32.2 NA 86.9 NA 69.6 ...
## $ Wealth.quintile.Richest: num 35.6 NA 88.3 NA 81.2 ...
## $ Source : chr "DHS 2015" "" "MICS 2019" "" ...
## $ Time_period : int 2015 NA 2020 NA 2016 NA NA 2020 2016 NA ...
## $ Pop_total : int 1096206 34333 977228 661 1102635 198 1486 749906 43011 334911 ...
## $ Pop_female : int 534029 16502 478249 311 546100 97 732 367999 20215 163004 ...
## $ Pop_male : int 562177 17831 498979 350 556535 101 754 381907 22796 171907 ...
## $ Pop_rural : num 816729.5 13623.7 267477.4 78.9 380250.2 ...
## $ Pop_urban : num 279477 20709 709751 582 722385 ...
## $ Urban_percentage : num 0.255 0.603 0.726 0.881 0.655 ...
# Replace missing values with the mean (numerical) or mode (categorical)
education_data_clean <- education_data %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
mutate(across(where(is.character), ~ ifelse(is.na(.), Mode(.), .)))
# Define a function for mode (for categorical variables)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
education_data_cleaned <- education_data %>%
filter(!is.na(Total))
# Select numerical columns for normalization
numerical_columns <- education_data %>% dplyr::select(where(is.numeric))
# Apply normalization (scaling)
normalized_columns <- scale(numerical_columns)
# Reassign the normalized data back to the dataset
education_data[names(numerical_columns)] <- as.data.frame(normalized_columns)
Number of columns in my data is
dim(education_data)
## [1] 114 23
education_matrix <- education_data %>% dplyr::select(where(is.numeric)) %>% as.matrix()
Principal Component Analysis (PCA) PCA is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.
# Perform PCA
pca_result <- prcomp(education_matrix, center = TRUE, scale. = TRUE)
# Summary of PCA
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.0370 2.2200 1.01780 0.91320 0.70776 0.44276 0.27711
## Proportion of Variance 0.5425 0.2899 0.06094 0.04905 0.02947 0.01153 0.00452
## Cumulative Proportion 0.5425 0.8325 0.89339 0.94244 0.97191 0.98344 0.98796
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.26374 0.24144 0.15706 0.14708 0.12876 0.10280 0.05774
## Proportion of Variance 0.00409 0.00343 0.00145 0.00127 0.00098 0.00062 0.00020
## Cumulative Proportion 0.99205 0.99548 0.99693 0.99820 0.99918 0.99980 0.99999
## PC15 PC16 PC17
## Standard deviation 0.01110 3.475e-11 5.97e-16
## Proportion of Variance 0.00001 0.000e+00 0.00e+00
## Cumulative Proportion 1.00000 1.000e+00 1.00e+00
# View rotation matrix (loadings)
pca_result$rotation
# Eigenvalues and explained variance
eigen_values <- pca_result$sdev^2 # Square of singular values
explained_variance <- eigen_values / sum(eigen_values) * 100 # Percentage of explained variance
cumulative_variance <- cumsum(explained_variance) # Cumulative variance
# Create a data frame for eigenvalues
eigen_df <- data.frame(
PC = paste0("PC", 1:length(eigen_values)),
Eigenvalue = eigen_values,
Variance_Explained = explained_variance,
Cumulative_Variance = cumulative_variance
)
The loadings indicate how much each original feature contributes to each principal component (PC). Higher loading values (positive or negative) signify that the respective feature has a stronger influence on that particular principal component. This helps identify which original variables are most important in defining the directions of maximum variance in the dataset. By examining these loadings, we can interpret the characteristics represented by each principal component and their relevance to the underlying data structure.
# Extract loadings
loadings <- pca_result$rotation
# Plot loadings for PC1
barplot(loadings[, 1], main = "Loadings for PC1", xlab = "Variables", ylab = "Loadings")
The eigenvalues represent the amount of variance captured or explained by each principal component. In the table above, we see that the first few PCs (e.g., PC1, PC2, and PC3) have high eigenvalues, which indicates that they account for a significant proportion of the total variance in the data. Specifically:
# Print eigenvalues and variance explained
print(eigen_df)
## PC Eigenvalue Variance_Explained Cumulative_Variance
## 1 PC1 9.223181e+00 5.425400e+01 54.25400
## 2 PC2 4.928464e+00 2.899096e+01 83.24497
## 3 PC3 1.035921e+00 6.093651e+00 89.33862
## 4 PC4 8.339286e-01 4.905462e+00 94.24408
## 5 PC5 5.009250e-01 2.946618e+00 97.19070
## 6 PC6 1.960379e-01 1.153164e+00 98.34386
## 7 PC7 7.678795e-02 4.516938e-01 98.79556
## 8 PC8 6.955883e-02 4.091696e-01 99.20473
## 9 PC9 5.829149e-02 3.428911e-01 99.54762
## 10 PC10 2.466914e-02 1.451126e-01 99.69273
## 11 PC11 2.163191e-02 1.272465e-01 99.81998
## 12 PC12 1.657943e-02 9.752605e-02 99.91750
## 13 PC13 1.056732e-02 6.216068e-02 99.97967
## 14 PC14 3.333822e-03 1.961072e-02 99.99928
## 15 PC15 1.231062e-04 7.241542e-04 100.00000
## 16 PC16 1.207382e-21 7.102249e-21 100.00000
## 17 PC17 3.563604e-31 2.096238e-30 100.00000
This suggests that most of the variability in the dataset can be represented using just the first few principal components, while later components contribute very little to explaining additional variance.
fviz_eig(pca_result, choice = "eigenvalue", main = "Scree Plot (Eigenvalues)")
Scree plot of eigenvalues
The cumulative variance plot demonstrates that the first 3-5 components explain nearly all the variance in the dataset, with the cumulative variance exceeding 95% by the fifth component. This provides a strong justification for dimensionality reduction, as retaining only the first few principal components would significantly reduce the complexity of the data while preserving its essential information.
# Plot cumulative variance
plot(cumulative_variance, type = "b", xlab = "Principal Components", ylab = "Cumulative Variance (%)",
main = "Cumulative Variance Explained")
Overall, this analysis reveals the utility of PCA in simplifying high-dimensional data while retaining most of the variability, facilitating further analysis or modeling efforts.
# Biplot for the first two principal components
biplot(pca_result, scale = 0, main = "PCA Biplot")
The biplot illustrates the relationships between the original variables (red arrows) and the observations (black points) in the space defined by the first two principal components (PC1 and PC2), highlighting the contribution of each variable to the principal components.
fviz_pca_var(pca_result, col.var="steelblue")
The PCA variable plot shows that most variables strongly contribute to Dim1 (54.3%), while fewer variables are aligned with Dim2 (29%). Variables with longer arrows are more influential, and their directions indicate correlations (similar directions for positive correlation, opposite for negative
var_contrib_PC1 <- fviz_contrib(pca_result, choice = "var", axes = 1, xtickslab.rt = 90)
var_contrib_PC2 <- fviz_contrib(pca_result, choice = "var", axes = 2, xtickslab.rt = 90)
# Combine and visualize contributions
library(gridExtra)
grid.arrange(var_contrib_PC1, var_contrib_PC2, top = "Variable Contributions to the First Two PCs")
PC1 is strongly influenced by Total, Gender_Male, and
Gender_Female, while PC2 is dominated by population-related variables
like Pop_Female, Pop_Male, and Pop_Total. This shows PC1 focuses on
attendance and wealth, while PC2 captures demographic patterns.
# Clustering with PCA components
pca_scores <- pca_result$x[, 1:2] # First two principal components
clusters <- KMeans_rcpp(pca_scores, clusters = 3, num_init = 5, max_iters = 100)
# Assign cluster labels to the PCA scores
cluster_labels <- clusters$clusters
# Plot clusters
plot(pca_scores, col = cluster_labels, pch = 19,
main = "K-means Clustering on PCA Components",
xlab = "Principal Component 1", ylab = "Principal Component 2")
points(clusters$centroids, col = 1:3, pch = 8, cex = 2) # Centroids
The plot represents the results of K-means clustering applied to the
first two principal components (PC1 and PC2). Data points are grouped
into three distinct clusters, differentiated by colors, with centroids
marked as larger symbols. This clustering highlights patterns and
groupings within the dataset, leveraging the reduced dimensions from PCA
to simplify and enhance data interpretation.
Multidimensional Scaling (MDS) is a technique used to visualize the similarity or dissimilarity of data in a lower-dimensional space. It preserves the pairwise distances between data points as accurately as possible while reducing dimensionality.
# Compute correlations for numerical variables
cor_matrix <- cor(education_matrix, method = "pearson")
# Visualize correlations
corrplot(cor_matrix, method = "color", order = "hclust", tl.cex = 0.6)
The heatmap represents the Pearson correlation matrix for numerical
variables. Darker colors indicate stronger correlations (positive or
negative), with variables clustered hierarchically to reveal patterns.
Strong correlations are observed among wealth-related variables and
attendance metrics, helping identify relationships and redundancies in
the data.
ggpairs(as.data.frame(education_matrix))
The
ggpairs function generates pairwise plots for all
variables in the dataset, including scatterplots, histograms, and
correlation coefficients. This helps visualize relationships,
distributions, and correlations between variables for comprehensive
exploratory data analysis.
This code computes the pairwise distance matrix for the dataset using the Euclidean distance method. The resulting matrix quantifies the dissimilarity between observations, where smaller values indicate higher similarity. The first 5x5 portion of the matrix is displayed for inspection.
# Compute the distance matrix
distance_matrix <- dist(education_matrix, method = "euclidean")
# Convert the distance matrix to a matrix format (optional)
as.matrix(distance_matrix)[1:5, 1:5] # View the first 5x5 distances
## 1 2 3 4 5
## 1 0.000000 7.379434 4.493153 8.038707 4.647702
## 2 7.379434 0.000000 3.863173 1.393947 3.238823
## 3 4.493153 3.863173 0.000000 4.599464 2.297331
## 4 8.038707 1.393947 4.599464 0.000000 3.829134
## 5 4.647702 3.238823 2.297331 3.829134 0.000000
# Perform classical MDS (cmdscale)
mds_result <- cmdscale(distance_matrix, k = 2, eig = TRUE)
# Extract MDS coordinates
mds_coordinates <- mds_result$points
# View the first few coordinates
head(mds_coordinates)
## [,1] [,2]
## [1,] 5.7639955 0.3192705
## [2,] -1.0275262 0.1300539
## [3,] 2.5340411 0.3130165
## [4,] -1.6457330 -0.1591963
## [5,] 1.6019750 -0.7938616
## [6,] 0.8265144 -0.7123181
Stress Value
cat("Stress Value for MDS:", mds_result$stress, "\n")
## Stress Value for MDS:
# Compute Stress Value for MDS
dist_matrix <- dist(education_data)
## Warning in dist(education_data): NAs introduced by coercion
mds_result1 <- smacofSym(dist_matrix, ndim = 3) # MDS in 3d
cat("Stress Value for MDS:", mds_result1$stress, "\n")
## Stress Value for MDS: 0.06284785
The stress value for the Multidimensional Scaling (MDS)
analysis with the initial configuration is 0.1139, which falls within
the fair fit range according to Kruskal’s guidelines. This suggests that
the MDS representation captures much of the original distance
information, but with some degree of distortion. However, when the
number of dimensions is increased to k = 3, the stress value improves
significantly to 0.0628, which is classified as an excellent fit. This
indicates that adding an additional dimension greatly enhances the
ability of the MDS configuration to represent the original data’s
relationships more accurately. This improvement highlights the
importance of selecting an appropriate number of dimensions for MDS to
optimize the trade-off between interpretability and accuracy in the
reduced-dimensional representation.
# Create a data frame and display only the top 10 dimensions
top_15_variance <- data.frame(
Dimension = seq_along(explained_variance),
Eigenvalue = eigenvalues,
Explained_Variance = explained_variance,
Cumulative_Variance = cumulative_variance
)
# Show only the top 10 rows
head(top_15_variance, 15)
## Dimension Eigenvalue Explained_Variance Cumulative_Variance
## 1 1 1042.2194246 5.425400e+01 54.25400
## 2 2 556.9164326 2.899096e+01 83.24497
## 3 3 117.0590411 6.093651e+00 89.33862
## 4 4 94.2339308 4.905462e+00 94.24408
## 5 5 56.6045299 2.946618e+00 97.19070
## 6 6 22.1522832 1.153164e+00 98.34386
## 7 7 8.6770379 4.516938e-01 98.79556
## 8 8 7.8601480 4.091696e-01 99.20473
## 9 9 6.5869386 3.428911e-01 99.54762
## 10 10 2.7876127 1.451126e-01 99.69273
## 11 11 2.4444056 1.272465e-01 99.81998
## 12 12 1.8734754 9.752605e-02 99.91750
## 13 13 1.1941067 6.216068e-02 99.97967
## 14 14 0.3767219 1.961072e-02 99.99928
## 15 15 0.0139110 7.241542e-04 100.00000
Based on the eigenvalues from my data, I can conclude that selecting
2 dimensions (k = 2) captures
83.24% of the total variance, which is a significant
proportion and sufficient for most visualizations and analyses. This
choice strikes a balance between retaining the majority of the variance
and maintaining simplicity in a 2D representation. Expanding to
3 dimensions (k = 3) increases the
cumulative variance explained to 89.34%, adding
approximately 6% more variance. While this additional
variance might be beneficial for tasks requiring higher precision or 3D
visualization, it may not always justify the added complexity. Beyond 3
dimensions, the variance gained becomes negligible, with the 4th
dimension contributing only around 5% more, and even
less for subsequent dimensions.
Therefore, based on the eigenvalues, I conclude that 2 dimensions are optimal for my analysis, providing a practical trade-off between simplicity and meaningful variance retention, while 3 dimensions could be used if additional precision is necessary.
# Plot the MDS coordinates
plot(mds_coordinates[, 1], mds_coordinates[, 2],
xlab = "Dimension 1", ylab = "Dimension 2",
main = "MDS Plot", pch = 21, bg = "blue")
# Add labels to the plot
text(mds_coordinates[, 1], mds_coordinates[, 2],
labels = rownames(education_matrix), cex = 0.7, pos = 3)
non_metric_mds <- smacofSym(distance_matrix, ndim = 2)
# Extract MDS coordinates
non_metric_coordinates <- non_metric_mds$conf
# Stress value
non_metric_mds$stress
## [1] 0.1138877
# Visualize the results
plot(non_metric_coordinates[, 1], non_metric_coordinates[, 2],
xlab = "Dimension 1", ylab = "Dimension 2",
main = "Non-Metric MDS Plot", pch = 21, bg = "red")
# Add labels
text(non_metric_coordinates[, 1], non_metric_coordinates[, 2],
labels = rownames(education_matrix), cex = 0.7, pos = 3)
#Simple plot of the MDS result
plot(fit.data, main = "MDS Plot (Simple)")
# Enhanced plot with the impact of observations on the stress function
plot(
fit.data,
pch = 21,
cex = as.numeric(fit.data$spp), # Symbol size based on stress-per-point (spp)
bg = "red",
main = "MDS Plot with Stress Impact"
)
The code performs Multidimensional Scaling (MDS) to visualize
the relationships between variables in a 2D space based on their
dissimilarities. The Simple MDS Plot shows variable positions, with
closer points indicating higher similarity. The Stress Impact Plot
highlights variables like Urban_percentage and Time_period contributing
significantly to stress, indicating challenges in representing them
accurately in reduced dimensions
if (!is.matrix(mds_coordinates) && !is.data.frame(mds_coordinates)) {
stop("mds_coordinates must be a matrix or data frame with 2D coordinates.")
}
# Perform K-means clustering
set.seed(123) # Ensure reproducibility
kmeans_result <- kmeans(mds_coordinates, centers = 3, nstart = 5)
library(ggplot2)
ggplot(data = as.data.frame(mds_coordinates), aes(x = V1, y = V2, color = factor(kmeans_result$cluster))) +
geom_point(size = 3) +
geom_point(data = as.data.frame(kmeans_result$centers), aes(x = V1, y = V2), size = 5, shape = 8, color = "black") +
labs(title = "K-means Clustering on MDS Coordinates", x = "Dimension 1", y = "Dimension 2") +
theme_minimal()
The plot visualizes the results of K-means clustering applied to MDS coordinates, dividing the data into three clusters (colored points). Cluster centroids are marked with black stars. The clustering highlights distinct groupings in the MDS space, with points within each cluster being relatively similar.
This project showed how dimension reduction techniques can simplify complex educational data. Key points include: