Water is one of the most essential resources for life, but access to clean and safe water is still a major challenge for many people worldwide. This project analyzes water accessibility, quality, and management by studying data from different countries. By looking at important factors like urban and rural water availability, income levels, environmental conditions, and resource management, we can better understand the difficulties countries face in providing safe water to their populations.
Access to clean water depends on many factors, such as natural resources, economic conditions, infrastructure, and government policies. For example, wealthier countries may have better water management systems, while regions with low rainfall might struggle with natural water shortages. Similarly, rural areas often have less access to basic water services than urban areas, leading to inequalities that need to be addressed. This project uses clustering techniques to group countries with similar water-related characteristics. Clustering organizes data into groups so that countries in the same group share similar challenges or strengths. This helps us identify patterns, such as: - Countries that rely heavily on non-piped water. - Regions facing water contamination issues. - How income levels and urbanization impact water availability. —
The insights from this project can guide policymakers, organizations, and governments in making better decisions. For example: - Regions with poor water quality can be prioritized for infrastructure improvements. - Countries with low rainfall can adopt water conservation strategies. - Understanding urban-rural disparities can help distribute resources more fairly and reduce inequalities.
This project combines data from two sources to understand global water access and its connection to climate and socioeconomic factors. The dataset started with 213 records but after cleaning and processing, it now contains 171 records with 41 variables. Here’s a breakdown of the data:
After processing the data, the final dataset contains detailed information about water access, management, climate, and socioeconomic factors for 171 countries. Here are some highlights:
Population and Urbanization:
Population sizes range from 93,800 people to 1.44 billion.
The percentage of people living in cities (urbanization) ranges from
13.6% to 100%, with an average of about 60%.
Water Accessibility:
At.least.basic: This measures the percentage of people
with basic water access, ranging from 35% to 100%. On average, 88.9% of
people have basic water access.Limited..more.than.30.mins.: This shows the percentage of
people who need over 30 minutes to fetch water, ranging from 0% to
37%.Unimproved: This tracks the use of unimproved water sources
like unprotected wells, ranging from 0% to 33.6%.Surface.water: The percentage of people relying on surface
water, like rivers or lakes, ranges from 0% to 21%.Rural vs. Urban Disparities:
Rural areas (At.least.basic_RURAL): Basic water access in
rural areas ranges from 13.8% to 100%, with an average of 83.7%.
Urban areas (At.least.basic_URBAN): Urban areas generally
have better access, ranging from 48% to 100%, with an average of
94.8%.
Water Management:
Safely.managed: This measures safe water services and
ranges from 2.8% to 100%, with an average of 69.2%.
Accessible.on.premises: This shows the percentage of water
available directly at homes, ranging from 2.8% to 100%, with an average
of 76.4%.
Free.from.contamination: The percentage of water that is
free from contamination ranges from 10.6% to 100%, with an average of
72.4%.
Climate and Precipitation:
Average.precipitation.in.depth..mm.per.year.: Average
annual rainfall ranges from as low as 18 mm to as high as 3,240 mm, with
an average of 1,139 mm.
Socioeconomic Context:
Income.Groupings: Countries are divided into four groups:
low-income, lower-middle-income, upper-middle-income, and
high-income.
Fragile.contexts: About 31.6% of the countries in the
dataset are classified as fragile or extremely fragile.
This dataset provides a detailed view of water access and its challenges across the globe. It highlights the differences between rural and urban areas, the role of climate, and the impact of socioeconomic conditions, making it a valuable resource for understanding global water issues.
This section explains how I approached the analysis step by step. The process involved preparing the data, selecting the important factors, applying clustering techniques, and validating the results.
The dataset was cleaned to ensure it was ready for analysis:
Sl, ISO3, and unnamed placeholders were
dropped as they added no value to the analysis.selected_data <- water_data %>%
select(-Sl, -ISO3, -`Unnamed..19`, -`Unnamed..20`, -`Unnamed..21`, -`Unnamed..43`, -`Unnamed..44`, -`Unnamed..45`)
Income.Groupings, UrbanPercentage, and
Average.precipitation.in.depth..mm.per.year. were
removed.selected_data <- selected_data %>%
filter(!is.na(UrbanPercentage) & !is.na(Average.precipitation.in.depth..mm.per.year.))
Fragile.contexts column, invalid entries (e.g., “-”) were
replaced with “Not Fragile” to improve consistency.selected_data <- selected_data %>%
mutate(Fragile.contexts = ifelse(Fragile.contexts == "-", "Not Fragile", Fragile.contexts))
Missing values were handled strategically:
Accessible.on.premises were replaced with the mean value of
their respective Income.Groupings.selected_data <- selected_data %>%
group_by(Income.Groupings) %>%
mutate(Accessible.on.premises = ifelse(is.na(Accessible.on.premises),
mean(Accessible.on.premises, na.rm = TRUE),
Accessible.on.premises)) %>%
ungroup()
Safely.managed_URBAN, Safely.managed_RURAL,
and related columns were filled with corresponding values from
Safely.managed.Safely.managed_RURAL, and Safely.managed_URBAN with corresponding Accessible.on.premises values
selected_data <- selected_data %>%
mutate(
Safely.managed = ifelse(is.na(Safely.managed), Accessible.on.premises, Safely.managed),
Safely.managed_RURAL = ifelse(is.na(Safely.managed_RURAL), Accessible.on.premises_RURAL, Safely.managed_RURAL),
Safely.managed_URBAN = ifelse(is.na(Safely.managed_URBAN), Accessible.on.premises_URBAN, Safely.managed_URBAN)
)
Non.piped column, missing values were replaced with
100 - Piped to maintain data integrity.selected_data <- selected_data %>%
mutate(
Non.piped = ifelse(is.na(Non.piped), 100 - Piped, Non.piped)
)
Additional steps enhanced the dataset for clustering:
Available.when.needed were replaced with the minimum of
Safely.managed and Accessible.on.premises,
reflecting practical water accessibility.selected_data <- selected_data %>%
mutate(
Available.when.needed = ifelse(
is.na(Available.when.needed),
pmin(Safely.managed, Accessible.on.premises, na.rm = TRUE),
Available.when.needed
)
)
SDG.Region, Income.Groupings, and
Fragile.contexts were converted into numeric values for
clustering.selected_data <- selected_data %>%
mutate(
SDG.Region = case_when(
SDG.Region == "Central and Southern Asia" ~ 1,
SDG.Region == "Europe and Northern America" ~ 2,
SDG.Region == "Northern Africa and Western Asia" ~ 3,
SDG.Region == "Sub-Saharan Africa" ~ 4,
SDG.Region == "Latin America and the Caribbean" ~ 5,
SDG.Region == "Australia and New Zealand" ~ 6,
SDG.Region == "Eastern and South-Eastern Asia" ~ 7,
SDG.Region == "Oceania" ~ 8,
TRUE ~ NA_real_
),
Income.Groupings = case_when(
Income.Groupings == "Low income" ~ 1,
Income.Groupings == "Lower middle income" ~ 2,
Income.Groupings == "Upper middle income" ~ 3,
Income.Groupings == "High income" ~ 4,
TRUE ~ NA_real_
),
Fragile.contexts = case_when(
Fragile.contexts == "Fragile or Extremely Fragile" ~ 1,
Fragile.contexts == "Not Fragile" ~ 0,
TRUE ~ NA_real_
)
)
columns_to_exclude <- c("COUNTRY..AREA.OR.TERRITORY", "SDG.Region", "Fragile.contexts", "Income.Groupings")
selected_data <- selected_data %>%
mutate(across(.cols = everything(),
.fns = ~ if (cur_column() %in% columns_to_exclude) {
.
} else {
suppressWarnings(as.numeric(gsub(">99", "100", gsub("<1", "0", .))))
}))
All numeric columns were standardized to ensure equal importance in clustering.
# Normalize the data (scale between 0 and 1)
scaled_data <- water_access_data %>%
select(-COUNTRY..AREA.OR.TERRITORY) %>%
scale()
# Load required libraries
library(tidyverse)
library(cluster) # For clustering
library(factoextra) # For visualization
library(ggplot2)
library(conflicted)
library(dplyr)
After preparing the data, I used clustering to group countries based on their similarities.
Clustering is a way to organize countries into groups that share similar water-related challenges.
K-Means Clustering: This method divides countries into a fixed number of groups based on their similarities. It’s fast and works well with standardized data.
Hierarchical Clustering: This technique builds a hierarchy of clusters, which helps visualize how countries are grouped at different levels.
CLARA (Clustering Large Applications): This method is useful for handling larger datasets efficiently by focusing on smaller samples of data.
These techniques allowed me to identify patterns in data.
This methodology is flexible and can be used to study various aspects of water access and management. For example, it can analyze Water Resource Management to see how well countries manage their water systems, identify reliance on piped versus non-piped water, and highlight contamination issues. It can also explore Urban vs. Rural Disparities to understand where rural areas lag behind urban ones in water access, helping policymakers address these gaps.
Additionally, it can look at Climate and Environmental Factors, like rainfall and fragile ecosystems, to prioritize regions needing conservation efforts. Lastly, it can examine Socioeconomic Groupings to uncover how income, urbanization, and population affect water availability.
For this project, I’m focusing only on Water Access and Quality, grouping countries based on clean water availability. This helps identify regions at risk and provides actionable insights into improving water access and management.
Columns Used:
columns_to_use <- c("At.least.basic","Income.Groupings","Fragile.contexts")
water_access_data <- data %>%
select(COUNTRY..AREA.OR.TERRITORY, all_of(columns_to_use))
The elbow method helps us figure out how many clusters (k) work best for the data. It looks at how the “Within Sum of Squares” (WSS) changes as the number of clusters increases. The plot shows that as we add more clusters, the WSS decreases, but after a certain point, the improvement slows down.
From the graph, the “elbow” is at k = 5, meaning 5 clusters are a good choice. This keeps the clusters meaningful and easy to interpret without making them too complicated.
fviz_nbclust(scaled_data, kmeans, method = "wss") +
labs(title = "Elbow Method to Determine Optimal Clusters")
The kmeans() function is applied to the scaled dataset, specifying optimal_k = 5 (determined from the Elbow Method). This assigns each country to one of the five clusters based on its water access features.
set.seed(477654)
optimal_k <- 5
kmeans_result <- kmeans(scaled_data, centers = optimal_k, nstart = 25)
The cluster assignments are added as a new column in the original dataset for further analysis and visualization.
water_access_data <- water_access_data %>%
mutate(Cluster = as.factor(kmeans_result$cluster))
Visualization with PCA
To make the clusters easier to understand, Principal Component Analysis (PCA) is used to simplify the data into two dimensions. TEach color represents a cluster, indicating distinct groupings of countries.
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
pca_data <- data.frame(pca_result$x[, 1:2])
pca_data$Cluster <- as.factor(kmeans_result$cluster)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2) +
labs(title = "K-Means Clustering (Euclidean Distance)",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
Evaluating Cluster Quality with Silhouette Analysis
Silhouette analysis helps check how well each point fits in its cluster compared to others. The average silhouette width is 0.68, showing that most points are well-matched to their clusters, with clear separation.
The red dashed line on the plot marks this average, and most clusters are well-defined, confirming the 5-cluster solution is a good choice.
library(cluster)
sil <- silhouette(kmeans_result$cluster, dist(scaled_data))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 38 0.77
## 2 2 28 0.49
## 3 3 53 0.91
## 4 4 26 0.44
## 5 5 26 0.49
Visualizing Clusters in 2D
fviz_cluster(kmeans_result, data = scaled_data)
Each cluster is represented by a different color and shape, making it easy to see how the data points are distributed. From the plot, the clusters are well-separated, indicating that the K-Means clustering effectively grouped countries with similar water access features.
To check the silhouette value for different values of k,
I used this code.
# Try different values for k and compute silhouette scores
silhouette_scores <- c()
for (k in 2:6) {
kmeans_result_n <- kmeans(scaled_data, centers = k, nstart = 25)
sil <- silhouette(kmeans_result_n$cluster, dist(scaled_data))
avg_sil <- mean(sil[, 3])
silhouette_scores <- c(silhouette_scores, avg_sil)
}
# Plot silhouette scores for different k values
plot(2:6, silhouette_scores, type = "b", pch = 19,
xlab = "Number of Clusters (k)",
ylab = "Average Silhouette Score",
main = "Silhouette Score vs. Number of Clusters")
From the output, we can see that the silhouette score is highest at
k=5, indicating that the clustering works best with 5
clusters. This confirms that selecting k=5 was a good
choice for this analysis.
To check how the clustering performs with PAM (Partitioning Around Medoids), I used this method as an alternative to K-Means. PAM is more robust to outliers as it selects representative points (medoids) from the dataset to define clusters
pam_result <- pam(scaled_data, k = 5)
pam_clusters <- pam_result$clustering
clustered_data <- data.frame(scaled_data, Cluster = as.factor(pam_clusters))
pam_clusters <- pam_result$clustering
# Add cluster assignments to your dataset
clustered_data <- data.frame(water_access_data, Cluster = as.factor(pam_clusters))
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
pca_data <- data.frame(pca_result$x[, 1:2])
pca_data$Cluster <- as.factor(pam_clusters)
# Visualize clusters
library(ggplot2)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2) +
labs(title = "PAM Clustering Results",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
The cluster distribution looks quite similar to K-Means, but there are
some slight differences because PAM uses medoids instead of
centroids.
Silhouette Analysis
silhouette_scores <- silhouette(pam_clusters, dist(scaled_data))
library(factoextra)
fviz_silhouette(silhouette_scores)
## cluster size ave.sil.width
## 1 1 27 0.43
## 2 2 38 0.77
## 3 3 26 0.49
## 4 4 27 0.50
## 5 5 53 0.91
The silhouette plot shows an average score of 0.67, meaning the clusters are fairly well-separated. Cluster 5 stands out with the best cohesion, while Cluster 1 has more variation within its points.
Comparison with K-Means:
Cluster Quality: The average silhouette width for K-Means was 0.68, slightly higher than PAM’s 0.67. This suggests that K-Means provides marginally better-defined clusters for this dataset.
CLARA (Clustering Large Applications) is applied here as an alternative clustering method for handling larger datasets efficiently. It uses a sampling approach to find representative clusters, reducing computational overhead while maintaining accuracy.
clara_result <- clara(scaled_data, k = 5, samples = 5, sampsize = 50)
clara_clusters <- clara_result$clustering
clustered_data <- data.frame(water_access_data, Cluster = as.factor(clara_clusters))
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
pca_data <- data.frame(pca_result$x[, 1:2]) # Take the first 2 principal components
pca_data$Cluster <- as.factor(clara_clusters)
# Plot clusters
library(ggplot2)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2) +
labs(title = "CLARA Clustering Results",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
silhouette_scores <- silhouette(clara_clusters, dist(scaled_data))
library(factoextra)
fviz_silhouette(silhouette_scores)
## cluster size ave.sil.width
## 1 1 28 0.41
## 2 2 38 0.77
## 3 3 26 0.49
## 4 4 26 0.50
## 5 5 53 0.91
While CLARA is suitable for larger datasets due to its efficienc,K-Means performed better with an average silhouette width of 0.68 and stronger overall cluster cohesion compared to CLARA’s 0.61.
Here I am using K-Means clustering using the Manhattan distance metric instead of the traditional Euclidean distance.
manhattan_dist <- dist(scaled_data, method = "manhattan")
set.seed(477654)
kmeans_manhattan <- kmeans(as.matrix(manhattan_dist), centers = 5, nstart = 25)
pca_data$Cluster <- as.factor(kmeans_manhattan$cluster)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2) +
labs(title = "K-Means Clustering (Manhattan Distance)",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
sil_manhattan <- silhouette(kmeans_manhattan$cluster, manhattan_dist)
fviz_silhouette(sil_manhattan)
## cluster size ave.sil.width
## 1 1 38 0.80
## 2 2 25 0.51
## 3 3 28 0.26
## 4 4 27 0.33
## 5 5 53 0.93
The silhouette plot shows an average silhouette width of 0.64, which is lower than the K-Means with Euclidean distance (0.68).
Hierarchical clustering is a technique that groups data into a tree-like structure (dendrogram) based on their similarity. The dendrogram visualizes how clusters are formed by merging smaller clusters iteratively.
# Hierarchical clustering using Ward's method
hc <- hclust(dist(scaled_data), method = "ward.D2")
# Cut tree into 3 clusters
hc_clusters <- cutree(hc, k = 5)
# Visualize dendrogram
plot(hc, labels = FALSE, main = "Dendrogram of Hierarchical Clustering")
rect.hclust(hc, k = 5, border = "red")
# Visualize using PCA
pca_data$Cluster <- as.factor(hc_clusters)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2) +
labs(title = "Hierarchical Clustering",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
Compared to K-Means, hierarchical clustering offers more flexibility in visualizing the clustering process
This code adds cluster labels to each country based on the K-Means results, making it easier to understand the data. The columns are renamed for clarity.It then filters the data to focus on Cluster 1 and lists the countries in this group.
clustered_data <- water_access_data %>%
mutate(Cluster = as.factor(kmeans_result$cluster))
# Rename long column names for better readability
colnames(clustered_data) <- c("Country", "Basic", "Income", "Fragile", "Cluster")
# Filter data for Cluster 1
cluster_1_data <- clustered_data %>%
dplyr::filter(Cluster == 1)
# Print the data for Cluster 1
print(cluster_1_data)
## Country Basic Income Fragile Cluster
## 1 Albania 95.09456 3 0 1
## 2 Argentina 96.74032 3 0 1
## 3 Armenia 100.00000 3 0 1
## 4 Azerbaijan 97.64219 3 0 1
## 5 Belarus 100.00000 3 0 1
## 6 Belize 98.40523 3 0 1
## 7 Bosnia and Herzegovina 96.09347 3 0 1
## 8 Botswana 92.56756 3 0 1
## 9 Brazil 100.00000 3 0 1
## 10 Bulgaria 100.00000 3 0 1
## 11 China 97.64693 3 0 1
## 12 Colombia 97.53776 3 0 1
## 13 Costa Rica 100.00000 3 0 1
## 14 Cuba 94.65270 3 0 1
## 15 Dominican Republic 96.75452 3 0 1
## 16 Ecuador 95.70334 3 0 1
## 17 Fiji 95.48863 3 0 1
## 18 Gabon 86.92777 3 0 1
## 19 Georgia 94.97754 3 0 1
## 20 Guyana 95.85974 3 0 1
## 21 Jamaica 91.09963 3 0 1
## 22 Jordan 98.96939 3 0 1
## 23 Kazakhstan 96.74032 3 0 1
## 24 Malaysia 97.15646 3 0 1
## 25 Maldives 100.00000 3 0 1
## 26 Mauritius 100.00000 3 0 1
## 27 Mexico 100.00000 3 0 1
## 28 Namibia 85.91211 3 0 1
## 29 North Macedonia 97.84256 3 0 1
## 30 Paraguay 100.00000 3 0 1
## 31 Peru 94.80581 3 0 1
## 32 Republic of Moldova 92.02427 3 0 1
## 33 Russian Federation 97.05160 3 0 1
## 34 Saint Lucia 96.88866 3 0 1
## 35 South Africa 94.49209 3 0 1
## 36 Suriname 97.99301 3 0 1
## 37 Thailand 100.00000 3 0 1
## 38 Türkiye 97.02619 3 0 1
Iterate Through Clusters:
A loop goes through each cluster, printing data for all countries in the cluster, helps analyze similarities and patterns within each group.
clustered_data <- water_access_data %>%
mutate(Cluster = as.factor(kmeans_result$cluster))
# Rename long column names for better readability
colnames(clustered_data) <- c("Country", "Basic", "Income", "Fragile", "Cluster")
# Function to get 5 countries from each cluster
sampled_data <- clustered_data %>%
group_by(Cluster) %>%
slice_head(n = 5) # Select the first 5 rows from each cluster
print(sampled_data, n = 25)
## # A tibble: 25 × 5
## # Groups: Cluster [5]
## Country Basic Income Fragile Cluster
## <chr> <dbl> <int> <int> <fct>
## 1 Albania 95.1 3 0 1
## 2 Argentina 96.7 3 0 1
## 3 Armenia 100 3 0 1
## 4 Azerbaijan 97.6 3 0 1
## 5 Belarus 100 3 0 1
## 6 Angola 57.7 2 1 2
## 7 Benin 67.4 2 1 2
## 8 Burkina Faso 49.5 1 1 2
## 9 Burundi 62.4 1 1 2
## 10 Cameroon 69.6 2 1 2
## 11 Antigua and Barbuda 98.4 4 0 3
## 12 Australia 100 4 0 3
## 13 Austria 100 4 0 3
## 14 Bahrain 100 4 0 3
## 15 Barbados 98.5 4 0 3
## 16 Afghanistan 82.2 1 1 4
## 17 Bangladesh 98.1 2 1 4
## 18 Cambodia 78.0 2 1 4
## 19 Côte d'Ivoire 72.9 2 1 4
## 20 Democratic People's Republic of Korea 93.9 1 1 4
## 21 Algeria 94.7 2 0 5
## 22 Bhutan 100 2 0 5
## 23 Bolivia (Plurinational State of) 94.1 2 0 5
## 24 Cabo Verde 89.9 2 0 5
## 25 Egypt 98.8 2 0 5
The number of countries in each cluster
The clustering results group countries based on similar features like Basic access to services, Income levels, and Fragility. By looking at these groups, we can see patterns and identify challenges that specific groups of countries face.
Countries in this cluster have limited access to basic services, as reflected in their low “Basic” scores. These regions are predominantly fragile, facing challenges like political instability, conflict, and weak infrastructure. Many belong to low-income or lower-middle-income categories.
This cluster consists of high-income, developed countries with perfect or near-perfect scores in “Basic” and no fragility concerns. These nations have well-established infrastructure and stable socio-political conditions.
Primarily comprising upper-middle-income countries, this cluster has high “Basic” scores and low fragility. These nations are stable but not as economically advanced as Cluster 2.
Countries in this cluster show moderate scores in “Basic” and fragility, reflecting their status as developing economies transitioning toward greater stability.
This cluster includes low-income countries with moderate-to-high fragility but better “Basic” scores than Cluster 1. These regions are fragile yet show signs of progress in basic access and income levels.
Percentage distribution of countries across clusters
The clustering analysis helps policymakers, organizations, and governments identify regional disparities and tailor interventions to specific needs:
Cluster 1: Requires significant support to address fragility and improve access to basic services.
Cluster 2: Can focus on supporting less developed nations.
Cluster 4: Needs targeted development programs to address transitional challenges.
Better Resource Allocation: Funds and resources can be distributed efficiently based on each cluster’s specific needs (e.g., humanitarian aid for Cluster 1, capacity-building for Cluster 4).
Prioritized Interventions: Efforts focused on improving fragility and access in Clusters 1 and 5 can yield measurable progress.
Policy Design: The clusters offer insights into shared socio-economic challenges, enabling policymakers to craft evidence-based strategies for sustainable growth.
This project shows how clustering techniques can help solve real-world problems. It creates a strong base for future research, such as focusing on specific regional issues or adding more factors to the analysis