In this report, we use the DBSCAN algorithm to: - Discover clusters of European countries based on their protein consumption patterns. - Use DBSCAN’s built-in anomaly detection to flag outlier countries whose diets do not fit any dense cluster.
suppressPackageStartupMessages(library(ggplot2)); theme_set(theme_classic())
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(fpc))
suppressPackageStartupMessages(library(dbscan))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(knitr))
protein <- read.csv("protein.csv", check.names = TRUE)
names(protein) <- trimws(names(protein))
head(protein)
## Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg
## 1 Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7
## 2 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3
## 3 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0
## 4 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2
## 5 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0
## 6 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4
str(protein)
## 'data.frame': 25 obs. of 10 variables:
## $ Country : chr "Albania" "Austria" "Belgium" "Bulgaria" ...
## $ RedMeat : num 10.1 8.9 13.5 7.8 9.7 10.6 8.4 9.5 18 10.2 ...
## $ WhiteMeat: num 1.4 14 9.3 6 11.4 10.8 11.6 4.9 9.9 3 ...
## $ Eggs : num 0.5 4.3 4.1 1.6 2.8 3.7 3.7 2.7 3.3 2.8 ...
## $ Milk : num 8.9 19.9 17.5 8.3 12.5 25 11.1 33.7 19.5 17.6 ...
## $ Fish : num 0.2 2.1 4.5 1.2 2 9.9 5.4 5.8 5.7 5.9 ...
## $ Cereals : num 42.3 28 26.6 56.7 34.3 21.9 24.6 26.3 28.1 41.7 ...
## $ Starch : num 0.6 3.6 5.7 1.1 5 4.8 6.5 5.1 4.8 2.2 ...
## $ Nuts : num 5.5 1.3 2.1 3.7 1.1 0.7 0.8 1 2.4 7.8 ...
## $ Fr.Veg : num 1.7 4.3 4 4.2 4 2.4 3.6 1.4 6.5 6.5 ...
summary(protein)
## Country RedMeat WhiteMeat Eggs
## Length:25 Min. : 4.400 Min. : 1.400 Min. :0.500
## Class :character 1st Qu.: 7.800 1st Qu.: 4.900 1st Qu.:2.700
## Mode :character Median : 9.500 Median : 7.800 Median :2.900
## Mean : 9.828 Mean : 7.896 Mean :2.936
## 3rd Qu.:10.600 3rd Qu.:10.800 3rd Qu.:3.700
## Max. :18.000 Max. :14.000 Max. :4.700
## Milk Fish Cereals Starch
## Min. : 4.90 Min. : 0.200 Min. :18.60 Min. :0.600
## 1st Qu.:11.10 1st Qu.: 2.100 1st Qu.:24.30 1st Qu.:3.100
## Median :17.60 Median : 3.400 Median :28.00 Median :4.700
## Mean :17.11 Mean : 4.284 Mean :32.25 Mean :4.276
## 3rd Qu.:23.30 3rd Qu.: 5.800 3rd Qu.:40.10 3rd Qu.:5.700
## Max. :33.70 Max. :14.200 Max. :56.70 Max. :6.500
## Nuts Fr.Veg
## Min. :0.700 Min. :1.400
## 1st Qu.:1.500 1st Qu.:2.900
## Median :2.400 Median :3.800
## Mean :3.072 Mean :4.136
## 3rd Qu.:4.700 3rd Qu.:4.900
## Max. :7.800 Max. :7.900
Interpretation - Each row is a country. - Columns (except Country) are daily protein intake (grams/person/day) from: - RedMeat, WhiteMeat, Eggs, Milk, Fish - Cereals, Starch, Nuts, Fr.Veg (fruits & vegetables) - All variables are on similar numeric scales (roughly 0–60), so we keep them in their original units for DBSCAN. This keeps interpretation straightforward; scaling is not strictly necessary here.
# Keep the country names for interpretation
protein <- protein %>%
rename(FrVeg = `Fr.Veg`) # for slightly cleaner name in R
# Store country as rownames for convenience
rownames(protein) <- protein$Country
# Numeric data matrix for DBSCAN (drop Country column)
protein_num <- protein %>%
select(-Country)
head(protein_num)
## RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg
## Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7
## Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3
## Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0
## Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2
## Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0
## Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4
Interpretation - protein_num is the numeric feature matrix DBSCAN will use. - Row names (countries) will help us understand cluster memberships later.
MinPts <- 3
dbscan::kNNdistplot(as.matrix(protein_num), k = MinPts)
abline(h = 9, lty = 2, col = "red")
Interpretation - The k-NN distance plot sorts each point’s distance to its 3rd nearest neighbor. - We look for a “knee”/elbow where distances start rising more sharply; here it occurs around distance ≈ 9. - We choose eps = 9 as a reasonable balance: - Points within distance 9 of at least 3 neighbors define dense clusters. - Points that don’t meet this become noise/outliers.
set.seed(123) # for reproducibility of any internal operations
eps_val <- 9
db_protein <- fpc::dbscan(
as.matrix(protein_num),
eps = eps_val,
MinPts = MinPts
)
db_protein
## dbscan Pts=25 MinPts=3 eps=9
## 0 1 2 3
## border 5 2 0 2
## seed 0 10 3 3
## total 5 12 3 5
table(db_protein$cluster)
##
## 0 1 2 3
## 5 12 3 5
Interpretation: - db_protein$cluster gives the cluster assignment for each country. - By DBSCAN convention in this implementation: - Cluster 0 = noise / outliers (points that do not belong to any dense region). - Clusters 1, 2, 3, … are actual clusters of density-connected points. - The frequency table shows: - One noise group (cluster 0) with several countries. - Several “proper” clusters (1, 2, 3, …) with different numbers of countries.
protein_clust <- protein %>%
mutate(cluster = db_protein$cluster)
# View countries by cluster
protein_clust %>%
arrange(cluster, Country) %>%
select(Country, cluster)
## Country cluster
## Albania Albania 0
## Finland Finland 0
## Hungary Hungary 0
## Portugal Portugal 0
## Spain Spain 0
## Austria Austria 1
## Belgium Belgium 1
## Denmark Denmark 1
## E Germany E Germany 1
## France France 1
## Ireland Ireland 1
## Netherlands Netherlands 1
## Norway Norway 1
## Sweden Sweden 1
## Switzerland Switzerland 1
## UK UK 1
## W Germany W Germany 1
## Bulgaria Bulgaria 2
## Romania Romania 2
## Yugoslavia Yugoslavia 2
## Czechoslovakia Czechoslovakia 3
## Greece Greece 3
## Italy Italy 3
## Poland Poland 3
## USSR USSR 3
library(ggrepel)
pca <- prcomp(protein_num, scale. = TRUE)
plot_df <- data.frame(
PC1 = pca$x[,1],
PC2 = pca$x[,2],
Country = rownames(protein_num),
cluster = factor(db_protein$cluster)
)
# Compute cluster "centers" (in PCA space)
cluster_centers <- plot_df %>%
filter(cluster != "0") %>%
group_by(cluster) %>%
summarise(
cx = mean(PC1),
cy = mean(PC2)
)
# eps circle function
circle_df <- function(center_x, center_y, r = eps_val, n = 200){
theta <- seq(0, 2*pi, length.out = n)
data.frame(
x = center_x + r * cos(theta),
y = center_y + r * sin(theta)
)
}
# build circles for each cluster center
circle_list <- do.call(rbind,
lapply(1:nrow(cluster_centers), function(i){
cdat <- circle_df(cluster_centers$cx[i],
cluster_centers$cy[i],
r = 1.2) # radius in PCA units
cdat$cluster <- cluster_centers$cluster[i]
cdat
}))
# Final simple plot
ggplot(plot_df, aes(PC1, PC2, color = cluster)) +
geom_point(size = 3) +
geom_text_repel(aes(label = Country), max.overlaps = 20) +
geom_path(data = circle_list,
aes(x = x, y = y, color = cluster, group = cluster),
linewidth = 0.5, linetype = "dashed") +
labs(
title = "DBSCAN Clusters on Protein Consumption",
subtitle = "Dashed circles approximate eps neighborhoods",
x = "PC1", y = "PC2"
) +
theme_classic()
Interpretation: - Each point is a country, plotted in a 2D factor space that best preserves variation in the features. - Colors represent DBSCAN clusters: - Tight colored groups = dense clusters of similar diets. - Points colored as cluster 0 = noise/outliers, separated from any dense group.
Let’s compute average protein intake by food group for each cluster.
cluster_summary <- protein_clust %>%
group_by(cluster) %>%
summarise(across(where(is.numeric), mean), .groups = "drop")
kable(cluster_summary, digits = 2, caption = "Average Protein Intake by DBSCAN Cluster")
| cluster | RedMeat | WhiteMeat | Eggs | Milk | Fish | Cereals | Starch | Nuts | FrVeg |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.64 | 5.16 | 2.06 | 13.16 | 5.50 | 32.98 | 4.26 | 4.50 | 4.48 |
| 1 | 12.00 | 10.00 | 3.79 | 21.12 | 4.96 | 23.88 | 4.73 | 1.75 | 3.67 |
| 2 | 6.13 | 5.77 | 1.43 | 9.63 | 0.93 | 54.07 | 2.40 | 4.90 | 3.40 |
| 3 | 9.02 | 6.86 | 2.66 | 15.94 | 3.46 | 38.50 | 4.32 | 3.72 | 5.34 |
# To make the interpretation easier, let’s quickly inspect which countries are in each cluster:
countries_by_cluster <- protein_clust %>%
group_by(cluster) %>%
summarise(Countries = paste(Country, collapse = ", "))
kable(countries_by_cluster, caption = "Countries in Each DBSCAN Cluster")
| cluster | Countries |
|---|---|
| 0 | Albania, Finland, Hungary, Portugal, Spain |
| 1 | Austria, Belgium, Denmark, E Germany, France, Ireland, Netherlands, Norway, Sweden, Switzerland, UK, W Germany |
| 2 | Bulgaria, Romania, Yugoslavia |
| 3 | Czechoslovakia, Greece, Italy, Poland, USSR |
Cluster 1 – High Meat & Dairy, Lower Cereals (Western / Northern Europe) Typical members (based on one run): Austria, Belgium, Denmark, E Germany, France, Ireland, Netherlands, Norway, Sweden, Switzerland, UK, W Germany - High RedMeat & WhiteMeat: diets rich in animal protein. - High Milk: strong dairy consumption. - Lower Cereals compared with other clusters. - Moderate Fish, low Nuts, moderate Fruits & Vegetables.
Cluster 2 – Mixed Diet with High Fruits & Vegetables (Central / Mediterranean mix) Typical members: Poland, Greece, Czechoslovakia, USSR, Italy - Moderate meat and milk compared to Cluster 1. - Medium to high Cereals. - Higher Fruits & Vegetables (FrVeg) on average. - Moderate Fish and Nuts.
Cluster 3 – Cereal-Heavy, Lower Animal Protein (Balkan/Eastern) Typical members: Bulgaria, Romania, Yugoslavia - Lower RedMeat, WhiteMeat, Eggs, Milk, Fish than Cluster 1. - Very high Cereals – the dominant protein source. - Relatively high Nuts, moderate Fruits & Vegetables.
Outliers (Cluster 0) – Anomalies / Unique Diet Patterns Typical outlier countries: Albania, Finland, Hungary, Portugal, Spain From the cluster summary, their averages often look like: - Moderate meat & milk, but - Unusually high or low fish, nuts, or cereals combinations that don’t fit any dense cluster. - Spain & Portugal may have higher Fish than most other European countries. - Finland may also show distinct combinations (e.g., relatively high fish + dairy). - Albania and Hungary can have unique balances of cereals, meat, and nuts.