12.1-556-DBScan

Goal of This Analysis

In this report, we use the DBSCAN algorithm to: - Discover clusters of European countries based on their protein consumption patterns. - Use DBSCAN’s built-in anomaly detection to flag outlier countries whose diets do not fit any dense cluster.

Load Packages and Data

suppressPackageStartupMessages(library(ggplot2)); theme_set(theme_classic())
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(fpc))
suppressPackageStartupMessages(library(dbscan))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(knitr))

protein <- read.csv("protein.csv", check.names = TRUE)
names(protein) <- trimws(names(protein))
head(protein)

##          Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg
## 1        Albania    10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5    1.7
## 2        Austria     8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3    4.3
## 3        Belgium    13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1    4.0
## 4       Bulgaria     7.8       6.0  1.6  8.3  1.2    56.7    1.1  3.7    4.2
## 5 Czechoslovakia     9.7      11.4  2.8 12.5  2.0    34.3    5.0  1.1    4.0
## 6        Denmark    10.6      10.8  3.7 25.0  9.9    21.9    4.8  0.7    2.4

str(protein)

## 'data.frame':    25 obs. of  10 variables:
##  $ Country  : chr  "Albania" "Austria" "Belgium" "Bulgaria" ...
##  $ RedMeat  : num  10.1 8.9 13.5 7.8 9.7 10.6 8.4 9.5 18 10.2 ...
##  $ WhiteMeat: num  1.4 14 9.3 6 11.4 10.8 11.6 4.9 9.9 3 ...
##  $ Eggs     : num  0.5 4.3 4.1 1.6 2.8 3.7 3.7 2.7 3.3 2.8 ...
##  $ Milk     : num  8.9 19.9 17.5 8.3 12.5 25 11.1 33.7 19.5 17.6 ...
##  $ Fish     : num  0.2 2.1 4.5 1.2 2 9.9 5.4 5.8 5.7 5.9 ...
##  $ Cereals  : num  42.3 28 26.6 56.7 34.3 21.9 24.6 26.3 28.1 41.7 ...
##  $ Starch   : num  0.6 3.6 5.7 1.1 5 4.8 6.5 5.1 4.8 2.2 ...
##  $ Nuts     : num  5.5 1.3 2.1 3.7 1.1 0.7 0.8 1 2.4 7.8 ...
##  $ Fr.Veg   : num  1.7 4.3 4 4.2 4 2.4 3.6 1.4 6.5 6.5 ...

summary(protein)

##    Country             RedMeat         WhiteMeat           Eggs      
##  Length:25          Min.   : 4.400   Min.   : 1.400   Min.   :0.500  
##  Class :character   1st Qu.: 7.800   1st Qu.: 4.900   1st Qu.:2.700  
##  Mode  :character   Median : 9.500   Median : 7.800   Median :2.900  
##                     Mean   : 9.828   Mean   : 7.896   Mean   :2.936  
##                     3rd Qu.:10.600   3rd Qu.:10.800   3rd Qu.:3.700  
##                     Max.   :18.000   Max.   :14.000   Max.   :4.700  
##       Milk            Fish           Cereals          Starch     
##  Min.   : 4.90   Min.   : 0.200   Min.   :18.60   Min.   :0.600  
##  1st Qu.:11.10   1st Qu.: 2.100   1st Qu.:24.30   1st Qu.:3.100  
##  Median :17.60   Median : 3.400   Median :28.00   Median :4.700  
##  Mean   :17.11   Mean   : 4.284   Mean   :32.25   Mean   :4.276  
##  3rd Qu.:23.30   3rd Qu.: 5.800   3rd Qu.:40.10   3rd Qu.:5.700  
##  Max.   :33.70   Max.   :14.200   Max.   :56.70   Max.   :6.500  
##       Nuts           Fr.Veg     
##  Min.   :0.700   Min.   :1.400  
##  1st Qu.:1.500   1st Qu.:2.900  
##  Median :2.400   Median :3.800  
##  Mean   :3.072   Mean   :4.136  
##  3rd Qu.:4.700   3rd Qu.:4.900  
##  Max.   :7.800   Max.   :7.900

Interpretation - Each row is a country. - Columns (except Country) are daily protein intake (grams/person/day) from: - RedMeat, WhiteMeat, Eggs, Milk, Fish - Cereals, Starch, Nuts, Fr.Veg (fruits & vegetables) - All variables are on similar numeric scales (roughly 0–60), so we keep them in their original units for DBSCAN. This keeps interpretation straightforward; scaling is not strictly necessary here.

Preparation

# Keep the country names for interpretation

protein <- protein %>%
rename(FrVeg = `Fr.Veg`)  # for slightly cleaner name in R

# Store country as rownames for convenience

rownames(protein) <- protein$Country

# Numeric data matrix for DBSCAN (drop Country column)

protein_num <- protein %>%
select(-Country)

head(protein_num)

##                RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg
## Albania           10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5   1.7
## Austria            8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3   4.3
## Belgium           13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1   4.0
## Bulgaria           7.8       6.0  1.6  8.3  1.2    56.7    1.1  3.7   4.2
## Czechoslovakia     9.7      11.4  2.8 12.5  2.0    34.3    5.0  1.1   4.0
## Denmark           10.6      10.8  3.7 25.0  9.9    21.9    4.8  0.7   2.4

Interpretation - protein_num is the numeric feature matrix DBSCAN will use. - Row names (countries) will help us understand cluster memberships later.

Choosing DBSCAN Parameters (eps and MinPts)

DBSCAN requires two key parameters:
- MinPts: minimum number of neighbors to form a dense region; must be at least 3.
- eps: radius of the neighborhood; chosen using a k-distance plot, where we look for the “knee” (strong bend).
Our dataset is small (25 countries) with several dimensions. A reasonable choice is:
MinPts = 3 (small dataset, but still allows identification of dense regions). We then use the k-NN distance plot with k = MinPts to find a good eps.

MinPts <- 3

dbscan::kNNdistplot(as.matrix(protein_num), k = MinPts)
abline(h = 9, lty = 2, col = "red")

Interpretation - The k-NN distance plot sorts each point’s distance to its 3rd nearest neighbor. - We look for a “knee”/elbow where distances start rising more sharply; here it occurs around distance ≈ 9. - We choose eps = 9 as a reasonable balance: - Points within distance 9 of at least 3 neighbors define dense clusters. - Points that don’t meet this become noise/outliers.

Run

set.seed(123)  # for reproducibility of any internal operations

eps_val <- 9

db_protein <- fpc::dbscan(
as.matrix(protein_num),
eps    = eps_val,
MinPts = MinPts
)

db_protein

## dbscan Pts=25 MinPts=3 eps=9
##        0  1 2 3
## border 5  2 0 2
## seed   0 10 3 3
## total  5 12 3 5

table(db_protein$cluster)

## 
##  0  1  2  3 
##  5 12  3  5

Interpretation: - db_protein$cluster gives the cluster assignment for each country. - By DBSCAN convention in this implementation: - Cluster 0 = noise / outliers (points that do not belong to any dense region). - Clusters 1, 2, 3, … are actual clusters of density-connected points. - The frequency table shows: - One noise group (cluster 0) with several countries. - Several “proper” clusters (1, 2, 3, …) with different numbers of countries.

Attach Cluster Labels Back to Countries

protein_clust <- protein %>%
mutate(cluster = db_protein$cluster)

# View countries by cluster

protein_clust %>%
arrange(cluster, Country) %>%
select(Country, cluster)

##                       Country cluster
## Albania               Albania       0
## Finland               Finland       0
## Hungary               Hungary       0
## Portugal             Portugal       0
## Spain                   Spain       0
## Austria               Austria       1
## Belgium               Belgium       1
## Denmark               Denmark       1
## E Germany           E Germany       1
## France                 France       1
## Ireland               Ireland       1
## Netherlands       Netherlands       1
## Norway                 Norway       1
## Sweden                 Sweden       1
## Switzerland       Switzerland       1
## UK                         UK       1
## W Germany           W Germany       1
## Bulgaria             Bulgaria       2
## Romania               Romania       2
## Yugoslavia         Yugoslavia       2
## Czechoslovakia Czechoslovakia       3
## Greece                 Greece       3
## Italy                   Italy       3
## Poland                 Poland       3
## USSR                     USSR       3

Visualizing Clusters

library(ggrepel)

pca <- prcomp(protein_num, scale. = TRUE)

plot_df <- data.frame(
  PC1 = pca$x[,1],
  PC2 = pca$x[,2],
  Country = rownames(protein_num),
  cluster = factor(db_protein$cluster)
)

# Compute cluster "centers" (in PCA space)
cluster_centers <- plot_df %>%
  filter(cluster != "0") %>%
  group_by(cluster) %>%
  summarise(
    cx = mean(PC1),
    cy = mean(PC2)
  )

# eps circle function
circle_df <- function(center_x, center_y, r = eps_val, n = 200){
  theta <- seq(0, 2*pi, length.out = n)
  data.frame(
    x = center_x + r * cos(theta),
    y = center_y + r * sin(theta)
  )
}

# build circles for each cluster center
circle_list <- do.call(rbind,
                       lapply(1:nrow(cluster_centers), function(i){
                         cdat <- circle_df(cluster_centers$cx[i],
                                           cluster_centers$cy[i],
                                           r = 1.2)  # radius in PCA units
                         cdat$cluster <- cluster_centers$cluster[i]
                         cdat
                       }))

# Final simple plot
ggplot(plot_df, aes(PC1, PC2, color = cluster)) +
  geom_point(size = 3) +
  geom_text_repel(aes(label = Country), max.overlaps = 20) +
  geom_path(data = circle_list,
            aes(x = x, y = y, color = cluster, group = cluster),
            linewidth = 0.5, linetype = "dashed") +
  labs(
    title = "DBSCAN Clusters on Protein Consumption",
    subtitle = "Dashed circles approximate eps neighborhoods",
    x = "PC1", y = "PC2"
  ) +
  theme_classic()

Interpretation: - Each point is a country, plotted in a 2D factor space that best preserves variation in the features. - Colors represent DBSCAN clusters: - Tight colored groups = dense clusters of similar diets. - Points colored as cluster 0 = noise/outliers, separated from any dense group.

Cluster Profiles: What Makes Each Cluster Different?

Let’s compute average protein intake by food group for each cluster.

cluster_summary <- protein_clust %>%
group_by(cluster) %>%
summarise(across(where(is.numeric), mean), .groups = "drop")

kable(cluster_summary, digits = 2, caption = "Average Protein Intake by DBSCAN Cluster")

Average Protein Intake by DBSCAN Cluster
cluster	RedMeat	WhiteMeat	Eggs	Milk	Fish	Cereals	Starch	Nuts	FrVeg
0	7.64	5.16	2.06	13.16	5.50	32.98	4.26	4.50	4.48
1	12.00	10.00	3.79	21.12	4.96	23.88	4.73	1.75	3.67
2	6.13	5.77	1.43	9.63	0.93	54.07	2.40	4.90	3.40
3	9.02	6.86	2.66	15.94	3.46	38.50	4.32	3.72	5.34

 # To make the interpretation easier, let’s quickly inspect which countries are in each cluster:
countries_by_cluster <- protein_clust %>%
group_by(cluster) %>%
summarise(Countries = paste(Country, collapse = ", "))

kable(countries_by_cluster, caption = "Countries in Each DBSCAN Cluster")

Countries in Each DBSCAN Cluster
cluster	Countries
0	Albania, Finland, Hungary, Portugal, Spain
1	Austria, Belgium, Denmark, E Germany, France, Ireland, Netherlands, Norway, Sweden, Switzerland, UK, W Germany
2	Bulgaria, Romania, Yugoslavia
3	Czechoslovakia, Greece, Italy, Poland, USSR

Interpretation

Cluster 1 – High Meat & Dairy, Lower Cereals (Western / Northern Europe) Typical members (based on one run): Austria, Belgium, Denmark, E Germany, France, Ireland, Netherlands, Norway, Sweden, Switzerland, UK, W Germany - High RedMeat & WhiteMeat: diets rich in animal protein. - High Milk: strong dairy consumption. - Lower Cereals compared with other clusters. - Moderate Fish, low Nuts, moderate Fruits & Vegetables.

Cluster 2 – Mixed Diet with High Fruits & Vegetables (Central / Mediterranean mix) Typical members: Poland, Greece, Czechoslovakia, USSR, Italy - Moderate meat and milk compared to Cluster 1. - Medium to high Cereals. - Higher Fruits & Vegetables (FrVeg) on average. - Moderate Fish and Nuts.

Cluster 3 – Cereal-Heavy, Lower Animal Protein (Balkan/Eastern) Typical members: Bulgaria, Romania, Yugoslavia - Lower RedMeat, WhiteMeat, Eggs, Milk, Fish than Cluster 1. - Very high Cereals – the dominant protein source. - Relatively high Nuts, moderate Fruits & Vegetables.

Outliers (Cluster 0) – Anomalies / Unique Diet Patterns Typical outlier countries: Albania, Finland, Hungary, Portugal, Spain From the cluster summary, their averages often look like: - Moderate meat & milk, but - Unusually high or low fish, nuts, or cereals combinations that don’t fit any dense cluster. - Spain & Portugal may have higher Fish than most other European countries. - Finland may also show distinct combinations (e.g., relatively high fish + dairy). - Albania and Hungary can have unique balances of cereals, meat, and nuts.