1. Objective
2. Data Loading, Cleaning and Feature Engineering
3. Clustering and PCA
4. Association Rule: Wine Share
5. Conclusions

1. Objective

The objective of this project is to verify and present wine production across countries worldwide and to analyse what proportion of total grape production is allocated to wine production, taking into account the vineyard area of each country. The analysis aims to understand structural differences between countries and to evaluate how efficiently vineyard resources are transformed into wine over time.

The dataset used in this project comes from the International Organisation of Vine and Wine (OIV): https://www.oiv.int/what-we-do/data-discovery-report?oiv. It covers the period from 1995 to 2025 and provides comprehensive information about the global grape industry. The dataset includes data on production, consumption, imports and exports for fresh grapes, dried grapes, table grapes and wine. Additionally, it contains information on vineyard area for each reporting country.

For the purposes of this analysis, the focus is placed primarily on the wine-to-grapes production ratio and on the productivity of harvested vineyard area over time. The broader trade and consumption variables are not the central subject of this study, although they form the structural background of the dataset.

The analysis integrates all three topic learned during our classes:

Clustering
Dimension reduction
Association rule analysis

2. Data Loading, Cleaning and Feature Engineering

2.1 Libraries

This libraries where used while conducting the analysis:

# Libraries
library(tidyverse)
library(broom)
library(lubridate)
library(cluster)
library(leaflet)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(countrycode)
library(ggplot2)
library(ggrepel)
library(GGally)
library(dplyr)
library(fmsb)
library(reshape2)

2.2 Data Cleaning

First, the dataset was filtered to maintain observations within the 1995–2025 period. Next, only production-related variables were selected, and irrelevant product categories were excluded from the main analytical dataset. Character variables were converted into numeric format where necessary, and missing values were handled either through removal or aggregation, depending on their structural relevance.

Country names were standardized to ensure consistency across datasets, especially for later spatial visualization. In addition, the data were aggregated at the country level in order to compute long-term averages and growth indicators. These steps were conducted to ensure comparability between countries and to eliminate inconsistencies that could distort clustering results.

ds_raw <- read_delim(input_path, delim = sep_symbol, locale = locale(encoding = "latin1"),
                     col_types = cols(.default = "c"))

cat("Original rows:", nrow(ds_raw), "columns:", ncol(ds_raw), "\n")

## Original rows: 59704 columns: 7

cat("Column names:", paste(colnames(ds_raw), collapse = ", "), "\n\n")

## Column names: Continent, Region/Country, Product, Variable, Year, Unit, Quantity

# norm names
ds <- ds_raw %>%
  rename(
    Continent = 1,
    Region_Country = 2,
    Product = 3,
    Variable = 4,
    Year = 5,
    Unit = 6,
    Quantity = 7
  ) %>%
  mutate(
    Product = str_squish(Product),
    Variable = str_squish(Variable),
    Region_Country = str_squish(Region_Country),
    Unit = str_squish(Unit),
    Year = as.integer(as.numeric(Year))
  )

# std product names
ds <- ds %>%
  mutate(
    Product_std = case_when(
      str_detect(str_to_lower(Product), "\\bwine\\b") ~ "Wine",
      str_detect(str_to_lower(Product), "vineyard") ~ "Vineyard",
      str_detect(str_to_lower(Product), "grape") ~ "Grapes",
      TRUE ~ Product
    ),
    Variable_std = str_to_lower(Variable)
  )

# filter: wine production, grapes production, vineyard surface area
ds_prod <- ds %>%
  filter(
    (Product_std == "Wine" & Variable_std == "production") |
      (Product_std == "Vineyard" & str_detect(Variable_std, "surface")) |
      (Product_std == "Grapes" & Variable_std == "production")
  ) %>%
  filter(!is.na(Year), Year >= min_year & Year <= max_year)

# clean quanitity
clean_quantity <- function(q) {
  q2 <- q %>%
    str_replace_all("\\s+", "") %>%
    str_replace_all(",", ".") %>%
    str_replace_all("[^0-9\\.\\-]", "")
  ifelse(q2 == "" | q2 == ".", NA, as.numeric(q2))
}


ds_prod <- ds_prod %>%
  mutate(
    Quantity_num = map_dbl(Quantity, ~ clean_quantity(.x)),
    Unit = if_else(is.na(Unit) | Unit == "", "unknown", Unit)
  )

2.3 Feature engeneering

Several new variables were created to better capture the structure and dynamics of national grape industries. These include:

mean_wine – average wine production
mean_grapes – average grape production
mean_vineyard – average vineyard area
sd_wine, sd_grapes, sd_vineyard – volatility measures
growth_wine, growth_grapes, growth_vineyard – long-term growth rates
wine_per_vineyard – efficiency indicator
wine_grape_ratio – share of grape production allocated to wine

These were the main variables analyzed in this project.

3. Clustering and PCA

3.1 Preparation and Cluster Selection

Before clustering, all structural variables were standardized to ensure equal contribution to the distance metric. The Elbow Method and Silhouette Analysis were used to evaluate the optimal number of clusters.

country_summary <- country_summary %>%
  mutate(has_vineyard = if_else(!is.na(mean_vineyard) & mean_vineyard > 0, 1, 0))

group_A <- country_summary %>% filter(has_vineyard == 1) 
group_B <- country_summary %>% filter(has_vineyard == 0)

feat_vars <- c("mean_wine", "mean_grapes", "mean_vineyard",
               "sd_wine", "sd_grapes", "sd_vineyard",
               "growth_wine", "growth_grapes", "growth_vineyard",
               "wine_per_vineyard", "grapes_per_vineyard")

feat_df <- country_summary %>%
  select(Region_Country, has_vineyard, all_of(feat_vars))

cat("Total countries:", nrow(feat_df), "\n")

## Total countries: 219

cat("Countries with vineyard:", sum(feat_df$has_vineyard == 1), "\n")

## Countries with vineyard: 96

cat("Countries without vineyard:", sum(feat_df$has_vineyard == 0), "\n")

## Countries without vineyard: 123

# clustering matrix
group_A_features <- group_A %>%
  mutate(
    log_mean_wine = log1p(mean_wine),
    log_mean_grapes = log1p(mean_grapes),
    log_mean_vineyard = log1p(mean_vineyard),
    log_sd_wine = log1p(sd_wine),
    log_sd_grapes = log1p(sd_grapes),
    log_sd_vineyard = log1p(sd_vineyard)
  ) %>%
  select(
    Region_Country,
    log_mean_wine,
    log_mean_grapes,
    log_mean_vineyard,
    log_sd_wine,
    log_sd_grapes,
    log_sd_vineyard,
    growth_wine,
    growth_grapes,
    growth_vineyard,
    wine_per_vineyard,
    grapes_per_vineyard)

# remove remaining NAs
group_A_features <- group_A_features %>% drop_na()

country_names <- group_A_features$Region_Country

scaled_matrix <- group_A_features %>% select(-Region_Country) %>% scale()

The silhouette analysis suggested that three clusters might not be sufficient, and more clusters might be a better option. Because of that, this option was not examined further. There was uncertainty between four or five clusters, therefore PCA projection was conducted to verify which option would be more appropriate.

3.2. PCA (Dimensionality Reduction)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5    PC6     PC7
## Standard deviation     2.2977 1.4499 1.1773 0.99619 0.85652 0.4931 0.38887
## Proportion of Variance 0.4799 0.1911 0.1260 0.09022 0.06669 0.0221 0.01375
## Cumulative Proportion  0.4799 0.6711 0.7971 0.88730 0.95399 0.9761 0.98984
##                            PC8    PC9    PC10    PC11
## Standard deviation     0.27608 0.1554 0.08047 0.06992
## Proportion of Variance 0.00693 0.0022 0.00059 0.00044
## Cumulative Proportion  0.99677 0.9990 0.99956 1.00000

The principal component analysis shows that the first component explains 48% of the total variance, while the second explains 19%. Together, the first two components account for approximately 67% of the total variance, and the first three explain nearly 80%. In socio-economic datasets, explaining around two-thirds of total variance with two components is considered satisfactory, especially when the variables represent complex structural systems. As the aim was to avoid overcomplicating the project, two components were selected.

The clustering was based on structural size variables (wine production, grape production, vineyard area), variability measures, growth dynamics, and efficiency ratios. PCA confirms that the first principal component captures overall production scale, as mean values and standard deviations load strongly and positively. The second principal component reflects growth dynamics, particularly changes in grape production and vineyard area. The third component further differentiates structural composition, especially in terms of efficiency ratios.

When testing five clusters, significant overlap between groups appeared, which reduced interpretability. Therefore, four clusters were selected as a balanced solution that provides sufficient structural depth without overcomplicating interpretation. Based on the four clusters, additional visualizations were prepared to better understand the data.

3.3 Parallel Coordinates Plot

The parallel coordinates further confirm structural separation between clusters. It is clearly visible that the two selected components had the most difficulty in explaining the growth ratio. The first variables represent structural size and are strongly associated with log-transformed mean production, volatility measures, and production per area. These variables are well correlated with the two selected PCA dimensions. Variables related to growth fall outside the 67% of variance explained by the chosen principal components. For this reason, a heat map was plotted.

3.4 Cluster Heatmap

The heat map visually confirms that clusters are separated primarily along the size dimension and the growth dimension. It clearly presents what the parallel plot could not fully explain. Based on the heat map, differences between clusters in terms of growth can be better understood. To avoid repetition, a detailed explanation of the heat map is provided in the next section. Combined with the other visual analyses, this is sufficient to draw conclusions.

3.5 Clustering analysis

Cluster 1 Represents large, mature and slightly declining producers. These countries present very high average wine production (approximately 490 hl per year), extremely high grape production (around 55 000 tons per year), and very large vineyard areas. Growth rates are slightly negative, suggesting structural maturity and mild contraction. These are established wine economies with developed infrastructure and limited expansion potential. Examples include Spain, Italy, China, South Africa, Brazil, Peru and Canada. This group reflects industrial-scale production systems.
Cluster 2 Consists of medium-sized producers experiencing structural decline. These countries show moderate vineyard area and grape production but relatively low wine output in relation to vineyard size. Growth rates for grapes and vineyard area are strongly negative. This suggests restructuring or decreasing importance of viticulture. Some countries in this cluster, such as Egypt, Saudi Arabia and Yemen, have cultural or religious constraints limiting wine production. Representative countries include Venezuela, Namibia, Kazakhstan, Slovakia and Morocco. There is also extreme outlier - Serbia And Montenegro which has highly increased growth compared to other countries within cluster or even whole dataset.

##          Region_Country       PC1     PC2 Cluster distance_from_center
## 1 Serbia And Montenegro 0.3503303 7.85909       2             7.201962

Cluster 3 Represents expanding or emerging producers. These countries demonstrate moderate production levels in past years but strong positive growth in wine, grapes and vineyard area. The grapes_per_vineyard ratio is the highest among clusters, indicating structural intensification. Countries such as Belarus, Colombia, India, Cuba and Vietnam belong to this group. This cluster reflects investment-driven expansion and modernization.
Cluster 4 Includes very small producers with minimal wine output and small vineyard areas. Although growth rates are slightly positive, the structural base is extremely limited. Countries such as Sweden, the Netherlands, the Philippines, Zimbabwe, Ecuador and Honduras belong to this category. These systems are economically marginal and often focused on niche or domestic markets.

Some countries form an NA cluster due to missing or extreme values. They are structurally outside the main system.

4. Association Rule: Wine Share

4.1 Main Goal

The purpose of the association analysis was to examine what proportion of grape production is converted into wine and to identify countries with the strongest specialization in wine production. As around 1.6 to 2 kg of grapes are needed to produce 1 liter of wine, standard estimated value 1.8kg per 1 liter was set for this proj

4.2 Data Preparation

To conduct this analysis, grape production and wine production were aggregated at the country level and merged into a unified dataset. As a key indicator, the wine-to-grapes production ratio was calculated. During preparation, several country names did not match the spatial dataset used for mapping. These mismatched values were corrected manually to ensure proper merging. Countries with incomplete structural information were excluded from the final visualization and are presented as NA.

4.3 Data visualisation and explanation

assoc_data <- country_summary %>%
  mutate(
    total_prod = mean_wine + mean_grapes,
    wine_share = mean_wine / total_prod
  )

world <- ne_countries(scale = "medium", returnclass = "sf")

world_spec <- world %>%
  left_join(assoc_data, by = c("name" = "Region_Country"))

world_spec$wine_share <- as.numeric(world_spec$wine_share)

pal_cont <- colorNumeric(
  palette = colorRampPalette(c("#dcedd1", "#822336"))(100),
  domain = world_spec$wine_share,
  na.color = "#f0f0f0"
)

leaflet(world_spec, width = "100%", height = 700) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~pal_cont(wine_share),
    weight = 1,
    opacity = 1,
    color = "white",
    fillOpacity = 0.85,
    highlightOptions = highlightOptions(
      weight = 2,
      color = "#444",
      fillOpacity = 0.95,
      bringToFront = TRUE
    ),
    label = ~paste0(
      "<strong>", name, "</strong><br/>",
      "Wine share: ", 
      ifelse(is.na(wine_share), "No data", round(wine_share, 3)),
      "<br/>",
      "Wine: ", round(mean_wine / 30, 2), " hl<br/>",
      "Grapes: ", round(mean_grapes / 30, 2), " tonnes<br/>",
      "Vineyards area: ", round(mean_vineyard / 30, 2), " ha"
    ) %>%
      lapply(htmltools::HTML)
  ) %>%
  addLegend(
    "bottomright",
    pal = pal_cont,
    values = ~wine_share,
    title = "Wine Share in Grape <br/>Production per Year",
    labFormat = labelFormat(digits = 2),
    opacity = 1
  )

Map reveals clear global patterns. The United Kingdom shows the highest wine-to-grape production ratio (1.27 hl of wine for 22.33 tonnes of produced grapes). Several European countries also demonstrate relatively high ratios between 1.8% and 1.2% wine share in production. Outside Europe, Argentina, South Africa and Australia also display strong wine percentage. Even though these are countries with the highest wine-to-grape ratio still the number is below 2% of production. Personally, I did not expect it to be so little.

In contrast, many Middle Eastern countries show near-zero wine shares despite maintaining large vineyard areas and grape production. This pattern likely reflects religious and cultural constraints regarding alcohol production and consumption. Sweden also exhibits a very low wine share, consistent with its historically restrictive alcohol policies.

Namibia presents an interesting case. It shows significant grape production but a zero wine ratio. According to a published study (https://pmc.ncbi.nlm.nih.gov/articles/PMC10184179/), Namibia has one of the highest per capita alcohol consumption rates in Africa, estimated between 11.2 and 12 liters of pure alcohol per adult annually. This discrepancy may indicate that residents rely on imported alcoholic beverages or domestic non-wine production. It may also suggest limitations and missing values of the dataset.

5. Conclusions

The results show that global wine production is clearly divided into different structural groups. Countries differ mainly in two ways: how large their production systems are and whether they are growing or declining over time. These two factors explain most of the variation in the dataset.

The clustering reveals meaningful segments. Some countries are large, stable producers with strong historical positions. Others are expanding and investing in vineyards. There is also a group where production and vineyard area are shrinking, and finally several very small producers with limited global importance.

The wine-to-grape ratio adds another important insight. As it might seem some countries are well know for their wine production, in reality its just a small percentage of fruit production itself. On the other side other grow grapes but produce absolutely no wine. These differences are often linked to cultural, institutional or economic conditions.

The global wine sector is structurally diverse. Countries follow different development paths, and their position in the market depends on scale, growth dynamics and production orientation.

Global Wine Production and Structural Dynamics of the Grape Industry (1995–2025)

Aleksandra Karolina Stawicka for Unsupervised Learning classes conducted by PhD, Assoc. Prof. Katarzyna Kopczewska