Homework 2 - clustering

library(readr)
Cleaned_Global_Wine_Data_2023_ <- read_csv("~/Desktop/HW1JULIJA/Wine clustering/Cleaned_Global_Wine_Data__2023_ .csv")

## New names:
## Rows: 85 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Country dbl (5): ...1, Wine Consumption (tonnes), Wine Exports (tonnes),
## Wine Import...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

 head(Cleaned_Global_Wine_Data_2023_ [, 1:4])

## # A tibble: 6 × 4
##    ...1 Country     `Wine Consumption (tonnes)` `Wine Exports (tonnes)`
##   <dbl> <chr>                             <dbl>                   <dbl>
## 1     0 Afghanistan                      449964                  202874
## 2     1 Albania                             253                      79
## 3     2 Algeria                          459287                       2
## 4     6 Argentina                          7753                   30595
## 5     7 Armenia                              90                    4654
## 6     9 Australia                         37862                  139619

I will now manipulate the data.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Cleaned_Global_Wine_Data_2023_ <- Cleaned_Global_Wine_Data_2023_ %>%
    rename(Consumption = `Wine Consumption (tonnes)`,
        Exports = `Wine Exports (tonnes)`,
        Imports = `Wine Imports (tonnes)`,
        Production = `Wine Production (tonnes)`)

colnames(Cleaned_Global_Wine_Data_2023_)

## [1] "...1"        "Country"     "Consumption" "Exports"     "Imports"    
## [6] "Production"

colnames(Cleaned_Global_Wine_Data_2023_)[colnames(Cleaned_Global_Wine_Data_2023_) == "...1"] <- "ID"

head(Cleaned_Global_Wine_Data_2023_)

## # A tibble: 6 × 6
##      ID Country     Consumption Exports Imports Production
##   <dbl> <chr>             <dbl>   <dbl>   <dbl>      <dbl>
## 1     0 Afghanistan      449964  202874     420    1552658
## 2     1 Albania             253      79    3673     218094
## 3     2 Algeria          459287       2    7594    1079044
## 4     6 Argentina          7753   30595    3787    1602835
## 5     7 Armenia              90    4654     797     214662
## 6     9 Australia         37862  139619   26351    1632170

Description:

ID: Country ID

Country: Name of the country

Consumption: Wine consumption in tonnes

Exports: Wine exports in tonnes

Imports: Wine imports in tonnes

Production: Wine production in tonnes

This data is data from the year 2023 taken from the official website of the International organisation of Vine and Wine (OIV): https://www.oiv.int/what-we-do/data-discovery-report?oiv.

RQ: “Can countries be grouped into distinct clusters based on their wine production, imports, and exports?”

For the clustering analysis, I selected three variables: Production, Imports and Exports. These categories capture distinct and critical dimensions of global wine trade and production, providing a comprehensive picture of wine-related activities and trade patterns across countries.

clustering_data <- Cleaned_Global_Wine_Data_2023_ %>%
  select(Exports, Imports, Production)

standardized_data <- scale(clustering_data)

standardized_data <- as.data.frame(standardized_data)

head(standardized_data)

##        Exports    Imports  Production
## 1  0.102542236 -0.4268769 -0.07665651
## 2 -0.211194203 -0.4065942 -0.18679742
## 3 -0.211313326 -0.3821466 -0.11574364
## 4 -0.163984059 -0.4058834 -0.07251543
## 5 -0.204116394 -0.4245263 -0.18708066
## 6  0.004682829 -0.2651956 -0.07009443

Cleaned_Global_Wine_Data_2023__clu_std <- as.data.frame(scale(Cleaned_Global_Wine_Data_2023_[c(3:6)]))

Cleaned_Global_Wine_Data_2023_$Dissimilarity = sqrt(Cleaned_Global_Wine_Data_2023__clu_std$Exports^2 + Cleaned_Global_Wine_Data_2023__clu_std$Imports^2 + Cleaned_Global_Wine_Data_2023__clu_std$Production^2)

I have standardized the data that will be used for clustering.

head(Cleaned_Global_Wine_Data_2023_[order(-Cleaned_Global_Wine_Data_2023_$Dissimilarity), c("ID", "Country", "Dissimilarity")], 20)

## # A tibble: 20 × 3
##       ID Country                          Dissimilarity
##    <dbl> <chr>                                    <dbl>
##  1    75 Global                                  13.6  
##  2   197 United States of America                 4.92 
##  3   132 Netherlands                              2.90 
##  4    41 China, mainland                          2.46 
##  5    73 Germany                                  2.24 
##  6   155 Russia                                   1.92 
##  7   196 United Kingdom                           1.79 
##  8   146 Peru                                     0.902
##  9    33 Canada                                   0.866
## 10    37 Chile                                    0.812
## 11    67 France                                   0.638
## 12    94 Italy                                    0.575
## 13   152 Republic of Türkiye                      0.550
## 14   185 Thailand                                 0.538
## 15   148 Poland                                   0.536
## 16   184 Tanzania, the United Republic of         0.518
## 17    63 Ethiopia                                 0.517
## 18   174 South Africa                             0.516
## 19   198 Uruguay                                  0.514
## 20   117 Malta                                    0.512

I calculated the variable dissimilarity to asses which countries are significantly different to the rest of the data set. The countries with the highest dissimilarity scores deviate from other observations making them potential outliers. To ensure they do not distort the clustering process I will remove them (Global and United States of America).

Cleaned_Global_Wine_Data_2023_<- Cleaned_Global_Wine_Data_2023_ %>% 
  filter(!Country %in% c("Global", "United States of America"))

Cleaned_Global_Wine_Data_2023__clu_std <- as.data.frame(scale(Cleaned_Global_Wine_Data_2023_[, c("Exports",
                                                                                                 "Imports",
                                                                                                 "Production")]))

head(Cleaned_Global_Wine_Data_2023_[order(-Cleaned_Global_Wine_Data_2023_$Dissimilarity), c("ID", "Country", "Dissimilarity")], 20)

## # A tibble: 20 × 3
##       ID Country                          Dissimilarity
##    <dbl> <chr>                                    <dbl>
##  1   132 Netherlands                              2.90 
##  2    41 China, mainland                          2.46 
##  3    73 Germany                                  2.24 
##  4   155 Russia                                   1.92 
##  5   196 United Kingdom                           1.79 
##  6   146 Peru                                     0.902
##  7    33 Canada                                   0.866
##  8    37 Chile                                    0.812
##  9    67 France                                   0.638
## 10    94 Italy                                    0.575
## 11   152 Republic of Türkiye                      0.550
## 12   185 Thailand                                 0.538
## 13   148 Poland                                   0.536
## 14   184 Tanzania, the United Republic of         0.518
## 15    63 Ethiopia                                 0.517
## 16   174 South Africa                             0.516
## 17   198 Uruguay                                  0.514
## 18   117 Malta                                    0.512
## 19   190 Tunisia                                  0.510
## 20   191 Turkmenistan                             0.508

Euclidian distances

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distances <- get_dist(Cleaned_Global_Wine_Data_2023__clu_std,
                       method = "euclidean")
 # Visualize the distance matrix
 fviz_dist(Distances, gradient = list(low = "darkred",
                                      mid = "grey95",
      high = "white"))

Now I will calculate the Hopkins statistics to see if my data is clusterable.

library(factoextra)

get_clust_tendency(Cleaned_Global_Wine_Data_2023__clu_std,
                   n = nrow(Cleaned_Global_Wine_Data_2023__clu_std) - 1,
graph = FALSE)

## $hopkins_stat
## [1] 0.9447029
## 
## $plot
## NULL

Explanation of results: Hopkins statistic, which measures clustering tendency, is 0.945, indicating a very strong clustering structure in the data set. If this data is close to one it confirms that the data is suitable for a cluster analysis. From the visual inspection of the distance matrix, a few different formations of squares are visible, supporting the existence of well-defined clusters in the data.

How many clusters?

library(factoextra) 
library(NbClust)

 
 fviz_nbclust(Cleaned_Global_Wine_Data_2023__clu_std, kmeans, method = "wss") +
   labs(subtitle = "Elbow method")

Explanation of results:

The elbow method is one of the methods used to determine the optimal number of clusters. It is where the total within sum of squares (WSS) starts to level off. In this plot we can read that it is suggested to divide the data into 3 or 4 clusters to capture the structure of the data set effectively.

We will now confirm this using the Silhouette method to find the optimal number of clusters.

fviz_nbclust(Cleaned_Global_Wine_Data_2023__clu_std, kmeans, method = "silhouette") +
   labs(subtitle = "Silhouette analysis")

From the silhouette analysis we see that it indicated that 3 clusters provide the best separation and cohesion among the data. This suggests that separating the data into 3 clusters is most effective in capturing the underlying structure. We will now confirm this one last time with the K-means method.

NbClust(Cleaned_Global_Wine_Data_2023__clu_std, 
        distance = "euclidean",
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 3 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 5 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##          KL       CH Hartigan     CCC    Scott    Marriot    TrCovW   TraceW
## 2    0.0292  46.6640  59.4302 -1.6940 117.9091 347352.878 4041.8966 156.0816
## 3    9.9338  69.2997  13.8606  1.4383 273.6394 119701.613  805.3012  90.0277
## 4    0.0595  58.0891 115.7140  0.4757 313.3533 131879.124  769.4708  76.7331
## 5    6.6244 134.5837  28.0810 12.2160 507.6547  19829.915  116.9988  31.1324
## 6  221.1058 150.0953   8.2552 13.9396 575.0616  12675.909   65.5729  22.8913
## 7    0.0193 138.0486   3.2788 12.9359 611.1331  11171.975   58.4998  20.6747
## 8    0.7215 122.2704  11.6130 11.4610 624.2611  12457.248   54.5788  19.8197
## 9    1.2533 123.3373   1.3203 11.6808 692.9649   6890.285   49.5351  17.1623
## 10   0.8952 110.2259   0.5807 10.3423 700.6298   7756.142   48.6832  16.8614
##    Friedman   Rubin Cindex     DB Silhouette    Duda Pseudot2   Beale Ratkowsky
## 2    2.6752  1.5761 0.0718 1.3145     0.6927  4.3967 -59.4869 -1.2959    0.4218
## 3    8.9576  2.7325 0.0827 0.8640     0.7636  0.7452  24.6194  0.5739    0.4522
## 4   11.1703  3.2059 0.0562 1.1526     0.6372  1.6381 -27.2675 -0.6526    0.4104
## 5   26.2543  7.9017 0.1515 0.7160     0.6696  0.6029  43.4645  1.1039    0.4177
## 6   34.2195 10.7464 0.1045 0.6850     0.6610  1.9652 -23.0842 -0.8220    0.3888
## 7   41.6442 11.8986 0.0764 0.7147     0.5050  0.7433  15.8901  0.5747    0.3617
## 8   44.7696 12.4119 0.0624 0.7666     0.5161  1.8193 -23.8678 -0.7480    0.3390
## 9   68.6456 14.3338 0.0592 0.6931     0.5357 16.4697 -11.2714 -1.4658    0.3214
## 10  71.6306 14.5895 0.0554 0.7212     0.4711  1.7961  -9.7513 -0.6707    0.3051
##       Ball Ptbiserial     Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  78.0408     0.7387  55.8314  0.0709 0.1178 0.0093 21.3214 0.7842 1.8794
## 3  30.0092     0.8167   4.7272  0.0506 0.1585 0.0089 19.3208 0.6431 1.6357
## 4  19.1833     0.7208  30.4709  0.0912 0.0617 0.0091 19.6852 0.5542 1.7063
## 5   6.2265     0.7642   3.4985  0.0760 0.2161 0.0092  5.3861 0.4559 0.3221
## 6   3.8152     0.6982   8.5895  0.0951 0.1510 0.0093  5.3356 0.3709 0.2843
## 7   2.9535     0.4968   2.4075  0.2111 0.0125 0.0095  9.0709 0.3184 0.2777
## 8   2.4775     0.4640   0.2313  0.2220 0.0173 0.0096  9.4897 0.2988 0.2620
## 9   1.9069     0.4647  10.3010  0.2134 0.0173 0.0096  9.2780 0.2738 0.1830
## 10  1.6861     0.4125 -10.9122  0.2739 0.0170 0.0096 13.5509 0.2615 0.1886
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.5191            71.3386       1.0000
## 3          0.5247            65.2107       0.6328
## 4          0.5088            67.5830       1.0000
## 5          0.5130            62.6440       0.3488
## 6          0.4996            47.0690       1.0000
## 7          0.4551            55.0867       0.6326
## 8          0.4434            66.5373       1.0000
## 9          0.1687            59.1210       1.0000
## 10         0.0819           246.4614       1.0000
## 
## $Best.nc
##                       KL       CH Hartigan     CCC    Scott  Marriot   TrCovW
## Number_clusters   6.0000   6.0000   4.0000  6.0000   5.0000      3.0    3.000
## Value_Index     221.1058 150.0953 101.8534 13.9396 194.3014 239828.8 3236.595
##                  TraceW Friedman   Rubin  Cindex    DB Silhouette   Duda
## Number_clusters  3.0000    9.000  5.0000 10.0000 6.000     3.0000 2.0000
## Value_Index     52.7593   23.876 -1.8511  0.0554 0.685     0.7636 4.3967
##                 PseudoT2   Beale Ratkowsky    Ball PtBiserial   Frey McClain
## Number_clusters   2.0000  2.0000    3.0000  3.0000     3.0000 7.0000  3.0000
## Value_Index     -59.4869 -1.2959    0.4522 48.0315     0.8167 2.4075  0.0506
##                   Dunn Hubert SDindex Dindex  SDbw
## Number_clusters 5.0000      0  6.0000      0 9.000
## Value_Index     0.2161      0  5.3356      0 0.183
## 
## $Best.partition
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 3
## [39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 3 1 1 1 1 3 1 2 1 1 1 1 3 1 1 1 1 1 1 1
## [77] 1 1 1 2 1 1 1

After a thorough analysis, I have confirmed that the best number of clusters is 3.

Clustering <- kmeans(Cleaned_Global_Wine_Data_2023__clu_std,
                         centers = 3,
                         nstart = 25)
Clustering

## K-means clustering with 3 clusters of sizes 7, 4, 72
## 
## Cluster means:
##      Exports     Imports Production
## 1  2.8700041 -0.02001805  1.9021219
## 2  0.2420837  3.90299613 -0.1803986
## 3 -0.2924773 -0.21488692 -0.1749064
## 
## Clustering vector:
##  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 1 3 3 1
## [39] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 1 3 2 3 3 3 3 1 3 3 3 3 3 3 3
## [77] 3 3 3 2 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 54.105876  6.582436 29.339403
##  (between_SS / total_SS =  63.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The “between SS/total SS” represents the proportion of total variance explained by the clustering. So this means that 63.4 % of variability of data is explained by these 3 clusters, the rest of the variability is within the clusters. The number also indicates well-separated and meaningful clusters.

fviz_cluster(Clustering,
              palette = "Set1",
              repel = TRUE,
              ggtheme = theme_bw(),
              data = Cleaned_Global_Wine_Data_2023__clu_std)

Cleaned_Global_Wine_Data_2023_ <- Cleaned_Global_Wine_Data_2023_ %>%
  filter(!(ID %in% c(41)))

I have filtered out the outlier number 18 from the first cluster.

Cleaned_Global_Wine_Data_2023__clu_std <- as.data.frame(scale(Cleaned_Global_Wine_Data_2023_[, c("Exports",
                                                                                                 "Imports",
                                                                                                 "Production")]))

Clustering <- kmeans(Cleaned_Global_Wine_Data_2023__clu_std,
                         centers = 3,
                         nstart = 25)
Clustering

## K-means clustering with 3 clusters of sizes 67, 4, 11
## 
## Cluster means:
##      Exports    Imports Production
## 1 -0.3283477 -0.2162458 -0.3453174
## 2  0.2996277  3.9485063 -0.1875710
## 3  1.8909806 -0.1186870  2.1715047
## 
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 3 1 2 1 1 1 1 3 3 1 3 1
## [39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 3 1 1 1 1 3 1 2 1 1 1 1 3 3 1 1 1 1 1 1 1
## [77] 1 1 2 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 22.711434  7.584806 40.136668
##  (between_SS / total_SS =  71.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

From the between_SS / total_SS increasing to 71% we can see that it was correct to filter out the outlier 18 (ID 41).

fviz_cluster(Clustering,
              palette = "Set1",
              repel = FALSE,
              ggtheme = theme_bw(),
              data = Cleaned_Global_Wine_Data_2023__clu_std)

After eliminating outliers, the visual representation of data looks much better. This indicates more meaningful and consistent groupings in the data.

Cluster averages

Averages <- Clustering$centers

Averages

##      Exports    Imports Production
## 1 -0.3283477 -0.2162458 -0.3453174
## 2  0.2996277  3.9485063 -0.1875710
## 3  1.8909806 -0.1186870  2.1715047

Cluster 1: This cluster likely represents countries that are not major players in the global wine market because their exports, imports, and production are relatively low. Cluster 2: This cluster represents countries that have extremely high imports while maintaining slightly above-average exports but lower production. They could be major wine importers that rely on external sources. Cluster 3: This cluster represents countries that likely are major wine-producing and exporting countries. They have above average exports and average imports.

Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Exports", "Imports", "Production" ))

Figure$Group <- factor(Figure$id, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$ImeF <- factor(Figure$name, 
              levels = c("Exports", "Imports", "Production"), 
              labels = c("Exports", "Imports", "Production"))


library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  scale_color_brewer(palette="Set1") +
  ylim(-1, 4) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

The graphical visualization makes it easier to see differences in wine trade and production between countries by grouping them into clusters. Each cluster highlights distinct market trends.

Cluster 2 includes countries that import large amounts of wine, meaning they rely on external markets to meet their demand. These countries have relatively low production but significant imports, indicating a high level of wine consumption with limited domestic supply. Cluster 3 contains major wine-producing and exporting nations, with countries that produce and export large amounts of wine while importing very little. Countries in this group—such as France, Italy, and Spain—are key players in the global wine industry, supplying wine to many other nations. Cluster 1 consists of countries with low levels of wine trade and production, suggesting they play a minimal role in the international market. These countries neither produce nor export large quantities, and their import levels are also close to average, meaning they likely have a smaller wine industry overall.

These results could be valuable for wine producers, policymakers, and trade analysts, as they provide insights into global market trends. Understanding these clusters can help businesses and governments develop strategies for trade, expansion, and partnerships based on how each country engages with the wine industry.

Now, I will look into which cluster Croatia is in because I want to see how my country is performing.

Cluster_averages <- Averages[3, ] 
 print(Cluster_averages)

##    Exports    Imports Production 
##   1.890981  -0.118687   2.171505

Croatia_values <- Cleaned_Global_Wine_Data_2023__clu_std[19, ]

print(Croatia_values)

##       Exports    Imports Production
## 19 -0.4269313 -0.3475385 -0.4475074

Croatia has below-average levels of wine exports, imports, and production, indicating that it is not a major player in the global wine market. The negative standardized values suggest that Croatia does not export significant amounts of wine and also does not rely heavily on wine imports to meet domestic demand. With moderately low production, Croatia’s wine industry appears to be relatively self-sufficient, likely focusing on local consumption rather than large-scale international trade. This suggests that while Croatia does have a wine industry, its presence in the global market remains modest in scale.

This position is a good starting point for a country that wants to improve it’s position in the global wine market.

Cleaned_Global_Wine_Data_2023_$Group <- Clustering$cluster

fit <- aov(cbind(Exports, Imports, Production) ~ as.factor(Group),
            data = Cleaned_Global_Wine_Data_2023_)
 summary(fit)

##  Response Exports :
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## as.factor(Group)  2 8.3217e+11 4.1609e+11  54.372 1.414e-15 ***
## Residuals        79 6.0455e+11 7.6526e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Imports :
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## as.factor(Group)  2 5.8732e+11 2.9366e+11  168.95 < 2.2e-16 ***
## Residuals        79 1.3732e+11 1.7382e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Production :
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## as.factor(Group)  2 1.4212e+14 7.1061e+13  112.86 < 2.2e-16 ***
## Residuals        79 4.9743e+13 6.2966e+11                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation of results:

The clustering effectively captures distinct patterns in wine-related activities, and we can see this from the very small p values (p < 0.001). It confirm that wine exports, imports, and production are significantly different between clusters. The clustering method successfully grouped countries based on meaningful wine trade and production characteristics.

aggregate(Cleaned_Global_Wine_Data_2023_$Consumption,
           by = list(Cluster = Cleaned_Global_Wine_Data_2023_$Group),
           FUN = mean)

##   Cluster        x
## 1       1  29363.6
## 2       2 147026.0
## 3       3 772899.1

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

 leveneTest(Cleaned_Global_Wine_Data_2023_$Consumption, as.factor(Cleaned_Global_Wine_Data_2023_$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value    Pr(>F)    
## group  2  35.239 1.149e-11 ***
##       79                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation of results:

The Levene’s Test checks whether the variances across groups are homogeneous. In this case, the test checks whether the variance in wine consumption is the same across the different clusters (groups).

H0: The variances of the groups are equal (homogeneity of variance). - The consumption is equal in all three clusters.
H1: The variances of the groups are not equal (heterogeneity of variance). - The consumption is not equal in all three clusters.

Since the p-value (p < 0.001) is lower than 0.05, we reject the null hypothesis. This means there is significant evidence to suggest that the consumption across clusters is different.

I will check if the variable is normally distributed with the Shapiro Wilk normality test.

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

Cleaned_Global_Wine_Data_2023_ %>%
  group_by(as.factor(Cleaned_Global_Wine_Data_2023_$Group)) %>%
  shapiro_test(Consumption)

## # A tibble: 3 × 4
##   `as.factor(Cleaned_Global_Wine_Data_2023_$Group)` variable  statistic        p
##   <fct>                                             <chr>         <dbl>    <dbl>
## 1 1                                                 Consumpt…     0.255 1.00e-16
## 2 2                                                 Consumpt…     0.730 2.46e- 2
## 3 3                                                 Consumpt…     0.832 2.51e- 2

Explanation of results:

H0: The data follows a normal distribution. H1: The data does not follow a normal distribution.

The p-values in all three clusters are less than 0.05 (<0.001, 0.024, 0.025) so the null hypothesis is rejected which means that consumption is not normally distributed in all three clusters.

Because the assumption of normality is violated in all three groups, I will perform the non-parametric alternative to ANOVA - the Kruskal-Wallis sum test.

kruskal.test(Consumption ~ as.factor(Group),
data = Cleaned_Global_Wine_Data_2023_)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Consumption by as.factor(Group)
## Kruskal-Wallis chi-squared = 24.625, df = 2, p-value = 4.496e-06

H0: All distribution locations of consumption are the same. H1: At least one distribution location of consumption is different.

Based on the p-value (<0.001), we can reject the null hypothesis.

Conclusion

This analysis aimed to determine whether countries could be effectively grouped based on their wine trade characteristics - exports, imports, and production. The results from the tests confirm that the clusters are significantly different, highlighting different wine characteristics they have in the global wine market.

After standardizing the data and determining the optimal number of clusters, three groups emerged:

Emerging Wine Markets (Cluster 1): These countries exhibit below-average exports, imports, and production, indicating that they are not major players in the global wine industry. Their involvement in wine trade is relatively low, suggesting that wine is either a minor industry or that domestic consumption is met through local production.

Wine Importers (Cluster 2): Countries in this group show significantly high import values, suggesting a strong reliance on external markets to meet their wine demand. These countries don’t produce significant amounts of wine but have high consumer demand, making them key destinations for global wine exports.

Leading Wine Producers & Exporters (Cluster 3): This cluster includes countries with high production and export levels, playing a dominant role in the global wine trade. These countries, such as Italy, France, and Spain, are known for their strong wine industries and contribute significantly to international wine markets.

These results can be valuable for people in the wine industry, policymakers, and trade experts, as they give a clearer picture of different market segments. Countries in Cluster 1 could be good places for businesses looking to expand, while Cluster 2 countries are big wine buyers, making them great targets for exporters. Meanwhile, the countries in Cluster 3 are already major wine producers and exporters, so they should focus on staying competitive by improving quality and trying out new innovations.

Overall, the clustering analysis did a great job of identifying different wine trade patterns around the world, showing how useful data analysis can be in understanding global markets.

Homework 2 - clustering

Julija Pletenac

2025-02-03

RQ: “Can countries be grouped into distinct clusters based on their wine production, imports, and exports?”

Euclidian distances

How many clusters?

Cluster averages

Conclusion