library(readr)
Cleaned_Global_Wine_Data_2023_ <- read_csv("~/Desktop/HW1JULIJA/Wine clustering/Cleaned_Global_Wine_Data__2023_ .csv")
## New names:
## Rows: 85 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Country dbl (5): ...1, Wine Consumption (tonnes), Wine Exports (tonnes),
## Wine Import...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(Cleaned_Global_Wine_Data_2023_ [, 1:4])
## # A tibble: 6 × 4
## ...1 Country `Wine Consumption (tonnes)` `Wine Exports (tonnes)`
## <dbl> <chr> <dbl> <dbl>
## 1 0 Afghanistan 449964 202874
## 2 1 Albania 253 79
## 3 2 Algeria 459287 2
## 4 6 Argentina 7753 30595
## 5 7 Armenia 90 4654
## 6 9 Australia 37862 139619
I will now manipulate the data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Cleaned_Global_Wine_Data_2023_ <- Cleaned_Global_Wine_Data_2023_ %>%
rename(Consumption = `Wine Consumption (tonnes)`,
Exports = `Wine Exports (tonnes)`,
Imports = `Wine Imports (tonnes)`,
Production = `Wine Production (tonnes)`)
colnames(Cleaned_Global_Wine_Data_2023_)
## [1] "...1" "Country" "Consumption" "Exports" "Imports"
## [6] "Production"
colnames(Cleaned_Global_Wine_Data_2023_)[colnames(Cleaned_Global_Wine_Data_2023_) == "...1"] <- "ID"
head(Cleaned_Global_Wine_Data_2023_)
## # A tibble: 6 × 6
## ID Country Consumption Exports Imports Production
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0 Afghanistan 449964 202874 420 1552658
## 2 1 Albania 253 79 3673 218094
## 3 2 Algeria 459287 2 7594 1079044
## 4 6 Argentina 7753 30595 3787 1602835
## 5 7 Armenia 90 4654 797 214662
## 6 9 Australia 37862 139619 26351 1632170
Description:
ID: Country ID
Country: Name of the country
Consumption: Wine consumption in tonnes
Exports: Wine exports in tonnes
Imports: Wine imports in tonnes
Production: Wine production in tonnes
This data is data from the year 2023 taken from the official website of the International organisation of Vine and Wine (OIV): https://www.oiv.int/what-we-do/data-discovery-report?oiv.
For the clustering analysis, I selected three variables: Production, Imports and Exports. These categories capture distinct and critical dimensions of global wine trade and production, providing a comprehensive picture of wine-related activities and trade patterns across countries.
clustering_data <- Cleaned_Global_Wine_Data_2023_ %>%
select(Exports, Imports, Production)
standardized_data <- scale(clustering_data)
standardized_data <- as.data.frame(standardized_data)
head(standardized_data)
## Exports Imports Production
## 1 0.102542236 -0.4268769 -0.07665651
## 2 -0.211194203 -0.4065942 -0.18679742
## 3 -0.211313326 -0.3821466 -0.11574364
## 4 -0.163984059 -0.4058834 -0.07251543
## 5 -0.204116394 -0.4245263 -0.18708066
## 6 0.004682829 -0.2651956 -0.07009443
Cleaned_Global_Wine_Data_2023__clu_std <- as.data.frame(scale(Cleaned_Global_Wine_Data_2023_[c(3:6)]))
Cleaned_Global_Wine_Data_2023_$Dissimilarity = sqrt(Cleaned_Global_Wine_Data_2023__clu_std$Exports^2 + Cleaned_Global_Wine_Data_2023__clu_std$Imports^2 + Cleaned_Global_Wine_Data_2023__clu_std$Production^2)
I have standardized the data that will be used for clustering.
head(Cleaned_Global_Wine_Data_2023_[order(-Cleaned_Global_Wine_Data_2023_$Dissimilarity), c("ID", "Country", "Dissimilarity")], 20)
## # A tibble: 20 × 3
## ID Country Dissimilarity
## <dbl> <chr> <dbl>
## 1 75 Global 13.6
## 2 197 United States of America 4.92
## 3 132 Netherlands 2.90
## 4 41 China, mainland 2.46
## 5 73 Germany 2.24
## 6 155 Russia 1.92
## 7 196 United Kingdom 1.79
## 8 146 Peru 0.902
## 9 33 Canada 0.866
## 10 37 Chile 0.812
## 11 67 France 0.638
## 12 94 Italy 0.575
## 13 152 Republic of Türkiye 0.550
## 14 185 Thailand 0.538
## 15 148 Poland 0.536
## 16 184 Tanzania, the United Republic of 0.518
## 17 63 Ethiopia 0.517
## 18 174 South Africa 0.516
## 19 198 Uruguay 0.514
## 20 117 Malta 0.512
I calculated the variable dissimilarity to asses which countries are significantly different to the rest of the data set. The countries with the highest dissimilarity scores deviate from other observations making them potential outliers. To ensure they do not distort the clustering process I will remove them (Global and United States of America).
Cleaned_Global_Wine_Data_2023_<- Cleaned_Global_Wine_Data_2023_ %>%
filter(!Country %in% c("Global", "United States of America"))
Cleaned_Global_Wine_Data_2023__clu_std <- as.data.frame(scale(Cleaned_Global_Wine_Data_2023_[, c("Exports",
"Imports",
"Production")]))
head(Cleaned_Global_Wine_Data_2023_[order(-Cleaned_Global_Wine_Data_2023_$Dissimilarity), c("ID", "Country", "Dissimilarity")], 20)
## # A tibble: 20 × 3
## ID Country Dissimilarity
## <dbl> <chr> <dbl>
## 1 132 Netherlands 2.90
## 2 41 China, mainland 2.46
## 3 73 Germany 2.24
## 4 155 Russia 1.92
## 5 196 United Kingdom 1.79
## 6 146 Peru 0.902
## 7 33 Canada 0.866
## 8 37 Chile 0.812
## 9 67 France 0.638
## 10 94 Italy 0.575
## 11 152 Republic of Türkiye 0.550
## 12 185 Thailand 0.538
## 13 148 Poland 0.536
## 14 184 Tanzania, the United Republic of 0.518
## 15 63 Ethiopia 0.517
## 16 174 South Africa 0.516
## 17 198 Uruguay 0.514
## 18 117 Malta 0.512
## 19 190 Tunisia 0.510
## 20 191 Turkmenistan 0.508
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(Cleaned_Global_Wine_Data_2023__clu_std,
method = "euclidean")
# Visualize the distance matrix
fviz_dist(Distances, gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
Now I will calculate the Hopkins statistics to see if my data is clusterable.
library(factoextra)
get_clust_tendency(Cleaned_Global_Wine_Data_2023__clu_std,
n = nrow(Cleaned_Global_Wine_Data_2023__clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.9447029
##
## $plot
## NULL
Explanation of results: Hopkins statistic, which measures clustering tendency, is 0.945, indicating a very strong clustering structure in the data set. If this data is close to one it confirms that the data is suitable for a cluster analysis. From the visual inspection of the distance matrix, a few different formations of squares are visible, supporting the existence of well-defined clusters in the data.
library(factoextra)
library(NbClust)
fviz_nbclust(Cleaned_Global_Wine_Data_2023__clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
Explanation of results:
The elbow method is one of the methods used to determine the optimal number of clusters. It is where the total within sum of squares (WSS) starts to level off. In this plot we can read that it is suggested to divide the data into 3 or 4 clusters to capture the structure of the data set effectively.
We will now confirm this using the Silhouette method to find the optimal number of clusters.
fviz_nbclust(Cleaned_Global_Wine_Data_2023__clu_std, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")
From the silhouette analysis we see that it indicated that 3 clusters provide the best separation and cohesion among the data. This suggests that separating the data into 3 clusters is most effective in capturing the underlying structure. We will now confirm this one last time with the K-means method.
NbClust(Cleaned_Global_Wine_Data_2023__clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 3 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 5 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 0.0292 46.6640 59.4302 -1.6940 117.9091 347352.878 4041.8966 156.0816
## 3 9.9338 69.2997 13.8606 1.4383 273.6394 119701.613 805.3012 90.0277
## 4 0.0595 58.0891 115.7140 0.4757 313.3533 131879.124 769.4708 76.7331
## 5 6.6244 134.5837 28.0810 12.2160 507.6547 19829.915 116.9988 31.1324
## 6 221.1058 150.0953 8.2552 13.9396 575.0616 12675.909 65.5729 22.8913
## 7 0.0193 138.0486 3.2788 12.9359 611.1331 11171.975 58.4998 20.6747
## 8 0.7215 122.2704 11.6130 11.4610 624.2611 12457.248 54.5788 19.8197
## 9 1.2533 123.3373 1.3203 11.6808 692.9649 6890.285 49.5351 17.1623
## 10 0.8952 110.2259 0.5807 10.3423 700.6298 7756.142 48.6832 16.8614
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 2.6752 1.5761 0.0718 1.3145 0.6927 4.3967 -59.4869 -1.2959 0.4218
## 3 8.9576 2.7325 0.0827 0.8640 0.7636 0.7452 24.6194 0.5739 0.4522
## 4 11.1703 3.2059 0.0562 1.1526 0.6372 1.6381 -27.2675 -0.6526 0.4104
## 5 26.2543 7.9017 0.1515 0.7160 0.6696 0.6029 43.4645 1.1039 0.4177
## 6 34.2195 10.7464 0.1045 0.6850 0.6610 1.9652 -23.0842 -0.8220 0.3888
## 7 41.6442 11.8986 0.0764 0.7147 0.5050 0.7433 15.8901 0.5747 0.3617
## 8 44.7696 12.4119 0.0624 0.7666 0.5161 1.8193 -23.8678 -0.7480 0.3390
## 9 68.6456 14.3338 0.0592 0.6931 0.5357 16.4697 -11.2714 -1.4658 0.3214
## 10 71.6306 14.5895 0.0554 0.7212 0.4711 1.7961 -9.7513 -0.6707 0.3051
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 78.0408 0.7387 55.8314 0.0709 0.1178 0.0093 21.3214 0.7842 1.8794
## 3 30.0092 0.8167 4.7272 0.0506 0.1585 0.0089 19.3208 0.6431 1.6357
## 4 19.1833 0.7208 30.4709 0.0912 0.0617 0.0091 19.6852 0.5542 1.7063
## 5 6.2265 0.7642 3.4985 0.0760 0.2161 0.0092 5.3861 0.4559 0.3221
## 6 3.8152 0.6982 8.5895 0.0951 0.1510 0.0093 5.3356 0.3709 0.2843
## 7 2.9535 0.4968 2.4075 0.2111 0.0125 0.0095 9.0709 0.3184 0.2777
## 8 2.4775 0.4640 0.2313 0.2220 0.0173 0.0096 9.4897 0.2988 0.2620
## 9 1.9069 0.4647 10.3010 0.2134 0.0173 0.0096 9.2780 0.2738 0.1830
## 10 1.6861 0.4125 -10.9122 0.2739 0.0170 0.0096 13.5509 0.2615 0.1886
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.5191 71.3386 1.0000
## 3 0.5247 65.2107 0.6328
## 4 0.5088 67.5830 1.0000
## 5 0.5130 62.6440 0.3488
## 6 0.4996 47.0690 1.0000
## 7 0.4551 55.0867 0.6326
## 8 0.4434 66.5373 1.0000
## 9 0.1687 59.1210 1.0000
## 10 0.0819 246.4614 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 6.0000 6.0000 4.0000 6.0000 5.0000 3.0 3.000
## Value_Index 221.1058 150.0953 101.8534 13.9396 194.3014 239828.8 3236.595
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 9.000 5.0000 10.0000 6.000 3.0000 2.0000
## Value_Index 52.7593 23.876 -1.8511 0.0554 0.685 0.7636 4.3967
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 3.0000 3.0000 3.0000 7.0000 3.0000
## Value_Index -59.4869 -1.2959 0.4522 48.0315 0.8167 2.4075 0.0506
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 5.0000 0 6.0000 0 9.000
## Value_Index 0.2161 0 5.3356 0 0.183
##
## $Best.partition
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 3
## [39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 3 1 1 1 1 3 1 2 1 1 1 1 3 1 1 1 1 1 1 1
## [77] 1 1 1 2 1 1 1
After a thorough analysis, I have confirmed that the best number of clusters is 3.
Clustering <- kmeans(Cleaned_Global_Wine_Data_2023__clu_std,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 7, 4, 72
##
## Cluster means:
## Exports Imports Production
## 1 2.8700041 -0.02001805 1.9021219
## 2 0.2420837 3.90299613 -0.1803986
## 3 -0.2924773 -0.21488692 -0.1749064
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 1 3 3 1
## [39] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 1 3 2 3 3 3 3 1 3 3 3 3 3 3 3
## [77] 3 3 3 2 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 54.105876 6.582436 29.339403
## (between_SS / total_SS = 63.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The “between SS/total SS” represents the proportion of total variance explained by the clustering. So this means that 63.4 % of variability of data is explained by these 3 clusters, the rest of the variability is within the clusters. The number also indicates well-separated and meaningful clusters.
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = Cleaned_Global_Wine_Data_2023__clu_std)
Cleaned_Global_Wine_Data_2023_ <- Cleaned_Global_Wine_Data_2023_ %>%
filter(!(ID %in% c(41)))
I have filtered out the outlier number 18 from the first cluster.
Cleaned_Global_Wine_Data_2023__clu_std <- as.data.frame(scale(Cleaned_Global_Wine_Data_2023_[, c("Exports",
"Imports",
"Production")]))
Clustering <- kmeans(Cleaned_Global_Wine_Data_2023__clu_std,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 67, 4, 11
##
## Cluster means:
## Exports Imports Production
## 1 -0.3283477 -0.2162458 -0.3453174
## 2 0.2996277 3.9485063 -0.1875710
## 3 1.8909806 -0.1186870 2.1715047
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 3 1 2 1 1 1 1 3 3 1 3 1
## [39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 3 1 1 1 1 3 1 2 1 1 1 1 3 3 1 1 1 1 1 1 1
## [77] 1 1 2 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 22.711434 7.584806 40.136668
## (between_SS / total_SS = 71.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
From the between_SS / total_SS increasing to 71% we can see that it was correct to filter out the outlier 18 (ID 41).
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = Cleaned_Global_Wine_Data_2023__clu_std)
After eliminating outliers, the visual representation of data looks much
better. This indicates more meaningful and consistent groupings in the
data.
Averages <- Clustering$centers
Averages
## Exports Imports Production
## 1 -0.3283477 -0.2162458 -0.3453174
## 2 0.2996277 3.9485063 -0.1875710
## 3 1.8909806 -0.1186870 2.1715047
Cluster 1: This cluster likely represents countries that are not major players in the global wine market because their exports, imports, and production are relatively low. Cluster 2: This cluster represents countries that have extremely high imports while maintaining slightly above-average exports but lower production. They could be major wine importers that rely on external sources. Cluster 3: This cluster represents countries that likely are major wine-producing and exporting countries. They have above average exports and average imports.
Figure <- as.data.frame(Averages)
Figure$id <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Exports", "Imports", "Production" ))
Figure$Group <- factor(Figure$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$ImeF <- factor(Figure$name,
levels = c("Exports", "Imports", "Production"),
labels = c("Exports", "Imports", "Production"))
library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
scale_color_brewer(palette="Set1") +
ylim(-1, 4) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
The graphical visualization makes it easier to see differences in wine trade and production between countries by grouping them into clusters. Each cluster highlights distinct market trends.
Cluster 2 includes countries that import large amounts of wine, meaning they rely on external markets to meet their demand. These countries have relatively low production but significant imports, indicating a high level of wine consumption with limited domestic supply. Cluster 3 contains major wine-producing and exporting nations, with countries that produce and export large amounts of wine while importing very little. Countries in this group—such as France, Italy, and Spain—are key players in the global wine industry, supplying wine to many other nations. Cluster 1 consists of countries with low levels of wine trade and production, suggesting they play a minimal role in the international market. These countries neither produce nor export large quantities, and their import levels are also close to average, meaning they likely have a smaller wine industry overall.
These results could be valuable for wine producers, policymakers, and trade analysts, as they provide insights into global market trends. Understanding these clusters can help businesses and governments develop strategies for trade, expansion, and partnerships based on how each country engages with the wine industry.
Now, I will look into which cluster Croatia is in because I want to see how my country is performing.
Cluster_averages <- Averages[3, ]
print(Cluster_averages)
## Exports Imports Production
## 1.890981 -0.118687 2.171505
Croatia_values <- Cleaned_Global_Wine_Data_2023__clu_std[19, ]
print(Croatia_values)
## Exports Imports Production
## 19 -0.4269313 -0.3475385 -0.4475074
Croatia has below-average levels of wine exports, imports, and production, indicating that it is not a major player in the global wine market. The negative standardized values suggest that Croatia does not export significant amounts of wine and also does not rely heavily on wine imports to meet domestic demand. With moderately low production, Croatia’s wine industry appears to be relatively self-sufficient, likely focusing on local consumption rather than large-scale international trade. This suggests that while Croatia does have a wine industry, its presence in the global market remains modest in scale.
This position is a good starting point for a country that wants to improve it’s position in the global wine market.
Cleaned_Global_Wine_Data_2023_$Group <- Clustering$cluster
fit <- aov(cbind(Exports, Imports, Production) ~ as.factor(Group),
data = Cleaned_Global_Wine_Data_2023_)
summary(fit)
## Response Exports :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 8.3217e+11 4.1609e+11 54.372 1.414e-15 ***
## Residuals 79 6.0455e+11 7.6526e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Imports :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 5.8732e+11 2.9366e+11 168.95 < 2.2e-16 ***
## Residuals 79 1.3732e+11 1.7382e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Production :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 1.4212e+14 7.1061e+13 112.86 < 2.2e-16 ***
## Residuals 79 4.9743e+13 6.2966e+11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explanation of results:
The clustering effectively captures distinct patterns in wine-related activities, and we can see this from the very small p values (p < 0.001). It confirm that wine exports, imports, and production are significantly different between clusters. The clustering method successfully grouped countries based on meaningful wine trade and production characteristics.
aggregate(Cleaned_Global_Wine_Data_2023_$Consumption,
by = list(Cluster = Cleaned_Global_Wine_Data_2023_$Group),
FUN = mean)
## Cluster x
## 1 1 29363.6
## 2 2 147026.0
## 3 3 772899.1
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(Cleaned_Global_Wine_Data_2023_$Consumption, as.factor(Cleaned_Global_Wine_Data_2023_$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 35.239 1.149e-11 ***
## 79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explanation of results:
The Levene’s Test checks whether the variances across groups are homogeneous. In this case, the test checks whether the variance in wine consumption is the same across the different clusters (groups).
H0: The variances of the groups are equal (homogeneity of variance).
- The consumption is equal in all three clusters.
H1: The variances of the groups are not equal (heterogeneity of
variance). - The consumption is not equal in all three clusters.
Since the p-value (p < 0.001) is lower than 0.05, we reject the null hypothesis. This means there is significant evidence to suggest that the consumption across clusters is different.
I will check if the variable is normally distributed with the Shapiro Wilk normality test.
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
Cleaned_Global_Wine_Data_2023_ %>%
group_by(as.factor(Cleaned_Global_Wine_Data_2023_$Group)) %>%
shapiro_test(Consumption)
## # A tibble: 3 × 4
## `as.factor(Cleaned_Global_Wine_Data_2023_$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Consumpt… 0.255 1.00e-16
## 2 2 Consumpt… 0.730 2.46e- 2
## 3 3 Consumpt… 0.832 2.51e- 2
Explanation of results:
H0: The data follows a normal distribution. H1: The data does not follow a normal distribution.
The p-values in all three clusters are less than 0.05 (<0.001, 0.024, 0.025) so the null hypothesis is rejected which means that consumption is not normally distributed in all three clusters.
Because the assumption of normality is violated in all three groups, I will perform the non-parametric alternative to ANOVA - the Kruskal-Wallis sum test.
kruskal.test(Consumption ~ as.factor(Group),
data = Cleaned_Global_Wine_Data_2023_)
##
## Kruskal-Wallis rank sum test
##
## data: Consumption by as.factor(Group)
## Kruskal-Wallis chi-squared = 24.625, df = 2, p-value = 4.496e-06
H0: All distribution locations of consumption are the same. H1: At least one distribution location of consumption is different.
Based on the p-value (<0.001), we can reject the null hypothesis.
This analysis aimed to determine whether countries could be effectively grouped based on their wine trade characteristics - exports, imports, and production. The results from the tests confirm that the clusters are significantly different, highlighting different wine characteristics they have in the global wine market.
After standardizing the data and determining the optimal number of clusters, three groups emerged:
Emerging Wine Markets (Cluster 1): These countries exhibit below-average exports, imports, and production, indicating that they are not major players in the global wine industry. Their involvement in wine trade is relatively low, suggesting that wine is either a minor industry or that domestic consumption is met through local production.
Wine Importers (Cluster 2): Countries in this group show significantly high import values, suggesting a strong reliance on external markets to meet their wine demand. These countries don’t produce significant amounts of wine but have high consumer demand, making them key destinations for global wine exports.
Leading Wine Producers & Exporters (Cluster 3): This cluster includes countries with high production and export levels, playing a dominant role in the global wine trade. These countries, such as Italy, France, and Spain, are known for their strong wine industries and contribute significantly to international wine markets.
These results can be valuable for people in the wine industry, policymakers, and trade experts, as they give a clearer picture of different market segments. Countries in Cluster 1 could be good places for businesses looking to expand, while Cluster 2 countries are big wine buyers, making them great targets for exporters. Meanwhile, the countries in Cluster 3 are already major wine producers and exporters, so they should focus on staying competitive by improving quality and trying out new innovations.
Overall, the clustering analysis did a great job of identifying different wine trade patterns around the world, showing how useful data analysis can be in understanding global markets.