Correspondence Analysis & Wine

thematic::thematic_rmd( 
  bg = "#FDFEFE",
  fg = "#2E7874",
  accent = "#067A5C",
  font = font_spec("Roboto Condensed", scale = 1),
  qualitative = paletteer::paletteer_d("dutchmasters::pearl_earring"),
  sequential = sequential_gradient(0.5, 0.75))

Idea

My project pursues a practical goal to answer the vital and everlasting question — where is the best place to buy a bottle of wine? And which store should you go to if you have specific preferences for this drink? What will I find in the shop if I go there?

In these challenging times, it is becoming difficult to find a bottle of dry white wine from Portugal. Also, if you are planning to open your own alcohol business, then some of my conclusions may be useful to you.

I chose 3 alcohol distributors (AMwine – “Аромантный мир”, “K&B –”Красное и Белое”, “Winelab” – “Винлаб”) based on 3 factors:

  1. They are the leaders in the market
  2. They have a satisfactory filtering system on their website
  3. The shops are located in St. Petersburg

I chose the positions and countries in which I’m more interested + For which there is at least 1 option in each of the stores. So, no NAs (as better for comparability)

Data

wine <- read_excel("~/data1/WINE.xlsx")
row <- wine$...1
col <- names(wine[2:4])
winemat <- wine[-1]
names(winemat) <- NULL
winemat <- as.matrix(winemat)
colnames(winemat) <- col
row.names(winemat) <- row
winemat <- as.table(winemat)
kbl(winemat, align = "lll") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), font_size = 6, full_width = F)
K&B Winelab Amwine
Portugal R 7 2 26
Portugal W 8 2 15
France R 12 43 40
France W 10 35 26
Italy R 30 67 58
Italy W 22 49 63
Spain R 29 48 83
Spain W 17 19 43
Georgia R 26 22 22
Georgia W 7 12 7
Russia R 40 35 46
Russia W 30 35 41
Chile W 8 13 10
Chile R 7 21 19
South Africa W 5 5 10
South Africa R 3 4 19
Argentina W 2 5 2
Argentina R 4 7 10

W stands for white wine, R for red wine.

Visualization of contingency table

balloonplot(
  t(winemat),
  main = "Wine in Shops",
  xlab = "",
  ylab = "",
  label = F,
  show.margins = FALSE,
  colmar=3,
  rowmar=1,
  text.size=1)

At first look, it seems that one should not go to K&B at all. Let’s explore this further.

Chi-square test

chisq.test(winemat)
## 
##  Pearson's Chi-squared test
## 
## data:  winemat
## X-squared = 95.836, df = 34, p-value = 8.453e-08
#chisq.test(winemat)$res
k = chisq.test(winemat)$stdres
kbl(k, align = "ccc") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), font_size = 8, full_width = F)
K&B Winelab Amwine
Portugal R -0.2460745 -3.6287763 3.6792871
Portugal W 1.2637504 -2.8111457 1.6423614
France R -2.2299898 2.3102122 -0.3601715
France W -1.6017782 2.7130595 -1.2676898
Italy R -0.7544155 2.4611247 -1.7302022
Italy W -1.5685512 0.5480167 0.7779573
Spain R -1.1729490 -1.2681246 2.1885310
Spain W -0.0380543 -2.0095176 1.9559179
Georgia R 3.2303050 -0.5466173 -2.1594430
Georgia W 0.6544395 1.2700731 -1.7597563
Russia R 3.1953219 -1.3451606 -1.3657038
Russia W 1.7278889 -0.3228937 -1.1258705
Chile W 0.5632998 0.8891029 -1.3192448
Chile R -1.1527194 1.5060045 -0.4847743
South Africa W 0.3621641 -0.8960768 0.5572922
South Africa R -1.2694172 -2.0670880 3.0337381
Argentina W 0.0389072 1.3377738 -1.3133650
Argentina R -0.2963147 -0.1079913 0.3495123

In our data, the variables in rows & columns are statistically significantly associated. Meaning that shops have a certain type of wine associated with their catalogs.

corrplot(t(chisq.test(winemat)$stdres), is.corr=FALSE)

Well, now we see that K&B has a position in which it is the best option for consumers. Winelab while being a good provider of French wines is not the best option for my favorite dry white wine from Portugal. Interestingly, the geography of products in the store’s catalogs is very different.

CA

res.ca <- CA(winemat, graph = FALSE)
print(res.ca)
## **Results of the Correspondence Analysis (CA)**
## The row variable has  18  categories; the column variable has 3 categories
## The chi square of independence between the two variables is equal to 95.83634 (p-value =  8.453298e-08 ).
## *The results are available in the following objects:
## 
##    name              description                   
## 1  "$eig"            "eigenvalues"                 
## 2  "$col"            "results for the columns"     
## 3  "$col$coord"      "coord. for the columns"      
## 4  "$col$cos2"       "cos2 for the columns"        
## 5  "$col$contrib"    "contributions of the columns"
## 6  "$row"            "results for the rows"        
## 7  "$row$coord"      "coord. for the rows"         
## 8  "$row$cos2"       "cos2 for the rows"           
## 9  "$row$contrib"    "contributions of the rows"   
## 10 "$call"           "summary called parameters"   
## 11 "$call$marge.col" "weights of the columns"      
## 12 "$call$marge.row" "weights of the rows"

If the data were random:

Rows: eigenvalue = 1/17 (wine types - 1) ~ 0,059

Columns: the average axis should account = 1/2 (stores - 1) ~ 0,5

res.ca$eig
##       eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.04710405               60.50428                          60.50428
## dim 2 0.03074837               39.49572                         100.00000
fviz_screeplot(res.ca) +
 geom_hline(yintercept = 50, linetype = 2, color = "#A65141FF")

According to the this graph, only 1 dimension should be used in the solution. Since we don’t have much choice, we will continue to work with our two-dimensional scheme + we still need to look at all our stores, so nothing can be ruled out.

#fviz_ca_biplot(res.ca, repel = TRUE, col.row = "#394165FF", title = "CA Biplot for stores and wine")

#fviz_ca_biplot(res.ca, repel = TRUE,
             # map = "colprincipal",
             # arrow = c(TRUE, TRUE), col.col = "#A65141FF", col.row = "#394165FF" )

(Adding pictures in R seems to be illegally time-consuming, so did I add them in other app)

The angle between the arrows responsible for the store and the wine is sharp in the case of Russian red and white wine, Georgian red wine, and the K&B store, for AMwine — a similar situation with red wine from South Africa and Spain, for Winelab — with red Italian wine and white Argentine.

This sharp angle speaks of a strong association and this, in general, created for me as a buyer a picture of what the store’s assortment is, and what I will find on the shelves when I come there.

Contributions of rows

#corrplot(t(res.ca$row$contrib), is.corr = FALSE)

The most contributing rows to Dim.1 is Red wine from Portugal & White wine from South Africa while for Dim.2 these are Georgian and Russian Red wine.

Column representation

fviz_cos2(res.ca, choice = "col", axes = 1:2)

All our distributors represented well.

Conclusion:

1) strong association:

Russian red and white wine, Georgian red wine — K&B

Red wine from South Africa and Spain — AMwine

Red Italian wine and white Argentine wine — Winelab

2) In general, CA and visualization are enough to get a picture of the assortments within the selected stores.

3) Do I think that the first 20 lines of code and balloon plot would be enough to come to similar conclusions? Yes.

4) It seems to me that with the same amount of effort, it would be possible to come up with something that includes both a more sufficient comparison of stores and their description in one.

5) However, it is interesting to know that all stores with their assortment are associated with red wine. I don’t like him so much. And if I wanted to open my wine store, then making a marketing campaign based on the sale of lots and lots of good white wine would not be bad!