Clustering and Multidimensional Scaling

library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(hopkins)
library(ClusterR)
library(dplyr)
library(tidyverse)
library(rstatix)
library(ggpubr)
library(psych)
library(readxl)
library(Rcpp)
library(NbClust)
library(FeatureImpCluster)
library(ggplot2)
library(spdep)
library(rgdal)
library(maptools)
library(sp)
library(RColorBrewer)
library(classInt)
library(GISTools)
library(maps)
library(corrplot)
library(clusterSim)
library(GGally)

Introduction

This paper is the presentation of methods: Clustering (KNN and PAM) and Dimensional Scaling (PCA) working on a dataset containing information about popular wines. The data set was created by web scraping information from one of the most popular websites about wine - Vivino using Beautiful Soup - a library of Python programming language. The first part of the paper presents the process of processing data to create a useful dataset. The second and third parts are focused respectively on Clustering methods and dimensional scaling methods. The final chapter is the conclusion about the results of used unsupervised learning algorithms.

Clustering / Multidimensional Scaling Explanation of method

Clustering and Multidimensional Reduction are methods of Unsupervised Learning. These algorithms searching for hidden patterns and structures in unlabeled data. The name of method refers to the fact that those algorithms do not need a human interventions.

Clustering is the method of identifying groups of data in the dataset. The aim is to create groups that the elements in each group are more similar to each other and dissimilar to elements in the other groups.

Groups are created by given constraint from the cluster center (KNN - any point might not exist in dataset / PAM - specific point from dataset). Distances bettwen points are calculated using many metrics (Minkowski/Manhattan/Euclidean)

Multidimensional scaling (Principal Coordinate Analysis) is also the method focus on creating mappings of items based on their distance to others. In results items more similar should be are closed on the graph than dissimilar items.

In this paper I will use algorithms using numerical distances PCA. Same as Clustering distances between points might be calculate using different metrics (Minkowski/Manhattan/Euclidean or dissimilarity).

Dataset - Web Scraping

As I write in the introduction data was web scrapping from the website: Vivino using one of the Python libraries. The aim was to select the most popular wines and for each of them obtain information listed below.

## [1] "Wine"           "Winery"         "Coutry"         "Region"         "Style"          "Grapes"         "Price"          "Rating"         "Number_of_mark"

##                 Wine       Winery        Coutry         Region          Style  Grapes     Price Rating Number_of_mark
## 1 Vinho Verde Branco Casal Garcia      Portugal    Vinho Verde     White wine   Blend      <NA>    3.8  76668 ratings
## 2               <NA>      Farnese         Italy        Abruzzo       Red wine   Blend   150 zł.    4.3  61609 ratings
## 3       Memoro Rosso      Piccini         Italy        Toscana       Red wine   Blend      <NA>    3.8  56353 ratings
## 4           Prosecco     La Marca         Italy       Prosecco Sparkling wine   Glera      <NA>    3.8  27818 ratings
## 5    Mucho Más Tinto  Félix Solís         Spain Vino de Espana       Red wine   Blend      <NA>    4.1  23976 ratings
## 6               <NA>     Barefoot United States     California     White wine Moscato 34.90 zł.    3.8  17905 ratings

Dataset - Preparing data to use an algorithms of Unsupervised Learning

After web scraping dataset contained variables that were not numeric. It was necessary to provide numeric data for further analysis using clustering and PCA. Below I presented used methods of clearing and transforming data to the final dataset used in the analysis. In the first step, I deleted the rows with missing or duplicate values. I also concatenated the name of the winery and wine to improve the label. Wine from many wineries might have the same names and labels have to be unique.

Wina$Wine <-paste(Wina$Winery, Wina$Wine)
Wina <- (na.omit(Wina))
Wina <- Wina[!duplicated(Wina[,1:6]),]
Wina <- Wina[,-c(2)]
paste("Number of rows in dataframe:", dim(Wina)[1])

## [1] "Number of rows in dataframe: 1643"

paste("Number of columns in dataframe:", dim(Wina)[2])

## [1] "Number of columns in dataframe: 8"

In the next step columns having number value (Price, popularity and rating) was transformed from text to number.

for(i in 1:dim(Wina)[1]){
  Wina$Price[i] <- strsplit(Wina$Price[i], " ", fixed = FALSE)[[1]][1]
}
for(i in 1:dim(Wina)[1]){
  Wina$Number_of_mark[i] <- strsplit(Wina$Number_of_mark[i], " ", fixed = FALSE)[[1]][1]
}
Wina$Rating <- as.numeric(Wina$Rating)
Wina$Price <- as.numeric(Wina$Price)
Wina$Number_of_mark <- as.numeric(Wina$Number_of_mark)

For variable having information about Country and Region, in the first step, I decided to limit the countries to eleven most frequently appears.

Selected_country <- names(sort(table(Wina$Coutry), decreasing = TRUE))[1:14]
Wina <- Wina[Wina$Coutry %in% Selected_country, ]

In the second step was selected more popular regions were and assigned to each region their Longitude and Latitude value which are numeric variables. What’s more this variable will be used to present the results of clustering wines on maps. Other variables (style of wine and used grapes) were converted to dummy variables with one if have some grapes/was in some style and zero if not

Wina %>% group_by(Region) %>% summarize(n=n()) %>% arrange(desc(n))

## # A tibble: 340 x 2
##    Region                    n
##    <chr>                 <int>
##  1 Rioja                    91
##  2 Douro                    90
##  3 Alentejano               50
##  4 Ribera del Duero         49
##  5 Stellenbosch             42
##  6 Südtirol - Alto Adige    35
##  7 Toscana                  33
##  8 Marlborough              28
##  9 Alentejo                 27
## 10 Puglia                   24
## # ... with 330 more rows

Selected_Region <- c("Vino Nobile di Montepulciano","Veneto","Toro","Somontano","Sancerre","Sicilia","Salento","Rueda","Península de Setúbal","Penedes","Moscato d'Asti","Monçao e Melgaço","Lugana","Dao","Etna","Aconcagua Valley", "Alentejano","Alentejo","Alicante","Alsace","Bairrada","Barbera d'Alba","Barbera d'Asti","Barossa","Barossa Valley","Castilla","Castilla y León","Constantia","Chablis","Chianti Classico","Coastal Trhion","Coastal Region","Colchagua Valley","Crozes-Hermitage","Douro","Gavi","Goriška Brda","Isola dei Nuraghi","Langhe","Lisboa","Médoc","Maipo Valley","Marlborough","Mendoza","Morgon","Mosel",
                     "Navarra","Paarl", "Pfalz","Pic-Saint-Loup","Primitivo di Manduria","Priorat","Puglia",
                     "Rías Baixas","Rheingau","Ribera del Duero","Rioja","Robertson","Rosso di Montalcino","Südtirol - Alto Adige","Salento",
                     "Salice Salentino","Stellenbosch","Tejo","Terre Siciliane","Tokaj","Toscana","Umbria","Veneto","Valpolicella",
                     "Valpolicella Ripasso", "Valpolicella Ripasso Classico","Vinho Verde","Western Cape")
Latitude <- c(43.77,45.97,41.52,42.03,47.19,37.79,40.15,41.41,38.26,41.28,44.90,41.14,45.58,40.37,37.75,-32.65,38.20, 38.20, 38.34, 48.67, 39.59,44.70, 44.54,-34.53,-34.53, 39.8, 39.8, -34.01,
              47.48, 43.32, -34, -34, -34.35, 45.10, 41.06, 44.41, 
              46.04, 41.06, 44.48, 38.74, 45.00, -33.36, 42.20, -32.53, 41.16, 50.21, 
              42.81, -33.42, 49.25, 43.46, 41.00, 41.07, 41, 
              42.23, 50.08, 41.7, 42.3, -34.24, 41.03, 46.73, 41, 
              40.23, -33.55, 39.23, 37.3, 48.07, 43.77, 43, 45.04, 45.33, 45.33, 45.33, 41.4, -34)
Longitude <- c(11.24,12.30,-5.39,-0.11,2.50,15.20,15.11,-4.95,-9.05,1.14,8.20,8.17,9.93,-8.14,14.99,-70.01,-8.2, -8.2, -0.28, 7, -8.23, 8.03, 8.12, 138.56,138.56,4, 4, 18.25, 
                3.47, 11.18, 20, 20, 71.24, 4.84, -7.47,  8.48,
                13.53, 9.17, 8.18, -9.14, 1, -71.37, -71.33, 68.49, 2.1, 7.36, 
                0.65, 18.57, 8.19, 2.48, 16.3, 0.48, 16.3, 
                -8.71, 8.04, 1.9, -2.25, 20.50, 13.82, 11.28, 16.3, 
                17.57, 18.51, -8.68, 14, 21.25, 11.26, 12.5, 11.47, 10.54, 10.54, 10.54, -8.3, 20)
Wina <- Wina[Wina$Region %in% Selected_Region, ]

Latitude_R <- matrix(NA, nrow = dim(DF)[1],1)
Longitude_R <- matrix(NA, nrow = dim(DF)[1],1)
for(j in 1:length(Latitude)){
  for(z in which(DF$Region == Selected_Region[j])){
    Latitude_R[z] <- Latitude[j]
    Longitude_R[z] <- Longitude[j]
  }
}
DF$Region_Latitude <- Latitude_R
DF$Region_Longitude <- Longitude_R

DF <- DF[-8]
colnames(DF)[1] <- "Wine"
tail(DF,5)

##                                  Wine Style_Red Style_White Style_Spark Price Rating Popularity Grapes_Blend Grapes_Tempranillo Grapes_Sauvignon_Blanc Grapes_Primitivo Grapes_Sangiovese Grapes_Cabernet_Sauvignon Grapes_Chardonnay Grapes_Riesling Grapes_Touriga_Nacional Grapes_Shiraz_Syrah Grapes_Barbera Grapes_Malbec Grapes_Pinotage Grapes_Nebbiolo Grapes_Other Region_Latitude Region_Longitude
## 1012         Caliterra Tributo Malbec         1           0           0 50.26    3.8        222            0                  0                      0                0                 0                         0                 0               0                       0                   0              0             1               0               0            0          -34.35            71.24
## 1013 Casal da Coelheira Reserva Tinto         1           0           0 79.99    4.0        222            1                  0                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            0           39.23            -8.68
## 1014      Telmo Rodriguez Pegaso Zeta         1           0           0 68.48    3.9        221            0                  0                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            1           39.80             4.00
## 1015                     Punica Samas         0           1           0 55.29    3.8        221            1                  0                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            0           41.06             9.17
## 1016   Colinas do Douro Reserva Tinto         1           0           0 40.95    3.9        221            1                  0                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            0           41.06            -7.47

Finally, in the last step of preparing data, I converted the column: “Wine” to names of rows and standardize every numeric variable to range 0-1.
My experiments provide that without standardization popularity have too much influence on results.

Data <- as.matrix(DF[,2:dim(DF)[2]])
Label <- DF[,1]

Standarization <- function(x){(x-min(x))/(max(x)-min(x))}
Data <- apply(Data[,], 2, function(x) Standarization(x) )

Maps of Wine Regions

Below I prepared a map of the world and Europe (continent of origin of the most amount of wines in the analysed data set). Each wine region was presented as an orange triangle.

Clustering

Hopkins statistic

Hopkins statistic tells can the data can be clustered - Result close to 1 prove that the dataset is not uniformly distributed (i.e., contains meaningful clusters) That conclusion was proved also by the plot of the dissimilarity matrix (ODM -Ordered dissimilarity matrix).

We can see some red squares (low distance = low dissimilarity -> high similarity) Data are clusterable (clusters are visible).

## [1] "Hopkins statistics: 0.99999999999893"

Selection of number of clusters

Unsupervised Learning is part of Machine Learning where the role of the human component is minimal. However, in clustering methods like KNN and PAM, it is necessary to define the number of clusters.

In literature exist many ways to select an appropriate quantity of groups. In this paper, I decided to use the most popular methods: silhouette, gap statistic, Calinski-Harabasz Index and Akaike Information Criterion. For each of them, below was presented the plot with statistics for different numbers of clusters.

Silhouette for KNN and PAM method suggests a number of clusters close to 18/19. Gap_statistic for KNN suggest 4 clusters and for PAM 20 or more clusters (maximal limit of clusters was limited to 20)

Calinski-Harabasz Index for the KNN method suggest 2 clusters. For the PAM method again the searching limit was selected (20 clusters) as those providing the best spliting of the dataset.

## [1] "Best number of clusters in KNN method based on Calinski-Harabasz statistic: 2"

## [1] "Best number of clusters in PAM method based on Calinski-Harabasz statistic: 20"

The last plot for KNN with Akaike Information Criteria suggests two potential amount of clusters - 6/7 or 11/13. For 6/7 cluster value of AIC is not minimal but the drop between 4 and 6/7 are large and the value for 6/7 is relatively close to the minimal value of AIC achieved for 11 and almost 13 clusters.

Clustering

Based on the above plots I decided to create three clustering for the wine’s dataset:
Two clusterings using the KNN method (for 7 and 13 clusters) and one clustering using the PAM method (for 19 clusters) Plots of clusters in two-dimensional scale and silhouette for each clustering was presented below. The biggest value of silhouette was achieved by the PAM method (more than 0.68)

##   cluster size ave.sil.width
## 1       1  174          0.06
## 2       2  134          0.70
## 3       3  130          0.70
## 4       4  158          0.22
## 5       5   92          0.76
## 6       6   40          0.74
## 7       7  244          0.69

##    cluster size ave.sil.width
## 1        1  143          0.05
## 2        2    9          0.32
## 3        3   53          0.38
## 4        4   49          0.76
## 5        5   92          0.76
## 6        6   40          0.74
## 7        7  152          0.13
## 8        8   30          0.52
## 9        9   89          0.49
## 10      10   31          0.63
## 11      11   55          0.59
## 12      12  134          0.69
## 13      13   95          0.32

##    cluster size ave.sil.width
## 1        1  213          0.69
## 2        2   92          0.76
## 3        3   14          0.73
## 4        4   55          0.52
## 5        5  116          0.71
## 6        6   39          0.83
## 7        7   40          0.74
## 8        8   31          0.59
## 9        9   14          0.84
## 10      10   49          0.71
## 11      11   19          0.52
## 12      12   31          0.54
## 13      13  129          0.63
## 14      14   13          0.85
## 15      15   29          0.52
## 16      16   17          0.90
## 17      17   20          0.79
## 18      18   26          0.52
## 19      19   25          0.75

Statistics for created clusters

Caliński-Harabasz is a measure of clustering quality. It is used to compare the clusterization. Statistics are the biggest for PAM clustering.

## [1] "Calinski-Harabasz statistic value for KNN (7 clusters): 348.98"

## [1] "Calinski-Harabasz statistic value for KNN (13 clusters): 298.17"

## [1] "Calinski-Harabasz statistic value for PAM (19 clusters): 609.91"

Another statistic to evaluate a clusterisation is shadow’s statistic. In R documentation we can read that: “The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to the closest and second-closest centroid. If the shadow values of a point are close to 0, then the point is close to its cluster centroid”.

Plots of shadows show that the value of statistics for clusters are much lower than 1 so clusters may be accepted as well separated.

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

The importance of variables for PAM clustering are presented below. The most important variable are proved to be: “Grapes_Blend”,“Grapes_Other”,“Region_Latitude”, “Style Red”, “Style wine”. They have the biggest influence on the clusters creation process.

Continuous numeric variables was also presented in boxplots for each of the 19 clusters. Figures clearly presented why only Region Latitude is an important variable in this clustering approach. Only this variable value statistically differs between the clusters. Other presented variables have similar values for each cluster.

##   Style_Red Style_White Style_Spark      Price    Rating Popularity Grapes_Blend Grapes_Tempranillo Grapes_Sauvignon_Blanc Grapes_Primitivo Grapes_Sangiovese Grapes_Cabernet_Sauvignon Grapes_Chardonnay Grapes_Riesling Grapes_Touriga_Nacional Grapes_Shiraz_Syrah Grapes_Barbera Grapes_Malbec Grapes_Pinotage Grapes_Nebbiolo Grapes_Other Region_Latitude Region_Longitude Clust
## 1         1           0           0 0.25915364 0.5714286  1.0000000            1                  0                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            0       0.9424121        0.3901777     1
## 2         1           0           0 0.07903104 0.2857143  0.9617280            0                  1                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            0       0.9066557        0.3292526     2
## 3         1           0           0 0.34441276 0.4285714  0.9195847            0                  1                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            0       0.9066557        0.3292526     2
## 4         1           0           0 0.19392271 0.4285714  0.8871195            0                  0                      0                0                 0                         0                 0               0                       0                   0              0             1               0               0            0       0.0236016        0.6662221     3
## 5         1           0           0 0.07450749 0.0000000  0.8228928            1                  0                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            0       0.8920227        0.3043872     1
## 6         1           0           0 0.14855607 0.4285714  0.7953546            0                  1                      0                0                 0                         0                 0               0                       0                   0              0             0               0               0            0       0.9066557        0.3292526     2

Clusters presented on maps

The above analysis (CH Index) suggested selecting two final clustering of data. Based on that I prepare maps (world and Europe) like those at the end of the previous chapter where the colour of symbols represent the wine’s cluster for the most popular wine in this region. The jitter was used to represent the wines from one region that are in other clusters.

Based on maps splitting wine to clusters are not clear.

maps::map('world', xlim = c(-120, 170), ylim = c(-50,60))
plot(Winery_Points, pch=2, axes=TRUE, col = PAM19$cluster, add=TRUE, cex = 1.5, lwd = 2)
mtext("Plot of all wine on world map - PAM(19 clusters)", side = 3, line = 1, cex = 1)

maps::map('world', xlim = c(-10, 22), ylim = c(29,54))
plot(Winery_Points, pch=2, axes=TRUE, col = PAM19$cluster, add=TRUE, cex = 1.2, lwd = 2)
mtext("Part of Europe with most wine regions in dataset - PAM(19 clusters)", side = 3, line = 1, cex = 1)

Multidimensional scaling

The second method presented in the paper is multidimensional scaling PCA. The method is similar to the clustering algorithm. Multidimensional scaling was used to present clusters in 2 dimension plot.

In the first stage, I analysed the correlation of variables using the Spearman method. High correlation might impact the result of PCA. In the dataset, a strong positive correlation has wine based on grapes: Touriga Nacional and Barbera. A negative correlation exists between red and white wine (if wine is red it can not be white).

To closer look of numeric, continuous variables an additional plot of correlation was created and distribution (Price, Popularity, Number of marks, Latitude, Longitude) The distribution of latitude and longitude reflects the fact that most wines are from Europe. Most wines in the dataset are relative cheap with relatively low popularity.

To investigate the dimension scaling I decided to use PCA (Principal Component Analysis). Dataset was standardised during processing data but R documentation advised to centre and scale the data so I decided to do it. Two first created variables explain 20% of variables variance, more than 80% explain 14 variables when more than 95% of volatility explain 18 firsts variable. The two-dimension plot of individual observations, variables and both things together was provided below. For the plot of individual observation for groups are separated - in the left part of the figure are located wine in red style and on the right site is wine in white style. In different directions are also the variables of longitude and latitude. The combination plot clearly shows that groups are mostly created by value for variables: Style_Red, Style_White, Region Longitude, Grapes Touriga Nacional and Barbera. Some influence might also have other types of grapes used by the winery, Popularity of wine and their Price.

### Influence of variables

Below the absolute value of variable loadings was presented. Value are representation of length of arrow presented in above plots. Relaatively big value besides the value mentioned above is also type of grape: Sauvignon_Blanc.

##                 Style_Red               Style_White    Grapes_Sauvignon_Blanc          Region_Longitude              Grapes_Other           Region_Latitude         Grapes_Chardonnay Grapes_Cabernet_Sauvignon        Grapes_Tempranillo             Grapes_Malbec            Grapes_Barbera   Grapes_Touriga_Nacional                Popularity              Grapes_Blend       Grapes_Shiraz_Syrah         Grapes_Sangiovese           Grapes_Pinotage          Grapes_Primitivo                     Price           Grapes_Nebbiolo           Grapes_Riesling                    Rating               Style_Spark 
##                0.55173368                0.54912623                0.29588685                0.24588003                0.21698058                0.17438122                0.16190688                0.12729021                0.12387194                0.12181586                0.11921436                0.11921436                0.10667376                0.10661389                0.09954066                0.08813833                0.08555922                0.08490978                0.07689928                0.04831947                0.03884699                0.03870779                0.03636740

Rotated PCA

Rotated PCA with cutoff at 0.5 level and 5 factors using varimax shown the created variables contain basic variables describe above. RC1 might be described as a variable telling about wine style where grape sauvignon_blanc is the grapes deserving to be a separate style. The second variable contains the grapes type: Tortuga and Barbera. Third, is built on geographical coordinate variables. Fourth contains the information about price and rating popular variable. Fiveth contain a volatility of grapes: Blend - wines that was using more than one type of grapes and Other - the variable that have value one if wine was using grapes that had too small representation in data to have a separate variable.

## Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done

## Warning in principal(Data, nfactor = 5, rotate = "varimax"): The matrix is not positive semi-definite, scores found from Structure loadings

## Principal Components Analysis
## Call: principal(r = Data, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                             RC1   RC2   RC3   RC4   RC5    h2     u2 com
## Style_Red                  0.85  0.09  0.22  0.24 -0.24 0.903 0.0970 1.5
## Style_White               -0.87 -0.08 -0.22 -0.21  0.22 0.897 0.1025 1.4
## Style_Spark                0.02 -0.03 -0.02 -0.14  0.12 0.038 0.9619 2.2
## Price                      0.02  0.10  0.06  0.54  0.22 0.352 0.6476 1.5
## Rating                    -0.20 -0.03 -0.08  0.61 -0.04 0.417 0.5834 1.3
## Popularity                 0.04 -0.04  0.06  0.44 -0.15 0.218 0.7818 1.3
## Grapes_Blend               0.20 -0.12 -0.14 -0.32 -0.74 0.725 0.2748 1.7
## Grapes_Tempranillo         0.23 -0.05 -0.15  0.45  0.02 0.285 0.7154 1.8
## Grapes_Sauvignon_Blanc    -0.64 -0.01 -0.14  0.17 -0.23 0.506 0.4940 1.5
## Grapes_Primitivo           0.10 -0.03  0.01  0.32  0.02 0.115 0.8851 1.2
## Grapes_Sangiovese          0.23 -0.02 -0.05  0.09  0.15 0.088 0.9123 2.2
## Grapes_Cabernet_Sauvignon -0.01  0.00  0.45  0.01 -0.07 0.211 0.7885 1.1
## Grapes_Chardonnay         -0.41  0.01  0.13  0.09  0.07 0.200 0.7997 1.4
## Grapes_Riesling            0.10 -0.02 -0.09  0.08 -0.03 0.027 0.9732 3.2
## Grapes_Touriga_Nacional    0.02  1.00 -0.03 -0.03 -0.01 0.995 0.0046 1.0
## Grapes_Shiraz_Syrah        0.08  0.00  0.24  0.08  0.08 0.075 0.9252 1.7
## Grapes_Barbera             0.02  1.00 -0.03 -0.03 -0.01 0.995 0.0046 1.0
## Grapes_Malbec             -0.04 -0.01  0.49  0.07 -0.02 0.249 0.7509 1.1
## Grapes_Pinotage           -0.02  0.00  0.34 -0.08 -0.10 0.133 0.8673 1.3
## Grapes_Nebbiolo            0.12 -0.01 -0.04  0.12  0.09 0.037 0.9627 3.1
## Grapes_Other              -0.04 -0.10 -0.14 -0.33  0.77 0.731 0.2691 1.5
## Region_Latitude            0.10  0.05 -0.86  0.10  0.08 0.762 0.2382 1.1
## Region_Longitude           0.23  0.03  0.67 -0.08  0.27 0.582 0.4179 1.6
## 
##                        RC1  RC2  RC3  RC4  RC5
## SS loadings           2.35 2.05 2.03 1.59 1.53
## Proportion Var        0.10 0.09 0.09 0.07 0.07
## Cumulative Var        0.10 0.19 0.28 0.35 0.41
## Proportion Explained  0.25 0.21 0.21 0.17 0.16
## Cumulative Proportion 0.25 0.46 0.67 0.84 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  2670.1  with prob <  0 
## 
## Fit based upon off diagonal values = 0.71

## 
## Loadings:
##                           RC1    RC2    RC3    RC4    RC5   
## Style_Red                  0.853                            
## Style_White               -0.865                            
## Grapes_Sauvignon_Blanc    -0.636                            
## Grapes_Touriga_Nacional           0.996                     
## Grapes_Barbera                    0.996                     
## Region_Latitude                         -0.856              
## Region_Longitude                         0.671              
## Price                                           0.536       
## Rating                                          0.606       
## Grapes_Blend                                          -0.740
## Grapes_Other                                           0.771
## Style_Spark                                                 
## Popularity                                                  
## Grapes_Tempranillo                                          
## Grapes_Primitivo                                            
## Grapes_Sangiovese                                           
## Grapes_Cabernet_Sauvignon                                   
## Grapes_Chardonnay                                           
## Grapes_Riesling                                             
## Grapes_Shiraz_Syrah                                         
## Grapes_Malbec                                               
## Grapes_Pinotage                                             
## Grapes_Nebbiolo                                             
## 
##                  RC1   RC2   RC3   RC4   RC5
## SS loadings    2.347 2.045 2.028 1.590 1.532
## Proportion Var 0.102 0.089 0.088 0.069 0.067
## Cumulative Var 0.102 0.191 0.279 0.348 0.415

Goodness of fit - Multidimensional Scaling

Uniqueness (proportion of variance that is not shared with other variables - when the statistic is high PCA needs more variables to better describe volatility at some level) achieve a high value for these variables with a small arrow in plots. Those variables without high relation to other variables create a problems when we want to group wines.

Conclusion

In the paper was presented an analysis using clustering and multidimensional scaling method on wine data which I collected myself. In the first chapter, I described the process of preparing data. The second part was dedicated to clustering methods. Using various statistics the number of clusters was selected for KNN and PAM methods. Methods were compared using CH Index. Clusters were presented on maps, statistics for the most interesting variables was presented on the boxplots. In scaling the PCA method was used. The most important in process of dimension reduction turned out to be variables: Style of wine, geographical coordinates and some grapes types (Touriga Nacional and Barbera).

Clustering and Multidimensional Scaling

Adrian Szymański

2022-02-25

Introduction

Clustering / Multidimensional Scaling Explanation of method

Dataset - Web Scraping

Dataset - Preparing data to use an algorithms of Unsupervised Learning

Maps of Wine Regions

Clustering

Hopkins statistic

Selection of number of clusters

Clustering

Statistics for created clusters

Clusters presented on maps

Multidimensional scaling

Rotated PCA

Goodness of fit - Multidimensional Scaling

Conclusion