library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(hopkins)
library(ClusterR)
library(dplyr)
library(tidyverse)
library(rstatix)
library(ggpubr)
library(psych)
library(readxl)
library(Rcpp)
library(NbClust)
library(FeatureImpCluster)
library(ggplot2)
library(spdep)
library(rgdal)
library(maptools)
library(sp)
library(RColorBrewer)
library(classInt)
library(GISTools)
library(maps)
library(corrplot)
library(clusterSim)
library(GGally)Introduction
This paper is the presentation of methods: Clustering (KNN and PAM) and Dimensional Scaling (PCA) working on a dataset containing information about popular wines. The data set was created by web scraping information from one of the most popular websites about wine - Vivino using Beautiful Soup - a library of Python programming language. The first part of the paper presents the process of processing data to create a useful dataset. The second and third parts are focused respectively on Clustering methods and dimensional scaling methods. The final chapter is the conclusion about the results of used unsupervised learning algorithms.
Clustering / Multidimensional Scaling Explanation of method
Clustering and Multidimensional Reduction are methods of Unsupervised Learning. These algorithms searching for hidden patterns and structures in unlabeled data. The name of method refers to the fact that those algorithms do not need a human interventions.
Clustering is the method of identifying groups of data in the dataset. The aim is to create groups that the elements in each group are more similar to each other and dissimilar to elements in the other groups.
Groups are created by given constraint from the cluster center (KNN - any point might not exist in dataset / PAM - specific point from dataset). Distances bettwen points are calculated using many metrics (Minkowski/Manhattan/Euclidean)
Multidimensional scaling (Principal Coordinate Analysis) is also the method focus on creating mappings of items based on their distance to others. In results items more similar should be are closed on the graph than dissimilar items.
In this paper I will use algorithms using numerical distances PCA. Same as Clustering distances between points might be calculate using different metrics (Minkowski/Manhattan/Euclidean or dissimilarity).
Dataset - Web Scraping
As I write in the introduction data was web scrapping from the website: Vivino using one of the Python libraries. The aim was to select the most popular wines and for each of them obtain information listed below.
## [1] "Wine" "Winery" "Coutry" "Region" "Style" "Grapes" "Price" "Rating" "Number_of_mark"
## Wine Winery Coutry Region Style Grapes Price Rating Number_of_mark
## 1 Vinho Verde Branco Casal Garcia Portugal Vinho Verde White wine Blend <NA> 3.8 76668 ratings
## 2 <NA> Farnese Italy Abruzzo Red wine Blend 150 zł. 4.3 61609 ratings
## 3 Memoro Rosso Piccini Italy Toscana Red wine Blend <NA> 3.8 56353 ratings
## 4 Prosecco La Marca Italy Prosecco Sparkling wine Glera <NA> 3.8 27818 ratings
## 5 Mucho Más Tinto Félix Solís Spain Vino de Espana Red wine Blend <NA> 4.1 23976 ratings
## 6 <NA> Barefoot United States California White wine Moscato 34.90 zł. 3.8 17905 ratings
Dataset - Preparing data to use an algorithms of Unsupervised Learning
After web scraping dataset contained variables that were not numeric. It was necessary to provide numeric data for further analysis using clustering and PCA. Below I presented used methods of clearing and transforming data to the final dataset used in the analysis. In the first step, I deleted the rows with missing or duplicate values. I also concatenated the name of the winery and wine to improve the label. Wine from many wineries might have the same names and labels have to be unique.
Wina$Wine <-paste(Wina$Winery, Wina$Wine)
Wina <- (na.omit(Wina))
Wina <- Wina[!duplicated(Wina[,1:6]),]
Wina <- Wina[,-c(2)]
paste("Number of rows in dataframe:", dim(Wina)[1])## [1] "Number of rows in dataframe: 1643"
paste("Number of columns in dataframe:", dim(Wina)[2])## [1] "Number of columns in dataframe: 8"
In the next step columns having number value (Price, popularity and rating) was transformed from text to number.
for(i in 1:dim(Wina)[1]){
Wina$Price[i] <- strsplit(Wina$Price[i], " ", fixed = FALSE)[[1]][1]
}
for(i in 1:dim(Wina)[1]){
Wina$Number_of_mark[i] <- strsplit(Wina$Number_of_mark[i], " ", fixed = FALSE)[[1]][1]
}
Wina$Rating <- as.numeric(Wina$Rating)
Wina$Price <- as.numeric(Wina$Price)
Wina$Number_of_mark <- as.numeric(Wina$Number_of_mark)For variable having information about Country and Region, in the first step, I decided to limit the countries to eleven most frequently appears.
Selected_country <- names(sort(table(Wina$Coutry), decreasing = TRUE))[1:14]
Wina <- Wina[Wina$Coutry %in% Selected_country, ]In the second step was selected more popular regions were and assigned to each region their Longitude and Latitude value which are numeric variables. What’s more this variable will be used to present the results of clustering wines on maps. Other variables (style of wine and used grapes) were converted to dummy variables with one if have some grapes/was in some style and zero if not
Wina %>% group_by(Region) %>% summarize(n=n()) %>% arrange(desc(n))## # A tibble: 340 x 2
## Region n
## <chr> <int>
## 1 Rioja 91
## 2 Douro 90
## 3 Alentejano 50
## 4 Ribera del Duero 49
## 5 Stellenbosch 42
## 6 Südtirol - Alto Adige 35
## 7 Toscana 33
## 8 Marlborough 28
## 9 Alentejo 27
## 10 Puglia 24
## # ... with 330 more rows
Selected_Region <- c("Vino Nobile di Montepulciano","Veneto","Toro","Somontano","Sancerre","Sicilia","Salento","Rueda","Península de Setúbal","Penedes","Moscato d'Asti","Monçao e Melgaço","Lugana","Dao","Etna","Aconcagua Valley", "Alentejano","Alentejo","Alicante","Alsace","Bairrada","Barbera d'Alba","Barbera d'Asti","Barossa","Barossa Valley","Castilla","Castilla y León","Constantia","Chablis","Chianti Classico","Coastal Trhion","Coastal Region","Colchagua Valley","Crozes-Hermitage","Douro","Gavi","Goriška Brda","Isola dei Nuraghi","Langhe","Lisboa","Médoc","Maipo Valley","Marlborough","Mendoza","Morgon","Mosel",
"Navarra","Paarl", "Pfalz","Pic-Saint-Loup","Primitivo di Manduria","Priorat","Puglia",
"Rías Baixas","Rheingau","Ribera del Duero","Rioja","Robertson","Rosso di Montalcino","Südtirol - Alto Adige","Salento",
"Salice Salentino","Stellenbosch","Tejo","Terre Siciliane","Tokaj","Toscana","Umbria","Veneto","Valpolicella",
"Valpolicella Ripasso", "Valpolicella Ripasso Classico","Vinho Verde","Western Cape")
Latitude <- c(43.77,45.97,41.52,42.03,47.19,37.79,40.15,41.41,38.26,41.28,44.90,41.14,45.58,40.37,37.75,-32.65,38.20, 38.20, 38.34, 48.67, 39.59,44.70, 44.54,-34.53,-34.53, 39.8, 39.8, -34.01,
47.48, 43.32, -34, -34, -34.35, 45.10, 41.06, 44.41,
46.04, 41.06, 44.48, 38.74, 45.00, -33.36, 42.20, -32.53, 41.16, 50.21,
42.81, -33.42, 49.25, 43.46, 41.00, 41.07, 41,
42.23, 50.08, 41.7, 42.3, -34.24, 41.03, 46.73, 41,
40.23, -33.55, 39.23, 37.3, 48.07, 43.77, 43, 45.04, 45.33, 45.33, 45.33, 41.4, -34)
Longitude <- c(11.24,12.30,-5.39,-0.11,2.50,15.20,15.11,-4.95,-9.05,1.14,8.20,8.17,9.93,-8.14,14.99,-70.01,-8.2, -8.2, -0.28, 7, -8.23, 8.03, 8.12, 138.56,138.56,4, 4, 18.25,
3.47, 11.18, 20, 20, 71.24, 4.84, -7.47, 8.48,
13.53, 9.17, 8.18, -9.14, 1, -71.37, -71.33, 68.49, 2.1, 7.36,
0.65, 18.57, 8.19, 2.48, 16.3, 0.48, 16.3,
-8.71, 8.04, 1.9, -2.25, 20.50, 13.82, 11.28, 16.3,
17.57, 18.51, -8.68, 14, 21.25, 11.26, 12.5, 11.47, 10.54, 10.54, 10.54, -8.3, 20)
Wina <- Wina[Wina$Region %in% Selected_Region, ]Latitude_R <- matrix(NA, nrow = dim(DF)[1],1)
Longitude_R <- matrix(NA, nrow = dim(DF)[1],1)
for(j in 1:length(Latitude)){
for(z in which(DF$Region == Selected_Region[j])){
Latitude_R[z] <- Latitude[j]
Longitude_R[z] <- Longitude[j]
}
}
DF$Region_Latitude <- Latitude_R
DF$Region_Longitude <- Longitude_R
DF <- DF[-8]
colnames(DF)[1] <- "Wine"
tail(DF,5)## Wine Style_Red Style_White Style_Spark Price Rating Popularity Grapes_Blend Grapes_Tempranillo Grapes_Sauvignon_Blanc Grapes_Primitivo Grapes_Sangiovese Grapes_Cabernet_Sauvignon Grapes_Chardonnay Grapes_Riesling Grapes_Touriga_Nacional Grapes_Shiraz_Syrah Grapes_Barbera Grapes_Malbec Grapes_Pinotage Grapes_Nebbiolo Grapes_Other Region_Latitude Region_Longitude
## 1012 Caliterra Tributo Malbec 1 0 0 50.26 3.8 222 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -34.35 71.24
## 1013 Casal da Coelheira Reserva Tinto 1 0 0 79.99 4.0 222 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39.23 -8.68
## 1014 Telmo Rodriguez Pegaso Zeta 1 0 0 68.48 3.9 221 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 39.80 4.00
## 1015 Punica Samas 0 1 0 55.29 3.8 221 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41.06 9.17
## 1016 Colinas do Douro Reserva Tinto 1 0 0 40.95 3.9 221 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41.06 -7.47
Finally, in the last step of preparing data, I converted the column:
“Wine” to names of rows and standardize every numeric variable to range
0-1.
My experiments provide that without standardization popularity have too
much influence on results.
Data <- as.matrix(DF[,2:dim(DF)[2]])
Label <- DF[,1]
Standarization <- function(x){(x-min(x))/(max(x)-min(x))}
Data <- apply(Data[,], 2, function(x) Standarization(x) )Maps of Wine Regions
Below I prepared a map of the world and Europe (continent of origin of the most amount of wines in the analysed data set). Each wine region was presented as an orange triangle.
Clustering
Hopkins statistic
Hopkins statistic tells can the data can be clustered - Result close to 1 prove that the dataset is not uniformly distributed (i.e., contains meaningful clusters) That conclusion was proved also by the plot of the dissimilarity matrix (ODM -Ordered dissimilarity matrix).
We can see some red squares (low distance = low dissimilarity -> high similarity) Data are clusterable (clusters are visible).
## [1] "Hopkins statistics: 0.99999999999893"
Selection of number of clusters
Unsupervised Learning is part of Machine Learning where the role of the human component is minimal. However, in clustering methods like KNN and PAM, it is necessary to define the number of clusters.
In literature exist many ways to select an appropriate quantity of groups. In this paper, I decided to use the most popular methods: silhouette, gap statistic, Calinski-Harabasz Index and Akaike Information Criterion. For each of them, below was presented the plot with statistics for different numbers of clusters.
Silhouette for KNN and PAM method suggests a number of clusters close to 18/19. Gap_statistic for KNN suggest 4 clusters and for PAM 20 or more clusters (maximal limit of clusters was limited to 20)
Calinski-Harabasz Index for the KNN method suggest 2 clusters. For the
PAM method again the searching limit was selected (20 clusters) as those
providing the best spliting of the dataset.
## [1] "Best number of clusters in KNN method based on Calinski-Harabasz statistic: 2"
## [1] "Best number of clusters in PAM method based on Calinski-Harabasz statistic: 20"
The last plot for KNN with Akaike Information Criteria suggests two potential amount of clusters - 6/7 or 11/13. For 6/7 cluster value of AIC is not minimal but the drop between 4 and 6/7 are large and the value for 6/7 is relatively close to the minimal value of AIC achieved for 11 and almost 13 clusters.
Clustering
Based on the above plots I decided to create three clustering for the
wine’s dataset:
Two clusterings using the KNN method (for 7 and 13 clusters) and one
clustering using the PAM method (for 19 clusters) Plots of clusters in
two-dimensional scale and silhouette for each clustering was presented
below. The biggest value of silhouette was achieved by the PAM method
(more than 0.68)
## cluster size ave.sil.width
## 1 1 174 0.06
## 2 2 134 0.70
## 3 3 130 0.70
## 4 4 158 0.22
## 5 5 92 0.76
## 6 6 40 0.74
## 7 7 244 0.69
## cluster size ave.sil.width
## 1 1 143 0.05
## 2 2 9 0.32
## 3 3 53 0.38
## 4 4 49 0.76
## 5 5 92 0.76
## 6 6 40 0.74
## 7 7 152 0.13
## 8 8 30 0.52
## 9 9 89 0.49
## 10 10 31 0.63
## 11 11 55 0.59
## 12 12 134 0.69
## 13 13 95 0.32
## cluster size ave.sil.width
## 1 1 213 0.69
## 2 2 92 0.76
## 3 3 14 0.73
## 4 4 55 0.52
## 5 5 116 0.71
## 6 6 39 0.83
## 7 7 40 0.74
## 8 8 31 0.59
## 9 9 14 0.84
## 10 10 49 0.71
## 11 11 19 0.52
## 12 12 31 0.54
## 13 13 129 0.63
## 14 14 13 0.85
## 15 15 29 0.52
## 16 16 17 0.90
## 17 17 20 0.79
## 18 18 26 0.52
## 19 19 25 0.75
Statistics for created clusters
Caliński-Harabasz is a measure of clustering quality. It is used to compare the clusterization. Statistics are the biggest for PAM clustering.
## [1] "Calinski-Harabasz statistic value for KNN (7 clusters): 348.98"
## [1] "Calinski-Harabasz statistic value for KNN (13 clusters): 298.17"
## [1] "Calinski-Harabasz statistic value for PAM (19 clusters): 609.91"
Another statistic to evaluate a clusterisation is shadow’s statistic. In R documentation we can read that: “The shadow value of each data point is defined as twice the distance to the closest centroid divided by the sum of distances to the closest and second-closest centroid. If the shadow values of a point are close to 0, then the point is close to its cluster centroid”.
Plots of shadows show that the value of statistics for clusters are much lower than 1 so clusters may be accepted as well separated.
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
The importance of variables for PAM clustering are presented below. The
most important variable are proved to be:
“Grapes_Blend”,“Grapes_Other”,“Region_Latitude”, “Style Red”, “Style
wine”. They have the biggest influence on the clusters creation
process.
Continuous numeric variables was also presented in boxplots for each of
the 19 clusters. Figures clearly presented why only Region Latitude is
an important variable in this clustering approach. Only this variable
value statistically differs between the clusters. Other presented
variables have similar values for each cluster.
## Style_Red Style_White Style_Spark Price Rating Popularity Grapes_Blend Grapes_Tempranillo Grapes_Sauvignon_Blanc Grapes_Primitivo Grapes_Sangiovese Grapes_Cabernet_Sauvignon Grapes_Chardonnay Grapes_Riesling Grapes_Touriga_Nacional Grapes_Shiraz_Syrah Grapes_Barbera Grapes_Malbec Grapes_Pinotage Grapes_Nebbiolo Grapes_Other Region_Latitude Region_Longitude Clust
## 1 1 0 0 0.25915364 0.5714286 1.0000000 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.9424121 0.3901777 1
## 2 1 0 0 0.07903104 0.2857143 0.9617280 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.9066557 0.3292526 2
## 3 1 0 0 0.34441276 0.4285714 0.9195847 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.9066557 0.3292526 2
## 4 1 0 0 0.19392271 0.4285714 0.8871195 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0.0236016 0.6662221 3
## 5 1 0 0 0.07450749 0.0000000 0.8228928 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.8920227 0.3043872 1
## 6 1 0 0 0.14855607 0.4285714 0.7953546 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.9066557 0.3292526 2
Clusters presented on maps
The above analysis (CH Index) suggested selecting two final clustering of data. Based on that I prepare maps (world and Europe) like those at the end of the previous chapter where the colour of symbols represent the wine’s cluster for the most popular wine in this region. The jitter was used to represent the wines from one region that are in other clusters.
Based on maps splitting wine to clusters are not clear.
maps::map('world', xlim = c(-120, 170), ylim = c(-50,60))
plot(Winery_Points, pch=2, axes=TRUE, col = PAM19$cluster, add=TRUE, cex = 1.5, lwd = 2)
mtext("Plot of all wine on world map - PAM(19 clusters)", side = 3, line = 1, cex = 1)maps::map('world', xlim = c(-10, 22), ylim = c(29,54))
plot(Winery_Points, pch=2, axes=TRUE, col = PAM19$cluster, add=TRUE, cex = 1.2, lwd = 2)
mtext("Part of Europe with most wine regions in dataset - PAM(19 clusters)", side = 3, line = 1, cex = 1)Multidimensional scaling
The second method presented in the paper is multidimensional scaling PCA. The method is similar to the clustering algorithm. Multidimensional scaling was used to present clusters in 2 dimension plot.
In the first stage, I analysed the correlation of variables using the Spearman method. High correlation might impact the result of PCA. In the dataset, a strong positive correlation has wine based on grapes: Touriga Nacional and Barbera. A negative correlation exists between red and white wine (if wine is red it can not be white).
To closer look of numeric, continuous variables an additional plot of correlation was created and distribution (Price, Popularity, Number of marks, Latitude, Longitude) The distribution of latitude and longitude reflects the fact that most wines are from Europe. Most wines in the dataset are relative cheap with relatively low popularity.
To investigate the dimension scaling I decided to use PCA (Principal Component Analysis). Dataset was standardised during processing data but R documentation advised to centre and scale the data so I decided to do it. Two first created variables explain 20% of variables variance, more than 80% explain 14 variables when more than 95% of volatility explain 18 firsts variable. The two-dimension plot of individual observations, variables and both things together was provided below. For the plot of individual observation for groups are separated - in the left part of the figure are located wine in red style and on the right site is wine in white style. In different directions are also the variables of longitude and latitude. The combination plot clearly shows that groups are mostly created by value for variables: Style_Red, Style_White, Region Longitude, Grapes Touriga Nacional and Barbera. Some influence might also have other types of grapes used by the winery, Popularity of wine and their Price.
### Influence of variables
Below the absolute value of variable loadings was presented. Value are representation of length of arrow presented in above plots. Relaatively big value besides the value mentioned above is also type of grape: Sauvignon_Blanc.
## Style_Red Style_White Grapes_Sauvignon_Blanc Region_Longitude Grapes_Other Region_Latitude Grapes_Chardonnay Grapes_Cabernet_Sauvignon Grapes_Tempranillo Grapes_Malbec Grapes_Barbera Grapes_Touriga_Nacional Popularity Grapes_Blend Grapes_Shiraz_Syrah Grapes_Sangiovese Grapes_Pinotage Grapes_Primitivo Price Grapes_Nebbiolo Grapes_Riesling Rating Style_Spark
## 0.55173368 0.54912623 0.29588685 0.24588003 0.21698058 0.17438122 0.16190688 0.12729021 0.12387194 0.12181586 0.11921436 0.11921436 0.10667376 0.10661389 0.09954066 0.08813833 0.08555922 0.08490978 0.07689928 0.04831947 0.03884699 0.03870779 0.03636740
Rotated PCA
Rotated PCA with cutoff at 0.5 level and 5 factors using varimax shown the created variables contain basic variables describe above. RC1 might be described as a variable telling about wine style where grape sauvignon_blanc is the grapes deserving to be a separate style. The second variable contains the grapes type: Tortuga and Barbera. Third, is built on geographical coordinate variables. Fourth contains the information about price and rating popular variable. Fiveth contain a volatility of grapes: Blend - wines that was using more than one type of grapes and Other - the variable that have value one if wine was using grapes that had too small representation in data to have a separate variable.
## Warning in cor.smooth(r): Matrix was not positive definite, smoothing was done
## Warning in principal(Data, nfactor = 5, rotate = "varimax"): The matrix is not positive semi-definite, scores found from Structure loadings
## Principal Components Analysis
## Call: principal(r = Data, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 RC4 RC5 h2 u2 com
## Style_Red 0.85 0.09 0.22 0.24 -0.24 0.903 0.0970 1.5
## Style_White -0.87 -0.08 -0.22 -0.21 0.22 0.897 0.1025 1.4
## Style_Spark 0.02 -0.03 -0.02 -0.14 0.12 0.038 0.9619 2.2
## Price 0.02 0.10 0.06 0.54 0.22 0.352 0.6476 1.5
## Rating -0.20 -0.03 -0.08 0.61 -0.04 0.417 0.5834 1.3
## Popularity 0.04 -0.04 0.06 0.44 -0.15 0.218 0.7818 1.3
## Grapes_Blend 0.20 -0.12 -0.14 -0.32 -0.74 0.725 0.2748 1.7
## Grapes_Tempranillo 0.23 -0.05 -0.15 0.45 0.02 0.285 0.7154 1.8
## Grapes_Sauvignon_Blanc -0.64 -0.01 -0.14 0.17 -0.23 0.506 0.4940 1.5
## Grapes_Primitivo 0.10 -0.03 0.01 0.32 0.02 0.115 0.8851 1.2
## Grapes_Sangiovese 0.23 -0.02 -0.05 0.09 0.15 0.088 0.9123 2.2
## Grapes_Cabernet_Sauvignon -0.01 0.00 0.45 0.01 -0.07 0.211 0.7885 1.1
## Grapes_Chardonnay -0.41 0.01 0.13 0.09 0.07 0.200 0.7997 1.4
## Grapes_Riesling 0.10 -0.02 -0.09 0.08 -0.03 0.027 0.9732 3.2
## Grapes_Touriga_Nacional 0.02 1.00 -0.03 -0.03 -0.01 0.995 0.0046 1.0
## Grapes_Shiraz_Syrah 0.08 0.00 0.24 0.08 0.08 0.075 0.9252 1.7
## Grapes_Barbera 0.02 1.00 -0.03 -0.03 -0.01 0.995 0.0046 1.0
## Grapes_Malbec -0.04 -0.01 0.49 0.07 -0.02 0.249 0.7509 1.1
## Grapes_Pinotage -0.02 0.00 0.34 -0.08 -0.10 0.133 0.8673 1.3
## Grapes_Nebbiolo 0.12 -0.01 -0.04 0.12 0.09 0.037 0.9627 3.1
## Grapes_Other -0.04 -0.10 -0.14 -0.33 0.77 0.731 0.2691 1.5
## Region_Latitude 0.10 0.05 -0.86 0.10 0.08 0.762 0.2382 1.1
## Region_Longitude 0.23 0.03 0.67 -0.08 0.27 0.582 0.4179 1.6
##
## RC1 RC2 RC3 RC4 RC5
## SS loadings 2.35 2.05 2.03 1.59 1.53
## Proportion Var 0.10 0.09 0.09 0.07 0.07
## Cumulative Var 0.10 0.19 0.28 0.35 0.41
## Proportion Explained 0.25 0.21 0.21 0.17 0.16
## Cumulative Proportion 0.25 0.46 0.67 0.84 1.00
##
## Mean item complexity = 1.6
## Test of the hypothesis that 5 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 2670.1 with prob < 0
##
## Fit based upon off diagonal values = 0.71
##
## Loadings:
## RC1 RC2 RC3 RC4 RC5
## Style_Red 0.853
## Style_White -0.865
## Grapes_Sauvignon_Blanc -0.636
## Grapes_Touriga_Nacional 0.996
## Grapes_Barbera 0.996
## Region_Latitude -0.856
## Region_Longitude 0.671
## Price 0.536
## Rating 0.606
## Grapes_Blend -0.740
## Grapes_Other 0.771
## Style_Spark
## Popularity
## Grapes_Tempranillo
## Grapes_Primitivo
## Grapes_Sangiovese
## Grapes_Cabernet_Sauvignon
## Grapes_Chardonnay
## Grapes_Riesling
## Grapes_Shiraz_Syrah
## Grapes_Malbec
## Grapes_Pinotage
## Grapes_Nebbiolo
##
## RC1 RC2 RC3 RC4 RC5
## SS loadings 2.347 2.045 2.028 1.590 1.532
## Proportion Var 0.102 0.089 0.088 0.069 0.067
## Cumulative Var 0.102 0.191 0.279 0.348 0.415
Goodness of fit - Multidimensional Scaling
Uniqueness (proportion of variance that is not shared with other variables - when the statistic is high PCA needs more variables to better describe volatility at some level) achieve a high value for these variables with a small arrow in plots. Those variables without high relation to other variables create a problems when we want to group wines.
Conclusion
In the paper was presented an analysis using clustering and multidimensional scaling method on wine data which I collected myself. In the first chapter, I described the process of preparing data. The second part was dedicated to clustering methods. Using various statistics the number of clusters was selected for KNN and PAM methods. Methods were compared using CH Index. Clusters were presented on maps, statistics for the most interesting variables was presented on the boxplots. In scaling the PCA method was used. The most important in process of dimension reduction turned out to be variables: Style of wine, geographical coordinates and some grapes types (Touriga Nacional and Barbera).