We go into the realm of oenology, investigating the subtle dance of chemical constituents that characterize the vast range of Italian wines. Analyze the learning curve to understand whether your model has low or high variance in parameter estimates across samples (bias-variance trade-off) and needs to add more samples or factors to your data set.This dataset contains the results of a chemical analysis of wines made from three different varieties grown in the same region of Italy.Our goal is to uncover the presence of clusters in the wine dataset. In other words, see if he can distinguish between the three cultivators within the dataset.
First we need to load some libraries and read the data set.
# Load the necessary libraries
library(tidyverse)
library(corrplot)
library(corrgram)
library(gridExtra)
library(GGally)
library(knitr)
library(cluster)
library(fpc)
library(factoextra)
# Load the wine dataset
wines <- read.csv("Wine.csv")
Display the first and last few rows of the dataset.
# First rows
head(wines)
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 13.20 1.78 2.14 11.2 100 2.65 2.76
## 3 13.16 2.36 2.67 18.6 101 2.80 3.24
## 4 14.37 1.95 2.50 16.8 113 3.85 3.49
## 5 13.24 2.59 2.87 21.0 118 2.80 2.69
## 6 14.20 1.76 2.45 15.2 112 3.27 3.39
## Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue OD280 Proline
## 1 0.28 2.29 5.64 1.04 3.92 1065
## 2 0.26 1.28 4.38 1.05 3.40 1050
## 3 0.30 2.81 5.68 1.03 3.17 1185
## 4 0.24 2.18 7.80 0.86 3.45 1480
## 5 0.39 1.82 4.32 1.04 2.93 735
## 6 0.34 1.97 6.75 1.05 2.85 1450
# Last rows
tail(wines)
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 173 14.16 2.51 2.48 20.0 91 1.68 0.70
## 174 13.71 5.65 2.45 20.5 95 1.68 0.61
## 175 13.40 3.91 2.48 23.0 102 1.80 0.75
## 176 13.27 4.28 2.26 20.0 120 1.59 0.69
## 177 13.17 2.59 2.37 20.0 120 1.65 0.68
## 178 14.13 4.10 2.74 24.5 96 2.05 0.76
## Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue OD280 Proline
## 173 0.44 1.24 9.7 0.62 1.71 660
## 174 0.52 1.06 7.7 0.64 1.74 740
## 175 0.43 1.41 7.3 0.70 1.56 750
## 176 0.43 1.35 10.2 0.59 1.56 835
## 177 0.53 1.46 9.3 0.60 1.62 840
## 178 0.56 1.35 9.2 0.61 1.60 560
# Summary
summary(wines)
## Alcohol Malic_Acid Ash Ash_Alcanity
## Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color_Intensity Hue OD280
## Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## Proline
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
# Structure
glimpse(wines)
## Rows: 178
## Columns: 13
## $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, …
## $ Malic_Acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
## $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
## $ Ash_Alcanity <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
## $ Magnesium <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 10…
## $ Total_Phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
## $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
## $ Nonflavanoid_Phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
## $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
## $ Color_Intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
## $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
## $ OD280 <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
## $ Proline <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…
This graphical study of the data helps us to detect patterns, central tendencies, and probable outliers. Histograms are a useful tool for analyzing the underlying properties of a dataset, whether it is alcohol concentration, acidity, or other chemical parameters.
wines %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(colour="black") +
facet_wrap(~ key, scales = "free") +
geom_density(colour="red", alpha=0.5, show.legend=FALSE) +
labs(x="Values", y="Frequency",
title="Diverse Attributes of Italian Wines - Histograms") +
theme_bw()
From the density above, we can see that the density is not normal and there are outliers. We can identify outliers using boxplots.
boxplot(wines,
las = 2,
col = c("blue")
)
Capping will be used to remove outliers since the other methods were introducing other outliers.
# Columns with outliers: Malic_Acid, Ash, Ash_Alcanity, Magnesium, Proanthocyanins, Color_Intensity, Hue
cap_outliers <- function(x, threshold = 1.5) {
q <- quantile(x, c(0.25, 0.75))
iqr <- q[2] - q[1]
lower_bound <- q[1] - threshold * iqr
upper_bound <- q[2] + threshold * iqr
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
columns_to_cap <- c("Malic_Acid", "Ash", "Ash_Alcanity", "Magnesium", "Proanthocyanins", "Color_Intensity", "Hue")
wines[columns_to_cap] <- lapply(wines[columns_to_cap], cap_outliers)
We can now verify.
boxplot(wines,
las = 2,
col = c("blue")
)
To construct a graphical representation of a correlation matrix, we may utilize the ‘corrplot()’ method.
corrgram(wines, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt)
To represent the variables in the same range of values, we must normalize them.
# Normalization
wines_df <- as.data.frame(scale(wines))
In this section we are going to execute the k-means algorithm and analyze the main components that the function returns. Here, the data is grouped into two clusters (center = 2). The ‘kmeans()’ function also has an nstart option that tries multiple initial configurations and reports the best configuration. For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.
# Execution of k-means with k=2
set.seed(1234)
wines_df1<- kmeans(wines_df, centers=2, nstart = 25)
par(mfrow = c(1, 2))
# plot1
plot(as.matrix(wines_df), col=wines_df1$cluster, main = "K means with 2 clusters",
xlab="", ylab="")
# plot2
plotcluster(wines_df,wines_df1$cluster)
points(wines_df1$centers,col=1:8,pch=16)
par(mfrow = c(1, 1))
You can also view the results using fviz_cluster(). This clearly shows the cluster. If there are more than two dimensions (variables), fviz_cluster() performs principal component analysis (PCA) and plots the data points according to the first two principal components that explain most of the variance.
fviz_cluster(wines_df1, data = wines_df)
We can use Silhouette.elbow and gap statics method to find the optimal number of clusters.
set.seed(1234)
plot1<- fviz_nbclust (wines_df, kmeans, method = "wss")+
geom_vline(xintercept = 3, linetype = 5)+
labs(subtitle = "Elbow method Plot")
plot2<- fviz_nbclust(wines_df, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method Plot")
plot3<- fviz_nbclust(wines_df, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method Plot")
plot(plot1)
K=3 was the estimated value by all the three methods.
set.seed(1234)
wines_k3 <- kmeans(wines_df, centers=3)
plot(as.matrix(wines_df), col=wines_k3$cluster, main = "K-means with 3 clusters",
xlab="", ylab="")
clusplot(wines_df, wines_k3$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
fviz_cluster(wines_k3, data = wines_df)
PCA was used to decrease the dimensionality of the data and identify the six principle components that explain the bulk of the variation. This allows us to display the dataset in a lower-dimensional space, showing the underlying elements that create wine qualities.
Concealed trends were revealed and important insights into wine categories and key elements influencing variability were gained by applying clustering and PCA to the wine dataset. These approaches have enormous potential for interpreting complicated, unlabeled data and making data-driven decisions.
wines_df <- as.data.frame(apply(wines, 2, as.numeric))
wines.pca <- prcomp(wines_df, center = TRUE, scale. = TRUE)
# Summary
summary(wines.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1776 1.5966 1.1910 0.95818 0.90477 0.79118 0.72394
## Proportion of Variance 0.3648 0.1961 0.1091 0.07062 0.06297 0.04815 0.04032
## Cumulative Proportion 0.3648 0.5609 0.6700 0.74059 0.80356 0.85171 0.89203
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.59732 0.53873 0.50104 0.47771 0.41529 0.32392
## Proportion of Variance 0.02745 0.02233 0.01931 0.01755 0.01327 0.00807
## Cumulative Proportion 0.91947 0.94180 0.96111 0.97866 0.99193 1.00000
Eigenvalues show the amount of variance explained by each principal component. In PCA, eigenvalues are used to determine the significance of each component in capturing the variation in the data. Eigenvectors indicate the direction of the principal components in the original feature space. These vectors represent the weights assigned to each original feature in the linear combination that forms the principal components.
cov_matrix <- cov(wines_df)
eigens <- eigen(cov_matrix)
# eigenvalues
values <- eigens$values
# eigenvectors
vectors <- eigens$vectors
# Sort the eigenvectors by their corresponding eigenvalues in descending order
sorted_indices <- order(values, decreasing = TRUE)
sorted_values <- values[sorted_indices]
sorted_vectors <- vectors[, sorted_indices]
print(sorted_values)
## [1] 9.920072e+04 1.496058e+02 9.062181e+00 4.745043e+00 1.172489e+00
## [6] 8.165326e-01 2.768415e-01 1.466982e-01 1.113144e-01 7.192301e-02
## [11] 3.569606e-02 2.057526e-02 8.339814e-03
print(sorted_vectors)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.0016593080 -0.0027761662 0.016391489 -0.142874926 0.036172691
## [2,] 0.0006677014 -0.0041934002 0.125965256 -0.173165435 -0.575191774
## [3,] -0.0001924218 -0.0054879766 0.049630777 0.010219881 0.017289952
## [4,] 0.0046835201 -0.0273975429 0.940194011 0.325951829 0.067651990
## [5,] -0.0175545210 -0.9991777062 -0.033020551 0.012286692 -0.007665888
## [6,] -0.0009898384 -0.0011231922 -0.044181380 0.081738936 0.324478814
## [7,] -0.0015673086 -0.0003016742 -0.091062158 0.178002706 0.542972954
## [8,] 0.0001230822 0.0013910539 0.014031046 -0.011389299 -0.030189061
## [9,] -0.0006036117 -0.0030055900 -0.025325702 0.055591726 0.258662209
## [10,] -0.0023486151 -0.0223590143 0.281266974 -0.873709189 0.350631946
## [11,] -0.0001729932 0.0012750789 -0.026753256 0.061437939 0.047658053
## [12,] -0.0007049313 0.0043345027 -0.073585741 0.185372761 0.264136814
## [13,] -0.9998283896 0.0174704364 0.004630613 0.002946757 -0.002346365
## [,6] [,7] [,8] [,9] [,10]
## [1,] -0.196704317 0.9395376757 -2.228308e-01 7.673033e-02 -1.348560e-02
## [2,] -0.768418188 -0.1576080436 4.749577e-02 2.252255e-02 -2.152715e-02
## [3,] -0.035772489 0.0404659068 1.273863e-01 8.049553e-02 -3.466877e-02
## [4,] 0.024782607 0.0321668955 -8.992161e-03 1.083829e-03 -2.294613e-04
## [5,] 0.002953274 0.0008509383 1.616223e-03 -2.357219e-03 5.371658e-05
## [6,] -0.264553961 -0.0267135405 1.701381e-01 2.941137e-01 8.354143e-01
## [7,] -0.411020461 -0.0668437914 2.230388e-01 3.666183e-01 -5.405048e-01
## [8,] 0.020382943 -0.0021211328 -6.955413e-03 3.989250e-02 4.017669e-02
## [9,] -0.221260705 -0.2480629996 -8.917215e-01 -8.768485e-02 3.162450e-02
## [10,] 0.012386957 -0.1214536445 7.861507e-02 -1.006349e-01 -1.781824e-02
## [11,] 0.025924990 0.0297311478 7.229968e-03 3.109937e-02 -5.261243e-02
## [12,] -0.282512325 0.0860132434 2.272408e-01 -8.636159e-01 4.889195e-02
## [13,] 0.001092317 -0.0010363146 -2.020321e-05 -4.945129e-05 3.521567e-05
## [,11] [,12] [,13]
## [1,] 0.0248170506 -0.0130591310 8.131430e-03
## [2,] 0.0175240682 0.0656060960 -1.175961e-02
## [3,] -0.9589882511 -0.1447970565 -1.721996e-01
## [4,] 0.0516359983 0.0065153954 1.254503e-03
## [5,] 0.0033932152 0.0007572426 2.441334e-03
## [6,] 0.0317662010 0.0195926559 -2.481482e-02
## [7,] 0.0952058366 -0.0482260769 6.406239e-02
## [8,] -0.1950901050 0.1421737938 9.678908e-01
## [9,] -0.1213782382 -0.0085770592 -1.410269e-02
## [10,] 0.0021520616 0.0500962659 -5.389182e-03
## [11,] -0.1120031169 0.9741206613 -1.625958e-01
## [12,] -0.0364047564 0.0099397349 4.401880e-02
## [13,] 0.0002459432 -0.0001037714 3.637047e-05
It involves computing the eigenvectors and eigenvalues of the covariance matrix of the dataset, and then projecting the data onto a reduced set of dimensions defined by the selected principal components.
eig.val <- get_eigenvalue(wines.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.7418544 36.4758028 36.47580
## Dim.2 2.5492707 19.6097746 56.08558
## Dim.3 1.4184536 10.9111819 66.99676
## Dim.4 0.9181091 7.0623779 74.05914
## Dim.5 0.8186077 6.2969826 80.35612
## Dim.6 0.6259587 4.8150671 85.17119
## Dim.7 0.5240955 4.0315036 89.20269
## Dim.8 0.3567893 2.7445334 91.94722
## Dim.9 0.2902270 2.2325157 94.17974
## Dim.10 0.2510438 1.9311063 96.11085
## Dim.11 0.2282063 1.7554330 97.86628
## Dim.12 0.1724624 1.3266336 99.19291
## Dim.13 0.1049214 0.8070876 100.00000
fviz_pca_ind(prcomp(wines_df),
geom = "point", ggtheme = theme_classic())
A scree plot is a graphical representation of the eigenvalues corresponding to each principal component. It helps in visualizing the amount of variance explained by each component.
fviz_eig(wines.pca, addlabels = TRUE, ylim = c(0, 50))
PCA serves as a powerful tool for dimensionality reduction and data exploration, offering valuable insights into the underlying patterns and structures present in the dataset.
fviz_pca_ind(wines.pca)
It helps in visualizing the relationships between variables and observations.
variable_colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#FF7F00", "#FDBF6F",
"#A65628", "#984EA3", "#999999", "#DEB887", "#C0C0C0",
"#000080", "#4682B4", "#FFA07A")
# Create a biplot with variable colors
fviz_pca_biplot(wines.pca, geom = "point", col.var = variable_colors)
wines_df %>%
plot_ly(x = .$Ash, y = .$Hue, z = .$Malic_Acid, type="scatter3d", mode="markers", color = .$clusters)
In conclusion, both K-means clustering and Principal Component Analysis (PCA) offer valuable insights into finding the optimal number of clusters in a dataset. K-means helps identify distinct groups based on similarities in data points, while PCA aids in dimensionality reduction, revealing underlying patterns. By leveraging the elbow method or silhouette analysis alongside PCA-based techniques like hierarchical clustering, we can determine the most suitable number of clusters for our data, enhancing our understanding of complex datasets and facilitating more informed decision-making processes.
Applied Predictive Modeling, Max kuhn 4.ggplot2 -Elegant Graphics for Data Analysis by Wickham, Hadley
Tutorial:Clustering wines with K-means Xavier Vivancos García
Clustering assessment, validation and measures Navni Joshi
Multivariate Data Analysis Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson
Introduction-Statistical-Learning-Applications-Statistics - Gareth James, Daniela Witten, Trevor Hastie, Robert
Classification of Wines Using Principal Component Analysis by Jackson Barth, Duwani Katumullage,Chenyu Yang and Jing Cao
McCannon, B. C. (2020). Wine descriptions provide information: A text analysis. Journal of Wine Economics, 15(1), 71–94.