This project involves applying unsupervised clustering to customer data from a wine farmer and seller. The data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
By grouping wines based on their similarities, I aim to provide a more customer-centric approach to product offerings and marketing, tailoring them to the individual needs of different customer segments.
The goal of this analysis is to identify the optimal marketing strategies and branding approaches for each distinct customer segment.
For our analysis, we will employ the Wine dataset sourced from Kaggle. The following descriptions outline the dataset’s features:
| Column Name | Data Type | Description |
|---|---|---|
| Alcohol | double | Alcohol content |
| Malic_Acid | double | Malic acid content |
| Ash | double | Ash content |
| Ash_Alcanity | double | Ash alkalinity |
| Magnesium | integer | Magnesium content |
| Total_Phenols | double | Total phenols content |
| Flavanoids | double | Flavanoids content |
| Nonflavanoid_Phenols | double | Nonflavanoid phenols content |
| Proanthocyanins | double | Proanthocyanins content |
| Color_Intensity | double | Color intensity |
| Hue | double | Hue |
| OD280 | double | Optical density at 280 nm |
| Proline | integer | Proline content |
| Customer_Segment | integer | Customer segment (presumably a numerical identifier) |
Next, we must import the necessary libraries.
library(tidyr)
library(factoextra)
library(cluster)
library(FactoMineR)
library(stringr)
library(tibble)
library(tidyverse)
library(cluster)
library(ggplot2)
library(grid)
library(gridExtra)
library(GGally)
Following that, we incorporate our data into the wine
dataset.
wine <- read.csv("data_input/Wine.csv")
head(wine)
Before proceeding with data cleaning and feature engineering, we’ll
conduct a preliminary exploration of the imported dataset. This involves
examining the initial and final rows using the head() and
tail() functions. This step will provide valuable insights
into the dataset’s structure and content.
head(wine)
tail(wine)
In this step, we are going to inspect our dataset by checking its data types, and check for missing values.
To find out the suitable data type, it is checked first with the `glimpse() function.
wine %>%
glimpse()
#> Rows: 178
#> Columns: 14
#> $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, …
#> $ Malic_Acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
#> $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
#> $ Ash_Alcanity <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
#> $ Magnesium <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 10…
#> $ Total_Phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
#> $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
#> $ Nonflavanoid_Phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
#> $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
#> $ Color_Intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
#> $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
#> $ OD280 <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
#> $ Proline <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…
#> $ Customer_Segment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Here we need to change data type for Customer_Segment column from
integer to factor
wine_cl <-
wine %>%
mutate(Customer_Segment = as.factor(Customer_Segment))
Next we check for missing values using is.na
wine_cl %>%
is.na() %>%
colSums()
#> Alcohol Malic_Acid Ash
#> 0 0 0
#> Ash_Alcanity Magnesium Total_Phenols
#> 0 0 0
#> Flavanoids Nonflavanoid_Phenols Proanthocyanins
#> 0 0 0
#> Color_Intensity Hue OD280
#> 0 0 0
#> Proline Customer_Segment
#> 0 0
Insights:
The wine dataset stands out for having no missing data
at all. This means that each data point in the dataset has values for
every variable that is defined.
To gain a comprehensive understanding of the dataset, we can use:
summary(wine_cl)
#> Alcohol Malic_Acid Ash Ash_Alcanity
#> Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
#> 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
#> Median :13.05 Median :1.865 Median :2.360 Median :19.50
#> Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
#> 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
#> Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
#> Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
#> Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
#> 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
#> Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
#> Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
#> 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
#> Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
#> Proanthocyanins Color_Intensity Hue OD280
#> Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
#> 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
#> Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
#> Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
#> 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
#> Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
#> Proline Customer_Segment
#> Min. : 278.0 1:59
#> 1st Qu.: 500.5 2:71
#> Median : 673.5 3:48
#> Mean : 746.9
#> 3rd Qu.: 985.0
#> Max. :1680.0
and also we can visualize its column using boxplot
function
boxplots <- list()
for (i in 1:ncol(wine_cl)) {
col_name <- colnames(wine_cl)[i]
boxplots[[i]] <- ggplot(wine_cl, aes(x = "", y = .data[[col_name]])) +
geom_boxplot() +
labs(title = paste(col_name))
}
grid.arrange(grobs = boxplots, ncol = 5)
Insights:
General Observations:
Specifics Observations:
The following code automatically identifies the positions of numeric and categorical variables, making it adaptable to datasets with many columns.
quanti <- wine_cl %>%
select_if(is.numeric) %>%
colnames()
quantivar <- which(colnames(wine_cl) %in% quanti)
quali <- wine_cl %>%
select_if(is.factor) %>%
colnames()
qualivar <- which(colnames(wine_cl) %in% quali)
wine_pca <- PCA(
X = wine_cl,
scale.unit = T, # scaling data
quali.sup = qualivar,
graph = F,
ncp = 13
)
wine_pca
#> **Results for the Principal Component Analysis (PCA)**
#> The analysis was performed on 178 individuals, described by 14 variables
#> *The results are available in the following objects:
#>
#> name description
#> 1 "$eig" "eigenvalues"
#> 2 "$var" "results for the variables"
#> 3 "$var$coord" "coord. for the variables"
#> 4 "$var$cor" "correlations variables - dimensions"
#> 5 "$var$cos2" "cos2 for the variables"
#> 6 "$var$contrib" "contributions of the variables"
#> 7 "$ind" "results for the individuals"
#> 8 "$ind$coord" "coord. for the individuals"
#> 9 "$ind$cos2" "cos2 for the individuals"
#> 10 "$ind$contrib" "contributions of the individuals"
#> 11 "$quali.sup" "results for the supplementary categorical variables"
#> 12 "$quali.sup$coord" "coord. for the supplementary categories"
#> 13 "$quali.sup$v.test" "v-test of the supplementary categories"
#> 14 "$call" "summary statistics"
#> 15 "$call$centre" "mean of the variables"
#> 16 "$call$ecart.type" "standard error of the variables"
#> 17 "$call$row.w" "weights for the individuals"
#> 18 "$call$col.w" "weights for the variables"
Subsequently, we will examine each factor plot separately and pinpoint the five most extreme outliers.
plot.PCA(
x = wine_pca,
choix = "ind",
select = "contrib 5",
invisible = "quali",
habillage = "Customer_Segment"
)
Insights:
For a more comprehensive understanding of variable distribution, the
fviz_contrib() function can be employed. This tool offers a
visual representation of how rows or columns contribute to the principal
components (PCs) in a Principal Component Analysis (PCA), providing
valuable insights into the data’s underlying structure.
fviz_contrib(X = wine_pca,
axes = 1, # = PC1
choice = "var")
Notes:
Correlation between Variable
Next, we’ll determine the relationships between variables using the
ggcorr() function.
ggcorr(wine_cl, label = TRUE, hjust = 1)
Insights:
Strong Positive Correlations:
Negative Correlations:
Chemical Composition: The strong positive correlations between Total_Phenols, Flavanoids, and OD280 might suggest that these compounds are chemically related or co-occur in wine.
Sensory Attributes: The negative correlation between Color_Intensity and Proanthocyanins could indicate that wines with a more intense color might have different sensory characteristics related to Proanthocyanins.
Next, we need to remove Customer_Segment from our
dataframe and store it in new dataframe called
wine_clean.
wine_clean <-
wine_cl %>%
select(-Customer_Segment)
The resulting data frame now only contains columns that are of
integer, double, or numeric data types. Next, we need to ensure that
different variables contribute equally to the analysis, this can be done
by using scale() function. The scale()
function standardizes numerical data by subtracting the mean of each
column and dividing by its standard deviation. This transforms the data
to have a mean of 0 and a standard deviation of 1.
wine_z <- scale(wine_clean)
summary(wine_z)
#> Alcohol Malic_Acid Ash Ash_Alcanity
#> Min. :-2.42739 Min. :-1.4290 Min. :-3.66881 Min. :-2.663505
#> 1st Qu.:-0.78603 1st Qu.:-0.6569 1st Qu.:-0.57051 1st Qu.:-0.687199
#> Median : 0.06083 Median :-0.4219 Median :-0.02375 Median : 0.001514
#> Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000000
#> 3rd Qu.: 0.83378 3rd Qu.: 0.6679 3rd Qu.: 0.69615 3rd Qu.: 0.600395
#> Max. : 2.25341 Max. : 3.1004 Max. : 3.14745 Max. : 3.145637
#> Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
#> Min. :-2.0824 Min. :-2.10132 Min. :-1.6912 Min. :-1.8630
#> 1st Qu.:-0.8221 1st Qu.:-0.88298 1st Qu.:-0.8252 1st Qu.:-0.7381
#> Median :-0.1219 Median : 0.09569 Median : 0.1059 Median :-0.1756
#> Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.5082 3rd Qu.: 0.80672 3rd Qu.: 0.8467 3rd Qu.: 0.6078
#> Max. : 4.3591 Max. : 2.53237 Max. : 3.0542 Max. : 2.3956
#> Proanthocyanins Color_Intensity Hue OD280
#> Min. :-2.06321 Min. :-1.6297 Min. :-2.08884 Min. :-1.8897
#> 1st Qu.:-0.59560 1st Qu.:-0.7929 1st Qu.:-0.76540 1st Qu.:-0.9496
#> Median :-0.06272 Median :-0.1588 Median : 0.03303 Median : 0.2371
#> Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
#> 3rd Qu.: 0.62741 3rd Qu.: 0.4926 3rd Qu.: 0.71116 3rd Qu.: 0.7864
#> Max. : 3.47527 Max. : 3.4258 Max. : 3.29241 Max. : 1.9554
#> Proline
#> Min. :-1.4890
#> 1st Qu.:-0.7824
#> Median :-0.2331
#> Mean : 0.0000
#> 3rd Qu.: 0.7561
#> Max. : 2.9631
K-means
Next, we need to partitions data into K clusters based on the mean distance to cluster centers using K-mean algorithm.
Parameters:
Note: It is necessary to use set.seed() due to the
random initialization in the initial stages of the K-means
algorithm.
RNGkind(sample.kind = "Rounding")
set.seed(100)
wine_km <- kmeans(
x = wine_z,
centers = 5
)
next bit is the result of k-means() function:
wine_km$centers
#> Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
#> 1 -0.1886092 -0.3973055 0.88933287 0.3084403 1.11034100 0.4408200
#> 2 0.1766166 0.9039567 0.21536153 0.5494898 -0.07712756 -0.9873154
#> 3 -0.7932129 -0.2670132 -1.06563358 -0.2679823 -0.68011948 0.2781082
#> 4 1.0751808 -0.3606243 0.16642266 -0.8929012 0.46102024 0.9849139
#> 5 -0.9985113 -0.4250204 0.08279355 0.6568276 -0.57973353 -0.5680217
#> Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue
#> 1 0.5438510 -0.3483549 0.5933360 -0.5275837 0.7570958
#> 2 -1.2236663 0.7114800 -0.7591372 0.9516989 -1.1867156
#> 3 0.2668683 -0.5684285 0.1678066 -0.8107192 0.2991794
#> 4 1.0567190 -0.6716820 0.6866569 0.3596910 0.4267835
#> 5 -0.3042323 0.8751497 -0.4437352 -0.9381988 0.5304372
#> OD280 Proline
#> 1 0.61524013 0.1902360
#> 2 -1.28577141 -0.3952058
#> 3 0.50311030 -0.7139802
#> 4 0.79065585 1.3330286
#> 5 -0.09609123 -0.7561714
Optional step when we don’t have a predetermined value for K based on business decisions.
The quality of clustering results can be evaluated using three metrics:
$withinss):
This measures the sum of squared distances from each data point to its
assigned cluster centroid.$betweenss):
This measures the sum of squared distances, weighted by cluster size,
from each cluster centroid to the overall mean.$totss): This
measures the total sum of squared distances from each data point to the
overall mean.wine_km$tot.withinss
#> [1] 1106.418
wine_km$betweenss / wine_km$totss
#> [1] 0.519158
Insights:
tot.withinss suggests that there is a moderate level of
within-cluster variation. This means that data points within their
respective clusters are reasonably close together, but there might be
some room for improvement in terms of compactness.To determine optimum number of clusters (\(k\)), objectively we can use a method called Elbow method. The elbow method is a common technique used to determine the optimal number of clusters (\(k\)) in K-means clustering. It involves plotting the Within-Cluster Sum of Squares (WSS) against the number of clusters. The “elbow” point in the plot, where the rate of decrease in WSS starts to slow down significantly, is often considered the optimal number of clusters.
elbow_plot <- fviz_nbclust(wine_z,
FUNcluster = kmeans,
method = "wss")
k_optimal <- 3
elbow_plot + geom_vline(xintercept = k_optimal, linetype = "dashed")
Insight :
Based on the visualization provided, the optimal \(k\) value for your dataset is 3. This means that dividing the data into 3 clusters will provide good and representative clustering results.
Re-run K-means with the Optimal Number of Clusters (\(k\)):
RNGkind(sample.kind = "Rounding")
set.seed(100)
wine_km_opt <- kmeans(x = wine_z,
centers = 3)
Next, we need to analyze the characteristics of each cluster and understand how the data points are grouped based on their similarity.
wine_km_opt$cluster
#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1
#> [75] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> [112] 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
The output wine_km_opt$cluster shows the cluster
assignments for each data point in the wine_z dataset. The
values in the output represent the cluster labels assigned to each
observation.
wine$kelompok <- wine_km_opt$cluster
wine_cluster <-
wine %>%
group_by(kelompok) %>%
summarise_all(mean)
wine_cluster
Cluster Profiling:
Based on above code bit and the result we can profile our clusters as follows:
Cluster 1: Premium Wine
Cluster 2: Unique Wine
Cluster 3: Fresh Wine
Based on the cluster analysis, we can identify three distinct customer segments
Cluster 1: Premium Wine
Cluster 2: Unique Wine
Cluster 3: Fresh Wine
Sadegh Jalalian. Wine Customer Segmentation. Retrieved from https://www.kaggle.com/datasets/sadeghjalalian/wine-customer-segmentation/data