1 Introduction

This project involves applying unsupervised clustering to customer data from a wine farmer and seller. The data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

By grouping wines based on their similarities, I aim to provide a more customer-centric approach to product offerings and marketing, tailoring them to the individual needs of different customer segments.

2 Business Question

The goal of this analysis is to identify the optimal marketing strategies and branding approaches for each distinct customer segment.

3 Preparation

For our analysis, we will employ the Wine dataset sourced from Kaggle. The following descriptions outline the dataset’s features:

Column Name	Data Type	Description
Alcohol	double	Alcohol content
Malic_Acid	double	Malic acid content
Ash	double	Ash content
Ash_Alcanity	double	Ash alkalinity
Magnesium	integer	Magnesium content
Total_Phenols	double	Total phenols content
Flavanoids	double	Flavanoids content
Nonflavanoid_Phenols	double	Nonflavanoid phenols content
Proanthocyanins	double	Proanthocyanins content
Color_Intensity	double	Color intensity
Hue	double	Hue
OD280	double	Optical density at 280 nm
Proline	integer	Proline content
Customer_Segment	integer	Customer segment (presumably a numerical identifier)

3.1 Importing Libraries

Next, we must import the necessary libraries.

library(tidyr)
library(factoextra)
library(cluster)
library(FactoMineR)
library(stringr)
library(tibble)
library(tidyverse)
library(cluster)
library(ggplot2)
library(grid)
library(gridExtra)
library(GGally)

3.2 Importing Data

Following that, we incorporate our data into the wine dataset.

wine <- read.csv("data_input/Wine.csv")

head(wine)

4 Data Cleansing and Exploratory Data Analysis (EDA)

Before proceeding with data cleaning and feature engineering, we’ll conduct a preliminary exploration of the imported dataset. This involves examining the initial and final rows using the head() and tail() functions. This step will provide valuable insights into the dataset’s structure and content.

head(wine)

tail(wine)

In this step, we are going to inspect our dataset by checking its data types, and check for missing values.

Result

Structure with Glimpse

To find out the suitable data type, it is checked first with the `glimpse() function.

wine %>% 
  glimpse()

#> Rows: 178
#> Columns: 14
#> $ Alcohol              <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, …
#> $ Malic_Acid           <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
#> $ Ash                  <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
#> $ Ash_Alcanity         <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
#> $ Magnesium            <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 10…
#> $ Total_Phenols        <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
#> $ Flavanoids           <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
#> $ Nonflavanoid_Phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
#> $ Proanthocyanins      <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
#> $ Color_Intensity      <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
#> $ Hue                  <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
#> $ OD280                <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
#> $ Proline              <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…
#> $ Customer_Segment     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Data Cleansing & Missing Value

Here we need to change data type for Customer_Segment column from integer to factor

wine_cl <- 
  wine %>% 
    mutate(Customer_Segment = as.factor(Customer_Segment))

Next we check for missing values using is.na

wine_cl %>% 
  is.na() %>% 
  colSums()

#>              Alcohol           Malic_Acid                  Ash 
#>                    0                    0                    0 
#>         Ash_Alcanity            Magnesium        Total_Phenols 
#>                    0                    0                    0 
#>           Flavanoids Nonflavanoid_Phenols      Proanthocyanins 
#>                    0                    0                    0 
#>      Color_Intensity                  Hue                OD280 
#>                    0                    0                    0 
#>              Proline     Customer_Segment 
#>                    0                    0

Insights:

The wine dataset stands out for having no missing data at all. This means that each data point in the dataset has values for every variable that is defined.

Descriptive Statistics

To gain a comprehensive understanding of the dataset, we can use:

summary(wine_cl)

#>     Alcohol        Malic_Acid         Ash         Ash_Alcanity  
#>  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
#>  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
#>  Median :13.05   Median :1.865   Median :2.360   Median :19.50  
#>  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
#>  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
#>  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
#>    Magnesium      Total_Phenols     Flavanoids    Nonflavanoid_Phenols
#>  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
#>  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
#>  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
#>  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
#>  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
#>  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
#>  Proanthocyanins Color_Intensity       Hue             OD280      
#>  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
#>  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
#>  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
#>  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
#>  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
#>  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
#>     Proline       Customer_Segment
#>  Min.   : 278.0   1:59            
#>  1st Qu.: 500.5   2:71            
#>  Median : 673.5   3:48            
#>  Mean   : 746.9                   
#>  3rd Qu.: 985.0                   
#>  Max.   :1680.0

and also we can visualize its column using boxplot function

boxplots <- list()
for (i in 1:ncol(wine_cl)) {
  col_name <- colnames(wine_cl)[i]
  boxplots[[i]] <- ggplot(wine_cl, aes(x = "", y = .data[[col_name]])) +
    geom_boxplot() +
    labs(title = paste(col_name))
}

grid.arrange(grobs = boxplots, ncol = 5)

Insights:

General Observations:

Skewness: A number of variables (such as Alcohol, Malic_Acid, Ash, Ash_Alcanity, Magnesium, Total_Phenols, Flavanoids, Proanthocyanins, Color_Intensity, Hue, OD280, Proline) seem to have a right-skewed distribution, indicating a longer tail on the right side. This implies that there are possibly some outliers causing the mean to be shifted towards the upper range.
Variability: Some variables have wide ranges (e.g., Proline), suggesting there could be notable diversity in the data.

Specifics Observations:

Segments of customers: The Customer_Segment variable contains three separate values (1, 2, and 3), indicating a possible division of data into three groups or clusters.
Alcohol Percentage: The alcohol percentage varies between 11.03% and 14.83%, with an average of 13.00%. This indicates that the typical wine in the dataset has a moderate level of alcohol.
Acidity is measured by the variables Malic Acid and Ash Alcanity. The variables’ distributions indicate that there is a variety of acidity levels present in the wines.
Phenolic compounds in wine such as Total_Phenols, Flavanoids, Nonflavanoid_Phenols, and Proanthocyanins are crucial for flavor and color. The variable distributions indicate various levels of phenolic compounds.
Intensity of Color: The Color_Intensity variable assesses the strength of the color of the wine. The distribution indicates a variety of color intensities, with certain wines exhibiting higher color intensity than others.
Proline, an amino acid, can be found in wine. The Proline distribution indicates variability in Proline levels among wines, with some wines containing notably higher levels than others.

5 PCA (Principal Component Analysis)

5.1 Data Preprocessing

The following code automatically identifies the positions of numeric and categorical variables, making it adaptable to datasets with many columns.

quanti <- wine_cl %>% 
  select_if(is.numeric) %>% 
  colnames()

quantivar <- which(colnames(wine_cl) %in% quanti)

quali <- wine_cl %>% 
  select_if(is.factor) %>% 
  colnames()

qualivar <- which(colnames(wine_cl) %in% quali)

wine_pca <- PCA(
  X = wine_cl, 
  scale.unit = T, # scaling data
  quali.sup = qualivar, 
  graph = F, 
  ncp = 13 
)
wine_pca

#> **Results for the Principal Component Analysis (PCA)**
#> The analysis was performed on 178 individuals, described by 14 variables
#> *The results are available in the following objects:
#> 
#>    name                description                                          
#> 1  "$eig"              "eigenvalues"                                        
#> 2  "$var"              "results for the variables"                          
#> 3  "$var$coord"        "coord. for the variables"                           
#> 4  "$var$cor"          "correlations variables - dimensions"                
#> 5  "$var$cos2"         "cos2 for the variables"                             
#> 6  "$var$contrib"      "contributions of the variables"                     
#> 7  "$ind"              "results for the individuals"                        
#> 8  "$ind$coord"        "coord. for the individuals"                         
#> 9  "$ind$cos2"         "cos2 for the individuals"                           
#> 10 "$ind$contrib"      "contributions of the individuals"                   
#> 11 "$quali.sup"        "results for the supplementary categorical variables"
#> 12 "$quali.sup$coord"  "coord. for the supplementary categories"            
#> 13 "$quali.sup$v.test" "v-test of the supplementary categories"             
#> 14 "$call"             "summary statistics"                                 
#> 15 "$call$centre"      "mean of the variables"                              
#> 16 "$call$ecart.type"  "standard error of the variables"                    
#> 17 "$call$row.w"       "weights for the individuals"                        
#> 18 "$call$col.w"       "weights for the variables"

5.2 Data Visualizing

Subsequently, we will examine each factor plot separately and pinpoint the five most extreme outliers.

plot.PCA(
  x = wine_pca,
  choix = "ind",
  select = "contrib 5", 
  invisible = "quali",
  habillage = "Customer_Segment"
)

Insights:

Based on the visual representation, the points corresponding to Customer IDs 4, 15, 19, 147, and 178 appear to be outliers. These points are located far from the main clusters of data, suggesting that they have unique characteristics or behaviors that deviate significantly from the norm.
Clustering: The plot shows distinct clustering among the customer segments. This suggests that the segments exhibit different patterns or characteristics based on the underlying variables captured by the principal components.
Separation: Customer Segment 3 appears to be well-separated from the other segments, particularly along Dim 1. This could indicate that this segment is characterized by a unique combination of attributes that distinguishes it from the rest.
Overlapping: Customer Segments 1 and 2 show some overlap, suggesting that they share similarities in terms of the variables represented by the principal components. However, there’s still a degree of separation, indicating that they also have distinct characteristics.
Importance of Dimensions: The percentages associated with Dim 1 (36.20%) and Dim 2 (19.21%) indicate the relative importance of these components in explaining the variation in the data. Dim 1 is more influential, suggesting that the variations captured by Dim 1 are more significant in differentiating the customer segments.

For a more comprehensive understanding of variable distribution, the fviz_contrib() function can be employed. This tool offers a visual representation of how rows or columns contribute to the principal components (PCs) in a Principal Component Analysis (PCA), providing valuable insights into the data’s underlying structure.

fviz_contrib(X = wine_pca, 
             axes = 1, # = PC1
             choice = "var")

Notes:

The red dotted horizontal line indicates the anticipated contribution level when the principal component extracts information evenly from all variables.
To calculate it, we divide 100% by the original number of variables.
In this specific instance, the calculation is 100 divided by 13, resulting in 7.7%.
If a variable’s contribution surpasses this threshold, it indicates that the variable is influential in forming that particular principal component.

Correlation between Variable

Next, we’ll determine the relationships between variables using the ggcorr() function.

ggcorr(wine_cl, label = TRUE, hjust = 1)

Insights:

Strong Positive Correlations:
1. Total_Phenols and Flavanoids: These two variables have a very strong positive correlation, as indicated by the dark red square at their intersection. This suggests that wines with higher levels of Total_Phenols tend to also have higher levels of Flavanoids.
2. Flavanoids and OD280: Another strong positive correlation exists between Flavanoids and OD280.
3. Total_Phenols and OD280: A strong positive correlation is also observed between Total_Phenols and OD280.
Negative Correlations:
1. Color_Intensity and Proanthocyanins: There is a moderate negative correlation between Color_Intensity and Proanthocyanins. This suggests that wines with higher Color_Intensity tend to have lower levels of Proanthocyanins.
Chemical Composition: The strong positive correlations between Total_Phenols, Flavanoids, and OD280 might suggest that these compounds are chemically related or co-occur in wine.
Sensory Attributes: The negative correlation between Color_Intensity and Proanthocyanins could indicate that wines with a more intense color might have different sensory characteristics related to Proanthocyanins.

6 Clustering

Next, we need to remove Customer_Segment from our dataframe and store it in new dataframe called wine_clean.

wine_clean <- 
wine_cl %>% 
  select(-Customer_Segment)

The resulting data frame now only contains columns that are of integer, double, or numeric data types. Next, we need to ensure that different variables contribute equally to the analysis, this can be done by using scale() function. The scale() function standardizes numerical data by subtracting the mean of each column and dividing by its standard deviation. This transforms the data to have a mean of 0 and a standard deviation of 1.

wine_z <- scale(wine_clean)

summary(wine_z)

#>     Alcohol           Malic_Acid           Ash            Ash_Alcanity      
#>  Min.   :-2.42739   Min.   :-1.4290   Min.   :-3.66881   Min.   :-2.663505  
#>  1st Qu.:-0.78603   1st Qu.:-0.6569   1st Qu.:-0.57051   1st Qu.:-0.687199  
#>  Median : 0.06083   Median :-0.4219   Median :-0.02375   Median : 0.001514  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000000  
#>  3rd Qu.: 0.83378   3rd Qu.: 0.6679   3rd Qu.: 0.69615   3rd Qu.: 0.600395  
#>  Max.   : 2.25341   Max.   : 3.1004   Max.   : 3.14745   Max.   : 3.145637  
#>    Magnesium       Total_Phenols        Flavanoids      Nonflavanoid_Phenols
#>  Min.   :-2.0824   Min.   :-2.10132   Min.   :-1.6912   Min.   :-1.8630     
#>  1st Qu.:-0.8221   1st Qu.:-0.88298   1st Qu.:-0.8252   1st Qu.:-0.7381     
#>  Median :-0.1219   Median : 0.09569   Median : 0.1059   Median :-0.1756     
#>  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000     
#>  3rd Qu.: 0.5082   3rd Qu.: 0.80672   3rd Qu.: 0.8467   3rd Qu.: 0.6078     
#>  Max.   : 4.3591   Max.   : 2.53237   Max.   : 3.0542   Max.   : 2.3956     
#>  Proanthocyanins    Color_Intensity        Hue               OD280        
#>  Min.   :-2.06321   Min.   :-1.6297   Min.   :-2.08884   Min.   :-1.8897  
#>  1st Qu.:-0.59560   1st Qu.:-0.7929   1st Qu.:-0.76540   1st Qu.:-0.9496  
#>  Median :-0.06272   Median :-0.1588   Median : 0.03303   Median : 0.2371  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.62741   3rd Qu.: 0.4926   3rd Qu.: 0.71116   3rd Qu.: 0.7864  
#>  Max.   : 3.47527   Max.   : 3.4258   Max.   : 3.29241   Max.   : 1.9554  
#>     Proline       
#>  Min.   :-1.4890  
#>  1st Qu.:-0.7824  
#>  Median :-0.2331  
#>  Mean   : 0.0000  
#>  3rd Qu.: 0.7561  
#>  Max.   : 2.9631

K-means

Next, we need to partitions data into K clusters based on the mean distance to cluster centers using K-mean algorithm.

Parameters:

x: dataset
centers: number of centroids (k)

Note: It is necessary to use set.seed() due to the random initialization in the initial stages of the K-means algorithm.

RNGkind(sample.kind = "Rounding")
set.seed(100)

wine_km <- kmeans(
  x = wine_z,
  centers = 5
)

next bit is the result of k-means() function:

wine_km$centers

#>      Alcohol Malic_Acid         Ash Ash_Alcanity   Magnesium Total_Phenols
#> 1 -0.1886092 -0.3973055  0.88933287    0.3084403  1.11034100     0.4408200
#> 2  0.1766166  0.9039567  0.21536153    0.5494898 -0.07712756    -0.9873154
#> 3 -0.7932129 -0.2670132 -1.06563358   -0.2679823 -0.68011948     0.2781082
#> 4  1.0751808 -0.3606243  0.16642266   -0.8929012  0.46102024     0.9849139
#> 5 -0.9985113 -0.4250204  0.08279355    0.6568276 -0.57973353    -0.5680217
#>   Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity        Hue
#> 1  0.5438510           -0.3483549       0.5933360      -0.5275837  0.7570958
#> 2 -1.2236663            0.7114800      -0.7591372       0.9516989 -1.1867156
#> 3  0.2668683           -0.5684285       0.1678066      -0.8107192  0.2991794
#> 4  1.0567190           -0.6716820       0.6866569       0.3596910  0.4267835
#> 5 -0.3042323            0.8751497      -0.4437352      -0.9381988  0.5304372
#>         OD280    Proline
#> 1  0.61524013  0.1902360
#> 2 -1.28577141 -0.3952058
#> 3  0.50311030 -0.7139802
#> 4  0.79065585  1.3330286
#> 5 -0.09609123 -0.7561714

Optional step when we don’t have a predetermined value for K based on business decisions.

The quality of clustering results can be evaluated using three metrics:

Within Sum of Squares ($withinss): This measures the sum of squared distances from each data point to its assigned cluster centroid.
Between Sum of Squares ($betweenss): This measures the sum of squared distances, weighted by cluster size, from each cluster centroid to the overall mean.
Total Sum of Squares ($totss): This measures the total sum of squared distances from each data point to the overall mean.

wine_km$tot.withinss

#> [1] 1106.418

wine_km$betweenss / wine_km$totss

#> [1] 0.519158

Insights:

Within-Cluster Variation: The value of 1106.418 for tot.withinss suggests that there is a moderate level of within-cluster variation. This means that data points within their respective clusters are reasonably close together, but there might be some room for improvement in terms of compactness.
Explained Variance Ratio (EVR): The explained variance ratio of 0.519158 indicates that the clustering is moderately effective in separating the data into distinct groups. This means that the clusters are reasonably well-separated, but there might be some overlap between them. The higher ratio close to 1, the better because it represent actual data distribution.

7 Parameter Tuning: Determining the Optimal Number of Clusters ( $k$ )

To determine optimum number of clusters ($k$), objectively we can use a method called Elbow method. The elbow method is a common technique used to determine the optimal number of clusters ($k$) in K-means clustering. It involves plotting the Within-Cluster Sum of Squares (WSS) against the number of clusters. The “elbow” point in the plot, where the rate of decrease in WSS starts to slow down significantly, is often considered the optimal number of clusters.

elbow_plot <- fviz_nbclust(wine_z,
             FUNcluster = kmeans,
             method = "wss")

k_optimal <- 3

elbow_plot + geom_vline(xintercept = k_optimal, linetype = "dashed")

Insight :

Based on the visualization provided, the optimal $k$ value for your dataset is 3. This means that dividing the data into 3 clusters will provide good and representative clustering results.

Re-run K-means with the Optimal Number of Clusters ($k$):

RNGkind(sample.kind = "Rounding")
set.seed(100)

wine_km_opt <- kmeans(x = wine_z,
                       centers = 3)

8 Cluster Interpretation and Profiling

Next, we need to analyze the characteristics of each cluster and understand how the data points are grouped based on their similarity.

wine_km_opt$cluster

#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1
#>  [75] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> [112] 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

The output wine_km_opt$cluster shows the cluster assignments for each data point in the wine_z dataset. The values in the output represent the cluster labels assigned to each observation.

wine$kelompok <- wine_km_opt$cluster

wine_cluster <- 
wine %>% 
  group_by(kelompok) %>% 
  summarise_all(mean)

wine_cluster

Cluster Profiling:

Based on above code bit and the result we can profile our clusters as follows:

Cluster 1: Premium Wine

Highest: Alcohol, Ash, Magnesium, Total Phenols, Flavanoids, Proanthocyanins, Hue, OD280, Proline
Lowest: Malic Acid, Ash Alcanity, Nonflavanoid Phenols
Label: Premium Wine (premium quality)

Cluster 2: Unique Wine

Highest: Malic Acid, Ash Alcanity, Nonflavanoid Phenols, Color Intensity
Lowest: Total Phenols, Flavanoids, Proanthocyanins, Hue, OD280
Label: Unique Wine (have a distinctive flavor)

Cluster 3: Fresh Wine

Highest: None
Lowest: Alcohol, Malic Acid, Ash, Magnesium, Color Intensity, Proline
Label: Fresh Wine (have a light flavor)

9 Conclusion

Based on the cluster analysis, we can identify three distinct customer segments

Cluster 1: Premium Wine

Characteristics: High levels of key wine components, suggesting premium quality and complexity.
Marketing Strategies: Target high-end wine connoisseurs and collectors. Emphasize the wine’s exceptional quality, complexity, and unique flavor profile. Consider premium packaging and distribution channels.
Branding Approaches: Position the brand as a luxury, exclusive product. Use elegant and sophisticated branding elements.

Cluster 2: Unique Wine

Characteristics: Distinctive flavor profile with higher levels of Malic Acid, Ash Alcanity, and Nonflavanoid Phenols.
Marketing Strategies: Target consumers who appreciate unique and unconventional wines. Highlight the wine’s distinctive taste and its ability to pair well with specific foods.
Branding Approaches: Create a brand identity that reflects the wine’s unique character. Use bold and eye-catching visuals.

Cluster 3: Fresh Wine

Characteristics: Lighter flavor profile with lower levels of most components.
Marketing Strategies: Target consumers who prefer lighter, more refreshing wines. Emphasize the wine’s freshness and drinkability.
Branding Approaches: Create a brand identity that conveys a sense of lightness and freshness. Use vibrant colors and imagery.

10 Reference

Sadegh Jalalian. Wine Customer Segmentation. Retrieved from https://www.kaggle.com/datasets/sadeghjalalian/wine-customer-segmentation/data

Optimizing Wine Marketing: A Cluster Analysis

Intan M Sari

2024-08-23

1 Introduction

2 Business Question

3 Preparation

3.1 Importing Libraries

3.2 Importing Data

4 Data Cleansing and Exploratory Data Analysis (EDA)

Result

Structure with Glimpse

Data Cleansing & Missing Value

Descriptive Statistics

5 PCA (Principal Component Analysis)

5.1 Data Preprocessing

5.2 Data Visualizing

6 Clustering

7 Parameter Tuning: Determining the Optimal Number of Clusters ( \(k\) )

8 Cluster Interpretation and Profiling

9 Conclusion

10 Reference