Exploring Brand Perception and Repeat Purchase Behavior Through Unsupervised Learning

Timeseries Analysis

1 Preparation

1.1 Load Packages

Show the code

library(pacman)
p_load(tidyverse,tidymodels,here,gt,forcats,dpylr,recipes,GGally, gapminder,ggplot,corrplot,psych,factoextra, cluster, dendextend, broom, visdat )

1.2 Load Dataset

Show the code

brand <- here("Data/brand.csv") |> read.csv()

glimpse(brand)

Rows: 1,000
Columns: 10
$ performance   <int> 2, 1, 2, 1, 1, 2, 1, 2, 2, 3, 2, 1, 3, 1, 3, 3, 2, 3, 2,…
$ leadership    <int> 4, 1, 3, 6, 1, 8, 1, 1, 1, 1, 1, 2, 3, 5, 1, 2, 1, 8, 5,…
$ productLatest <int> 8, 4, 5, 10, 5, 9, 5, 7, 8, 9, 5, 7, 10, 7, 3, 7, 6, 9, …
$ fun           <int> 8, 7, 9, 8, 8, 5, 7, 5, 10, 8, 6, 7, 10, 10, 6, 6, 7, 9,…
$ serious       <int> 2, 1, 2, 3, 1, 3, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 4, 2,…
$ bargain       <int> 9, 1, 9, 4, 9, 8, 5, 8, 7, 3, 1, 3, 3, 1, 3, 10, 1, 7, 6…
$ bestValue     <int> 7, 1, 5, 5, 9, 7, 1, 7, 7, 3, 1, 2, 3, 3, 4, 5, 3, 10, 1…
$ trendiness    <int> 4, 2, 1, 2, 1, 1, 1, 7, 5, 4, 1, 1, 3, 3, 4, 1, 5, 4, 5,…
$ repeatBuy     <int> 6, 2, 6, 1, 1, 2, 1, 1, 1, 1, 2, 5, 3, 1, 2, 3, 1, 2, 2,…
$ name          <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "…

2 Data Cleaning

2.1 Missing Values

Show the code

# calculate the percentage amount of NA relative to the dataset population
percent_na = sum(is.na(brand)) / (nrow(brand) * ncol(brand)) 

# calculate the total NA
full_na = sum(is.na(brand))

# visualize the missing values
vis_miss(brand)

This dataset has no missing value.

2.2 Converting categorical variables into factors

Show the code

# Select numerical variables for correlation analysis
prepared_brand <- brand |> 
  mutate(name = as_factor(name)) |>  
  mutate(name = fct_recode(name,
                           "1" = "a",
                           "2" = "b",
                           "3" = "c",
                           "4" = "d",
                           "5" = "e",
                           "6" = "f",
                           "7" = "g",
                           "8" = "h",
                           "9" = "i",
                           "10" = "j"))


prepared_brand$name <- factor(prepared_brand$name, levels = as.character(1:10))

glimpse(prepared_brand)

Rows: 1,000
Columns: 10
$ performance   <int> 2, 1, 2, 1, 1, 2, 1, 2, 2, 3, 2, 1, 3, 1, 3, 3, 2, 3, 2,…
$ leadership    <int> 4, 1, 3, 6, 1, 8, 1, 1, 1, 1, 1, 2, 3, 5, 1, 2, 1, 8, 5,…
$ productLatest <int> 8, 4, 5, 10, 5, 9, 5, 7, 8, 9, 5, 7, 10, 7, 3, 7, 6, 9, …
$ fun           <int> 8, 7, 9, 8, 8, 5, 7, 5, 10, 8, 6, 7, 10, 10, 6, 6, 7, 9,…
$ serious       <int> 2, 1, 2, 3, 1, 3, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 4, 2,…
$ bargain       <int> 9, 1, 9, 4, 9, 8, 5, 8, 7, 3, 1, 3, 3, 1, 3, 10, 1, 7, 6…
$ bestValue     <int> 7, 1, 5, 5, 9, 7, 1, 7, 7, 3, 1, 2, 3, 3, 4, 5, 3, 10, 1…
$ trendiness    <int> 4, 2, 1, 2, 1, 1, 1, 7, 5, 4, 1, 1, 3, 3, 4, 1, 5, 4, 5,…
$ repeatBuy     <int> 6, 2, 6, 1, 1, 2, 1, 1, 1, 1, 2, 5, 3, 1, 2, 3, 1, 2, 2,…
$ name          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Converting the categorical variable into factors ensures that the distinct brand identities are properly encoded as discrete levels, allowing PCA and regression to accurately process and analyze the brand data.

2.3 Preprocessing steps

Show the code

brand_recipe <- recipe(repeatBuy ~ ., data = prepared_brand) |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(performance, leadership, productLatest, fun, serious, bargain, bestValue, trendiness, num_comp = 3, id = "pca") |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

print(brand_recipe)

step_normalize ensures all numeric predictors contribute equally, which is crucial for accurate distance calculations in clustering, unbiased principal components in PCA, and comparable coefficients in regression.
step_pca reduces dimensionality by transforming variables into uncorrelated components, simplifying the dataset for easier visualization and interpretation in clustering, and improving regression model performance by focusing on the most important features.
step_dummy converts categorical predictors into binary variables, enabling their use in PCA, cluster analysis, and regression models that require numerical input.
step_zv Eliminates predictors with no variance to improve computational efficiency and ensure that only informative variables are used in clustering, PCA, and regression analyses.

3 Cluster Analysis

3.1 Descriptive analysis of numerical variables

Show the code

# Descriptive 
describe(prepared_brand)

              vars    n mean   sd median trimmed  mad min max range  skew
performance      1 1000 4.49 3.20    4.0    4.24 4.45   1  10     9  0.43
leadership       2 1000 4.42 2.61    4.0    4.27 2.97   1  10     9  0.28
productLatest    3 1000 6.20 3.08    7.0    6.37 4.45   1  10     9 -0.35
fun              4 1000 6.07 2.74    6.0    6.18 2.97   1  10     9 -0.24
serious          5 1000 4.32 2.78    4.0    4.07 2.97   1  10     9  0.58
bargain          6 1000 4.26 2.67    4.0    4.07 2.97   1  10     9  0.37
bestValue        7 1000 4.34 2.40    4.0    4.21 2.97   1  10     9  0.33
trendiness       8 1000 5.22 2.74    5.0    5.19 2.97   1  10     9  0.02
repeatBuy        9 1000 3.73 2.54    3.0    3.43 2.97   1  10     9  0.74
name*           10 1000 5.50 2.87    5.5    5.50 3.71   1  10     9  0.00
              kurtosis   se
performance      -1.23 0.10
leadership       -0.96 0.08
productLatest    -1.17 0.10
fun              -0.96 0.09
serious          -0.73 0.09
bargain          -0.98 0.08
bestValue        -0.78 0.08
trendiness       -1.06 0.09
repeatBuy        -0.45 0.08
name*            -1.23 0.09

Show the code

brand_cor <- brand[, -which(names(brand) == "name")]

cor_matrix <- cor(brand_cor)

corrplot(cor_matrix, type = "full", order = "original", tl.col = "black", tl.srt = 50, addCoef.col = "black",number.cex = 0.7, col = colorRampPalette(c("red", "white", "blue"))(200))

The highest positive correlation with repeatBuy is bestValue (0.51), and the highest negative correlation is productLatest (-0.47). None of the variables have a correlation higher than 0.8 with repeatBuy, which suggests that there is no severe multicollinearity issue among these variables for later supervised regression analysis with repeatBuy as the predictor variable.

3.2 Define the number of the clusters

As we are required to compare the relationships among the ordinal variables which are requivalant to the shopping factors, the brand names are supposed to be removed before clustering, not to be biased by brands.

Show the code

brand_numeric <- prepared_brand |>  select_if(is.numeric)

3.2.1 Elbow Method

Show the code

# (A) Elbow method
p1 <- fviz_nbclust(brand_numeric, FUNcluster = hcut, method = "wss", 
                   k.max = 10) +
  labs(title="(A) Elbow method")
p1

3.2.2 Silhouette Method

Show the code

# (B) Silhouette method
p2 <- fviz_nbclust(brand_numeric, FUNcluster = hcut, method = "silhouette",
                   k.max = 10) +
 labs(title="(B) Silhouette method")
p2

In the first plot, the Elbow Method, involves plotting the sum of squared distances (inertia) from each point to its assigned cluster center. The “elbow” point in the plot indicates the optimal number of clusters.

In the second plot, the Silhouette Method, involves calculating the average silhouette scores for different numbers of clusters. The optimal number of clusters maximizes the average silhouette score.

The optimal number of clusters is 3 in this analysis.

3.3 Create Clusters

cluster	n
1	270
2	494
3	236

cluster	performance	leadership	productLatest	fun	serious	bargain	bestValue	trendiness	repeatBuy
1	7.407407	7.100000	7.185185	4.507407	7.185185	3.537037	3.781481	6.466667	3.792593
2	2.497976	2.983806	7.435223	7.340081	2.882591	3.491903	3.429150	5.939271	2.334008
3	5.313559	4.347458	2.466102	5.190678	4.063559	6.690678	6.872881	2.288136	6.567797

3.3.1 The frequencies of clusters by brand

3.3.2 Interpretation of the frequency distribution

This chart displays the frequency distribution（ between 0 and 100）of clusters across different brands (labeled ‘a’ to ‘j’).

Brand a: Dominated by cluster 2, with a significant frequency. Cluster 1 and 3 have very low frequencies.
Brand b: Mainly belongs to cluster 1 with some presence of cluster 2, and a negligible presence of cluster 3.
Brand c: Similar to Brand b, mainly in cluster 1, with minimal presence in clusters 2 and 3.
Brand d: Primarily in cluster 2, similar to Brand a.
Brand e: Cluster 2 is the most frequent, followed by cluster 3 and a small frequency in cluster 1.
Brand f: Mostly in cluster 3 with some presence in cluster 2 and very low frequency in cluster 1.
Brand g: Predominantly in cluster 3 with no presence in cluster 1 and very low frequency in cluster 2.
Brand h: Primarily in cluster 2 with a minor presence in cluster 1 and no presence in cluster 3.
Brand i: Mostly in cluster 2, followed by cluster 1, and very low in cluster 3.
Brand j: Dominated by cluster 2 with some presence of cluster 3 and very low presence of cluster 1.

In terms of clustering, cluster 1 mainly contains brands b and c, cluster 2 mainly contains brands a, d, e, h, i and j, and cluster 3 mainly contains brands f and g.

3.4 The box plots representing the relationships among variables and repeatBuy

3.4.1 Interpretation of the box plots

These charts represent the box plot of repeat buy by cluster and faceted by performance, leadership ,product lastest, fun, serious, bargain, best value and trendiness.

For the box plot of repeat buy by cluster and faceted by performance, cluster 3 had the highest median, followed by cluster 1, and the lowest median was for cluster 2. The overall lower outliers for cluster 3 compared to clusters 1 and 2 suggest that there is a difference in repeat purchase rates for clusters 1 and 2.

For the box plot of repeat buy by cluster and faceted by leadership, similar to the performance in performance, Cluster 3 has the highest median, followed by Cluster 1 and Cluster 2. Overall the outliers for clusters 1 and 3 are generally higher. When repeat buy is 9 and 10, cluster 1 has more distribution than cluster 2 and 3.

For the box plot of repeat buy by cluster and faceted by product lastest, in this set of box plots, cluster 3 has the highest median, followed by cluster 1, and cluster 2 is the lowest. Clusters 1 and 2 have more outliers. Cluster 3 has no distribution at all when the repeat buy is 9 and 10 and there are very many outliers for clusters 1 and 2 when the repeat buy is 10, which means that the variance is very high at this point.

For the box plot of repeat buy by cluster and faceted by fun,cluster 3 has the highest median, followed by cluster 1, and cluster 2 is the lowest. Cluster 2 has significantly more outliers overall.

For the box plot of repeat buy by cluster and faceted by serious, cluster 3 has the highest median and both cluster 1 and 2 have relatively low medians. Cluster 1 has significantly more outliers and a wider distribution of IQRs than clusters 2 and 3, implying that cluster 1 is more variable in this dimension.

For the box plot of repeat buy by cluster and faceted by bargain, cluster 3 has the highest median, followed by cluster 1, and cluster 2 has the lowest low median. The distribution of outliers for these three clusters is relatively even.

For the box plot of repeat buy by cluster and faceted by best value, cluster 3 has the highest median, followed by cluster 1, and cluster 2 has the lowest low median. Clusters 1 and 3 have a higher distribution of outliers. The distribution of cluster 3 is very less when repeat buy is 1, while the distribution of clusters 1 and 2 is very less when repeat buy is 9 and 10.

For the box plot of repeat buy by cluster and faceted by trendiness, cluster 3 has the highest median followed by cluster 1 and cluster 2 has the lowest median. The distribution of outliers is more for clusters 1 and 2. When repeat buy is greater than 7, the distribution of cluster 3 is significantly less than that of clusters 1 and 2.

3.4.2 Overall Interpretations

Cluster 1:

Repeat Buying Behavior: Typically has the second-highest median, indicating moderate repeat buying behavior. The moderate number of outliers suggests some variability in purchasing patterns.
Customer Traits:
- Open-minded: Varied perceptions across different facets.
- Value-conscious: Interest in bargains and good value.
- Skeptical: Lower perceptions of leadership and seriousness, indicating a tendency to question authority.

Cluster 2:

Repeat Buying Behavior: Consistently has the lowest median, indicating the least frequent repeat buying behavior. More outliers suggest high variability in repeat buying patterns.
Customer Traits:
- Conventional: More conservative perceptions.
- Practical: Prioritize performance and value.
- Reserved: Moderate perceptions as indicated by lower outliers.

Cluster 3:

Repeat Buying Behavior: Consistently shows the highest median repeat buy across all facets, indicating strong repeat buying behavior. Fewer lower outliers suggest more consistent purchasing patterns.
Customer Traits:
- Enthusiastic: High perceptions across most facets.
- Trend-conscious: Preference for modern and enjoyable experiences.
- Brand loyalists: Lower variability in perceptions, suggesting strong brand loyalty.

4 Principal Component Analysis (PCA)

4.0.1 Descriptive analysis of categorical variables

Warning: The dot-dot notation (`..y..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(y)` instead.

The brands with the highest mean repeat buy value are ‘f’ and ‘g’, both having a mean of 7.2. This indicates that customers are most likely to make repeat purchases of these brands in the future, suggesting strong customer loyalty or satisfaction with these brands. The brand with the lowest mean repeat buy value is ‘j’, with a mean of 1.3. This suggests that customers are least likely to repurchase this brand, potentially indicating low customer satisfaction or other factors that deter repeat purchases. The box figure reveals two outliers, represented by dots above the maximum values for brands ‘a’,‘e’ and ‘g’. These outliers indicates some customers have an exceptionally high likelihood of repurchasing these brands compared to the overall trend.

4.0.2 Prep the recipe

4.0.3 Extract PCA components

PCA components
terms	value	component	id
variance	3	1	pca
variance	2	2	pca
variance	1	3	pca
variance	1	4	pca
variance	1	5	pca
variance	0	6	pca
variance	0	7	pca
variance	0	8	pca
cumulative variance	3	1	pca
cumulative variance	5	2	pca
cumulative variance	6	3	pca
cumulative variance	6	4	pca
cumulative variance	7	5	pca
cumulative variance	7	6	pca
cumulative variance	8	7	pca
cumulative variance	8	8	pca
percent variance	31	1	pca
percent variance	26	2	pca
percent variance	13	3	pca
percent variance	9	4	pca
percent variance	8	5	pca
percent variance	5	6	pca
percent variance	4	7	pca
percent variance	3	8	pca
cumulative percent variance	31	1	pca
cumulative percent variance	58	2	pca
cumulative percent variance	71	3	pca
cumulative percent variance	80	4	pca
cumulative percent variance	88	5	pca
cumulative percent variance	93	6	pca
cumulative percent variance	97	7	pca
cumulative percent variance	100	8	pca

3 pca components extracts 71% of the variation in the 8 variables, as per the above Cumulative Percentage, hence 3 components is deemed sufficient to explain the variation in variables.

4.0.4 Obtain the loading

# A tibble: 64 × 4
   terms          value component id   
   <chr>          <dbl> <chr>     <chr>
 1 performance    0.242 PC1       pca  
 2 leadership     0.219 PC1       pca  
 3 productLatest -0.420 PC1       pca  
 4 fun           -0.284 PC1       pca  
 5 serious        0.164 PC1       pca  
 6 bargain        0.436 PC1       pca  
 7 bestValue      0.493 PC1       pca  
 8 trendiness    -0.418 PC1       pca  
 9 performance    0.428 PC2       pca  
10 leadership     0.532 PC2       pca  
# ℹ 54 more rows

In PC1, “productLatest” has the highest negative loading (-0.420), suggesting it inversely impacts this component, while “bestValue” has the highest positive loading (0.493), indicating a strong positive influence. For PC2, “leadership” holds the highest positive loading (0.532), meaning it significantly drives this component, whereas “fun” has the highest negative loading (-0.261), showing an inverse relationship. In PC3, “bargain” is the most significant with a high negative loading (-0.529), suggesting it strongly inversely impacts this component, while “performance” has the highest positive loading (0.032), though its influence is relatively small compared to the others.

4.0.5 Interpretation of PCA Components

PC1 (Principal Component 1): Contrasts value/bargain perception with trendiness/recency.

BestValue (0.493): This variable has the highest positive loading, suggesting that brands perceived as offering good value are strongly associated with PC1.
Bargain (0.436): This also has a strong positive loading, indicating that brands seen as bargains are similarly associated with PC1.
Trendiness (-0.418): This variable has a strong negative loading, suggesting an inverse relationship with PC1. Brands perceived as trendy are less associated with PC1.
ProductLatest (-0.420): This also has a strong negative loading, indicating that brands with the most recent products are inversely related to PC1.

A higher PC1 score represents a higher chance of customer repurchasing the brand due to positive perception of the brand being good value and bargain, however it also represents lower chances of customers repurchasing the brand due to negative perception of the brand being trendy and latest. We can say PC1 brands represents good value and bargain for customers but not being trendy and latest.

PC2 (Principal Component 2): Emphasizes leadership, seriousness, performance, and trendiness.

Leadership (0.532): This variable has the highest positive loading on PC2, indicating a strong association.
Serious (0.517): This also has a strong positive loading, showing a significant relationship with PC2.
Performance (0.428): This variable has a moderate positive loading on PC2.
Trendiness (0.303): This has the lowest loading among the top contributors but is still positively associated with PC2.

A higher PC2 score represents a higher chance of customer repurchasing the brand due to positive perception of the brand being trendy, serious, leadership in field and performance. We can say PC2 represents customers perceiving the brand to be leaders in their field, serious, strong performers and trendy.

PC3 (Principal Component 3): Contrasts bargain/value and recent products and fun.

Bargain (-0.530): This variable has the highest negative loading on PC3, indicating a strong inverse relationship.
ProductLatest (-0.514): This also has a strong negative loading.
BestValue (-0.414): This variable has a moderate negative loading.
Fun (-0.392): This has the lowest loading among the top contributors but is still negatively associated with PC3.

A higher PC3 score represents a lower chance of customer repurchasing the brand due to negative perception of the brand being a bargain, recent product, good value and fun. We can say customers have a negative perception of brands from PC3 being bargain, latest product, value and fun.

5 Comparison of PCA Scores with Clusters

5.0.0.1 Prep and Bake the recipe

Rows: 1,000
Columns: 14
$ repeatBuy <int> 6, 2, 6, 1, 1, 2, 1, 1, 1, 1, 2, 5, 3, 1, 2, 3, 1, 2, 2, 2, …
$ PC1       <dbl> 0.70208363, -1.27234346, 0.97033026, -0.34325825, 1.59384234…
$ PC2       <dbl> -1.6139918, -1.8687620, -2.2821677, -0.7265777, -3.2033164, …
$ PC3       <dbl> -1.7832971, 2.0516587, -0.6693561, -0.6666432, -1.1081565, -…
$ name_X2   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X3   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X4   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X5   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X6   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X7   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X8   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X9   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X10  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cluster   <fct> 3, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, …

5.1 Compare principal component scores with clusters

5.1.1 Results

PC1 provides mixed results, with cluster 3 having a high positive mean score around 2, cluster 1 a moderately negative mean around -0.5, and cluster 2 a high negative mean around -1.5. This suggests PC1 is an important dimension separating the three clusters. Cluster 3 has a high positive mean around 2, indicating these brands are strongly perceived as offering good value and bargains, but not as trendy or having the latest products. Cluster 2 has a high negative mean around -1.5, suggesting these brands are viewed as trendy and having recent product lines, but lacking in value/bargain perception. Cluster 1 has a moderately negative mean around -0.5, placing it somewhat in the middle between value/bargain and trendy/recent on this dimension.

For PC2, cluster 2 has a moderate positive mean around 0.5, while clusters 1 and 3 have slightly negative means around -0.2. So PC2 separates cluster 2 from the other two to some degree. Cluster 2’s moderate positive mean of around 0.5 aligns with brands seen as leaders, top performers, serious players who are also relatively trendy. Clusters 1 and 3 have slightly negative means around -0.2, indicating a lack of strong leadership/performance/seriousness perception for these brand groups.

PC3 shows cluster 1 with a high positive mean around 4, cluster 2 with a moderate negative mean around -0.1, and cluster 3 with a very negative mean around -2. Cluster 1’s very high positive mean near 4 suggests these brands are viewed as less fun, bargain/value-focused, and not having the most recent product lines. Cluster 3’s strong negative mean around -2 represents brands perceived as high bargain/value brands with fun/recent product perception. Cluster 2’s moderate negative mean around -0.1 places it closer to the bargain/value side on this dimension.

Overall, Cluster 3 aligns with value/bargain positioning but lacks trendiness and new products, not seen as leaders or top performers. Cluster 2 represents trendy brands with new products, but lacks value/bargain and leadership/performance attributes . Cluster 1 is moderately fun and bargain/value focused, lacking trendy/recent product reputations, but is viewed as a serious leadership/top performing brands.

5.1.2 Interpretation of Clusters

Cluster 1 (Purple):

Principal Component Scores: Neutral to slightly positive on PC1, very positive on PC2, neutral on PC3.
Repeat Buy Behavior: Moderate loyalty with variability in purchasing patterns.
Customer Traits: Open-minded, value-conscious, skeptical of leadership.
The cluster’s majority customer brand: B and C

Cluster 2 (Orange):

Principal Component Scores: Negative on PC1, somewhat negative on PC2, slightly negative on PC3.
Repeat Buy Behavior: Low loyalty with high variability in purchasing patterns.
Customer Traits: Conventional, practical, reserved.
The cluster’s majority customer brand: a, d, e, h, i, and j

Cluster 3 (Blue):

Principal Component Scores: Very positive on PC1, negative on PC2, slightly positive on PC3.
Repeat Buy Behavior: High loyalty with consistent purchasing patterns.
Customer Traits: Enthusiastic, trend-conscious, brand loyalists.
The cluster’s majority customer brand: f and g

5.1.3 Interpretation of Principal Components

PC1 was interpreted as contrasting value/bargain perception against trendiness/recency. The cluster analysis suggests cluster 3 is strongly associated with the value/bargain side, cluster 2 with the trendy/recent side, and cluster 1 is somewhat in the middle.

PC2 emphasized leadership, seriousness, performance and trendiness. The moderate positive mean for cluster 2 aligns with this, potentially indicating these brands are seen as leaders/top performers. Clusters 1 and 3 skew slightly negative on this dimension.

PC3 contrasted bargain/value and product recency against fun perception. Cluster 1’s very high positive mean matches the interpretation of being seen as not bargain/value-oriented or having recent products, but perhaps being more “fun” brands. Clusters 2 and 3 skew moderately to strongly negative, suggesting they are viewed as more value/bargain-focused with recent product lines.

6 Analysis of Factors Influencing Repeat Purchase

6.0.0.1 Prepare the Data

Rows: 1,000
Columns: 14
$ repeatBuy <int> 6, 2, 6, 1, 1, 2, 1, 1, 1, 1, 2, 5, 3, 1, 2, 3, 1, 2, 2, 2, …
$ PC1       <dbl> 0.70208363, -1.27234346, 0.97033026, -0.34325825, 1.59384234…
$ PC2       <dbl> -1.6139918, -1.8687620, -2.2821677, -0.7265777, -3.2033164, …
$ PC3       <dbl> -1.7832971, 2.0516587, -0.6693561, -0.6666432, -1.1081565, -…
$ cluster   <fct> 3, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, …
$ b         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ c         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ d         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ e         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ f         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ g         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ h         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ i         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ j         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

6.0.0.2 Setup Recipe

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 13

── Operations

• Centering and scaling for: all_numeric_predictors()

• Dummy variables from: all_nominal_predictors()

6.1 Model Development

6.1.1 Define the model

Linear Regression Model Specification (regression)

Computational engine: lm

6.1.2 Create the workflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

6.1.3 Fit the model

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)          PC1          PC2          PC3            b            c  
     3.8421       0.3258      -0.1066      -0.2947       0.5518       0.3470  
          d            e            f            g            h            i  
     0.2734       0.4133       1.1376       1.1250       0.1327       0.3812  
          j   cluster_X2   cluster_X3  
    -0.2228      -0.4974       0.5534

6.2 Interpret the results

Linear Regression Results
term	estimate	p.value
(Intercept)	3.8421361	1.812458e-91
PC1	0.3258048	1.146124e-03
PC2	-0.1065947	2.982900e-01
PC3	-0.2946554	3.428169e-07
b	0.5518088	1.107524e-08
c	0.3470022	3.133610e-04
d	0.2734432	3.349442e-04
e	0.4132519	1.889629e-08
f	1.1375543	7.603756e-30
g	1.1250411	1.193237e-28
h	0.1327365	8.657522e-02
i	0.3811577	6.239035e-07
j	-0.2228147	1.462694e-03
cluster_X2	-0.4974322	2.947149e-02
cluster_X3	0.5533705	6.807553e-02

6.2.1 Interpretation of PCA

PC1 has a positive coefficient (0.3258048) and a p-value less than 0.05, indicating that it is a statistically significant. A positive PC1 value , which contrasts value/bargain perception with trendiness/recency, is associated with a higher likelihood of customers repurchasing the products.

PC2 has a negative coefficient (-0.1065947) and a p-value greater than 0.05, suggesting that it is not a statistically. PC2, which emphasizes leadership, seriousness, performance, and trendiness, does not have a significant effect on the likelihood of customers repurchasing the products.

PC3 has a negative coefficient (-0.2946554) and a p-value less than 0.05, indicating that it is a statistically significant. A negative PC3 value, which contrasts bargain/value and recent products with fun, is associated with a higher likelihood of customers repurchasing the products.

6.2.2 Interpretation of clusters

Cluster 1 is the baseline for the regression, the interpretation of the regression results for Clusters 2 and 3 will be in comparison to Cluster 1. The coefficient of cluster 2 and 3 are (-0.4974322) and (0.5533705) respectively. This suggest cluster 1 customers are expected to purchase frequently more than customers of cluster 2 but less than them of cluster 3.
Cluster 2 p-value is lower than the signifance level of 0.05 making it statistically significant. The negative coefficient (-0.4974322) suggests brands in this cluster struggle with customer retention and repeat purchases, exhibiting lower repeat buying behavior compared to Cluster 1
Cluster 3 has positive impact on repeat buying behavior compared to Cluster 1, however the result is insignificant, due to its p-value (0.07) being higher than the significance level of 0.05.

6.2.3 Interpretation of categorical variable

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

The intercept term (3.84) represents the estimated mean value of the dependent variable (repeatBuy) when all other predictors are set to zero. Brand name ‘a’ is our baseline variable for this regression as it will be used in comparision with other brand name variables. Brand name ‘h’ is insignificant above due to its p-value being higher than signifcant level of 0.05.

Brand names ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’ and ‘i’ are significant and the positive coefficients likely suggest higher values are associated with increased chances of customers buying product again compared with brand ‘a’, with brand name ‘f’ and ‘g’ having the highest chance of customers repeat buying, compared to brand ‘a’.

Brand name ‘j’ is significant and has a negative coefficient, suggesting higher values are associated with lower chances of customers repeat buying compared to the brand ‘a’.