Show the code
library(pacman)
p_load(tidyverse,tidymodels,here,gt,forcats,dpylr,recipes,GGally, gapminder,ggplot,corrplot,psych,factoextra, cluster, dendextend, broom, visdat )Timeseries Analysis
library(pacman)
p_load(tidyverse,tidymodels,here,gt,forcats,dpylr,recipes,GGally, gapminder,ggplot,corrplot,psych,factoextra, cluster, dendextend, broom, visdat )brand <- here("Data/brand.csv") |> read.csv()
glimpse(brand)Rows: 1,000
Columns: 10
$ performance <int> 2, 1, 2, 1, 1, 2, 1, 2, 2, 3, 2, 1, 3, 1, 3, 3, 2, 3, 2,…
$ leadership <int> 4, 1, 3, 6, 1, 8, 1, 1, 1, 1, 1, 2, 3, 5, 1, 2, 1, 8, 5,…
$ productLatest <int> 8, 4, 5, 10, 5, 9, 5, 7, 8, 9, 5, 7, 10, 7, 3, 7, 6, 9, …
$ fun <int> 8, 7, 9, 8, 8, 5, 7, 5, 10, 8, 6, 7, 10, 10, 6, 6, 7, 9,…
$ serious <int> 2, 1, 2, 3, 1, 3, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 4, 2,…
$ bargain <int> 9, 1, 9, 4, 9, 8, 5, 8, 7, 3, 1, 3, 3, 1, 3, 10, 1, 7, 6…
$ bestValue <int> 7, 1, 5, 5, 9, 7, 1, 7, 7, 3, 1, 2, 3, 3, 4, 5, 3, 10, 1…
$ trendiness <int> 4, 2, 1, 2, 1, 1, 1, 7, 5, 4, 1, 1, 3, 3, 4, 1, 5, 4, 5,…
$ repeatBuy <int> 6, 2, 6, 1, 1, 2, 1, 1, 1, 1, 2, 5, 3, 1, 2, 3, 1, 2, 2,…
$ name <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "…
# calculate the percentage amount of NA relative to the dataset population
percent_na = sum(is.na(brand)) / (nrow(brand) * ncol(brand))
# calculate the total NA
full_na = sum(is.na(brand))
# visualize the missing values
vis_miss(brand)This dataset has no missing value.
# Select numerical variables for correlation analysis
prepared_brand <- brand |>
mutate(name = as_factor(name)) |>
mutate(name = fct_recode(name,
"1" = "a",
"2" = "b",
"3" = "c",
"4" = "d",
"5" = "e",
"6" = "f",
"7" = "g",
"8" = "h",
"9" = "i",
"10" = "j"))
prepared_brand$name <- factor(prepared_brand$name, levels = as.character(1:10))
glimpse(prepared_brand)Rows: 1,000
Columns: 10
$ performance <int> 2, 1, 2, 1, 1, 2, 1, 2, 2, 3, 2, 1, 3, 1, 3, 3, 2, 3, 2,…
$ leadership <int> 4, 1, 3, 6, 1, 8, 1, 1, 1, 1, 1, 2, 3, 5, 1, 2, 1, 8, 5,…
$ productLatest <int> 8, 4, 5, 10, 5, 9, 5, 7, 8, 9, 5, 7, 10, 7, 3, 7, 6, 9, …
$ fun <int> 8, 7, 9, 8, 8, 5, 7, 5, 10, 8, 6, 7, 10, 10, 6, 6, 7, 9,…
$ serious <int> 2, 1, 2, 3, 1, 3, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 4, 2,…
$ bargain <int> 9, 1, 9, 4, 9, 8, 5, 8, 7, 3, 1, 3, 3, 1, 3, 10, 1, 7, 6…
$ bestValue <int> 7, 1, 5, 5, 9, 7, 1, 7, 7, 3, 1, 2, 3, 3, 4, 5, 3, 10, 1…
$ trendiness <int> 4, 2, 1, 2, 1, 1, 1, 7, 5, 4, 1, 1, 3, 3, 4, 1, 5, 4, 5,…
$ repeatBuy <int> 6, 2, 6, 1, 1, 2, 1, 1, 1, 1, 2, 5, 3, 1, 2, 3, 1, 2, 2,…
$ name <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Converting the categorical variable into factors ensures that the distinct brand identities are properly encoded as discrete levels, allowing PCA and regression to accurately process and analyze the brand data.
brand_recipe <- recipe(repeatBuy ~ ., data = prepared_brand) |>
step_normalize(all_numeric_predictors()) |>
step_pca(performance, leadership, productLatest, fun, serious, bargain, bestValue, trendiness, num_comp = 3, id = "pca") |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
print(brand_recipe)step_normalize ensures all numeric predictors contribute equally, which is crucial for accurate distance calculations in clustering, unbiased principal components in PCA, and comparable coefficients in regression.
step_pca reduces dimensionality by transforming variables into uncorrelated components, simplifying the dataset for easier visualization and interpretation in clustering, and improving regression model performance by focusing on the most important features.
step_dummy converts categorical predictors into binary variables, enabling their use in PCA, cluster analysis, and regression models that require numerical input.
step_zv Eliminates predictors with no variance to improve computational efficiency and ensure that only informative variables are used in clustering, PCA, and regression analyses.
# Descriptive
describe(prepared_brand) vars n mean sd median trimmed mad min max range skew
performance 1 1000 4.49 3.20 4.0 4.24 4.45 1 10 9 0.43
leadership 2 1000 4.42 2.61 4.0 4.27 2.97 1 10 9 0.28
productLatest 3 1000 6.20 3.08 7.0 6.37 4.45 1 10 9 -0.35
fun 4 1000 6.07 2.74 6.0 6.18 2.97 1 10 9 -0.24
serious 5 1000 4.32 2.78 4.0 4.07 2.97 1 10 9 0.58
bargain 6 1000 4.26 2.67 4.0 4.07 2.97 1 10 9 0.37
bestValue 7 1000 4.34 2.40 4.0 4.21 2.97 1 10 9 0.33
trendiness 8 1000 5.22 2.74 5.0 5.19 2.97 1 10 9 0.02
repeatBuy 9 1000 3.73 2.54 3.0 3.43 2.97 1 10 9 0.74
name* 10 1000 5.50 2.87 5.5 5.50 3.71 1 10 9 0.00
kurtosis se
performance -1.23 0.10
leadership -0.96 0.08
productLatest -1.17 0.10
fun -0.96 0.09
serious -0.73 0.09
bargain -0.98 0.08
bestValue -0.78 0.08
trendiness -1.06 0.09
repeatBuy -0.45 0.08
name* -1.23 0.09
brand_cor <- brand[, -which(names(brand) == "name")]
cor_matrix <- cor(brand_cor)
corrplot(cor_matrix, type = "full", order = "original", tl.col = "black", tl.srt = 50, addCoef.col = "black",number.cex = 0.7, col = colorRampPalette(c("red", "white", "blue"))(200)) The highest positive correlation with repeatBuy is bestValue (0.51), and the highest negative correlation is productLatest (-0.47). None of the variables have a correlation higher than 0.8 with repeatBuy, which suggests that there is no severe multicollinearity issue among these variables for later supervised regression analysis with repeatBuy as the predictor variable.
As we are required to compare the relationships among the ordinal variables which are requivalant to the shopping factors, the brand names are supposed to be removed before clustering, not to be biased by brands.
brand_numeric <- prepared_brand |> select_if(is.numeric)# (A) Elbow method
p1 <- fviz_nbclust(brand_numeric, FUNcluster = hcut, method = "wss",
k.max = 10) +
labs(title="(A) Elbow method")
p1# (B) Silhouette method
p2 <- fviz_nbclust(brand_numeric, FUNcluster = hcut, method = "silhouette",
k.max = 10) +
labs(title="(B) Silhouette method")
p2In the first plot, the Elbow Method, involves plotting the sum of squared distances (inertia) from each point to its assigned cluster center. The “elbow” point in the plot indicates the optimal number of clusters.
In the second plot, the Silhouette Method, involves calculating the average silhouette scores for different numbers of clusters. The optimal number of clusters maximizes the average silhouette score.
The optimal number of clusters is 3 in this analysis.
| cluster | n |
|---|---|
| 1 | 270 |
| 2 | 494 |
| 3 | 236 |
| cluster | performance | leadership | productLatest | fun | serious | bargain | bestValue | trendiness | repeatBuy |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 7.407407 | 7.100000 | 7.185185 | 4.507407 | 7.185185 | 3.537037 | 3.781481 | 6.466667 | 3.792593 |
| 2 | 2.497976 | 2.983806 | 7.435223 | 7.340081 | 2.882591 | 3.491903 | 3.429150 | 5.939271 | 2.334008 |
| 3 | 5.313559 | 4.347458 | 2.466102 | 5.190678 | 4.063559 | 6.690678 | 6.872881 | 2.288136 | 6.567797 |
This chart displays the frequency distribution( between 0 and 100)of clusters across different brands (labeled ‘a’ to ‘j’).
Brand a: Dominated by cluster 2, with a significant frequency. Cluster 1 and 3 have very low frequencies.
Brand b: Mainly belongs to cluster 1 with some presence of cluster 2, and a negligible presence of cluster 3.
Brand c: Similar to Brand b, mainly in cluster 1, with minimal presence in clusters 2 and 3.
Brand d: Primarily in cluster 2, similar to Brand a.
Brand e: Cluster 2 is the most frequent, followed by cluster 3 and a small frequency in cluster 1.
Brand f: Mostly in cluster 3 with some presence in cluster 2 and very low frequency in cluster 1.
Brand g: Predominantly in cluster 3 with no presence in cluster 1 and very low frequency in cluster 2.
Brand h: Primarily in cluster 2 with a minor presence in cluster 1 and no presence in cluster 3.
Brand i: Mostly in cluster 2, followed by cluster 1, and very low in cluster 3.
Brand j: Dominated by cluster 2 with some presence of cluster 3 and very low presence of cluster 1.
In terms of clustering, cluster 1 mainly contains brands b and c, cluster 2 mainly contains brands a, d, e, h, i and j, and cluster 3 mainly contains brands f and g.
These charts represent the box plot of repeat buy by cluster and faceted by performance, leadership ,product lastest, fun, serious, bargain, best value and trendiness.
For the box plot of repeat buy by cluster and faceted by performance, cluster 3 had the highest median, followed by cluster 1, and the lowest median was for cluster 2. The overall lower outliers for cluster 3 compared to clusters 1 and 2 suggest that there is a difference in repeat purchase rates for clusters 1 and 2.
For the box plot of repeat buy by cluster and faceted by leadership, similar to the performance in performance, Cluster 3 has the highest median, followed by Cluster 1 and Cluster 2. Overall the outliers for clusters 1 and 3 are generally higher. When repeat buy is 9 and 10, cluster 1 has more distribution than cluster 2 and 3.
For the box plot of repeat buy by cluster and faceted by product lastest, in this set of box plots, cluster 3 has the highest median, followed by cluster 1, and cluster 2 is the lowest. Clusters 1 and 2 have more outliers. Cluster 3 has no distribution at all when the repeat buy is 9 and 10 and there are very many outliers for clusters 1 and 2 when the repeat buy is 10, which means that the variance is very high at this point.
For the box plot of repeat buy by cluster and faceted by fun,cluster 3 has the highest median, followed by cluster 1, and cluster 2 is the lowest. Cluster 2 has significantly more outliers overall.
For the box plot of repeat buy by cluster and faceted by serious, cluster 3 has the highest median and both cluster 1 and 2 have relatively low medians. Cluster 1 has significantly more outliers and a wider distribution of IQRs than clusters 2 and 3, implying that cluster 1 is more variable in this dimension.
For the box plot of repeat buy by cluster and faceted by bargain, cluster 3 has the highest median, followed by cluster 1, and cluster 2 has the lowest low median. The distribution of outliers for these three clusters is relatively even.
For the box plot of repeat buy by cluster and faceted by best value, cluster 3 has the highest median, followed by cluster 1, and cluster 2 has the lowest low median. Clusters 1 and 3 have a higher distribution of outliers. The distribution of cluster 3 is very less when repeat buy is 1, while the distribution of clusters 1 and 2 is very less when repeat buy is 9 and 10.
For the box plot of repeat buy by cluster and faceted by trendiness, cluster 3 has the highest median followed by cluster 1 and cluster 2 has the lowest median. The distribution of outliers is more for clusters 1 and 2. When repeat buy is greater than 7, the distribution of cluster 3 is significantly less than that of clusters 1 and 2.
Cluster 1:
Repeat Buying Behavior: Typically has the second-highest median, indicating moderate repeat buying behavior. The moderate number of outliers suggests some variability in purchasing patterns.
Customer Traits:
Open-minded: Varied perceptions across different facets.
Value-conscious: Interest in bargains and good value.
Skeptical: Lower perceptions of leadership and seriousness, indicating a tendency to question authority.
Cluster 2:
Repeat Buying Behavior: Consistently has the lowest median, indicating the least frequent repeat buying behavior. More outliers suggest high variability in repeat buying patterns.
Customer Traits:
Conventional: More conservative perceptions.
Practical: Prioritize performance and value.
Reserved: Moderate perceptions as indicated by lower outliers.
Cluster 3:
Repeat Buying Behavior: Consistently shows the highest median repeat buy across all facets, indicating strong repeat buying behavior. Fewer lower outliers suggest more consistent purchasing patterns.
Customer Traits:
Enthusiastic: High perceptions across most facets.
Trend-conscious: Preference for modern and enjoyable experiences.
Brand loyalists: Lower variability in perceptions, suggesting strong brand loyalty.
Warning: The dot-dot notation (`..y..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(y)` instead.
The brands with the highest mean repeat buy value are ‘f’ and ‘g’, both having a mean of 7.2. This indicates that customers are most likely to make repeat purchases of these brands in the future, suggesting strong customer loyalty or satisfaction with these brands. The brand with the lowest mean repeat buy value is ‘j’, with a mean of 1.3. This suggests that customers are least likely to repurchase this brand, potentially indicating low customer satisfaction or other factors that deter repeat purchases. The box figure reveals two outliers, represented by dots above the maximum values for brands ‘a’,‘e’ and ‘g’. These outliers indicates some customers have an exceptionally high likelihood of repurchasing these brands compared to the overall trend.
| PCA components | |||
|---|---|---|---|
| terms | value | component | id |
| variance | 3 | 1 | pca |
| variance | 2 | 2 | pca |
| variance | 1 | 3 | pca |
| variance | 1 | 4 | pca |
| variance | 1 | 5 | pca |
| variance | 0 | 6 | pca |
| variance | 0 | 7 | pca |
| variance | 0 | 8 | pca |
| cumulative variance | 3 | 1 | pca |
| cumulative variance | 5 | 2 | pca |
| cumulative variance | 6 | 3 | pca |
| cumulative variance | 6 | 4 | pca |
| cumulative variance | 7 | 5 | pca |
| cumulative variance | 7 | 6 | pca |
| cumulative variance | 8 | 7 | pca |
| cumulative variance | 8 | 8 | pca |
| percent variance | 31 | 1 | pca |
| percent variance | 26 | 2 | pca |
| percent variance | 13 | 3 | pca |
| percent variance | 9 | 4 | pca |
| percent variance | 8 | 5 | pca |
| percent variance | 5 | 6 | pca |
| percent variance | 4 | 7 | pca |
| percent variance | 3 | 8 | pca |
| cumulative percent variance | 31 | 1 | pca |
| cumulative percent variance | 58 | 2 | pca |
| cumulative percent variance | 71 | 3 | pca |
| cumulative percent variance | 80 | 4 | pca |
| cumulative percent variance | 88 | 5 | pca |
| cumulative percent variance | 93 | 6 | pca |
| cumulative percent variance | 97 | 7 | pca |
| cumulative percent variance | 100 | 8 | pca |
3 pca components extracts 71% of the variation in the 8 variables, as per the above Cumulative Percentage, hence 3 components is deemed sufficient to explain the variation in variables.
# A tibble: 64 × 4
terms value component id
<chr> <dbl> <chr> <chr>
1 performance 0.242 PC1 pca
2 leadership 0.219 PC1 pca
3 productLatest -0.420 PC1 pca
4 fun -0.284 PC1 pca
5 serious 0.164 PC1 pca
6 bargain 0.436 PC1 pca
7 bestValue 0.493 PC1 pca
8 trendiness -0.418 PC1 pca
9 performance 0.428 PC2 pca
10 leadership 0.532 PC2 pca
# ℹ 54 more rows
In PC1, “productLatest” has the highest negative loading (-0.420), suggesting it inversely impacts this component, while “bestValue” has the highest positive loading (0.493), indicating a strong positive influence. For PC2, “leadership” holds the highest positive loading (0.532), meaning it significantly drives this component, whereas “fun” has the highest negative loading (-0.261), showing an inverse relationship. In PC3, “bargain” is the most significant with a high negative loading (-0.529), suggesting it strongly inversely impacts this component, while “performance” has the highest positive loading (0.032), though its influence is relatively small compared to the others.
PC1 (Principal Component 1): Contrasts value/bargain perception with trendiness/recency.
BestValue (0.493): This variable has the highest positive loading, suggesting that brands perceived as offering good value are strongly associated with PC1.
Bargain (0.436): This also has a strong positive loading, indicating that brands seen as bargains are similarly associated with PC1.
Trendiness (-0.418): This variable has a strong negative loading, suggesting an inverse relationship with PC1. Brands perceived as trendy are less associated with PC1.
ProductLatest (-0.420): This also has a strong negative loading, indicating that brands with the most recent products are inversely related to PC1.
A higher PC1 score represents a higher chance of customer repurchasing the brand due to positive perception of the brand being good value and bargain, however it also represents lower chances of customers repurchasing the brand due to negative perception of the brand being trendy and latest. We can say PC1 brands represents good value and bargain for customers but not being trendy and latest.
PC2 (Principal Component 2): Emphasizes leadership, seriousness, performance, and trendiness.
Leadership (0.532): This variable has the highest positive loading on PC2, indicating a strong association.
Serious (0.517): This also has a strong positive loading, showing a significant relationship with PC2.
Performance (0.428): This variable has a moderate positive loading on PC2.
Trendiness (0.303): This has the lowest loading among the top contributors but is still positively associated with PC2.
A higher PC2 score represents a higher chance of customer repurchasing the brand due to positive perception of the brand being trendy, serious, leadership in field and performance. We can say PC2 represents customers perceiving the brand to be leaders in their field, serious, strong performers and trendy.
PC3 (Principal Component 3): Contrasts bargain/value and recent products and fun.
Bargain (-0.530): This variable has the highest negative loading on PC3, indicating a strong inverse relationship.
ProductLatest (-0.514): This also has a strong negative loading.
BestValue (-0.414): This variable has a moderate negative loading.
Fun (-0.392): This has the lowest loading among the top contributors but is still negatively associated with PC3.
A higher PC3 score represents a lower chance of customer repurchasing the brand due to negative perception of the brand being a bargain, recent product, good value and fun. We can say customers have a negative perception of brands from PC3 being bargain, latest product, value and fun.
Rows: 1,000
Columns: 14
$ repeatBuy <int> 6, 2, 6, 1, 1, 2, 1, 1, 1, 1, 2, 5, 3, 1, 2, 3, 1, 2, 2, 2, …
$ PC1 <dbl> 0.70208363, -1.27234346, 0.97033026, -0.34325825, 1.59384234…
$ PC2 <dbl> -1.6139918, -1.8687620, -2.2821677, -0.7265777, -3.2033164, …
$ PC3 <dbl> -1.7832971, 2.0516587, -0.6693561, -0.6666432, -1.1081565, -…
$ name_X2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X7 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X8 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X9 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ name_X10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cluster <fct> 3, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, …
PC1 provides mixed results, with cluster 3 having a high positive mean score around 2, cluster 1 a moderately negative mean around -0.5, and cluster 2 a high negative mean around -1.5. This suggests PC1 is an important dimension separating the three clusters. Cluster 3 has a high positive mean around 2, indicating these brands are strongly perceived as offering good value and bargains, but not as trendy or having the latest products. Cluster 2 has a high negative mean around -1.5, suggesting these brands are viewed as trendy and having recent product lines, but lacking in value/bargain perception. Cluster 1 has a moderately negative mean around -0.5, placing it somewhat in the middle between value/bargain and trendy/recent on this dimension.
For PC2, cluster 2 has a moderate positive mean around 0.5, while clusters 1 and 3 have slightly negative means around -0.2. So PC2 separates cluster 2 from the other two to some degree. Cluster 2’s moderate positive mean of around 0.5 aligns with brands seen as leaders, top performers, serious players who are also relatively trendy. Clusters 1 and 3 have slightly negative means around -0.2, indicating a lack of strong leadership/performance/seriousness perception for these brand groups.
PC3 shows cluster 1 with a high positive mean around 4, cluster 2 with a moderate negative mean around -0.1, and cluster 3 with a very negative mean around -2. Cluster 1’s very high positive mean near 4 suggests these brands are viewed as less fun, bargain/value-focused, and not having the most recent product lines. Cluster 3’s strong negative mean around -2 represents brands perceived as high bargain/value brands with fun/recent product perception. Cluster 2’s moderate negative mean around -0.1 places it closer to the bargain/value side on this dimension.
Overall, Cluster 3 aligns with value/bargain positioning but lacks trendiness and new products, not seen as leaders or top performers. Cluster 2 represents trendy brands with new products, but lacks value/bargain and leadership/performance attributes . Cluster 1 is moderately fun and bargain/value focused, lacking trendy/recent product reputations, but is viewed as a serious leadership/top performing brands.
Cluster 1 (Purple):
Principal Component Scores: Neutral to slightly positive on PC1, very positive on PC2, neutral on PC3.
Repeat Buy Behavior: Moderate loyalty with variability in purchasing patterns.
Customer Traits: Open-minded, value-conscious, skeptical of leadership.
The cluster’s majority customer brand: B and C
Cluster 2 (Orange):
Principal Component Scores: Negative on PC1, somewhat negative on PC2, slightly negative on PC3.
Repeat Buy Behavior: Low loyalty with high variability in purchasing patterns.
Customer Traits: Conventional, practical, reserved.
The cluster’s majority customer brand: a, d, e, h, i, and j
Cluster 3 (Blue):
Principal Component Scores: Very positive on PC1, negative on PC2, slightly positive on PC3.
Repeat Buy Behavior: High loyalty with consistent purchasing patterns.
Customer Traits: Enthusiastic, trend-conscious, brand loyalists.
The cluster’s majority customer brand: f and g
PC1 was interpreted as contrasting value/bargain perception against trendiness/recency. The cluster analysis suggests cluster 3 is strongly associated with the value/bargain side, cluster 2 with the trendy/recent side, and cluster 1 is somewhat in the middle.
PC2 emphasized leadership, seriousness, performance and trendiness. The moderate positive mean for cluster 2 aligns with this, potentially indicating these brands are seen as leaders/top performers. Clusters 1 and 3 skew slightly negative on this dimension.
PC3 contrasted bargain/value and product recency against fun perception. Cluster 1’s very high positive mean matches the interpretation of being seen as not bargain/value-oriented or having recent products, but perhaps being more “fun” brands. Clusters 2 and 3 skew moderately to strongly negative, suggesting they are viewed as more value/bargain-focused with recent product lines.
Rows: 1,000
Columns: 14
$ repeatBuy <int> 6, 2, 6, 1, 1, 2, 1, 1, 1, 1, 2, 5, 3, 1, 2, 3, 1, 2, 2, 2, …
$ PC1 <dbl> 0.70208363, -1.27234346, 0.97033026, -0.34325825, 1.59384234…
$ PC2 <dbl> -1.6139918, -1.8687620, -2.2821677, -0.7265777, -3.2033164, …
$ PC3 <dbl> -1.7832971, 2.0516587, -0.6693561, -0.6666432, -1.1081565, -…
$ cluster <fct> 3, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, …
$ b <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ c <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ e <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ f <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ g <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ h <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ i <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ j <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 13
── Operations
• Centering and scaling for: all_numeric_predictors()
• Dummy variables from: all_nominal_predictors()
Linear Regression Model Specification (regression)
Computational engine: lm
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_normalize()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_normalize()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) PC1 PC2 PC3 b c
3.8421 0.3258 -0.1066 -0.2947 0.5518 0.3470
d e f g h i
0.2734 0.4133 1.1376 1.1250 0.1327 0.3812
j cluster_X2 cluster_X3
-0.2228 -0.4974 0.5534
| Linear Regression Results | ||
|---|---|---|
| term | estimate | p.value |
| (Intercept) | 3.8421361 | 1.812458e-91 |
| PC1 | 0.3258048 | 1.146124e-03 |
| PC2 | -0.1065947 | 2.982900e-01 |
| PC3 | -0.2946554 | 3.428169e-07 |
| b | 0.5518088 | 1.107524e-08 |
| c | 0.3470022 | 3.133610e-04 |
| d | 0.2734432 | 3.349442e-04 |
| e | 0.4132519 | 1.889629e-08 |
| f | 1.1375543 | 7.603756e-30 |
| g | 1.1250411 | 1.193237e-28 |
| h | 0.1327365 | 8.657522e-02 |
| i | 0.3811577 | 6.239035e-07 |
| j | -0.2228147 | 1.462694e-03 |
| cluster_X2 | -0.4974322 | 2.947149e-02 |
| cluster_X3 | 0.5533705 | 6.807553e-02 |
PC1 has a positive coefficient (0.3258048) and a p-value less than 0.05, indicating that it is a statistically significant. A positive PC1 value , which contrasts value/bargain perception with trendiness/recency, is associated with a higher likelihood of customers repurchasing the products.
PC2 has a negative coefficient (-0.1065947) and a p-value greater than 0.05, suggesting that it is not a statistically. PC2, which emphasizes leadership, seriousness, performance, and trendiness, does not have a significant effect on the likelihood of customers repurchasing the products.
PC3 has a negative coefficient (-0.2946554) and a p-value less than 0.05, indicating that it is a statistically significant. A negative PC3 value, which contrasts bargain/value and recent products with fun, is associated with a higher likelihood of customers repurchasing the products.
Cluster 1 is the baseline for the regression, the interpretation of the regression results for Clusters 2 and 3 will be in comparison to Cluster 1. The coefficient of cluster 2 and 3 are (-0.4974322) and (0.5533705) respectively. This suggest cluster 1 customers are expected to purchase frequently more than customers of cluster 2 but less than them of cluster 3.
Cluster 2 p-value is lower than the signifance level of 0.05 making it statistically significant. The negative coefficient (-0.4974322) suggests brands in this cluster struggle with customer retention and repeat purchases, exhibiting lower repeat buying behavior compared to Cluster 1
Cluster 3 has positive impact on repeat buying behavior compared to Cluster 1, however the result is insignificant, due to its p-value (0.07) being higher than the significance level of 0.05.
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
The intercept term (3.84) represents the estimated mean value of the dependent variable (repeatBuy) when all other predictors are set to zero. Brand name ‘a’ is our baseline variable for this regression as it will be used in comparision with other brand name variables. Brand name ‘h’ is insignificant above due to its p-value being higher than signifcant level of 0.05.
Brand names ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’ and ‘i’ are significant and the positive coefficients likely suggest higher values are associated with increased chances of customers buying product again compared with brand ‘a’, with brand name ‘f’ and ‘g’ having the highest chance of customers repeat buying, compared to brand ‘a’.
Brand name ‘j’ is significant and has a negative coefficient, suggesting higher values are associated with lower chances of customers repeat buying compared to the brand ‘a’.