1 Introduction

Customer Personality Analysis helps a business to better understand its customers and makes it easier for them to modify products based on its target customers from different types of customer segments. Instead of spending money to market a new product to every customer, this analysis can help direct the marketing team to market the product to specific related segments. Our goal is to perform customer segmentation using unsupervised learning analysis, which includes Clustering with K-Means and also if possible, dimensional reduction using Principle Component Analysis (PCA). The dataset used in this analysis is from Kaggle: Customer Personality Analysis.

2 Data Wrangling

2.1 Import Data

customers <- read.csv("marketing_campaign.csv", sep="\t")
head(customers)
glimpse(customers)
## Rows: 2,240
## Columns: 29
## $ ID                  <int> 5524, 2174, 4141, 6182, 5324, 7446, 965, 6177, 485…
## $ Year_Birth          <int> 1957, 1954, 1965, 1984, 1981, 1967, 1971, 1985, 19…
## $ Education           <chr> "Graduation", "Graduation", "Graduation", "Graduat…
## $ Marital_Status      <chr> "Single", "Single", "Together", "Together", "Marri…
## $ Income              <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 3…
## $ Kidhome             <int> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,…
## $ Teenhome            <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
## $ Dt_Customer         <chr> "04-09-2012", "08-03-2014", "21-08-2013", "10-02-2…
## $ Recency             <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82…
## $ MntWines            <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6,…
## $ MntFruits           <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, …
## $ MntMeatProducts     <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 4…
## $ MntFishProducts     <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3…
## $ MntSweetProducts    <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1…
## $ MntGoldProds        <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14…
## $ NumDealsPurchases   <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3,…
## $ NumWebPurchases     <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3,…
## $ NumCatalogPurchases <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0…
## $ NumStorePurchases   <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12,…
## $ NumWebVisitsMonth   <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8…
## $ AcceptedCmp3        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ AcceptedCmp4        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AcceptedCmp5        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ AcceptedCmp1        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ AcceptedCmp2        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Complain            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Z_CostContact       <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Z_Revenue           <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11…
## $ Response            <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…

2.2 Variable Explanation

People

  • ID: Customer’s unique identifier
  • Year_Birth: Customer’s birth year
  • Education: Customer’s education level
  • Marital_Status: Customer’s marital status
  • Income: Customer’s yearly household income
  • Kidhome: Number of children in customer’s household
  • Teenhome: Number of teenagers in customer’s household
  • Dt_Customer: Date of customer’s enrollment with the company
  • Recency: Number of days since customer’s last purchase
  • Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Products

  • MntWines: Amount spent on wine in last 2 years
  • MntFruits: Amount spent on fruits in last 2 years
  • MntMeatProducts: Amount spent on meat in last 2 years
  • MntFishProducts: Amount spent on fish in last 2 years
  • MntSweetProducts: Amount spent on sweets in last 2 years
  • MntGoldProds: Amount spent on gold in last 2 years

Promotion

  • NumDealsPurchases: Number of purchases made with a discount
  • AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
  • AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
  • AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
  • AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
  • AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
  • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Place

  • NumWebPurchases: Number of purchases made through the company’s website
  • NumCatalogPurchases: Number of purchases made using a catalogue
  • NumStorePurchases: Number of purchases made directly in stores
  • NumWebVisitsMonth: Number of visits to company’s website in the last month

2.3 Adjusting Data Type

#factor data type
customers <- customers %>%   mutate_at(vars(Complain, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Response), as.factor)

#date data type
customers$Dt_Customer <- dmy(customers$Dt_Customer)
#remove Id, Z_CostContact, Z_Revenue column
customers <- customers %>% select(-ID, -Z_Revenue, -Z_CostContact)

2.3.1 Feature Engineering

unique(customers$Marital_Status)
## [1] "Single"   "Together" "Married"  "Divorced" "Widow"    "Alone"    "Absurd"  
## [8] "YOLO"
unique(customers $Education)
## [1] "Graduation" "PhD"        "Master"     "Basic"      "2n Cycle"

Since the labels are not efficient from the dataset, we will simplify Marital_status categories into : Single and Couple, and for Education we will simplify the categories into: Undergraduate and Postgraduate.

customers <- customers %>%
# Marital Status
  mutate(Marital_Status = if_else(Marital_Status %in% c("Single", "Divorced", "Widow", "YOLO", "Alone", "Absurd"), "Single", Marital_Status)) %>%
  mutate(Marital_Status = if_else(Marital_Status %in% c("Together", "Married"),"Couple", Marital_Status)) %>% 
  
# Education
  mutate(Education = if_else(Education %in% c("Graduation", "Master", "PhD"), "Postgraduate", Education)) %>% 
  mutate(Education = if_else(Education %in% c("2n Cycle", "Basic"), "Undergraduate", Education)) %>% 
  
 mutate(Marital_Status = as.factor(Marital_Status),
        Education = as.factor(Education))

For columns Kidhome and Teenhome will be combined into Children:

customers <- customers %>%
  mutate(Children = Kidhome + Teenhome) %>%
  select(-Kidhome, -Teenhome) 

customers <- customers %>% mutate(Children = as.factor(Children))

We will create customer age variable based on the columns Year_Birth and Dt_Customer:

range(customers$Dt_Customer)
## [1] "2012-07-30" "2014-06-29"

The last recorded membership enrollment is the year 2014, so we will assume that the data gathered was from this year.

customers <- customers %>% mutate(Age = 2014 - Year_Birth) %>% select(-Year_Birth)
#rename variables
customers <- customers %>%
              mutate(Wine = MntWines, 
                     Fruit = MntFruits,
                     Meat = MntMeatProducts,
                     Fish = MntFishProducts,
                     Sweet = MntSweetProducts,
                     Gold = MntGoldProds) %>% 
              select(-c(MntWines, MntFruits, MntMeatProducts, MntFishProducts, 
                        MntSweetProducts,MntGoldProds)) %>% 
              mutate(Deals = NumDealsPurchases,
                     Web = NumWebPurchases,
                     Catalog = NumCatalogPurchases,
                     Store = NumStorePurchases,
                     WebVisitsMonth = NumWebVisitsMonth) %>% 
              select(-c(NumDealsPurchases, NumWebPurchases, 
                        NumCatalogPurchases, NumStorePurchases, 
                        NumStorePurchases,NumWebVisitsMonth))

2.4 Check for Missing Values

colSums(is.na(customers))
##      Education Marital_Status         Income    Dt_Customer        Recency 
##              0              0             24              0              0 
##   AcceptedCmp3   AcceptedCmp4   AcceptedCmp5   AcceptedCmp1   AcceptedCmp2 
##              0              0              0              0              0 
##       Complain       Response       Children            Age           Wine 
##              0              0              0              0              0 
##          Fruit           Meat           Fish          Sweet           Gold 
##              0              0              0              0              0 
##          Deals            Web        Catalog          Store WebVisitsMonth 
##              0              0              0              0              0
customers <- na.omit(customers)
customers<- na.omit(customers)
anyNA(customers)
## [1] FALSE

3 Exploratory Data Analysis

3.1 Demographic Analysis

After cleansing and feature engineering, our demographic variables are: Education, Marital_Status, Income, Age, and Children_Home.

Customers Age

range(customers$Age)
## [1]  18 121
customers$Age_group <- cut(customers$Age, breaks=c(18, 30, 40, 50, 60, Inf), labels=c("18-30", "31-40", "41-50", "51-60", "60+"))
prop.table(table(customers$Age_group))*100
## 
##    18-30    31-40    41-50    51-60      60+ 
## 11.51762 26.37760 28.95212 21.09304 12.05962
barplot(table(customers$Age_group), main="Age Distribution", col = "lightblue")

Majority of the customers are adults (approx. 75%), with young adults and elderly being the minority.

Customers Income

hist(customers$Income, 
     main = "Income Distribution", 
     xlab = "Income ($)",
     col = "lightblue",
     breaks = 50) 
abline(v = mean(customers$Income), col = "red", lty = 2, lwd = 1)

boxplot(customers$Income,col = "lightblue", horizontal = T)

There seems to be a case in which one of the customers fall out of the majority income range (an outlier). The mean income is at $52,247.25.

Customers Education Status

barplot(table(customers$Education), main="Education", col = "lightblue")

Majority of customers education status is Postgraduate, which confirms the age distribution.

Customers Marital Status and Number of Children at Home

barplot(table(customers$Children, customers$Marital_Status),main="Marital Status and Number of Children", beside=TRUE, col=c("lightblue","skyblue3", "steelblue4","midnightblue"), legend=TRUE)

There are more customers who are a couple (64.5%) than those who are single. Of which, 50% of these customers at least has one child.

3.2 Possibility for PCA

customers_num <- customers %>% select_if(is.numeric)
quali <- customers %>% select_if(is.factor) %>% select(-c(Age_group,))
qualivar <- quali %>% colnames()
summary(customers_num)
##      Income          Recency           Age              Wine       
##  Min.   :  1730   Min.   : 0.00   Min.   : 18.00   Min.   :   0.0  
##  1st Qu.: 35303   1st Qu.:24.00   1st Qu.: 37.00   1st Qu.:  24.0  
##  Median : 51382   Median :49.00   Median : 44.00   Median : 174.5  
##  Mean   : 52247   Mean   :49.01   Mean   : 45.18   Mean   : 305.1  
##  3rd Qu.: 68522   3rd Qu.:74.00   3rd Qu.: 55.00   3rd Qu.: 505.0  
##  Max.   :666666   Max.   :99.00   Max.   :121.00   Max.   :1493.0  
##      Fruit             Meat             Fish            Sweet       
##  Min.   :  0.00   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  2.00   1st Qu.:  16.0   1st Qu.:  3.00   1st Qu.:  1.00  
##  Median :  8.00   Median :  68.0   Median : 12.00   Median :  8.00  
##  Mean   : 26.36   Mean   : 167.0   Mean   : 37.64   Mean   : 27.03  
##  3rd Qu.: 33.00   3rd Qu.: 232.2   3rd Qu.: 50.00   3rd Qu.: 33.00  
##  Max.   :199.00   Max.   :1725.0   Max.   :259.00   Max.   :262.00  
##       Gold            Deals             Web            Catalog      
##  Min.   :  0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  9.00   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 0.000  
##  Median : 24.50   Median : 2.000   Median : 4.000   Median : 2.000  
##  Mean   : 43.97   Mean   : 2.324   Mean   : 4.085   Mean   : 2.671  
##  3rd Qu.: 56.00   3rd Qu.: 3.000   3rd Qu.: 6.000   3rd Qu.: 4.000  
##  Max.   :321.00   Max.   :15.000   Max.   :27.000   Max.   :28.000  
##      Store        WebVisitsMonth  
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.: 3.000  
##  Median : 5.000   Median : 6.000  
##  Mean   : 5.801   Mean   : 5.319  
##  3rd Qu.: 8.000   3rd Qu.: 7.000  
##  Max.   :13.000   Max.   :20.000
var(customers_num)
##                       Income        Recency          Age           Wine
## Income         633683788.576 -2892.83658596 48814.621201 4913651.567032
## Recency            -2892.837   838.00706375     5.653712     153.517126
## Age                48814.621     5.65371157   143.653507     644.670673
## Wine             4913651.567   153.51712642   644.670673  113790.125690
## Fruit             431589.325    -6.73181459     8.464241    5195.253148
## Meat             3300781.480   146.19869694    90.582220   43038.246873
## Fish              604886.901     0.87320371    26.528317    7345.671585
## Sweet             455689.305    29.85471392     9.946052    5407.865525
## Gold              425110.225    26.49344150    39.875318    6864.460972
## Deals              -4024.231     0.11780606     1.352697       5.766284
## Web                26762.860    -0.44757927     5.028016     512.030107
## Catalog            43406.478     2.04027593     4.271300     626.671684
## Store              43318.897    -0.04082519     4.982937     701.824559
## WebVisitsMonth    -33768.090    -1.30335585    -3.601796    -263.423371
##                        Fruit          Meat           Fish         Sweet
## Income         431589.325364 3300781.48003 604886.9010551 455689.305153
## Recency            -6.731815     146.19870      0.8732037     29.854714
## Age                 8.464241      90.58222     26.5283165      9.946052
## Wine             5195.253148   43038.24687   7345.6715847   5407.865525
## Fruit            1583.555792    4889.37300   1292.9674518    934.242986
## Meat             4889.373004   50302.98644   7043.4752771   4929.552262
## Fish             1292.967452    7043.47528   2997.7905288   1312.988348
## Sweet             934.242986    4929.55226   1312.9883482   1686.912935
## Gold              817.531565    4177.24410   1211.8027530    760.712065
## Deals             -10.297194     -52.33909    -15.0872165     -9.594450
## Web                32.944337     188.78364     44.9749377     37.593472
## Catalog            56.633209     481.89347     85.3714836     59.518760
## Store              59.311068     354.34502     81.4727753     60.780016
## WebVisitsMonth    -40.413420    -293.46191    -59.2820805    -42.074230
##                         Gold          Deals           Web        Catalog
## Income         425110.224539 -4024.23079940 26762.8598168 43406.47821080
## Recency            26.493442     0.11780606    -0.4475793     2.04027593
## Age                39.875318     1.35269739     5.0280158     4.27130045
## Wine             6864.460972     5.76628359   512.0301067   626.67168449
## Fruit             817.531565   -10.29719442    32.9443373    56.63320913
## Meat             4177.244103   -52.33909165   188.7836423   481.89347145
## Fish             1211.802753   -15.08721651    44.9749377    85.37148361
## Sweet             760.712065    -9.59445038    37.5934725    59.51876034
## Gold             2684.837167     5.17377599    57.8128972    67.09420732
## Deals               5.173776     3.70068189     1.2730690    -0.06822921
## Web                57.812897     1.27306904     7.5128128     3.10346526
## Catalog            67.094207    -0.06822921     3.1034653     8.56576978
## Store              65.553804     0.41340324     4.5998250     4.92682787
## WebVisitsMonth    -31.127510     1.61455799    -0.3405414    -3.70538032
##                         Store WebVisitsMonth
## Income         43318.89744787 -33768.0897098
## Recency           -0.04082519     -1.3033559
## Age                4.98293674     -3.6017965
## Wine             701.82455913   -263.4233706
## Fruit             59.31106849    -40.4134199
## Meat             354.34501532   -293.4619091
## Fish              81.47277526    -59.2820805
## Sweet             60.78001646    -42.0742297
## Gold              65.55380426    -31.1275096
## Deals              0.41340324      1.6145580
## Web                4.59982500     -0.3405414
## Catalog            4.92682787     -3.7053803
## Store             10.56760172     -3.4091656
## WebVisitsMonth    -3.40916564      5.8823641

Our data has high variance and has different scales. There are a few variables that noticeably have outliers in the data, which can be observed in the vast difference between their max and 3rd quartile values (Income, Meat, Wine, Gold, etc.).

We want to see if there is a high correlation between variables in customers_num. If the variables in our data are highly correlated, it may suggest redundancy in the data , which confirms our need for Principle Component Analysis (PCA) in order to mitigate multicollinearity, reduce dimensionality and improve cluster quality.

ggcorr(customers_num, label = T, hjust = 0.9)+
  theme(plot.margin = unit(c(0,-2, 0.5, 1), "cm"))

Since the correlation map above shows that there are variables with high correlation (Sweet & Fish, etc.), we will use PCA to reduce the data dimensionality.

3.3 Possibility for Clustering

no_outliers <- customers[customers$Income < 300000, ]
 ggplot(no_outliers, aes(Age_group, Income, fill = Age_group)) + 
  geom_boxplot(show.legend = F) + 
  scale_fill_brewer()+
    theme_minimal() + 
  labs(title = "Income by Age Group")

From the boxplot above, we can observe that as we go up the age group, the average for income increases.

# Convert wide to long format using pivot_longer
long_purchase <- no_outliers %>%
  pivot_longer(cols = c(Wine,Fruit,Meat,Fish,Sweet,Gold), names_to = "Purchase_Type", values_to = "Purchase_Amt")

# Plot relationship between Product, Purchase Amount and Income
ggplot(long_purchase, aes(Purchase_Amt, Income, color = Purchase_Type)) +
    geom_point(alpha = 0.5) + 
    theme_minimal() + 
    labs(title = "Income and Product Purchase", color = "Product", x = "Spending")+
    scale_color_brewer(palette = "Set3")

Based on the scatterplot for Income and Product Purchase, there is a lot of overlap for products with fish, fruit, gold, and sweet , thus we are unable to discern any significant patterns for purchases with these products. But, for wine and meat, we can observe a potential pattern for clustering, in which those with higher income tend to spend more on these two products.

ggplot(long_purchase, aes(x = Age_group, y = Purchase_Amt, fill = Purchase_Type)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Product Purchase by Age Group", fill = "Product", x = "Age Group", 
       y = "Spending") + 
  scale_fill_brewer()+
  theme_minimal()

Based on this stacked barplot of Product Purchase by Age Group, the age group that has the highest amount of spending in total is 41-50 years old. Across all age groups the product that customers spend the most on is Wine and Meat.

# Convert wide to long format using pivot_longer
long_method <- no_outliers %>%
  pivot_longer(cols = c(Deals, Web, Store, Catalog), names_to = "Purchase_Method", values_to = "Qty")

# Plot relationship between Product, Purchase Amount and Income
ggplot(long_method, aes(Qty, Income, color = Purchase_Method)) +
    geom_point(alpha = 0.5) + 
    theme_minimal() + 
    labs(title = "Income and Purchase Method", color = "Method", x = "Number of Purchases")+    
    scale_color_brewer(palette = "Set3")

From this scatterplot on Income and Purchase Method, we can observe a few patterns such as the catalog method being one of the least inefficient methods of attracting customers to purchase. Customers tend to purchase more from the physical store. There doesn’t seem to be a discernible pattern between income and the method of purchase.

4 Data Pre-Processing

4.1 Scaling

customers_sc <- scale(customers_num)
boxplot(customers_sc)

The numerical variables are now scaled but we still have outliers in our data. Let’s handle it using windsorize to handle outliers and preserve overall structure of data.

4.2 Outliers

customers_win <- Winsorize(customers_sc) %>% as.matrix()
customers_win %>% boxplot()

Our data seems better now, we no longer have extreme outliers.

4.3 Encoding

#Encode Marital Status
quali$Single <- ifelse(quali$Marital_Status == "Single", 1, 0)
quali <- quali %>% select(-Marital_Status)

#Encode Education
quali$Postgraduate <- ifelse(quali$Education == "Postgraduate", 1, 0)
quali <- quali %>% select(-Education)

#Encode Children
quali$Children <- ifelse(quali$Children != 0, 1, 0)

quali <- quali %>%  mutate_at(vars(Complain, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Response), as.numeric)
combined <- cbind(customers_win, quali)

5 Principle Component Analysis

Make PCA object

pca <- PCA(combined,
            scale.unit = T,
            quali.sup = qualivar,
            graph = F,
            ncp = 14)
pca$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   6.4488455             40.3052841                          40.30528
## comp 2   1.6638377             10.3989855                          50.70427
## comp 3   1.2084139              7.5525866                          58.25686
## comp 4   1.0046245              6.2789029                          64.53576
## comp 5   1.0006563              6.2541016                          70.78986
## comp 6   0.8720914              5.4505715                          76.24043
## comp 7   0.6930427              4.3315168                          80.57195
## comp 8   0.5856672              3.6604200                          84.23237
## comp 9   0.5285247              3.3032796                          87.53565
## comp 10  0.4219360              2.6370998                          90.17275
## comp 11  0.3572397              2.2327483                          92.40550
## comp 12  0.3322164              2.0763527                          94.48185
## comp 13  0.3106524              1.9415777                          96.42343
## comp 14  0.2261436              1.4133974                          97.83682
## comp 15  0.2013635              1.2585218                          99.09535
## comp 16  0.1447446              0.9046536                         100.00000

Through PCA, we can gain information on principal components (high in cumulative variance) and perform dimensionality reduction (reduce dimension of the data while maintaining as much info as possible). In this case we would like to retain at least 85% of information from our data, thus based on the summary above we will pick PC1-PC7, since cumulative percentage of variance for PC7 meets the minimum of 85%. We will be able to reduce our data by 50% while retaining 85% of the information from our data.

Let’s extract values PC1-PC7 and make a new dataframe

keepnum <- as.data.frame(pca$ind$coord[,1:7])

6 Clustering

6.1 Optimum K

fviz_nbclust(x = keepnum , # data 
             FUNcluster = kmeans, # algoritma klustering yg digunakan 
             method = "wss" ) 

Using the elbow method, we can observe that the optimum k is 3 clusters since there are no longer any significant declines of total within sum of square past k = 3.

6.2 K-Means Clustering

clusterpca <- kmeans(x = keepnum,
              centers = 3)

Looking at the cluster size, they are not proportional to each other and have a considerably low ratio between the Between-Cluster Sum of Squares (Between_SS) and the Total Sum of Squares (Total_SS).

Let’s reconstruct our PCA object and bind it with our qualitative data, also adding cluster as a new variable.

reconpca <- reconst(pca, 
            ncp = 7) %>% as.data.frame() %>% select(-Single, -Postgraduate)
reconpca<- cbind(reconpca, quali)
reconpca$cluster <- clusterpca$cluster %>% as.factor()

head(reconpca)

6.3 Profiling

customer_centroids <- reconpca %>% group_by(cluster) %>% 
  summarise_all(mean)
customer_centroids %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(
    kelompok_min = which.min(value),
    kelompok_max = which.max(value))

Based from our clustering model, we can observe these characteristics in the 3 clusters: Cluster 1 - Oldest customers,Prone to purchase with deals,Postgraduate education, Top spender through Web

Cluster 2 - Top spenders in all product categories, highest income

Cluster 3 - Youngest customers, Most complains, lowest income