Introduction

Customer segmentation is the process of separating customers into groups based on the certain traits they share. Segmentation offers a simple way of organizing and managing company’s relationships with customers. This process also makes it easy to tailor and personalize company’s marketing, service, and sales efforts to the needs of specific groups. This helps to boost customer loyalty and conversions.

We use Dataset from Kaggle.com about customer personality analysis segmentation. Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers. Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

We will analyze by using unsupervised learning method, clustering and PCA method. Unsupervised learning is a machine learning method where there is no target variable in the data to be analyzed. In unsupervised learning, it focuses more on exploring data such as looking for patterns in data.

Data Preparation

Importing Libraries

library(dplyr) #for general data wrangling
library(lubridate) #for date time data converting
library(GGally) #for making plot
library(FactoMineR) #for exploratory multivariate data analysis
library(factoextra)# for making intuitive plot
library(ggplot2) #for adorn the plot
library(gridExtra) #for tidy up the plot
library(ggiraphExtra) #for profiling process
library(plotly) #for making a plotly visualization

Importing Dataset

customer <- read.csv("marketing_campaign.csv", sep = "\t")
rmarkdown::paged_table(customer)

#Rename column names to make it more intuitive
names(customer) <- c("ID","Year_Birth","Education","Marital_Status","Income","Kidhome","Teenhome","Dt_Customer","Recency","Wines", "Fruits", "Meat", "Fish", "Sweet", "Gold", "Deals", "Web", "Catalog", "Store", "WebVisits", "Cmp3", "Cmp4", "Cmp5", "Cmp1", "Cmp2", "Complain", "Z_CostContact", "Z_Revenue", "Response")

Data Columns:

• ID: Customer’s unique identifier
• Year_Birth: Customer’s birth year
• Education: Customer’s education level
• Marital_Status: Customer’s marital status
• Income: Customer’s yearly household income
• Kidhome: Number of children in customer’s household
• Teenhome: Number of teenagers in customer’s household
• Dt_Customer: Date of customer’s enrollment with the company
• Recency: Number of days since customer’s last purchase
• Wines (MntWines): Amount spent on wine in last 2 years
• Fruits (MntFruits): Amount spent on fruits in last 2 years
• Meat (MntMeatProducts): Amount spent on meat in last 2 years
• Fish (MntFishProducts): Amount spent on fish in last 2 years
• Sweet (MntSweetProduc)ts: Amount spent on sweets in last 2 years
• Gold (MntGoldProds): Amount spent on gold in last 2 years
• Web (NumWebPurchases): Number of purchases made through the company’s website
• Catalog (NumCatalogPurchases): Number of purchases made using a catalog
• Store (NumStorePurchases): Number of purchases made directly in stores
• WebVisits (NumWebVisitsMonth): Number of visits to company’s website in the last month
• Deals (NumDealsPurchases): Number of purchases made with a discount
• Cmp1 (AcceptedCmp1): 1 if customer accepted the offer in the 1st campaign, 0 otherwise
• Cmp2 (AcceptedCmp2): 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
• Cmp3 (AcceptedCmp3): 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
• Cmp4 (AcceptedCmp4): 1 if customer accepted the offer in the 4th campaign, 0 otherwise
• Cmp5 (AcceptedCmp5): 1 if customer accepted the offer in the 5th campaign, 0 otherwise
• Complain: 1 if the customer complained in the last 2 years, 0 otherwise
• Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Data Wrangling

Data Types

glimpse(customer)

## Rows: 2,240
## Columns: 29
## $ ID             <int> 5524, 2174, 4141, 6182, 5324, 7446, 965, 6177, 4855, 58…
## $ Year_Birth     <int> 1957, 1954, 1965, 1984, 1981, 1967, 1971, 1985, 1974, 1…
## $ Education      <chr> "Graduation", "Graduation", "Graduation", "Graduation",…
## $ Marital_Status <chr> "Single", "Single", "Together", "Together", "Married", …
## $ Income         <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 33454,…
## $ Kidhome        <int> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0…
## $ Teenhome       <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1…
## $ Dt_Customer    <chr> "04-09-2012", "08-03-2014", "21-08-2013", "10-02-2014",…
## $ Recency        <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82, 53,…
## $ Wines          <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6, 194,…
## $ Fruits         <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, 14, 2…
## $ Meat           <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 480, 5…
## $ Fish           <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3, 6, …
## $ Sweet          <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1, 68,…
## $ Gold           <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14, 5, …
## $ Deals          <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2…
## $ Web            <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3, 4, 1…
## $ Catalog        <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0, 1, …
## $ Store          <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12, 3, 6…
## $ WebVisits      <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8, 7, …
## $ Cmp3           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp4           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp5           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Cmp1           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1…
## $ Cmp2           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Complain       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Z_CostContact  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ Z_Revenue      <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
## $ Response       <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…

We will take out ID, Year_Birth, Dt_Customer, Z_CostContact, and Z_Revenue because we won’t use it. In K-means clustering, the data used must be in numeric form, so that categorical data will remain in integer form. Education and Marital_Status column will be convert into numerical data type.

#Convert Education and Marital_Status column into numerical data
cust_clean <- customer
cust_clean$Marital_Status <- sapply(X = cust_clean$Marital_Status,
                           FUN = switch, 
                           "Single" = "0",
                           "Married" = "1", 
                           "Together" = "2", 
                           "Divorced" = "3", 
                           "Widow" = "4",
                           "Alone" = "5", 
                           "Other" = "6")
cust_clean$Marital_Status <- as.character(cust_clean$Marital_Status)
cust_clean$Education <- sapply(X = cust_clean$Education,
                           FUN = switch, 
                           "2n Cycle" = "0",
                           "Basic" = "1", 
                           "Graduation" = "2", 
                           "Master" = "3", 
                           "PhD" = "4")

#Convert some columns into factors to make the EDA process easier
cust_clean <- cust_clean %>% 
  mutate_at(vars(Education, Marital_Status, Kidhome, Teenhome, Cmp3, Cmp4, Cmp5, Cmp1, Cmp2, Complain, Response), as.factor) %>% 
  select(-c(Dt_Customer, ID, Year_Birth, Z_CostContact, Z_Revenue))
glimpse(cust_clean)

## Rows: 2,240
## Columns: 24
## $ Education      <fct> 2, 2, 2, 2, 4, 3, 2, 4, 4, 4, 2, 1, 2, 3, 2, 4, 2, 2, 3…
## $ Marital_Status <fct> 0, 0, 2, 2, 1, 2, 3, 1, 2, 2, 1, 1, 3, 3, 1, 0, 1, 2, 1…
## $ Income         <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 33454,…
## $ Kidhome        <fct> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0…
## $ Teenhome       <fct> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1…
## $ Recency        <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82, 53,…
## $ Wines          <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6, 194,…
## $ Fruits         <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, 14, 2…
## $ Meat           <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 480, 5…
## $ Fish           <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3, 6, …
## $ Sweet          <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1, 68,…
## $ Gold           <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14, 5, …
## $ Deals          <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2…
## $ Web            <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3, 4, 1…
## $ Catalog        <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0, 1, …
## $ Store          <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12, 3, 6…
## $ WebVisits      <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8, 7, …
## $ Cmp3           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp4           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp5           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Cmp1           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1…
## $ Cmp2           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Complain       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Response       <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…

Missing Values

cust_clean %>% is.na() %>% colSums()

##      Education Marital_Status         Income        Kidhome       Teenhome 
##              0              0             24              0              0 
##        Recency          Wines         Fruits           Meat           Fish 
##              0              0              0              0              0 
##          Sweet           Gold          Deals            Web        Catalog 
##              0              0              0              0              0 
##          Store      WebVisits           Cmp3           Cmp4           Cmp5 
##              0              0              0              0              0 
##           Cmp1           Cmp2       Complain       Response 
##              0              0              0              0

There are any missing value on Income column. We will remove it because we need to keep the original data and it’s only 24 row.

cust_clean <- na.omit(cust_clean)
cust_clean %>% is.na() %>% colSums()

##      Education Marital_Status         Income        Kidhome       Teenhome 
##              0              0              0              0              0 
##        Recency          Wines         Fruits           Meat           Fish 
##              0              0              0              0              0 
##          Sweet           Gold          Deals            Web        Catalog 
##              0              0              0              0              0 
##          Store      WebVisits           Cmp3           Cmp4           Cmp5 
##              0              0              0              0              0 
##           Cmp1           Cmp2       Complain       Response 
##              0              0              0              0

There are no missing value any more. So, we can continue to the next process.

Remove outlier

Is there any outlier in this data set?

summary(cust_clean)

##  Education Marital_Status     Income       Kidhome  Teenhome    Recency     
##  0: 200    0   :471       Min.   :  1730   0:1283   0:1147   Min.   : 0.00  
##  1:  54    1   :857       1st Qu.: 35303   1: 887   1:1018   1st Qu.:24.00  
##  2:1116    2   :573       Median : 51382   2:  46   2:  51   Median :49.00  
##  3: 365    3   :232       Mean   : 52247                     Mean   :49.01  
##  4: 481    4   : 76       3rd Qu.: 68522                     3rd Qu.:74.00  
##            5   :  3       Max.   :666666                     Max.   :99.00  
##            NULL:  4                                                         
##      Wines            Fruits            Meat             Fish       
##  Min.   :   0.0   Min.   :  0.00   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:  24.0   1st Qu.:  2.00   1st Qu.:  16.0   1st Qu.:  3.00  
##  Median : 174.5   Median :  8.00   Median :  68.0   Median : 12.00  
##  Mean   : 305.1   Mean   : 26.36   Mean   : 167.0   Mean   : 37.64  
##  3rd Qu.: 505.0   3rd Qu.: 33.00   3rd Qu.: 232.2   3rd Qu.: 50.00  
##  Max.   :1493.0   Max.   :199.00   Max.   :1725.0   Max.   :259.00  
##                                                                     
##      Sweet             Gold            Deals             Web        
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median :  8.00   Median : 24.50   Median : 2.000   Median : 4.000  
##  Mean   : 27.03   Mean   : 43.97   Mean   : 2.324   Mean   : 4.085  
##  3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   3rd Qu.: 6.000  
##  Max.   :262.00   Max.   :321.00   Max.   :15.000   Max.   :27.000  
##                                                                     
##     Catalog           Store          WebVisits      Cmp3     Cmp4     Cmp5    
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   0:2053   0:2052   0:2054  
##  1st Qu.: 0.000   1st Qu.: 3.000   1st Qu.: 3.000   1: 163   1: 164   1: 162  
##  Median : 2.000   Median : 5.000   Median : 6.000                             
##  Mean   : 2.671   Mean   : 5.801   Mean   : 5.319                             
##  3rd Qu.: 4.000   3rd Qu.: 8.000   3rd Qu.: 7.000                             
##  Max.   :28.000   Max.   :13.000   Max.   :20.000                             
##                                                                               
##  Cmp1     Cmp2     Complain Response
##  0:2074   0:2186   0:2195   0:1883  
##  1: 142   1:  30   1:  21   1: 333  
##                                     
##                                     
##                                     
##                                     
##

It seems there is any outlier in Income column. Let’s check it first using boxplot.

boxplot(cust_clean$Income)

Yes, we are right. There is any outlier in Income column. We will take it out.

cust_clean <- cust_clean %>% filter(Income<=300000)

Exploratory Data Analysis

Possibility for Principle Component Analysis (PCA)

- Do the variables have the same scale?
Let’s check the range for each column:

summary(cust_clean)

##  Education Marital_Status     Income       Kidhome  Teenhome    Recency     
##  0: 200    0   :471       Min.   :  1730   0:1283   0:1146   Min.   : 0.00  
##  1:  54    1   :857       1st Qu.: 35284   1: 886   1:1018   1st Qu.:24.00  
##  2:1115    2   :572       Median : 51373   2:  46   2:  51   Median :49.00  
##  3: 365    3   :232       Mean   : 51970                     Mean   :49.02  
##  4: 481    4   : 76       3rd Qu.: 68487                     3rd Qu.:74.00  
##            5   :  3       Max.   :162397                     Max.   :99.00  
##            NULL:  4                                                         
##      Wines            Fruits            Meat             Fish       
##  Min.   :   0.0   Min.   :  0.00   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:  24.0   1st Qu.:  2.00   1st Qu.:  16.0   1st Qu.:  3.00  
##  Median : 175.0   Median :  8.00   Median :  68.0   Median : 12.00  
##  Mean   : 305.2   Mean   : 26.36   Mean   : 167.1   Mean   : 37.65  
##  3rd Qu.: 505.0   3rd Qu.: 33.00   3rd Qu.: 232.5   3rd Qu.: 50.00  
##  Max.   :1493.0   Max.   :199.00   Max.   :1725.0   Max.   :259.00  
##                                                                     
##      Sweet             Gold            Deals             Web        
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median :  8.00   Median : 25.00   Median : 2.000   Median : 4.000  
##  Mean   : 27.04   Mean   : 43.98   Mean   : 2.323   Mean   : 4.086  
##  3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   3rd Qu.: 6.000  
##  Max.   :262.00   Max.   :321.00   Max.   :15.000   Max.   :27.000  
##                                                                     
##     Catalog           Store          WebVisits      Cmp3     Cmp4     Cmp5    
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   0:2052   0:2051   0:2053  
##  1st Qu.: 0.000   1st Qu.: 3.000   1st Qu.: 3.000   1: 163   1: 164   1: 162  
##  Median : 2.000   Median : 5.000   Median : 6.000                             
##  Mean   : 2.672   Mean   : 5.802   Mean   : 5.319                             
##  3rd Qu.: 4.000   3rd Qu.: 8.000   3rd Qu.: 7.000                             
##  Max.   :28.000   Max.   :13.000   Max.   :20.000                             
##                                                                               
##  Cmp1     Cmp2     Complain Response
##  0:2073   0:2185   0:2194   0:1882  
##  1: 142   1:  30   1:  21   1: 333  
##                                     
##                                     
##                                     
##                                     
##

The scale of numeric data has different ranges. It is very risky to form bias in the model, especially in the PCA process, so scaling will be carried out before the PCA process is carried out.

-Does our numeric predictors have correlation with each other?

ggcorr(cust_clean, low = "navy", high = "turquoise", label = T)

The value of the Pearson correlation coefficient (r) lies between -1 to +1. When the value of – r=0; there is no relation between the variable.
r=+1; perfectly positively correlated.
r=-1; perfectly negatively correlated.
r= 0 to 0.30; negligible correlation.
r=0.30 to 0.50; moderate correlation.
r=0.50 to 1 highly correlated.
source

Based on the plot above, most of predictors have moderate correlation. Principal Component Analysis can be used to reduce the dimension of the data and preserve as much information as possible, producing non-multicollinearity data.

Possibility for Clustering

Our target column is Response. Let’s see if the customer will accept the last campaign based on Deals and Income. We need to filter based on Income first to get a clearer visualization, because there is an outlier in the Income column.

#Mutate Response and Marital_Status column into factor for easier visualization in this section
#using customer data for clearly visualization (before marital_status convert into numerical data)
customer_eda <- customer %>%  mutate_at(vars(Response, Marital_Status), as.factor)

customer_eda %>% filter(Income<=200000) %>% 
              ggplot( aes(Deals, Income, color = Response, size = Marital_Status)) + 
                geom_point(alpha = 0.5) + theme_minimal()

Based on the graph above, most customers who accept the campaign have an income of 25000-125000, and their marital status is Married, single, or together.

Then we want to check some stats based on customer’s Response by using boxplot.

p1 <- ggplot(customer_eda, aes(Response, Wines, fill = Response)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "MntWines")

p2 <- ggplot(customer_eda, aes(Response, Fruits, fill = Response)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "MntFruits")

p3 <- ggplot(customer_eda, aes(Response, Meat, fill = Response)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "MntMeatProducts")

p4 <- ggplot(customer_eda, aes(Response, Fish, fill = Response)) + geom_boxplot(show.legend = F) + 
    theme_minimal()  + labs(title = "MntFishProducts")

p5 <- ggplot(customer_eda, aes(Response, Sweet, fill = Response)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "MntSweetProducts")

p6 <- ggplot(customer_eda, aes(Response, Gold, fill = Response)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "MntGoldProds")
grid.arrange(p1, p2, p3, p4, p5, p6)

Based on the graphs above, most customers who say yes to the campaign buy a more considerable amount, even though it’s not significant. But we can say they are our loyal customer.

Our dataset does not have a customer-type label. However, the difference is not too significant when looking at the pattern of the number of purchases on various products and the response to the last marketing campaign. This data might be suitable for clustering using K-means.

Preprocessing Data

Clustering quantitative predictor and qualitative predictor. This clustering predictor only use in PCA method, for clustering analysis all of categorical predictors will convert into numerical predictors.

#Check the numeric predictors
cust_clean %>% 
  select_if(is.numeric) %>% 
  colnames()

##  [1] "Income"    "Recency"   "Wines"     "Fruits"    "Meat"      "Fish"     
##  [7] "Sweet"     "Gold"      "Deals"     "Web"       "Catalog"   "Store"    
## [13] "WebVisits"

There are 13 numerical columns

# quantitative columns
quanti <- cust_clean %>% 
  select_if(is.numeric) %>% 
  colnames()

# indexing numerical columns
quantivar <- which(colnames(cust_clean) %in% quanti)

# qualitative columns
quali <- cust_clean %>% 
  select_if(is.factor) %>% 
  colnames()

# indexing categorical columns
qualivar <- which(colnames(cust_clean) %in% quali)

PCA

Principal component analysis (PCA) is a multivariate statistical analysis technique. Arguably, this is the most popular statistical analysis technique today. Usually, PCA is used in the fields of pattern recognition and signal processing.

PCA is the basis of multivariate data analysis that applies the projection method. This analysis technique is usually used to summarize large-scale multivariate data tables to collect more minor variables or summary indexes. From there, the variables are analyzed to find specific trends, variable clusters, and outliers. source

Dimensionality reduction

# PCA by using FactoMineR
cust_pca <- PCA(
  X = cust_clean,
  scale.unit = T, 
  quali.sup = qualivar, 
  graph = F, 
  ncp = 13 # 13 numerical columns
)

cust_pca$eig # analyze cumulative variance of each PC

##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   5.8313793              44.856764                          44.85676
## comp 2   1.5545979              11.958445                          56.81521
## comp 3   1.0026549               7.712730                          64.52794
## comp 4   0.8987336               6.913335                          71.44127
## comp 5   0.6654878               5.119137                          76.56041
## comp 6   0.6362576               4.894289                          81.45470
## comp 7   0.5290949               4.069961                          85.52466
## comp 8   0.4330794               3.331380                          88.85604
## comp 9   0.3981573               3.062749                          91.91879
## comp 10  0.3629042               2.791571                          94.71036
## comp 11  0.2661855               2.047580                          96.75794
## comp 12  0.2461341               1.893339                          98.65128
## comp 13  0.1753336               1.348720                         100.00000

Eigenvalues represent the total amount of variance that can be explained by a given principal component. They can be positive or negative in theory, but in practice they explain variance which is always positive.
- If eigenvalues are greater than zero, then it’s a good sign.
- Since variance cannot be negative, negative eigenvalues imply the model is ill-conditioned.
- Eigenvalues close to zero imply there is item multicollinearity, since all the variance can be taken up by the first component.
source

Through PCA, We can keep some informative principal components (with high cumulative variance) from the Kernels dataset to perform dimensionality reduction. By doing this, we can reduce the dimensionality of the dataset while retaining as much information as possible.

In this study, We want to retain at least 85% of the information from our data. From the PCA summary (cust_pca$eig), and picked PC1-PC8 from a total of 13 PC. By doing this, We were able to reduce 38.46% of dimension from my original data while retaining 87.5% of the information from the data.

We can extract the values of PC1-PC8 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.

# making a new data frame from PCA result
cust_x <- data.frame(cust_pca$ind$coord[,1:8])

cust_1 <- cbind(cust_x, Response = cust_clean$Response)
head(cust_1)

##         Dim.1      Dim.2      Dim.3       Dim.4       Dim.5      Dim.6
## 1  3.63029082  0.6284463  0.2578246  1.75833936  0.96723077 -0.2504482
## 2 -2.13279216 -0.9567982 -0.3326070 -0.34326515  0.31540242 -0.3632943
## 3  1.71918345  0.1411721 -0.9475246 -0.11005151 -0.95656049  1.0691186
## 4 -2.31635974 -0.5090272 -0.7967320  0.07766089  0.09785698  0.1260063
## 5  0.06529784  0.7402671  1.5385334  0.22315971  1.12298509  0.3412906
## 6  0.78300157  0.6853681 -1.1632779 -0.85228829 -0.25014634  1.1248055
##        Dim.7       Dim.8 Response
## 1  2.3737353 -0.04560618        1
## 2 -0.1418925 -0.17755398        0
## 3 -0.6182426  0.45559518        0
## 4 -0.1837603  0.06647132        0
## 5 -0.5817403  0.19720561        0
## 6 -0.1887550  0.15115626        0

Individual and Variable Factor Map

Outlier identification

plot.PCA(
  x=cust_pca, 
  choix="ind", 
  select="contrib 10", 
  invisible = "quali", 
  habillage = "Response")

In total, 8 customer IDs are considered outliers, 2 of which gave response 1, namely, the customer accepted the offer in the last campaign.

Variables Factor Map
We want to know the contribution of variables to each PC, how much information each variable explains to each PC, and the correlation between the initial variables.

# PC1 bar plot variable contribution
fviz_contrib(X = cust_pca, 
             axes = 1, # = PC1
             choice = "var")

# PC2 bar plot variable contribution
fviz_contrib(X = cust_pca, 
             axes = 2, # = PC2
             choice = "var")

Based on the graphs above,
- Catalog, Meat, Income, Wines, Store, Fish, Fruits, Sweet are the variables contribute to the PC1.
- Deals, Web, WebVisits are the variables contribute to the PC2.

Clustering

Clustering is one of the machine learning methods and is included in unsupervised learning. Clustering itself aims to find similar data patterns so that it has the possibility of grouping similar data. In those that have been grouped in clustering, they are usually also called clusters. In determining a good cluster, it is when a member in the cluster has the most similarity possible while between cluster members there are quite significant differences. Clustering is widely used in various fields such as customer segmentation, product recommendations, data profiling, and many more. source

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters. The goal is to group similar data points together and discover underlying patterns or structures within the data. K-means is a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. The main objective of k-means clustering is to partition your data into a specific number (k) of groups, where data points within each group are similar and dissimilar to points in other groups. It achieves this by minimizing the distance between data points and their assigned cluster’s center, called the centroid source.

In K-means clustering, the data used must be in numeric form, so we will try to only use numerical predictors.

cust_clust <- cust_clean %>%  select(-c(Education, Marital_Status, Kidhome, Teenhome, Cmp3, Cmp4, Cmp5, Cmp1, Cmp2, Complain, Response))
glimpse(cust_clust)

## Rows: 2,215
## Columns: 13
## $ Income    <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 33454, 3035…
## $ Recency   <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 59, 82, 53, 38, 23, …
## $ Wines     <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 6, 194, 233, 3,…
## $ Fruits    <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 16, 61, 2, 14, 22, 5, 5,…
## $ Meat      <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 11, 480, 53, 17, 1…
## $ Fish      <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 11, 225, 3, 6, 59, 2, 1…
## $ Sweet     <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 1, 112, 5, 1, 68, 13, 12,…
## $ Gold      <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 16, 30, 14, 5, 45, 4, 2…
## $ Deals     <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2, 2, 1, …
## $ Web       <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 2, 3, 6, 1, 7, 3, 4, 11, 2, 4,…
## $ Catalog   <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 4, 1, 0, 6, 0, 1, 4, 1, 2,…
## $ Store     <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 3, 8, 5, 3, 12, 3, 6, 9, 3, …
## $ WebVisits <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 8, 2, 6, 8, 3, 8, 7, 5, 6, 8,…

#scaling data
cust_z <- scale(cust_clust)
summary(cust_z)

##      Income            Recency               Wines             Fruits       
##  Min.   :-2.33388   Min.   :-1.6934384   Min.   :-0.9048   Min.   :-0.6623  
##  1st Qu.:-0.77514   1st Qu.:-0.8644117   1st Qu.:-0.8336   1st Qu.:-0.6121  
##  Median :-0.02773   Median :-0.0008421   Median :-0.3860   Median :-0.4613  
##  Mean   : 0.00000   Mean   : 0.0000000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.76730   3rd Qu.: 0.8627274   3rd Qu.: 0.5922   3rd Qu.: 0.1668  
##  Max.   : 5.12987   Max.   : 1.7262970   Max.   : 3.5209   Max.   : 4.3374  
##       Meat              Fish             Sweet              Gold        
##  Min.   :-0.7448   Min.   :-0.6876   Min.   :-0.6583   Min.   :-0.8487  
##  1st Qu.:-0.6735   1st Qu.:-0.6328   1st Qu.:-0.6339   1st Qu.:-0.6750  
##  Median :-0.4416   Median :-0.4684   Median :-0.4635   Median :-0.3662  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2917   3rd Qu.: 0.2255   3rd Qu.: 0.1451   3rd Qu.: 0.2320  
##  Max.   : 6.9454   Max.   : 4.0421   Max.   : 5.7199   Max.   : 5.3455  
##      Deals              Web              Catalog            Store        
##  Min.   :-1.2074   Min.   :-1.49036   Min.   :-0.9127   Min.   :-1.7848  
##  1st Qu.:-0.6876   1st Qu.:-0.76082   1st Qu.:-0.9127   1st Qu.:-0.8620  
##  Median :-0.1678   Median :-0.03129   Median :-0.2295   Median :-0.2468  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3520   3rd Qu.: 0.69825   3rd Qu.: 0.4538   3rd Qu.: 0.6760  
##  Max.   : 6.5896   Max.   : 8.35836   Max.   : 8.6528   Max.   : 2.2140  
##    WebVisits      
##  Min.   :-2.1925  
##  1st Qu.:-0.9558  
##  Median : 0.2808  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6931  
##  Max.   : 6.0520

Find the optimum value of K number

1. Elbow method
The elbow method, also known as total within sum of squares, is a technique used to determine the optimal number of clusters for a k-means clustering analysis. source

# obtain k optimum
fviz_nbclust(cust_z, FUNcluster = kmeans, method = "wss") + labs(subtitle = "Elbow method")

Based on the graph above, the line slows down at k = 2. The elbow method is one of the most widely used methods but it has disadvantages. The elbow method of interpretation is based on visual interpretation, which leads to incorrect results.

2. Silhouette Method
Average silhouette method measures how well-defined a particular cluster is, and how well-separated it is from other clusters. At this point, it is necessary to state that Silhouette value is calculated for each observation in the data set. Average of the silhouette value of all observations gives us the average silhouette value, which is the silhouette value of the clustering analysis. source

fviz_nbclust(cust_z, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

As can be easily seen from the plot, the clustering model with the highest silhouette value is the clustering for 2 clusters. Therefore, it can be inferred that the optimal number of clusters is two. However, when the silhouette values on the y-axis are examined, the silhouette value for the number of clusters 3 is also quite close, although the number of clusters 2 is the highest. For this reason, it would be more useful to always run the clustering algorithm for both 2 and 3 clusters and interpret the results. source

3. Gap Statistic
Gap Statistic Method compares the observed within-cluster variation for different values of k with the variation expected under a null reference distribution of the data.

fviz_nbclust(cust_z, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")

Gap Statistic Method offers 5 as the optimal number of clusters.

Two of them give the k = 2, but gap statistic method gives us 5. Then we will try to use mean value of them, 3.

K-Means Clustering

# k-means clustering
set.seed(133)
cust_km <- kmeans(cust_z, 3)

# result analysis
cust_km

## K-means clustering with 3 clusters of sizes 597, 593, 1025
## 
## Cluster means:
##       Income      Recency      Wines     Fruits       Meat       Fish
## 1  1.1118923  0.010729995  0.8678931  1.0678523  1.2548446  1.1541386
## 2  0.2566170 -0.018550399  0.4767978 -0.1340294 -0.1449327 -0.1959788
## 3 -0.7960718  0.004482517 -0.7813398 -0.5444179 -0.6470216 -0.5588345
##        Sweet       Gold      Deals        Web    Catalog      Store  WebVisits
## 1  1.0746399  0.6740915 -0.4803667  0.4379592  1.1455884  0.8156569 -1.0235079
## 2 -0.1521858  0.2927043  0.7823986  0.8292665  0.1011815  0.5577581  0.2467713
## 3 -0.5378672 -0.5619573 -0.1728619 -0.7348456 -0.7257726 -0.7977539  0.4533647
## 
## Clustering vector:
##    [1] 1 3 2 3 2 2 2 3 3 3 3 1 2 3 1 3 3 1 3 3 1 2 2 2 3 3 3 1 3 3 3 2 1 3 2 3 3
##   [38] 1 1 3 3 3 1 3 3 2 2 1 3 1 2 1 1 3 2 1 2 2 2 1 3 3 1 2 2 1 2 2 3 3 1 1 3 2
##   [75] 3 3 3 3 1 3 3 2 1 3 3 3 3 2 3 1 2 3 3 1 1 1 3 3 1 3 1 1 2 2 1 2 3 1 2 3 3
##  [112] 2 3 3 3 1 1 1 3 2 2 2 1 3 1 3 3 3 3 1 2 1 2 3 2 3 3 3 3 2 2 2 3 2 2 3 3 3
##  [149] 1 3 2 3 2 1 3 1 3 1 3 3 3 3 3 3 1 1 3 3 1 3 3 1 3 3 3 3 2 1 3 3 1 3 3 2 3
##  [186] 2 1 1 2 2 1 2 1 3 3 3 2 3 2 3 1 2 2 1 2 3 1 3 2 3 1 2 3 2 3 2 2 1 3 2 1 3
##  [223] 3 2 3 3 2 3 3 1 1 3 1 2 3 2 1 1 1 3 3 1 3 2 3 2 2 3 3 3 2 3 3 2 3 1 3 1 3
##  [260] 1 3 3 3 3 2 1 1 1 2 3 2 3 2 3 3 1 2 1 2 3 3 1 3 3 2 3 3 1 2 3 2 3 3 3 1 3
##  [297] 1 2 3 3 3 1 2 3 3 2 3 2 3 3 2 2 1 3 3 3 3 3 3 2 3 3 1 1 3 1 1 1 3 2 2 3 1
##  [334] 3 1 3 3 2 1 2 1 2 3 3 1 2 2 1 2 3 3 2 2 1 3 1 2 3 2 3 2 3 3 3 3 2 3 3 3 3
##  [371] 3 3 3 2 1 3 2 1 3 1 3 2 1 3 3 3 3 3 1 3 3 2 3 3 2 3 2 3 1 2 2 2 1 3 1 1 2
##  [408] 3 3 3 1 1 3 1 2 3 1 1 2 2 1 3 3 2 2 3 3 3 3 3 3 3 3 3 1 3 2 2 2 3 2 2 2 1
##  [445] 3 3 1 2 1 3 1 3 1 1 3 2 2 1 3 2 3 3 2 3 2 2 2 3 3 3 3 1 1 2 2 3 3 1 3 1 2
##  [482] 2 2 3 1 2 1 3 3 3 2 3 2 1 1 3 2 3 1 2 1 2 1 3 3 1 1 3 1 3 2 3 3 1 2 1 2 2
##  [519] 1 2 3 3 2 3 1 3 3 3 3 3 1 1 3 1 3 3 3 3 3 2 3 1 3 1 1 3 1 3 1 2 1 2 3 2 3
##  [556] 3 3 2 3 3 3 3 2 3 3 3 2 3 2 3 3 3 3 1 1 2 3 3 1 1 3 2 3 3 3 3 3 3 2 1 2 3
##  [593] 3 3 3 3 1 3 3 3 3 2 2 3 3 3 3 2 3 1 3 1 3 1 1 3 3 2 1 1 3 1 3 1 2 2 1 1 1
##  [630] 2 2 1 2 1 3 1 2 1 3 2 3 2 3 3 3 1 3 2 3 1 3 3 3 3 3 3 3 1 2 1 1 2 3 2 1 3
##  [667] 1 2 1 2 3 1 2 1 1 1 1 2 1 3 3 3 3 3 3 2 1 2 2 2 1 3 1 3 2 2 3 3 2 3 2 3 1
##  [704] 1 3 2 3 2 2 3 1 3 3 1 1 2 1 3 2 2 2 2 1 1 2 3 1 2 3 3 3 1 1 3 1 3 1 1 2 1
##  [741] 1 1 1 2 2 3 3 3 2 1 3 1 3 1 1 3 2 1 1 2 3 3 3 3 1 3 1 1 3 3 3 3 3 3 2 2 2
##  [778] 1 1 3 3 3 3 2 2 1 3 2 3 3 1 1 2 3 2 2 1 3 3 1 2 1 2 3 2 2 3 1 3 2 3 1 1 2
##  [815] 3 1 3 3 2 2 3 3 1 2 1 3 2 3 3 3 3 1 1 1 2 3 3 2 2 1 3 2 1 3 2 3 1 3 2 3 2
##  [852] 2 2 2 3 2 3 2 1 3 3 1 1 2 3 1 3 3 3 3 3 1 1 3 3 2 1 3 2 1 3 1 2 1 1 2 3 1
##  [889] 3 1 2 2 1 1 3 3 3 1 1 2 3 1 1 2 1 3 1 3 1 3 3 1 2 1 1 1 1 1 3 2 3 1 3 1 2
##  [926] 2 2 2 2 1 1 3 2 2 2 3 2 3 3 3 3 3 3 2 2 3 2 1 2 3 3 2 2 1 3 3 2 1 3 3 2 1
##  [963] 1 1 2 3 2 3 3 3 2 1 2 1 1 1 3 1 3 2 1 3 3 1 3 2 2 2 1 2 3 3 2 2 1 3 3 1 3
## [1000] 2 3 3 2 1 3 3 3 3 3 2 3 3 1 3 3 3 2 1 1 1 3 1 3 3 3 3 2 2 3 3 1 3 3 3 1 2
## [1037] 2 1 3 1 3 3 1 3 3 1 1 2 2 2 3 2 3 1 2 3 1 3 1 2 3 2 1 1 3 2 3 1 2 1 3 1 2
## [1074] 3 1 3 1 1 3 1 3 3 2 2 1 3 2 1 2 2 3 3 1 3 3 2 2 1 1 3 1 3 2 3 3 3 2 2 3 2
## [1111] 3 3 2 2 3 3 1 2 3 3 1 1 3 3 1 3 3 2 3 3 3 1 3 3 2 2 3 2 1 3 1 3 3 2 1 1 1
## [1148] 3 2 2 1 3 2 3 3 1 1 3 3 1 2 3 3 3 2 3 1 2 3 2 3 3 3 3 2 3 3 1 2 3 3 3 2 3
## [1185] 2 1 1 3 2 3 3 1 2 1 3 3 3 3 1 1 1 2 3 2 3 1 3 3 3 1 3 3 2 2 3 2 3 3 3 3 3
## [1222] 3 2 3 1 3 3 3 3 1 1 3 3 3 3 3 2 1 2 1 1 2 2 2 1 3 1 3 1 1 3 3 1 2 3 3 1 2
## [1259] 1 3 2 3 2 3 3 1 2 1 1 3 2 2 3 2 3 1 2 3 3 3 3 3 3 3 2 1 3 3 1 3 2 1 2 2 1
## [1296] 2 2 2 1 3 1 2 3 2 3 3 3 1 2 1 2 3 2 3 3 3 1 3 2 1 1 2 3 1 3 3 3 3 2 2 3 3
## [1333] 3 3 2 2 1 1 1 3 3 1 1 3 2 1 3 2 3 2 1 2 1 2 3 3 3 3 3 3 2 2 2 2 3 2 3 3 3
## [1370] 1 3 3 1 2 3 3 3 3 2 3 3 2 2 2 2 3 3 2 3 2 1 2 3 2 2 3 1 3 2 3 3 3 3 3 1 2
## [1407] 3 3 3 3 3 3 2 3 3 1 3 2 3 2 3 3 3 3 3 3 1 1 3 1 2 1 2 3 1 1 3 2 1 3 3 1 2
## [1444] 2 2 3 3 3 1 2 1 3 1 3 3 3 1 1 3 1 3 3 2 1 2 3 3 1 2 1 2 1 2 2 3 1 2 3 1 3
## [1481] 2 1 1 2 3 2 2 2 2 2 1 1 2 1 3 1 1 3 3 2 3 3 3 1 1 3 3 3 2 1 3 1 3 2 2 2 3
## [1518] 3 3 3 1 2 2 3 2 1 3 2 3 3 3 2 3 3 1 1 1 2 3 2 3 3 1 3 2 3 1 3 2 2 1 2 2 1
## [1555] 3 1 3 2 3 3 1 3 2 3 1 1 3 1 3 3 2 3 1 3 1 3 3 3 2 3 3 1 3 2 1 2 3 3 3 1 2
## [1592] 2 2 1 3 1 3 3 1 3 3 3 2 3 3 1 2 3 3 2 2 3 3 1 3 3 3 2 2 2 1 3 3 3 2 2 3 1
## [1629] 3 2 2 1 3 3 1 3 1 3 3 3 1 2 1 2 3 2 3 3 3 2 3 2 1 3 1 1 1 1 3 3 3 2 3 1 3
## [1666] 3 3 3 3 1 2 2 1 2 1 3 3 3 1 3 2 3 1 2 3 3 2 3 3 2 3 1 3 1 1 3 1 3 2 2 3 3
## [1703] 2 3 1 1 1 2 3 3 3 2 1 3 3 3 1 1 1 2 1 2 3 3 2 3 1 2 1 3 1 1 1 2 2 3 2 3 3
## [1740] 3 3 3 1 1 3 2 1 3 2 3 1 3 3 3 3 1 1 3 3 3 3 3 1 3 3 1 2 2 3 2 3 2 1 3 3 1
## [1777] 3 3 3 2 3 2 1 1 1 3 2 3 2 2 3 1 1 3 2 1 1 1 3 2 1 2 3 2 3 3 1 1 3 2 1 1 3
## [1814] 3 3 2 3 3 3 1 3 2 2 3 1 3 1 2 2 3 3 2 3 1 3 1 1 1 2 3 3 1 2 2 3 2 1 1 2 3
## [1851] 3 2 1 3 1 3 2 3 3 2 1 2 2 1 3 3 1 3 2 3 1 1 3 1 1 3 2 1 3 3 1 1 2 3 3 2 3
## [1888] 3 1 2 3 3 3 1 1 1 1 2 2 3 3 3 3 3 1 1 1 1 3 2 1 1 2 3 2 3 3 1 3 2 2 3 3 1
## [1925] 3 3 1 3 1 1 1 3 3 3 2 1 1 2 1 3 2 1 3 1 2 3 3 3 1 2 1 1 1 2 3 2 2 3 2 1 3
## [1962] 3 3 3 3 1 3 2 3 2 3 3 2 1 3 1 1 3 1 1 2 2 2 3 3 3 3 3 3 1 2 3 2 1 3 1 2 1
## [1999] 3 3 3 3 3 3 3 2 2 1 3 3 3 2 3 2 2 3 1 2 2 2 2 3 1 2 2 3 3 2 3 3 1 1 1 3 2
## [2036] 1 3 3 3 1 1 2 3 1 3 2 1 3 2 3 3 2 1 3 2 1 2 3 3 3 2 1 2 1 1 3 3 2 3 2 1 2
## [2073] 3 1 2 1 2 3 2 3 2 3 3 3 2 3 1 2 2 3 3 2 2 3 1 1 3 3 3 3 2 3 1 2 1 2 3 3 1
## [2110] 3 3 1 2 3 3 3 3 3 3 3 3 2 1 3 3 2 3 3 1 3 3 3 3 3 3 2 3 1 3 2 1 3 3 1 1 1
## [2147] 3 2 2 1 2 2 1 1 2 2 3 3 2 3 3 3 1 1 1 1 3 1 3 3 1 1 3 3 2 2 3 3 2 2 1 3 3
## [2184] 1 3 3 3 2 1 3 1 3 3 3 1 3 2 2 1 3 3 2 2 2 2 3 3 2 3 3 1 2 2 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 7705.288 4901.901 3334.311
##  (between_SS / total_SS =  44.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

From the clustering result, it is very interesting that the size of each cluster is quite proportional to one another (cluster 1: 549; cluster 2: 641; cluster 3: 1025). In this case the points in cluster 1 have a sum of square distances from the centroid equal to 7705.288, the points in cluster 2 have a sum of square distances from the centroid of 4901.901, the points in cluster 3 have a sum of square distances from the centroid of 3334.311.

The ratio between the sum of squares distance between cluster to the total sum of squares is 44.6%, meaning that only 44.6% of the sum of squares distance comes from the distance between clusters. This numbers mean that if you imagine a cluster like a cloud of points that have a distance from the centroid, this represents the sum of squares of the distances of the points assigned to a cluster from the centroid of that cluster. between_SS / total_SS = 44.6 %. total_SS tends to be small so the ratio tends to 1 (or 100%).

It means that as much the points in each cluster are compact (low within_SS for each cluster and therefore low tot.within_SS which is the sum of all the within_SS ) the better the clustering is because you have low variance in each cluster, in this case the separation of centroids (between_SS) explains the separation of the clusters which are compact and well separated.

Thus, we can conclude that our data is not properly clustered since the observations in the same cluster has a big distance or variations. The number of members on each cluster is not equally distributed.

Cluster Profiling

# make new column, cluster column
cust_clust$cluster <- cust_km$cluster

# profiling with summarise data
cust_clust %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 3 × 14
##   cluster Income Recency Wines Fruits  Meat   Fish Sweet  Gold Deals   Web
##     <int>  <dbl>   <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       1 75905.    49.3 598.   68.9  449.  101.   71.2   78.9  1.40  5.29
## 2       2 57494.    48.5 466.   21.0  135.   26.9  20.8   59.1  3.83  6.36
## 3       3 34833.    49.2  41.6   4.69  21.9   7.05  4.95  14.9  1.99  2.07
## # ℹ 3 more variables: Catalog <dbl>, Store <dbl>, WebVisits <dbl>

# Plotting to make easier process profiling
ggRadar(data=cust_clust, aes(colour=cluster), interactive=TRUE)

Profiling:
- Cluster 1 Beloved Customer:
- The highest income level.
- Customers with the largest number of purchases for each type of product
- They like to buy through catalogs and come directly to the store
- Cluster 2 Discount Chaser:
- The second income level.
- They love discount
- They like to buy online via our website
- Cluster 3 The Potential One:
- The third income level
- Only buy in a small amount
- They love visit our website

Combining Clustering and PCA

PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.

fviz_cluster(object = cust_km, data = cust_clust, labelsize = 0) + theme_minimal()

3D-plot Visualization for Multidimentional Data

cust_3D <- cbind(cust_x, cluster = cust_clust$cluster)
plotly::plot_ly(cust_3D, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c( 
    "red", "green", "blue")) %>% 
  add_markers() %>% 
  layout(scene = list(xaxis = list(title = "Dim.1"),
                      yaxis = list(title = "Dim.2"), 
                      zaxis = list(title = "Dim.3")))

Based on the two graphs above, the unsupervised learning model can separate data into three different money clusters without any overlap between each other.

Conclusion

This customer data set is used for customer segmentation using unsupervised learning with PCA and clustering methods.
* We picked PC1-PC8 from a total of 13 PC. By doing this, We were able to reduce 38.46% of dimension from my original data while retaining 87.5% of the information from the data.
* We find out k optimum using three methods, elbow method, gap statistic, and silhouette method. We use k= 3
* Our model can create 3 clusters with quite good results, namely not overlapping each other.
* Profiling customer result:
- Cluster 1 Beloved Customer:
- The highest income level.
- Customers with the largest number of purchases for each type of product
- They like to buy through catalogs and come directly to the store
- Cluster 2 Discount Chaser:
- The second income level.
- They love discount
- They like to buy online via our website
- Cluster 3 The Potential One:
- The third income level
- Only buy in a small amount
- They love visit our website

References

https://www.researchgate.net/publication/287543507_CLUSTERING_DATA_NON-NUMERIK_DENGAN_PENDEKATAN_ALGORITMA_K-MEANS_DAN_HAMMING_DISTANCE_STUDI_KASUS_BIRO_JODOH/fulltext/5677848208ae0ad265c5be74/CLUSTERING-DATA-NON-NUMERIK-DENGAN-PENDEKATAN-ALGORITMA-K-MEANS-DAN-HAMMING-DISTANCE-STUDI-KASUS-BIRO-JODOH.pdf
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
https://algolearn.netlify.app/p/fuzzy-clustering/
https://medium.com/@ozturkfemre/unsupervised-learning-determination-of-cluster-number-be8842cdb11#:~:text=Generate%20B%20reference%20datasets%20by,the%20optimal%20number%20of%20clusters.
https://andrea-grianti.medium.com/kmeans-parameters-in-rstudio-explained-c493ec5a05df
https://algorit.ma/blog/principal-component-analysis-2022/

Unsupervised Learning: Customer Personality Analysis Segmentation

Galuh Chynintya

2024-08-23