1 Summary

In conducting personality and segmentation analysis also build a machine learning to classify the data, there are several step that this research pursue as follows:

Data Preprocessing.
Data Wrangling.
EDA 1.
EDA 2 for background and characteristic analysis.
Unsupervised learning modelling for clustering & lebelling.
PCA for dimensionality reduction.
Supervised learning modelling.
Prediction Result.
Conclusion.

After conducting the analysis, there are two customer categories which are shopaholic & general shopper. Then, the general shopper category divided into Mid to Low Class and High Class. The shopaholic characteristics and backgrounds are discount hunter, love shopping, mid income, graduation, etc. Then if the company wants to acquire more costumers, the company can target the General Shopper - High Class category as well. Using decision tree model can classify the positive class and negative class with accuracy score 0.6897321, sensitivity score 0.7272727, and AUC score 73.1215 % . Furthermore, using another more complex machine learning model to get better accuracy to clasify both classes.

2 Preface

2.1 Background

Customer Personality Analysis is an anlysis to understand characteristic the customer’s personality. It helps the company to understand the customers more deeply so that the company can give a product / recommendation based on specific character, profile, behavior, needs, and concerns of different types of customers.

and makes it easier for them to modify products according to the specific needs, profile, behaviors and concerns of different types of customers.

2.2 Research Objectives

The objectives are :

Providing an unsupervised learning model to discover several clusters which represented the company’s customer.

2.3 Data Source

Customer Personality

2.4 Data Description

People

ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company
Recency: Number of days since customer’s last purchase
Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Products

MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years

Promotion

NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Place

NumWebPurchases: Number of purchases made through the company’s website
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s website in the last month

2.5 List Packages

There are several packages that used in this research, as follows:

# data cleansing
library(readr) 
library(dplyr)

# Date
library(lubridate)

#data analysis
library(GGally)
# Silhouettle, Elbow
library(factoextra)
#K-Medoid
library(cluster) 
# PCA
library(FactoMineR)
# K-Mods
library(MASS)
library(klaR)

#data visualizationl
library(ggplot2) 
library(scales)
library(echarts4r)

# Sampling adequacy (MSA)
library(EFAtools)  

# Decision Tree
library(partykit)

# Confussion Matrix
library(caret)
# ROC
library(ROCR)
# Visualization DT
library(rpart)
library(rattle)

#ROC 
library(pROC)

3 Data Preprocessing

3.1 Read & Extracting Data

The csv file is loaded into persona variable and using function read.csv to extract the csv file. Then the data contains 27 columns and 2240 rows.

# Extracting Data

persona <- read.csv("data_input/marketing_campaign.csv",sep = "\t")
persona <- persona %>% dplyr::select(-c(Z_CostContact,Z_Revenue)) #Remove unknown column

persona

4 Data Wrangling

4.1 Data Inspection

This step consists of three steps ranging from the top 6 observations of data set inspection, the bottom 6 observations of data set inspection by using head also tail function, then unique inspection for each variables so that the data set’s background can be recognized a little bit.

1. Top 6 observations of data set.

# Top 6 data
head(persona)

2. Bottom 6 observations of data set.

# Bottom 6 data
tail(persona)

3. Unique Inspection.

lengths(lapply(persona, unique))

#>                  ID          Year_Birth           Education      Marital_Status 
#>                2240                  59                   5                   8 
#>              Income             Kidhome            Teenhome         Dt_Customer 
#>                1975                   3                   3                 663 
#>             Recency            MntWines           MntFruits     MntMeatProducts 
#>                 100                 776                 158                 558 
#>     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
#>                 182                 177                 213                  15 
#>     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
#>                  15                  14                  14                  16 
#>        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
#>                   2                   2                   2                   2 
#>        AcceptedCmp2            Complain            Response 
#>                   2                   2                   2

4.2 Change The Data Type

Perform data type inspection to ensure the data type of each column is appropriate by using glimpse() function in dplyr package. Then, there are several variables whose data type are changed, as follows :

persona <- persona %>% 
  mutate_at(vars(Education,Marital_Status,Complain,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Year_Birth),as.factor) %>% 
  mutate(Dt_Customer=dmy(Dt_Customer))

glimpse(persona)

#> Rows: 2,240
#> Columns: 27
#> $ ID                  <int> 5524, 2174, 4141, 6182, 5324, 7446, 965, 6177, 485…
#> $ Year_Birth          <fct> 1957, 1954, 1965, 1984, 1981, 1967, 1971, 1985, 19…
#> $ Education           <fct> Graduation, Graduation, Graduation, Graduation, Ph…
#> $ Marital_Status      <fct> Single, Single, Together, Together, Married, Toget…
#> $ Income              <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 3…
#> $ Kidhome             <int> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,…
#> $ Teenhome            <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
#> $ Dt_Customer         <date> 2012-09-04, 2014-03-08, 2013-08-21, 2014-02-10, 2…
#> $ Recency             <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82…
#> $ MntWines            <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6,…
#> $ MntFruits           <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, …
#> $ MntMeatProducts     <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 4…
#> $ MntFishProducts     <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3…
#> $ MntSweetProducts    <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1…
#> $ MntGoldProds        <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14…
#> $ NumDealsPurchases   <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3,…
#> $ NumWebPurchases     <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3,…
#> $ NumCatalogPurchases <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0…
#> $ NumStorePurchases   <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12,…
#> $ NumWebVisitsMonth   <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8…
#> $ AcceptedCmp3        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
#> $ AcceptedCmp4        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ AcceptedCmp5        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
#> $ AcceptedCmp1        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
#> $ AcceptedCmp2        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ Complain            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ Response            <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…

4.3 Missing Values

In this step, checking the missing values is a must so that the missing value treatment can be done then the data is ready to analyze.

Check Missing Value.

# Check Missing Value
colSums(is.na(persona))

#>                  ID          Year_Birth           Education      Marital_Status 
#>                   0                   0                   0                   0 
#>              Income             Kidhome            Teenhome         Dt_Customer 
#>                  24                   0                   0                   0 
#>             Recency            MntWines           MntFruits     MntMeatProducts 
#>                   0                   0                   0                   0 
#>     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
#>                   0                   0                   0                   0 
#>     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
#>                   0                   0                   0                   0 
#>        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
#>                   0                   0                   0                   0 
#>        AcceptedCmp2            Complain            Response 
#>                   0                   0                   0

# Check column with empty string
persona[is.na(persona$Income),]

Insight: There isn’t any missing values.
There are 24 rows of missing value within Income column.

Treatment Missing Value.

persona <- persona %>% 
  mutate(Income =  ifelse(is.na(Income),0,Income))
head(persona)

4.4 Check Duplicate Data

Checking the data whether there is the same values or duplicates for each row.

persona %>% 
  duplicated() %>% 
  sum()

#> [1] 0

Insight: There are not duplicate data, then there is not any data treatment.

4.5 Data Splitting

Splitting data into two type categories which are numeric data type (persona1) for dimensionality reduction purpose and factor data type (persona2) for unsupervised learning.

# Selection Data 1

persona1 <- persona %>% 
  select_if(is.numeric)  
  
# rownames(persona1) <- persona1$ID

persona1 <- persona1 %>% 
  dplyr::select(-ID)

persona1

# Selection Data 2
persona2 <- persona %>% 
  select_if(purrr::negate(is.numeric)) %>% # Select coloumns except numeric column
  dplyr::select(-Dt_Customer)

persona2

5 EDA 1

5.1 Summary

summary(persona1)

#>      Income          Kidhome          Teenhome         Recency     
#>  Min.   :     0   Min.   :0.0000   Min.   :0.0000   Min.   : 0.00  
#>  1st Qu.: 34722   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:24.00  
#>  Median : 51075   Median :0.0000   Median :0.0000   Median :49.00  
#>  Mean   : 51688   Mean   :0.4442   Mean   :0.5062   Mean   :49.11  
#>  3rd Qu.: 68290   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:74.00  
#>  Max.   :666666   Max.   :2.0000   Max.   :2.0000   Max.   :99.00  
#>     MntWines         MntFruits     MntMeatProducts  MntFishProducts 
#>  Min.   :   0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.00  
#>  1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16.0   1st Qu.:  3.00  
#>  Median : 173.50   Median :  8.0   Median :  67.0   Median : 12.00  
#>  Mean   : 303.94   Mean   : 26.3   Mean   : 166.9   Mean   : 37.53  
#>  3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232.0   3rd Qu.: 50.00  
#>  Max.   :1493.00   Max.   :199.0   Max.   :1725.0   Max.   :259.00  
#>  MntSweetProducts  MntGoldProds    NumDealsPurchases NumWebPurchases 
#>  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000    Min.   : 0.000  
#>  1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000    1st Qu.: 2.000  
#>  Median :  8.00   Median : 24.00   Median : 2.000    Median : 4.000  
#>  Mean   : 27.06   Mean   : 44.02   Mean   : 2.325    Mean   : 4.085  
#>  3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000    3rd Qu.: 6.000  
#>  Max.   :263.00   Max.   :362.00   Max.   :15.000    Max.   :27.000  
#>  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
#>  Min.   : 0.000      Min.   : 0.00     Min.   : 0.000   
#>  1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000   
#>  Median : 2.000      Median : 5.00     Median : 6.000   
#>  Mean   : 2.662      Mean   : 5.79     Mean   : 5.317   
#>  3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000   
#>  Max.   :28.000      Max.   :13.00     Max.   :20.000

Insight:

The range between predictor variables are big, it can be seen from Income variable compare to others.

5.2 Correlation

ggcorr(persona1, palette = "RdGy", nbreaks = 4,label = T,label_round = 2,label_size = 2.8,hjust = 0.95,layout.exp = 0.2, label_color = "white")

ggpairs(persona1[1:15], progress = F,upper = list(continuous = wrap("cor", size=2)))+
  theme(axis.text = element_text(colour = "black", size = 4),
        strip.text = element_text(size = 5))

Insight:

Most of predictor variable is right skewed.
Most of predictor variable is tend to high correlaiton, and this is suitable for PCA analysis to eliminate multicollinearity / correlation among predictors.

6 EDA 2

6.1 K-Means

6.1.1 Explaination

K-means is a centroid-based clustering algorithm, meaning that each cluster has one centroid that represents the cluster. In this part, using K-means for clustering which only based on numeric variable in data frame of persona1.

6.1.2 Elbow Method

The Elbow method measures the optimal number of clusters based on the SSE (Sum of Square Error) value. Visually, the optimal number of clusters can be seen from the points that tend to form the first elbow or the points that have a high/significant number of changes from the previous point.

elbow_scale_num <- scale(persona1)
as.data.frame(elbow_scale_num)

fviz_nbclust(x = elbow_scale_num, FUNcluster = kmeans, method = "wss") + geom_hline(yintercept = 19750) + geom_hline(yintercept = 16200) + geom_hline(yintercept = 17800) + geom_vline(xintercept = 3, linetype = 2) + geom_vline(xintercept = 5, linetype = 2) + geom_vline(xintercept = 7, linetype = 2)

Insight:
Based on plot above, the optimum K that is close to 90 degrees : 3 & 7. Then, the k number 3 is close to 90 degrees than the k number 7.

6.1.3 Clustering Model

RNGkind(sample.kind = "Rounding")
set.seed(113)

# berdasarkan elbow method, k = 3 
elb <- kmeans(elbow_scale_num, center = 3)

#Input Labelling to df
df_elb <- persona1 %>% 
  mutate(cluster=elb$cluster)

df_elb

6.1.4 Cluster Profiling

# Mean
df_elb %>% 
  dplyr::select(1,4,16) %>% 
  dplyr::group_by(cluster) %>% 
  summarise_all(mean)

# Sum
df_elb %>% 
  dplyr::select(-c(1,4)) %>% 
  dplyr::group_by(cluster) %>% 
  summarise_all(sum)

🔻Cluster Profiling:

Cluster 1:
- Max Characteristic / Background : Recency (2nd place), Kidhome, Teenhome (2nd place), NumDealsPurchases (2nd place), NumWebVisitsMonth.
- Min Characteristic / Background : Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumWebPurchases, NumCatalogPurchases, NumStorePurchases.
- Description : Low Income, rarely go shopping, love a discount product.
- Label : General Shopper - Mid to Low Class.
Cluster 2:
- Max Characteristic / Background : Income (2nd place), Kidhome (2nd place), Teenhome, MntWines (2nd place), MntFruits (2nd place), MntMeatProducts (2nd place), MntFishProducts (2nd place), MntSweetProducts (2nd place), MntGoldProds (2nd place), NumDealsPurchases, NumWebPurchases, NumCatalogPurchases (2nd place), NumStorePurchases (2nd place), NumWebVisitsMonth (2nd place).
- Min Characteristic / Background : Recency.
- Description : Mid Income, love shopping, seek also love a discount product, love shopping in stores and on websites.
- Label : Shopaholic.
Cluster 3:
- Max Characteristic / Background : Income, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumWebPurchases (2nd place), NumCatalogPurchases, NumStorePurchases.
- Min Characteristic / Background : Kidhome, Teenhome, NumDealsPurchases, NumWebVisitsMonth.
- Description : High income, rarely go shopping, love wine products, love meat products, love fish products, likes to shop in stores and through catalogs.
- Label : General Shopper - High Class.

6.1.5 Visualize clustering

fviz_cluster(object = elb,
             data = persona1,labelsize = 5) # Does not involve column groups / target where it is for clustering.

6.1.6 Evaluation

the clustering goal is to minimize the average distance within clusters; and to maximize the average distance between clusters. Therefore, there are some measurement to evaluate the cluter, as follows:

6.1.6.1 Conducting Silhouette Analysis

The first way we can look at this is by running a Silhouette Analysis. This analysis measures how well an observation is clustered and it estimates the average distance between clusters.

sil0 <- silhouette(elb$cluster, dist(elbow_scale_num))
head(sil0[, 1:3], 10) #looking at the first 10 observations

#>       cluster neighbor   sil_width
#>  [1,]       3        2  0.17234788
#>  [2,]       1        2  0.46941136
#>  [3,]       3        2 -0.05048963
#>  [4,]       1        2  0.51994893
#>  [5,]       2        1 -0.10534239
#>  [6,]       2        1  0.22107363
#>  [7,]       2        1  0.20944478
#>  [8,]       1        2  0.46563436
#>  [9,]       1        2  0.48843113
#> [10,]       1        2  0.22527343

The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.

fviz_silhouette(sil0)

#>   cluster size ave.sil.width
#> 1       1 1046          0.40
#> 2       2  613          0.11
#> 3       3  581          0.14

Inight:

The plot above shows that the average silhouette width is 0.26.
The cluster 1 is well clustered.
The cluster 2 has a silhouette width score is 0.11 and the cluster 3 has a a silhouette width score is 0.14. It indicates that two clusters have a silhouette width near to 0, hence the observation lies between two clusters.

6.2 K-Mods

Using k =3 which based on company’s need and business perspective.

RNGkind(sample.kind = "Rounding")
set.seed(193)

cl_p1 <- kmodes(persona2, 3, iter.max = 14, weighted = FALSE )
cluster <- cl_p1$cluster
output_p1 <- cbind(persona2,cluster)
output_p1

6.2.1 Cluster Profilling

6.2.1.1 CLuster 1

output_p1 %>% 
  dplyr::group_by(cluster,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>% 
  summarise(n=n()) %>% 
  arrange(cluster,desc(n)) %>% 
  filter(cluster==1)

🔻Cluster Profiling:

Cluster 1:
- n Max Characteristic / Background : Single, PhD.
- n Min Characteristic / Background : take last campaign, take the 5th campaign.
- Description : Strong background education.
- Label : Well Educated Shopper.

6.2.1.2 CLuster 2

c2 <- output_p1 %>% 
  dplyr::group_by(cluster,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>% 
  summarise(n=n()) %>% 
  arrange(cluster,desc(n)) %>% 
  filter(cluster==2) 

c2

🔻Cluster Profiling:

Cluster 2:
- n Max Characteristic / Background : Married, Single, Graduation.
- n Min Characteristic / Background : Take the 4th campaign.
- Description : Seeking for a discount.
- Label : Discount Hunter.

6.2.1.3 CLuster 3

c3 <- output_p1 %>% 
  dplyr::group_by(cluster,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>% 
  summarise(n=n()) %>% 
  arrange(cluster,desc(n)) %>% 
  filter(cluster==3) 

c3

🔻Cluster Profiling:

Cluster 3:
- n Max Characteristic / Background : Master, Together.
- n Min Characteristic / Background : Take the 5th campaign.
- Description : Not easily tempted by marketing campaigns.
- Label : Wise Shopper.

7 Unsupervised Learning

7.1 K-Medoid

7.1.1 Explaination

K-Medoid is a category partitioning clustering algorithm, where the center of the cluster is called the medoid. In general, to determine the optimum K value for a K-medoid, silhouettes are used, in which this technique seeks the lowest dissimilarities for each cluster.

In this part, using unsupervised learning, K-Medoid, to provide lebelling of target so that it can be used in making a supervised model. Furthermore, this K-Medoid Clustering conduct based on numeric variable in data frame of persona1 and factor variable in data frame of persona2.

7.1.2 Optimum K

7.1.2.1 Silhouette Method

The Silhouette method measures the Silhouette coefficient by calculating the average distance of each data to all data in the same cluster, then calculating the average distance of each data to all data in other clusters. Then, the data is regrouped based on the minimum distance.

# Scale
# Calling df persona 1 due to numeric data type
scale_num <- as.data.frame(scale(persona1))
persona_scale <- cbind(scale_num,persona2)
persona_scale

fviz_nbclust(x = scale_num, FUNcluster = pam, method = "silhouette",k.max = 15) + geom_hline(yintercept = 0.3225)+
  labs(subtitle = "Silhouette method")

***

7.1.2.2 Elbow Method

elbow_scale_medoid <- scale(persona1)
as.data.frame(elbow_scale_medoid)

fviz_nbclust(x = elbow_scale_medoid, FUNcluster = pam, method = "wss") + geom_hline(yintercept = 23150) + geom_vline(xintercept = 2, linetype = 2)

7.1.2.3 Results

Based on Silhouette Method and Elbow Method, the optimum k is two. And the goal is to choose a small value of k that still has a low WSS, and the elbow that form a 90 degree angle.

7.1.3 Clustering Model

Using K-Medoids Clustering (PAM) with k=2 to cluster the company’s customer. Then clustering result is used to provide label/target column.

RNGkind(sample.kind = "Rounding")
set.seed(1004)
#perform k-medoids clustering with k = 2 clusters
kmed <- pam(persona_scale, k = 2)

#Input Labelling to df
df <- cbind(persona1,persona2)
df$label <- kmed$cluster
df

7.1.4 Cluster Profiling

# Numeric - mean

df %>% 
  dplyr::select(1:15,26) %>% 
  dplyr::group_by(label) %>% 
  dplyr::select(1,4) %>%
  summarise_all(mean)

# Numeric - sum
df %>%
  dplyr::select(1:15,26) %>%
  dplyr::group_by(label) %>%
  dplyr::select(2,3,5:15) %>% 
  summarise_all(sum)

# Kategorik - Label 1
f1 <- df %>% 
  dplyr::select(-(1:15)) %>% 
  dplyr::group_by(label,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>%   summarise(n=n()) %>% 
  filter(label==1) %>% 
  arrange(label,desc(n))
f1

# Kategorik - Label 2
f2 <- df %>% 
  dplyr::select(-(1:15)) %>% 
  dplyr::group_by(label,AcceptedCmp1,AcceptedCmp2,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,Response,Marital_Status,Education) %>%   summarise(n=n()) %>% 
  filter(label==2) %>% 
  arrange(label,desc(n))
f2

🔻Cluster Profiling:

Cluster 1:
- Max Characteristic / Background : Income, Teenhome, Recency, Graduation, Married, Together.
- Min Characteristic / Background : Kidhome, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVisitsMonth.
- Description : Love a discount, graduation, Married, Together, rarely go shopping.
- Label : General Shopper.
Cluster 2:
- Max Characteristic / Background : Kidhome, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVisitsMonth, Graduation, Married, Single.
- Min Characteristic / Background : Income, Teenhome, Recency.
- Description : Love and seek a discount, graduation education, Married, Single, like to visit company’s web, has children at home, mid to low income.
- Label : Shopaholic.

7.1.5 Visualize clustering

fviz_cluster(object = kmed,
             data = persona1,labelsize = 5) # Does not involve column groups / target where it is for clustering.

7.1.6 Evaluation

7.1.6.1 Conducting Silhouette Analysis

The first way we can look at this is by running a Silhouette Analysis. This analysis measures how well an observation is clustered and it estimates the average distance between clusters.

sil <- silhouette(kmed$cluster, dist(persona_scale))

head(sil[, 1:3], 10) #looking at the first 10 observations

#>       cluster neighbor    sil_width
#>  [1,]       1        2  0.596189205
#>  [2,]       1        2  0.664027875
#>  [3,]       1        2  0.233695083
#>  [4,]       2        1  0.646646375
#>  [5,]       2        1  0.655211120
#>  [6,]       2        1 -0.006236619
#>  [7,]       2        1  0.409488525
#>  [8,]       2        1  0.638280574
#>  [9,]       2        1  0.566364248
#> [10,]       1        2  0.569280417

The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.

fviz_silhouette(sil)

#>   cluster size ave.sil.width
#> 1       1  910          0.52
#> 2       2 1330          0.50

Inight:

The plot above shows that the average silhouette width is 0.51 and the clusters have a large silhouette width. It can be concluded that the clusters are well very clustered.

Clusters with a large silhouette width are well very clustered, those with width near 0 mean the observation lies between two clusters, and negative silhouette widths mean the observation is likely in the wrong cluster.

8 Supervised Learning Model

Using supervised machine learning model to classify the customers, whether they belong to cluster 1 as general shopper or cluster 2 as shopaholic.

8.1 PCA

The purpose of dimensionality reduction, PCA, is to reduce the number of variables (dimensions/features) in the data while retaining as much information as possible. Dimensionality reduction can overcome the problem of high-dimensional data. Difficulties encountered in high-dimensional data:

Requires a large amount of time and computation in modeling
Do more than three-dimensional visualization

8.1.1 Sampling adequacy (MSA)

Checking sampling adequacy (MSA) is important before conducting PCA analysis , which is done via the Kaiser-Meyer-Olkin (KMO) test. From the result above shows that the average MSA for sample is 0.888, more than sufficient to proceed (generally anything above 0.6 is good, see paper for more info).

KMO(cor(persona1))

#> 
#> ── Kaiser-Meyer-Olkin criterion (KMO) ──────────────────────────────────────────
#> 
#> ✔ The overall KMO value for your data is meritorious.
#>   These data are probably suitable for factor analysis.
#> 
#>   Overall: 0.888
#> 
#>   For each variable:
#>              Income             Kidhome            Teenhome             Recency 
#>               0.931               0.900               0.574               0.302 
#>            MntWines           MntFruits     MntMeatProducts     MntFishProducts 
#>               0.882               0.938               0.911               0.937 
#>    MntSweetProducts        MntGoldProds   NumDealsPurchases     NumWebPurchases 
#>               0.939               0.935               0.502               0.858 
#> NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
#>               0.903               0.899               0.842

8.1.2 PCA Analysis

Using FactoMineR package to conduct PCA analysis, where the PCA() function for PCA analysis and plot.PCA() for visualization.

persona1_pca <- PCA(X = persona1, # numeric data without target/lebel
                scale.unit = T, # scale is true
                graph = F, # untuk tidak menampilkan visualisasi
                ncp = 15) # The number of PCs used (a total of 15 numeric coloumns)

# Show the PCA's results
results1 <- persona1_pca$ind$coord %>% as.data.frame() 
results1

Insight:

The data frame above shows the PCA anlysis results which contain 15 PCA which are the same as the number of columns in the original data (persona1).

persona1_pca$eig

#>         eigenvalue percentage of variance cumulative percentage of variance
#> comp 1   6.0226357              40.150905                          40.15090
#> comp 2   1.8344275              12.229517                          52.38042
#> comp 3   1.0252150               6.834767                          59.21519
#> comp 4   1.0049524               6.699683                          65.91487
#> comp 5   0.7842345               5.228230                          71.14310
#> comp 6   0.7548658               5.032438                          76.17554
#> comp 7   0.6272726               4.181817                          80.35736
#> comp 8   0.5031005               3.354004                          83.71136
#> comp 9   0.4666598               3.111065                          86.82243
#> comp 10  0.4323828               2.882552                          89.70498
#> comp 11  0.3943847               2.629232                          92.33421
#> comp 12  0.3568601               2.379067                          94.71328
#> comp 13  0.3210059               2.140039                          96.85332
#> comp 14  0.2533239               1.688826                          98.54214
#> comp 15  0.2186788               1.457859                         100.00000

Insight:

The cumulative percentage of variance represents the amount of informations that company wants to retain. If the company / user wants to retain the 80% of informations, then it is represented by PC 7.

8.1.3 Clean Data Frame

Combining the data frame from the PCA result with non numeric data frame with labels from clustering results of unserved learning models

df_clean <- cbind(results1[,1:7],persona2)
df_clean$label <- kmed$cluster

df_clean <- df_clean %>% mutate(label=as.factor(label))

df_clean <- df_clean %>% dplyr::select(-Year_Birth) # Drop it due to it belongs to visualization purpose

rownames(df_clean) <- NULL
df_clean

8.2 Modelling

8.2.1 Cross Validation

In cross validation step, the data is devided into train data with proportion 80% of original data and test data with proportion 20% of original data as unseen data.

# Set seed to lock the random
# RNGkind(sample.kind = "Rounding")
set.seed(1239)

# index sampling
index <- sample(x = nrow(df_clean), size = nrow(df_clean)*0.8)

# splitting 
df_clean_train <- df_clean[index,]


df_clean_test <- df_clean[-index,]

8.2.2 Imbalance Treatment

prop.table(table(df_clean_train$label))

#> 
#>         1         2 
#> 0.4051339 0.5948661

After, checking the imbalance of target variable and there is no imbalance over there (refers to statistical best practice the limmitation is 60% & 40%). It can be concluded that there is no needed to do the treatments such as up-sampling, down-sampling, SMOTE, and ROSE. And the treatment can be conducted after cross validation and only conducted on the train data because the test data is unseen data.

8.2.3 Decision Tree

Tree-based model which is quite simple with robust/powerful performance for prediction. The Decision Tree produces a visualization in the form of a decision tree which can be interpreted easily. Using ctree function from partykit package to implement the decision tree model.

library(party)
persona_tree <-
  ctree(
    formula = label ~ Dim.1 + Dim.2 + Dim.3 + Dim.4 + Dim.5 + Dim.6 + Dim.7 +
      Education + Marital_Status + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 +
      AcceptedCmp4 + AcceptedCmp5 + Complain + Response ,
    data = df_clean_train)

plot(persona_tree,type = "simple") #,type = "simple"

The tree diagram result can simplified by using rpart function and library of rpart.

tree_model <- rpart(label ~ Dim.1 + Dim.2 + Dim.3 + Dim.4 + Dim.5 + Dim.6 + Dim.7 +
      Education + Marital_Status + AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 +
      AcceptedCmp4 + AcceptedCmp5 + Complain + Response, 
                   data = df_clean_train
    )

fancyRpartPlot(tree_model, caption = NULL)

8.3 Model Evaluation

8.3.1 Test

# prediction on data test
persona_pred_test <- predict(object = persona_tree, # model decision tree 
                          newdata = df_clean_test, # predict data test
                          type = "response")

# confusion matrix
untuned_test <- confusionMatrix(data = persona_pred_test, # prediction result
                reference = df_clean_test$label, # original label
                positive = "2") # number 2 -> Shopaholic is this research observation

untuned_test

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   1   2
#>          1 118  74
#>          2  66 190
#>                                           
#>                Accuracy : 0.6875          
#>                  95% CI : (0.6423, 0.7302)
#>     No Information Rate : 0.5893          
#>     P-Value [Acc > NIR] : 1.117e-05       
#>                                           
#>                   Kappa : 0.3586          
#>                                           
#>  Mcnemar's Test P-Value : 0.5541          
#>                                           
#>             Sensitivity : 0.7197          
#>             Specificity : 0.6413          
#>          Pos Pred Value : 0.7422          
#>          Neg Pred Value : 0.6146          
#>              Prevalence : 0.5893          
#>          Detection Rate : 0.4241          
#>    Detection Prevalence : 0.5714          
#>       Balanced Accuracy : 0.6805          
#>                                           
#>        'Positive' Class : 2               
#>

8.3.2 Train

# prediction on train test
persona_pred_train <- predict(object = persona_tree, # model decision tree 
                          newdata = df_clean_train, # predict data test
                          type = "response")

# confusion matrix
untuned_train <- confusionMatrix(data = persona_pred_train, # prediction result
                reference = df_clean_train$label, # original label 
                positive = "2") # number 2 -> Shopaholic is this research observation
untuned_train

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   1   2
#>          1 447 307
#>          2 279 759
#>                                           
#>                Accuracy : 0.673           
#>                  95% CI : (0.6507, 0.6947)
#>     No Information Rate : 0.5949          
#>     P-Value [Acc > NIR] : 5.365e-12       
#>                                           
#>                   Kappa : 0.3257          
#>                                           
#>  Mcnemar's Test P-Value : 0.2647          
#>                                           
#>             Sensitivity : 0.7120          
#>             Specificity : 0.6157          
#>          Pos Pred Value : 0.7312          
#>          Neg Pred Value : 0.5928          
#>              Prevalence : 0.5949          
#>          Detection Rate : 0.4235          
#>    Detection Prevalence : 0.5792          
#>       Balanced Accuracy : 0.6639          
#>                                           
#>        'Positive' Class : 2               
#>

8.4 Model Improvement - Pruning & Tree Size

In post-prunning step, adding ctree_control parameter within ctree function in order to improve the model accuracy by minimizing the over fitting model. This is because the decision tree model can perform data splitting into very detail level, thus making the Decision Tree model only memorize the train data patterns by making complex rules, where the model should be able to learn the test data patterns as well.

tuning_tree <- ctree(formula = label ~. , data = df_clean_train,
                     controls = ctree_control(mincriterion = 0.95, 
                                            minsplit = 100,
                                            minbucket = 5)
)

plot(tuning_tree, type = "simple")

8.4.1 Test

# prediction on data test
persona_pred_test1 <- predict(object = tuning_tree, # model decision tree 
                          newdata = df_clean_test)

# confusion matrix
tuned_test <- confusionMatrix(data = persona_pred_test1, # prediction result
                reference = df_clean_test$label, # original label 
                positive = "2") # number 2 -> Shopaholic is this research observation

Insight:

It can be concluded that the model is good enough due to the accuracy score on test data is higher than the accuracy score on train data. Furthermore, the small differences between the accuracy score in train data and test data.

8.4.2 Train

In this step, the data train has predicted by using tuning model of decision tree, with the result type is response (class). Then, evaluating the model with confusion matrix which focused on accuracy score.

# prediction on train test
persona_pred_train1 <- predict(object = tuning_tree, # model decision tree 
                          newdata = df_clean_train, # predict data test
                          type = "response")

# confusion matrix
tuned_train <- confusionMatrix(data = persona_pred_train1, # prediction result
                reference = df_clean_train$label, # original label
                positive = "2") # number 2 -> Shopaholic is this research observation

tuned_train

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   1   2
#>          1 417 295
#>          2 309 771
#>                                           
#>                Accuracy : 0.6629          
#>                  95% CI : (0.6405, 0.6848)
#>     No Information Rate : 0.5949          
#>     P-Value [Acc > NIR] : 1.733e-09       
#>                                           
#>                   Kappa : 0.2986          
#>                                           
#>  Mcnemar's Test P-Value : 0.5968          
#>                                           
#>             Sensitivity : 0.7233          
#>             Specificity : 0.5744          
#>          Pos Pred Value : 0.7139          
#>          Neg Pred Value : 0.5857          
#>              Prevalence : 0.5949          
#>          Detection Rate : 0.4302          
#>    Detection Prevalence : 0.6027          
#>       Balanced Accuracy : 0.6488          
#>                                           
#>        'Positive' Class : 2               
#>

Insight:

It can be concluded that the model is good enough due to the accuracy score on train data is small than the accuracy score on test data. Furthermore, the small differences between the accuracy score in train data and test data.

8.5 Final Evaluation

8.5.1 Model Comparison

# Comparison After Tuning & Before Tuning

performance_comparison <- cbind.data.frame(Accuracy=c(untuned_test$overall[1],tuned_test$overall[1]),
                 Sensitivity = c(untuned_test$byClass[[1]], tuned_test$byClass[[1]])
                 )
rownames(performance_comparison) <- c("Untuned DT Model", "Tuned DT Model")

performance_comparison

Insight:

Only choosing the model with high accuracy, this is because the model can classify / predict both class (shopaholic as positive class and general Shopper as negative class).

8.5.2 AUC Matrix Evaluation

# Get Prediction Result with type = probability
persona_pred_test1_prob <- predict(object = tuning_tree, # model decision tree 
                          newdata = df_clean_test, type="prob")

# Extract / Convert the list into matrix with 2 columns
pred_test_improvement <- matrix(unlist(persona_pred_test1_prob), ncol = 2, byrow = TRUE)



## AUC
pred_prob_improvement <- pred_test_improvement[,2]

model_auc_eva <- prediction(pred_prob_improvement, 
                        df_clean_test$label)

model_auc1 <- performance(model_auc_eva,
                         measure = "auc")

model_auc1@y.values

#> [[1]]
#> [1] 0.731215

Insight:

AUC = 0.731215, it can be concluded that the model is quite good at separating positive shopaholic and negative general shopper classes.
The AUC value of the Decision Tree model is more than 0.5 and close to 1, meaning that our model is able to distinguish positive and negative classes well.

9 Prediction Result

pred_result <- df_clean_test
pred_result$pred <- persona_pred_test1
pred_result <- pred_result %>% dplyr::select(label,pred) %>% rename(actual=label) 
row.names(pred_result) <- NULL
pred_result

# confusion matrix
tuned_test <- confusionMatrix(data = persona_pred_test1, # prediction result
                reference = df_clean_test$label, # original label 
                positive = "2") # number 2 -> Shopaholic is this research observation

# Plot 
Actual <- factor(c("No", "No", "Yes", "Yes"))
Predicted <- factor(c("No", "Yes", "No", "Yes"))
y      <- c(tuned_test$table[1], tuned_test$table[2], tuned_test$table[3], tuned_test$table[4])
df1_log <- data.frame(Actual, Predicted, y)

# Plot 1
plot_log1 <- ggplot(data =  df1_log, mapping = aes(x = Actual, y = Predicted, col = y)) +
  geom_tile(aes(fill = y)) +
  geom_text(aes(label = sprintf("%1.0f", y)), color = "Black", size = 5, family = "Times New Roman") +
  scale_fill_gradient(low = "#FFA07C", high = "#FC945A") +
  scale_color_gradient(low = "#FFA07C", high = "#FC945A") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#B78980", color = "#B78980"),
        panel.background = element_rect(fill = "#B78980"),
        panel.grid = element_line(colour = "#B78980"),
        panel.grid.major.x = element_line(colour = "#B78980"),
        panel.grid.minor.x = element_line(colour = "#B78980"),
        axis.text.x = element_text(color = "#54433A", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "#54433A", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#402E32", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#402E32", family = "Times New Roman", size = 14),
        axis.ticks = element_blank()) 

plot_log1

10 Conclusion

There are two customer categories, namely shopaholic and general shopper. Furthermore, general shopper category divided into Mid to Low Class and High Class.
The personality characters of shopaholic are discount hunter, love shopping in stores also on websites, like to visit company’s web, and beside the discount most customers buy products during the 4th campaign. Then the shopaholic backgrounds are mid income, graduation, married, single.
If the company wants to acquire more customers, the company can target the General Shopper - High Class category. The personality characters of General Shopper - High Class category are not easily tempted by marketing campaigns, loves Wines, loves Meat Products, loves Fish Products, likes to shop in stores, through catalogs, and websites. Then the General Shopper - High Class category backgrounds are high income, master, together.
Using Decision Tree Model - Tuned DT Model, it can classify the positive class and negative class with accuracy score 0.6897321, sensitivity score 0.7272727, and auc score 73.1215 % .

Unsupervised & Supervised Learning for Customer Personality & Segmentation

Faisal Adhisthana Nugraha

2023-04-09

1 Summary

2 Preface

2.1 Background

2.2 Research Objectives

2.3 Data Source

2.4 Data Description

2.5 List Packages

3 Data Preprocessing

3.1 Read & Extracting Data

4 Data Wrangling

4.1 Data Inspection

4.2 Change The Data Type

4.3 Missing Values

4.4 Check Duplicate Data

4.5 Data Splitting

5 EDA 1

5.1 Summary

5.2 Correlation

6 EDA 2

6.1 K-Means

6.1.1 Explaination

6.1.2 Elbow Method

6.1.3 Clustering Model

6.1.4 Cluster Profiling

6.1.5 Visualize clustering

6.1.6 Evaluation

6.1.6.1 Conducting Silhouette Analysis

6.2 K-Mods

6.2.1 Cluster Profilling

6.2.1.1 CLuster 1

6.2.1.2 CLuster 2

6.2.1.3 CLuster 3

7 Unsupervised Learning

7.1 K-Medoid

7.1.1 Explaination

7.1.2 Optimum K

7.1.2.1 Silhouette Method

7.1.2.2 Elbow Method

7.1.2.3 Results

7.1.3 Clustering Model

7.1.4 Cluster Profiling

7.1.5 Visualize clustering

7.1.6 Evaluation

7.1.6.1 Conducting Silhouette Analysis

8 Supervised Learning Model

8.1 PCA

8.1.1 Sampling adequacy (MSA)

8.1.2 PCA Analysis

8.1.3 Clean Data Frame

8.2 Modelling

8.2.1 Cross Validation

8.2.2 Imbalance Treatment

8.2.3 Decision Tree

8.3 Model Evaluation

8.3.1 Test

8.3.2 Train

8.4 Model Improvement - Pruning & Tree Size

8.4.1 Test

8.4.2 Train

8.5 Final Evaluation

8.5.1 Model Comparison

8.5.2 AUC Matrix Evaluation

9 Prediction Result

10 Conclusion