1 Problem Statement

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. Customer segmentation is the practice of dividing a company’s customers into groups that reflect similarity among customers in each group. The goal of segmenting customers is to decide how to relate to customers in each segment in order to maximize the value of each customer to the business. (Source: Optimove)

In order for us to be able to find the customer types, we will transform the data into several cluster using K-Means Clustering.

Load Library

library(tidyverse)
library(dplyr)
library(caret)
library(class)
library(scales)
library(lubridate)

2 Load the Data

cust <- read.csv("marketing_campaign.csv", sep = '\t')
head(cust,5)

3 EDA and Preprocessing

It is very interesting dataset from which we can get much information.

str(cust)

## 'data.frame':    2240 obs. of  29 variables:
##  $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
##  $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
##  $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
##  $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
##  $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
##  $ Dt_Customer        : chr  "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
##  $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...
##  $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
##  $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
##  $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
##  $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
##  $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
##  $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
##  $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
##  $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
##  $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
##  $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
##  $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
##  $ AcceptedCmp3       : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ AcceptedCmp4       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Z_CostContact      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Z_Revenue          : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...

3.1 Understanding Family Details

First, we need to understand the family details

cust_clean <- 
cust %>% 
  mutate(Kids = Kidhome + Teenhome,
         Marital_Status = replace(Marital_Status, Marital_Status == "Single", "Not in Relationship"),
         Marital_Status = replace(Marital_Status, Marital_Status == "Divorced", "Not in Relationship"),
         Marital_Status = replace(Marital_Status, Marital_Status == "Widow", "Not in Relationship"),
         Marital_Status = replace(Marital_Status, Marital_Status == "Alone", "Not in Relationship"),
         Marital_Status = replace(Marital_Status, Marital_Status == "Absurd", "Not in Relationship"),
         Marital_Status = replace(Marital_Status, Marital_Status == "YOLO", "Not in Relationship"),
         Marital_Status = replace(Marital_Status, Marital_Status == "Together", "In Relationship"),
         Marital_Status = replace(Marital_Status, Marital_Status == "Married", "In Relationship")
         )

3.2 Purchase Informations

We can also see, the highlight of since and how long the customer has been a shopper.

cust_clean <- 
cust_clean %>% 
  mutate(Education = as.factor(Education),
         Marital_Status = as.factor(Marital_Status),
         Dt_Customer = dmy(Dt_Customer))
max(cust_clean$Dt_Customer)

## [1] "2014-06-29"

because the last customer access info were happened in 29 June 2014, we will assume that the customer information was collected 2014-06-30 (30 June 2014)

cust_clean <- 
cust_clean %>% 
  mutate(Collected = '30-06-2014',
         Collected = dmy(Collected),
         Day_Is_Client = Collected - Dt_Customer )

Then infer some more information about purchases.

cust_clean <-  
cust_clean %>% 
  
        # Calculating total spending in all products (maybe in dollar $)
  mutate(MntTotal = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts+ MntGoldProds,
         
         # Total purchase or number of transactions in every platform 
         NumAllPurchases = NumWebPurchases + NumCatalogPurchases + NumStorePurchases,
         
         # Average spending in every purchase  (dollar $/ purchase)
         AverageCheck = round(MntTotal/NumAllPurchases, 1), 
         
         # Percentage of purchase that were made with additional deals/ discount 
         ShareDealsPurchases = round((NumDealsPurchases/ NumAllPurchases) * 100, 1), 
         
         # Total accepted offer from various campaign
         TotalAcceptedCmp = AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 +AcceptedCmp4 + AcceptedCmp5 + Response
         )

3.3 Anomaly Detection

There are some anomaly purchase in the data, so 6 clients spent money, but did not make a single order - we can/ will delete them.

# Subsetting the anomaly purchase 
cust_clean <-  
cust_clean %>% 
  filter(NumAllPurchases != 0)

We also have missing values (NA) that will we fill with the mean of the said variable.

# Replacing NA income with the mean of entire column
cust_clean$Income[is.na(cust_clean$Income)]<-
  mean(cust_clean$Income,na.rm=TRUE)

# Re-check the NA in Income
colSums(is.na(cust_clean))

##                  ID          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                   0                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
##                   0                   0                   0                   0 
##        AcceptedCmp2            Complain       Z_CostContact           Z_Revenue 
##                   0                   0                   0                   0 
##            Response                Kids           Collected       Day_Is_Client 
##                   0                   0                   0                   0 
##            MntTotal     NumAllPurchases        AverageCheck ShareDealsPurchases 
##                   0                   0                   0                   0 
##    TotalAcceptedCmp 
##                   0

Remove all unnecessary columns.

cust_clean2 <- 
cust_clean %>% 
  select(-c(ID, Year_Birth, Kidhome, Teenhome, Dt_Customer, Z_CostContact, Z_Revenue, Collected))

3.4 Checking and Converting Outliers

Outlier/anomaly detection is a technique to find data that has extreme differences than the others. Outlier detection needs to be done considering the algorithms used in the clustering process are K-Means. The algorithm will force the outlier into one of the clusters that are formed, and if it happens, the characteristics of the cluster will change significantly.

Now we detect the outliers.

Then convert them.

cust_clean2 <- 
cust_clean2 %>% 
mutate(Income= ifelse(Income > 120000, 120000, Income),
       AverageCheck = ifelse(AverageCheck > 200, 200, AverageCheck))

And finally, calculate the difference between how long a person has been a client and the count of days from the last purchase.

cust_clean2 <-  
cust_clean2 %>% 
  mutate(ActiveDays = Day_Is_Client - Recency)

Result :

head(cust_clean2, 5)

4 Clustering

Clustering will be done based on average check, count of all purchases and the time that person is a client.

cust_cluster <- 
  cust_clean2 %>% 
  select(AverageCheck, Day_Is_Client, NumAllPurchases) %>% 
  mutate(Day_Is_Client = as.integer(Day_Is_Client))

summary(cust_cluster)

##   AverageCheck    Day_Is_Client   NumAllPurchases
##  Min.   :  2.70   Min.   :  1.0   Min.   : 1.00  
##  1st Qu.: 13.00   1st Qu.:182.2   1st Qu.: 6.00  
##  Median : 29.75   Median :357.0   Median :12.00  
##  Mean   : 37.54   Mean   :355.1   Mean   :12.57  
##  3rd Qu.: 49.20   3rd Qu.:530.0   3rd Qu.:18.00  
##  Max.   :200.00   Max.   :700.0   Max.   :32.00

4.1 Scaling and Choosing Optimal K

Scaling the data.

cust_cluster_z <- scale(cust_cluster)

We can find the optimum value for K using an Elbow point graph. From the below visualization, we can see that the optimal number of clusters should be around 4.

library(factoextra)
fviz_nbclust(cust_cluster_z, FUNcluster = kmeans, method = "wss")

4.2 K-Means Clustering

# sampling
RNGkind(sample.kind = "Rounding")
set.seed(88)

# making the cluster using k = 4 
cust_cluster_4 <- kmeans(cust_cluster_z, centers = 4)

aggregate(cust_cluster, by=list(cluster=cust_cluster_4$cluster), mean)

# Inputting label cluster to initial dataset
cust_clean2$category <- cust_cluster_4$cluster

head(cust_clean2,5)

4.3 Cluster Profiling

We will generate descriptions of the clusters with reference to the input variables we used for K-Means earlier.

cust_clean2 %>% 
  group_by(category) %>%
  summarise_all(mean)

Profiling :

Cluster 1 : Clients with high purchases and have been a long time client, we may call it Elite Client
Cluster 2 : Clients with average purchases, and have been a long time client, fitted as Good Client
Cluster 3 : Clients with high purchases, but considerably the new haul client, categorised as High Potential Client
Cluster 4 : Clients with average purchases, and considerably the new haul client, assigned as Ordinary Client

Renaming The Cluster

cust_profile <- 
cust_clean2 %>% 
  mutate(category = replace(category, category == 1, "Elite Client"),
         category = replace(category, category == 2, "Good Client"), 
         category = replace(category, category == 3, "High Potential Client"), 
         category = replace(category, category == 4, "Ordinary Client"), 
         category = as.factor(category)
         )

The relationship is linear. Customers having higher salaries are spending more. With that being said, the Elite Client and High Potential Client are the clusters who have the bigger income.

We can see that whether the Marital Status is In Relationship or Not in Relationship, the number quite evenly distributed, but the quantity of the client who is In Relationship is always slightly more than the Not in Relationship ones, and it applies to every cluster.

4.3.1 What is the purchasing habits of each cluster?

Insights:

Elite Client and High Potential Client are mostly likely to do store purchasing.
Most of the web and catalog purchases are also done by the customers from Elite Client and High Potential Client.
Deal purchases are more common among the Good Client and Ordinary Client.
Ordinary Client made the most number of web visits while customers from High Potential Client segment have least web visits.

4.3.2 What do customers from different clusters buy?

We are definitely dealing with a store that may has good variety of Wines, then it resulted a high purchase and almost equally bought by all clusters of buyers. In general, there are no major differences between, however the customers who come from Good Client and Ordinary Client clusters, are more likely to buy gold. Furthermore, Elite Client and High Potential Client customers are more likely to buy meat more often than the other clusters.

4.3.3 Which clients take part in the promotions the most?

As we can see that Elite Client and High Potential Client are the clusters which took part the most in promotions. This phenomenon may be explained due the number of purchases these clusters have are also higher.

4.3.4 Promotions Acceptance by Clusters

The 3rd Campaign, 4th Campaign, and Latest Campaign seem to be the most successful ones, since it was quite accepted by all clients cluster.
Ordinary Client cluster showed the least interest in the promotion campaigns of the company.
Elite Clientaccepted most of the offers from the company. We can rule this as un-ordinary trend, since usually the clients with lower income are more likely to participate in promotions, but in this case, it was the other way around.

Customer K-Means Clustering

Laura Andretha

10/7/2021