Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. Customer segmentation is the practice of dividing a company’s customers into groups that reflect similarity among customers in each group. The goal of segmenting customers is to decide how to relate to customers in each segment in order to maximize the value of each customer to the business. (Source: Optimove)
In order for us to be able to find the customer types, we will transform the data into several cluster using K-Means Clustering.
Load Library
library(tidyverse)
library(dplyr)
library(caret)
library(class)
library(scales)
library(lubridate)
cust <- read.csv("marketing_campaign.csv", sep = '\t')
head(cust,5)
It is very interesting dataset from which we can get much information.
str(cust)
## 'data.frame': 2240 obs. of 29 variables:
## $ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
## $ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Single" "Single" "Together" "Together" ...
## $ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
## $ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
## $ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
## $ Dt_Customer : chr "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
## $ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
## $ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
## $ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
## $ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
## $ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
## $ MntSweetProducts : int 88 1 21 3 27 42 49 1 3 1 ...
## $ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
## $ NumDealsPurchases : int 3 2 1 2 5 2 4 2 1 1 ...
## $ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
## $ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
## $ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
## $ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
## $ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
## $ Response : int 1 0 0 0 0 0 0 0 1 0 ...
First, we need to understand the family details
cust_clean <-
cust %>%
mutate(Kids = Kidhome + Teenhome,
Marital_Status = replace(Marital_Status, Marital_Status == "Single", "Not in Relationship"),
Marital_Status = replace(Marital_Status, Marital_Status == "Divorced", "Not in Relationship"),
Marital_Status = replace(Marital_Status, Marital_Status == "Widow", "Not in Relationship"),
Marital_Status = replace(Marital_Status, Marital_Status == "Alone", "Not in Relationship"),
Marital_Status = replace(Marital_Status, Marital_Status == "Absurd", "Not in Relationship"),
Marital_Status = replace(Marital_Status, Marital_Status == "YOLO", "Not in Relationship"),
Marital_Status = replace(Marital_Status, Marital_Status == "Together", "In Relationship"),
Marital_Status = replace(Marital_Status, Marital_Status == "Married", "In Relationship")
)
We can also see, the highlight of since and how long the customer has been a shopper.
cust_clean <-
cust_clean %>%
mutate(Education = as.factor(Education),
Marital_Status = as.factor(Marital_Status),
Dt_Customer = dmy(Dt_Customer))
max(cust_clean$Dt_Customer)
## [1] "2014-06-29"
because the last customer access info were happened in 29 June 2014, we will assume that the customer information was collected 2014-06-30 (30 June 2014)
cust_clean <-
cust_clean %>%
mutate(Collected = '30-06-2014',
Collected = dmy(Collected),
Day_Is_Client = Collected - Dt_Customer )
Then infer some more information about purchases.
cust_clean <-
cust_clean %>%
# Calculating total spending in all products (maybe in dollar $)
mutate(MntTotal = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts+ MntGoldProds,
# Total purchase or number of transactions in every platform
NumAllPurchases = NumWebPurchases + NumCatalogPurchases + NumStorePurchases,
# Average spending in every purchase (dollar $/ purchase)
AverageCheck = round(MntTotal/NumAllPurchases, 1),
# Percentage of purchase that were made with additional deals/ discount
ShareDealsPurchases = round((NumDealsPurchases/ NumAllPurchases) * 100, 1),
# Total accepted offer from various campaign
TotalAcceptedCmp = AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 +AcceptedCmp4 + AcceptedCmp5 + Response
)
There are some anomaly purchase in the data, so 6 clients spent money, but did not make a single order - we can/ will delete them.
# Subsetting the anomaly purchase
cust_clean <-
cust_clean %>%
filter(NumAllPurchases != 0)
We also have missing values (NA) that will we fill with the mean of the said variable.
# Replacing NA income with the mean of entire column
cust_clean$Income[is.na(cust_clean$Income)]<-
mean(cust_clean$Income,na.rm=TRUE)
# Re-check the NA in Income
colSums(is.na(cust_clean))
## ID Year_Birth Education Marital_Status
## 0 0 0 0
## Income Kidhome Teenhome Dt_Customer
## 0 0 0 0
## Recency MntWines MntFruits MntMeatProducts
## 0 0 0 0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## 0 0 0 0
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 0 0 0 0
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## 0 0 0 0
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## 0 0 0 0
## Response Kids Collected Day_Is_Client
## 0 0 0 0
## MntTotal NumAllPurchases AverageCheck ShareDealsPurchases
## 0 0 0 0
## TotalAcceptedCmp
## 0
Remove all unnecessary columns.
cust_clean2 <-
cust_clean %>%
select(-c(ID, Year_Birth, Kidhome, Teenhome, Dt_Customer, Z_CostContact, Z_Revenue, Collected))
Outlier/anomaly detection is a technique to find data that has extreme differences than the others. Outlier detection needs to be done considering the algorithms used in the clustering process are K-Means. The algorithm will force the outlier into one of the clusters that are formed, and if it happens, the characteristics of the cluster will change significantly.
Now we detect the outliers.
Then convert them.
cust_clean2 <-
cust_clean2 %>%
mutate(Income= ifelse(Income > 120000, 120000, Income),
AverageCheck = ifelse(AverageCheck > 200, 200, AverageCheck))
And finally, calculate the difference between how long a person has been a client and the count of days from the last purchase.
cust_clean2 <-
cust_clean2 %>%
mutate(ActiveDays = Day_Is_Client - Recency)
Result :
head(cust_clean2, 5)
Clustering will be done based on average check, count of all purchases and the time that person is a client.
cust_cluster <-
cust_clean2 %>%
select(AverageCheck, Day_Is_Client, NumAllPurchases) %>%
mutate(Day_Is_Client = as.integer(Day_Is_Client))
summary(cust_cluster)
## AverageCheck Day_Is_Client NumAllPurchases
## Min. : 2.70 Min. : 1.0 Min. : 1.00
## 1st Qu.: 13.00 1st Qu.:182.2 1st Qu.: 6.00
## Median : 29.75 Median :357.0 Median :12.00
## Mean : 37.54 Mean :355.1 Mean :12.57
## 3rd Qu.: 49.20 3rd Qu.:530.0 3rd Qu.:18.00
## Max. :200.00 Max. :700.0 Max. :32.00
Scaling the data.
cust_cluster_z <- scale(cust_cluster)
We can find the optimum value for K using an Elbow point graph. From the below visualization, we can see that the optimal number of clusters should be around 4.
library(factoextra)
fviz_nbclust(cust_cluster_z, FUNcluster = kmeans, method = "wss")
# sampling
RNGkind(sample.kind = "Rounding")
set.seed(88)
# making the cluster using k = 4
cust_cluster_4 <- kmeans(cust_cluster_z, centers = 4)
aggregate(cust_cluster, by=list(cluster=cust_cluster_4$cluster), mean)
# Inputting label cluster to initial dataset
cust_clean2$category <- cust_cluster_4$cluster
head(cust_clean2,5)
We will generate descriptions of the clusters with reference to the input variables we used for K-Means earlier.
cust_clean2 %>%
group_by(category) %>%
summarise_all(mean)
Profiling :
Elite Client
Good Client
High Potential Client
Ordinary Client
Renaming The Cluster
cust_profile <-
cust_clean2 %>%
mutate(category = replace(category, category == 1, "Elite Client"),
category = replace(category, category == 2, "Good Client"),
category = replace(category, category == 3, "High Potential Client"),
category = replace(category, category == 4, "Ordinary Client"),
category = as.factor(category)
)
The relationship is linear. Customers having higher salaries are spending more. With that being said, the
Elite Client
andHigh Potential Client
are the clusters who have the bigger income.
We can see that whether the Marital Status is In Relationship or Not in Relationship, the number quite evenly distributed, but the quantity of the client who is In Relationship is always slightly more than the Not in Relationship ones, and it applies to every cluster.
Insights:
Elite Client
and High Potential Client
are mostly likely to do store purchasing.Elite Client
and High Potential Client
.Good Client
and Ordinary Client
.Ordinary Client
made the most number of web visits while customers from High Potential Client
segment have least web visits.We are definitely dealing with a store that may has good variety of Wines, then it resulted a high purchase and almost equally bought by all clusters of buyers. In general, there are no major differences between, however the customers who come from
Good Client
andOrdinary Client
clusters, are more likely to buy gold. Furthermore,Elite Client
andHigh Potential Client
customers are more likely to buy meat more often than the other clusters.
As we can see that
Elite Client
andHigh Potential Client
are the clusters which took part the most in promotions. This phenomenon may be explained due the number of purchases these clusters have are also higher.
Ordinary Client
cluster showed the least interest in the promotion campaigns of the company.Elite Client
accepted most of the offers from the company. We can rule this as un-ordinary trend, since usually the clients with lower income are more likely to participate in promotions, but in this case, it was the other way around.