Customer Personality analysis is a detailed analysis done by a brand or company to find an ideal set of customers. This helps the company to make changes in their product and marketing in a more efficient way. Promoting a product to a customer who is more likely to buy the product is better than promoting it to a random person.
Customer Personality analysis is done by clustering as it is a un-supervised problem.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(cluster)
library(gower)
library(Rtsne)
library(stringr)
library(corrgram)
data <- read.csv2("/Users/aashaysharma/Desktop/DS/Customer Personality Analysis/marketing_campaign.csv", sep = "\t")
sum(is.na(data))
## [1] 24
data <- na.omit(data)
As there are very few NA items we will just omit them.
str(data)
## 'data.frame': 2216 obs. of 29 variables:
## $ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
## $ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Single" "Single" "Together" "Together" ...
## $ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
## $ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
## $ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
## $ Dt_Customer : chr "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
## $ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
## $ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
## $ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
## $ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
## $ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
## $ MntSweetProducts : int 88 1 21 3 27 42 49 1 3 1 ...
## $ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
## $ NumDealsPurchases : int 3 2 1 2 5 2 4 2 1 1 ...
## $ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
## $ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
## $ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
## $ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
## $ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
## $ Response : int 1 0 0 0 0 0 0 0 1 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:24] 11 28 44 49 59 72 91 92 93 129 ...
## ..- attr(*, "names")= chr [1:24] "11" "28" "44" "49" ...
The data is mixed type data (ie both Categorical and Continous values). Euclidean distance is commonly used for clustering algorithms like KMeans and hierarchical clustering but it is useful when we only have to deal with continous values. Thus here we will use gower distance as the distance metric.
Gower distance is computed as the average of partial dissimilarities across individuals. Each partial dissimilarity (and thus Gower distance) ranges in [0 1].
To work with the data we need to alter some variables and add a few new variables based on the existing ones.
data$Age <- 2014 - data$Year_Birth
data$Education[data$Education == "2n Cycle"] = "UG"
data$Education[data$Education == "Basic"] = "UG"
data$Education[data$Education == "Graduation"] = "PG"
data$Education[data$Education == "Master"] = "PG"
data$Education[data$Education == "PhD"] = "PG"
data$Marital_Status[data$Marital_Status == "Divorced"] = "Single"
data$Marital_Status[data$Marital_Status == "Absurd"] = "Single"
data$Marital_Status[data$Marital_Status == "YOLO"] = "Single"
data$Marital_Status[data$Marital_Status == "Widow"] = "Single"
data$Marital_Status[data$Marital_Status == "Together"] = "Couple"
data$Marital_Status[data$Marital_Status == "Married"] = "Couple"
data$Marital_Status[data$Marital_Status == "Alone"] = "Single"
data$Customer_year <- str_sub(data$Dt_Customer,-4)
data$Customer_year <- as.numeric(data$Customer_year)
data$Customer_Seniority <- 2014 - data$Customer_year
data$Child <- data$Kidhome + data$Teenhome
data$Amt_Spent <- data$MntWines + data$MntFishProducts + data$MntFruits + data$MntGoldProds + data$MntMeatProducts + data$MntSweetProducts
data$Num_Purchases_made <- data$NumWebPurchases + data$NumCatalogPurchases + data$NumStorePurchases
Thus we are left with 12 variables that can be used to create the distance matrix.
data <- data[c(1,30,3,4,5,33,32,9,34,35,16,20)]
data2 <- data
data2$Education <- unclass(as.factor(data2$Education))
data2$Marital_Status <- unclass(as.factor(data2$Marital_Status))
data2$Education <- as.numeric(data2$Education)
data2$Marital_Status <- as.numeric(data2$Marital_Status)
corrgram(data2[-c(1)], order=TRUE, lower.panel=panel.shade, upper.panel=NULL, text.panel=panel.txt, main="Customer Data")
Strong Correlation between Income, Amount Spent and Purchase, As well as there is slight trend between Education and Income(PG:1, UG:2) so this tells us that people with higher studies earn more. 1. Age
age_plot <- ggplot(data = data, aes(Age))
age_plot + geom_density()
edu_plot1 <- ggplot(data = data, aes(Education))
edu_plot1 + geom_histogram(stat = "count")
edu_plot2<- ggplot(data = data, aes(Age, fill = Education))
edu_plot2 + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
marital_plot <- ggplot(data = data, aes(Marital_Status, fill = Marital_Status))
marital_plot + geom_histogram(stat = "count")
require(scales)
## Loading required package: scales
inc_plt <- ggplot(data = data, aes(Income, Amt_Spent))
inc_plt + geom_point(alpha = 0.5, color = "blue") + scale_x_continuous(labels = comma)
a <- ggplot(data = data, aes(Child, fill = Marital_Status))
a + geom_histogram(position = "dodge")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cust_plt <- ggplot(data = data, aes(as.factor(Customer_Seniority), Amt_Spent, fill = Customer_Seniority))
cust_plt + geom_boxplot(color = "black")
gower_dist2 <- daisy(data[c("Income", "Amt_Spent", "Customer_Seniority", "Age")],
metric = "gower")
summary(gower_dist2)
## 2454220 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1234 0.1902 0.1973 0.2686 0.7065
## Metric : mixed ; Types = I, I, I, I
## Number of objects : 2216
gower_mat2 <- as.matrix(gower_dist2)
MOST SIMILAR PAIR
data[
which(gower_mat2 == min(gower_mat2[gower_mat2 != min(gower_mat2)]),
arr.ind = TRUE)[1, ], ]
## ID Age Education Marital_Status Income Child Customer_Seniority Recency
## 1754 5092 65 PG Single 51569 1 1 39
## 438 10664 65 PG Single 51529 1 1 14
## Amt_Spent Num_Purchases_made NumDealsPurchases NumWebVisitsMonth
## 1754 467 15 4 8
## 438 467 15 2 8
MOST DISSIMILAR PAIR
data[
which(gower_mat2 == max(gower_mat2[gower_mat2 != max(gower_mat2)]),
arr.ind = TRUE)[1, ], ]
## ID Age Education Marital_Status Income Child Customer_Seniority Recency
## 915 10619 20 PG Single 95529 0 2 29
## 240 11004 121 UG Single 60182 1 0 23
## Amt_Spent Num_Purchases_made NumDealsPurchases NumWebVisitsMonth
## 915 1990 17 1 3
## 240 22 3 1 4
FINDING NUMBER OF CLUSTERS
sil_width2 <- c(NA)
for(i in 2:10){
pam_fit2 <- pam(gower_dist2,
diss = TRUE,
k = i)
sil_width2[i] <- pam_fit2$silinfo$avg.width
}
plot(1:10, sil_width2,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:10, sil_width2)
By silhouette method we can choose 4 clusters.
CLUSTER SUMMARY
pam_fit2 <- pam(gower_dist2, diss = TRUE, k = 4)
pam_results2 <- data %>%
dplyr::select(-c(ID, Child, NumWebVisitsMonth, Num_Purchases_made, NumDealsPurchases, Education, Marital_Status)) %>%
mutate(cluster = pam_fit2$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results2$the_summary
## [[1]]
## Age Income Customer_Seniority Recency
## Min. :18.00 Min. : 6835 Min. :2 Min. : 0.00
## 1st Qu.:36.00 1st Qu.: 34217 1st Qu.:2 1st Qu.:26.00
## Median :43.00 Median : 52050 Median :2 Median :50.50
## Mean :44.17 Mean : 51150 Mean :2 Mean :50.11
## 3rd Qu.:54.00 3rd Qu.: 67441 3rd Qu.:2 3rd Qu.:76.00
## Max. :69.00 Max. :160803 Max. :2 Max. :99.00
## Amt_Spent cluster
## Min. : 13.0 Min. :1
## 1st Qu.: 132.2 1st Qu.:1
## Median : 578.5 Median :1
## Mean : 746.6 Mean :1
## 3rd Qu.:1240.8 3rd Qu.:1
## Max. :2352.0 Max. :1
##
## [[2]]
## Age Income Customer_Seniority Recency
## Min. : 18.00 Min. : 1730 Min. :0 Min. : 0.00
## 1st Qu.: 37.00 1st Qu.: 35246 1st Qu.:0 1st Qu.:23.00
## Median : 44.00 Median : 51287 Median :0 Median :47.00
## Mean : 45.18 Mean : 52827 Mean :0 Mean :47.83
## 3rd Qu.: 55.00 3rd Qu.: 70545 3rd Qu.:0 3rd Qu.:73.00
## Max. :121.00 Max. :157243 Max. :0 Max. :99.00
## Amt_Spent cluster
## Min. : 6.0 Min. :2
## 1st Qu.: 49.0 1st Qu.:2
## Median : 165.0 Median :2
## Mean : 493.8 Mean :2
## 3rd Qu.: 889.0 3rd Qu.:2
## Max. :2525.0 Max. :2
##
## [[3]]
## Age Income Customer_Seniority Recency
## Min. : 21.00 Min. : 2447 Min. :1 Min. : 0.00
## 1st Qu.: 39.00 1st Qu.: 62745 1st Qu.:1 1st Qu.:24.00
## Median : 49.00 Median : 70379 Median :1 Median :51.00
## Mean : 48.25 Mean : 70493 Mean :1 Mean :49.63
## 3rd Qu.: 59.00 3rd Qu.: 78353 3rd Qu.:1 3rd Qu.:74.00
## Max. :115.00 Max. :157146 Max. :1 Max. :99.00
## Amt_Spent cluster
## Min. : 507 Min. :3
## 1st Qu.: 882 1st Qu.:3
## Median :1128 Median :3
## Mean :1200 Mean :3
## 3rd Qu.:1493 3rd Qu.:3
## Max. :2524 Max. :3
##
## [[4]]
## Age Income Customer_Seniority Recency
## Min. : 22.00 Min. : 3502 Min. :1 Min. : 0.00
## 1st Qu.: 36.00 1st Qu.: 28437 1st Qu.:1 1st Qu.:25.00
## Median : 42.00 Median : 38199 Median :1 Median :49.00
## Mean : 43.69 Mean : 39338 Mean :1 Mean :48.74
## 3rd Qu.: 51.00 3rd Qu.: 47496 3rd Qu.:1 3rd Qu.:74.00
## Max. :114.00 Max. :666666 Max. :1 Max. :99.00
## Amt_Spent cluster
## Min. : 5.0 Min. :4
## 1st Qu.: 46.0 1st Qu.:4
## Median : 91.0 Median :4
## Mean :168.4 Mean :4
## 3rd Qu.:268.2 3rd Qu.:4
## Max. :662.0 Max. :4
We can interpret from the cluster summary that 4 types of customers here are :
VISUALISING CLUSTERS
tsne_obj2 <- Rtsne(gower_dist2, is_distance = TRUE)
tsne_data2 <- tsne_obj2$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit2$clustering),
name = data$ID)
ggplot(aes(x = X, y = Y), data = tsne_data2) +
geom_point(aes(color = cluster))
data2 <- data
data2$Education <- unclass(as.factor(data2$Education))
data2$Marital_Status <- unclass(as.factor(data2$Marital_Status))
data2$Education <- as.numeric(data2$Education)
data2$Marital_Status <- as.numeric(data2$Marital_Status)
data_new <- read.csv2("/Users/aashaysharma/Desktop/DS/Customer Personality Analysis/marketing_campaign.csv", sep = "\t")
data_new <- na.omit(data_new)
data3 <- data2
data3$AmtWines <- data_new$MntWines
data3$AmtFruits <- data_new$MntFruits
data3$AmtFish <- data_new$MntFishProducts
data3$AmtMeat <- data_new$MntMeatProducts
data3$AmtSweet <- data_new$MntSweetProducts
data3$AmtGold <- data_new$MntGoldProds
data3$Education <- as.factor(data3$Education)
data3$Marital_Status <- as.factor(data3$Marital_Status)
gower_dist3 <- daisy(data3[c("Income", "Child", "Num_Purchases_made", "Marital_Status", "AmtWines", "AmtFruits", "AmtFish", "AmtMeat", "AmtSweet", "AmtGold", "NumDealsPurchases", "Amt_Spent", "Recency")],
metric = "gower")
summary(gower_dist3)
## 2454220 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1422 0.2075 0.2135 0.2796 0.6256
## Metric : mixed ; Types = I, I, I, N, I, I, I, I, I, I, I, I, I
## Number of objects : 2216
gower_mat3 <- as.matrix(gower_dist3)
MOST SIMILAR PAIR
data3[
which(gower_mat3 == min(gower_mat3[gower_mat3 != min(gower_mat3)]),
arr.ind = TRUE)[1, ], ]
## ID Age Education Marital_Status Income Child Customer_Seniority Recency
## 1469 7196 64 1 1 41145 2 0 20
## 570 3525 44 1 1 38200 2 2 19
## Amt_Spent Num_Purchases_made NumDealsPurchases NumWebVisitsMonth AmtWines
## 1469 13 3 1 3 9
## 570 17 3 1 7 12
## AmtFruits AmtFish AmtMeat AmtSweet AmtGold
## 1469 0 0 3 0 1
## 570 0 0 4 0 1
MOST DISSIMILAR PAIR
data3[
which(gower_mat3 == max(gower_mat3[gower_mat3 != max(gower_mat3)]),
arr.ind = TRUE)[1, ], ]
## ID Age Education Marital_Status Income Child Customer_Seniority Recency
## 1445 1553 68 1 1 82657 0 1 71
## 161 2795 56 1 2 30523 3 1 0
## Amt_Spent Num_Purchases_made NumDealsPurchases NumWebVisitsMonth AmtWines
## 1445 2283 22 1 4 966
## 161 13 3 1 7 5
## AmtFruits AmtFish AmtMeat AmtSweet AmtGold
## 1445 168 246 672 105 126
## 161 0 0 3 0 5
FINDING NUMBER OF CLUSTERS
sil_width3 <- c(NA)
for(i in 2:10){
pam_fit3 <- pam(gower_dist3,
diss = TRUE,
k = i)
sil_width3[i] <- pam_fit3$silinfo$avg.width
}
plot(1:10, sil_width3,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:10, sil_width3)
By silhouette method we can choose 4 clusters.
CLUSTER SUMMARY
pam_fit3 <- pam(gower_dist3, diss = TRUE, k = 3)
pam_results3 <- data3 %>%
dplyr::select(-c(ID, NumWebVisitsMonth, Age, Customer_Seniority)) %>%
mutate(cluster = pam_fit3$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results3$the_summary
## [[1]]
## Education Marital_Status Income Child Recency
## 1:563 1:433 Min. : 32632 Min. :0.0000 Min. : 0.00
## 2: 60 2:190 1st Qu.: 68109 1st Qu.:0.0000 1st Qu.:25.00
## Median : 75012 Median :0.0000 Median :52.00
## Mean : 74363 Mean :0.3451 Mean :49.67
## 3rd Qu.: 81187 3rd Qu.:1.0000 3rd Qu.:74.00
## Max. :105471 Max. :3.0000 Max. :99.00
## Amt_Spent Num_Purchases_made NumDealsPurchases AmtWines
## Min. : 461 Min. : 9.00 Min. : 0.00 Min. : 47.0
## 1st Qu.:1072 1st Qu.:17.00 1st Qu.: 1.00 1st Qu.: 397.5
## Median :1338 Median :20.00 Median : 1.00 Median : 587.0
## Mean :1381 Mean :19.98 Mean : 1.66 Mean : 639.3
## 3rd Qu.:1653 3rd Qu.:23.00 3rd Qu.: 2.00 3rd Qu.: 854.0
## Max. :2525 Max. :32.00 Max. :15.00 Max. :1493.0
## AmtFruits AmtFish AmtMeat AmtSweet
## Min. : 0.0 Min. : 0.00 Min. : 48.0 Min. : 0.00
## 1st Qu.: 27.0 1st Qu.: 45.00 1st Qu.:242.0 1st Qu.: 30.00
## Median : 58.0 Median : 86.00 Median :397.0 Median : 56.00
## Mean : 68.3 Mean : 98.18 Mean :425.5 Mean : 69.32
## 3rd Qu.:102.0 3rd Qu.:145.00 3rd Qu.:568.5 3rd Qu.:102.00
## Max. :199.0 Max. :259.00 Max. :984.0 Max. :198.00
## AmtGold cluster
## Min. : 0.00 Min. :1
## 1st Qu.: 33.00 1st Qu.:1
## Median : 61.00 Median :1
## Mean : 80.41 Mean :1
## 3rd Qu.:118.50 3rd Qu.:1
## Max. :249.00 Max. :1
##
## [[2]]
## Education Marital_Status Income Child Recency
## 1:529 1: 0 Min. : 1730 Min. :0.000 Min. : 0.00
## 2: 67 2:596 1st Qu.: 31630 1st Qu.:1.000 1st Qu.:25.00
## Median : 43102 Median :1.000 Median :49.00
## Mean : 44460 Mean :1.117 Mean :48.13
## 3rd Qu.: 57102 3rd Qu.:2.000 3rd Qu.:73.00
## Max. :153924 Max. :3.000 Max. :99.00
## Amt_Spent Num_Purchases_made NumDealsPurchases AmtWines
## Min. : 5.0 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 54.0 1st Qu.: 5.00 1st Qu.: 1.00 1st Qu.: 15.0
## Median : 166.0 Median : 8.00 Median : 2.00 Median : 68.0
## Mean : 351.8 Mean :10.31 Mean : 2.49 Mean : 198.3
## 3rd Qu.: 577.2 3rd Qu.:15.00 3rd Qu.: 3.00 3rd Qu.: 273.0
## Max. :1766.0 Max. :31.00 Max. :15.00 Max. :1462.0
## AmtFruits AmtFish AmtMeat AmtSweet
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 2.00 1st Qu.: 11.00 1st Qu.: 1.00
## Median : 5.0 Median : 7.00 Median : 32.00 Median : 4.00
## Mean : 12.9 Mean : 17.28 Mean : 77.25 Mean : 12.56
## 3rd Qu.: 15.0 3rd Qu.: 19.00 3rd Qu.:100.00 3rd Qu.: 14.00
## Max. :174.0 Max. :201.00 Max. :873.00 Max. :262.00
## AmtGold cluster
## Min. : 0.00 Min. :2
## 1st Qu.: 7.00 1st Qu.:2
## Median : 18.00 Median :2
## Mean : 33.53 Mean :2
## 3rd Qu.: 40.00 3rd Qu.:2
## Max. :291.00 Max. :2
##
## [[3]]
## Education Marital_Status Income Child Recency
## 1:870 1:997 Min. : 2447 Min. :0.000 Min. : 0.00
## 2:127 2: 0 1st Qu.: 30261 1st Qu.:1.000 1st Qu.:24.00
## Median : 41145 Median :1.000 Median :49.00
## Mean : 43083 Mean :1.222 Mean :49.13
## 3rd Qu.: 54198 3rd Qu.:2.000 3rd Qu.:74.00
## Max. :666666 Max. :3.000 Max. :99.00
## Amt_Spent Num_Purchases_made NumDealsPurchases AmtWines
## Min. : 8 Min. : 0.000 Min. : 0.000 Min. : 0.0
## 1st Qu.: 50 1st Qu.: 5.000 1st Qu.: 1.000 1st Qu.: 13.0
## Median : 127 Median : 7.000 Median : 2.000 Median : 48.0
## Mean : 276 Mean : 9.266 Mean : 2.639 Mean : 160.1
## 3rd Qu.: 428 3rd Qu.:13.000 3rd Qu.: 3.000 3rd Qu.: 216.0
## Max. :1730 Max. :29.000 Max. :15.000 Max. :1279.0
## AmtFruits AmtFish AmtMeat AmtSweet
## Min. : 0.000 Min. : 0.00 Min. : 1.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 2.00 1st Qu.: 11.00 1st Qu.: 1.00
## Median : 4.000 Median : 6.00 Median : 25.00 Median : 4.00
## Mean : 8.186 Mean : 11.98 Mean : 59.11 Mean : 9.25
## 3rd Qu.: 10.000 3rd Qu.: 15.00 3rd Qu.: 73.00 3rd Qu.: 11.00
## Max. :123.000 Max. :179.00 Max. :1725.00 Max. :195.00
## AmtGold cluster
## Min. : 0.00 Min. :3
## 1st Qu.: 6.00 1st Qu.:3
## Median : 14.00 Median :3
## Mean : 27.43 Mean :3
## 3rd Qu.: 32.00 3rd Qu.:3
## Max. :321.00 Max. :3
We can interpret from the cluster summary that 3 types of customers here are :
VISUALISING CLUSTERS
tsne_obj3 <- Rtsne(gower_dist3, is_distance = TRUE)
tsne_data3 <- tsne_obj3$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit3$clustering),
name = data3$ID)
ggplot(aes(x = X, y = Y), data = tsne_data3) +
geom_point(aes(color = cluster))
# General Conclusions :