Customer Personality Analysis helps a business to better understand its customers and makes it easier for them to modify products based on its target customers from different types of customer segments. Instead of spending money to market a new product to every customer, this analysis can help direct the marketing team to market the product to specific related segments. Our goal is to perform customer segmentation using unsupervised learning analysis, which includes Clustering with K-Means and also if possible, dimensional reduction using Principle Component Analysis (PCA). The dataset used in this analysis is from Kaggle: Customer Personality Analysis.
## Rows: 2,240
## Columns: 29
## $ ID <int> 5524, 2174, 4141, 6182, 5324, 7446, 965, 6177, 485…
## $ Year_Birth <int> 1957, 1954, 1965, 1984, 1981, 1967, 1971, 1985, 19…
## $ Education <chr> "Graduation", "Graduation", "Graduation", "Graduat…
## $ Marital_Status <chr> "Single", "Single", "Together", "Together", "Marri…
## $ Income <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 3…
## $ Kidhome <int> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,…
## $ Teenhome <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
## $ Dt_Customer <chr> "04-09-2012", "08-03-2014", "21-08-2013", "10-02-2…
## $ Recency <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82…
## $ MntWines <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6,…
## $ MntFruits <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, …
## $ MntMeatProducts <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 4…
## $ MntFishProducts <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3…
## $ MntSweetProducts <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1…
## $ MntGoldProds <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14…
## $ NumDealsPurchases <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3,…
## $ NumWebPurchases <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3,…
## $ NumCatalogPurchases <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0…
## $ NumStorePurchases <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12,…
## $ NumWebVisitsMonth <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8…
## $ AcceptedCmp3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ AcceptedCmp4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AcceptedCmp5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ AcceptedCmp1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ AcceptedCmp2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Complain <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Z_CostContact <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Z_Revenue <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11…
## $ Response <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
People
ID: Customer’s unique identifierYear_Birth: Customer’s birth yearEducation: Customer’s education levelMarital_Status: Customer’s marital statusIncome: Customer’s yearly household incomeKidhome: Number of children in customer’s
householdTeenhome: Number of teenagers in customer’s
householdDt_Customer: Date of customer’s enrollment with the
companyRecency: Number of days since customer’s last
purchaseComplain: 1 if the customer complained in the last 2
years, 0 otherwiseProducts
MntWines: Amount spent on wine in last 2 yearsMntFruits: Amount spent on fruits in last 2 yearsMntMeatProducts: Amount spent on meat in last 2
yearsMntFishProducts: Amount spent on fish in last 2
yearsMntSweetProducts: Amount spent on sweets in last 2
yearsMntGoldProds: Amount spent on gold in last 2 yearsPromotion
NumDealsPurchases: Number of purchases made with a
discountAcceptedCmp1: 1 if customer accepted the offer in the
1st campaign, 0 otherwiseAcceptedCmp2: 1 if customer accepted the offer in the
2nd campaign, 0 otherwiseAcceptedCmp3: 1 if customer accepted the offer in the
3rd campaign, 0 otherwiseAcceptedCmp4: 1 if customer accepted the offer in the
4th campaign, 0 otherwiseAcceptedCmp5: 1 if customer accepted the offer in the
5th campaign, 0 otherwiseResponse: 1 if customer accepted the offer in the last
campaign, 0 otherwisePlace
NumWebPurchases: Number of purchases made through the
company’s websiteNumCatalogPurchases: Number of purchases made using a
catalogueNumStorePurchases: Number of purchases made directly in
storesNumWebVisitsMonth: Number of visits to company’s
website in the last month#factor data type
customers <- customers %>% mutate_at(vars(Complain, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Response), as.factor)
#date data type
customers$Dt_Customer <- dmy(customers$Dt_Customer)#remove Id, Z_CostContact, Z_Revenue column
customers <- customers %>% select(-ID, -Z_Revenue, -Z_CostContact)## [1] "Single" "Together" "Married" "Divorced" "Widow" "Alone" "Absurd"
## [8] "YOLO"
## [1] "Graduation" "PhD" "Master" "Basic" "2n Cycle"
Since the labels are not efficient from the dataset, we will simplify
Marital_status categories into : Single and
Couple, and for Education we will simplify the
categories into: Undergraduate and Postgraduate.
customers <- customers %>%
# Marital Status
mutate(Marital_Status = if_else(Marital_Status %in% c("Single", "Divorced", "Widow", "YOLO", "Alone", "Absurd"), "Single", Marital_Status)) %>%
mutate(Marital_Status = if_else(Marital_Status %in% c("Together", "Married"),"Couple", Marital_Status)) %>%
# Education
mutate(Education = if_else(Education %in% c("Graduation", "Master", "PhD"), "Postgraduate", Education)) %>%
mutate(Education = if_else(Education %in% c("2n Cycle", "Basic"), "Undergraduate", Education)) %>%
mutate(Marital_Status = as.factor(Marital_Status),
Education = as.factor(Education))For columns Kidhome and Teenhome will be
combined into Children:
customers <- customers %>%
mutate(Children = Kidhome + Teenhome) %>%
select(-Kidhome, -Teenhome)
customers <- customers %>% mutate(Children = as.factor(Children))We will create customer age variable based on the columns
Year_Birth and Dt_Customer:
## [1] "2012-07-30" "2014-06-29"
The last recorded membership enrollment is the year 2014, so we will assume that the data gathered was from this year.
#rename variables
customers <- customers %>%
mutate(Wine = MntWines,
Fruit = MntFruits,
Meat = MntMeatProducts,
Fish = MntFishProducts,
Sweet = MntSweetProducts,
Gold = MntGoldProds) %>%
select(-c(MntWines, MntFruits, MntMeatProducts, MntFishProducts,
MntSweetProducts,MntGoldProds)) %>%
mutate(Deals = NumDealsPurchases,
Web = NumWebPurchases,
Catalog = NumCatalogPurchases,
Store = NumStorePurchases,
WebVisitsMonth = NumWebVisitsMonth) %>%
select(-c(NumDealsPurchases, NumWebPurchases,
NumCatalogPurchases, NumStorePurchases,
NumStorePurchases,NumWebVisitsMonth))## Education Marital_Status Income Dt_Customer Recency
## 0 0 24 0 0
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2
## 0 0 0 0 0
## Complain Response Children Age Wine
## 0 0 0 0 0
## Fruit Meat Fish Sweet Gold
## 0 0 0 0 0
## Deals Web Catalog Store WebVisitsMonth
## 0 0 0 0 0
## [1] FALSE
After cleansing and feature engineering, our demographic variables
are: Education, Marital_Status,
Income, Age, and
Children_Home.
Customers Age
## [1] 18 121
customers$Age_group <- cut(customers$Age, breaks=c(18, 30, 40, 50, 60, Inf), labels=c("18-30", "31-40", "41-50", "51-60", "60+"))##
## 18-30 31-40 41-50 51-60 60+
## 11.51762 26.37760 28.95212 21.09304 12.05962
Majority of the customers are adults (approx. 75%), with young adults and elderly being the minority.
Customers Income
hist(customers$Income,
main = "Income Distribution",
xlab = "Income ($)",
col = "lightblue",
breaks = 50)
abline(v = mean(customers$Income), col = "red", lty = 2, lwd = 1)There seems to be a case in which one of the customers fall out of the majority income range (an outlier). The mean income is at $52,247.25.
Customers Education Status
Majority of customers education status is Postgraduate, which confirms the age distribution.
Customers Marital Status and Number of Children at Home
barplot(table(customers$Children, customers$Marital_Status),main="Marital Status and Number of Children", beside=TRUE, col=c("lightblue","skyblue3", "steelblue4","midnightblue"), legend=TRUE)There are more customers who are a couple (64.5%) than those who are single. Of which, 50% of these customers at least has one child.
customers_num <- customers %>% select_if(is.numeric)
quali <- customers %>% select_if(is.factor) %>% select(-c(Age_group,))
qualivar <- quali %>% colnames()
summary(customers_num)## Income Recency Age Wine
## Min. : 1730 Min. : 0.00 Min. : 18.00 Min. : 0.0
## 1st Qu.: 35303 1st Qu.:24.00 1st Qu.: 37.00 1st Qu.: 24.0
## Median : 51382 Median :49.00 Median : 44.00 Median : 174.5
## Mean : 52247 Mean :49.01 Mean : 45.18 Mean : 305.1
## 3rd Qu.: 68522 3rd Qu.:74.00 3rd Qu.: 55.00 3rd Qu.: 505.0
## Max. :666666 Max. :99.00 Max. :121.00 Max. :1493.0
## Fruit Meat Fish Sweet
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 16.0 1st Qu.: 3.00 1st Qu.: 1.00
## Median : 8.00 Median : 68.0 Median : 12.00 Median : 8.00
## Mean : 26.36 Mean : 167.0 Mean : 37.64 Mean : 27.03
## 3rd Qu.: 33.00 3rd Qu.: 232.2 3rd Qu.: 50.00 3rd Qu.: 33.00
## Max. :199.00 Max. :1725.0 Max. :259.00 Max. :262.00
## Gold Deals Web Catalog
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 24.50 Median : 2.000 Median : 4.000 Median : 2.000
## Mean : 43.97 Mean : 2.324 Mean : 4.085 Mean : 2.671
## 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.: 4.000
## Max. :321.00 Max. :15.000 Max. :27.000 Max. :28.000
## Store WebVisitsMonth
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 5.000 Median : 6.000
## Mean : 5.801 Mean : 5.319
## 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :13.000 Max. :20.000
## Income Recency Age Wine
## Income 633683788.576 -2892.83658596 48814.621201 4913651.567032
## Recency -2892.837 838.00706375 5.653712 153.517126
## Age 48814.621 5.65371157 143.653507 644.670673
## Wine 4913651.567 153.51712642 644.670673 113790.125690
## Fruit 431589.325 -6.73181459 8.464241 5195.253148
## Meat 3300781.480 146.19869694 90.582220 43038.246873
## Fish 604886.901 0.87320371 26.528317 7345.671585
## Sweet 455689.305 29.85471392 9.946052 5407.865525
## Gold 425110.225 26.49344150 39.875318 6864.460972
## Deals -4024.231 0.11780606 1.352697 5.766284
## Web 26762.860 -0.44757927 5.028016 512.030107
## Catalog 43406.478 2.04027593 4.271300 626.671684
## Store 43318.897 -0.04082519 4.982937 701.824559
## WebVisitsMonth -33768.090 -1.30335585 -3.601796 -263.423371
## Fruit Meat Fish Sweet
## Income 431589.325364 3300781.48003 604886.9010551 455689.305153
## Recency -6.731815 146.19870 0.8732037 29.854714
## Age 8.464241 90.58222 26.5283165 9.946052
## Wine 5195.253148 43038.24687 7345.6715847 5407.865525
## Fruit 1583.555792 4889.37300 1292.9674518 934.242986
## Meat 4889.373004 50302.98644 7043.4752771 4929.552262
## Fish 1292.967452 7043.47528 2997.7905288 1312.988348
## Sweet 934.242986 4929.55226 1312.9883482 1686.912935
## Gold 817.531565 4177.24410 1211.8027530 760.712065
## Deals -10.297194 -52.33909 -15.0872165 -9.594450
## Web 32.944337 188.78364 44.9749377 37.593472
## Catalog 56.633209 481.89347 85.3714836 59.518760
## Store 59.311068 354.34502 81.4727753 60.780016
## WebVisitsMonth -40.413420 -293.46191 -59.2820805 -42.074230
## Gold Deals Web Catalog
## Income 425110.224539 -4024.23079940 26762.8598168 43406.47821080
## Recency 26.493442 0.11780606 -0.4475793 2.04027593
## Age 39.875318 1.35269739 5.0280158 4.27130045
## Wine 6864.460972 5.76628359 512.0301067 626.67168449
## Fruit 817.531565 -10.29719442 32.9443373 56.63320913
## Meat 4177.244103 -52.33909165 188.7836423 481.89347145
## Fish 1211.802753 -15.08721651 44.9749377 85.37148361
## Sweet 760.712065 -9.59445038 37.5934725 59.51876034
## Gold 2684.837167 5.17377599 57.8128972 67.09420732
## Deals 5.173776 3.70068189 1.2730690 -0.06822921
## Web 57.812897 1.27306904 7.5128128 3.10346526
## Catalog 67.094207 -0.06822921 3.1034653 8.56576978
## Store 65.553804 0.41340324 4.5998250 4.92682787
## WebVisitsMonth -31.127510 1.61455799 -0.3405414 -3.70538032
## Store WebVisitsMonth
## Income 43318.89744787 -33768.0897098
## Recency -0.04082519 -1.3033559
## Age 4.98293674 -3.6017965
## Wine 701.82455913 -263.4233706
## Fruit 59.31106849 -40.4134199
## Meat 354.34501532 -293.4619091
## Fish 81.47277526 -59.2820805
## Sweet 60.78001646 -42.0742297
## Gold 65.55380426 -31.1275096
## Deals 0.41340324 1.6145580
## Web 4.59982500 -0.3405414
## Catalog 4.92682787 -3.7053803
## Store 10.56760172 -3.4091656
## WebVisitsMonth -3.40916564 5.8823641
Our data has high variance and has different scales. There are a few
variables that noticeably have outliers in the data, which can be
observed in the vast difference between their max and 3rd quartile
values (Income, Meat, Wine,
Gold, etc.).
We want to see if there is a high correlation between variables in
customers_num. If the variables in our data are highly
correlated, it may suggest redundancy in the data , which confirms our
need for Principle Component Analysis (PCA) in order to mitigate
multicollinearity, reduce dimensionality and improve cluster
quality.
Since the correlation map above shows that there are variables with
high correlation (Sweet & Fish, etc.), we
will use PCA to reduce the data dimensionality.
ggplot(no_outliers, aes(Age_group, Income, fill = Age_group)) +
geom_boxplot(show.legend = F) +
scale_fill_brewer()+
theme_minimal() +
labs(title = "Income by Age Group")From the boxplot above, we can observe that as we go up the age group, the average for income increases.
# Convert wide to long format using pivot_longer
long_purchase <- no_outliers %>%
pivot_longer(cols = c(Wine,Fruit,Meat,Fish,Sweet,Gold), names_to = "Purchase_Type", values_to = "Purchase_Amt")
# Plot relationship between Product, Purchase Amount and Income
ggplot(long_purchase, aes(Purchase_Amt, Income, color = Purchase_Type)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "Income and Product Purchase", color = "Product", x = "Spending")+
scale_color_brewer(palette = "Set3")Based on the scatterplot for
Income and Product Purchase, there is a lot of overlap for
products with fish, fruit, gold,
and sweet , thus we are unable to discern any significant
patterns for purchases with these products. But, for wine
and meat, we can observe a potential pattern for
clustering, in which those with higher income tend to spend more on
these two products.
ggplot(long_purchase, aes(x = Age_group, y = Purchase_Amt, fill = Purchase_Type)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Product Purchase by Age Group", fill = "Product", x = "Age Group",
y = "Spending") +
scale_fill_brewer()+
theme_minimal()Based on this stacked barplot of
Product Purchase by Age Group, the age group that has the
highest amount of spending in total is 41-50 years old. Across all age
groups the product that customers spend the most on is Wine and
Meat.
# Convert wide to long format using pivot_longer
long_method <- no_outliers %>%
pivot_longer(cols = c(Deals, Web, Store, Catalog), names_to = "Purchase_Method", values_to = "Qty")
# Plot relationship between Product, Purchase Amount and Income
ggplot(long_method, aes(Qty, Income, color = Purchase_Method)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "Income and Purchase Method", color = "Method", x = "Number of Purchases")+
scale_color_brewer(palette = "Set3")From this scatterplot on Income and Purchase Method, we
can observe a few patterns such as the catalog method being
one of the least inefficient methods of attracting customers to
purchase. Customers tend to purchase more from the physical
store. There doesn’t seem to be a discernible pattern
between income and the method of purchase.
The numerical variables are now scaled but we still have outliers in our data. Let’s handle it using windsorize to handle outliers and preserve overall structure of data.
Our data seems better now, we no longer have extreme outliers.
#Encode Marital Status
quali$Single <- ifelse(quali$Marital_Status == "Single", 1, 0)
quali <- quali %>% select(-Marital_Status)
#Encode Education
quali$Postgraduate <- ifelse(quali$Education == "Postgraduate", 1, 0)
quali <- quali %>% select(-Education)
#Encode Children
quali$Children <- ifelse(quali$Children != 0, 1, 0)
quali <- quali %>% mutate_at(vars(Complain, AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Response), as.numeric)Make PCA object
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 6.4488455 40.3052841 40.30528
## comp 2 1.6638377 10.3989855 50.70427
## comp 3 1.2084139 7.5525866 58.25686
## comp 4 1.0046245 6.2789029 64.53576
## comp 5 1.0006563 6.2541016 70.78986
## comp 6 0.8720914 5.4505715 76.24043
## comp 7 0.6930427 4.3315168 80.57195
## comp 8 0.5856672 3.6604200 84.23237
## comp 9 0.5285247 3.3032796 87.53565
## comp 10 0.4219360 2.6370998 90.17275
## comp 11 0.3572397 2.2327483 92.40550
## comp 12 0.3322164 2.0763527 94.48185
## comp 13 0.3106524 1.9415777 96.42343
## comp 14 0.2261436 1.4133974 97.83682
## comp 15 0.2013635 1.2585218 99.09535
## comp 16 0.1447446 0.9046536 100.00000
Through PCA, we can gain information on principal components (high in cumulative variance) and perform dimensionality reduction (reduce dimension of the data while maintaining as much info as possible). In this case we would like to retain at least 85% of information from our data, thus based on the summary above we will pick PC1-PC7, since cumulative percentage of variance for PC7 meets the minimum of 85%. We will be able to reduce our data by 50% while retaining 85% of the information from our data.
Let’s extract values PC1-PC7 and make a new dataframe
fviz_nbclust(x = keepnum , # data
FUNcluster = kmeans, # algoritma klustering yg digunakan
method = "wss" ) Using the elbow method, we can observe that the optimum k is 3 clusters since there are no longer any significant declines of total within sum of square past k = 3.
Looking at the cluster size, they are not proportional to each other and have a considerably low ratio between the Between-Cluster Sum of Squares (Between_SS) and the Total Sum of Squares (Total_SS).
Let’s reconstruct our PCA object and bind it with our qualitative data, also adding cluster as a new variable.
reconpca <- reconst(pca,
ncp = 7) %>% as.data.frame() %>% select(-Single, -Postgraduate)
reconpca<- cbind(reconpca, quali)
reconpca$cluster <- clusterpca$cluster %>% as.factor()
head(reconpca)customer_centroids %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
kelompok_min = which.min(value),
kelompok_max = which.max(value))Based from our clustering model, we can observe these characteristics in the 3 clusters: Cluster 1 - Oldest customers,Prone to purchase with deals,Postgraduate education, Top spender through Web
Cluster 2 - Top spenders in all product categories, highest income
Cluster 3 - Youngest customers, Most complains, lowest income