Source: awkn.pro
The aim of the analysis is to segment customers using clustering methods. Correct customer segmentation can be extremely valuable for companies. It allows them to better understand their customers’ needs and preferences. In addition, correct customer clustering allows a company to carry out more effective and targeted marketing activities. This is due to the ability to tailor product and service offerings to a specific group of customers, which ultimately increases the likelihood of their interest and purchase. Ultimately, insightful customer segmentation also allows for a more efficient use of company resources and enables a better understanding of the market.
Both clustering and dimension reduction methods were used in the analysis. First, an analysis of the available data was carried out. Redundant variables were removed, some were also transformed and outliers were discarded. The data was then normalised and scaled for further dimension reduction and clustering analysis.
The data used in the analysis was taken from Kaggle: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis
Before cleaning, the data contains 29 variables and 2240 observations. The database contains information on the customers of a certain company. Among other things, a range of private customer information can be read out, such as year of birth, completed level of education, marital status and income. In addition, it includes information on the expenditure for each product group over the last two years, as well as the number of transactions and information on accepted marketing campaigns. All variables and their descriptions can be found in the tables below.
| Variable | Description |
|---|---|
| ID | Customer’s unique identifier |
| Year_Birth: | Customer’s birth year |
| Education | Customer’s education level |
| Marital_Status | Customer’s marital status |
| Income | Customer’s yearly household income |
| Kidhome | Number of children in customer’s household |
| Teenhome | Number of teenagers in customer’s household |
| Dt_Customer | Date of customer’s enrollment with the company |
| Recency | Number of days since customer’s last purchase |
| Complain | 1 if the customer complained in the last 2 years, 0 otherwise |
| Variable | Description |
|---|---|
| MntWines | Amount spent on wine in last 2 years |
| MntFruits | Amount spent on fruits in last 2 years |
| MntMeatProducts | Amount spent on meat in last 2 years |
| MntFishProducts | Amount spent on fish in last 2 years |
| MntSweetProducts | Amount spent on sweets in last 2 years |
| MntGoldProds | Amount spent on gold in last 2 years |
| NumDealsPurchases | Number of purchases made with a discount |
|---|---|
| AcceptedCmp1 | 1 if customer accepted the offer in the 1st campaign, 0 otherwise |
| AcceptedCmp2 | 1 if customer accepted the offer in the 2nd campaign, 0 otherwise |
| AcceptedCmp3 | 1 if customer accepted the offer in the 3rd campaign, 0 otherwise |
| AcceptedCmp4 | 1 if customer accepted the offer in the 4th campaign, 0 otherwise |
| AcceptedCmp5 | 1 if customer accepted the offer in the 5th campaign, 0 otherwise |
| Response | 1 if customer accepted the offer in the last campaign, 0 otherwise |
| Variable | Description |
|---|---|
| NumDealsPurchases | Number of purchases made on deals |
| NumWebPurchases | Number of purchases made through the company’s website |
| NumCatalogPurchases | Number of purchases made using a catalogue |
| NumStorePurchases | Number of purchases made directly in stores |
| NumWebVisitsMonth | Number of visits to company’s website in the last month |
data=read.table("USL_clustering_marketing_campaign.csv", sep = ",", header = T)
customers=data.frame(data)
head(customers)Searching for missing data
customers[!complete.cases(customers),]customers<-na.omit(customers)There aren’t many, so I decide to discard the missing data.
str(customers)'data.frame': 2216 obs. of 29 variables:
$ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
$ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
$ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
$ Marital_Status : chr "Single" "Single" "Together" "Together" ...
$ Income : num 58138 46344 71613 26646 58293 ...
$ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
$ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
$ Dt_Customer : chr "2012-04-09" "2014-08-03" "2013-08-21" "2014-10-02" ...
$ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
$ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
$ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
$ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
$ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
$ MntSweetProducts : int 88 1 21 3 27 42 49 1 3 1 ...
$ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
$ NumDealsPurchases : int 3 2 1 2 5 2 4 2 1 1 ...
$ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
$ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
$ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
$ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
$ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
$ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
$ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
$ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
$ Response : int 1 0 0 0 0 0 0 0 1 0 ...
- attr(*, "na.action")= 'omit' Named int [1:24] 11 28 44 49 59 72 91 92 93 129 ...
..- attr(*, "names")= chr [1:24] "11" "28" "44" "49" ...
The ‘Dt_Customer’ variable, which indicates the date the customer was added to the database, should be in Date format. It is then worth creating a new variable ‘client_days’ stating how many days the customer has been in the system. (reference to the day of the last saved client)
customers$Dt_Customer <- as.Date(customers$Dt_Customer)
customers$client_days <- as.numeric(customers$Dt_Customer - min(customers$Dt_Customer))Another variable to be improved is Education. Firstly, it is stored as a textual variable, but we would like it to be stored as a categorical variable (factor).
customers %>% group_by(Education) %>% tally()Looking at the distribution of the Eductaion variable, it is worth reducing the number of its levels. In my opinion, only 3 levels are sufficient: Undergraduate, Graduate and Postgraduate
customers[customers$Education=='Basic',]$Education <- 'Undergraduate'
customers[(customers$Education=='Graduation' | customers$Education=='2n Cycle'),]$Education <- 'Graduate'
customers[(customers$Education=='Master' | customers$Education=='PhD'),]$Education <- 'Postgraduate'
customers$Education <- factor(customers$Education,
levels = c('Undergraduate', 'Graduate', 'Postgraduate'),
labels = c('Undergraduate', 'Graduate', 'Postgraduate'),
ordered = T)
customers %>% group_by(Education) %>% tally()Similarly, the Marital_Status variable should be corrected.
customers %>% group_by(Marital_Status) %>% tally()filter(customers, Marital_Status=='YOLO' | Marital_Status=='Absurd')The data for people with Marital_Status==‘YOLO’ seem to be repeated. As the data with marital status YOLO or Absurd is not much I decided to discard it. The remaining levels of the variable will be aggregated to two levels: Single and In_relationship.
customers<-customers[!(customers$Marital_Status=='YOLO' | customers$Marital_Status=='Absurd'),]
customers[(customers$Marital_Status=='Alone' | customers$Marital_Status=='Divorced' | customers$Marital_Status=='Widow'),]$Marital_Status <- 'Single'
customers[(customers$Marital_Status=='Married'| customers$Marital_Status=='Together'),]$Marital_Status <- 'In_relationship'
customers$Marital_Status <- factor(customers$Marital_Status)
customers %>% group_by(Marital_Status) %>% tally()The other variables in the database appear to have the appropriate types. However, before proceeding with further analysis, it is worth supplementing the database with some useful new variables.
customers$Age <- 2014 - customers$Year_Birth
customers$Cmp_accepted <- customers$AcceptedCmp1 + customers$AcceptedCmp2 + customers$AcceptedCmp3 + customers$AcceptedCmp4 + customers$AcceptedCmp5
customers$Children <- customers$Teenhome + customers$Kidhome
customers$Money_spent <- customers$MntWines + customers$MntFruits + customers$MntMeatProducts + customers$MntFishProducts + customers$MntSweetProducts + customers$MntGoldProds
customers$Number_purchases <- + customers$NumWebPurchases + customers$NumCatalogPurchases + customers$NumStorePurchases
customers$Number_deals_purchases <- customers$NumDealsPurchases
customers$Number_web_visits_month <- customers$NumWebVisitsMonthThe newly created variables are:
Age - depicting the age of the customer
Cmp_accepted - depicting the number of accepted marketing campaigns
Children - depicting the number of children owned
Money_spent - depicting the total spend on the listed products in the last 2 years
Number_purchases - depicting the total number of purchases of a given customer
Number_deals_purchases - depicting the total number of purchases from a given customer’s promotions
Number_web_visits_month - depicting the total number of visits to a given customer
Checking whether the dataset contains any outlier observations.
describe(customers)Warning: no non-missing arguments to min; returning InfWarning: no non-missing arguments to max; returning -Inf
Looking at the statistics, potential outliers are noticeable with the Age and Income variables. However, it is worthwhile using graphical analysis.
ggplot(customers, aes(Age)) +
geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)ggplot(customers, aes(y = Age))+
geom_boxplot(fill = '#0099F8',alpha = 0.5,color = 1,outlier.colour = 2)+
theme_bw()boxplot.stats(customers$Age)$out[1] 114 121 115
customers<-customers[customers$Age<114,]ggplot(customers, aes(Income)) +
geom_histogram(aes(y = ..density..), color = "#000000", fill = "#0099F8") +
geom_density(color = "#000000", fill = "#F85700", alpha = 0.6)ggplot(customers, aes(y = Income))+
geom_boxplot(fill = '#0099F8',alpha = 0.5,color = 1,outlier.colour = 2)+
theme_bw()boxplot.stats(customers$Income)$out[1] 157243 162397 153924 160803 157733 157146 156924 666666
customers<-customers[customers$Income<153924,]It is also worth considering the suitability of some of the variables for further analysis.
hist(customers$Z_Revenue, col = "#0099F8")hist(customers$Z_CostContact, col = "#0099F8")hist(customers$Complain, col = "#0099F8")The variables Z_Revenue and Z_CostContact have a fixed value for all observations. We can calmly remove them from the dataset. The variable Complain, on the other hand, takes the value 1 for too few observations to be retained in the database.
#Removing unnecessary variables from the dataset for further analysis
customers_df <- customers[c(-1,-2,-8,-26,-27,-28,-29)]corr_dim_red <- customers_df[c(-1,-2,-25,-26,-27,-28,-29,-30)]ggcorr(corr_dim_red, method = c("everything", "pearson"), label = TRUE, label_size = 1.35, label_round = 3, hjust = 0.85, size = 1.75)ggcorr(corr_dim_red, method = c("everything", "spearman"), label = TRUE, label_size = 1.35, label_round = 3, hjust = 0.85, size = 1.75)As can be seen from the graphs, quite a few of the variables are correlated. Hence, reducing the dimensions will not only possibly provide some insight into the data, but will also help to get rid of the high correlation of the variables.
customers_dim_red$Edu_ug <- 0
customers_dim_red[customers$Education == 'Undergraduate',]$Edu_ug <- 1
customers_dim_red$Edu_g <- 0
customers_dim_red[customers$Education == 'Graduate',]$Edu_g <- 1
customers_dim_red$Edu_pg <- 0
customers_dim_red[customers$Education == 'Postgraduate',]$Edu_pg <- 1
customers_dim_red$Marital_status <- 0
customers_dim_red[customers$Marital_Status == 'In_relationship',]$Marital_status <- 1
customers_dim_red <- customers_dim_red[c(-1,-2)]preproc2 <- preProcess(customers_dim_red, method=c("center", "scale"))
customers_dim_red.s <- predict(preproc2, customers_dim_red)
summary(customers_dim_red.s) Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
Min. :-0.9304 Min. :-1.695700 Min. :-0.9062 Min. :-0.6631 Min. :-0.7587 Min. :-0.6889 Min. :-0.6597 Min. :-0.8520 Min. :-1.2282 Min. :-1.4967
1st Qu.:-0.9304 1st Qu.:-0.865953 1st Qu.:-0.8352 1st Qu.:-0.6128 1st Qu.:-0.6853 1st Qu.:-0.6340 1st Qu.:-0.6354 1st Qu.:-0.6774 1st Qu.:-0.6980 1st Qu.:-0.7664
Median :-0.9304 Median :-0.001634 Median :-0.3852 Median :-0.4619 Median :-0.4466 Median :-0.4692 Median :-0.4653 Median :-0.3669 Median :-0.1677 Median :-0.0360
Mean : 0.0000 Mean : 0.000000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.9062 3rd Qu.: 0.862686 3rd Qu.: 0.5948 3rd Qu.: 0.1667 3rd Qu.: 0.3063 3rd Qu.: 0.2263 3rd Qu.: 0.1665 3rd Qu.: 0.2345 3rd Qu.: 0.3626 3rd Qu.: 0.6943
Max. : 2.7429 Max. : 1.727005 Max. : 3.5138 Max. : 4.3404 Max. : 7.1600 Max. : 4.0520 Max. : 5.7071 Max. : 5.3758 Max. : 6.7259 Max. : 8.3630
NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 client_days Age
Min. :-0.9451 Min. :-1.7946 Min. :-2.2135 Min. :-0.2827 Min. :-0.2837 Min. :-0.2799 Min. :-0.2616 Min. :-0.1175 Min. :-2.371570 Min. :-2.31631
1st Qu.:-0.9451 1st Qu.:-0.8700 1st Qu.:-0.9695 1st Qu.:-0.2827 1st Qu.:-0.2837 1st Qu.:-0.2799 1st Qu.:-0.2616 1st Qu.:-0.1175 1st Qu.:-0.749124 1st Qu.:-0.69263
Median :-0.2298 Median :-0.2536 Median : 0.2745 Median :-0.2827 Median :-0.2837 Median :-0.2799 Median :-0.2616 Median :-0.1175 Median :-0.004607 Median :-0.09443
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.00000
3rd Qu.: 0.4856 3rd Qu.: 0.6710 3rd Qu.: 0.6892 3rd Qu.:-0.2827 3rd Qu.:-0.2837 3rd Qu.:-0.2799 3rd Qu.:-0.2616 3rd Qu.:-0.1175 3rd Qu.: 0.739911 3rd Qu.: 0.84560
Max. : 9.0695 Max. : 2.2120 Max. : 6.0798 Max. : 3.5352 Max. : 3.5235 Max. : 3.5708 Max. : 3.8214 Max. : 8.5049 Max. : 2.203124 Max. : 2.46929
Edu_ug Edu_g Edu_pg Marital_status
Min. :-0.1586 Min. :-1.2123 Min. :-0.7832 Min. :-1.351
1st Qu.:-0.1586 1st Qu.:-1.2123 1st Qu.:-0.7832 1st Qu.:-1.351
Median :-0.1586 Median : 0.8245 Median :-0.7832 Median : 0.740
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000
3rd Qu.:-0.1586 3rd Qu.: 0.8245 3rd Qu.: 1.2763 3rd Qu.: 0.740
Max. : 6.3041 Max. : 0.8245 Max. : 1.2763 Max. : 0.740
The Multidimensional Scaling (MDS) method is a way to simplify complex data by representing it in two dimensions. Despite the number of variables involved in the data, MDS can effectively present it in a more manageable form. This method has the added benefit of being able to easily identify outliers, making it an important step in the data analysis process. By using MDS, the data is transformed from multiple dimensions into a more comprehensible two-dimensional format.
dist.customers<-dist(customers_dim_red.s)
mds1<-cmdscale(dist.customers, k=2)
summary(mds1) V1 V2
Min. :-4.7315 Min. :-5.2298
1st Qu.:-2.1042 1st Qu.:-1.1029
Median :-0.7606 Median : 0.1407
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 1.9036 3rd Qu.: 1.2791
Max. : 7.5714 Max. : 3.2274
plot(mds1)
abline(v=-4)abline(h=-4.5)
x.out<-which(mds1[,1]< -4)
y.out<-which(mds1[,2]< -4.5)
out.all<-c(x.out, y.out)
out.uni<-unique(out.all)
points(mds1[out.uni,], pch=4, col="red", cex=2)customers_dim_red.s_2 <- customers_dim_red.s
full<-1:2201
limited<-as.numeric(out.all)
customers_dim_red.s_2$mark<-full %in% limited
customers_dim_red.s_2 <- subset(customers_dim_red.s_2, customers_dim_red.s_2$mark == FALSE, select = -c(mark))PCA (Principal Component Analysis) like MDS is a statistical method used to reduce the dimensionality of large and complex data sets. The aim of PCA is to identify the underlying structure of the data and reduce it to a smaller set of uncorrelated variables, called principal components, which capture most of the information in the original data.
customers_pca.s <- customers_dim_red.s
customers_pca1<-prcomp(customers_pca.s, center=FALSE, scale.=FALSE)
customers_pca1Standard deviations (1, .., p=24):
[1] 2.385535e+00 1.524139e+00 1.409266e+00 1.282481e+00 1.165746e+00 1.047859e+00 1.004184e+00 9.943085e-01 9.664366e-01 8.981268e-01 8.901675e-01 8.196812e-01 7.813397e-01 7.721401e-01 7.385156e-01
[16] 7.302660e-01 7.116891e-01 6.510803e-01 6.206232e-01 6.080506e-01 5.368458e-01 4.983573e-01 4.350454e-01 2.656610e-15
Rotation (n x k) = (24 x 24):
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14
Teenhome 0.073185952 -0.341133856 0.320645247 -0.032845418 0.31229689 -0.123272536 0.023820214 -0.047492341 0.24005215 -0.02986569 0.051765316 -0.05347785 -0.156577743 -0.1894530237
Recency -0.007827026 0.006910237 0.039314118 0.005303605 0.04594928 0.348552473 0.480956802 -0.778170724 -0.09189307 -0.03384369 0.122481437 -0.08012419 -0.003891442 0.0396792246
MntWines -0.325104699 -0.241264169 -0.010221154 0.157065224 -0.04611474 0.033097363 0.004020872 0.004656491 -0.04619117 -0.03119975 0.033880396 0.21942886 -0.030686751 -0.0465769893
MntFruits -0.294236781 0.143093220 0.049030963 -0.180736660 -0.05489847 0.028552652 -0.018012615 0.053531498 0.02561488 0.08921164 0.003197363 -0.15418085 -0.197664325 0.1886904245
MntMeatProducts -0.346283435 0.061446277 -0.083338395 -0.080423851 -0.04493856 0.044967260 0.012204253 -0.008085930 -0.08604396 -0.04834517 -0.059042219 0.04755970 -0.110398397 -0.1916050594
MntFishProducts -0.306644806 0.150946826 0.044588192 -0.165157274 -0.04659368 0.038545067 -0.023676906 0.032078735 0.05424469 0.04164063 -0.088872033 -0.25703788 -0.038025137 0.1910952981
MntSweetProducts -0.298423627 0.133633714 0.046946310 -0.129879520 -0.03865166 0.065852002 -0.018865292 0.011497341 0.03524294 -0.01114937 -0.010087239 -0.25277699 -0.287160417 0.0863810958
MntGoldProds -0.243547875 0.015019708 0.221015331 -0.017678987 -0.13546042 -0.195567690 0.057934937 -0.080899611 0.09669337 0.12067843 0.072363347 0.09026913 0.779244867 0.0585350706
NumDealsPurchases 0.058003152 -0.271547632 0.438799616 0.062136375 -0.12257758 -0.004668105 -0.054033355 0.036207152 -0.05316077 -0.09503608 0.304162806 -0.25735820 -0.168362567 -0.3477078866
NumWebPurchases -0.227955364 -0.242759089 0.287242311 0.031463994 -0.11124936 -0.046658590 -0.043710632 0.037930307 -0.04037265 -0.08643869 0.230723543 0.02470812 0.105509378 0.2947594593
NumCatalogPurchases -0.351624286 -0.050217157 -0.003873638 -0.043665266 -0.02846532 -0.073860403 0.038727780 -0.065144267 0.01816080 0.02720838 -0.030395052 0.08296407 -0.049395275 -0.1471509144
NumStorePurchases -0.305783572 -0.146050314 0.125327302 -0.077776312 0.05584569 0.120011131 -0.050843125 0.085555959 -0.01677967 0.15618580 0.194918340 0.12784665 -0.084244819 -0.0803875666
NumWebVisitsMonth 0.258875376 -0.101551933 0.226265342 0.231979459 -0.31517978 0.004716936 -0.012576649 0.029317220 -0.12395842 -0.10989314 -0.081794161 -0.15678400 0.001475785 0.2348634696
AcceptedCmp3 -0.022182292 0.014948201 -0.040062015 0.114635358 -0.29393841 -0.750506890 0.113902405 -0.298816455 0.05931115 0.05101144 -0.006116423 0.17467647 -0.350800791 0.1570016805
AcceptedCmp4 -0.097130384 -0.193852028 -0.090585569 0.474794224 0.14375892 0.265464091 0.026389897 0.134733475 0.03550017 -0.03881517 0.120688082 0.23109850 -0.153852814 0.5268564386
AcceptedCmp5 -0.205040110 -0.011109829 -0.250472865 0.352272253 0.01852175 -0.010489983 -0.047182781 -0.053309987 -0.03492185 -0.26037839 -0.024536163 0.25969523 0.021961774 -0.4279174401
AcceptedCmp1 -0.185864798 0.010627459 -0.178009701 0.318181565 0.01673699 -0.125869794 -0.067972657 -0.032903794 0.04584277 -0.51499882 -0.074107621 -0.58205996 0.159998275 0.0003375772
AcceptedCmp2 -0.060493303 -0.089248476 -0.113873101 0.458999259 0.04871356 -0.021618725 0.088382709 0.016815675 0.12602370 0.73562547 -0.103651733 -0.35996883 0.022177841 -0.1448927932
client_days 0.039460236 0.069300733 -0.289438582 -0.092512057 0.46729932 -0.270418384 -0.006703345 -0.004358483 0.11657258 -0.01173306 0.637028410 -0.12482735 0.056837882 0.1188384631
Age -0.051983752 -0.288738153 0.097546358 -0.128135854 0.38777469 -0.086919371 0.096931260 -0.064688143 0.40935157 -0.13239242 -0.538848290 0.03498719 0.027311621 0.1307741243
Edu_ug 0.067315913 0.086927128 -0.103378021 0.001958897 -0.41200317 0.231644369 -0.018223978 -0.004207145 0.80005748 -0.06831803 0.204744354 0.06203049 -0.036203060 -0.0685189225
Edu_g -0.035596849 0.458340262 0.381291969 0.246965133 0.26400760 -0.055115302 0.008869278 0.014488210 -0.04993636 -0.02750539 -0.038208473 0.09617673 -0.041128092 -0.0220934700
Edu_pg 0.014541191 -0.491140200 -0.352589998 -0.250336713 -0.13565093 -0.018089992 -0.003160484 -0.013308686 -0.20446350 0.04958237 -0.026612636 -0.11701398 0.053122494 0.0441742623
Marital_status 0.008282944 -0.019736489 0.017147116 0.027977229 0.06219102 0.073326393 -0.846219724 -0.500103315 0.01549563 0.11285110 -0.037292537 0.02107725 0.003797217 0.0699139223
PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24
Teenhome -0.15837681 0.25638251 -0.505636400 -0.058804914 0.389016980 -2.186702e-02 0.106797438 -0.1147730629 0.089627260 -2.574958e-15
Recency -0.01233471 -0.03411091 -0.005720900 -0.043256854 -0.023629420 3.750364e-02 -0.006565130 0.0021315031 0.003869343 5.848378e-17
MntWines 0.09419265 -0.16336709 -0.032200850 -0.094567073 0.113239937 -1.422494e-05 0.439326831 0.1685499318 -0.687581531 3.474832e-16
MntFruits -0.31831007 0.18379038 0.032307697 -0.723390153 -0.106858428 -2.357716e-01 -0.064027558 0.1118870277 -0.026935728 -7.809101e-16
MntMeatProducts 0.18294654 -0.01947767 0.235925051 -0.044002680 0.276036969 -1.962685e-01 0.045650244 -0.7612013717 0.081111581 -4.385008e-16
MntFishProducts 0.05105992 0.18670926 0.100984974 0.075571254 0.327376733 7.527154e-01 -0.002797024 0.0638082420 -0.029231024 5.195270e-16
MntSweetProducts -0.41680650 0.01075232 -0.084510711 0.625529280 -0.193303380 -2.559232e-01 0.182261676 -0.0002743741 -0.063659767 8.006048e-17
MntGoldProds -0.11757191 0.33846372 0.006330666 0.075822973 -0.112140097 -8.820847e-02 0.100924311 -0.1100990561 -0.004537891 3.472787e-17
NumDealsPurchases 0.09419086 0.25062127 0.465152393 0.034004274 -0.213977935 4.160248e-02 -0.164389673 0.0313271583 -0.150440364 -2.085176e-17
NumWebPurchases -0.16279114 -0.56257764 -0.038772735 0.041152227 0.229429129 -3.496007e-02 -0.474117603 -0.0147259887 0.041073044 1.112222e-16
NumCatalogPurchases 0.31918976 0.04628244 0.165835424 0.094387097 0.275765795 -2.939016e-01 0.082446157 0.5580791522 0.453683054 3.308902e-16
NumStorePurchases 0.20918533 -0.19940903 -0.283259288 -0.092820200 -0.552936686 2.946534e-01 0.209288942 -0.1125854103 0.349284564 3.358918e-18
NumWebVisitsMonth -0.18938318 -0.15563178 0.193269157 -0.119495282 0.126641591 3.737228e-02 0.598936597 -0.0656474317 0.333491504 7.817682e-17
AcceptedCmp3 0.07239344 0.06653415 -0.047785467 0.021866276 -0.129418137 9.140870e-02 -0.060306421 -0.0820424115 -0.018200559 9.346817e-18
AcceptedCmp4 0.09698400 0.43106124 0.063716914 0.099607034 -0.031823466 -5.818113e-02 -0.103725203 -0.0801288716 0.081231753 1.284028e-17
AcceptedCmp5 -0.56498313 0.01856657 0.102680209 -0.047837601 -0.015742300 2.586764e-01 -0.089931342 0.0637440840 0.182019612 3.291880e-16
AcceptedCmp1 0.26558696 -0.02786658 -0.262603955 -0.092704721 -0.133598433 -7.312160e-02 -0.035279518 -0.0077800155 -0.002998012 -7.854050e-17
AcceptedCmp2 -0.04631960 -0.14981273 0.046269797 -0.007890175 0.044252711 -2.190371e-02 -0.066204959 -0.0361998805 -0.005939678 1.923083e-17
client_days -0.07209616 -0.11899075 0.277044416 -0.022212308 0.058932686 1.659797e-02 0.235627309 -0.0145386344 0.034606570 1.069968e-17
Age -0.03155110 -0.16612330 0.374108016 -0.014506312 -0.227064846 2.721443e-02 0.008120800 -0.0289246150 -0.001520263 -2.093640e-18
Edu_ug 0.04517987 -0.06839611 -0.002787300 -0.020966576 0.034176812 -1.082798e-02 0.014338821 -0.0214677528 -0.001221529 2.186621e-01
Edu_g 0.05867348 -0.07918358 0.011959332 -0.013682832 0.003476073 -7.648098e-03 0.016963535 0.0157207076 -0.033274431 6.938013e-01
Edu_pg -0.07372372 0.10186032 -0.011204140 0.020516492 -0.014405912 1.118375e-02 -0.021721615 -0.0090544336 0.034033855 6.861681e-01
Marital_status 0.02042953 0.02016865 0.035248256 -0.001334638 0.008255286 -2.831734e-02 0.012187251 -0.0146429687 -0.003913106 3.623449e-17
fviz_eig(customers_pca1, addlabels = T) fviz_pca_var(customers_pca1, col.var = "contrib", repel = TRUE, labels = TRUE)var<-get_pca_var(customers_pca1)
a<-fviz_contrib(customers_pca1, "var", axes=1, xtickslab.rt=90)
b<-fviz_contrib(customers_pca1, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')fviz_eig(customers_pca1, choice='eigenvalue', addlabels = T) eig.val<-get_eigenvalue(customers_pca1)
eig.valcumulative_variance_1<-summary(customers_pca1)
plot(cumulative_variance_1$importance[3,],type="l")customers_pca2<-principal(customers_pca.s, nfactors=12, rotate="varimax")Warning: Matrix was not positive definite, smoothing was doneWarning: The matrix is not positive semi-definite, scores found from Structure loadings
customers_pca2Principal Components Analysis
Call: principal(r = customers_pca.s, nfactors = 12, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC4 RC3 RC5 RC11 RC6 RC9 RC7 RC8 RC10 RC12
SS loadings 5.16 2.03 1.79 1.70 1.30 1.14 1.12 1.06 1.00 1.00 1.00 1.00
Proportion Var 0.21 0.08 0.07 0.07 0.05 0.05 0.05 0.04 0.04 0.04 0.04 0.04
Cumulative Var 0.21 0.30 0.37 0.44 0.50 0.55 0.59 0.64 0.68 0.72 0.76 0.80
Proportion Explained 0.27 0.10 0.09 0.09 0.07 0.06 0.06 0.06 0.05 0.05 0.05 0.05
Cumulative Proportion 0.27 0.37 0.46 0.55 0.62 0.68 0.74 0.79 0.84 0.90 0.95 1.00
Mean item complexity = 1.5
Test of the hypothesis that 12 components are sufficient.
The root mean square of the residuals (RMSR) is 0.04
with the empirical chi square 1834.91 with prob < 0
Fit based upon off diagonal values = 0.97
print(loadings(customers_pca2), digits=3, cutoff=0.35, sort=TRUE)
Loadings:
RC1 RC2 RC4 RC3 RC5 RC11 RC6 RC9 RC7 RC8 RC10 RC12
MntWines 0.617 0.582
MntFruits 0.775
MntMeatProducts 0.790
MntFishProducts 0.796
MntSweetProducts 0.758
MntGoldProds 0.599
NumCatalogPurchases 0.788
NumStorePurchases 0.720
NumWebVisitsMonth -0.631 0.394 -0.414
Edu_g -0.977
Edu_pg 0.979
AcceptedCmp4 0.774
AcceptedCmp5 0.667
NumDealsPurchases 0.835
NumWebPurchases 0.504 0.565
Teenhome 0.534 0.603
Age 0.926
client_days 0.945
AcceptedCmp3 0.938
Edu_ug 0.987
Marital_status 0.997
Recency 0.998
AcceptedCmp2 0.958
AcceptedCmp1 0.868
RC1 RC2 RC4 RC3 RC5 RC11 RC6 RC9 RC7 RC8 RC10 RC12
SS loadings 5.157 2.026 1.790 1.704 1.302 1.141 1.116 1.064 1.004 1.003 1.000 0.998
Proportion Var 0.215 0.084 0.075 0.071 0.054 0.048 0.047 0.044 0.042 0.042 0.042 0.042
Cumulative Var 0.215 0.299 0.374 0.445 0.499 0.547 0.593 0.637 0.679 0.721 0.763 0.804
fviz_eig(customers_pca1, addlabels = T)Twelve principal components account for almost 80% of the variance in the data. We can see how well the algorithm has dealt with correlated variables by assigning them to a single principal component. One can also try to interpret the principal components to better understand the data. Thus, high PC1 values will be held by customers who generally spend a lot and do a lot of transactions. PC2, on the other hand, can distinguish between customers with a master’s or doctorate degree and those with a bachelor’s degree. In a similar way, further principal components can be interpreted and relationships in the data highlighted by the PCA can be explored.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(customers_dim_red.s, method = x)$ac
}
map_dbl(m, ac) average single complete ward
0.8747472 0.8509067 0.9051705 0.9854785
# dendrogram
hc <- agnes(customers_dim_red.s, method = "ward")
pltree(hc, cex = 0.6, hang = -1, main = "dendrogram - agnes")# dissimilarity matrix
d <- dist(customers_dim_red.s)
hc1 <- hclust(d, method = "ward.D" )
plot(hc1, cex = 0.6)
rect.hclust(hc1, k = 3
, border = 2:5)res.pca <- PCA(customers_dim_red.s, graph=FALSE)
dendr <- HCPC(res.pca, nb.clust= -1, graph = F)fviz_dend(dendr,
palette = "rickandmorty",
rect = TRUE, rect_fill = TRUE,
rect_border = "rickandmorty",
)Hierarchical clustering is a method of grouping data into clusters based on their similarity. The algorithm creates a hierarchy of clusters by dividing and combining smaller clusters into larger ones. The result is a tree-like diagram that shows the relationships between the clusters. The advantage of using hierarchical clustering is that it can help identify meaningful patterns in large datasets and is useful for exploratory data analysis. The suggested partition appears to be the best, so it will be used in further analysis.
#Dropping Education and Marital_Status for corr plot preparation. Also dropping variables which are represented by new variables.
customers_correlation <- customers_df[c(-1,-2,-4,-5,-7,-8,-9,-10,-11,-12,-13,-14,-15,-16,-17,-18,-19,-20,-21,-22)]ggcorr(customers_correlation, method = c("everything", "pearson"), label = TRUE, label_size = 3, label_round = 2, hjust = 0.85, size = 3)ggcorr(customers_correlation, method = c("everything", "spearman"), label = TRUE, label_size = 3, label_round = 2, hjust = 0.85, size = 3)As can be seen from the graph, the variables Income, Money_spent and Number_purchases are highly correlated. For further analysis on continuous variables, therefore, only the Money_spent variable will be used and the other two will be temporarily discarded.
customers_data_con <- customers_correlation[c(-1,-8)]
skim(customers_data_con)── Data Summary ────────────────────────
Values
Name customers_data_con
Number of rows 2201
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None
get_clust_tendency(customers_data_con, n=ceiling(nrow(customers_data_con)/10), graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)$hopkins_stat
[1] 0.6487238
$plot
Before proceeding with further analysis, it is useful to check the clustering tendency of the data. This can be done using the Hopkins statistic. Near zero values of the statistic indicate that the data is clustered. In this case, the value of the Hopkins statistic is not very low, so clustering obtained using only continuous variables is unlikely to be of high quality.
preproc <- preProcess(customers_data_con, method=c("center", "scale"))
customers_data_con.s <- predict(preproc, customers_data_con)
summary(customers_data_con.s) Recency client_days Age Cmp_accepted Children
Min. :-1.695700 Min. :-2.371570 Min. :-2.31631 Min. :-0.4396 Min. :-1.2672
1st Qu.:-0.865953 1st Qu.:-0.749124 1st Qu.:-0.69263 1st Qu.:-0.4396 1st Qu.:-1.2672
Median :-0.001634 Median :-0.004607 Median :-0.09443 Median :-0.4396 Median : 0.0673
Mean : 0.000000 Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.862686 3rd Qu.: 0.739911 3rd Qu.: 0.84560 3rd Qu.:-0.4396 3rd Qu.: 0.0673
Max. : 1.727005 Max. : 2.203124 Max. : 2.46929 Max. : 5.4427 Max. : 2.7362
Money_spent Number_deals_purchases Number_web_visits_month
Min. :-0.9992 Min. :-1.2282 Min. :-2.2135
1st Qu.:-0.8929 1st Qu.:-0.6980 1st Qu.:-0.9695
Median :-0.3496 Median :-0.1677 Median : 0.2745
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.7286 3rd Qu.: 0.3626 3rd Qu.: 0.6892
Max. : 3.1873 Max. : 6.7259 Max. : 6.0798
K-means is a widely used and well-known clustering algorithm. It segments a dataset into k clusters. The algorithm operates by assigning each data point to the cluster with the closest centroid (average), and then adjusting the centroids based on the new cluster assignments. This continues until the centroids stop changing or a stopping criterion is met. One drawback of K-means is that the number of clusters must be specified in advance by the user. To address this limitation, there are various techniques available to determine the optimal number of clusters. These techniques include for example the Elbow method, where the number of clusters is selected based on where the explained variance stops increasing, and the Silhouette score, which measures a data point’s similarity to its own cluster compared to other clusters.
opt_elbow_grpd<-Optimal_Clusters_KMeans(customers_data_con.s, max_clusters=20, plot_clusters = TRUE)opt_silhouette_grpd<-Optimal_Clusters_KMeans(customers_data_con.s, max_clusters=20, plot_clusters=TRUE, criterion="silhouette")opt_grpd<-NbClust(customers_data_con.s, distance="euclidean", min.nc=2, max.nc=12, method="complete", index="ch")
opt_grpd$Best.ncNumber_clusters Value_Index
11.0000 162.4258
agr_km2<-eclust(customers_data_con.s, "kmeans", k = 2) fviz_silhouette(agr_km2)agr_km3<-eclust(customers_data_con.s, "kmeans", k = 3) fviz_silhouette(agr_km3)agr_km6<-eclust(customers_data_con.s, "kmeans", k = 6) fviz_silhouette(agr_km6)agr_km11<-eclust(customers_data_con.s, "kmeans", k = 11) fviz_silhouette(agr_km11)score_agr_km2<-kmeans(customers_data_con.s, 2)
round(calinhara(customers_data_con.s, score_agr_km2$cluster),digits=2)[1] 580.37
score_agr_km3<-kmeans(customers_data_con.s, 3)
round(calinhara(customers_data_con.s, score_agr_km3$cluster),digits=2)[1] 471.86
score_agr_km6<-kmeans(customers_data_con.s, 6)
round(calinhara(customers_data_con.s, score_agr_km6$cluster),digits=2)[1] 363.26
score_agr_km11<-kmeans(customers_data_con.s, 11)
round(calinhara(customers_data_con.s, score_agr_km11$cluster),digits=2)[1] 276.6
dudahart2(customers_data_con.s, score_agr_km2$cluster)$p.value
[1] 0
$dh
[1] 0.7911873
$compare
[1] 0.8892011
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
dudahart2(customers_data_con.s, score_agr_km3$cluster)$p.value
[1] 0
$dh
[1] 0.5080522
$compare
[1] 0.8892011
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
dudahart2(customers_data_con.s, score_agr_km6$cluster)$p.value
[1] 0
$dh
[1] 0.1511493
$compare
[1] 0.8892011
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
dudahart2(customers_data_con.s, score_agr_km11$cluster)$p.value
[1] 0
$dh
[1] 0.07948125
$compare
[1] 0.8892011
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
Clustering using K-means achieved the best results for the number of clusters of 2 followed by 3. This was confirmed by all statistics including Calinski-Harabasz and Duda-Hart indices. The highest average silhouette width achieved was only 0.24, which shows that the clustering results are poor.
data_pca<-customers_pca1$x[,1:12]get_clust_tendency(data_pca, n=ceiling(nrow(data_pca)/10), seed = 123)$hopkins_stat[1] 0.8377595
opt_elbow<-Optimal_Clusters_KMeans(data_pca, max_clusters=20, plot_clusters = TRUE)opt_silhouette<-Optimal_Clusters_KMeans(data_pca, max_clusters=20, plot_clusters=TRUE, criterion="silhouette")opt1<-NbClust(data_pca, distance="euclidean", min.nc=2, max.nc=12, method="complete", index="ch")opt1$All.index 2 3 4 5 6 7 8 9 10 11 12
550.4834 385.3125 377.4499 356.5730 290.3140 267.5355 248.6706 220.3645 207.1051 188.0100 184.8578
opt1$Best.ncNumber_clusters Value_Index
2.0000 550.4834
pca12_km2 <- eclust(data_pca, "kmeans", k = 2) fviz_silhouette(pca12_km2)pca12_km3<-eclust(data_pca, "kmeans", k = 3) fviz_silhouette(pca12_km3)pca12_km6<-eclust(data_pca, "kmeans", k = 6) fviz_silhouette(pca12_km6)score_pca12_km2<-kmeans(data_pca, 2)
round(calinhara(data_pca, score_pca12_km2$cluster),digits=2)[1] 261.79
score_pca12_km3<-kmeans(data_pca, 3)
round(calinhara(data_pca, score_pca12_km3$cluster),digits=2)[1] 459.97
score_pca12_km6<-kmeans(data_pca, 6)
round(calinhara(data_pca, score_pca12_km6$cluster),digits=2)[1] 378.71
dudahart2(data_pca, score_pca12_km2$cluster)$p.value
[1] 1.097645e-10
$dh
[1] 0.8936152
$compare
[1] 0.9209815
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
dudahart2(data_pca, score_pca12_km3$cluster)$p.value
[1] 0
$dh
[1] 0.5575084
$compare
[1] 0.9209815
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
dudahart2(data_pca, score_pca12_km6$cluster)$p.value
[1] 0
$dh
[1] 0.126912
$compare
[1] 0.9209815
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
The k-means clustering results obtained on the PCA data are again very poor. The high Hopkins statistic and low average silhouette width show that this is a bad method for clustering these data.
Another idea is to use more variables, not just continuous ones, and see if clustering on them will be better. There are many methods for clustering mixed data. Some of the most popular include clustering using Factorial Analysis of Mixed Data (FAMD), clustering data using Grower distance and K-prototypes. All of these methods were tested on this dataset.
Factorial Analysis of Mixed Data (FAMD) is a method for examining data that includes both numerical and categorical variables. It takes advantage of two statistical techniques, PCA and MCA, to condense the data and highlight the connections between variables. In this case, the results obtained with FAMD will be used to perform clustering with k-means.
customers_data_con <- customers_correlation[c(-1,-8)]
skim(customers_data_con)── Data Summary ────────────────────────
Values
Name customers_data_con
Number of rows 2201
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None
summary(famd)
Call:
FAMD(base = customers_mixed, ncp = 20, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
Variance 2.870 1.556 1.281 1.028 1.006 0.987 0.865 0.799 0.739 0.429 0.297 0.143
% of var. 23.918 12.969 10.671 8.570 8.383 8.223 7.206 6.662 6.157 3.579 2.472 1.190
Cumulative % of var. 23.918 36.887 47.558 56.128 64.511 72.734 79.940 86.602 92.759 96.338 98.810 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
1 | 3.807 | 1.646 0.043 0.187 | 1.845 0.099 0.235 | -1.514 0.081 0.158 |
2 | 3.406 | -1.558 0.038 0.209 | -0.828 0.020 0.059 | 1.810 0.116 0.282 |
3 | 2.398 | 1.398 0.031 0.340 | -0.554 0.009 0.053 | -0.096 0.000 0.002 |
4 | 3.035 | -1.294 0.027 0.182 | -1.787 0.093 0.347 | 0.277 0.003 0.008 |
5 | 2.953 | -0.501 0.004 0.029 | 0.486 0.007 0.027 | 0.499 0.009 0.029 |
6 | 2.232 | 0.370 0.002 0.027 | 0.445 0.006 0.040 | 0.814 0.023 0.133 |
7 | 2.311 | -0.169 0.000 0.005 | 1.071 0.034 0.215 | -0.998 0.035 0.187 |
8 | 2.609 | -1.378 0.030 0.279 | -0.489 0.007 0.035 | -0.155 0.001 0.004 |
9 | 2.905 | -1.676 0.044 0.333 | -0.548 0.009 0.036 | 0.161 0.001 0.003 |
10 | 7.090 | -3.743 0.222 0.279 | 0.931 0.025 0.017 | 0.814 0.024 0.013 |
Continuous variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
Recency | 0.010 0.003 0.000 | 0.061 0.242 0.004 | -0.051 0.202 0.003 |
client_days | 0.003 0.000 0.000 | -0.522 17.523 0.273 | 0.601 28.241 0.362 |
Age | 0.148 0.768 0.022 | 0.363 8.490 0.132 | 0.586 26.796 0.343 |
Cmp_accepted | 0.532 9.864 0.283 | 0.043 0.120 0.002 | -0.128 1.278 0.016 |
Children | -0.697 16.907 0.485 | 0.353 8.022 0.125 | 0.282 6.204 0.079 |
Money_spent | 0.895 27.914 0.801 | 0.263 4.459 0.069 | -0.102 0.816 0.010 |
Number_purchases | 0.800 22.271 0.639 | 0.431 11.931 0.186 | -0.041 0.134 0.002 |
Number_deals_purchases | -0.324 3.650 0.105 | 0.772 38.265 0.595 | -0.040 0.123 0.002 |
Number_web_visits_month | -0.710 17.545 0.504 | 0.291 5.444 0.085 | -0.288 6.460 0.083 |
Categories
Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test Dim.3 ctr cos2 v.test
Undergraduate | -1.815 0.981 0.077 -7.970 | -2.007 4.082 0.094 -11.969 | -2.987 13.352 0.208 -19.637 |
Graduate | 0.001 0.000 0.000 0.044 | -0.083 0.171 0.010 -3.799 | -0.338 4.140 0.165 -16.974 |
Postgraduate | 0.115 0.061 0.008 2.495 | 0.260 1.061 0.040 7.655 | 0.721 12.066 0.307 23.420 |
In_relationship | -0.040 0.013 0.003 -1.497 | 0.051 0.068 0.005 2.571 | 0.041 0.067 0.003 2.304 |
Single | 0.073 0.023 0.003 1.497 | -0.092 0.125 0.005 -2.571 | -0.075 0.122 0.003 -2.304 |
fviz_eig(famd, ncp=20, addlabels=TRUE)fviz_pca_var(famd, col.var = "contrib", repel = TRUE, labels = TRUE)fviz_contrib(famd, "var", axes=1, xtickslab.rt=90) fviz_contrib(famd, "var", axes=2, xtickslab.rt=90)opt_silhouette<-Optimal_Clusters_KMeans(customers_famd, max_clusters=20, plot_clusters=TRUE, criterion="silhouette")famd_kmeans <- eclust(customers_famd, "kmeans", k = 3) fviz_silhouette(famd_kmeans)score_famd_kmeans<-kmeans(customers_famd, 3)
round(calinhara(customers_famd, score_famd_kmeans$cluster),digits=2)[1] 370.08
dudahart2(customers_famd, score_famd_kmeans$cluster)$p.value
[1] 0
$dh
[1] 0.4123133
$compare
[1] 0.9209815
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
Again, the clustering results obtained by this method are not satisfactory. Average silhouette width is only 0.15, which shows how poor this clustering is.
The combination of Grower distance and PAM is a clustering technique that aims to tackle the challenges posed by high-dimensional and complex data. The Grower distance measure is used as a way to assess the similarity between data points, while PAM uses medoids to represent the clusters instead of centroids. This approach aims to provide improved results in clustering complex data compared to traditional techniques such as the K-means algorithm and Euclidean distance.
gower_dist <- daisy(customers_mixed, metric = "gower",
stand = FALSE, warnType = TRUE)
summary(gower_dist)2421100 dissimilarities, summarized :
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.1859 0.2427 0.2455 0.3015 0.6837
Metric : mixed ; Types = O, N, I, I, I, I, I, I, I, I, I
Number of objects : 2201
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:10, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width",
main = "Optimal number of clusters PAM")
lines(1:10, sil_width)c1<-eclust(gower_dist, "pam", k = 2)
fviz_silhouette(c1)fviz_cluster(c1)c2<-eclust(gower_dist, "pam", k = 3)
fviz_silhouette(c2)fviz_cluster(c2)score_pam2<-pam(gower_dist, 2)
round(calinhara(gower_dist, score_pam2$cluster),digits=2)[1] 1058.05
score_pam3<-pam(gower_dist, 3)
round(calinhara(gower_dist, score_pam3$cluster),digits=2)[1] 856.1
dudahart2(gower_dist, score_pam2$cluster)$p.value
[1] 0
$dh
[1] 0.6751517
$compare
[1] 0.9977256
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
dudahart2(gower_dist, score_pam3$cluster)$p.value
[1] 0
$dh
[1] 0.388104
$compare
[1] 0.9977256
$cluster1
[1] FALSE
$alpha
[1] 0.001
$z
[1] 3.090232
As can be seen, clustering with this method improved previous results. This is indicated by quality measures including average silhouette width = 0.28, or Calinsk-Harabasz = 856.1 (for k=3), among others.
K-Prototypes is a clustering method specifically designed for datasets that have a combination of both categorical and numerical variables. It combines the principles of K-means (for numerical data) and K-modes (for categorical data) to create a more comprehensive approach to clustering mixed data. The algorithm operates by assigning each data point to a cluster based on both its categorical and numerical values, and then continuously refining the cluster prototypes (centroids) until they reach a stable state.
kproto_twsos <- numeric(12)
for(i in 1:12){
kpres <- kproto(customers_mixed, k = i)
kproto_twsos[i] <- kpres$tot.withinss
}# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
plot(1:12, kproto_twsos,
xlab = "Number of clusters",
ylab = "Total Within Sum Of Squares",
main = "Optimal number of clusters K-prototypes")
lines(1:12, kproto_twsos)kproto_ss <- numeric(12)
for(i in 2:12){
kpres <- kproto(customers_mixed, k = i)
kproto_ss[i] <- validation_kproto(object = kpres, method = "silhouette")
}# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
# Plot sihouette width (higher is better)
plot(1:12, kproto_ss,
xlab = "Number of clusters",
ylab = "Silhouette Width",
main = "Optimal number of clusters K-prototypes")
lines(1:12, kproto_ss)customers_k_prototypes <- kproto(customers_mixed, 3)# NAs in variables:
Education Marital_Status Recency client_days Age Cmp_accepted Children Money_spent Number_purchases
0 0 0 0 0 0 0 0 0
Number_deals_purchases Number_web_visits_month
0 0
0 observation(s) with NAs.
Estimated lambda: 96825.71
0 observation(s) with NAs.
customers_k_prototypes$centersvalidation_kproto(object = customers_k_prototypes, method = "silhouette")[1] 0.398714
c_assignment_kproto <- customers_k_prototypes$cluster
customers_kproto <- cbind(customers_mixed, c_assignment_kproto)
customers_kproto$c_assignment_kproto <- as.factor(customers_kproto$c_assignment_kproto)ggplot(customers_kproto, aes(x = Money_spent, y = Number_purchases, color = c_assignment_kproto)) +
geom_point() +
labs(title = "Money spent ~ Number of purchases",
x = "Money_spent", y = "Number_purchases") + guides(color = guide_legend(title = "Cluster"))Definitely performing K-prototypes on mixed data gives the best clustering results. Average silhouette width is highest for 2 clusters and is more than 0.7, but for further analysis a split into 3 clusters was chosen with the second highest average silhouette width = 0.398 guided by indications from hierachical clustering and wanting to see potential differences between more clusters.
set.seed(123)
features <- subset(customers_dim_red.s)
tsne <- tsne(features, initial_dims = 26, k = 3, max_iter = 300, epoch = 50)sigma summary: Min. : 0.33590247385282 |1st Qu. : 0.50268605673983 |Median : 0.555157820787002 |Mean : 0.593604554549779 |3rd Qu. : 0.653023509629496 |Max. : 1.15235770954057 |
Epoch: Iteration #50 error is: 18.2826308570591
Epoch: Iteration #100 error is: 18.0346360515252
Epoch: Iteration #150 error is: 1.31086852227389
Epoch: Iteration #200 error is: 1.08051232062025
Epoch: Iteration #250 error is: 0.98146496383044
Epoch: Iteration #300 error is: 0.870987572159646
tsne <- data.frame(tsne)fig <- plot_ly(data = tsne ,x = ~X1, y = ~X2, z = ~X3, colors = c('#636EFA','#EF553B','#00CC96') ) %>%
add_markers(size = 8) %>%
layout(
xaxis = list(
zerolinecolor = "#ffff",
zerolinewidth = 2,
gridcolor='#ffff'),
yaxis = list(
zerolinecolor = "#ffff",
zerolinewidth = 2,
gridcolor='#ffff'),
scene =list(bgcolor = "#e5ecf6"))
figfig2 <- plot_ly(data = tsne ,x = ~X1, y = ~X2, z = ~X3, color = ~customers_kproto$c_assignment_kproto, colors = c('#636EFA','#EF553B','#00CC96') ) %>%
add_markers(size = 8) %>%
layout(
xaxis = list(
zerolinecolor = "#ffff",
zerolinewidth = 2,
gridcolor='#ffff'),
yaxis = list(
zerolinecolor = "#ffff",
zerolinewidth = 2,
gridcolor='#ffff'),
scene =list(bgcolor = "#e5ecf6"))
fig2c_assignment_pam <- c2$cluster
customers_pam <- cbind(customers_mixed, c_assignment_pam)
customers_pam$c_assignment_pam <- as.factor(customers_pam$c_assignment_pam)
fig3 <- plot_ly(data = tsne ,x = ~X1, y = ~X2, z = ~X3, color = ~customers_pam$c_assignment_pam, colors = c('#636EFA','#EF553B','#00CC96') ) %>%
add_markers(size = 8) %>%
layout(
xaxis = list(
zerolinecolor = "#ffff",
zerolinewidth = 2,
gridcolor='#ffff'),
yaxis = list(
zerolinecolor = "#ffff",
zerolinewidth = 2,
gridcolor='#ffff'),
scene =list(bgcolor = "#e5ecf6"))
fig3features2 <- subset(customers_dim_red.s)
set.seed(123)
tsne2 <- tsne(features2, initial_dims = 26, k = 2, max_iter = 300, epoch = 50)sigma summary: Min. : 0.33590247385282 |1st Qu. : 0.50268605673983 |Median : 0.555157820787002 |Mean : 0.593604554549779 |3rd Qu. : 0.653023509629496 |Max. : 1.15235770954057 |
Epoch: Iteration #50 error is: 18.6989606457907
Epoch: Iteration #100 error is: 18.3985917244678
Epoch: Iteration #150 error is: 1.59540704101302
Epoch: Iteration #200 error is: 1.34819514926882
Epoch: Iteration #250 error is: 1.23263486520246
Epoch: Iteration #300 error is: 1.09182863987721
tsne2 <- data.frame(tsne2)options(warn = -1)
fig4 <- plot_ly(data = tsne2 ,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~customers_kproto$c_assignment_kproto)
fig4 <- fig4 %>%
layout(
plot_bgcolor = "#e5ecf6"
)
fig4options(warn = -1)
fig5 <- plot_ly(data = tsne2 ,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~customers_pam$c_assignment_pam)
fig5 <- fig5 %>%
layout(
plot_bgcolor = "#e5ecf6"
)
fig5customers_kproto <- customers_kproto %>% mutate(Income = customers_df[, "Income"])
customers_kproto_grpd <- customers_kproto %>% group_by(c_assignment_kproto)skim(customers_kproto_grpd)── Data Summary ────────────────────────
Values
Name customers_kproto_grpd
Number of rows 2201
Number of columns 13
_______________________
Column type frequency:
factor 2
numeric 10
________________________
Group variables c_assignment_kproto
customers_kproto_grpd %>%
summarize(Count = n()) %>%
ggplot(aes(x=c_assignment_kproto, y=Count, fill=c_assignment_kproto)) +
theme_solarized() +
scale_fill_solarized() +
labs(title="Number of observations in each cluster",
x ="Cluster", y = "Number of observations", fill = "Clusters") +
geom_bar(stat='identity', position= "dodge")
target_variables <- c("Money_spent","Income","Age","client_days","Number_purchases","Number_deals_purchases","Number_web_visits_month","Cmp_accepted","Children","Recency")
for (each_variable in target_variables) {
plot_var_name <- str_c(c("ggplot", each_variable), collapse = "_")
temp_plot <- ggplot(customers_kproto, aes_string(x="c_assignment_kproto", y=each_variable, fill = "c_assignment_kproto")) +
theme_solarized() +
scale_fill_solarized() +
ggtitle(str_c("Cluster ~ ",each_variable)) +
labs(fill = "Clusters") +
geom_boxplot()
assign(plot_var_name, temp_plot)
}
gridExtra::grid.arrange(ggplot_Money_spent, ggplot_Income, ggplot_Age, ggplot_client_days, ggplot_Number_purchases, ggplot_Number_deals_purchases, ggplot_Number_web_visits_month, ggplot_Cmp_accepted, ggplot_Children, ggplot_Recency, ncol = 2)customers_kproto_grpd %>%
summarize(Mean_money_spent = mean(Money_spent)) %>%
ggplot(aes(x=c_assignment_kproto, y=Mean_money_spent, fill = c_assignment_kproto)) +
theme_solarized() +
scale_fill_solarized() +
labs(title="Mean money spent in each cluster",
x ="Cluster", y = "Mean money spent", fill = "Clusters") +
geom_bar(stat='identity', position= "dodge")customers_kproto_grpd %>%
summarize(Mean_income = mean(Income)) %>%
ggplot(aes(x=c_assignment_kproto, y=Mean_income, fill = c_assignment_kproto)) +
theme_solarized() +
scale_fill_solarized() +
labs(title="Mean income in each cluster",
x ="Cluster", y = "Mean income", fill = "Clusters") +
geom_bar(stat='identity', position= "dodge")ggplot(customers_kproto, aes(x = Money_spent, y = Income, color = c_assignment_kproto)) +
geom_point() +
theme_solarized() +
scale_fill_solarized() +
labs(title = "Money spent ~ Income",
x = "Money spent", y = "Income") + guides(color = guide_legend(title = "Cluster"))customers_kproto_grpd %>%
summarize(Mean_client_days = mean(client_days)) %>%
ggplot(aes(x=c_assignment_kproto, y=Mean_client_days, fill = c_assignment_kproto)) +
theme_solarized() +
scale_fill_solarized() +
labs(title="Mean client days in each cluster",
x ="Cluster", y = "Mean client days", fill = "Clusters") +
geom_bar(stat='identity', position= "dodge")customers_kproto_summ <- customers_kproto_grpd[,-c(1,2)]
customers_kproto_summ %>% summarize(across(everything(), mean)) customers_kproto_summ %>% summarize(across(everything(), median))customers_kproto_summ %>% summarize(across(everything(), min)) customers_kproto_summ %>% summarize(across(everything(), max)) The purpose of the analysis was to segment customers by applying clustering methods. The best approach turned out to be the use of the K-prototypes method on both continuous and categorical variables. Accurate customer segmentation is of great value to companies because it allows them to better understand customer needs and preferences. Thanks to the performed clustering, it was possible to divide customers into those spending the most, with the highest earnings, while not being mostly parents (Cluster 1), and those with less spending, but making purchases on promotions and being mostly parents (Cluster 2 or 3). It is now also possible to carry out targeted marketing activities, thereby increasing the likelihood of customer interest and purchase.
Given the high similarity of clusters 2 and 3 and the highest value of average silhouette width for clustering using the K-prototypes method, it would be worthwhile in further work on the project to conduct the analysis by two clusters and compare the results.
https://plotly.com/r/t-sne-and-umap-projections/ https://medium.com/analytics-vidhya/the-ultimate-guide-for-clustering-mixed-data-1eefa0b4743b https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis