Using clustering techniques, we are going to explore best audience for a company. Our dataset is taken from https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis. Let’s get started!
First, we have to load our dataset:
# Load the dataset
marketing_data <- read.csv("marketing_campaign.csv",sep = "\t")
summary(marketing_data)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 Length:2240 Length:2240
## 1st Qu.: 2828 1st Qu.:1959 Class :character Class :character
## Median : 5458 Median :1970 Mode :character Mode :character
## Mean : 5592 Mean :1969
## 3rd Qu.: 8428 3rd Qu.:1977
## Max. :11191 Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Length:2240
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median : 51382 Median :0.0000 Median :0.0000 Mode :character
## Mean : 52247 Mean :0.4442 Mean :0.5062
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :666666 Max. :2.0000 Max. :2.0000
## NA's :24
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## Min. :0.00000 Min. :0.000000 Min. :3 Min. :11
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:3 1st Qu.:11
## Median :0.00000 Median :0.000000 Median :3 Median :11
## Mean :0.01339 Mean :0.009375 Mean :3 Mean :11
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:3 3rd Qu.:11
## Max. :1.00000 Max. :1.000000 Max. :3 Max. :11
##
## Response
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1491
## 3rd Qu.:0.0000
## Max. :1.0000
##
Data pre-processing is an important step in any data analysis task as well as machine learning tasks. Manual skim of the data shows me there are only missing values in Income column.
# Check for missing values
missing_values <- colSums(is.na(marketing_data))
print("Missing Values:")
## [1] "Missing Values:"
print(missing_values)
## ID Year_Birth Education Marital_Status
## 0 0 0 0
## Income Kidhome Teenhome Dt_Customer
## 24 0 0 0
## Recency MntWines MntFruits MntMeatProducts
## 0 0 0 0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## 0 0 0 0
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 0 0 0 0
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## 0 0 0 0
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## 0 0 0 0
## Response
## 0
# Convert 'Dt_Customer' to datetime format
marketing_data$Dt_Customer <- as.Date(marketing_data$Dt_Customer,format="%d-%m-%Y")
# Let's replace N/A incomes with mean of income
marketing_data$Income[is.na(marketing_data$Income)] <- mean(marketing_data$Income, na.rm = TRUE)
Exploring the data is an important step. I generate some plots to see high-level overview of data. Since there are 2240 observations, sampling is a good choice for visualization:
set.seed(4242)
# Create a sample of 500 rows for analysis
sample_size <- 500
sample_data <- marketing_data %>% sample_n(sample_size)
# Display a summary of the sample
summary(sample_data)
## ID Year_Birth Education Marital_Status
## Min. : 1 Min. :1893 Length:500 Length:500
## 1st Qu.: 2925 1st Qu.:1958 Class :character Class :character
## Median : 5778 Median :1969 Mode :character Mode :character
## Mean : 5651 Mean :1968
## 3rd Qu.: 8320 3rd Qu.:1976
## Max. :11187 Max. :1993
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.000 Min. :0.000 Min. :2012-08-01
## 1st Qu.: 35409 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:2013-02-05
## Median : 54158 Median :0.000 Median :0.000 Median :2013-07-27
## Mean : 53448 Mean :0.434 Mean :0.522 Mean :2013-07-18
## 3rd Qu.: 68667 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:2014-01-03
## Max. :666666 Max. :2.000 Max. :2.000 Max. :2014-06-29
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 1.0
## 1st Qu.:23.00 1st Qu.: 24.75 1st Qu.: 1.75 1st Qu.: 16.0
## Median :49.00 Median : 182.00 Median : 9.00 Median : 69.0
## Mean :48.62 Mean : 313.23 Mean : 27.57 Mean :167.8
## 3rd Qu.:74.00 3rd Qu.: 516.50 3rd Qu.: 35.00 3rd Qu.:232.0
## Max. :99.00 Max. :1493.00 Max. :194.00 Max. :951.0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 34.94 Mean : 24.55 Mean : 45.73 Mean : 2.264
## 3rd Qu.: 50.00 3rd Qu.: 32.00 3rd Qu.: 58.00 3rd Qu.: 3.000
## Max. :258.00 Max. :191.00 Max. :291.00 Max. :15.000
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 0.00 1st Qu.: 3.000 1st Qu.: 3.00
## Median : 4.000 Median : 2.00 Median : 5.000 Median : 6.00
## Mean : 4.062 Mean : 2.66 Mean : 5.728 Mean : 5.25
## 3rd Qu.: 6.000 3rd Qu.: 4.00 3rd Qu.: 8.000 3rd Qu.: 7.00
## Max. :23.000 Max. :11.00 Max. :13.000 Max. :20.00
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2
## Min. :0.000 Min. :0.00 Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0.00 Median :0.000 Median :0.000
## Mean :0.078 Mean :0.08 Mean :0.09 Mean :0.048 Mean :0.014
## 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.00 Max. :1.000 Max. :1.000
## Complain Z_CostContact Z_Revenue Response
## Min. :0.000 Min. :3 Min. :11 Min. :0.000
## 1st Qu.:0.000 1st Qu.:3 1st Qu.:11 1st Qu.:0.000
## Median :0.000 Median :3 Median :11 Median :0.000
## Mean :0.008 Mean :3 Mean :11 Mean :0.158
## 3rd Qu.:0.000 3rd Qu.:3 3rd Qu.:11 3rd Qu.:0.000
## Max. :1.000 Max. :3 Max. :11 Max. :1.000
Here is distribution of income for all people in the campaign:
It is beneficial to investigate customers’ activities on the website in
order to understand their behaviour. Getting random sample of customers
helps creating cleaner visuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 23.00 49.00 48.62 74.00 99.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 3.00 6.00 5.25 7.00 20.00
Let’s see if education level affects the kind of purchases; so this will help us target specific audiences for products.
For wine, targeting people with PhD will definitely help maximize sales:
For fruits, there is no huge difference between education levels.
Same observation for meat purchases:
And it’s complementary with our experiences. Fruits and meat are common needs for everyone regardless of their education levels.
Since we are done with our analysis, onward to clustering!
In this analysis, we delve into customer segmentation using three popular clustering methods: K-means, hierarchical clustering, and PAM. We will analyse spending on wines vs spending on meat products
K-means clustering is a popular method that partitions data into k clusters based on similarity.
Using fviz_nbclust function from factoextra package, we can see optimal cluster to use
features_for_clustering <- select(marketing_data, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds)
fviz_nbclust(features_for_clustering, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)
Let’s apply K-means to our dataset with optimal number of clusters which
is three:
kmeans_model <- kmeans(features_for_clustering, centers = 3, nstart = 20)
marketing_data$Kmeans_Cluster <- as.factor(kmeans_model$cluster)
ggplot(marketing_data, aes(x = MntWines, y = MntMeatProducts, color = Kmeans_Cluster)) +
geom_point() +
labs(title = "K-means Clustering", x = "Spending on Wines", y = "Spending on Meat Products", color = "Cluster")
Hierarchical clustering builds a tree-like structure of clusters. We’ll use the Ward’s method because it makes clusters easy to differentiate:
hierarchical_model <- hclust(dist(features_for_clustering), method = "ward.D2")
hierarchical_clusters <- cutree(hierarchical_model, 3)
marketing_data$Hierarchical_Cluster <- as.factor(hierarchical_clusters)
ggplot(marketing_data, aes(x = MntWines, y = MntMeatProducts, color = Hierarchical_Cluster)) +
geom_point() +
labs(title = "Hierarchical Clustering", x = "Spending on Wines", y = "Spending on Meat Products", color = "Cluster")
PAM is a partitioning clustering algorithm similar to K-means but uses medoids instead of means.
# Select relevant features for PAM
features_for_pam <- select(marketing_data, MntWines, MntMeatProducts)
pam_model <- pam(features_for_pam, k = 3)
marketing_data$PAM_Cluster <- as.factor(pam_model$cluster)
Now,results!
With the clusters identified, businesses can now tailor marketing strategies to each segment. For example, high-spending clusters might be targeted with premium offers, while clusters showing low engagement might benefit from targeted promotions. Monitoring customer responses within each cluster can guide ongoing marketing efforts.