Introduction

Using clustering techniques, we are going to explore best audience for a company. Our dataset is taken from https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis. Let’s get started!

Preliminaries

First, we have to load our dataset:

# Load the dataset
marketing_data <- read.csv("marketing_campaign.csv",sep = "\t")

summary(marketing_data)
##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :    0   Min.   :1893   Length:2240        Length:2240       
##  1st Qu.: 2828   1st Qu.:1959   Class :character   Class :character  
##  Median : 5458   Median :1970   Mode  :character   Mode  :character  
##  Mean   : 5592   Mean   :1969                                        
##  3rd Qu.: 8428   3rd Qu.:1977                                        
##  Max.   :11191   Max.   :1996                                        
##                                                                      
##      Income          Kidhome          Teenhome      Dt_Customer       
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Length:2240       
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median : 51382   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   : 52247   Mean   :0.4442   Mean   :0.5062                     
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :666666   Max.   :2.0000   Max.   :2.0000                     
##  NA's   :24                                                           
##     Recency         MntWines         MntFruits     MntMeatProducts 
##  Min.   : 0.00   Min.   :   0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.:24.00   1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16.0  
##  Median :49.00   Median : 173.50   Median :  8.0   Median :  67.0  
##  Mean   :49.11   Mean   : 303.94   Mean   : 26.3   Mean   : 166.9  
##  3rd Qu.:74.00   3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232.0  
##  Max.   :99.00   Max.   :1493.00   Max.   :199.0   Max.   :1725.0  
##                                                                    
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.00   Median : 2.000   
##  Mean   : 37.53   Mean   : 27.06   Mean   : 44.02   Mean   : 2.325   
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
##  Max.   :259.00   Max.   :263.00   Max.   :362.00   Max.   :15.000   
##                                                                      
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.00     Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.00     Median : 6.000   
##  Mean   : 4.085   Mean   : 2.662      Mean   : 5.79     Mean   : 5.317   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.00     Max.   :20.000   
##                                                                          
##   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5      AcceptedCmp1    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.07277   Mean   :0.07455   Mean   :0.07277   Mean   :0.06429  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##   AcceptedCmp2        Complain        Z_CostContact   Z_Revenue 
##  Min.   :0.00000   Min.   :0.000000   Min.   :3     Min.   :11  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:3     1st Qu.:11  
##  Median :0.00000   Median :0.000000   Median :3     Median :11  
##  Mean   :0.01339   Mean   :0.009375   Mean   :3     Mean   :11  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:3     3rd Qu.:11  
##  Max.   :1.00000   Max.   :1.000000   Max.   :3     Max.   :11  
##                                                                 
##     Response     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1491  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
## 

Data pre-processing is an important step in any data analysis task as well as machine learning tasks. Manual skim of the data shows me there are only missing values in Income column.

# Check for missing values
missing_values <- colSums(is.na(marketing_data))
print("Missing Values:")
## [1] "Missing Values:"
print(missing_values)
##                  ID          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                  24                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
##                   0                   0                   0                   0 
##        AcceptedCmp2            Complain       Z_CostContact           Z_Revenue 
##                   0                   0                   0                   0 
##            Response 
##                   0
# Convert 'Dt_Customer' to datetime format
marketing_data$Dt_Customer <- as.Date(marketing_data$Dt_Customer,format="%d-%m-%Y")

# Let's replace N/A incomes with mean of income
marketing_data$Income[is.na(marketing_data$Income)] <- mean(marketing_data$Income, na.rm = TRUE)

Exploratory Data Analysis

Exploring the data is an important step. I generate some plots to see high-level overview of data. Since there are 2240 observations, sampling is a good choice for visualization:

set.seed(4242)

# Create a sample of 500 rows for analysis
sample_size <- 500
sample_data <- marketing_data %>% sample_n(sample_size)

# Display a summary of the sample
summary(sample_data)
##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :    1   Min.   :1893   Length:500         Length:500        
##  1st Qu.: 2925   1st Qu.:1958   Class :character   Class :character  
##  Median : 5778   Median :1969   Mode  :character   Mode  :character  
##  Mean   : 5651   Mean   :1968                                        
##  3rd Qu.: 8320   3rd Qu.:1976                                        
##  Max.   :11187   Max.   :1993                                        
##      Income          Kidhome         Teenhome      Dt_Customer        
##  Min.   :  1730   Min.   :0.000   Min.   :0.000   Min.   :2012-08-01  
##  1st Qu.: 35409   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:2013-02-05  
##  Median : 54158   Median :0.000   Median :0.000   Median :2013-07-27  
##  Mean   : 53448   Mean   :0.434   Mean   :0.522   Mean   :2013-07-18  
##  3rd Qu.: 68667   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:2014-01-03  
##  Max.   :666666   Max.   :2.000   Max.   :2.000   Max.   :2014-06-29  
##     Recency         MntWines         MntFruits      MntMeatProducts
##  Min.   : 0.00   Min.   :   0.00   Min.   :  0.00   Min.   :  1.0  
##  1st Qu.:23.00   1st Qu.:  24.75   1st Qu.:  1.75   1st Qu.: 16.0  
##  Median :49.00   Median : 182.00   Median :  9.00   Median : 69.0  
##  Mean   :48.62   Mean   : 313.23   Mean   : 27.57   Mean   :167.8  
##  3rd Qu.:74.00   3rd Qu.: 516.50   3rd Qu.: 35.00   3rd Qu.:232.0  
##  Max.   :99.00   Max.   :1493.00   Max.   :194.00   Max.   :951.0  
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.00   Median : 2.000   
##  Mean   : 34.94   Mean   : 24.55   Mean   : 45.73   Mean   : 2.264   
##  3rd Qu.: 50.00   3rd Qu.: 32.00   3rd Qu.: 58.00   3rd Qu.: 3.000   
##  Max.   :258.00   Max.   :191.00   Max.   :291.00   Max.   :15.000   
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.00       Min.   : 0.000    Min.   : 0.00    
##  1st Qu.: 2.000   1st Qu.: 0.00       1st Qu.: 3.000    1st Qu.: 3.00    
##  Median : 4.000   Median : 2.00       Median : 5.000    Median : 6.00    
##  Mean   : 4.062   Mean   : 2.66       Mean   : 5.728    Mean   : 5.25    
##  3rd Qu.: 6.000   3rd Qu.: 4.00       3rd Qu.: 8.000    3rd Qu.: 7.00    
##  Max.   :23.000   Max.   :11.00       Max.   :13.000    Max.   :20.00    
##   AcceptedCmp3    AcceptedCmp4   AcceptedCmp5   AcceptedCmp1    AcceptedCmp2  
##  Min.   :0.000   Min.   :0.00   Min.   :0.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.000   Median :0.00   Median :0.00   Median :0.000   Median :0.000  
##  Mean   :0.078   Mean   :0.08   Mean   :0.09   Mean   :0.048   Mean   :0.014  
##  3rd Qu.:0.000   3rd Qu.:0.00   3rd Qu.:0.00   3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.000   Max.   :1.00   Max.   :1.00   Max.   :1.000   Max.   :1.000  
##     Complain     Z_CostContact   Z_Revenue     Response    
##  Min.   :0.000   Min.   :3     Min.   :11   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:3     1st Qu.:11   1st Qu.:0.000  
##  Median :0.000   Median :3     Median :11   Median :0.000  
##  Mean   :0.008   Mean   :3     Mean   :11   Mean   :0.158  
##  3rd Qu.:0.000   3rd Qu.:3     3rd Qu.:11   3rd Qu.:0.000  
##  Max.   :1.000   Max.   :3     Max.   :11   Max.   :1.000

Here is distribution of income for all people in the campaign:

It is beneficial to investigate customers’ activities on the website in order to understand their behaviour. Getting random sample of customers helps creating cleaner visuals:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   23.00   49.00   48.62   74.00   99.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    3.00    6.00    5.25    7.00   20.00

Let’s see if education level affects the kind of purchases; so this will help us target specific audiences for products.

For wine, targeting people with PhD will definitely help maximize sales:

For fruits, there is no huge difference between education levels.

Same observation for meat purchases:

And it’s complementary with our experiences. Fruits and meat are common needs for everyone regardless of their education levels.

Since we are done with our analysis, onward to clustering!

Clustering

In this analysis, we delve into customer segmentation using three popular clustering methods: K-means, hierarchical clustering, and PAM. We will analyse spending on wines vs spending on meat products

K-Means Clustering

K-means clustering is a popular method that partitions data into k clusters based on similarity.

Using fviz_nbclust function from factoextra package, we can see optimal cluster to use

features_for_clustering <- select(marketing_data, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds)

fviz_nbclust(features_for_clustering, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)

Let’s apply K-means to our dataset with optimal number of clusters which is three:

kmeans_model <- kmeans(features_for_clustering, centers = 3, nstart = 20)
marketing_data$Kmeans_Cluster <- as.factor(kmeans_model$cluster)
ggplot(marketing_data, aes(x = MntWines, y = MntMeatProducts, color = Kmeans_Cluster)) +
  geom_point() +
  labs(title = "K-means Clustering", x = "Spending on Wines", y = "Spending on Meat Products", color = "Cluster")

Hierarchical Clustering

Hierarchical clustering builds a tree-like structure of clusters. We’ll use the Ward’s method because it makes clusters easy to differentiate:

hierarchical_model <- hclust(dist(features_for_clustering), method = "ward.D2")
hierarchical_clusters <- cutree(hierarchical_model, 3)
marketing_data$Hierarchical_Cluster <- as.factor(hierarchical_clusters)

ggplot(marketing_data, aes(x = MntWines, y = MntMeatProducts, color = Hierarchical_Cluster)) +
  geom_point() +
  labs(title = "Hierarchical Clustering", x = "Spending on Wines", y = "Spending on Meat Products", color = "Cluster")

PAM

PAM is a partitioning clustering algorithm similar to K-means but uses medoids instead of means.

# Select relevant features for PAM
features_for_pam <- select(marketing_data, MntWines, MntMeatProducts)

pam_model <- pam(features_for_pam, k = 3)

marketing_data$PAM_Cluster <- as.factor(pam_model$cluster)

Now,results!

Conclusion

With the clusters identified, businesses can now tailor marketing strategies to each segment. For example, high-spending clusters might be targeted with premium offers, while clusters showing low engagement might benefit from targeted promotions. Monitoring customer responses within each cluster can guide ongoing marketing efforts.