Customer Segmentation (Smriti Raman)

Author

Smriti

Customer Segmentation

MARKET SEGMENTATION

1. Indroduction

Market segmentation involves breaking a broad market into smaller, more defined groups of consumers who share similar needs, characteristics, or behaviors. This approach enables businesses to target specific audiences more precisely, customize products and marketing strategies, and allocate resources more effectively. By recognizing and meeting the unique requirements of each segment, companies can enhance customer satisfaction, boost sales, and strengthen their competitive edge.

Descriptive Data Mining

Descriptive data mining is the practice of examining data to summarize its key characteristics and reveal patterns or relationships within it. Unlike predictive data mining, which aims to forecast future events, descriptive analysis concentrates on understanding past data. It uncovers trends, groupings, and associations such as customer segments, frequent item combinations, or correlations that offer insights into historical behaviors or conditions. This type of analysis is commonly applied in reporting, informed decision-making, and identifying hidden structures within large datasets.

# Rattle is Copyright (c) 2006-2021 Togaware Pty Ltd.
# It is free (as in libre) open source software.
# It is licensed under the GNU General Public License,
# Version 2. Rattle comes with ABSOLUTELY NO WARRANTY.
# Rattle was written by Graham Williams with contributions
# from others as acknowledged in 'library(help=rattle)'.
# Visit https://rattle.togaware.com/ for details.

#=======================================================================
# Rattle timestamp: 2025-07-15 11:55:07.733592 x86_64-w64-mingw32 

# Rattle version 5.5.1 user 'smrit'

# This log captures interactions with Rattle as an R script. 

# For repeatability, export this activity log to a 
# file, like 'model.R' using the Export button or 
# through the Tools menu. Th script can then serve as a 
# starting point for developing your own scripts. 
# After xporting to a file called 'model.R', for exmample, 
# you can type into a new R Console the command 
# "source('model.R')" and so repeat all actions. Generally, 
# you will want to edit the file to suit your own needs. 
# You can also edit this log in place to record additional 
# information before exporting the script. 
 
# Note that saving/loading projects retains this log.

# We begin most scripts by loading the required packages.
# Here are some initial packages to load and others will be
# identified as we proceed through the script. When writing
# our own scripts we often collect together the library
# commands at the beginning of the script here.

library(rattle)   # Access the weather dataset and utilities.

Loading required package: tibble

Loading required package: bitops

Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.

library(magrittr) # Utilise %>% and %<>% pipeline operators.

# This log generally records the process of building a model. 
# However, with very little effort the log can also be used 
# to score a new dataset. The logical variable 'building' 
# is used to toggle between generating transformations, 
# when building a model and using the transformations, 
# when scoring a dataset.

building <- TRUE
scoring  <- ! building

# A pre-defined value is used to reset the random seed 
# so that results are repeatable.

crv$seed <- 42 

#=======================================================================
# Rattle timestamp: 2025-07-15 11:55:23.990856 x86_64-w64-mingw32 

# Load a dataset from file.

library(readxl, quietly=TRUE)

 crs$dataset <- read_excel("C:/Users/smrit/OneDrive/Documents/Amrita/DemoKTC.xlsx", guess_max=1e4)

 crs$dataset

# A tibble: 30 × 7
     Age Female Income Married Children  Loan Mortgage
   <dbl>  <dbl>  <dbl>   <dbl>    <dbl> <dbl>    <dbl>
 1    48      1 17546        0        1     0        0
 2    40      0 30085.       1        3     1        1
 3    51      1 16575.       1        0     1        0
 4    23      1 20375.       1        3     0        0
 5    57      1 50576.       1        0     0        0
 6    57      1 37870.       1        2     0        0
 7    22      0  8877.       0        0     0        0
 8    58      0 24947.       1        0     1        0
 9    37      1 25304.       1        2     1        0
10    54      0 24212.       1        2     1        0
# ℹ 20 more rows

#=======================================================================
# Rattle timestamp: 2025-07-15 11:55:24.921638 x86_64-w64-mingw32 

# Action the user selections from the Data tab. 

# Build the train/validate/test datasets.

# nobs=30 train=21 validate=4 test=5

set.seed(crv$seed)

crs$nobs <- nrow(crs$dataset)

crs$train <- sample(crs$nobs, 0.7*crs$nobs)

crs$nobs %>%
  seq_len() %>%
  setdiff(crs$train) %>%
  sample(0.15*crs$nobs) ->
crs$validate

crs$nobs %>%
  seq_len() %>%
  setdiff(crs$train) %>%
  setdiff(crs$validate) ->
crs$test

# The following variable selections have been noted.

crs$input     <- c("Age", "Female", "Income", "Married",
                   "Children", "Loan")

crs$numeric   <- c("Age", "Female", "Income", "Married",
                   "Children", "Loan")

crs$categoric <- NULL

crs$target    <- "Mortgage"
crs$risk      <- NULL
crs$ident     <- NULL
crs$ignore    <- NULL
crs$weights   <- NULL

#=======================================================================
# Rattle timestamp: 2025-07-15 11:55:28.079238 x86_64-w64-mingw32 

# Action the user selections from the Data tab. 

# Build the train/validate/test datasets.

# nobs=30 train=21 validate=4 test=5

set.seed(crv$seed)

crs$nobs <- nrow(crs$dataset)

crs$train <- sample(crs$nobs, 0.7*crs$nobs)

crs$nobs %>%
  seq_len() %>%
  setdiff(crs$train) %>%
  sample(0.15*crs$nobs) ->
crs$validate

crs$nobs %>%
  seq_len() %>%
  setdiff(crs$train) %>%
  setdiff(crs$validate) ->
crs$test

# The following variable selections have been noted.

crs$input     <- c("Age", "Female", "Income", "Married",
                   "Children", "Loan", "Mortgage")

crs$numeric   <- c("Age", "Female", "Income", "Married",
                   "Children", "Loan", "Mortgage")

crs$categoric <- NULL

crs$target    <- NULL
crs$risk      <- NULL
crs$ident     <- NULL
crs$ignore    <- NULL
crs$weights   <- NULL

#=======================================================================
# Rattle timestamp: 2025-07-15 12:26:38.223719 x86_64-w64-mingw32 

# Action the user selections from the Data tab. 

# The following variable selections have been noted.

crs$input     <- c("Age", "Female", "Income", "Married",
                   "Children", "Loan", "Mortgage")

crs$numeric   <- c("Age", "Female", "Income", "Married",
                   "Children", "Loan", "Mortgage")

crs$categoric <- NULL

crs$target    <- NULL
crs$risk      <- NULL
crs$ident     <- NULL
crs$ignore    <- NULL
crs$weights   <- NULL

2.1 Data Exploration

The dataset for KTC Company consists of 30 records and 7 variables, which include Age, Gender, Income, Marital Status, Number of Children, Loan status, and Mortgage status.

library(Hmisc, quietly=TRUE)


Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, units

# Obtain a summary of the dataset.

contents(crs$dataset[, c(crs$input, crs$risk, crs$target)])


Data frame:crs$dataset[, c(crs$input, crs$risk, crs$target)]    30 observations and 7 variables    Maximum # NAs:0

         Storage
Age       double
Female    double
Income    double
Married   double
Children  double
Loan      double
Mortgage  double

summary(crs$dataset[, c(crs$input, crs$risk, crs$target)])

      Age            Female           Income         Married   
 Min.   :22.00   Min.   :0.0000   Min.   : 8877   Min.   :0.0  
 1st Qu.:37.25   1st Qu.:0.0000   1st Qu.:18166   1st Qu.:1.0  
 Median :47.00   Median :1.0000   Median :24241   Median :1.0  
 Mean   :45.97   Mean   :0.5667   Mean   :28012   Mean   :0.8  
 3rd Qu.:56.75   3rd Qu.:1.0000   3rd Qu.:35923   3rd Qu.:1.0  
 Max.   :66.00   Max.   :1.0000   Max.   :59804   Max.   :1.0  
    Children           Loan           Mortgage  
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0  
 Median :0.5000   Median :0.0000   Median :0.0  
 Mean   :0.9333   Mean   :0.4333   Mean   :0.4  
 3rd Qu.:2.0000   3rd Qu.:1.0000   3rd Qu.:1.0  
 Max.   :3.0000   Max.   :1.0000   Max.   :1.0

The dataset comprises 30 observations and includes information on age, marital status, gender, number of children, loan status, and mortgage status.

2.1.1 Age

# Use ggplot2 to generate box plot for Age

# Generate a box plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  ggplot2::ggplot(ggplot2::aes(y=Age)) +
  ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="navy") +
  ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) +
  ggplot2::xlab("Rattle 2025-Jul-15 12:42:43 smrit") +
  ggplot2::ggtitle("Distribution of Age") +
  ggplot2::theme(legend.position="none")

Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
ℹ Please use the `fun` argument instead.

# Display the plots.

gridExtra::grid.arrange(p01)

The box plot shows there is no outliers in the data

Histogram

# Rattle timestamp: 2025-07-15 12:45:36.104563 x86_64-w64-mingw32 

# Display histogram plots for the selected variables. 

# Use ggplot2 to generate histogram plot for Age

# Generate the plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Age) %>%
  ggplot2::ggplot(ggplot2::aes(x=Age)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Age\n\nRattle 2025-Jul-15 12:45:36 smrit") +
  ggplot2::ggtitle("Distribution of Age") +
  ggplot2::labs(y="Density")

# Display the plots.

gridExtra::grid.arrange(p01)

The histogram of the age variables is left-skewed.

2.1.2 Gender

# Display box plots for the selected variables. 

# Use ggplot2 to generate box plot for Female

# Generate a box plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  ggplot2::ggplot(ggplot2::aes(y=Female)) +
  ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="pink") +
  ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) +
  ggplot2::xlab("Rattle 2025-Jul-15 12:50:54 smrit") +
  ggplot2::ggtitle("Distribution of Female") +
  ggplot2::theme(legend.position="none")

# Display the plots.

gridExtra::grid.arrange(p01)

Notch went outside hinges
ℹ Do you want `notch = FALSE`?

Histogram

# Rattle timestamp: 2025-08-12 22:53:55.721733 x86_64-w64-mingw32 

# Display histogram plots for the selected variables. 

# Use ggplot2 to generate histogram plot for Female

# Generate the plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Female) %>%
  ggplot2::ggplot(ggplot2::aes(x=Female)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Female\n\nRattle 2025-Aug-12 22:53:55 smrit") +
  ggplot2::ggtitle("Distribution of Female") +
  ggplot2::labs(y="Density")

# Display the plots.

gridExtra::grid.arrange(p01)

2.1.3 Income

# Display box plots for the selected variables. 

# Use ggplot2 to generate box plot for Income

# Generate a box plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  ggplot2::ggplot(ggplot2::aes(y=Income)) +
  ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="grey") +
  ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) +
  ggplot2::xlab("Rattle 2025-Jul-17 11:57:19 smrit") +
  ggplot2::ggtitle("Distribution of Income") +
  ggplot2::theme(legend.position="none")

# Display the plots.

gridExtra::grid.arrange(p01)

# Display histogram plots for the selected variables. 

# Use ggplot2 to generate histogram plot for Income

# Generate the plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Income) %>%
  ggplot2::ggplot(ggplot2::aes(x=Income)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Income\n\nRattle 2025-Jul-17 11:58:37 smrit") +
  ggplot2::ggtitle("Distribution of Income") +
  ggplot2::labs(y="Density")

# Display the plots.

gridExtra::grid.arrange(p01)

The chart indicates that the majority of customers have a moderate income (slightly above $20,000). However, the right-skewed distribution reveals that a portion of customers earn significantly higher incomes.

2.1.4 Marital Status

# Display box plots for the selected variables. 

# Use ggplot2 to generate box plot for Married

# Generate a box plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  ggplot2::ggplot(ggplot2::aes(y=Married)) +
  ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="grey") +
  ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) +
  ggplot2::xlab("Rattle 2025-Jul-17 12:00:13 smrit") +
  ggplot2::ggtitle("Distribution of Married") +
  ggplot2::theme(legend.position="none")

# Display the plots.

gridExtra::grid.arrange(p01)

# Display histogram plots for the selected variables. 

# Use ggplot2 to generate histogram plot for Married

# Generate the plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Married) %>%
  ggplot2::ggplot(ggplot2::aes(x=Married)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Married\n\nRattle 2025-Jul-17 12:01:04 smrit") +
  ggplot2::ggtitle("Distribution of Married") +
  ggplot2::labs(y="Density")

# Display the plots.

gridExtra::grid.arrange(p01)

The distribution reveals that most customers are married, with the graph skewed toward married individuals, indicating that the majority of people in this dataset are in a relationship.

2.1.5 Children

# Display box plots for the selected variables. 

# Use ggplot2 to generate box plot for Children

# Generate a box plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  ggplot2::ggplot(ggplot2::aes(y=Children)) +
  ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="purple") +
  ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) +
  ggplot2::xlab("Rattle 2025-Jul-17 12:02:05 smrit") +
  ggplot2::ggtitle("Distribution of Children") +
  ggplot2::theme(legend.position="none")

# Display the plots.

gridExtra::grid.arrange(p01)

Notch went outside hinges
ℹ Do you want `notch = FALSE`?

# Display histogram plots for the selected variables. 

# Use ggplot2 to generate histogram plot for Children

# Generate the plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Children) %>%
  ggplot2::ggplot(ggplot2::aes(x=Children)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Children\n\nRattle 2025-Jul-17 12:02:52 smrit") +
  ggplot2::ggtitle("Distribution of Children") +
  ggplot2::labs(y="Density")

# Display the plots.

gridExtra::grid.arrange(p01)

The chart reveals a notable mismatch in the first-digit distribution of the children’s data. While the expected pattern would show a decreasing frequency for higher digits (more 1s and fewer 9s), the “All” data displays an unusually high occurrence of the digit ‘2’ followed by zero frequency for digits ‘4’ through ‘9’. This marked deviation from Benford’s Law suggests the children’s data may be artificially generated, rounded, or otherwise not naturally occurring.

2.1.6 Loan

# Display box plots for the selected variables. 

# Use ggplot2 to generate box plot for Loan

# Generate a box plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  ggplot2::ggplot(ggplot2::aes(y=Loan)) +
  ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="grey") +
  ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) +
  ggplot2::xlab("Rattle 2025-Jul-17 12:03:45 smrit") +
  ggplot2::ggtitle("Distribution of Loan") +
  ggplot2::theme(legend.position="none")

# Display the plots.

gridExtra::grid.arrange(p01)

Notch went outside hinges
ℹ Do you want `notch = FALSE`?

# Display histogram plots for the selected variables. 

# Use ggplot2 to generate histogram plot for Loan

# Generate the plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Loan) %>%
  ggplot2::ggplot(ggplot2::aes(x=Loan)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Loan\n\nRattle 2025-Jul-17 12:04:35 smrit") +
  ggplot2::ggtitle("Distribution of Loan") +
  ggplot2::labs(y="Density")

# Display the plots.

gridExtra::grid.arrange(p01)

The chart displays a bimodal distribution for the ‘Loan’ variable, with two distinct density peaks. The larger peak near 0.0 indicates a high concentration of customers without a loan, while a smaller peak around 1.0 represents those with an active loan. The lowest density occurs around 0.5, showing few observations between these extremes. This pattern suggests that ‘Loan’ is essentially a binary or categorical variable, with a majority of customers falling into the “no loan” category.

2.1.7 Mortgage

# Display box plots for the selected variables. 

# Use ggplot2 to generate box plot for Mortgage

# Generate a box plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  ggplot2::ggplot(ggplot2::aes(y=Mortgage)) +
  ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="yellow") +
  ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) +
  ggplot2::xlab("Rattle 2025-Jul-17 12:05:28 smrit") +
  ggplot2::ggtitle("Distribution of Mortgage") +
  ggplot2::theme(legend.position="none")

# Display the plots.

gridExtra::grid.arrange(p01)

Notch went outside hinges
ℹ Do you want `notch = FALSE`?

# Display histogram plots for the selected variables. 

# Use ggplot2 to generate histogram plot for Mortgage

# Generate the plot.

p01 <- crs %>%
  with(dataset[,]) %>%
  dplyr::select(Mortgage) %>%
  ggplot2::ggplot(ggplot2::aes(x=Mortgage)) +
  ggplot2::geom_density(lty=3) +
  ggplot2::xlab("Mortgage\n\nRattle 2025-Jul-17 12:06:19 smrit") +
  ggplot2::ggtitle("Distribution of Mortgage") +
  ggplot2::labs(y="Density")

# Display the plots.

gridExtra::grid.arrange(p01)

The chart indicates that mortgage ownership is fairly evenly distributed among customers. Most either have a mortgage or do not, with no significant imbalance, showing that mortgage status varies across the dataset.

2.2 Data summarizing

The dataset features a diverse group of individuals described by their clothing, age, gender, marital status, family structure, income levels, and financial obligations. Respondents range widely in age, from young adults in their 20s to individuals approaching retirement, allowing for analysis of financial maturity factors such as the impact of age on loan ownership or income stability. Gender representation appears balanced, including both male and female participants, with the potential presence of non-binary individuals. This inclusivity reflects broader societal dynamics and supports more comprehensive insights.

2.3 Segmentation using Clustering

Clustering is a method of grouping observations based on their similarities. We use distance measures to assess the dissimilarity among the observations. There are many measures of distance, including Euclidean, Manhattan, Mahalanobis, etc. Similarly, we have different types of algorithms, such as hierarchical, k-means, bi-cluster, etc. We start with the hierarchical method as part of data exploration and then use k-means.

2.2.1 Heirarchial

DemoKTC <- read_excel("C:/Users/smrit/OneDrive/Documents/Amrita/DemoKTC.xlsx")
mydata<-scale(DemoKTC)
d <- dist(mydata, method = "manhattan") # distance matrix
fit <- hclust(d, method="ward.D2") # Clustering
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
#draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=3, border="purple")

Cluster 1 is primarily made up of younger, unmarried individuals with lower incomes, higher loan and mortgage levels, and a larger proportion of females (~67%). Cluster 2 consists of middle-aged, married individuals with moderate incomes, the highest average number of children, average loan and mortgage levels, and a relatively balanced gender distribution (~54% female). Cluster 3 comprises older, married individuals with higher incomes, fewer children, lower loan and mortgage levels, and a comparable female proportion (~55%).

K means

# Rattle timestamp: 2025-08-12 22:50:30.932313 x86_64-w64-mingw32 

# Hierarchical Cluster 

# Generate a hierarchical cluster from the numeric data.

crs$dataset[, crs$numeric] %>%
  amap::hclusterpar(method="euclidean", link="ward.D2", nbproc=1) ->
crs$hclust

Cluster

library(factoextra)

Loading required package: ggplot2

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)
data <- scale(mydata)  
fviz_nbclust(data, kmeans, method = "wss")

set.seed(123)  # For reproducibility
km <- kmeans(data, centers = 3, nstart = 25)
set.seed(123)  # For reproducibility
km <- kmeans(data, centers = 3, nstart = 25)
fviz_cluster(km, data)

data2<-data# duplicating the data
data2$cluster<-km$cluster# writing the cluster membership in to the data

Warning in data2$cluster <- km$cluster: Coercing LHS to a list

data2<-mydata# duplicating the data
cluster_id<-as.vector(unlist(km$cluster))# writing the cluster membership in to the data
data2<-as.data.frame(cbind(data2,cluster_id))

# Group data2 by cluster_id and compute mean for each group
group_means <- aggregate(. ~ cluster_id, data = data2, FUN = mean)

# Split the original data into a list of data frames by cluster_id
grouped_data <- split(data2, data2$cluster_id)

# If we specifically want 3 data sets, we can extract them like this:
data_cluster1 <- grouped_data[[1]]
data_cluster2 <- grouped_data[[2]]
data_cluster3 <- grouped_data[[3]]

# Optionally view the group means
print(group_means)

  cluster_id        Age      Female     Income   Married    Children
1          1 -0.8408520  0.19840997 -0.6987521 -1.966384  0.06169096
2          2 -0.4279950 -0.05596179 -0.5185505  0.491596  0.27523658
3          3  0.9644588 -0.04208696  0.9939700  0.491596 -0.35892921
         Loan   Mortgage
1  0.46295659  0.2006932
2  0.05596179  0.1235035
3 -0.31865844 -0.2554278

Result

The second cluster has the highest average age, while the third cluster contains a larger proportion of younger individuals.

From the hierarchical clustering results and the elbow method, it is evident that three clusters are optimal for this dataset.