Market segmentation involves breaking a broad market into smaller, more defined groups of consumers who share similar needs, characteristics, or behaviors. This approach enables businesses to target specific audiences more precisely, customize products and marketing strategies, and allocate resources more effectively. By recognizing and meeting the unique requirements of each segment, companies can enhance customer satisfaction, boost sales, and strengthen their competitive edge.
Descriptive Data Mining
Descriptive data mining is the practice of examining data to summarize its key characteristics and reveal patterns or relationships within it. Unlike predictive data mining, which aims to forecast future events, descriptive analysis concentrates on understanding past data. It uncovers trends, groupings, and associations such as customer segments, frequent item combinations, or correlations that offer insights into historical behaviors or conditions. This type of analysis is commonly applied in reporting, informed decision-making, and identifying hidden structures within large datasets.
# Rattle is Copyright (c) 2006-2021 Togaware Pty Ltd.# It is free (as in libre) open source software.# It is licensed under the GNU General Public License,# Version 2. Rattle comes with ABSOLUTELY NO WARRANTY.# Rattle was written by Graham Williams with contributions# from others as acknowledged in 'library(help=rattle)'.# Visit https://rattle.togaware.com/ for details.#=======================================================================# Rattle timestamp: 2025-07-15 11:55:07.733592 x86_64-w64-mingw32 # Rattle version 5.5.1 user 'smrit'# This log captures interactions with Rattle as an R script. # For repeatability, export this activity log to a # file, like 'model.R' using the Export button or # through the Tools menu. Th script can then serve as a # starting point for developing your own scripts. # After xporting to a file called 'model.R', for exmample, # you can type into a new R Console the command # "source('model.R')" and so repeat all actions. Generally, # you will want to edit the file to suit your own needs. # You can also edit this log in place to record additional # information before exporting the script. # Note that saving/loading projects retains this log.# We begin most scripts by loading the required packages.# Here are some initial packages to load and others will be# identified as we proceed through the script. When writing# our own scripts we often collect together the library# commands at the beginning of the script here.library(rattle) # Access the weather dataset and utilities.
Loading required package: tibble
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(magrittr) # Utilise %>% and %<>% pipeline operators.# This log generally records the process of building a model. # However, with very little effort the log can also be used # to score a new dataset. The logical variable 'building' # is used to toggle between generating transformations, # when building a model and using the transformations, # when scoring a dataset.building <-TRUEscoring <-! building# A pre-defined value is used to reset the random seed # so that results are repeatable.crv$seed <-42#=======================================================================# Rattle timestamp: 2025-07-15 11:55:23.990856 x86_64-w64-mingw32 # Load a dataset from file.library(readxl, quietly=TRUE) crs$dataset <-read_excel("C:/Users/smrit/OneDrive/Documents/Amrita/DemoKTC.xlsx", guess_max=1e4) crs$dataset
#=======================================================================# Rattle timestamp: 2025-07-15 11:55:24.921638 x86_64-w64-mingw32 # Action the user selections from the Data tab. # Build the train/validate/test datasets.# nobs=30 train=21 validate=4 test=5set.seed(crv$seed)crs$nobs <-nrow(crs$dataset)crs$train <-sample(crs$nobs, 0.7*crs$nobs)crs$nobs %>%seq_len() %>%setdiff(crs$train) %>%sample(0.15*crs$nobs) ->crs$validatecrs$nobs %>%seq_len() %>%setdiff(crs$train) %>%setdiff(crs$validate) ->crs$test# The following variable selections have been noted.crs$input <-c("Age", "Female", "Income", "Married","Children", "Loan")crs$numeric <-c("Age", "Female", "Income", "Married","Children", "Loan")crs$categoric <-NULLcrs$target <-"Mortgage"crs$risk <-NULLcrs$ident <-NULLcrs$ignore <-NULLcrs$weights <-NULL#=======================================================================# Rattle timestamp: 2025-07-15 11:55:28.079238 x86_64-w64-mingw32 # Action the user selections from the Data tab. # Build the train/validate/test datasets.# nobs=30 train=21 validate=4 test=5set.seed(crv$seed)crs$nobs <-nrow(crs$dataset)crs$train <-sample(crs$nobs, 0.7*crs$nobs)crs$nobs %>%seq_len() %>%setdiff(crs$train) %>%sample(0.15*crs$nobs) ->crs$validatecrs$nobs %>%seq_len() %>%setdiff(crs$train) %>%setdiff(crs$validate) ->crs$test# The following variable selections have been noted.crs$input <-c("Age", "Female", "Income", "Married","Children", "Loan", "Mortgage")crs$numeric <-c("Age", "Female", "Income", "Married","Children", "Loan", "Mortgage")crs$categoric <-NULLcrs$target <-NULLcrs$risk <-NULLcrs$ident <-NULLcrs$ignore <-NULLcrs$weights <-NULL#=======================================================================# Rattle timestamp: 2025-07-15 12:26:38.223719 x86_64-w64-mingw32 # Action the user selections from the Data tab. # The following variable selections have been noted.crs$input <-c("Age", "Female", "Income", "Married","Children", "Loan", "Mortgage")crs$numeric <-c("Age", "Female", "Income", "Married","Children", "Loan", "Mortgage")crs$categoric <-NULLcrs$target <-NULLcrs$risk <-NULLcrs$ident <-NULLcrs$ignore <-NULLcrs$weights <-NULL
2.1 Data Exploration
The dataset for KTC Company consists of 30 records and 7 variables, which include Age, Gender, Income, Marital Status, Number of Children, Loan status, and Mortgage status.
library(Hmisc, quietly=TRUE)
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, units
# Obtain a summary of the dataset.contents(crs$dataset[, c(crs$input, crs$risk, crs$target)])
Data frame:crs$dataset[, c(crs$input, crs$risk, crs$target)] 30 observations and 7 variables Maximum # NAs:0
Storage
Age double
Female double
Income double
Married double
Children double
Loan double
Mortgage double
Age Female Income Married
Min. :22.00 Min. :0.0000 Min. : 8877 Min. :0.0
1st Qu.:37.25 1st Qu.:0.0000 1st Qu.:18166 1st Qu.:1.0
Median :47.00 Median :1.0000 Median :24241 Median :1.0
Mean :45.97 Mean :0.5667 Mean :28012 Mean :0.8
3rd Qu.:56.75 3rd Qu.:1.0000 3rd Qu.:35923 3rd Qu.:1.0
Max. :66.00 Max. :1.0000 Max. :59804 Max. :1.0
Children Loan Mortgage
Min. :0.0000 Min. :0.0000 Min. :0.0
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0
Median :0.5000 Median :0.0000 Median :0.0
Mean :0.9333 Mean :0.4333 Mean :0.4
3rd Qu.:2.0000 3rd Qu.:1.0000 3rd Qu.:1.0
Max. :3.0000 Max. :1.0000 Max. :1.0
The dataset comprises 30 observations and includes information on age, marital status, gender, number of children, loan status, and mortgage status.
2.1.1 Age
# Use ggplot2 to generate box plot for Age# Generate a box plot.p01 <- crs %>%with(dataset[,]) %>% ggplot2::ggplot(ggplot2::aes(y=Age)) + ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="navy") + ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) + ggplot2::xlab("Rattle 2025-Jul-15 12:42:43 smrit") + ggplot2::ggtitle("Distribution of Age") + ggplot2::theme(legend.position="none")
Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
ℹ Please use the `fun` argument instead.
# Display the plots.gridExtra::grid.arrange(p01)
The box plot shows there is no outliers in the data
Histogram
# Rattle timestamp: 2025-07-15 12:45:36.104563 x86_64-w64-mingw32 # Display histogram plots for the selected variables. # Use ggplot2 to generate histogram plot for Age# Generate the plot.p01 <- crs %>%with(dataset[,]) %>% dplyr::select(Age) %>% ggplot2::ggplot(ggplot2::aes(x=Age)) + ggplot2::geom_density(lty=3) + ggplot2::xlab("Age\n\nRattle 2025-Jul-15 12:45:36 smrit") + ggplot2::ggtitle("Distribution of Age") + ggplot2::labs(y="Density")# Display the plots.gridExtra::grid.arrange(p01)
The histogram of the age variables is left-skewed.
2.1.2 Gender
# Display box plots for the selected variables. # Use ggplot2 to generate box plot for Female# Generate a box plot.p01 <- crs %>%with(dataset[,]) %>% ggplot2::ggplot(ggplot2::aes(y=Female)) + ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="pink") + ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) + ggplot2::xlab("Rattle 2025-Jul-15 12:50:54 smrit") + ggplot2::ggtitle("Distribution of Female") + ggplot2::theme(legend.position="none")# Display the plots.gridExtra::grid.arrange(p01)
Notch went outside hinges
ℹ Do you want `notch = FALSE`?
Histogram
# Rattle timestamp: 2025-08-12 22:53:55.721733 x86_64-w64-mingw32 # Display histogram plots for the selected variables. # Use ggplot2 to generate histogram plot for Female# Generate the plot.p01 <- crs %>%with(dataset[,]) %>% dplyr::select(Female) %>% ggplot2::ggplot(ggplot2::aes(x=Female)) + ggplot2::geom_density(lty=3) + ggplot2::xlab("Female\n\nRattle 2025-Aug-12 22:53:55 smrit") + ggplot2::ggtitle("Distribution of Female") + ggplot2::labs(y="Density")# Display the plots.gridExtra::grid.arrange(p01)
2.1.3 Income
# Display box plots for the selected variables. # Use ggplot2 to generate box plot for Income# Generate a box plot.p01 <- crs %>%with(dataset[,]) %>% ggplot2::ggplot(ggplot2::aes(y=Income)) + ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="grey") + ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) + ggplot2::xlab("Rattle 2025-Jul-17 11:57:19 smrit") + ggplot2::ggtitle("Distribution of Income") + ggplot2::theme(legend.position="none")# Display the plots.gridExtra::grid.arrange(p01)
# Display histogram plots for the selected variables. # Use ggplot2 to generate histogram plot for Income# Generate the plot.p01 <- crs %>%with(dataset[,]) %>% dplyr::select(Income) %>% ggplot2::ggplot(ggplot2::aes(x=Income)) + ggplot2::geom_density(lty=3) + ggplot2::xlab("Income\n\nRattle 2025-Jul-17 11:58:37 smrit") + ggplot2::ggtitle("Distribution of Income") + ggplot2::labs(y="Density")# Display the plots.gridExtra::grid.arrange(p01)
The chart indicates that the majority of customers have a moderate income (slightly above $20,000). However, the right-skewed distribution reveals that a portion of customers earn significantly higher incomes.
2.1.4 Marital Status
# Display box plots for the selected variables. # Use ggplot2 to generate box plot for Married# Generate a box plot.p01 <- crs %>%with(dataset[,]) %>% ggplot2::ggplot(ggplot2::aes(y=Married)) + ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="grey") + ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) + ggplot2::xlab("Rattle 2025-Jul-17 12:00:13 smrit") + ggplot2::ggtitle("Distribution of Married") + ggplot2::theme(legend.position="none")# Display the plots.gridExtra::grid.arrange(p01)
# Display histogram plots for the selected variables. # Use ggplot2 to generate histogram plot for Married# Generate the plot.p01 <- crs %>%with(dataset[,]) %>% dplyr::select(Married) %>% ggplot2::ggplot(ggplot2::aes(x=Married)) + ggplot2::geom_density(lty=3) + ggplot2::xlab("Married\n\nRattle 2025-Jul-17 12:01:04 smrit") + ggplot2::ggtitle("Distribution of Married") + ggplot2::labs(y="Density")# Display the plots.gridExtra::grid.arrange(p01)
The distribution reveals that most customers are married, with the graph skewed toward married individuals, indicating that the majority of people in this dataset are in a relationship.
2.1.5 Children
# Display box plots for the selected variables. # Use ggplot2 to generate box plot for Children# Generate a box plot.p01 <- crs %>%with(dataset[,]) %>% ggplot2::ggplot(ggplot2::aes(y=Children)) + ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="purple") + ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) + ggplot2::xlab("Rattle 2025-Jul-17 12:02:05 smrit") + ggplot2::ggtitle("Distribution of Children") + ggplot2::theme(legend.position="none")# Display the plots.gridExtra::grid.arrange(p01)
Notch went outside hinges
ℹ Do you want `notch = FALSE`?
# Display histogram plots for the selected variables. # Use ggplot2 to generate histogram plot for Children# Generate the plot.p01 <- crs %>%with(dataset[,]) %>% dplyr::select(Children) %>% ggplot2::ggplot(ggplot2::aes(x=Children)) + ggplot2::geom_density(lty=3) + ggplot2::xlab("Children\n\nRattle 2025-Jul-17 12:02:52 smrit") + ggplot2::ggtitle("Distribution of Children") + ggplot2::labs(y="Density")# Display the plots.gridExtra::grid.arrange(p01)
The chart reveals a notable mismatch in the first-digit distribution of the children’s data. While the expected pattern would show a decreasing frequency for higher digits (more 1s and fewer 9s), the “All” data displays an unusually high occurrence of the digit ‘2’ followed by zero frequency for digits ‘4’ through ‘9’. This marked deviation from Benford’s Law suggests the children’s data may be artificially generated, rounded, or otherwise not naturally occurring.
2.1.6 Loan
# Display box plots for the selected variables. # Use ggplot2 to generate box plot for Loan# Generate a box plot.p01 <- crs %>%with(dataset[,]) %>% ggplot2::ggplot(ggplot2::aes(y=Loan)) + ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="grey") + ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) + ggplot2::xlab("Rattle 2025-Jul-17 12:03:45 smrit") + ggplot2::ggtitle("Distribution of Loan") + ggplot2::theme(legend.position="none")# Display the plots.gridExtra::grid.arrange(p01)
Notch went outside hinges
ℹ Do you want `notch = FALSE`?
# Display histogram plots for the selected variables. # Use ggplot2 to generate histogram plot for Loan# Generate the plot.p01 <- crs %>%with(dataset[,]) %>% dplyr::select(Loan) %>% ggplot2::ggplot(ggplot2::aes(x=Loan)) + ggplot2::geom_density(lty=3) + ggplot2::xlab("Loan\n\nRattle 2025-Jul-17 12:04:35 smrit") + ggplot2::ggtitle("Distribution of Loan") + ggplot2::labs(y="Density")# Display the plots.gridExtra::grid.arrange(p01)
The chart displays a bimodal distribution for the ‘Loan’ variable, with two distinct density peaks. The larger peak near 0.0 indicates a high concentration of customers without a loan, while a smaller peak around 1.0 represents those with an active loan. The lowest density occurs around 0.5, showing few observations between these extremes. This pattern suggests that ‘Loan’ is essentially a binary or categorical variable, with a majority of customers falling into the “no loan” category.
2.1.7 Mortgage
# Display box plots for the selected variables. # Use ggplot2 to generate box plot for Mortgage# Generate a box plot.p01 <- crs %>%with(dataset[,]) %>% ggplot2::ggplot(ggplot2::aes(y=Mortgage)) + ggplot2::geom_boxplot(ggplot2::aes(x="All"), notch=TRUE, fill="yellow") + ggplot2::stat_summary(ggplot2::aes(x="All"), fun.y=mean, geom="point", shape=8) + ggplot2::xlab("Rattle 2025-Jul-17 12:05:28 smrit") + ggplot2::ggtitle("Distribution of Mortgage") + ggplot2::theme(legend.position="none")# Display the plots.gridExtra::grid.arrange(p01)
Notch went outside hinges
ℹ Do you want `notch = FALSE`?
# Display histogram plots for the selected variables. # Use ggplot2 to generate histogram plot for Mortgage# Generate the plot.p01 <- crs %>%with(dataset[,]) %>% dplyr::select(Mortgage) %>% ggplot2::ggplot(ggplot2::aes(x=Mortgage)) + ggplot2::geom_density(lty=3) + ggplot2::xlab("Mortgage\n\nRattle 2025-Jul-17 12:06:19 smrit") + ggplot2::ggtitle("Distribution of Mortgage") + ggplot2::labs(y="Density")# Display the plots.gridExtra::grid.arrange(p01)
The chart indicates that mortgage ownership is fairly evenly distributed among customers. Most either have a mortgage or do not, with no significant imbalance, showing that mortgage status varies across the dataset.
2.2 Data summarizing
The dataset features a diverse group of individuals described by their clothing, age, gender, marital status, family structure, income levels, and financial obligations. Respondents range widely in age, from young adults in their 20s to individuals approaching retirement, allowing for analysis of financial maturity factors such as the impact of age on loan ownership or income stability. Gender representation appears balanced, including both male and female participants, with the potential presence of non-binary individuals. This inclusivity reflects broader societal dynamics and supports more comprehensive insights.
2.3 Segmentation using Clustering
Clustering is a method of grouping observations based on their similarities. We use distance measures to assess the dissimilarity among the observations. There are many measures of distance, including Euclidean, Manhattan, Mahalanobis, etc. Similarly, we have different types of algorithms, such as hierarchical, k-means, bi-cluster, etc. We start with the hierarchical method as part of data exploration and then use k-means.
2.2.1 Heirarchial
DemoKTC <-read_excel("C:/Users/smrit/OneDrive/Documents/Amrita/DemoKTC.xlsx")mydata<-scale(DemoKTC)d <-dist(mydata, method ="manhattan") # distance matrixfit <-hclust(d, method="ward.D2") # Clusteringplot(fit) # display dendogramgroups <-cutree(fit, k=5) # cut tree into 5 clusters#draw dendogram with red borders around the 5 clustersrect.hclust(fit, k=3, border="purple")
Cluster 1 is primarily made up of younger, unmarried individuals with lower incomes, higher loan and mortgage levels, and a larger proportion of females (~67%). Cluster 2 consists of middle-aged, married individuals with moderate incomes, the highest average number of children, average loan and mortgage levels, and a relatively balanced gender distribution (~54% female). Cluster 3 comprises older, married individuals with higher incomes, fewer children, lower loan and mortgage levels, and a comparable female proportion (~55%).
K means
# Rattle timestamp: 2025-08-12 22:50:30.932313 x86_64-w64-mingw32 # Hierarchical Cluster # Generate a hierarchical cluster from the numeric data.crs$dataset[, crs$numeric] %>% amap::hclusterpar(method="euclidean", link="ward.D2", nbproc=1) ->crs$hclust
Cluster
library(factoextra)
Loading required package: ggplot2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
set.seed(123) # For reproducibilitykm <-kmeans(data, centers =3, nstart =25)set.seed(123) # For reproducibilitykm <-kmeans(data, centers =3, nstart =25)fviz_cluster(km, data)
data2<-data# duplicating the datadata2$cluster<-km$cluster# writing the cluster membership in to the data
Warning in data2$cluster <- km$cluster: Coercing LHS to a list
data2<-mydata# duplicating the datacluster_id<-as.vector(unlist(km$cluster))# writing the cluster membership in to the datadata2<-as.data.frame(cbind(data2,cluster_id))# Group data2 by cluster_id and compute mean for each groupgroup_means <-aggregate(. ~ cluster_id, data = data2, FUN = mean)# Split the original data into a list of data frames by cluster_idgrouped_data <-split(data2, data2$cluster_id)# If we specifically want 3 data sets, we can extract them like this:data_cluster1 <- grouped_data[[1]]data_cluster2 <- grouped_data[[2]]data_cluster3 <- grouped_data[[3]]# Optionally view the group meansprint(group_means)
cluster_id Age Female Income Married Children
1 1 -0.8408520 0.19840997 -0.6987521 -1.966384 0.06169096
2 2 -0.4279950 -0.05596179 -0.5185505 0.491596 0.27523658
3 3 0.9644588 -0.04208696 0.9939700 0.491596 -0.35892921
Loan Mortgage
1 0.46295659 0.2006932
2 0.05596179 0.1235035
3 -0.31865844 -0.2554278
Result
The second cluster has the highest average age, while the third cluster contains a larger proportion of younger individuals.
From the hierarchical clustering results and the elbow method, it is evident that three clusters are optimal for this dataset.