Market segmentation is one of the most valued marketing techniques and it has the potential to help definining a corporate strategy and can be useful as a technique to leverage top level business decisions. However, carrying out an adequate market segmentation has proven to be difficult for small and medium sized companies from the perspective of both the data supply and also the available techniques and local know-how.
With the advent of a society and business enviroment that thrives on data and information, one of the two big challenges to develop a correct market segmentations is slowly being eroded. Currently, organizations of any size have the technology and capacity to gather, collect, clasify and store immense amounts of data about their customers, their iteractions with the company and their historical transactions. While the data barrier has been slowy eroding, still the capacities to use and benefit from that data are scarced. The reasons are multiple and fall beyond the scope of this article, but at the moment is possible to say that competition for talented data scientist and information analyst is fierce, and the best resources usually are assigned to the companies with larger financial proweness, making it difficult for smaller companies to compete for those resources.
In this paper, a strategy to segment customers for a publicly available data set will be presented.
The segmentation of a base of customers into multiple categories can contribute to a better understanding of that market and can allow to develop multiple aspects of the market strategy: relational marketing, identification of different communication and promotional strategies, product and sales diferentiation. A better understanding and knowledge of the market can have the final results of increasing customer loyalty and customer satisfaction, allowing market growth and better long term results
The “bank data set” is a publicly available collection of records for a hipotetic sample from a larger customer data set, who are clients of an un-identified Japanese large bank.
The data set contains 600 records (one for each customer) and 11 attributes or variables for each customer. The variables range from demographic data to attributes that identify the actual use of different bank products by those customers. The general characteristcs of the data set are described on the table bellow.
Descriptive Summary of the Bank Data
## id age sex region
## ID12101: 1 Min. :18.00 FEMALE:300 INNER_CITY:269
## ID12102: 1 1st Qu.:30.00 MALE :300 RURAL : 96
## ID12103: 1 Median :42.00 SUBURBAN : 62
## ID12104: 1 Mean :42.40 TOWN :173
## ID12105: 1 3rd Qu.:55.25
## ID12106: 1 Max. :67.00
## (Other):594
## income married children car save_act current_act
## Min. : 5014 NO :204 Min. :0.000 NO :304 NO :186 NO :145
## 1st Qu.:17265 YES:396 1st Qu.:0.000 YES:296 YES:414 YES:455
## Median :24925 Median :1.000
## Mean :27524 Mean :1.012
## 3rd Qu.:36173 3rd Qu.:2.000
## Max. :63130 Max. :3.000
##
## mortgage pep
## NO :391 NO :326
## YES:209 YES:274
##
##
##
##
##
Summary MultiPlot of the Bank data
For this paper, R, an open data processing software will be used to carry out the data processing and marketing analytics.
The R package and almost all of the current statistical sofware for machine learning and data mining include different algortihms that can be used for market segmentation. The general technique of separating a whole market into different pieces that share a common sub-set of attributes and characteristics is known as clustering. Among the most popular algorithms and techniques are k-means clustering, k-medoids clustering, hierarchical clustering and density-based clustering. the specific characteristics and technicalities ofthose algorithms wonèt be discussed here but on the internet there is a large availability of literature about them.
The following part of the document presents the use of the k-means and k-medoids algorithms to cluster the bank data set.
## 'data.frame': 600 obs. of 10 variables:
## $ age : int 48 40 51 23 57 57 22 58 37 54 ...
## $ sex : Factor w/ 2 levels "FEMALE","MALE": 1 2 1 1 1 1 2 2 1 2 ...
## $ region : Factor w/ 4 levels "INNER_CITY","RURAL",..: 1 4 1 4 2 4 2 4 3 4 ...
## $ income : num 17546 30085 16575 20375 50576 ...
## $ married : Factor w/ 2 levels "NO","YES": 1 2 2 2 2 2 1 2 2 2 ...
## $ children : int 1 3 0 3 0 2 0 0 2 2 ...
## $ car : Factor w/ 2 levels "NO","YES": 1 2 2 1 1 1 1 2 2 2 ...
## $ save_act : Factor w/ 2 levels "NO","YES": 1 1 2 1 2 2 1 2 1 2 ...
## $ current_act: Factor w/ 2 levels "NO","YES": 1 2 2 2 1 2 2 2 1 2 ...
## $ mortgage : Factor w/ 2 levels "NO","YES": 1 2 1 1 1 1 1 1 1 1 ...
## Warning: package 'cluster' was built under R version 3.2.5
## Medoids:
## ID age sex region income married children car save_act current_act
## [1,] 376 41 2 1 20866.3 2 0 2 2 2
## [2,] 501 39 1 1 27765.8 2 3 2 2 1
## [3,] 457 32 1 4 13267.6 2 0 2 2 2
## [4,] 75 64 1 1 52674.0 1 2 2 2 2
## [5,] 443 43 1 2 36281.0 2 0 2 2 2
## [6,] 447 52 1 4 43719.5 2 0 1 2 2
## mortgage
## [1,] 1
## [2,] 1
## [3,] 2
## [4,] 1
## [5,] 1
## [6,] 1
## Clustering vector:
## [1] 1 2 3 1 4 5 3 2 2 1 4 2 3 4 1 1 1 6 2 1 4 3 5 3 3 6 1 1 2 1 1 3 1 2 5
## [36] 1 3 3 1 2 2 2 4 2 3 4 3 3 5 1 3 1 3 2 4 6 2 1 3 2 2 5 2 3 3 1 3 5 1 3
## [71] 3 3 2 1 4 5 2 4 3 5 1 2 3 3 1 2 3 1 3 2 2 2 5 6 6 3 3 2 3 3 2 3 3 5 5
## [106] 5 6 3 3 1 5 2 5 5 6 6 1 3 1 6 2 4 1 5 6 5 2 6 3 1 1 6 3 5 4 2 3 5 6 4
## [141] 5 1 1 4 1 4 1 1 2 2 2 5 4 2 1 3 2 1 1 3 1 3 1 3 2 1 3 3 2 5 4 1 2 2 6
## [176] 4 5 2 3 5 1 4 5 1 2 5 1 5 5 1 4 3 4 3 5 1 1 4 3 1 2 6 3 1 3 3 4 3 5 1
## [211] 2 5 3 3 3 1 3 2 2 2 2 6 5 6 4 2 4 2 2 3 2 2 3 2 6 3 4 1 1 3 3 4 5 1 2
## [246] 5 3 1 1 1 4 5 3 1 3 1 2 1 1 3 1 2 1 6 1 1 5 2 5 3 4 4 3 2 2 3 2 1 2 5
## [281] 1 3 4 5 3 3 2 2 3 5 3 5 6 1 1 5 2 3 1 3 3 3 2 6 4 3 4 2 2 3 2 5 2 3 1
## [316] 6 1 5 1 1 5 1 3 1 2 3 5 1 4 1 4 3 2 1 1 2 2 5 5 1 2 4 4 1 1 6 5 6 5 1
## [351] 2 2 2 6 2 4 1 3 5 1 4 1 2 3 4 1 2 6 3 5 6 5 3 2 6 1 5 2 1 3 2 5 3 5 2
## [386] 5 3 2 3 3 6 1 5 1 2 1 2 5 4 5 1 1 5 1 3 5 6 1 1 3 1 3 4 3 6 1 3 1 2 1
## [421] 4 6 3 6 1 2 2 5 1 2 2 4 3 2 1 3 3 2 3 3 2 6 5 5 3 2 6 2 4 5 2 5 2 4 3
## [456] 3 3 4 3 6 1 2 1 3 1 2 1 3 3 5 6 2 1 2 1 3 2 2 1 6 3 6 5 3 1 4 1 3 3 1
## [491] 2 1 2 6 6 5 6 6 6 1 2 5 2 3 4 2 1 3 4 3 3 3 6 3 5 3 1 5 1 2 3 1 3 3 1
## [526] 1 3 3 2 6 4 2 2 2 3 3 5 4 5 3 1 5 3 4 1 5 3 3 1 4 3 5 6 5 5 3 1 2 2 2
## [561] 3 1 4 1 1 2 1 1 2 1 5 5 2 1 2 2 1 6 3 2 4 3 1 2 1 3 1 2 4 3 1 2 1 4 2
## [596] 6 3 3 3 2
## Objective function:
## build swap
## 2244.399 2128.571
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
clustering results
The results for the clustering algorithm (medoid clustering using Partitioning Around Medoids (PAM) package) present statisics for each cluster including the centroids (mean vectors for each cluster) These are used to describe each cluster. When the PAM algorithm was submitted in R, a parameter to calcualte 6 clusters was selected).
The centroid for cluster 1 shows that this is a market segment of middle age (average age=41) male bank customers, mostly married, inner city residents, who have an average income of approx. $20,866. They are single with no children, posses a vehicle and are users of chequing and savings accounts but don’t carry a mortgage. The rest of the 5 other segments calculated by the model can be described in similar manner.
The last part of the table lists to which of the 6 calculated clusters each of the 600 customer has been assigned. i.e.: cusotmer # 1 has been identified as belonging to cluster #1 (males in inner city, single, who posses a vehicle, no children..) This array of values wil be used to present a graphical analysis of the cluster separations. ** http://rpubs.com/MauVas **