This example uses data related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be subscribed (‘yes’) or not (‘no’). For additional details on this dataset, please refer to: http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing

In direct marketing campaigns (phone calls, direct mail, direct email…), marketing managers usually rely on data from past campaigns in order to build predictive models that are then used to support upcoming targeted efforts. The goal is to find out which features (demographics, transactional activity, contextual data, etc.) have the greatest impact on the probability of a customer accepting a given offer. Once a model has been built, new or existing customers (that don’t have the given product) are scored and assigned with a probability of taking the offer. Due to the fact that marketing dollars are limited (budget is a scarce resource), marketing managers must optimize their budgets by “hitting” customers with a high probability of accepting an offer and also with a large potential (expected return). By doing so, the expected profit from the campaign (cost of campaign minus the expected return once the customer agrees to open the given service or financial product) is also maximized.

Loading the Data

Let´s start by loading our data. We use this code chunk also to require the libraries needed for this project. Notice that the data set must be already downloaded from the link provided above, and located in the current working directory.

library(caret); library(ggplot2)
library(plyr); library(dplyr)
library(pander)
bank <- read.csv("bank-full.csv",
                 header=T,
                 sep=";")

This data set contains the following columns:

Bank client data:

age (numeric)
job: type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)
marital: marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)
education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)
default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)
housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)

Objective

Our objective is to build clustering scheme for the current customer base, including all bank client data, and using the current campaign data as potential discrimination variables.

Data Preparation

bank <- subset(bank, contact %in% c("cellular", "telephone"))
bank$contact <- as.factor(as.character(bank$contact))

# Balance
p <- qplot(balance, data = bank,
           geom = "density", alpha = I(1/2))
p <- p + ggtitle("Histogram of Balance")
p

The balance column does not appear to follow a normal distribution. There are some outliers that add noise to the variable so there is a need to remove such values.

Using the interquartile rule, we proceed with the exclusion of all values that lie above the third quartile plus 1.5 times the IQR, and below the first quartile minus 1.5 times the IQR (-2067.5, 3664.5).

bank <- subset(bank, balance>-2067.5 & balance<3664.5)

p <- qplot(balance, data = bank,
           geom = "density", alpha = I(1/2))
p <- p + ggtitle("Histogram of Balance")
p

We now see a more Gaussian looking shape in the distributions of age for each target class.

Features

We now eliminate columns that we don´t want to include in the analysis (all marketing campaign columns).

bank.cluster <- select(bank,
                       age, job, marital, education, default, balance,
                       housing, loan)

In this case, we will be using only real-valued columns as clustering features (age and balance).

p <- ggplot(bank.cluster, aes(x=age, y=balance))
p <- p + geom_point()
p

Scaling

Let´s standardize those two columns by substracting their respective means and dividing by their standard deviations.

feat.scaled <- scale(bank.cluster[,c("age","balance")])

Clustering with k=4

We first try the clustering algorithm (k-means) with k=4, meaning that we want 4 different clusters or customer segments.

set.seed(15555)
pclusters <- kmeans(feat.scaled, 4, nstart=20, iter.max=100)

groups <- pclusters$cluster
clusterDF <- cbind(as.data.frame(feat.scaled), Cluster=as.factor(groups))

4-Segment Profiling

We now profile the 4 clusters we got. The results are reported in the following table.

Segment Profiles (k=4)
	Cluster 1	Cluster 2	Cluster 3	Cluster 4
Age	51.52	35.17	56.19	33.41
Balance	306	2066	2199	282.4
Administrative	10.45	10.36	9.186	13.05
Entrepreneur	3.658	3.12	2.887	3.104
Management	20.15	28.03	20.6	22.43
Services	7.861	7.539	5.731	10.09
Student	0.02368	3.844	0	3.932
Unemployed	3.149	3.32	3.412	2.836
Retired	11.31	0.09985	22.88	0.1556
Educ.Tertiary	24.91	41.19	27.91	34.3
Married	73.46	52.9	79.31	49.29
Defaulted	2.356	0.09985	0.2187	2.192
Housing.Loan	41.81	53.37	31.63	56.16
Personal.Loan	20.18	11.68	11.94	17.74

Clustering with k=5

set.seed(15555)
pclusters <- kmeans(feat.scaled, 5, nstart=20, iter.max=100)

groups <- pclusters$cluster
clusterDF <- cbind(as.data.frame(feat.scaled), Cluster=as.factor(groups))

5-Segment Profiling

We now profile the 5 clusters we got in our second run. The results are reported in the following table.

Segment Profiles (k=5)
	Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5
Age	34.6	31.54	57.19	54.44	43.6
Balance	2073	287.7	364.2	2305	292.9
Administrative	10.38	13.37	8.348	9.768	12.11
Entrepreneur	3.129	2.734	3.199	3.167	4.13
Management	28.22	22.5	17.76	21.81	22.24
Services	7.515	10.2	6.352	6.066	9.411
Student	4.012	4.977	0	0.0446	0.1738
Unemployed	3.316	2.87	3.085	3.568	2.968
Retired	0.107	0.09083	23.48	18.11	0.655
Educ.Tertiary	41.83	35.7	23.93	28.72	27.39
Married	51.73	44.85	76.09	78.72	68.39
Defaulted	0.08024	2.171	1.996	0.1338	2.473
Housing.Loan	53.2	54.82	32.49	34.7	54.81
Personal.Loan	11.63	17.69	18.83	12.58	19.5

Case Study Guide

Which k would you recommend using, and why? Fully explain your selection
Name the clusters you selected (4 or 5) according to the profiling
Select one or two clusters of interest and develop a strategy for servicing and marketing such customers
What would you recommend in order to improve the current clustering results?

Cluster Analysis: Interpreting the Results

Carlos Ignacio Patiño (cpatinof@gmail.com)

October 16, 2015

Loading the Data

Bank client data:

Objective

Data Preparation

Features

Scaling

Clustering with k=4

4-Segment Profiling

Clustering with k=5

5-Segment Profiling

Case Study Guide

Cluster Analysis: Interpreting the Results

Carlos Ignacio Patiño (cpatinof@gmail.com)

October 16, 2015

Loading the Data

Bank client data:

Related with the last contact of the current campaign:

Objective

Data Preparation

Features

Scaling

Clustering with k=4

4-Segment Profiling

Clustering with k=5

5-Segment Profiling

Case Study Guide