This example uses data related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be subscribed (‘yes’) or not (‘no’). For additional details on this dataset, please refer to: http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing

In direct marketing campaigns (phone calls, direct mail, direct email…), marketing managers usually rely on data from past campaigns in order to build predictive models that are then used to support upcoming targeted efforts. The goal is to find out which features (demographics, transactional activity, contextual data, etc.) have the greatest impact on the probability of a customer accepting a given offer. Once a model has been built, new or existing customers (that don’t have the given product) are scored and assigned with a probability of taking the offer. Due to the fact that marketing dollars are limited (budget is a scarce resource), marketing managers must optimize their budgets by “hitting” customers with a high probability of accepting an offer and also with a large potential (expected return). By doing so, the expected profit from the campaign (cost of campaign minus the expected return once the customer agrees to open the given service or financial product) is also maximized.

Loading the Data

Let´s start by loading our data. We use this code chunk also to require the libraries needed for this project. Notice that the data set must be already downloaded from the link provided above, and located in the current working directory.

library(caret); library(ggplot2)
library(plyr); library(dplyr)
library(pander)
bank <- read.csv("bank-full.csv",
                 header=T,
                 sep=";")

This data set contains the following columns:

Bank client data:

  1. age (numeric)
  2. job: type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)
  3. marital: marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)
  5. default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)

Objective

Our objective is to build clustering scheme for the current customer base, including all bank client data, and using the current campaign data as potential discrimination variables.

Data Preparation

bank <- subset(bank, contact %in% c("cellular", "telephone"))
bank$contact <- as.factor(as.character(bank$contact))
# Balance
p <- qplot(balance, data = bank,
           geom = "density", alpha = I(1/2))
p <- p + ggtitle("Histogram of Balance")
p

The balance column does not appear to follow a normal distribution. There are some outliers that add noise to the variable so there is a need to remove such values.

Using the interquartile rule, we proceed with the exclusion of all values that lie above the third quartile plus 1.5 times the IQR, and below the first quartile minus 1.5 times the IQR (-2067.5, 3664.5).

bank <- subset(bank, balance>-2067.5 & balance<3664.5)

p <- qplot(balance, data = bank,
           geom = "density", alpha = I(1/2))
p <- p + ggtitle("Histogram of Balance")
p

We now see a more Gaussian looking shape in the distributions of age for each target class.

Features

We now eliminate columns that we don´t want to include in the analysis (all marketing campaign columns).

bank.cluster <- select(bank,
                       age, job, marital, education, default, balance,
                       housing, loan)

In this case, we will be using only real-valued columns as clustering features (age and balance).

p <- ggplot(bank.cluster, aes(x=age, y=balance))
p <- p + geom_point()
p

Scaling

Let´s standardize those two columns by substracting their respective means and dividing by their standard deviations.

feat.scaled <- scale(bank.cluster[,c("age","balance")])

Clustering with k=4

We first try the clustering algorithm (k-means) with k=4, meaning that we want 4 different clusters or customer segments.

set.seed(15555)
pclusters <- kmeans(feat.scaled, 4, nstart=20, iter.max=100)

groups <- pclusters$cluster
clusterDF <- cbind(as.data.frame(feat.scaled), Cluster=as.factor(groups))

4-Segment Profiling

We now profile the 4 clusters we got. The results are reported in the following table.

Segment Profiles (k=4)
  Cluster 1 Cluster 2 Cluster 3 Cluster 4
Age 51.52 35.17 56.19 33.41
Balance 306 2066 2199 282.4
Administrative 10.45 10.36 9.186 13.05
Entrepreneur 3.658 3.12 2.887 3.104
Management 20.15 28.03 20.6 22.43
Services 7.861 7.539 5.731 10.09
Student 0.02368 3.844 0 3.932
Unemployed 3.149 3.32 3.412 2.836
Retired 11.31 0.09985 22.88 0.1556
Educ.Tertiary 24.91 41.19 27.91 34.3
Married 73.46 52.9 79.31 49.29
Defaulted 2.356 0.09985 0.2187 2.192
Housing.Loan 41.81 53.37 31.63 56.16
Personal.Loan 20.18 11.68 11.94 17.74

Clustering with k=5

set.seed(15555)
pclusters <- kmeans(feat.scaled, 5, nstart=20, iter.max=100)

groups <- pclusters$cluster
clusterDF <- cbind(as.data.frame(feat.scaled), Cluster=as.factor(groups))

5-Segment Profiling

We now profile the 5 clusters we got in our second run. The results are reported in the following table.

Segment Profiles (k=5)
  Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
Age 34.6 31.54 57.19 54.44 43.6
Balance 2073 287.7 364.2 2305 292.9
Administrative 10.38 13.37 8.348 9.768 12.11
Entrepreneur 3.129 2.734 3.199 3.167 4.13
Management 28.22 22.5 17.76 21.81 22.24
Services 7.515 10.2 6.352 6.066 9.411
Student 4.012 4.977 0 0.0446 0.1738
Unemployed 3.316 2.87 3.085 3.568 2.968
Retired 0.107 0.09083 23.48 18.11 0.655
Educ.Tertiary 41.83 35.7 23.93 28.72 27.39
Married 51.73 44.85 76.09 78.72 68.39
Defaulted 0.08024 2.171 1.996 0.1338 2.473
Housing.Loan 53.2 54.82 32.49 34.7 54.81
Personal.Loan 11.63 17.69 18.83 12.58 19.5

Case Study Guide

  1. Which k would you recommend using, and why? Fully explain your selection
  2. Name the clusters you selected (4 or 5) according to the profiling
  3. Select one or two clusters of interest and develop a strategy for servicing and marketing such customers
  4. What would you recommend in order to improve the current clustering results?