This example uses data related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be subscribed (‘yes’) or not (‘no’). For additional details on this dataset, please refer to: http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing
In direct marketing campaigns (phone calls, direct mail, direct email…), marketing managers usually rely on data from past campaigns in order to build predictive models that are then used to support upcoming targeted efforts. The goal is to find out which features (demographics, transactional activity, contextual data, etc.) have the greatest impact on the probability of a customer accepting a given offer. Once a model has been built, new or existing customers (that don’t have the given product) are scored and assigned with a probability of taking the offer. Due to the fact that marketing dollars are limited (budget is a scarce resource), marketing managers must optimize their budgets by “hitting” customers with a high probability of accepting an offer and also with a large potential (expected return). By doing so, the expected profit from the campaign (cost of campaign minus the expected return once the customer agrees to open the given service or financial product) is also maximized.
Let´s start by loading our data. We use this code chunk also to require the libraries needed for this project. Notice that the data set must be already downloaded from the link provided above, and located in the current working directory.
library(caret); library(ggplot2)
library(plyr); library(dplyr)
library(pander)
bank <- read.csv("bank-full.csv",
header=T,
sep=";")
This data set contains the following columns:
Our objective is to build clustering scheme for the current customer base, including all bank client data, and using the current campaign data as potential discrimination variables.
bank <- subset(bank, contact %in% c("cellular", "telephone"))
bank$contact <- as.factor(as.character(bank$contact))
# Balance
p <- qplot(balance, data = bank,
geom = "density", alpha = I(1/2))
p <- p + ggtitle("Histogram of Balance")
p
The balance column does not appear to follow a normal distribution. There are some outliers that add noise to the variable so there is a need to remove such values.
Using the interquartile rule, we proceed with the exclusion of all values that lie above the third quartile plus 1.5 times the IQR, and below the first quartile minus 1.5 times the IQR (-2067.5, 3664.5).
bank <- subset(bank, balance>-2067.5 & balance<3664.5)
p <- qplot(balance, data = bank,
geom = "density", alpha = I(1/2))
p <- p + ggtitle("Histogram of Balance")
p
We now see a more Gaussian looking shape in the distributions of age for each target class.
We now eliminate columns that we don´t want to include in the analysis (all marketing campaign columns).
bank.cluster <- select(bank,
age, job, marital, education, default, balance,
housing, loan)
In this case, we will be using only real-valued columns as clustering features (age and balance).
p <- ggplot(bank.cluster, aes(x=age, y=balance))
p <- p + geom_point()
p
Let´s standardize those two columns by substracting their respective means and dividing by their standard deviations.
feat.scaled <- scale(bank.cluster[,c("age","balance")])
We first try the clustering algorithm (k-means) with k=4, meaning that we want 4 different clusters or customer segments.
set.seed(15555)
pclusters <- kmeans(feat.scaled, 4, nstart=20, iter.max=100)
groups <- pclusters$cluster
clusterDF <- cbind(as.data.frame(feat.scaled), Cluster=as.factor(groups))
We now profile the 4 clusters we got. The results are reported in the following table.
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | |
|---|---|---|---|---|
| Age | 51.52 | 35.17 | 56.19 | 33.41 |
| Balance | 306 | 2066 | 2199 | 282.4 |
| Administrative | 10.45 | 10.36 | 9.186 | 13.05 |
| Entrepreneur | 3.658 | 3.12 | 2.887 | 3.104 |
| Management | 20.15 | 28.03 | 20.6 | 22.43 |
| Services | 7.861 | 7.539 | 5.731 | 10.09 |
| Student | 0.02368 | 3.844 | 0 | 3.932 |
| Unemployed | 3.149 | 3.32 | 3.412 | 2.836 |
| Retired | 11.31 | 0.09985 | 22.88 | 0.1556 |
| Educ.Tertiary | 24.91 | 41.19 | 27.91 | 34.3 |
| Married | 73.46 | 52.9 | 79.31 | 49.29 |
| Defaulted | 2.356 | 0.09985 | 0.2187 | 2.192 |
| Housing.Loan | 41.81 | 53.37 | 31.63 | 56.16 |
| Personal.Loan | 20.18 | 11.68 | 11.94 | 17.74 |
set.seed(15555)
pclusters <- kmeans(feat.scaled, 5, nstart=20, iter.max=100)
groups <- pclusters$cluster
clusterDF <- cbind(as.data.frame(feat.scaled), Cluster=as.factor(groups))
We now profile the 5 clusters we got in our second run. The results are reported in the following table.
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | |
|---|---|---|---|---|---|
| Age | 34.6 | 31.54 | 57.19 | 54.44 | 43.6 |
| Balance | 2073 | 287.7 | 364.2 | 2305 | 292.9 |
| Administrative | 10.38 | 13.37 | 8.348 | 9.768 | 12.11 |
| Entrepreneur | 3.129 | 2.734 | 3.199 | 3.167 | 4.13 |
| Management | 28.22 | 22.5 | 17.76 | 21.81 | 22.24 |
| Services | 7.515 | 10.2 | 6.352 | 6.066 | 9.411 |
| Student | 4.012 | 4.977 | 0 | 0.0446 | 0.1738 |
| Unemployed | 3.316 | 2.87 | 3.085 | 3.568 | 2.968 |
| Retired | 0.107 | 0.09083 | 23.48 | 18.11 | 0.655 |
| Educ.Tertiary | 41.83 | 35.7 | 23.93 | 28.72 | 27.39 |
| Married | 51.73 | 44.85 | 76.09 | 78.72 | 68.39 |
| Defaulted | 0.08024 | 2.171 | 1.996 | 0.1338 | 2.473 |
| Housing.Loan | 53.2 | 54.82 | 32.49 | 34.7 | 54.81 |
| Personal.Loan | 11.63 | 17.69 | 18.83 | 12.58 | 19.5 |
k would you recommend using, and why? Fully explain your selection