The client company, Santander bank, is originated from Spain and has branches from all over the world. Currently, their system doesn’t offer the right products to their customers, so that some customers get too many offers while others don’t have any. Their wish for the solution is “with a more effective recommendation system in place, Santander can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life” (Kaggle). Therefore, an algorithm or a model that is a result from my project will help Santander bank to predict which products their existing customers will use in the next month based on their past behavior.
Train data has 13,647,309 observations and 48 different variables. First 24 variables are all customer related and last 24 variables are all bank products. Since Santander bank is originated from Spain, data variable names are all in Spanish.
dim(bank)
## [1] 13647309 48
names(bank)
## [1] "fecha_dato" "ncodpers"
## [3] "ind_empleado" "pais_residencia"
## [5] "sexo" "age"
## [7] "fecha_alta" "ind_nuevo"
## [9] "antiguedad" "indrel"
## [11] "ult_fec_cli_1t" "indrel_1mes"
## [13] "tiprel_1mes" "indresi"
## [15] "indext" "conyuemp"
## [17] "canal_entrada" "indfall"
## [19] "tipodom" "cod_prov"
## [21] "nomprov" "ind_actividad_cliente"
## [23] "renta" "segmento"
## [25] "ind_ahor_fin_ult1" "ind_aval_fin_ult1"
## [27] "ind_cco_fin_ult1" "ind_cder_fin_ult1"
## [29] "ind_cno_fin_ult1" "ind_ctju_fin_ult1"
## [31] "ind_ctma_fin_ult1" "ind_ctop_fin_ult1"
## [33] "ind_ctpp_fin_ult1" "ind_deco_fin_ult1"
## [35] "ind_deme_fin_ult1" "ind_dela_fin_ult1"
## [37] "ind_ecue_fin_ult1" "ind_fond_fin_ult1"
## [39] "ind_hip_fin_ult1" "ind_plan_fin_ult1"
## [41] "ind_pres_fin_ult1" "ind_reca_fin_ult1"
## [43] "ind_tjcr_fin_ult1" "ind_valo_fin_ult1"
## [45] "ind_viv_fin_ult1" "ind_nomina_ult1"
## [47] "ind_nom_pens_ult1" "ind_recibo_ult1"
There are 2 main types of variables: customer related variables and bank products. Since I need the results to be in Spanish, I leave bank products as is and changed the names of customer related variables only to make them readable and understandable.
Now let’s see what kind of data we’re dealing with.
summary(bank2)
## date custcode emp_index country
## 2016-05-28: 931453 Min. : 15889 : 27734 ES :13553710
## 2016-04-28: 928274 1st Qu.: 452813 A: 2492 : 27734
## 2016-03-28: 925076 Median : 931893 B: 3566 FR : 5161
## 2016-02-28: 920904 Mean : 834904 F: 2523 AR : 4835
## 2016-01-28: 916269 3rd Qu.:1199286 N:13610977 DE : 4625
## 2015-12-28: 912021 Max. :1553689 S: 17 GB : 4605
## (Other) :8113312 (Other): 46639
## gender age first_date new_cust
## : 27804 23 : 779884 2014-07-28: 57389 Min. :0.00
## H:6195253 22 : 736314 2014-10-03: 54287 1st Qu.:0.00
## V:7424252 24 : 734785 2014-08-04: 45746 Median :0.00
## 21 : 675988 2013-10-14: 40804 Mean :0.06
## 25 : 472016 2013-08-03: 33414 3rd Qu.:0.00
## 20 : 422867 : 27734 Max. :1.00
## (Other):9825455 (Other) :13387935 NA's :27734
## senior primary last_date
## 12: 243160 Min. : 1.000 :13622516
## 21: 214795 1st Qu.: 1.000 2015-12-24: 763
## 10: 206165 Median : 1.000 2015-12-28: 521
## 9: 177957 Mean : 1.178 2015-07-09: 443
## 23: 177839 3rd Qu.: 1.000 2015-07-06: 405
## 33: 174352 Max. :99.000 2015-07-01: 401
## (Other):12453041 NA's :27734 (Other) : 22260
## cust_type_first cust_relation_first resi_index for_index
## 1.0 :9133383 : 149781 : 27734 : 27734
## 1 :4357298 A:6187123 N: 65864 N:12974839
## : 149781 I:7304875 S:13553711 S: 644736
## 3.0 : 2780 N: 4
## 3 : 1570 P: 4656
## P : 874 R: 870
## (Other): 1623
## spouse_index channel deceased add_type
## :13645501 KHE :4055270 : 27734 Min. :1
## N: 1791 KAT :3268209 N:13584813 1st Qu.:1
## S: 17 KFC :3098360 S: 34762 Median :1
## KHQ : 591039 Mean :1
## KFA : 409669 3rd Qu.:1
## KHK : 241084 Max. :1
## (Other):1983678 NA's :27735
## cod_prov prov_name active income
## Min. : 1.00 MADRID :4409600 Min. :0.000 Min. : 1203
## 1st Qu.:15.00 BARCELONA :1275219 1st Qu.:0.000 1st Qu.: 68711
## Median :28.00 VALENCIA : 682304 Median :0.000 Median : 101850
## Mean :26.57 SEVILLA : 605164 Mean :0.458 Mean : 134254
## 3rd Qu.:35.00 CORUÃA, A: 429322 3rd Qu.:1.000 3rd Qu.: 155956
## Max. :52.00 MURCIA : 396759 Max. :1.000 Max. :28894396
## NA's :93591 (Other) :5848941 NA's :27734 NA's :2794375
## segmento ind_ahor_fin_ult1 ind_aval_fin_ult1
## : 189368 Min. :0.0000000 Min. :0.00e+00
## 01 - TOP : 562142 1st Qu.:0.0000000 1st Qu.:0.00e+00
## 02 - PARTICULARES :7960220 Median :0.0000000 Median :0.00e+00
## 03 - UNIVERSITARIO:4935579 Mean :0.0001023 Mean :2.32e-05
## 3rd Qu.:0.0000000 3rd Qu.:0.00e+00
## Max. :1.0000000 Max. :1.00e+00
##
## ind_cco_fin_ult1 ind_cder_fin_ult1 ind_cno_fin_ult1 ind_ctju_fin_ult1
## Min. :0.0000 Min. :0.0000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :1.0000 Median :0.0000000 Median :0.00000 Median :0.000000
## Mean :0.6555 Mean :0.0003939 Mean :0.08087 Mean :0.009474
## 3rd Qu.:1.0000 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.0000000 Max. :1.00000 Max. :1.000000
##
## ind_ctma_fin_ult1 ind_ctop_fin_ult1 ind_ctpp_fin_ult1 ind_deco_fin_ult1
## Min. :0.000000 Min. :0.000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0.000 Median :0.00000 Median :0.000000
## Mean :0.009727 Mean :0.129 Mean :0.04331 Mean :0.001779
## 3rd Qu.:0.000000 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000 Max. :1.00000 Max. :1.000000
##
## ind_deme_fin_ult1 ind_dela_fin_ult1 ind_ecue_fin_ult1 ind_fond_fin_ult1
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.001661 Mean :0.04297 Mean :0.08274 Mean :0.01849
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## ind_hip_fin_ult1 ind_plan_fin_ult1 ind_pres_fin_ult1
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.005887 Mean :0.009171 Mean :0.002627
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000
##
## ind_reca_fin_ult1 ind_tjcr_fin_ult1 ind_valo_fin_ult1 ind_viv_fin_ult1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.05254 Mean :0.04439 Mean :0.02561 Mean :0.003848
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000
##
## ind_nomina_ult1 ind_nom_pens_ult1 ind_recibo_ult1
## Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :0.000 Median :0.0000
## Mean :0.055 Mean :0.059 Mean :0.1279
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.000 Max. :1.0000
## NA's :16063 NA's :16063
I found there are consistent of 27,734 missing data in several variables and checking those accounts proved that there isn’t much information in those accounts, so I decided to delete them all together.
missing<-subset(bank2, bank2$emp_index=="")
bank_new<-subset(bank2, bank2$emp_index!="")
To check all the other missing values, I created bank_missing to visualize the missing values. Let’s quickly understand this. There are 80% values in the data set with no missing value. There are 20% missing values in Income, around 0.05% missing values in province code and so on.
library(VIM)
bank_missing<-cbind(bank_new$add_type, bank_new$cod_prov, bank_new$income, bank_new[,46], bank_new[,47])
aggr(bank_missing, col=c('#ffffb2','#fecc5c'),
numbers=TRUE, sortVars=TRUE,
labels=names(bank_missing), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## V3 2.031371e-01
## V2 4.835467e-03
## V4 1.593295e-05
## V5 1.593295e-05
## V1 7.342373e-08
Using Hmisc package to impute missing values of income with its statistical mean value.
# impute with mean value
library(Hmisc)
bank_new$income <- with(bank_new, impute(income, mean))
Also, I replaced the missing values of payroll and pension with 0 since it represents no ownership of the products.
bank_new$ind_nomina_ult1[is.na(bank_new$ind_nomina_ult1)] <- 0
bank_new$ind_nom_pens_ult1[is.na(bank_new$ind_nom_pens_ult1)] <- 0
Initial histogram revealed that age distribution has two distinct spikes.
library(ggplot2)
ggplot(bank_new, aes(x=age)) +
stat_count(width=1, position="stack") +
ggtitle("Age Histogram")
Later histogram reveals that younger group belongs to college graduates and older group belongs to individual group of customers.
bank_new$age<-as.numeric(as.character(bank_new$age))
library(ggthemes)
library(ggplot2)
bank_new$segmento <- factor(bank_new$segmento, labels = c("Other", "VIP", "Individuals", "Graduates"))
ggplot(bank_new, aes(x=age, fill=factor(segmento))) +
geom_bar() +
facet_grid(".~segmento") +
scale_fill_manual("Customer Segmentation", values=c("#ffffb2", "#fecc5c", "#fd8d3c", "#e31a1c")) +
theme_classic() +
theme(legend.position="bottom") +
scale_y_continuous("Frequency")
When you look at the average age of customers first open their account within time line, younger people more likely to open an account at Santander bank since 2010.
Country table shows that Santander customers are from all over the world.
table(bank_new$country)
##
## AD AE AL AO AR AT AU
## 0 111 221 17 68 4835 476 424
## BA BE BG BM BO BR BY BZ
## 34 1526 476 6 1514 2351 102 17
## CA CD CF CG CH CI CL CM
## 446 17 17 34 1995 51 989 85
## CN CO CR CU CZ DE DJ DK
## 563 3526 147 758 102 4625 11 226
## DO DZ EC EE EG ES ET FI
## 424 86 2169 45 68 13553710 34 345
## FR GA GB GE GH GI GM GN
## 5161 51 4605 17 17 17 17 51
## GQ GR GT GW HK HN HR HU
## 119 243 130 34 51 282 68 37
## IE IL IN IS IT JM JP KE
## 409 413 187 17 2947 11 239 72
## KH KR KW KZ LB LT LU LV
## 17 96 17 17 17 45 124 17
## LY MA MD MK ML MM MR MT
## 17 396 96 51 17 17 51 2
## MX MZ NG NI NL NO NZ OM
## 2573 34 214 62 757 136 51 22
## PA PE PH PK PL PR PT PY
## 77 900 34 85 599 101 1419 1430
## QA RO RS RU SA SE SG SK
## 52 2931 34 769 79 603 117 85
## SL SN SV TG TH TN TR TW
## 17 68 102 17 102 17 62 34
## UA US UY VE VN ZA ZW
## 493 3651 510 2331 34 119 11
round(prop.table(table(bank2$country=="ES"))*100,2)
##
## FALSE TRUE
## 0.69 99.31
Income distribution shows that there are many extreme values up to 30 million. It’s interesting to see that VIP customers do not have the highest income distribution.
First I explored if there is any correlations between bank products. To do this I had to extract only products information from the dataset.
Now, visualize to see any correlation
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.2
product_corr <- cor(product)
corrplot(product_corr, method="square")
See actual correlation values
corrplot.mixed(product_corr)
More than 2 million customers do not have a single product, and almost half of the customers own only one product. There are only 15 products that one person holds at same time.
bank_new$totalproducts<-rowSums(bank_new[,25:48], na.rm = TRUE)
table(bank_new$totalproducts)
##
## 0 1 2 3 4 5 6 7 8
## 2550371 7150086 1928371 770354 449867 289819 210919 141468 76138
## 9 10 11 12 13 14 15
## 33768 12682 4343 1124 230 26 9
barplot(table(bank_new$totalproducts), xlab="Number of Products", las=3)
The current or debit account is the most popular product as it has 8,938,150 account holders.
tbl<-colSums(bank_new[,25:48])
names(tbl)<-c("savings", "guarantee", "current", "derivada", "payroll_acc",
"junior", "mas particular", "particular", "particular Plus", "short_dep",
"med-dep", "long_dep", "e_account", "funds", "mortgage", "pensions",
"loans", "taxes", "creditcard", "securities", "home", "payroll", "pension_last",
"direct_debit")
barplot(tbl)
More transformation
bank_clean<-subset(bank_new, bank_new$totalproducts>0)
dim(bank_clean)
## [1] 11069204 49
#Checking top 15 products
prodname<-head(sort(tbl,decreasing=TRUE), n = 15)
#Only choosing top 15 products
bank_last<-cbind(bank_clean[,1:24],bank_clean[,27],bank_clean[,29:33],
bank_clean[,36:38], bank_clean[,42:44], bank_clean[,46:48], bank_clean[,49])
colnames(bank_last)[25] <- "ind_cco_fin_ult1"
colnames(bank_last)[40] <- "totalproducts"
dim(bank_last)
## [1] 11069204 40
bank_active<-subset(bank_last, bank_last$active==1)
dim(bank_active)
## [1] 6215578 40
bank_active.5.28.16<-subset(bank_active, bank_active$date=="2016-05-28")
bank.matrix<-bank_active.5.28.16
library(arules)
## Warning: package 'arules' was built under R version 3.3.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
bank.matrix$custcode <- as.factor(bank.matrix$custcode)
bank.matrix$age <- discretize(bank.matrix$age, method = "frequency", 5)
bank.matrix$new_cust <- as.logical(bank.matrix$new_cust)
bank.matrix$primary <- as.factor(bank.matrix$primary)
bank.matrix$add_type<-NULL
bank.matrix$cod_prov<-as.factor(bank.matrix$cod_prov)
bank.matrix$income <- discretize(bank.matrix$income, method = "frequency", 5)
## Warning in unique(c(x[!is.na(x)], if (!missing(cuts)) cuts)): Reached total
## allocation of 16291Mb: see help(memory.size)
## Warning in unique(c(x[!is.na(x)], if (!missing(cuts)) cuts)): Reached total
## allocation of 16291Mb: see help(memory.size)
bank.matrix$active <- NULL
bank.matrix$totalproducts <- as.factor(bank.matrix$totalproducts)
bank.matrix$first_date<- as.factor(bank.matrix$first_date)
dim(bank.matrix)
## [1] 394050 38
# For 0-1 data, I'm using "recommenderlab" package
library(recommenderlab)
bank.matrix<-bank_active.5.28.16
# I subset dataset that has only user ids and top 15 bank products
bank.recom<-cbind(bank.matrix[,2], bank.matrix[23:37])
head(bank.recom)
## bank.matrix[, 2] income segmento ind_cco_fin_ult1
## 12715858 657788 132559.35 Individuals 1
## 12715859 657795 81399.57 Individuals 1
## 12715860 657790 134254.32 Individuals 0
## 12715861 657794 102189.00 VIP 1
## 12715863 657789 153725.49 VIP 1
## 12715867 657781 159854.19 VIP 0
## ind_cno_fin_ult1 ind_ctju_fin_ult1 ind_ctma_fin_ult1
## 12715858 0 0 0
## 12715859 0 0 0
## 12715860 1 0 0
## 12715861 0 0 0
## 12715863 0 0 0
## 12715867 0 0 0
## ind_ctop_fin_ult1 ind_ctpp_fin_ult1 ind_dela_fin_ult1
## 12715858 0 0 0
## 12715859 0 0 0
## 12715860 0 0 0
## 12715861 0 0 1
## 12715863 0 1 1
## 12715867 0 0 1
## ind_ecue_fin_ult1 ind_fond_fin_ult1 ind_reca_fin_ult1
## 12715858 0 0 0
## 12715859 0 0 0
## 12715860 1 0 0
## 12715861 0 0 0
## 12715863 0 0 0
## 12715867 0 0 0
## ind_tjcr_fin_ult1 ind_valo_fin_ult1 ind_nomina_ult1
## 12715858 0 0 0
## 12715859 0 0 0
## 12715860 1 0 1
## 12715861 0 0 0
## 12715863 1 0 0
## 12715867 0 0 0
bank.recom<-as(bank.recom, "realRatingMatrix")
# Create a simple recommender that generates recommendations based on popularity of items
bank.rec<-Recommender(bank.recom, method="POPULAR")
getModel(bank.rec)
## $topN
## Recommendations as 'topNList' with n = 240021 for 1 users.
##
## $ratings
## 1 x 240021 rating matrix of class 'realRatingMatrix' with 240021 ratings.
## Normalized using center on rows.
##
## $normalize
## [1] "center"
##
## $aggregationRatings
## function (x, na.rm = FALSE, dims = 1, ...)
## standardGeneric("colMeans")
## <bytecode: 0x0000000046a0e520>
## <environment: 0x0000000008cdc650>
## attr(,"generic")
## [1] "colMeans"
## attr(,"generic")attr(,"package")
## [1] "base"
## attr(,"package")
## [1] "base"
## attr(,"group")
## list()
## attr(,"valueClass")
## character(0)
## attr(,"signature")
## [1] "x" "na.rm" "dims"
## attr(,"default")
## Method Definition (Class "derivedDefaultMethod"):
##
## function (x, na.rm = FALSE, dims = 1, ...)
## base::colMeans(x, na.rm = na.rm, dims = dims, ...)
## <environment: 0x0000000008b84088>
##
## Signatures:
## x
## target "ANY"
## defined "ANY"
## attr(,"skeleton")
## (function (x, na.rm = FALSE, dims = 1, ...)
## base::colMeans(x, na.rm = na.rm, dims = dims, ...))(x, na.rm,
## dims, ...)
## attr(,"class")
## [1] "standardGeneric"
## attr(,"class")attr(,"package")
## [1] "methods"
##
## $aggregationPopularity
## function (x, na.rm = FALSE, dims = 1, ...)
## standardGeneric("colSums")
## <bytecode: 0x0000000046a10438>
## <environment: 0x000000000886cc58>
## attr(,"generic")
## [1] "colSums"
## attr(,"generic")attr(,"package")
## [1] "base"
## attr(,"package")
## [1] "base"
## attr(,"group")
## list()
## attr(,"valueClass")
## character(0)
## attr(,"signature")
## [1] "x" "na.rm" "dims"
## attr(,"default")
## Method Definition (Class "derivedDefaultMethod"):
##
## function (x, na.rm = FALSE, dims = 1, ...)
## base::colSums(x, na.rm = na.rm, dims = dims, ...)
## <environment: 0x0000000008b84088>
##
## Signatures:
## x
## target "ANY"
## defined "ANY"
## attr(,"skeleton")
## (function (x, na.rm = FALSE, dims = 1, ...)
## base::colSums(x, na.rm = na.rm, dims = dims, ...))(x, na.rm,
## dims, ...)
## attr(,"class")
## [1] "standardGeneric"
## attr(,"class")attr(,"package")
## [1] "methods"
##
## $verbose
## [1] FALSE
# Create Top-1 list from recommendations
#bank.pred.15889<-predict(bank.rec, bank.recom["15889",], n=1)
#summary(list(bank.pred))
#as(bank.pred.15889, "list")