Data Exploration: Recommender System on Predicting Next Bank Products

Introduction

The client company, Santander bank, is originated from Spain and has branches from all over the world. Currently, their system doesn’t offer the right products to their customers, so that some customers get too many offers while others don’t have any. Their wish for the solution is “with a more effective recommendation system in place, Santander can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life” (Kaggle). Therefore, an algorithm or a model that is a result from my project will help Santander bank to predict which products their existing customers will use in the next month based on their past behavior.

Data Cleaning:

Train data has 13,647,309 observations and 48 different variables. First 24 variables are all customer related and last 24 variables are all bank products. Since Santander bank is originated from Spain, data variable names are all in Spanish.

dim(bank)

## [1] 13647309       48

names(bank)

##  [1] "fecha_dato"            "ncodpers"             
##  [3] "ind_empleado"          "pais_residencia"      
##  [5] "sexo"                  "age"                  
##  [7] "fecha_alta"            "ind_nuevo"            
##  [9] "antiguedad"            "indrel"               
## [11] "ult_fec_cli_1t"        "indrel_1mes"          
## [13] "tiprel_1mes"           "indresi"              
## [15] "indext"                "conyuemp"             
## [17] "canal_entrada"         "indfall"              
## [19] "tipodom"               "cod_prov"             
## [21] "nomprov"               "ind_actividad_cliente"
## [23] "renta"                 "segmento"             
## [25] "ind_ahor_fin_ult1"     "ind_aval_fin_ult1"    
## [27] "ind_cco_fin_ult1"      "ind_cder_fin_ult1"    
## [29] "ind_cno_fin_ult1"      "ind_ctju_fin_ult1"    
## [31] "ind_ctma_fin_ult1"     "ind_ctop_fin_ult1"    
## [33] "ind_ctpp_fin_ult1"     "ind_deco_fin_ult1"    
## [35] "ind_deme_fin_ult1"     "ind_dela_fin_ult1"    
## [37] "ind_ecue_fin_ult1"     "ind_fond_fin_ult1"    
## [39] "ind_hip_fin_ult1"      "ind_plan_fin_ult1"    
## [41] "ind_pres_fin_ult1"     "ind_reca_fin_ult1"    
## [43] "ind_tjcr_fin_ult1"     "ind_valo_fin_ult1"    
## [45] "ind_viv_fin_ult1"      "ind_nomina_ult1"      
## [47] "ind_nom_pens_ult1"     "ind_recibo_ult1"

1. Change Variable Names to English

There are 2 main types of variables: customer related variables and bank products. Since I need the results to be in Spanish, I leave bank products as is and changed the names of customer related variables only to make them readable and understandable.

Now let’s see what kind of data we’re dealing with.

summary(bank2)

##          date            custcode       emp_index       country        
##  2016-05-28: 931453   Min.   :  15889    :   27734   ES     :13553710  
##  2016-04-28: 928274   1st Qu.: 452813   A:    2492          :   27734  
##  2016-03-28: 925076   Median : 931893   B:    3566   FR     :    5161  
##  2016-02-28: 920904   Mean   : 834904   F:    2523   AR     :    4835  
##  2016-01-28: 916269   3rd Qu.:1199286   N:13610977   DE     :    4625  
##  2015-12-28: 912021   Max.   :1553689   S:      17   GB     :    4605  
##  (Other)   :8113312                                  (Other):   46639  
##  gender           age               first_date          new_cust    
##   :  27804    23    : 779884   2014-07-28:   57389   Min.   :0.00   
##  H:6195253    22    : 736314   2014-10-03:   54287   1st Qu.:0.00   
##  V:7424252    24    : 734785   2014-08-04:   45746   Median :0.00   
##               21    : 675988   2013-10-14:   40804   Mean   :0.06   
##               25    : 472016   2013-08-03:   33414   3rd Qu.:0.00   
##               20    : 422867             :   27734   Max.   :1.00   
##              (Other):9825455   (Other)   :13387935   NA's   :27734  
##      senior            primary            last_date       
##       12:  243160   Min.   : 1.000             :13622516  
##       21:  214795   1st Qu.: 1.000   2015-12-24:     763  
##       10:  206165   Median : 1.000   2015-12-28:     521  
##        9:  177957   Mean   : 1.178   2015-07-09:     443  
##       23:  177839   3rd Qu.: 1.000   2015-07-06:     405  
##       33:  174352   Max.   :99.000   2015-07-01:     401  
##  (Other):12453041   NA's   :27734    (Other)   :   22260  
##  cust_type_first   cust_relation_first resi_index   for_index   
##  1.0    :9133383    : 149781            :   27734    :   27734  
##  1      :4357298   A:6187123           N:   65864   N:12974839  
##         : 149781   I:7304875           S:13553711   S:  644736  
##  3.0    :   2780   N:      4                                    
##  3      :   1570   P:   4656                                    
##  P      :    874   R:    870                                    
##  (Other):   1623                                                
##  spouse_index    channel        deceased        add_type    
##   :13645501   KHE    :4055270    :   27734   Min.   :1      
##  N:    1791   KAT    :3268209   N:13584813   1st Qu.:1      
##  S:      17   KFC    :3098360   S:   34762   Median :1      
##               KHQ    : 591039                Mean   :1      
##               KFA    : 409669                3rd Qu.:1      
##               KHK    : 241084                Max.   :1      
##               (Other):1983678                NA's   :27735  
##     cod_prov          prov_name           active          income        
##  Min.   : 1.00   MADRID    :4409600   Min.   :0.000   Min.   :    1203  
##  1st Qu.:15.00   BARCELONA :1275219   1st Qu.:0.000   1st Qu.:   68711  
##  Median :28.00   VALENCIA  : 682304   Median :0.000   Median :  101850  
##  Mean   :26.57   SEVILLA   : 605164   Mean   :0.458   Mean   :  134254  
##  3rd Qu.:35.00   CORUÃA, A: 429322   3rd Qu.:1.000   3rd Qu.:  155956  
##  Max.   :52.00   MURCIA    : 396759   Max.   :1.000   Max.   :28894396  
##  NA's   :93591   (Other)   :5848941   NA's   :27734   NA's   :2794375   
##                segmento       ind_ahor_fin_ult1   ind_aval_fin_ult1 
##                    : 189368   Min.   :0.0000000   Min.   :0.00e+00  
##  01 - TOP          : 562142   1st Qu.:0.0000000   1st Qu.:0.00e+00  
##  02 - PARTICULARES :7960220   Median :0.0000000   Median :0.00e+00  
##  03 - UNIVERSITARIO:4935579   Mean   :0.0001023   Mean   :2.32e-05  
##                               3rd Qu.:0.0000000   3rd Qu.:0.00e+00  
##                               Max.   :1.0000000   Max.   :1.00e+00  
##                                                                     
##  ind_cco_fin_ult1 ind_cder_fin_ult1   ind_cno_fin_ult1  ind_ctju_fin_ult1 
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.0000   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :1.0000   Median :0.0000000   Median :0.00000   Median :0.000000  
##  Mean   :0.6555   Mean   :0.0003939   Mean   :0.08087   Mean   :0.009474  
##  3rd Qu.:1.0000   3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.0000   Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000  
##                                                                           
##  ind_ctma_fin_ult1  ind_ctop_fin_ult1 ind_ctpp_fin_ult1 ind_deco_fin_ult1 
##  Min.   :0.000000   Min.   :0.000     Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000     1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000     Median :0.00000   Median :0.000000  
##  Mean   :0.009727   Mean   :0.129     Mean   :0.04331   Mean   :0.001779  
##  3rd Qu.:0.000000   3rd Qu.:0.000     3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000     Max.   :1.00000   Max.   :1.000000  
##                                                                           
##  ind_deme_fin_ult1  ind_dela_fin_ult1 ind_ecue_fin_ult1 ind_fond_fin_ult1
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.001661   Mean   :0.04297   Mean   :0.08274   Mean   :0.01849  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                          
##  ind_hip_fin_ult1   ind_plan_fin_ult1  ind_pres_fin_ult1 
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.000000  
##  Mean   :0.005887   Mean   :0.009171   Mean   :0.002627  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.000000  
##                                                          
##  ind_reca_fin_ult1 ind_tjcr_fin_ult1 ind_valo_fin_ult1 ind_viv_fin_ult1  
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.000000  
##  Mean   :0.05254   Mean   :0.04439   Mean   :0.02561   Mean   :0.003848  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.000000  
##                                                                          
##  ind_nomina_ult1 ind_nom_pens_ult1 ind_recibo_ult1 
##  Min.   :0.000   Min.   :0.000     Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.000     1st Qu.:0.0000  
##  Median :0.000   Median :0.000     Median :0.0000  
##  Mean   :0.055   Mean   :0.059     Mean   :0.1279  
##  3rd Qu.:0.000   3rd Qu.:0.000     3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.000     Max.   :1.0000  
##  NA's   :16063   NA's   :16063

2. Missing data

I found there are consistent of 27,734 missing data in several variables and checking those accounts proved that there isn’t much information in those accounts, so I decided to delete them all together.

missing<-subset(bank2, bank2$emp_index=="")
bank_new<-subset(bank2, bank2$emp_index!="")

To check all the other missing values, I created bank_missing to visualize the missing values. Let’s quickly understand this. There are 80% values in the data set with no missing value. There are 20% missing values in Income, around 0.05% missing values in province code and so on.

library(VIM)
bank_missing<-cbind(bank_new$add_type, bank_new$cod_prov, bank_new$income, bank_new[,46], bank_new[,47])
aggr(bank_missing, col=c('#ffffb2','#fecc5c'),
     numbers=TRUE, sortVars=TRUE,
     labels=names(bank_missing), cex.axis=.7,
     gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable        Count
##        V3 2.031371e-01
##        V2 4.835467e-03
##        V4 1.593295e-05
##        V5 1.593295e-05
##        V1 7.342373e-08

Using Hmisc package to impute missing values of income with its statistical mean value.

# impute with mean value
library(Hmisc)
bank_new$income <- with(bank_new, impute(income, mean))

Also, I replaced the missing values of payroll and pension with 0 since it represents no ownership of the products.

bank_new$ind_nomina_ult1[is.na(bank_new$ind_nomina_ult1)] <- 0
bank_new$ind_nom_pens_ult1[is.na(bank_new$ind_nom_pens_ult1)] <- 0

Data Exploration

1. Exploring Customer Age

Initial histogram revealed that age distribution has two distinct spikes.

library(ggplot2)
ggplot(bank_new, aes(x=age)) +
  stat_count(width=1, position="stack") +
  ggtitle("Age Histogram")

Later histogram reveals that younger group belongs to college graduates and older group belongs to individual group of customers.

bank_new$age<-as.numeric(as.character(bank_new$age))
library(ggthemes)
library(ggplot2)
bank_new$segmento <- factor(bank_new$segmento, labels = c("Other", "VIP", "Individuals", "Graduates"))
ggplot(bank_new, aes(x=age, fill=factor(segmento))) +
  geom_bar() +
  facet_grid(".~segmento") + 
  scale_fill_manual("Customer Segmentation", values=c("#ffffb2", "#fecc5c", "#fd8d3c", "#e31a1c")) +
  theme_classic() +
  theme(legend.position="bottom") +
  scale_y_continuous("Frequency")

When you look at the average age of customers first open their account within time line, younger people more likely to open an account at Santander bank since 2010.

Changing Age Distribution Over Years

2. Exploring Customer Locations

Country table shows that Santander customers are from all over the world.

table(bank_new$country)

## 
##                AD       AE       AL       AO       AR       AT       AU 
##        0      111      221       17       68     4835      476      424 
##       BA       BE       BG       BM       BO       BR       BY       BZ 
##       34     1526      476        6     1514     2351      102       17 
##       CA       CD       CF       CG       CH       CI       CL       CM 
##      446       17       17       34     1995       51      989       85 
##       CN       CO       CR       CU       CZ       DE       DJ       DK 
##      563     3526      147      758      102     4625       11      226 
##       DO       DZ       EC       EE       EG       ES       ET       FI 
##      424       86     2169       45       68 13553710       34      345 
##       FR       GA       GB       GE       GH       GI       GM       GN 
##     5161       51     4605       17       17       17       17       51 
##       GQ       GR       GT       GW       HK       HN       HR       HU 
##      119      243      130       34       51      282       68       37 
##       IE       IL       IN       IS       IT       JM       JP       KE 
##      409      413      187       17     2947       11      239       72 
##       KH       KR       KW       KZ       LB       LT       LU       LV 
##       17       96       17       17       17       45      124       17 
##       LY       MA       MD       MK       ML       MM       MR       MT 
##       17      396       96       51       17       17       51        2 
##       MX       MZ       NG       NI       NL       NO       NZ       OM 
##     2573       34      214       62      757      136       51       22 
##       PA       PE       PH       PK       PL       PR       PT       PY 
##       77      900       34       85      599      101     1419     1430 
##       QA       RO       RS       RU       SA       SE       SG       SK 
##       52     2931       34      769       79      603      117       85 
##       SL       SN       SV       TG       TH       TN       TR       TW 
##       17       68      102       17      102       17       62       34 
##       UA       US       UY       VE       VN       ZA       ZW 
##      493     3651      510     2331       34      119       11

Let’s see it in the map:

99.31% of customers are from ‘Spain’

round(prop.table(table(bank2$country=="ES"))*100,2)

## 
## FALSE  TRUE 
##  0.69 99.31

3. Exploring Customer Income

Income distribution shows that there are many extreme values up to 30 million. It’s interesting to see that VIP customers do not have the highest income distribution.

4. Exploring Bank Products

First I explored if there is any correlations between bank products. To do this I had to extract only products information from the dataset.

Now, visualize to see any correlation

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.3.2

product_corr <- cor(product)
corrplot(product_corr, method="square")

See actual correlation values

corrplot.mixed(product_corr)

Let’s see how many total products customers have from Santander bank.

More than 2 million customers do not have a single product, and almost half of the customers own only one product. There are only 15 products that one person holds at same time.

bank_new$totalproducts<-rowSums(bank_new[,25:48], na.rm = TRUE)
table(bank_new$totalproducts)

## 
##       0       1       2       3       4       5       6       7       8 
## 2550371 7150086 1928371  770354  449867  289819  210919  141468   76138 
##       9      10      11      12      13      14      15 
##   33768   12682    4343    1124     230      26       9

barplot(table(bank_new$totalproducts), xlab="Number of Products", las=3)

Now let’s see which product is most popular.

The current or debit account is the most popular product as it has 8,938,150 account holders.

tbl<-colSums(bank_new[,25:48])
names(tbl)<-c("savings", "guarantee", "current", "derivada", "payroll_acc",
              "junior", "mas particular", "particular", "particular Plus", "short_dep",
              "med-dep", "long_dep", "e_account", "funds", "mortgage", "pensions",
              "loans", "taxes", "creditcard", "securities", "home", "payroll", "pension_last",
              "direct_debit")
barplot(tbl)

More transformation

bank_clean<-subset(bank_new, bank_new$totalproducts>0)
dim(bank_clean)

## [1] 11069204       49

#Checking top 15 products
prodname<-head(sort(tbl,decreasing=TRUE), n = 15)
#Only choosing top 15 products
bank_last<-cbind(bank_clean[,1:24],bank_clean[,27],bank_clean[,29:33], 
                 bank_clean[,36:38], bank_clean[,42:44], bank_clean[,46:48], bank_clean[,49])
colnames(bank_last)[25] <- "ind_cco_fin_ult1"
colnames(bank_last)[40] <- "totalproducts"
dim(bank_last)

## [1] 11069204       40

RECOMMENDATION SYSTEM

1. Transforming dataset for analysis

bank_active<-subset(bank_last, bank_last$active==1)
dim(bank_active)

## [1] 6215578      40

bank_active.5.28.16<-subset(bank_active, bank_active$date=="2016-05-28")
bank.matrix<-bank_active.5.28.16
library(arules)

## Warning: package 'arules' was built under R version 3.3.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

bank.matrix$custcode <- as.factor(bank.matrix$custcode)
bank.matrix$age <- discretize(bank.matrix$age, method = "frequency", 5)
bank.matrix$new_cust <- as.logical(bank.matrix$new_cust)
bank.matrix$primary <- as.factor(bank.matrix$primary)
bank.matrix$add_type<-NULL
bank.matrix$cod_prov<-as.factor(bank.matrix$cod_prov)
bank.matrix$income <- discretize(bank.matrix$income, method = "frequency", 5)

## Warning in unique(c(x[!is.na(x)], if (!missing(cuts)) cuts)): Reached total
## allocation of 16291Mb: see help(memory.size)

## Warning in unique(c(x[!is.na(x)], if (!missing(cuts)) cuts)): Reached total
## allocation of 16291Mb: see help(memory.size)

bank.matrix$active <- NULL
bank.matrix$totalproducts <- as.factor(bank.matrix$totalproducts)
bank.matrix$first_date<- as.factor(bank.matrix$first_date)
dim(bank.matrix)

## [1] 394050     38

2. Using Recommenderlab Package for Recommendation

# For 0-1 data, I'm using "recommenderlab" package
library(recommenderlab)
bank.matrix<-bank_active.5.28.16
# I subset dataset that has only user ids and top 15 bank products
bank.recom<-cbind(bank.matrix[,2], bank.matrix[23:37])
head(bank.recom)

##          bank.matrix[, 2]    income    segmento ind_cco_fin_ult1
## 12715858           657788 132559.35 Individuals                1
## 12715859           657795  81399.57 Individuals                1
## 12715860           657790 134254.32 Individuals                0
## 12715861           657794 102189.00         VIP                1
## 12715863           657789 153725.49         VIP                1
## 12715867           657781 159854.19         VIP                0
##          ind_cno_fin_ult1 ind_ctju_fin_ult1 ind_ctma_fin_ult1
## 12715858                0                 0                 0
## 12715859                0                 0                 0
## 12715860                1                 0                 0
## 12715861                0                 0                 0
## 12715863                0                 0                 0
## 12715867                0                 0                 0
##          ind_ctop_fin_ult1 ind_ctpp_fin_ult1 ind_dela_fin_ult1
## 12715858                 0                 0                 0
## 12715859                 0                 0                 0
## 12715860                 0                 0                 0
## 12715861                 0                 0                 1
## 12715863                 0                 1                 1
## 12715867                 0                 0                 1
##          ind_ecue_fin_ult1 ind_fond_fin_ult1 ind_reca_fin_ult1
## 12715858                 0                 0                 0
## 12715859                 0                 0                 0
## 12715860                 1                 0                 0
## 12715861                 0                 0                 0
## 12715863                 0                 0                 0
## 12715867                 0                 0                 0
##          ind_tjcr_fin_ult1 ind_valo_fin_ult1 ind_nomina_ult1
## 12715858                 0                 0               0
## 12715859                 0                 0               0
## 12715860                 1                 0               1
## 12715861                 0                 0               0
## 12715863                 1                 0               0
## 12715867                 0                 0               0

bank.recom<-as(bank.recom, "realRatingMatrix")

# Create a simple recommender that generates recommendations based on popularity of items

bank.rec<-Recommender(bank.recom, method="POPULAR")

getModel(bank.rec)

## $topN
## Recommendations as 'topNList' with n = 240021 for 1 users. 
## 
## $ratings
## 1 x 240021 rating matrix of class 'realRatingMatrix' with 240021 ratings.
## Normalized using center on rows.
## 
## $normalize
## [1] "center"
## 
## $aggregationRatings
## function (x, na.rm = FALSE, dims = 1, ...) 
## standardGeneric("colMeans")
## <bytecode: 0x0000000046a0e520>
## <environment: 0x0000000008cdc650>
## attr(,"generic")
## [1] "colMeans"
## attr(,"generic")attr(,"package")
## [1] "base"
## attr(,"package")
## [1] "base"
## attr(,"group")
## list()
## attr(,"valueClass")
## character(0)
## attr(,"signature")
## [1] "x"     "na.rm" "dims" 
## attr(,"default")
## Method Definition (Class "derivedDefaultMethod"):
## 
## function (x, na.rm = FALSE, dims = 1, ...) 
## base::colMeans(x, na.rm = na.rm, dims = dims, ...)
## <environment: 0x0000000008b84088>
## 
## Signatures:
##         x    
## target  "ANY"
## defined "ANY"
## attr(,"skeleton")
## (function (x, na.rm = FALSE, dims = 1, ...) 
## base::colMeans(x, na.rm = na.rm, dims = dims, ...))(x, na.rm, 
##     dims, ...)
## attr(,"class")
## [1] "standardGeneric"
## attr(,"class")attr(,"package")
## [1] "methods"
## 
## $aggregationPopularity
## function (x, na.rm = FALSE, dims = 1, ...) 
## standardGeneric("colSums")
## <bytecode: 0x0000000046a10438>
## <environment: 0x000000000886cc58>
## attr(,"generic")
## [1] "colSums"
## attr(,"generic")attr(,"package")
## [1] "base"
## attr(,"package")
## [1] "base"
## attr(,"group")
## list()
## attr(,"valueClass")
## character(0)
## attr(,"signature")
## [1] "x"     "na.rm" "dims" 
## attr(,"default")
## Method Definition (Class "derivedDefaultMethod"):
## 
## function (x, na.rm = FALSE, dims = 1, ...) 
## base::colSums(x, na.rm = na.rm, dims = dims, ...)
## <environment: 0x0000000008b84088>
## 
## Signatures:
##         x    
## target  "ANY"
## defined "ANY"
## attr(,"skeleton")
## (function (x, na.rm = FALSE, dims = 1, ...) 
## base::colSums(x, na.rm = na.rm, dims = dims, ...))(x, na.rm, 
##     dims, ...)
## attr(,"class")
## [1] "standardGeneric"
## attr(,"class")attr(,"package")
## [1] "methods"
## 
## $verbose
## [1] FALSE

# Create Top-1 list from recommendations
#bank.pred.15889<-predict(bank.rec, bank.recom["15889",], n=1)
#summary(list(bank.pred))
#as(bank.pred.15889, "list")