Data Exploration: Recommender System on Predicting Next Bank Products

Introduction

The client company, Santander bank, is originated from Spain and has branches from all over the world. Currently, their system doesn’t offer the right products to their customers, so that some customers get too many offers while others don’t have any. Their wish for the solution is “with a more effective recommendation system in place, Santander can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life” (Kaggle). Therefore, an algorithm or a model that is a result from my project will help Santander bank to predict which products their existing customers will use in the next month based on their past behavior.

Data Cleaning:

Train data has 13,647,309 observations and 48 different variables. First 24 variables are all customer related and last 24 variables are all bank products. Since Santander bank is originated from Spain, data variable names are all in Spanish.

dim(bank)

## [1] 13647309       48

names(bank)

##  [1] "fecha_dato"            "ncodpers"             
##  [3] "ind_empleado"          "pais_residencia"      
##  [5] "sexo"                  "age"                  
##  [7] "fecha_alta"            "ind_nuevo"            
##  [9] "antiguedad"            "indrel"               
## [11] "ult_fec_cli_1t"        "indrel_1mes"          
## [13] "tiprel_1mes"           "indresi"              
## [15] "indext"                "conyuemp"             
## [17] "canal_entrada"         "indfall"              
## [19] "tipodom"               "cod_prov"             
## [21] "nomprov"               "ind_actividad_cliente"
## [23] "renta"                 "segmento"             
## [25] "ind_ahor_fin_ult1"     "ind_aval_fin_ult1"    
## [27] "ind_cco_fin_ult1"      "ind_cder_fin_ult1"    
## [29] "ind_cno_fin_ult1"      "ind_ctju_fin_ult1"    
## [31] "ind_ctma_fin_ult1"     "ind_ctop_fin_ult1"    
## [33] "ind_ctpp_fin_ult1"     "ind_deco_fin_ult1"    
## [35] "ind_deme_fin_ult1"     "ind_dela_fin_ult1"    
## [37] "ind_ecue_fin_ult1"     "ind_fond_fin_ult1"    
## [39] "ind_hip_fin_ult1"      "ind_plan_fin_ult1"    
## [41] "ind_pres_fin_ult1"     "ind_reca_fin_ult1"    
## [43] "ind_tjcr_fin_ult1"     "ind_valo_fin_ult1"    
## [45] "ind_viv_fin_ult1"      "ind_nomina_ult1"      
## [47] "ind_nom_pens_ult1"     "ind_recibo_ult1"

1. Change Variable Names to English

There are 2 main types of variables: customer related variables and bank products. Since I need the results to be in Spanish, I leave bank products as is and changed the names of customer related variables only to make them readable and understandable.

Now let’s see what kind of data we’re dealing with.

summary(bank2)

##          date            custcode       emp_index       country        
##  2016-05-28: 931453   Min.   :  15889    :   27734   ES     :13553710  
##  2016-04-28: 928274   1st Qu.: 452813   A:    2492          :   27734  
##  2016-03-28: 925076   Median : 931893   B:    3566   FR     :    5161  
##  2016-02-28: 920904   Mean   : 834904   F:    2523   AR     :    4835  
##  2016-01-28: 916269   3rd Qu.:1199286   N:13610977   DE     :    4625  
##  2015-12-28: 912021   Max.   :1553689   S:      17   GB     :    4605  
##  (Other)   :8113312                                  (Other):   46639  
##  gender           age               first_date          new_cust    
##   :  27804    23    : 779884   2014-07-28:   57389   Min.   :0.00   
##  H:6195253    22    : 736314   2014-10-03:   54287   1st Qu.:0.00   
##  V:7424252    24    : 734785   2014-08-04:   45746   Median :0.00   
##               21    : 675988   2013-10-14:   40804   Mean   :0.06   
##               25    : 472016   2013-08-03:   33414   3rd Qu.:0.00   
##               20    : 422867             :   27734   Max.   :1.00   
##              (Other):9825455   (Other)   :13387935   NA's   :27734  
##      senior            primary            last_date       
##       12:  243160   Min.   : 1.000             :13622516  
##       21:  214795   1st Qu.: 1.000   2015-12-24:     763  
##       10:  206165   Median : 1.000   2015-12-28:     521  
##        9:  177957   Mean   : 1.178   2015-07-09:     443  
##       23:  177839   3rd Qu.: 1.000   2015-07-06:     405  
##       33:  174352   Max.   :99.000   2015-07-01:     401  
##  (Other):12453041   NA's   :27734    (Other)   :   22260  
##  cust_type_first   cust_relation_first resi_index   for_index   
##  1.0    :9133383    : 149781            :   27734    :   27734  
##  1      :4357298   A:6187123           N:   65864   N:12974839  
##         : 149781   I:7304875           S:13553711   S:  644736  
##  3.0    :   2780   N:      4                                    
##  3      :   1570   P:   4656                                    
##  P      :    874   R:    870                                    
##  (Other):   1623                                                
##  spouse_index    channel        deceased        add_type    
##   :13645501   KHE    :4055270    :   27734   Min.   :1      
##  N:    1791   KAT    :3268209   N:13584813   1st Qu.:1      
##  S:      17   KFC    :3098360   S:   34762   Median :1      
##               KHQ    : 591039                Mean   :1      
##               KFA    : 409669                3rd Qu.:1      
##               KHK    : 241084                Max.   :1      
##               (Other):1983678                NA's   :27735  
##     cod_prov          prov_name           active          income        
##  Min.   : 1.00   MADRID    :4409600   Min.   :0.000   Min.   :    1203  
##  1st Qu.:15.00   BARCELONA :1275219   1st Qu.:0.000   1st Qu.:   68711  
##  Median :28.00   VALENCIA  : 682304   Median :0.000   Median :  101850  
##  Mean   :26.57   SEVILLA   : 605164   Mean   :0.458   Mean   :  134254  
##  3rd Qu.:35.00   CORUÃA, A: 429322   3rd Qu.:1.000   3rd Qu.:  155956  
##  Max.   :52.00   MURCIA    : 396759   Max.   :1.000   Max.   :28894396  
##  NA's   :93591   (Other)   :5848941   NA's   :27734   NA's   :2794375   
##                segmento       ind_ahor_fin_ult1   ind_aval_fin_ult1 
##                    : 189368   Min.   :0.0000000   Min.   :0.00e+00  
##  01 - TOP          : 562142   1st Qu.:0.0000000   1st Qu.:0.00e+00  
##  02 - PARTICULARES :7960220   Median :0.0000000   Median :0.00e+00  
##  03 - UNIVERSITARIO:4935579   Mean   :0.0001023   Mean   :2.32e-05  
##                               3rd Qu.:0.0000000   3rd Qu.:0.00e+00  
##                               Max.   :1.0000000   Max.   :1.00e+00  
##                                                                     
##  ind_cco_fin_ult1 ind_cder_fin_ult1   ind_cno_fin_ult1  ind_ctju_fin_ult1 
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.0000   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :1.0000   Median :0.0000000   Median :0.00000   Median :0.000000  
##  Mean   :0.6555   Mean   :0.0003939   Mean   :0.08087   Mean   :0.009474  
##  3rd Qu.:1.0000   3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.0000   Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000  
##                                                                           
##  ind_ctma_fin_ult1  ind_ctop_fin_ult1 ind_ctpp_fin_ult1 ind_deco_fin_ult1 
##  Min.   :0.000000   Min.   :0.000     Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000     1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000     Median :0.00000   Median :0.000000  
##  Mean   :0.009727   Mean   :0.129     Mean   :0.04331   Mean   :0.001779  
##  3rd Qu.:0.000000   3rd Qu.:0.000     3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000     Max.   :1.00000   Max.   :1.000000  
##                                                                           
##  ind_deme_fin_ult1  ind_dela_fin_ult1 ind_ecue_fin_ult1 ind_fond_fin_ult1
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.001661   Mean   :0.04297   Mean   :0.08274   Mean   :0.01849  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                          
##  ind_hip_fin_ult1   ind_plan_fin_ult1  ind_pres_fin_ult1 
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.000000  
##  Mean   :0.005887   Mean   :0.009171   Mean   :0.002627  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.000000  
##                                                          
##  ind_reca_fin_ult1 ind_tjcr_fin_ult1 ind_valo_fin_ult1 ind_viv_fin_ult1  
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.000000  
##  Mean   :0.05254   Mean   :0.04439   Mean   :0.02561   Mean   :0.003848  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.000000  
##                                                                          
##  ind_nomina_ult1 ind_nom_pens_ult1 ind_recibo_ult1 
##  Min.   :0.000   Min.   :0.000     Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.000     1st Qu.:0.0000  
##  Median :0.000   Median :0.000     Median :0.0000  
##  Mean   :0.055   Mean   :0.059     Mean   :0.1279  
##  3rd Qu.:0.000   3rd Qu.:0.000     3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.000     Max.   :1.0000  
##  NA's   :16063   NA's   :16063

2. Missing data

I found there are consistent of 27,734 missing data in several variables and checking those accounts proved that there isn’t much information in those accounts, so I decided to delete them all together.

missing<-subset(bank2, bank2$emp_index=="")
bank_new<-subset(bank2, bank2$emp_index!="")

To check all the other missing values, I created bank_missing to visualize the missing values. Let’s quickly understand this. There are 80% values in the data set with no missing value. There are 20% missing values in Income, around 0.05% missing values in province code and so on.

library(VIM)
bank_missing<-cbind(bank_new$add_type, bank_new$cod_prov, bank_new$income, bank_new[,46], bank_new[,47])
aggr(bank_missing, col=c('#ffffb2','#fecc5c'),
     numbers=TRUE, sortVars=TRUE,
     labels=names(bank_missing), cex.axis=.7,
     gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable        Count
##        V3 2.031371e-01
##        V2 4.835467e-03
##        V4 1.593295e-05
##        V5 1.593295e-05
##        V1 7.342373e-08

Using Hmisc package to impute missing values of income with its statistical mean value.

# impute with mean value
library(Hmisc)
bank_new$income <- with(bank_new, impute(income, mean))

Also, I replaced the missing values of payroll and pension with 0 since it represents no ownership of the products.

bank_new$ind_nomina_ult1[is.na(bank_new$ind_nomina_ult1)] <- 0
bank_new$ind_nom_pens_ult1[is.na(bank_new$ind_nom_pens_ult1)] <- 0

Data Exploration

1. Exploring Customer Age

Initial histogram revealed that age distribution has two distinct spikes.

library(ggplot2)
bank_new$age<-as.numeric(as.character(bank_new$age))
ggplot(bank_new, aes(x=age)) +
  stat_count(width=1, position="stack") +
  ggtitle("Age Histogram")

Later histogram reveals that younger group belongs to college graduates and older group belongs to individual group of customers.

bank_new$age<-as.numeric(as.character(bank_new$age))
library(ggthemes)
library(ggplot2)
bank_new$segmento <- factor(bank_new$segmento, labels = c("Other", "VIP", "Individuals", "Graduates"))
ggplot(bank_new, aes(x=age, fill=factor(segmento))) +
  geom_bar() +
  facet_grid(".~segmento") + 
  scale_fill_manual("Customer Segmentation", values=c("#ffffb2", "#fecc5c", "#fd8d3c", "#e31a1c")) +
  theme_classic() +
  theme(legend.position="bottom") +
  scale_y_continuous("Frequency")

When you look at the average age of customers first open their account within time line, younger people more likely to open an account at Santander bank since 2010.

Changing Age Distribution Over Years

2. Exploring Customer Locations

Country table shows that Santander customers are from all over the world.

table(bank_new$country)

## 
##                AD       AE       AL       AO       AR       AT       AU 
##        0      111      221       17       68     4835      476      424 
##       BA       BE       BG       BM       BO       BR       BY       BZ 
##       34     1526      476        6     1514     2351      102       17 
##       CA       CD       CF       CG       CH       CI       CL       CM 
##      446       17       17       34     1995       51      989       85 
##       CN       CO       CR       CU       CZ       DE       DJ       DK 
##      563     3526      147      758      102     4625       11      226 
##       DO       DZ       EC       EE       EG       ES       ET       FI 
##      424       86     2169       45       68 13553710       34      345 
##       FR       GA       GB       GE       GH       GI       GM       GN 
##     5161       51     4605       17       17       17       17       51 
##       GQ       GR       GT       GW       HK       HN       HR       HU 
##      119      243      130       34       51      282       68       37 
##       IE       IL       IN       IS       IT       JM       JP       KE 
##      409      413      187       17     2947       11      239       72 
##       KH       KR       KW       KZ       LB       LT       LU       LV 
##       17       96       17       17       17       45      124       17 
##       LY       MA       MD       MK       ML       MM       MR       MT 
##       17      396       96       51       17       17       51        2 
##       MX       MZ       NG       NI       NL       NO       NZ       OM 
##     2573       34      214       62      757      136       51       22 
##       PA       PE       PH       PK       PL       PR       PT       PY 
##       77      900       34       85      599      101     1419     1430 
##       QA       RO       RS       RU       SA       SE       SG       SK 
##       52     2931       34      769       79      603      117       85 
##       SL       SN       SV       TG       TH       TN       TR       TW 
##       17       68      102       17      102       17       62       34 
##       UA       US       UY       VE       VN       ZA       ZW 
##      493     3651      510     2331       34      119       11

Let’s see it in the map:

99.31% of customers are from ‘Spain’

round(prop.table(table(bank2$country=="ES"))*100,2)

## 
## FALSE  TRUE 
##  0.69 99.31

3. Exploring Customer Income

Income distribution shows that there are many extreme values up to 30 million. It’s interesting to see that VIP customers do not have the highest income distribution.

4. Exploring Bank Products

First I explored if there is any correlations between bank products. To do this I had to extract only products information from the dataset.

Now, visualize to see any correlation

library(corrplot)
product_corr <- cor(product)
corrplot(product_corr, method="square")

See actual correlation values. Pension_last and payroll products are the ones are highly correlated. Since I don’t have detailed description of each product, I cannot assume the reasons why they’re highly correlated.

layout(matrix(1:1, ncol = 3))
corrplot(product_corr, method="number", tl.cex = 1)

layout(1)

Let’s see how many total products customers have from Santander bank.

More than 2 million customers do not have a single product, and almost half of the customers own only one product. At most, there are only 15 products that one person holds at the same time.

bank_new$totalproducts<-rowSums(bank_new[,25:48], na.rm = TRUE)
table(bank_new$totalproducts)

## 
##       0       1       2       3       4       5       6       7       8 
## 2550371 7150086 1928371  770354  449867  289819  210919  141468   76138 
##       9      10      11      12      13      14      15 
##   33768   12682    4343    1124     230      26       9

barplot(table(bank_new$totalproducts), xlab="Number of Products", las=3)

Now let’s see which product is most popular.

The current account is the most popular product as it has 8,938,150 account holders.

tbl<-colSums(bank_new[,25:48])
names(tbl)<-c("savings", "guarantee", "current", "derivada", "payroll_acc",
              "junior", "mas particular", "particular", "particular Plus", "short_dep",
              "med-dep", "long_dep", "e_account", "funds", "mortgage", "pensions",
              "loans", "taxes", "creditcard", "securities", "home", "payroll", "pension_last",
              "direct_debit")
barplot(tbl)

More transformation: 1. Deleted all users that do not own any products 2. Only selected top 8 products that users own to simplify

bank_clean<-subset(bank_new, bank_new$totalproducts>0)
dim(bank_clean)

## [1] 11069204       49

#Checking top 8 products
prodname<-head(sort(tbl,decreasing=TRUE), n = 8)
#Only choosing top 8 products
# Dropped payroll
bank_last<-cbind(bank_clean[,1:24], bank_clean[,27], bank_clean[,29], bank_clean[,32], bank_clean[,37], bank_clean[,42:43], bank_clean[,47:48])

dim(bank_last)

## [1] 11069204       32

RECOMMENDATION SYSTEM

1. Transforming dataset for analysis

# Select only active users
bank_active<-subset(bank_last, bank_last$active==1)
dim(bank_active)

## [1] 6215578      32

# Select users in May 28th, 2016, month before prediction
bank_active.5.28.16<-subset(bank_active, bank_active$date=="2016-05-28")
bank.matrix<-bank_active.5.28.16
library(arules)
bank.matrix$custcode <- as.character(bank.matrix$custcode)
bank.matrix$age <- discretize(bank.matrix$age, method = "frequency", 5)
bank.matrix$new_cust <- as.logical(bank.matrix$new_cust)
bank.matrix$primary <- as.factor(bank.matrix$primary)
bank.matrix$add_type<-NULL
bank.matrix$cod_prov<-as.factor(bank.matrix$cod_prov)
bank.matrix$income <- discretize(bank.matrix$income, method = "frequency", 5)
bank.matrix$active <- NULL
bank.matrix$first_date<- as.factor(bank.matrix$first_date)
# Select only non-employee users
bank.matrix<-subset(bank.matrix, bank.matrix$emp_index=="N")
bank.matrix$emp_index<-NULL
bank.matrix$date<-NULL
dim(bank.matrix)

## [1] 393636     28

2. Using Collaborative Filtering for Recommendation

A. Item-based Collaborative Filtering

#Choosing only bank products for item-to-item collaborative filtering
# I don't need user information, only products
bank.recom<-bank.matrix[,21:28]

prodnam8 <- c("current", "payroll_acc",
               "particular", "e_account",  
               "taxes", "creditcard", "pension_last",
               "direct_debit")
names(bank.recom)<- prodnam8
summary(bank.recom)

##     current        payroll_acc       particular       e_account    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :1.0000   Median :0.0000   Median :0.0000   Median :0.000  
##  Mean   :0.7418   Mean   :0.1825   Mean   :0.1711   Mean   :0.179  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##      taxes          creditcard       pension_last     direct_debit   
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.1073   Mean   :0.08783   Mean   :0.1338   Mean   :0.2857  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000

# Create a helper function to calculate the cosine between two vectors
  getCosine <- function(x,y) 
  {
    this.cosine <- sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y)))
    return(this.cosine)
  }

# Create item to item dataframe
holder <- matrix(NA, nrow=ncol(bank.recom),ncol=ncol(bank.recom),dimnames=list(colnames(bank.recom),colnames(bank.recom)))

bank.recom.similarity <- as.data.frame(holder)

#Lets fill in those empty spaces with cosine similarities
for(i in 1:ncol(bank.recom)) {
  for(j in 1:ncol(bank.recom)) {
    bank.recom.similarity[i,j]= getCosine(bank.recom[i],bank.recom[j])
  }
}
 # Back to dataframe
bank.recom.similarity <- as.data.frame(bank.recom.similarity)
head(bank.recom.similarity)

##                current payroll_acc particular e_account     taxes
## current     1.00000000  0.04887032  0.3687681 0.2747616 0.1902221
## payroll_acc 0.04887032  1.00000000  0.1391024 0.3674849 0.3687115
## particular  0.36876805  0.13910241  1.0000000 0.1153420 0.1672927
## e_account   0.27476158  0.36748492  0.1153420 1.0000000 0.2688076
## taxes       0.19022212  0.36871153  0.1672927 0.2688076 1.0000000
## creditcard  0.15297901  0.39759523  0.1561424 0.2961349 0.3148097
##             creditcard pension_last direct_debit
## current      0.1529790   0.06127241    0.3545048
## payroll_acc  0.3975952   0.79755111    0.5742283
## particular   0.1561424   0.13023025    0.1897989
## e_account    0.2961349   0.33744902    0.3753149
## taxes        0.3148097   0.33478083    0.4053543
## creditcard   1.0000000   0.38736088    0.4006921

# Get the top 2 neighbours for each product
# I only need to recommend 1 product for each user for the next month

bank.neighbours <- matrix(NA, nrow=ncol(bank.recom.similarity),ncol=2,dimnames=list(colnames(bank.recom.similarity)))

# Then I need to find the neighbours. This is another loop but runs much faster.
for(i in 1:ncol(bank.recom)) 
{
  bank.neighbours[i,] <- (t(head(n=2,rownames(bank.recom.similarity[order(bank.recom.similarity[,i],decreasing=TRUE),][i]))))
}
# First column is just identical of the product, so ignore it
head(bank.neighbours, n=8)

##              [,1]           [,2]          
## current      "current"      "particular"  
## payroll_acc  "payroll_acc"  "pension_last"
## particular   "particular"   "current"     
## e_account    "e_account"    "direct_debit"
## taxes        "taxes"        "direct_debit"
## creditcard   "creditcard"   "direct_debit"
## pension_last "pension_last" "payroll_acc" 
## direct_debit "direct_debit" "payroll_acc"

str(bank.neighbours)

##  chr [1:8, 1:2] "current" "payroll_acc" "particular" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:8] "current" "payroll_acc" "particular" "e_account" ...
##   ..$ : NULL

# Recommended products for each user
bank.final<-cbind(bank.matrix[,1],bank.neighbours[,2])

#Let's see recommended products for first 10 users
colnames(bank.final)<-c("ncodpers", "added_products")
head(bank.final, n=10)

##       ncodpers added_products
##  [1,] "657788" "particular"  
##  [2,] "657795" "pension_last"
##  [3,] "657790" "current"     
##  [4,] "657794" "direct_debit"
##  [5,] "657789" "direct_debit"
##  [6,] "657781" "direct_debit"
##  [7,] "657779" "payroll_acc" 
##  [8,] "657770" "payroll_acc" 
##  [9,] "657764" "particular"  
## [10,] "657760" "pension_last"

#Save the results to a file for submission
save(bank.final, file="bank.final.submission.csv")

B. Using Recommenderlab Package for User-Based CF approach

library(Matrix)
library(reshape2)
library(recommenderlab)
#User-based Collaborative  Filtering


# This time, I need user code and product ratings
bank.recomd<-cbind(bank.matrix[,1], bank.matrix[21:28])

prodnam9 <- c("user", "current", "payroll_acc",
              "particular", "e_account",  
              "taxes", "payroll", "pension_last",
              "direct_debit")
names(bank.recomd)<- prodnam9

#For demonstration purpose, I only selected first 5000 users since the large dataset
#makes challenging to use user-based collaborative filtering
bank.recomd<-bank.recomd[1:5000,]


bank.recomd$user <- as.numeric(factor(bank.recomd$user))

bank.reshaped <- transform(melt(bank.recomd, id.var='user', na.rm=TRUE), 
                variable=match(variable, names(bank.recomd)[-1]))
head(bank.reshaped)

##   user variable value
## 1 1973        1     1
## 2 1977        1     1
## 3 1975        1     0
## 4 1976        1     1
## 5 1974        1     1
## 6 1971        1     0

#Change dataset to SparseMatrix
sparse_ratings<-with(bank.reshaped, sparseMatrix(user, variable, x=value)) 

#recommenderlab works with realRatingMatrix object, which is created from sparse matrix
real_ratings <- new("realRatingMatrix", data = sparse_ratings)

head(as(real_ratings, "matrix"))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    0    0    0    0    0    0    0
## [2,]    1    0    0    0    0    0    0    1
## [3,]    1    0    0    0    0    0    0    1
## [4,]    1    0    1    0    1    0    0    0
## [5,]    0    1    1    0    1    1    1    1
## [6,]    1    0    0    0    0    0    0    0

#Create a recommender model based on user ratings

bank.rec<-Recommender(real_ratings, method="UBCF", param=list(normalize = "center"))
getModel(bank.rec)

## $description
## [1] "UBCF-Real data: contains full or sample of data set"
## 
## $data
## 5000 x 8 rating matrix of class 'realRatingMatrix' with 40000 ratings.
## Normalized using center on rows.
## 
## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"
## 
## $verbose
## [1] FALSE

# 
set.seed(1)
e <- evaluationScheme(real_ratings, method="split", train=0.8, given=3)

#Predicting recommendation
bank.pred<-predict(bank.rec, getData(e, "known"), type="ratings", n=1)
head(as(bank.pred, "matrix"), n=10)

##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
##  [1,]        NA        NA 0.2083333        NA 0.2083333 0.2083333
##  [2,]        NA 0.2083333 0.2083333        NA        NA 0.2083333
##  [3,] 0.2083333        NA        NA 0.2083333 0.2083333 0.2083333
##  [4,] 0.2083333        NA 0.2083333 0.2083333 0.2083333        NA
##  [5,] 0.0000000        NA        NA 0.0000000 0.0000000 0.0000000
##  [6,]        NA        NA 0.2083333        NA 0.2083333 0.2083333
##  [7,] 0.0000000 0.0000000        NA 0.0000000 0.0000000        NA
##  [8,]        NA 0.2083333 0.2083333        NA        NA 0.2083333
##  [9,] 0.0000000 0.0000000        NA 0.0000000 0.0000000        NA
## [10,]        NA        NA 0.2083333        NA 0.2083333 0.2083333
##            [,7]      [,8]
##  [1,] 0.2083333 0.2083333
##  [2,] 0.2083333 0.2083333
##  [3,]        NA 0.2083333
##  [4,]        NA 0.2083333
##  [5,] 0.0000000        NA
##  [6,] 0.2083333 0.2083333
##  [7,] 0.0000000        NA
##  [8,] 0.2083333 0.2083333
##  [9,] 0.0000000        NA
## [10,] 0.2083333 0.2083333

# Calculating RMSE
RMSE.ubcf <- calcPredictionAccuracy(bank.pred, getData(e, "unknown"))[1]
RMSE.ubcf

##    RMSE 
## 0.45487