The client company, Santander bank, is originated from Spain and has branches from all over the world. Currently, their system doesn’t offer the right products to their customers, so that some customers get too many offers while others don’t have any. Their wish for the solution is “with a more effective recommendation system in place, Santander can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life” (Kaggle). Therefore, an algorithm or a model that is a result from my project will help Santander bank to predict which products their existing customers will use in the next month based on their past behavior.
Train data has 13,647,309 observations and 48 different variables. First 24 variables are all customer related and last 24 variables are all bank products. Since Santander bank is originated from Spain, data variable names are all in Spanish.
dim(bank)
## [1] 13647309 48
names(bank)
## [1] "fecha_dato" "ncodpers"
## [3] "ind_empleado" "pais_residencia"
## [5] "sexo" "age"
## [7] "fecha_alta" "ind_nuevo"
## [9] "antiguedad" "indrel"
## [11] "ult_fec_cli_1t" "indrel_1mes"
## [13] "tiprel_1mes" "indresi"
## [15] "indext" "conyuemp"
## [17] "canal_entrada" "indfall"
## [19] "tipodom" "cod_prov"
## [21] "nomprov" "ind_actividad_cliente"
## [23] "renta" "segmento"
## [25] "ind_ahor_fin_ult1" "ind_aval_fin_ult1"
## [27] "ind_cco_fin_ult1" "ind_cder_fin_ult1"
## [29] "ind_cno_fin_ult1" "ind_ctju_fin_ult1"
## [31] "ind_ctma_fin_ult1" "ind_ctop_fin_ult1"
## [33] "ind_ctpp_fin_ult1" "ind_deco_fin_ult1"
## [35] "ind_deme_fin_ult1" "ind_dela_fin_ult1"
## [37] "ind_ecue_fin_ult1" "ind_fond_fin_ult1"
## [39] "ind_hip_fin_ult1" "ind_plan_fin_ult1"
## [41] "ind_pres_fin_ult1" "ind_reca_fin_ult1"
## [43] "ind_tjcr_fin_ult1" "ind_valo_fin_ult1"
## [45] "ind_viv_fin_ult1" "ind_nomina_ult1"
## [47] "ind_nom_pens_ult1" "ind_recibo_ult1"
There are 2 main types of variables: customer related variables and bank products. Since I need the results to be in Spanish, I leave bank products as is and changed the names of customer related variables only to make them readable and understandable.
Now let’s see what kind of data we’re dealing with.
summary(bank2)
## date custcode emp_index country
## 2016-05-28: 931453 Min. : 15889 : 27734 ES :13553710
## 2016-04-28: 928274 1st Qu.: 452813 A: 2492 : 27734
## 2016-03-28: 925076 Median : 931893 B: 3566 FR : 5161
## 2016-02-28: 920904 Mean : 834904 F: 2523 AR : 4835
## 2016-01-28: 916269 3rd Qu.:1199286 N:13610977 DE : 4625
## 2015-12-28: 912021 Max. :1553689 S: 17 GB : 4605
## (Other) :8113312 (Other): 46639
## gender age first_date new_cust
## : 27804 23 : 779884 2014-07-28: 57389 Min. :0.00
## H:6195253 22 : 736314 2014-10-03: 54287 1st Qu.:0.00
## V:7424252 24 : 734785 2014-08-04: 45746 Median :0.00
## 21 : 675988 2013-10-14: 40804 Mean :0.06
## 25 : 472016 2013-08-03: 33414 3rd Qu.:0.00
## 20 : 422867 : 27734 Max. :1.00
## (Other):9825455 (Other) :13387935 NA's :27734
## senior primary last_date
## 12: 243160 Min. : 1.000 :13622516
## 21: 214795 1st Qu.: 1.000 2015-12-24: 763
## 10: 206165 Median : 1.000 2015-12-28: 521
## 9: 177957 Mean : 1.178 2015-07-09: 443
## 23: 177839 3rd Qu.: 1.000 2015-07-06: 405
## 33: 174352 Max. :99.000 2015-07-01: 401
## (Other):12453041 NA's :27734 (Other) : 22260
## cust_type_first cust_relation_first resi_index for_index
## 1.0 :9133383 : 149781 : 27734 : 27734
## 1 :4357298 A:6187123 N: 65864 N:12974839
## : 149781 I:7304875 S:13553711 S: 644736
## 3.0 : 2780 N: 4
## 3 : 1570 P: 4656
## P : 874 R: 870
## (Other): 1623
## spouse_index channel deceased add_type
## :13645501 KHE :4055270 : 27734 Min. :1
## N: 1791 KAT :3268209 N:13584813 1st Qu.:1
## S: 17 KFC :3098360 S: 34762 Median :1
## KHQ : 591039 Mean :1
## KFA : 409669 3rd Qu.:1
## KHK : 241084 Max. :1
## (Other):1983678 NA's :27735
## cod_prov prov_name active income
## Min. : 1.00 MADRID :4409600 Min. :0.000 Min. : 1203
## 1st Qu.:15.00 BARCELONA :1275219 1st Qu.:0.000 1st Qu.: 68711
## Median :28.00 VALENCIA : 682304 Median :0.000 Median : 101850
## Mean :26.57 SEVILLA : 605164 Mean :0.458 Mean : 134254
## 3rd Qu.:35.00 CORUÃA, A: 429322 3rd Qu.:1.000 3rd Qu.: 155956
## Max. :52.00 MURCIA : 396759 Max. :1.000 Max. :28894396
## NA's :93591 (Other) :5848941 NA's :27734 NA's :2794375
## segmento ind_ahor_fin_ult1 ind_aval_fin_ult1
## : 189368 Min. :0.0000000 Min. :0.00e+00
## 01 - TOP : 562142 1st Qu.:0.0000000 1st Qu.:0.00e+00
## 02 - PARTICULARES :7960220 Median :0.0000000 Median :0.00e+00
## 03 - UNIVERSITARIO:4935579 Mean :0.0001023 Mean :2.32e-05
## 3rd Qu.:0.0000000 3rd Qu.:0.00e+00
## Max. :1.0000000 Max. :1.00e+00
##
## ind_cco_fin_ult1 ind_cder_fin_ult1 ind_cno_fin_ult1 ind_ctju_fin_ult1
## Min. :0.0000 Min. :0.0000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :1.0000 Median :0.0000000 Median :0.00000 Median :0.000000
## Mean :0.6555 Mean :0.0003939 Mean :0.08087 Mean :0.009474
## 3rd Qu.:1.0000 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.0000000 Max. :1.00000 Max. :1.000000
##
## ind_ctma_fin_ult1 ind_ctop_fin_ult1 ind_ctpp_fin_ult1 ind_deco_fin_ult1
## Min. :0.000000 Min. :0.000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0.000 Median :0.00000 Median :0.000000
## Mean :0.009727 Mean :0.129 Mean :0.04331 Mean :0.001779
## 3rd Qu.:0.000000 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000 Max. :1.00000 Max. :1.000000
##
## ind_deme_fin_ult1 ind_dela_fin_ult1 ind_ecue_fin_ult1 ind_fond_fin_ult1
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.001661 Mean :0.04297 Mean :0.08274 Mean :0.01849
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## ind_hip_fin_ult1 ind_plan_fin_ult1 ind_pres_fin_ult1
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.005887 Mean :0.009171 Mean :0.002627
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000
##
## ind_reca_fin_ult1 ind_tjcr_fin_ult1 ind_valo_fin_ult1 ind_viv_fin_ult1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.05254 Mean :0.04439 Mean :0.02561 Mean :0.003848
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000
##
## ind_nomina_ult1 ind_nom_pens_ult1 ind_recibo_ult1
## Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.000 Median :0.000 Median :0.0000
## Mean :0.055 Mean :0.059 Mean :0.1279
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.000 Max. :1.0000
## NA's :16063 NA's :16063
I found there are consistent of 27,734 missing data in several variables and checking those accounts proved that there isn’t much information in those accounts, so I decided to delete them all together.
missing<-subset(bank2, bank2$emp_index=="")
bank_new<-subset(bank2, bank2$emp_index!="")
To check all the other missing values, I created bank_missing to visualize the missing values. Let’s quickly understand this. There are 80% values in the data set with no missing value. There are 20% missing values in Income, around 0.05% missing values in province code and so on.
library(VIM)
bank_missing<-cbind(bank_new$add_type, bank_new$cod_prov, bank_new$income, bank_new[,46], bank_new[,47])
aggr(bank_missing, col=c('#ffffb2','#fecc5c'),
numbers=TRUE, sortVars=TRUE,
labels=names(bank_missing), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## V3 2.031371e-01
## V2 4.835467e-03
## V4 1.593295e-05
## V5 1.593295e-05
## V1 7.342373e-08
Using Hmisc package to impute missing values of income with its statistical mean value.
# impute with mean value
library(Hmisc)
bank_new$income <- with(bank_new, impute(income, mean))
Also, I replaced the missing values of payroll and pension with 0 since it represents no ownership of the products.
bank_new$ind_nomina_ult1[is.na(bank_new$ind_nomina_ult1)] <- 0
bank_new$ind_nom_pens_ult1[is.na(bank_new$ind_nom_pens_ult1)] <- 0
Initial histogram revealed that age distribution has two distinct spikes.
library(ggplot2)
bank_new$age<-as.numeric(as.character(bank_new$age))
ggplot(bank_new, aes(x=age)) +
stat_count(width=1, position="stack") +
ggtitle("Age Histogram")
Later histogram reveals that younger group belongs to college graduates and older group belongs to individual group of customers.
bank_new$age<-as.numeric(as.character(bank_new$age))
library(ggthemes)
library(ggplot2)
bank_new$segmento <- factor(bank_new$segmento, labels = c("Other", "VIP", "Individuals", "Graduates"))
ggplot(bank_new, aes(x=age, fill=factor(segmento))) +
geom_bar() +
facet_grid(".~segmento") +
scale_fill_manual("Customer Segmentation", values=c("#ffffb2", "#fecc5c", "#fd8d3c", "#e31a1c")) +
theme_classic() +
theme(legend.position="bottom") +
scale_y_continuous("Frequency")
When you look at the average age of customers first open their account within time line, younger people more likely to open an account at Santander bank since 2010.
Country table shows that Santander customers are from all over the world.
table(bank_new$country)
##
## AD AE AL AO AR AT AU
## 0 111 221 17 68 4835 476 424
## BA BE BG BM BO BR BY BZ
## 34 1526 476 6 1514 2351 102 17
## CA CD CF CG CH CI CL CM
## 446 17 17 34 1995 51 989 85
## CN CO CR CU CZ DE DJ DK
## 563 3526 147 758 102 4625 11 226
## DO DZ EC EE EG ES ET FI
## 424 86 2169 45 68 13553710 34 345
## FR GA GB GE GH GI GM GN
## 5161 51 4605 17 17 17 17 51
## GQ GR GT GW HK HN HR HU
## 119 243 130 34 51 282 68 37
## IE IL IN IS IT JM JP KE
## 409 413 187 17 2947 11 239 72
## KH KR KW KZ LB LT LU LV
## 17 96 17 17 17 45 124 17
## LY MA MD MK ML MM MR MT
## 17 396 96 51 17 17 51 2
## MX MZ NG NI NL NO NZ OM
## 2573 34 214 62 757 136 51 22
## PA PE PH PK PL PR PT PY
## 77 900 34 85 599 101 1419 1430
## QA RO RS RU SA SE SG SK
## 52 2931 34 769 79 603 117 85
## SL SN SV TG TH TN TR TW
## 17 68 102 17 102 17 62 34
## UA US UY VE VN ZA ZW
## 493 3651 510 2331 34 119 11
round(prop.table(table(bank2$country=="ES"))*100,2)
##
## FALSE TRUE
## 0.69 99.31
Income distribution shows that there are many extreme values up to 30 million. It’s interesting to see that VIP customers do not have the highest income distribution.
First I explored if there is any correlations between bank products. To do this I had to extract only products information from the dataset.
Now, visualize to see any correlation
library(corrplot)
product_corr <- cor(product)
corrplot(product_corr, method="square")
See actual correlation values. Pension_last and payroll products are the ones are highly correlated. Since I don’t have detailed description of each product, I cannot assume the reasons why they’re highly correlated.
layout(matrix(1:1, ncol = 3))
corrplot(product_corr, method="number", tl.cex = 1)
layout(1)
More than 2 million customers do not have a single product, and almost half of the customers own only one product. At most, there are only 15 products that one person holds at the same time.
bank_new$totalproducts<-rowSums(bank_new[,25:48], na.rm = TRUE)
table(bank_new$totalproducts)
##
## 0 1 2 3 4 5 6 7 8
## 2550371 7150086 1928371 770354 449867 289819 210919 141468 76138
## 9 10 11 12 13 14 15
## 33768 12682 4343 1124 230 26 9
barplot(table(bank_new$totalproducts), xlab="Number of Products", las=3)
The current account is the most popular product as it has 8,938,150 account holders.
tbl<-colSums(bank_new[,25:48])
names(tbl)<-c("savings", "guarantee", "current", "derivada", "payroll_acc",
"junior", "mas particular", "particular", "particular Plus", "short_dep",
"med-dep", "long_dep", "e_account", "funds", "mortgage", "pensions",
"loans", "taxes", "creditcard", "securities", "home", "payroll", "pension_last",
"direct_debit")
barplot(tbl)
More transformation: 1. Deleted all users that do not own any products 2. Only selected top 8 products that users own to simplify
bank_clean<-subset(bank_new, bank_new$totalproducts>0)
dim(bank_clean)
## [1] 11069204 49
#Checking top 8 products
prodname<-head(sort(tbl,decreasing=TRUE), n = 8)
#Only choosing top 8 products
# Dropped payroll
bank_last<-cbind(bank_clean[,1:24], bank_clean[,27], bank_clean[,29], bank_clean[,32], bank_clean[,37], bank_clean[,42:43], bank_clean[,47:48])
dim(bank_last)
## [1] 11069204 32
# Select only active users
bank_active<-subset(bank_last, bank_last$active==1)
dim(bank_active)
## [1] 6215578 32
# Select users in May 28th, 2016, month before prediction
bank_active.5.28.16<-subset(bank_active, bank_active$date=="2016-05-28")
bank.matrix<-bank_active.5.28.16
library(arules)
bank.matrix$custcode <- as.character(bank.matrix$custcode)
bank.matrix$age <- discretize(bank.matrix$age, method = "frequency", 5)
bank.matrix$new_cust <- as.logical(bank.matrix$new_cust)
bank.matrix$primary <- as.factor(bank.matrix$primary)
bank.matrix$add_type<-NULL
bank.matrix$cod_prov<-as.factor(bank.matrix$cod_prov)
bank.matrix$income <- discretize(bank.matrix$income, method = "frequency", 5)
bank.matrix$active <- NULL
bank.matrix$first_date<- as.factor(bank.matrix$first_date)
# Select only non-employee users
bank.matrix<-subset(bank.matrix, bank.matrix$emp_index=="N")
bank.matrix$emp_index<-NULL
bank.matrix$date<-NULL
dim(bank.matrix)
## [1] 393636 28
#Choosing only bank products for item-to-item collaborative filtering
# I don't need user information, only products
bank.recom<-bank.matrix[,21:28]
prodnam8 <- c("current", "payroll_acc",
"particular", "e_account",
"taxes", "creditcard", "pension_last",
"direct_debit")
names(bank.recom)<- prodnam8
summary(bank.recom)
## current payroll_acc particular e_account
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.0000 Median :0.0000 Median :0.0000 Median :0.000
## Mean :0.7418 Mean :0.1825 Mean :0.1711 Mean :0.179
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000
## taxes creditcard pension_last direct_debit
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.1073 Mean :0.08783 Mean :0.1338 Mean :0.2857
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
# Create a helper function to calculate the cosine between two vectors
getCosine <- function(x,y)
{
this.cosine <- sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y)))
return(this.cosine)
}
# Create item to item dataframe
holder <- matrix(NA, nrow=ncol(bank.recom),ncol=ncol(bank.recom),dimnames=list(colnames(bank.recom),colnames(bank.recom)))
bank.recom.similarity <- as.data.frame(holder)
#Lets fill in those empty spaces with cosine similarities
for(i in 1:ncol(bank.recom)) {
for(j in 1:ncol(bank.recom)) {
bank.recom.similarity[i,j]= getCosine(bank.recom[i],bank.recom[j])
}
}
# Back to dataframe
bank.recom.similarity <- as.data.frame(bank.recom.similarity)
head(bank.recom.similarity)
## current payroll_acc particular e_account taxes
## current 1.00000000 0.04887032 0.3687681 0.2747616 0.1902221
## payroll_acc 0.04887032 1.00000000 0.1391024 0.3674849 0.3687115
## particular 0.36876805 0.13910241 1.0000000 0.1153420 0.1672927
## e_account 0.27476158 0.36748492 0.1153420 1.0000000 0.2688076
## taxes 0.19022212 0.36871153 0.1672927 0.2688076 1.0000000
## creditcard 0.15297901 0.39759523 0.1561424 0.2961349 0.3148097
## creditcard pension_last direct_debit
## current 0.1529790 0.06127241 0.3545048
## payroll_acc 0.3975952 0.79755111 0.5742283
## particular 0.1561424 0.13023025 0.1897989
## e_account 0.2961349 0.33744902 0.3753149
## taxes 0.3148097 0.33478083 0.4053543
## creditcard 1.0000000 0.38736088 0.4006921
# Get the top 2 neighbours for each product
# I only need to recommend 1 product for each user for the next month
bank.neighbours <- matrix(NA, nrow=ncol(bank.recom.similarity),ncol=2,dimnames=list(colnames(bank.recom.similarity)))
# Then I need to find the neighbours. This is another loop but runs much faster.
for(i in 1:ncol(bank.recom))
{
bank.neighbours[i,] <- (t(head(n=2,rownames(bank.recom.similarity[order(bank.recom.similarity[,i],decreasing=TRUE),][i]))))
}
# First column is just identical of the product, so ignore it
head(bank.neighbours, n=8)
## [,1] [,2]
## current "current" "particular"
## payroll_acc "payroll_acc" "pension_last"
## particular "particular" "current"
## e_account "e_account" "direct_debit"
## taxes "taxes" "direct_debit"
## creditcard "creditcard" "direct_debit"
## pension_last "pension_last" "payroll_acc"
## direct_debit "direct_debit" "payroll_acc"
str(bank.neighbours)
## chr [1:8, 1:2] "current" "payroll_acc" "particular" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:8] "current" "payroll_acc" "particular" "e_account" ...
## ..$ : NULL
# Recommended products for each user
bank.final<-cbind(bank.matrix[,1],bank.neighbours[,2])
#Let's see recommended products for first 10 users
colnames(bank.final)<-c("ncodpers", "added_products")
head(bank.final, n=10)
## ncodpers added_products
## [1,] "657788" "particular"
## [2,] "657795" "pension_last"
## [3,] "657790" "current"
## [4,] "657794" "direct_debit"
## [5,] "657789" "direct_debit"
## [6,] "657781" "direct_debit"
## [7,] "657779" "payroll_acc"
## [8,] "657770" "payroll_acc"
## [9,] "657764" "particular"
## [10,] "657760" "pension_last"
#Save the results to a file for submission
save(bank.final, file="bank.final.submission.csv")
library(Matrix)
library(reshape2)
library(recommenderlab)
#User-based Collaborative Filtering
# This time, I need user code and product ratings
bank.recomd<-cbind(bank.matrix[,1], bank.matrix[21:28])
prodnam9 <- c("user", "current", "payroll_acc",
"particular", "e_account",
"taxes", "payroll", "pension_last",
"direct_debit")
names(bank.recomd)<- prodnam9
#For demonstration purpose, I only selected first 5000 users since the large dataset
#makes challenging to use user-based collaborative filtering
bank.recomd<-bank.recomd[1:5000,]
bank.recomd$user <- as.numeric(factor(bank.recomd$user))
bank.reshaped <- transform(melt(bank.recomd, id.var='user', na.rm=TRUE),
variable=match(variable, names(bank.recomd)[-1]))
head(bank.reshaped)
## user variable value
## 1 1973 1 1
## 2 1977 1 1
## 3 1975 1 0
## 4 1976 1 1
## 5 1974 1 1
## 6 1971 1 0
#Change dataset to SparseMatrix
sparse_ratings<-with(bank.reshaped, sparseMatrix(user, variable, x=value))
#recommenderlab works with realRatingMatrix object, which is created from sparse matrix
real_ratings <- new("realRatingMatrix", data = sparse_ratings)
head(as(real_ratings, "matrix"))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 0 0 0 0 0 0 0
## [2,] 1 0 0 0 0 0 0 1
## [3,] 1 0 0 0 0 0 0 1
## [4,] 1 0 1 0 1 0 0 0
## [5,] 0 1 1 0 1 1 1 1
## [6,] 1 0 0 0 0 0 0 0
#Create a recommender model based on user ratings
bank.rec<-Recommender(real_ratings, method="UBCF", param=list(normalize = "center"))
getModel(bank.rec)
## $description
## [1] "UBCF-Real data: contains full or sample of data set"
##
## $data
## 5000 x 8 rating matrix of class 'realRatingMatrix' with 40000 ratings.
## Normalized using center on rows.
##
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
##
## $verbose
## [1] FALSE
#
set.seed(1)
e <- evaluationScheme(real_ratings, method="split", train=0.8, given=3)
#Predicting recommendation
bank.pred<-predict(bank.rec, getData(e, "known"), type="ratings", n=1)
head(as(bank.pred, "matrix"), n=10)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] NA NA 0.2083333 NA 0.2083333 0.2083333
## [2,] NA 0.2083333 0.2083333 NA NA 0.2083333
## [3,] 0.2083333 NA NA 0.2083333 0.2083333 0.2083333
## [4,] 0.2083333 NA 0.2083333 0.2083333 0.2083333 NA
## [5,] 0.0000000 NA NA 0.0000000 0.0000000 0.0000000
## [6,] NA NA 0.2083333 NA 0.2083333 0.2083333
## [7,] 0.0000000 0.0000000 NA 0.0000000 0.0000000 NA
## [8,] NA 0.2083333 0.2083333 NA NA 0.2083333
## [9,] 0.0000000 0.0000000 NA 0.0000000 0.0000000 NA
## [10,] NA NA 0.2083333 NA 0.2083333 0.2083333
## [,7] [,8]
## [1,] 0.2083333 0.2083333
## [2,] 0.2083333 0.2083333
## [3,] NA 0.2083333
## [4,] NA 0.2083333
## [5,] 0.0000000 NA
## [6,] 0.2083333 0.2083333
## [7,] 0.0000000 NA
## [8,] 0.2083333 0.2083333
## [9,] 0.0000000 NA
## [10,] 0.2083333 0.2083333
# Calculating RMSE
RMSE.ubcf <- calcPredictionAccuracy(bank.pred, getData(e, "unknown"))[1]
RMSE.ubcf
## RMSE
## 0.45487