Background

TartaCUBE Bank Corp has a history of catering to the needs of its customer through a range of services. Each year the bank launches a number of products and spends a considerable budget from its revenues on advertising campaign targeting all the bank customers.

GIF Source: https://giphy.com/gifs/propaganda-4CPazmmkkYzxm

Business Opportunity

TartaCUBE, currently uses direct mail and direct calling methods as a way to reach out to customers to inform them of new product launches. This is done by a bank representative. This cost the bank approximately $2 per customer. The bank has developed a new automated system that can mail automated calls and send automated emails to potential bank customers. This cost the bank approximately 50 cents.

With an aim to optimize its marketing costs, the bank wants to reach out to customers who would be receptive to its products through the automated marketing system and the customer who would not be receptive to its products with the personal marketing system.

GIF Source: https://giphy.com/gifs/clouds-movie-2Zx1Pa8Fcs9xu

Task

The bank has collected data about customers based on a total of 20 parameters that includes their job status, educational status, their response to other campaigns, etc over a three year period.

The data available is in a raw form and is a mix of numerical, character, categorical and binary values. Making use of the dataset and a suitable machine learning algorithm the bank wants to build a predictive model to identify customers into two groups, one who are likely to subscibe to the bank’s new product and other group who are not likely to subscribe to their new products. This would allow the bank to channelize its resources more efficiently and cater appropriately to the needs of both sets of customers through the use of automated and personal marketing campigns respectively.

GIF Source: https://giphy.com/gifs/animation-loop-blue-26ufbhAiPrAlyvY4g

Dataset

The dataset used was obtained from the UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets/Bank+Marketing# The raw dataset consists of a csv file that includes more than 40,000 data points and about 20 variables ordered by date (from May 2008 to November 2010).

Distribution of classes in target variable

## 
##   no  yes 
## 88.3 11.7

Distribution of classes in Normalized Dataset

## 
##  no yes 
##  50  50

Model with k=1

## Training model on the data with k= 1
bank2_test_pred_1 <- knn(train = bank2_n_train, test = bank2_n_test, cl = bank2_n_train_labels, k = 1)
CrossTable( x= bank2_n_test_labels, y = bank2_test_pred_1, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                     | bank2_test_pred_1 
## bank2_n_test_labels |        no |       yes | Row Total | 
## --------------------|-----------|-----------|-----------|
##                  no |       824 |       229 |      1053 | 
##                     |     0.783 |     0.217 |     0.526 | 
##                     |     0.755 |     0.252 |           | 
##                     |     0.412 |     0.114 |           | 
## --------------------|-----------|-----------|-----------|
##                 yes |       268 |       679 |       947 | 
##                     |     0.283 |     0.717 |     0.473 | 
##                     |     0.245 |     0.748 |           | 
##                     |     0.134 |     0.340 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |      1092 |       908 |      2000 | 
##                     |     0.546 |     0.454 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 
#Determining the accuracy of the model
(mean(bank2_test_pred_1 == bank2_n_test_labels))*100
## [1] 75.15

Model with k=5

## Training model on the data with k= 5
bank2_test_pred_5 <- knn(train = bank2_n_train, test = bank2_n_test, cl = bank2_n_train_labels, k = 5)
CrossTable( x= bank2_n_test_labels, y = bank2_test_pred_5, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                     | bank2_test_pred_5 
## bank2_n_test_labels |        no |       yes | Row Total | 
## --------------------|-----------|-----------|-----------|
##                  no |       855 |       198 |      1053 | 
##                     |     0.812 |     0.188 |     0.526 | 
##                     |     0.804 |     0.212 |           | 
##                     |     0.427 |     0.099 |           | 
## --------------------|-----------|-----------|-----------|
##                 yes |       209 |       738 |       947 | 
##                     |     0.221 |     0.779 |     0.473 | 
##                     |     0.196 |     0.788 |           | 
##                     |     0.104 |     0.369 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |      1064 |       936 |      2000 | 
##                     |     0.532 |     0.468 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 
#Finding how many % of them are correctly predicted
(mean(bank2_test_pred_5 == bank2_n_test_labels))*100
## [1] 79.65

Model with k=11

## Training model on the data with k= 11
bank2_test_pred_11 <- knn(train = bank2_n_train, test = bank2_n_test, cl = bank2_n_train_labels, k = 11)
CrossTable( x= bank2_n_test_labels, y = bank2_test_pred_11, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                     | bank2_test_pred_11 
## bank2_n_test_labels |        no |       yes | Row Total | 
## --------------------|-----------|-----------|-----------|
##                  no |       858 |       195 |      1053 | 
##                     |     0.815 |     0.185 |     0.526 | 
##                     |     0.806 |     0.208 |           | 
##                     |     0.429 |     0.098 |           | 
## --------------------|-----------|-----------|-----------|
##                 yes |       206 |       741 |       947 | 
##                     |     0.218 |     0.782 |     0.473 | 
##                     |     0.194 |     0.792 |           | 
##                     |     0.103 |     0.370 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |      1064 |       936 |      2000 | 
##                     |     0.532 |     0.468 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 
#Determining accuracy of the model
(mean(bank2_test_pred_11 == bank2_n_test_labels))*100
## [1] 79.95

Model with k=15

## Training model on the data with k= 15
bank2_test_pred_15 <- knn(train = bank2_n_train, test = bank2_n_test, cl = bank2_n_train_labels, k = 15)
CrossTable( x= bank2_n_test_labels, y = bank2_test_pred_15, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                     | bank2_test_pred_15 
## bank2_n_test_labels |        no |       yes | Row Total | 
## --------------------|-----------|-----------|-----------|
##                  no |       851 |       202 |      1053 | 
##                     |     0.808 |     0.192 |     0.526 | 
##                     |     0.801 |     0.215 |           | 
##                     |     0.425 |     0.101 |           | 
## --------------------|-----------|-----------|-----------|
##                 yes |       211 |       736 |       947 | 
##                     |     0.223 |     0.777 |     0.473 | 
##                     |     0.199 |     0.785 |           | 
##                     |     0.105 |     0.368 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |      1062 |       938 |      2000 | 
##                     |     0.531 |     0.469 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 
#Determining accuracy of the model
(mean(bank2_test_pred_15 == bank2_n_test_labels))*100
## [1] 79.35

Model with k=21

## Training model on the data with k= 21
bank2_test_pred_21 <- knn(train = bank2_n_train, test = bank2_n_test, cl = bank2_n_train_labels, k = 21)
CrossTable( x= bank2_n_test_labels, y = bank2_test_pred_21, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                     | bank2_test_pred_21 
## bank2_n_test_labels |        no |       yes | Row Total | 
## --------------------|-----------|-----------|-----------|
##                  no |       859 |       194 |      1053 | 
##                     |     0.816 |     0.184 |     0.526 | 
##                     |     0.800 |     0.210 |           | 
##                     |     0.429 |     0.097 |           | 
## --------------------|-----------|-----------|-----------|
##                 yes |       215 |       732 |       947 | 
##                     |     0.227 |     0.773 |     0.473 | 
##                     |     0.200 |     0.790 |           | 
##                     |     0.107 |     0.366 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |      1074 |       926 |      2000 | 
##                     |     0.537 |     0.463 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 
##Determining accuracy of the model
(mean(bank2_test_pred_21 == bank2_n_test_labels))*100
## [1] 79.55

Model with k=27

## Training model on the data with k= 27
bank2_test_pred_27 <- knn(train = bank2_n_train, test = bank2_n_test, cl = bank2_n_train_labels, k = 27)
CrossTable( x= bank2_n_test_labels, y = bank2_test_pred_27, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                     | bank2_test_pred_27 
## bank2_n_test_labels |        no |       yes | Row Total | 
## --------------------|-----------|-----------|-----------|
##                  no |       858 |       195 |      1053 | 
##                     |     0.815 |     0.185 |     0.526 | 
##                     |     0.797 |     0.211 |           | 
##                     |     0.429 |     0.098 |           | 
## --------------------|-----------|-----------|-----------|
##                 yes |       218 |       729 |       947 | 
##                     |     0.230 |     0.770 |     0.473 | 
##                     |     0.203 |     0.789 |           | 
##                     |     0.109 |     0.364 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |      1076 |       924 |      2000 | 
##                     |     0.538 |     0.462 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 
#Determining accuracy of the model
(mean(bank2_test_pred_27 == bank2_n_test_labels))*100
## [1] 79.35