Bank Marketing

Dataset: https://www.kaggle.com/janiobachmann/bank-marketing-dataset

Carga de librerias

library("ggplot2")
library("data.table")
library("plotly")

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library("stats")
library("caret")

## Loading required package: lattice

library("randomForest")

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library("e1071")
library("class")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Carga de datos

bank <- read.csv("bank.csv")
str(bank)

## 'data.frame':    11162 obs. of  17 variables:
##  $ age      : int  59 56 41 55 54 42 56 60 37 28 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 10 8 1 5 5 6 10 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 2 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 2 2 2 2 3 3 3 2 2 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  2343 45 1270 2476 184 0 830 545 1 5090 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 6 6 6 6 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  1042 1467 1389 579 673 562 1201 1030 608 1297 ...
##  $ campaign : int  1 1 1 1 2 2 1 1 1 3 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ deposit  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...

bank$deposit = ifelse(bank$deposit=='yes',1,0)

Separando datos

En datos de entrenamiento y test en base a un 80%

split<-sample(nrow(bank),nrow(bank)*0.8)
train<-bank[split,]
test<-bank[-split,]

Exloración

head(train)

##      age        job marital education default balance housing loan
## 1225  61    retired married  tertiary      no    1257      no   no
## 5952  51 technician married secondary      no    1327      no   no
## 4609  35   services  single   primary      no     167      no  yes
## 9673  34  housemaid  single   primary      no     443      no   no
## 9046  53 management married  tertiary      no       0     yes   no
## 3545  30 technician  single secondary      no    3144      no   no
##       contact day month duration campaign pdays previous poutcome deposit
## 1225 cellular  10   feb      503        1    -1        0  unknown       1
## 5952 cellular   7   jul       21        2    -1        0  unknown       0
## 4609 cellular  11   jul      614        2    -1        0  unknown       1
## 9673 cellular  30   jan       10        1     2        1    other       0
## 9046 cellular  14   jul       85        3    -1        0  unknown       0
## 3545 cellular  19   may      212        2    -1        0  unknown       1

colnames(train)

##  [1] "age"       "job"       "marital"   "education" "default"  
##  [6] "balance"   "housing"   "loan"      "contact"   "day"      
## [11] "month"     "duration"  "campaign"  "pdays"     "previous" 
## [16] "poutcome"  "deposit"

Glosario

Client Information
- age - age of client
- job - type of job held by client
- marital - marital status of client
- education - highest level of education completed by client
- default - has the client ever defaulted on previous debts?
- balance - client’s average yearly balance, in euros
- housing - does client possess a housing loan?
- loan - does client possess a personal loan?
Information related to the last contact of the client during the current campaign
- contact - communication type
- month - month of year
- day - day (of the month) that the client was contacted
- duration - contact duration in seconds.
Miscellaneous Attributes
- campaign - number of contacts performed during this campaign and for this client
- pdays - number of days that passed by after the client was last contacted from a previous campaign
- previous - number of contacts performed before this campaign and for this client
- poutcome - outcome of the previous marketing campaign for this client
- deposit - has the client subscribed a term deposit? (dependent var.)

# porcentaje del total compuesto por valores ausentes/NA:
(sum(is.na(train))/(nrow(train)*ncol(train)))*100

## [1] 0

Acciones

Estructura de datos
gráficos de:
- Edades
- Trabajos
- Estado civil
- Nivel educacional
- Posee un crédito (default)
- Balance
- Prestamo hipotecario (housing)
- Prestamo personal (loan)
- medio de contacto
- cuando se contactaron en el mes
- mes de contacto
- duración de la llamada
- numero de contactos realizados al cliente

CART

library(tidyverse)

## ── Attaching packages ────────

## ✔ tibble  2.1.3     ✔ purrr   0.3.2
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ─────────────────
## ✖ dplyr::between()       masks data.table::between()
## ✖ dplyr::combine()       masks randomForest::combine()
## ✖ dplyr::filter()        masks plotly::filter(), stats::filter()
## ✖ dplyr::first()         masks data.table::first()
## ✖ dplyr::lag()           masks stats::lag()
## ✖ dplyr::last()          masks data.table::last()
## ✖ purrr::lift()          masks caret::lift()
## ✖ randomForest::margin() masks ggplot2::margin()
## ✖ purrr::transpose()     masks data.table::transpose()

library(rpart)
library(rpart.plot)

classifier = rpart(formula = deposit ~ .,
                   data = train, method = "class")
rpart.plot(classifier)

# plot
# https://www.rdocumentation.org/packages/rpart.plot/versions/3.0.8/topics/prp
# type = 0|5
prp(classifier, type = 2, extra = 104, fallen.leaves = TRUE, main="Decision Tree")

## predicción

pred<-predict(classifier,test,type = "class")
confusionMatrix(as.factor(test$deposit),as.factor(pred))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 849 355
##          1  94 935
##                                           
##                Accuracy : 0.7989          
##                  95% CI : (0.7817, 0.8154)
##     No Information Rate : 0.5777          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6027          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9003          
##             Specificity : 0.7248          
##          Pos Pred Value : 0.7051          
##          Neg Pred Value : 0.9086          
##              Prevalence : 0.4223          
##          Detection Rate : 0.3802          
##    Detection Prevalence : 0.5392          
##       Balanced Accuracy : 0.8126          
##                                           
##        'Positive' Class : 0               
##

Bank Marketing

Patricio Araneda G.

9/11/2019

Carga de datos

Separando datos

Exloración

Glosario

Acciones

CART