Dataset: https://www.kaggle.com/janiobachmann/bank-marketing-dataset
Carga de librerias
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: lattice
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## 'data.frame': 11162 obs. of 17 variables:
## $ age : int 59 56 41 55 54 42 56 60 37 28 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 10 8 1 5 5 6 10 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 2 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 2 2 2 2 3 3 3 2 2 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 2343 45 1270 2476 184 0 830 545 1 5090 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 6 6 6 6 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 1042 1467 1389 579 673 562 1201 1030 608 1297 ...
## $ campaign : int 1 1 1 1 2 2 1 1 1 3 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ deposit : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
En datos de entrenamiento y test en base a un 80%
## age job marital education default balance housing loan
## 1225 61 retired married tertiary no 1257 no no
## 5952 51 technician married secondary no 1327 no no
## 4609 35 services single primary no 167 no yes
## 9673 34 housemaid single primary no 443 no no
## 9046 53 management married tertiary no 0 yes no
## 3545 30 technician single secondary no 3144 no no
## contact day month duration campaign pdays previous poutcome deposit
## 1225 cellular 10 feb 503 1 -1 0 unknown 1
## 5952 cellular 7 jul 21 2 -1 0 unknown 0
## 4609 cellular 11 jul 614 2 -1 0 unknown 1
## 9673 cellular 30 jan 10 1 2 1 other 0
## 9046 cellular 14 jul 85 3 -1 0 unknown 0
## 3545 cellular 19 may 212 2 -1 0 unknown 1
## [1] "age" "job" "marital" "education" "default"
## [6] "balance" "housing" "loan" "contact" "day"
## [11] "month" "duration" "campaign" "pdays" "previous"
## [16] "poutcome" "deposit"
# porcentaje del total compuesto por valores ausentes/NA:
(sum(is.na(train))/(nrow(train)*ncol(train)))*100## [1] 0
## ── Attaching packages ────────
## ✔ tibble 2.1.3 ✔ purrr 0.3.2
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::lift() masks caret::lift()
## ✖ randomForest::margin() masks ggplot2::margin()
## ✖ purrr::transpose() masks data.table::transpose()
library(rpart)
library(rpart.plot)
classifier = rpart(formula = deposit ~ .,
data = train, method = "class")
rpart.plot(classifier)# plot
# https://www.rdocumentation.org/packages/rpart.plot/versions/3.0.8/topics/prp
# type = 0|5
prp(classifier, type = 2, extra = 104, fallen.leaves = TRUE, main="Decision Tree") ## predicción
pred<-predict(classifier,test,type = "class")
confusionMatrix(as.factor(test$deposit),as.factor(pred))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 849 355
## 1 94 935
##
## Accuracy : 0.7989
## 95% CI : (0.7817, 0.8154)
## No Information Rate : 0.5777
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6027
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9003
## Specificity : 0.7248
## Pos Pred Value : 0.7051
## Neg Pred Value : 0.9086
## Prevalence : 0.4223
## Detection Rate : 0.3802
## Detection Prevalence : 0.5392
## Balanced Accuracy : 0.8126
##
## 'Positive' Class : 0
##