Bases desbalanceadas e Naive Bayes

Introdução

A ideia aqui é trabalhar com uma base desbalanceada e mostrar uma possível saída para que os modelos não sejam prejudicados por esse desbalanceamento. Será usado aqui o Naive Bayes como algoritmo mas poderia ser usado outro também.

Naive Bayes

O Naive Bayes é um algoritmo probabilistico, ou seja usa probabilidades para chegar aos resultados. Ele faz parte dos algoritmos de Machine Learning chamados supervisionados, ou seja, necessita de uma variável alvo para ser treinado.

Ele é derivado do teorema de bayes. Esse teorema calcula a probabilidade de um evento dado um conhecimento a priori e com isso fazemos inferências com eventos que ainda não conhecemos.

Usando notações de probabilidade o teorema diz que:

\[\begin{equation} P(A/B) = \frac{P(B/A)\times P(A)}{P(B)} \end{equation}\]

Pode-se olhar a fórmula da seguinte forma:

\[\begin{equation} \text probabilidade a posteriori = \frac{\text probabilidade condicional \times \text probabilidade a priori}{\text evidencia} \end{equation}\]

Vamos usar esse algoritmo para criar um modelo em uma base desbalanceada.

A base de dados que usaremos está em : - http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

Os dados referem-se a campanhas de marketing direto (telefonemas) de uma instituição bancária portuguesa. O objetivo da classificação é prever se o cliente fará um depósito a prazo (variável y).

Conhecendo a base de dados

# variaveis
head(bank_data)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

#dimensão
dim(bank_data)

## [1] 45211    17

#estrutura
glimpse(bank_data)

## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 5...
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknow...
## $ marital   <fct> married, single, married, married, single, married, singl...
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiar...
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, ...
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6,...
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes...
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no,...
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unk...
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, ma...
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 5...
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unk...
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...

Vamos observa nossa variável alvo:

bank_data %>% count(y)

##     y     n
## 1  no 39922
## 2 yes  5289

Perceba que ela é desbalanceada, e isso pode prejudicar os algoritmos de previsão.

Analisando a variável alvo com as variaveis categoricas

Para isso, vamos observar primeiramente de forma tabular:

round(prop.table(table(bank_data$job, bank_data$y, dnn=c('job', 'target')))*100,2)

##                target
## job                no   yes
##   admin.        10.04  1.40
##   blue-collar   19.96  1.57
##   entrepreneur   3.02  0.27
##   housemaid      2.50  0.24
##   management    18.04  2.88
##   retired        3.87  1.14
##   self-employed  3.08  0.41
##   services       8.37  0.82
##   student        1.48  0.59
##   technician    14.95  1.86
##   unemployed     2.44  0.45
##   unknown        0.56  0.08

round(prop.table(table(bank_data$marital,bank_data$y, dnn = c("Marital","Target")),1)*100,2)

##           Target
## Marital       no   yes
##   divorced 88.05 11.95
##   married  89.88 10.12
##   single   85.05 14.95

round(prop.table(table(bank_data$education,bank_data$y, dnn = c("Education","Target")),1)*100,2)

##            Target
## Education      no   yes
##   primary   91.37  8.63
##   secondary 89.44 10.56
##   tertiary  84.99 15.01
##   unknown   86.43 13.57

round(prop.table(table(bank_data$default,bank_data$y, dnn = c("Default","Target")),1)*100,2)

##        Target
## Default    no   yes
##     no  88.20 11.80
##     yes 93.62  6.38

round(prop.table(table(bank_data$housing,bank_data$y, dnn = c("Housing","Target")),1)*100,2)

##        Target
## Housing   no  yes
##     no  83.3 16.7
##     yes 92.3  7.7

round(prop.table(table(bank_data$loan,bank_data$y, dnn = c("Loan","Target")),1)*100,2)

##      Target
## Loan     no   yes
##   no  87.34 12.66
##   yes 93.32  6.68

round(prop.table(table(bank_data$contact,bank_data$y, dnn = c("Contact","Target")),1)*100,2)

##            Target
## Contact        no   yes
##   cellular  85.08 14.92
##   telephone 86.58 13.42
##   unknown   95.93  4.07

round(prop.table(table(bank_data$month,bank_data$y, dnn = c("Month","Target")),1)*100,2)

##      Target
## Month    no   yes
##   apr 80.32 19.68
##   aug 88.99 11.01
##   dec 53.27 46.73
##   feb 83.35 16.65
##   jan 89.88 10.12
##   jul 90.91  9.09
##   jun 89.78 10.22
##   mar 48.01 51.99
##   may 93.28  6.72
##   nov 89.85 10.15
##   oct 56.23 43.77
##   sep 53.54 46.46

round(prop.table(table(bank_data$poutcome ,bank_data$y, dnn = c("Poutcome","Target")),1)*100,2)

##          Target
## Poutcome     no   yes
##   failure 87.39 12.61
##   other   83.32 16.68
##   success 35.27 64.73
##   unknown 90.84  9.16

Observando essas variaveis e ignorando por hora o desbalanceamento, parece que existe associação entre elas e a variável alvo, ou seja elas podem ter forte poder preditivo. Podemos ver isso também graficamente:

job<- bank_data %>% count(job,y)

ggplot(job, aes(x = job, y = n )) +
        geom_bar(aes(fill = job), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") + scale_x_discrete(name = "Job") +  ggtitle("Job") +
        facet_wrap(~y) + theme_gray() + 
        theme(axis.text.x = element_blank(), axis.ticks = element_blank())

Podemos ver também todas as outras variáveis categóricas.

Vamos preparar as variáveis para os gráficos:

marital <- bank_data %>% count(marital, y)

education <- bank_data %>% count(education, y)

default <- bank_data %>% count(default, y)

housing <- bank_data %>% count(housing, y)

loan <- bank_data %>% count(loan, y)

contact <- bank_data %>% count(contact, y)

month <- bank_data %>% count(month, y)

poutcome <- bank_data %>% count(poutcome, y)

bp_marital <- ggplot(marital, aes(x = marital, y = n )) +
        geom_bar(aes(fill = marital), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") +
        scale_x_discrete(name = "Marital") +
        ggtitle("Marital") +
        facet_wrap(~y) + 
        theme_gray() +
        theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))

bp_education <- ggplot(education, aes(x = education, y = n )) +
        geom_bar(aes(fill = education), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") +
        scale_x_discrete(name = "education") +
        ggtitle("Education") +
        facet_wrap(~y) + 
        theme_gray() + 
        theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))

bp_default <- ggplot(default, aes(x = default, y = n )) +
        geom_bar(aes(fill = default), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") +
        scale_x_discrete(name = "default") +
        ggtitle("Default") +
        facet_wrap(~y) + 
        theme_gray() + 
        theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))

bp_housing <- ggplot(housing, aes(x = housing, y = n )) +
        geom_bar(aes(fill = housing), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") +
        scale_x_discrete(name = "housing") +
        ggtitle("Housing") +
        facet_wrap(~y) + 
        theme_gray() + 
        theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))

bp_loan <- ggplot(loan, aes(x = loan, y = n )) +
        geom_bar(aes(fill = loan), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") +
        scale_x_discrete(name = "loan") +
        ggtitle("Loan") +
        facet_wrap(~y) + 
        theme_gray() + 
        theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))

bp_contact <- ggplot(contact, aes(x = contact, y = n )) +
        geom_bar(aes(fill = contact), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") +
        scale_x_discrete(name = "contact") +
        ggtitle("Contact") +
        facet_wrap(~y) + 
        theme_gray() + 
        theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))

bp_month <- ggplot(month, aes(x = month, y = n )) +
        geom_bar(aes(fill = month), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") +
        scale_x_discrete(name = "month") +
        ggtitle("month") +
        facet_wrap(~y) + 
        theme_gray() + 
        theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))

bp_poutcome <- ggplot(poutcome, aes(x = poutcome, y = n )) +
        geom_bar(aes(fill = poutcome), stat = "identity", color = "white") +
        scale_y_discrete(name = "Frequencia") +
        scale_x_discrete(name = "poutcome") +
        ggtitle("Poutcome") +
        facet_wrap(~y) + 
        theme_gray() + 
        theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.6.3

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(bp_marital, bp_education, bp_default, bp_housing
             , nrow = 2, ncol = 2)

grid.arrange(bp_loan, bp_contact, bp_month, bp_poutcome
             , nrow = 2, ncol = 2)

Visualizaçao das variaveis continuas

bp_age <- ggplot(bank_data, aes(x = y, y = age)) +
        geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
        scale_y_continuous(name = "Age") +
        scale_x_discrete(name = "Target") +
        ggtitle("Age") +
        theme_gray() + 
        theme(plot.title = element_text(hjust = 0.5))

bp_balance <- ggplot(bank_data, aes(x = y, y = balance )) +
        geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
        scale_y_continuous(name = "Balance") +
        scale_x_discrete(name = "Target") +
        ggtitle("Balance") +
        theme_gray() + 
        theme(plot.title = element_text(hjust = 0.5))

bp_day <- ggplot(bank_data, aes(x = y, y = day )) +
        geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
        scale_y_continuous(name = "Day") +
        scale_x_discrete(name = "Target") +
        ggtitle("Day") +
        theme_gray() + 
        theme(plot.title = element_text(hjust = 0.5))

bp_duration <- ggplot(bank_data, aes(x = y, y = duration )) +
        geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
        scale_y_continuous(name = "Duration") +
        scale_x_discrete(name = "Target") +
        ggtitle("Duration") +
        theme_gray() + 
        theme(plot.title = element_text(hjust = 0.5))

bp_campaign <- ggplot(bank_data, aes(x = y, y = campaign )) +
        geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
        scale_y_continuous(name = "Campaign") +
        scale_x_discrete(name = "Target") +
        ggtitle("Campaign") +
        theme_gray() + 
        theme(plot.title = element_text(hjust = 0.5))

bp_pdays <- ggplot(bank_data, aes(x = y, y = pdays )) +
        geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
        scale_y_continuous(name = "pdays") +
        scale_x_discrete(name = "Target") +
        ggtitle("Pdays") +
        theme_gray() + 
        theme(plot.title = element_text(hjust = 0.5))

bp_previous <- ggplot(bank_data, aes(x = y, y = previous )) +
        geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
        scale_y_continuous(name = "Previous") +
        scale_x_discrete(name = "Target") +
        ggtitle("Previous") +
        theme_gray() + 
        theme(plot.title = element_text(hjust = 0.5))

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 3.6.3

library(gridExtra)

grid.arrange(bp_age, bp_balance, bp_day, bp_duration
             ,bp_campaign, bp_pdays, bp_previous
             , nrow = 2, ncol = 4)

Modelo 1

library(caret)

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(42) # semente alaeatoria para reproducao dos mesmos resultados

partic <- createDataPartition(bank_data$y, p = 0.7, list = FALSE) #separacao em treino e teste

train1 <- bank_data[partic, ]  #treino

test1 <- bank_data[-partic, ] #teste

dim(train1)

## [1] 31649    17

dim(test1)

## [1] 13562    17

Treinamento do Modelo 1

library(e1071)

## Warning: package 'e1071' was built under R version 3.6.3

# Treinamento do modelo 1

nb1 <- naiveBayes(y ~ ., data = train1 , laplace = 1)

print(nb1)

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        no       yes 
## 0.8829979 0.1170021 
## 
## Conditional probabilities:
##      age
## Y         [,1]     [,2]
##   no  40.81518 10.14110
##   yes 41.60897 13.40696
## 
##      job
## Y          admin. blue-collar entrepreneur   housemaid  management     retired
##   no  0.113098219 0.227019100  0.033729165 0.028972029 0.205808713 0.043171901
##   yes 0.118169583 0.132436070  0.021534320 0.019919246 0.249259758 0.095558546
##      job
## Y     self-employed    services     student  technician  unemployed     unknown
##   no    0.034158380 0.095214250 0.016274412 0.169075041 0.026969025 0.006509765
##   yes   0.035800808 0.066218035 0.050605653 0.162045760 0.041453567 0.006998654
## 
##      marital
## Y      divorced   married    single
##   no  0.1161759 0.6100755 0.2737486
##   yes 0.1184566 0.5153805 0.3661630
## 
##      education
## Y        primary  secondary   tertiary    unknown
##   no  0.15735242 0.51871199 0.28382826 0.04010733
##   yes 0.11302940 0.46398705 0.37712436 0.04585919
## 
##      default
## Y              no         yes
##   no  0.981250895 0.018749105
##   yes 0.990553306 0.009446694
## 
##      balance
## Y         [,1]     [,2]
##   no  1298.774 3024.689
##   yes 1780.023 3165.041
## 
##      housing
## Y            no       yes
##   no  0.4183126 0.5816874
##   yes 0.6364372 0.3635628
## 
##      loan
## Y             no        yes
##   no  0.83079290 0.16920710
##   yes 0.90823212 0.09176788
## 
##      contact
## Y       cellular  telephone    unknown
##   no  0.62145336 0.06332964 0.31521700
##   yes 0.82514841 0.07501349 0.09983810
## 
##      day
## Y         [,1]     [,2]
##   no  15.86413 8.284102
##   yes 15.23386 8.569881
## 
##      month
## Y             apr         aug         dec         feb         jan         jul
##   no  0.059625152 0.139709564 0.002968739 0.055118392 0.031153874 0.156234352
##   yes 0.108748318 0.127590848 0.019111709 0.081830417 0.027187079 0.116823688
##      month
## Y             jun         mar         may         nov         oct         sep
##   no  0.121539452 0.005436727 0.320480721 0.090170971 0.009871951 0.007690107
##   yes 0.106325707 0.046298789 0.172812921 0.076985195 0.065679677 0.050605653
## 
##      duration
## Y         [,1]     [,2]
##   no  220.8359 207.7037
##   yes 532.9714 390.5589
## 
##      campaign
## Y         [,1]     [,2]
##   no  2.851428 3.221690
##   yes 2.098569 1.876453
## 
##      pdays
## Y         [,1]      [,2]
##   no  36.30863  96.80896
##   yes 69.63948 119.48881
## 
##      previous
## Y          [,1]     [,2]
##   no  0.4984255 2.408017
##   yes 1.1763435 2.457079
## 
##      poutcome
## Y        failure      other    success    unknown
##   no  0.10719141 0.03817531 0.01305903 0.84157424
##   yes 0.11896412 0.05988670 0.18586458 0.63528460

Observe que o argumento acima laplace=1 é usado para evitar o caso quando o modelo resulta em probablidade zero, ou seja, quando ele analisa um evento que não está na base de dados. E porque isso é ruim, isso é ruim pois como o algoritmo multiplica probabilidades, se uma for zero toda a multiplicação será zero!

Previsão na base de treinamento do Modelo 1

# Previsao na base de treinamento

nb_train_pred1 <- predict(nb1, train1, type = "class")

confusionMatrix(nb_train_pred1, train1$y, positive = 'yes') # para dizer que a minha classe positiva

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  25849  1681
##        yes  2097  2022
##                                          
##                Accuracy : 0.8806         
##                  95% CI : (0.877, 0.8842)
##     No Information Rate : 0.883          
##     P-Value [Acc > NIR] : 0.9064         
##                                          
##                   Kappa : 0.4491         
##                                          
##  Mcnemar's Test P-Value : 1.461e-11      
##                                          
##             Sensitivity : 0.54604        
##             Specificity : 0.92496        
##          Pos Pred Value : 0.49090        
##          Neg Pred Value : 0.93894        
##              Prevalence : 0.11700        
##          Detection Rate : 0.06389        
##    Detection Prevalence : 0.13015        
##       Balanced Accuracy : 0.73550        
##                                          
##        'Positive' Class : yes            
##

#é o yes(positive = 'yes')

O modelo teve 88% de acurácia, o que é uma informação importante, porém isso não basta para termos um bom modelo. Porém para avaliar os que aceitaram a campanha temos que olhar para o “sensitivity”, ou seja o modelo só está certando 55% para esses, é o quase como se jogássemos cara ou coroa. Resumindo apesar de 88% de acurácia esse modelo é uma fraude, pois sua base é desbalanceada prejudicando o modelo.

Obs: “sensitivity” - mede a proporção de positivos que são identificados corretamente

Previsão na base de teste do Modelo 1

# Previsao na base de teste

nb_test_pred1 <- predict(nb1, test1, type = "class")

confusionMatrix(nb_test_pred1, test1$y, positive = 'yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11041   736
##        yes   935   850
##                                           
##                Accuracy : 0.8768          
##                  95% CI : (0.8711, 0.8823)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 0.9884          
##                                           
##                   Kappa : 0.4342          
##                                           
##  Mcnemar's Test P-Value : 1.274e-06       
##                                           
##             Sensitivity : 0.53594         
##             Specificity : 0.92193         
##          Pos Pred Value : 0.47619         
##          Neg Pred Value : 0.93751         
##              Prevalence : 0.11694         
##          Detection Rate : 0.06268         
##    Detection Prevalence : 0.13162         
##       Balanced Accuracy : 0.72893         
##                                           
##        'Positive' Class : yes             
##

Ocorre basicamente a mesma coisa que na base de treinamento, o modelo não pode ser usado para fazer previsões para saber quem aceitará a campanha. Precisamos balancear essa base de dados.

Modelo 2 - Trabalhando a base desbalanceada

nb_test_pred1 <- predict(nb1, test1, type = "class")

confusionMatrix(nb_test_pred1, test1$y, positive = 'yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  11041   736
##        yes   935   850
##                                           
##                Accuracy : 0.8768          
##                  95% CI : (0.8711, 0.8823)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 0.9884          
##                                           
##                   Kappa : 0.4342          
##                                           
##  Mcnemar's Test P-Value : 1.274e-06       
##                                           
##             Sensitivity : 0.53594         
##             Specificity : 0.92193         
##          Pos Pred Value : 0.47619         
##          Neg Pred Value : 0.93751         
##              Prevalence : 0.11694         
##          Detection Rate : 0.06268         
##    Detection Prevalence : 0.13162         
##       Balanced Accuracy : 0.72893         
##                                           
##        'Positive' Class : yes             
##

Oversampling

Para trabalhar a base desbalanceada vamos utilizar o oversampling - Oversampling aleatório envolve suplementar os dados de treinamento com várias cópias de algumas das classes minoritárias; Ou seja sua base é aumentada como citado até que ela deixe de ser desbalanceada. - Esse é um dos métodos mais antigos propostos, que também é comprovadamente robusto; - Em vez de duplicar todas as amostras da classe minoritária, algumas delas podem ser escolhidas aleatoriamente, com substituição.

bank_data_oversample<- upSample(bank_data,bank_data$y) #base e classe  desbalanceada
str(bank_data_oversample)

## 'data.frame':    79844 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Class    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Perceba como a base aumentou, antes tinha 45211 e com o oversample ficou 79844

Além disso, vejamos a classe desbalanceada como ficou:

bank_data_oversample %>% count(Class)

##   Class     n
## 1    no 39922
## 2   yes 39922

Agora esta equilibrada!

Vamos agora separar essa nova base, agora balanceada em treinamento e teste

set.seed(42)

partc2 <- createDataPartition(bank_data_oversample$Class, p = 0.7, list = FALSE)

train2 <- bank_data_oversample[partc2, ]

test2 <- bank_data_oversample[-partc2, ]

dim(train2)

## [1] 55892    17

dim(test2)

## [1] 23952    17

Treinamento do modelo 2

# Treinamento do modelo 2

nb2 <- naiveBayes(Class ~ .
                  , data = train2)

Previsao na base de treinamento do modelo 2

# Previsao na base de treinamento 2

nb_train_pred2 <- predict(nb2, train2, type = "class")

confusionMatrix(nb_train_pred2, train2$Class, positive = 'yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  22071  6027
##        yes  5875 21919
##                                           
##                Accuracy : 0.7871          
##                  95% CI : (0.7836, 0.7904)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5741          
##                                           
##  Mcnemar's Test P-Value : 0.1663          
##                                           
##             Sensitivity : 0.7843          
##             Specificity : 0.7898          
##          Pos Pred Value : 0.7886          
##          Neg Pred Value : 0.7855          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3922          
##    Detection Prevalence : 0.4973          
##       Balanced Accuracy : 0.7871          
##                                           
##        'Positive' Class : yes             
##

Observe agora que o “sensitivity” melhorou bastante, 78%, já é um modelo bem razoável. Vamos verificar agora na base de teste:

Previsao na base de teste do modelo 2

# Previsao na base de teste 2
 
nb_test_pred2 <- predict(nb2, test2, type = "class")

confusionMatrix(nb_test_pred2, test2$Class, positive = 'yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9390 2703
##        yes 2586 9273
##                                           
##                Accuracy : 0.7792          
##                  95% CI : (0.7739, 0.7844)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5584          
##                                           
##  Mcnemar's Test P-Value : 0.1107          
##                                           
##             Sensitivity : 0.7743          
##             Specificity : 0.7841          
##          Pos Pred Value : 0.7819          
##          Neg Pred Value : 0.7765          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3871          
##    Detection Prevalence : 0.4951          
##       Balanced Accuracy : 0.7792          
##                                           
##        'Positive' Class : yes             
##

Aqui o “sensitivity” também melhorou bastante, 77%, além disso tem performance próxima a que ocoorreu na base de treinamento, mostarndo estabilildade do modelo.

Porém agora temos que testar o modelo treinado na base de teste real, lembre que testamos acima na base aumentada pelo oversampling:

Previsao na base de teste original

# Previsao na base de teste 1 - dados sem oversampling
 
nb_test_pred_no_over <- predict(nb2, test1, type = "class")

confusionMatrix(nb_test_pred_no_over, test1$y, positive = 'yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9390  373
##        yes 2586 1213
##                                           
##                Accuracy : 0.7818          
##                  95% CI : (0.7748, 0.7887)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3419          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.76482         
##             Specificity : 0.78407         
##          Pos Pred Value : 0.31929         
##          Neg Pred Value : 0.96179         
##              Prevalence : 0.11694         
##          Detection Rate : 0.08944         
##    Detection Prevalence : 0.28012         
##       Balanced Accuracy : 0.77444         
##                                           
##        'Positive' Class : yes             
##

Aqui o “sensitivity” também melhorou bastante, 76%. Ou seja, o oversampling melhorou em muito o modelo! Era isso!

Keep calm and analysing data!