A ideia aqui é trabalhar com uma base desbalanceada e mostrar uma possível saída para que os modelos não sejam prejudicados por esse desbalanceamento. Será usado aqui o Naive Bayes como algoritmo mas poderia ser usado outro também.
O Naive Bayes é um algoritmo probabilistico, ou seja usa probabilidades para chegar aos resultados. Ele faz parte dos algoritmos de Machine Learning chamados supervisionados, ou seja, necessita de uma variável alvo para ser treinado.
Ele é derivado do teorema de bayes. Esse teorema calcula a probabilidade de um evento dado um conhecimento a priori e com isso fazemos inferências com eventos que ainda não conhecemos.
Usando notações de probabilidade o teorema diz que:
\[\begin{equation} P(A/B) = \frac{P(B/A)\times P(A)}{P(B)} \end{equation}\]
Pode-se olhar a fórmula da seguinte forma:
\[\begin{equation} \text probabilidade a posteriori = \frac{\text probabilidade condicional \times \text probabilidade a priori}{\text evidencia} \end{equation}\]
Vamos usar esse algoritmo para criar um modelo em uma base desbalanceada.
A base de dados que usaremos está em : - http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#
Os dados referem-se a campanhas de marketing direto (telefonemas) de uma instituição bancária portuguesa. O objetivo da classificação é prever se o cliente fará um depósito a prazo (variável y).
# variaveis
head(bank_data)
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
#dimensão
dim(bank_data)
## [1] 45211 17
#estrutura
glimpse(bank_data)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 5...
## $ job <fct> management, technician, entrepreneur, blue-collar, unknow...
## $ marital <fct> married, single, married, married, single, married, singl...
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiar...
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, ...
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6,...
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes...
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no,...
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unk...
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, ma...
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 5...
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unk...
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
Vamos observa nossa variável alvo:
bank_data %>% count(y)
## y n
## 1 no 39922
## 2 yes 5289
Perceba que ela é desbalanceada, e isso pode prejudicar os algoritmos de previsão.
Para isso, vamos observar primeiramente de forma tabular:
round(prop.table(table(bank_data$job, bank_data$y, dnn=c('job', 'target')))*100,2)
## target
## job no yes
## admin. 10.04 1.40
## blue-collar 19.96 1.57
## entrepreneur 3.02 0.27
## housemaid 2.50 0.24
## management 18.04 2.88
## retired 3.87 1.14
## self-employed 3.08 0.41
## services 8.37 0.82
## student 1.48 0.59
## technician 14.95 1.86
## unemployed 2.44 0.45
## unknown 0.56 0.08
round(prop.table(table(bank_data$marital,bank_data$y, dnn = c("Marital","Target")),1)*100,2)
## Target
## Marital no yes
## divorced 88.05 11.95
## married 89.88 10.12
## single 85.05 14.95
round(prop.table(table(bank_data$education,bank_data$y, dnn = c("Education","Target")),1)*100,2)
## Target
## Education no yes
## primary 91.37 8.63
## secondary 89.44 10.56
## tertiary 84.99 15.01
## unknown 86.43 13.57
round(prop.table(table(bank_data$default,bank_data$y, dnn = c("Default","Target")),1)*100,2)
## Target
## Default no yes
## no 88.20 11.80
## yes 93.62 6.38
round(prop.table(table(bank_data$housing,bank_data$y, dnn = c("Housing","Target")),1)*100,2)
## Target
## Housing no yes
## no 83.3 16.7
## yes 92.3 7.7
round(prop.table(table(bank_data$loan,bank_data$y, dnn = c("Loan","Target")),1)*100,2)
## Target
## Loan no yes
## no 87.34 12.66
## yes 93.32 6.68
round(prop.table(table(bank_data$contact,bank_data$y, dnn = c("Contact","Target")),1)*100,2)
## Target
## Contact no yes
## cellular 85.08 14.92
## telephone 86.58 13.42
## unknown 95.93 4.07
round(prop.table(table(bank_data$month,bank_data$y, dnn = c("Month","Target")),1)*100,2)
## Target
## Month no yes
## apr 80.32 19.68
## aug 88.99 11.01
## dec 53.27 46.73
## feb 83.35 16.65
## jan 89.88 10.12
## jul 90.91 9.09
## jun 89.78 10.22
## mar 48.01 51.99
## may 93.28 6.72
## nov 89.85 10.15
## oct 56.23 43.77
## sep 53.54 46.46
round(prop.table(table(bank_data$poutcome ,bank_data$y, dnn = c("Poutcome","Target")),1)*100,2)
## Target
## Poutcome no yes
## failure 87.39 12.61
## other 83.32 16.68
## success 35.27 64.73
## unknown 90.84 9.16
Observando essas variaveis e ignorando por hora o desbalanceamento, parece que existe associação entre elas e a variável alvo, ou seja elas podem ter forte poder preditivo. Podemos ver isso também graficamente:
job<- bank_data %>% count(job,y)
ggplot(job, aes(x = job, y = n )) +
geom_bar(aes(fill = job), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") + scale_x_discrete(name = "Job") + ggtitle("Job") +
facet_wrap(~y) + theme_gray() +
theme(axis.text.x = element_blank(), axis.ticks = element_blank())
Podemos ver também todas as outras variáveis categóricas.
Vamos preparar as variáveis para os gráficos:
marital <- bank_data %>% count(marital, y)
education <- bank_data %>% count(education, y)
default <- bank_data %>% count(default, y)
housing <- bank_data %>% count(housing, y)
loan <- bank_data %>% count(loan, y)
contact <- bank_data %>% count(contact, y)
month <- bank_data %>% count(month, y)
poutcome <- bank_data %>% count(poutcome, y)
bp_marital <- ggplot(marital, aes(x = marital, y = n )) +
geom_bar(aes(fill = marital), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") +
scale_x_discrete(name = "Marital") +
ggtitle("Marital") +
facet_wrap(~y) +
theme_gray() +
theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))
bp_education <- ggplot(education, aes(x = education, y = n )) +
geom_bar(aes(fill = education), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") +
scale_x_discrete(name = "education") +
ggtitle("Education") +
facet_wrap(~y) +
theme_gray() +
theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))
bp_default <- ggplot(default, aes(x = default, y = n )) +
geom_bar(aes(fill = default), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") +
scale_x_discrete(name = "default") +
ggtitle("Default") +
facet_wrap(~y) +
theme_gray() +
theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))
bp_housing <- ggplot(housing, aes(x = housing, y = n )) +
geom_bar(aes(fill = housing), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") +
scale_x_discrete(name = "housing") +
ggtitle("Housing") +
facet_wrap(~y) +
theme_gray() +
theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))
bp_loan <- ggplot(loan, aes(x = loan, y = n )) +
geom_bar(aes(fill = loan), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") +
scale_x_discrete(name = "loan") +
ggtitle("Loan") +
facet_wrap(~y) +
theme_gray() +
theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))
bp_contact <- ggplot(contact, aes(x = contact, y = n )) +
geom_bar(aes(fill = contact), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") +
scale_x_discrete(name = "contact") +
ggtitle("Contact") +
facet_wrap(~y) +
theme_gray() +
theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))
bp_month <- ggplot(month, aes(x = month, y = n )) +
geom_bar(aes(fill = month), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") +
scale_x_discrete(name = "month") +
ggtitle("month") +
facet_wrap(~y) +
theme_gray() +
theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))
bp_poutcome <- ggplot(poutcome, aes(x = poutcome, y = n )) +
geom_bar(aes(fill = poutcome), stat = "identity", color = "white") +
scale_y_discrete(name = "Frequencia") +
scale_x_discrete(name = "poutcome") +
ggtitle("Poutcome") +
facet_wrap(~y) +
theme_gray() +
theme(axis.text.x = element_text(angle = 90),plot.title = element_text(hjust = 0.5))
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(bp_marital, bp_education, bp_default, bp_housing
, nrow = 2, ncol = 2)
grid.arrange(bp_loan, bp_contact, bp_month, bp_poutcome
, nrow = 2, ncol = 2)
bp_age <- ggplot(bank_data, aes(x = y, y = age)) +
geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
scale_y_continuous(name = "Age") +
scale_x_discrete(name = "Target") +
ggtitle("Age") +
theme_gray() +
theme(plot.title = element_text(hjust = 0.5))
bp_balance <- ggplot(bank_data, aes(x = y, y = balance )) +
geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
scale_y_continuous(name = "Balance") +
scale_x_discrete(name = "Target") +
ggtitle("Balance") +
theme_gray() +
theme(plot.title = element_text(hjust = 0.5))
bp_day <- ggplot(bank_data, aes(x = y, y = day )) +
geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
scale_y_continuous(name = "Day") +
scale_x_discrete(name = "Target") +
ggtitle("Day") +
theme_gray() +
theme(plot.title = element_text(hjust = 0.5))
bp_duration <- ggplot(bank_data, aes(x = y, y = duration )) +
geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
scale_y_continuous(name = "Duration") +
scale_x_discrete(name = "Target") +
ggtitle("Duration") +
theme_gray() +
theme(plot.title = element_text(hjust = 0.5))
bp_campaign <- ggplot(bank_data, aes(x = y, y = campaign )) +
geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
scale_y_continuous(name = "Campaign") +
scale_x_discrete(name = "Target") +
ggtitle("Campaign") +
theme_gray() +
theme(plot.title = element_text(hjust = 0.5))
bp_pdays <- ggplot(bank_data, aes(x = y, y = pdays )) +
geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
scale_y_continuous(name = "pdays") +
scale_x_discrete(name = "Target") +
ggtitle("Pdays") +
theme_gray() +
theme(plot.title = element_text(hjust = 0.5))
bp_previous <- ggplot(bank_data, aes(x = y, y = previous )) +
geom_boxplot(fill = "#4271AE", colour = "#1F3552", alpha = 0.6) +
scale_y_continuous(name = "Previous") +
scale_x_discrete(name = "Target") +
ggtitle("Previous") +
theme_gray() +
theme(plot.title = element_text(hjust = 0.5))
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 3.6.3
library(gridExtra)
grid.arrange(bp_age, bp_balance, bp_day, bp_duration
,bp_campaign, bp_pdays, bp_previous
, nrow = 2, ncol = 4)
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(42) # semente alaeatoria para reproducao dos mesmos resultados
partic <- createDataPartition(bank_data$y, p = 0.7, list = FALSE) #separacao em treino e teste
train1 <- bank_data[partic, ] #treino
test1 <- bank_data[-partic, ] #teste
dim(train1)
## [1] 31649 17
dim(test1)
## [1] 13562 17
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.3
# Treinamento do modelo 1
nb1 <- naiveBayes(y ~ ., data = train1 , laplace = 1)
print(nb1)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## no yes
## 0.8829979 0.1170021
##
## Conditional probabilities:
## age
## Y [,1] [,2]
## no 40.81518 10.14110
## yes 41.60897 13.40696
##
## job
## Y admin. blue-collar entrepreneur housemaid management retired
## no 0.113098219 0.227019100 0.033729165 0.028972029 0.205808713 0.043171901
## yes 0.118169583 0.132436070 0.021534320 0.019919246 0.249259758 0.095558546
## job
## Y self-employed services student technician unemployed unknown
## no 0.034158380 0.095214250 0.016274412 0.169075041 0.026969025 0.006509765
## yes 0.035800808 0.066218035 0.050605653 0.162045760 0.041453567 0.006998654
##
## marital
## Y divorced married single
## no 0.1161759 0.6100755 0.2737486
## yes 0.1184566 0.5153805 0.3661630
##
## education
## Y primary secondary tertiary unknown
## no 0.15735242 0.51871199 0.28382826 0.04010733
## yes 0.11302940 0.46398705 0.37712436 0.04585919
##
## default
## Y no yes
## no 0.981250895 0.018749105
## yes 0.990553306 0.009446694
##
## balance
## Y [,1] [,2]
## no 1298.774 3024.689
## yes 1780.023 3165.041
##
## housing
## Y no yes
## no 0.4183126 0.5816874
## yes 0.6364372 0.3635628
##
## loan
## Y no yes
## no 0.83079290 0.16920710
## yes 0.90823212 0.09176788
##
## contact
## Y cellular telephone unknown
## no 0.62145336 0.06332964 0.31521700
## yes 0.82514841 0.07501349 0.09983810
##
## day
## Y [,1] [,2]
## no 15.86413 8.284102
## yes 15.23386 8.569881
##
## month
## Y apr aug dec feb jan jul
## no 0.059625152 0.139709564 0.002968739 0.055118392 0.031153874 0.156234352
## yes 0.108748318 0.127590848 0.019111709 0.081830417 0.027187079 0.116823688
## month
## Y jun mar may nov oct sep
## no 0.121539452 0.005436727 0.320480721 0.090170971 0.009871951 0.007690107
## yes 0.106325707 0.046298789 0.172812921 0.076985195 0.065679677 0.050605653
##
## duration
## Y [,1] [,2]
## no 220.8359 207.7037
## yes 532.9714 390.5589
##
## campaign
## Y [,1] [,2]
## no 2.851428 3.221690
## yes 2.098569 1.876453
##
## pdays
## Y [,1] [,2]
## no 36.30863 96.80896
## yes 69.63948 119.48881
##
## previous
## Y [,1] [,2]
## no 0.4984255 2.408017
## yes 1.1763435 2.457079
##
## poutcome
## Y failure other success unknown
## no 0.10719141 0.03817531 0.01305903 0.84157424
## yes 0.11896412 0.05988670 0.18586458 0.63528460
Observe que o argumento acima laplace=1 é usado para evitar o caso quando o modelo resulta em probablidade zero, ou seja, quando ele analisa um evento que não está na base de dados. E porque isso é ruim, isso é ruim pois como o algoritmo multiplica probabilidades, se uma for zero toda a multiplicação será zero!
# Previsao na base de treinamento
nb_train_pred1 <- predict(nb1, train1, type = "class")
confusionMatrix(nb_train_pred1, train1$y, positive = 'yes') # para dizer que a minha classe positiva
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 25849 1681
## yes 2097 2022
##
## Accuracy : 0.8806
## 95% CI : (0.877, 0.8842)
## No Information Rate : 0.883
## P-Value [Acc > NIR] : 0.9064
##
## Kappa : 0.4491
##
## Mcnemar's Test P-Value : 1.461e-11
##
## Sensitivity : 0.54604
## Specificity : 0.92496
## Pos Pred Value : 0.49090
## Neg Pred Value : 0.93894
## Prevalence : 0.11700
## Detection Rate : 0.06389
## Detection Prevalence : 0.13015
## Balanced Accuracy : 0.73550
##
## 'Positive' Class : yes
##
#é o yes(positive = 'yes')
O modelo teve 88% de acurácia, o que é uma informação importante, porém isso não basta para termos um bom modelo. Porém para avaliar os que aceitaram a campanha temos que olhar para o “sensitivity”, ou seja o modelo só está certando 55% para esses, é o quase como se jogássemos cara ou coroa. Resumindo apesar de 88% de acurácia esse modelo é uma fraude, pois sua base é desbalanceada prejudicando o modelo.
Obs: “sensitivity” - mede a proporção de positivos que são identificados corretamente
# Previsao na base de teste
nb_test_pred1 <- predict(nb1, test1, type = "class")
confusionMatrix(nb_test_pred1, test1$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11041 736
## yes 935 850
##
## Accuracy : 0.8768
## 95% CI : (0.8711, 0.8823)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.9884
##
## Kappa : 0.4342
##
## Mcnemar's Test P-Value : 1.274e-06
##
## Sensitivity : 0.53594
## Specificity : 0.92193
## Pos Pred Value : 0.47619
## Neg Pred Value : 0.93751
## Prevalence : 0.11694
## Detection Rate : 0.06268
## Detection Prevalence : 0.13162
## Balanced Accuracy : 0.72893
##
## 'Positive' Class : yes
##
Ocorre basicamente a mesma coisa que na base de treinamento, o modelo não pode ser usado para fazer previsões para saber quem aceitará a campanha. Precisamos balancear essa base de dados.
nb_test_pred1 <- predict(nb1, test1, type = "class")
confusionMatrix(nb_test_pred1, test1$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11041 736
## yes 935 850
##
## Accuracy : 0.8768
## 95% CI : (0.8711, 0.8823)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.9884
##
## Kappa : 0.4342
##
## Mcnemar's Test P-Value : 1.274e-06
##
## Sensitivity : 0.53594
## Specificity : 0.92193
## Pos Pred Value : 0.47619
## Neg Pred Value : 0.93751
## Prevalence : 0.11694
## Detection Rate : 0.06268
## Detection Prevalence : 0.13162
## Balanced Accuracy : 0.72893
##
## 'Positive' Class : yes
##
Para trabalhar a base desbalanceada vamos utilizar o oversampling - Oversampling aleatório envolve suplementar os dados de treinamento com várias cópias de algumas das classes minoritárias; Ou seja sua base é aumentada como citado até que ela deixe de ser desbalanceada. - Esse é um dos métodos mais antigos propostos, que também é comprovadamente robusto; - Em vez de duplicar todas as amostras da classe minoritária, algumas delas podem ser escolhidas aleatoriamente, com substituição.
bank_data_oversample<- upSample(bank_data,bank_data$y) #base e classe desbalanceada
str(bank_data_oversample)
## 'data.frame': 79844 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Class : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Perceba como a base aumentou, antes tinha 45211 e com o oversample ficou 79844
Além disso, vejamos a classe desbalanceada como ficou:
bank_data_oversample %>% count(Class)
## Class n
## 1 no 39922
## 2 yes 39922
Agora esta equilibrada!
Vamos agora separar essa nova base, agora balanceada em treinamento e teste
set.seed(42)
partc2 <- createDataPartition(bank_data_oversample$Class, p = 0.7, list = FALSE)
train2 <- bank_data_oversample[partc2, ]
test2 <- bank_data_oversample[-partc2, ]
dim(train2)
## [1] 55892 17
dim(test2)
## [1] 23952 17
# Treinamento do modelo 2
nb2 <- naiveBayes(Class ~ .
, data = train2)
# Previsao na base de treinamento 2
nb_train_pred2 <- predict(nb2, train2, type = "class")
confusionMatrix(nb_train_pred2, train2$Class, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 22071 6027
## yes 5875 21919
##
## Accuracy : 0.7871
## 95% CI : (0.7836, 0.7904)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5741
##
## Mcnemar's Test P-Value : 0.1663
##
## Sensitivity : 0.7843
## Specificity : 0.7898
## Pos Pred Value : 0.7886
## Neg Pred Value : 0.7855
## Prevalence : 0.5000
## Detection Rate : 0.3922
## Detection Prevalence : 0.4973
## Balanced Accuracy : 0.7871
##
## 'Positive' Class : yes
##
Observe agora que o “sensitivity” melhorou bastante, 78%, já é um modelo bem razoável. Vamos verificar agora na base de teste:
# Previsao na base de teste 2
nb_test_pred2 <- predict(nb2, test2, type = "class")
confusionMatrix(nb_test_pred2, test2$Class, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 9390 2703
## yes 2586 9273
##
## Accuracy : 0.7792
## 95% CI : (0.7739, 0.7844)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5584
##
## Mcnemar's Test P-Value : 0.1107
##
## Sensitivity : 0.7743
## Specificity : 0.7841
## Pos Pred Value : 0.7819
## Neg Pred Value : 0.7765
## Prevalence : 0.5000
## Detection Rate : 0.3871
## Detection Prevalence : 0.4951
## Balanced Accuracy : 0.7792
##
## 'Positive' Class : yes
##
Aqui o “sensitivity” também melhorou bastante, 77%, além disso tem performance próxima a que ocoorreu na base de treinamento, mostarndo estabilildade do modelo.
Porém agora temos que testar o modelo treinado na base de teste real, lembre que testamos acima na base aumentada pelo oversampling:
# Previsao na base de teste 1 - dados sem oversampling
nb_test_pred_no_over <- predict(nb2, test1, type = "class")
confusionMatrix(nb_test_pred_no_over, test1$y, positive = 'yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 9390 373
## yes 2586 1213
##
## Accuracy : 0.7818
## 95% CI : (0.7748, 0.7887)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3419
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.76482
## Specificity : 0.78407
## Pos Pred Value : 0.31929
## Neg Pred Value : 0.96179
## Prevalence : 0.11694
## Detection Rate : 0.08944
## Detection Prevalence : 0.28012
## Balanced Accuracy : 0.77444
##
## 'Positive' Class : yes
##
Aqui o “sensitivity” também melhorou bastante, 76%. Ou seja, o oversampling melhorou em muito o modelo! Era isso!
Keep calm and analysing data!