Marcos J Ribeiro
04/06/2020
I built Naive Bayes algorithm to solve classification problems
I used R language, version 4.0, to do this
This presentation was built using Beamer Rmarkdown
My Naive Bayes algorithm and this presentation can be view in my Github, see Nbayes2_work file. This presentation can also be seen in my Rpubs
First, I will apply my algorithm in three data sets. The first two are simpler
After, I will apply my algorithm in Ibovespa returns. This is my principal analysis. I will try to predict the crash of brasilian stock market during COVID-19 pandemic
In the last three data sets I have problems with unbalanced classes. To solve this I used the ROSE library
The boundaries of my algorithm were made using ggplot2 library
To do quality control I compared my algorithm with that of the e1071 library
There are two types of independent variables: Categorical and non-categorical
The approach to classification in this context is diferent
So, I built two functions to solve this, and put this two functions inside one
naivef = function(k, df, cd=1){
if(cd == 1){
naive_marcos(k, df)
}else if (cd == 0){
naive_marcos2(k, df)
}else{
cat('Type cd = 1 for categorical dependent variables, \n
and cd = 0 for non-categorical dependent variables.')
}} predf = function(k, df, df_n, cl, cclas=0, cd=1){
if(cd == 1){
pred_marcos(k, df, df_n, cl, cclas)
}else if (cd == 0){
pred_marcos2(k, df, df_n, cl, cclas)
}else{
cat('Type cd = 1 for categorical dependent variables,
\n and cd = 0 for non-categorical dependent variables.')
}} | historia | divida | risco |
|---|---|---|
| ruim | alta | alto |
| desconhecida | alta | alto |
| desconhecida | baixa | moderado |
| desconhecida | baixa | alto |
| desconhecida | baixa | baixo |
| desconhecida | baixa | baixo |
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
##
## alto baixo moderado
## 0.4285714 0.3571429 0.2142857
## Conditional Probabilities:
## alta baixa
## boa 0.04761905 0.02380952
## desconhecida 0.09523810 0.04761905
## ruim 0.14285714 0.07142857
| historia | divida |
|---|---|
| boa | baixa |
| boa | alta |
| ruim | baixa |
| ruim | alta |
| desconhecida | baixa |
| desconhecida | alta |
## alto baixo moderado
## [1,] 0.1190476 0.6428571 0.2380952
## [2,] 0.3030303 0.5454545 0.1515152
## [3,] 0.6000000 0.0000000 0.4000000
## [4,] 0.8571429 0.0000000 0.1428571
## [5,] 0.2631579 0.4736842 0.2631579
## [6,] 0.5405405 0.3243243 0.1351351
## [1] "baixo" "baixo" "alto" "alto" "baixo" "alto"
library(e1071)
clas2 = naiveBayes(x=df[-3], y = as.factor(df$risco))
prev2 = predict(clas2, newdata = df_teste)
print(prev2) ## [1] baixo baixo alto alto baixo alto
## Levels: alto baixo moderado
Naive Bayes decision boundaries
| height | weight | sex |
|---|---|---|
| 6.00 | 180 | male |
| 5.92 | 190 | male |
| 5.58 | 170 | male |
| 5.92 | 165 | male |
| 5.00 | 100 | female |
| 5.50 | 150 | female |
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
##
## female male
## 0.5 0.5
## , , female
##
## mean variance
## [1,] 5.4175 0.3118092
## [2,] 132.5000 23.6290781
##
## , , male
##
## mean variance
## [1,] 5.855 0.1871719
## [2,] 176.250 11.0867789
| height | weight |
|---|---|
| 5.4 | 170 |
| 5.8 | 183 |
| 6.0 | 188 |
| 5.0 | 188 |
## [,1]
## [1,] "female"
## [2,] "male"
## [3,] "male"
## [4,] "female"
## female male
## [1,] 0.642353175 0.357646825
## [2,] 0.016711702 0.983288298
## [3,] 0.007327301 0.992672699
## [4,] 0.997700955 0.002299045
clas3 = naiveBayes(x=teste[-3], y = teste$sex)
prev3 = predict(clas3, newdata = dfn, 'raw')
print(prev3)## female male
## [1,] 0.642353175 0.357646825
## [2,] 0.016711702 0.983288298
## [3,] 0.007327301 0.992672699
## [4,] 0.997700955 0.002299045
Naive Bayes decision boundaries
Naive Bayes decision boundaries
| education | occupation | income |
|---|---|---|
| HS-grad | Adm-clerical | <=50K |
| Some-college | Prof-specialty | <=50K |
| HS-grad | Adm-clerical | <=50K |
| Bachelors | Prof-specialty | <=50K |
| Bachelors | Prof-specialty | <=50K |
| HS-grad | Other-service | <=50K |
| education | occupation | income |
|---|---|---|
| 10th | Adm-clerical | <=50K |
| 11th | Armed-Forces | >50K |
| 12th | Craft-repair | |
| 1st-4th | Exec-managerial | |
| 5th-6th | Farming-fishing | |
| 7th-8th | Handlers-cleaners | |
| 9th | Machine-op-inspct | |
| Assoc-acdm | Other-service | |
| Assoc-voc | Priv-house-serv | |
| Bachelors | Prof-specialty | |
| Doctorate | Protective-serv | |
| HS-grad | Sales | |
| Masters | Tech-support | |
| Preschool | Transport-moving | |
| Prof-school | ||
| Some-college |
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
##
## <=50K >50K
## 0.5032143 0.4967857
## Conditional Probabilities:
## <=50K >50K
## [1,] 0.8412499 0.1587501
## [2,] 0.4708610 0.5291390
## [3,] 1.0000000 0.0000000
## [4,] 1.0000000 0.0000000
## [5,] 1.0000000 0.0000000
## [6,] 0.1352615 0.8647385
## [1] " <=50K" " >50K" " <=50K" " <=50K" " <=50K" " >50K"
## [1] 73.63552
Naive Bayes decision region
Naive Bayes decision boundaries
\[\begin{equation}\label{eq12} E(R^i_{t+1}) - R^f_{t+1} = \lambda_{g_{t+1}} \beta_{i,g_{t+1}} \end{equation}\]
where
\[\begin{equation}\label{eq13} \beta_{i,g_{t+1}} = \left(\frac{Cov_t(g_{t+1}, R_{t+1})}{Var_t(g_{t+1})} \right) \end{equation}\]
and
\[\begin{equation}\label{eq14} \lambda_{g_{t+1}} = \gamma Var_t(g_{t+1}) \end{equation}\]
\[\begin{equation}\label{eq15} CMAX_t = \frac{p_t}{max(p_{t-12},\dotsb,p_t)} \end{equation}\]
CMAX to Ibovespa
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
##
## 0 1
## 0.4935065 0.5064935
## 0 1
## [1,] 5.545518e-105 1
## [2,] 1.277693e-98 1
## [3,] 5.748209e-91 1
## [4,] 4.422171e-88 1
## [5,] 1.389092e-87 1
## [6,] 2.479376e-92 1
## [7,] 1.332146e-110 1
## [8,] 3.762972e-80 1
## [9,] 2.888895e-67 1
## [10,] 1.063357e-14 1
## [,1]
## [1,] "1"
## [2,] "1"
## [3,] "1"
## [4,] "1"
## [5,] "1"
## [6,] "1"
## [7,] "1"
## [8,] "1"
## [9,] "1"
## [10,] "1"
prev = predf('x', tr, tst, cl3, cclas=1, cd=0)
accuracy = (sum((prev == tst[,1])*1)/length(tst[,1]) )*100
accuracy## [1] 100
I created two graphics, one with the decision region for each classes and one with the decision boundary
We can see in the figures below that there is no obvious pattern that allows the prediction of falls in the Brazilian stock market. But it seems that the stock market falls are associated with lower oil prices and high pca
Naive Bayes decision region
Naive Bayes decision boundaries
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
##
## 0 1
## 0.4935065 0.5064935
## 0 1
## [1,] 5.020233e-42 1
## [2,] 3.710212e-40 1
## [3,] 5.763950e-46 1
## [4,] 1.000049e-48 1
## [5,] 4.960224e-48 1
## [6,] 3.223118e-49 1
## [7,] 1.396563e-46 1
## [8,] 1.053612e-47 1
## [9,] 3.915360e-53 1
## [10,] 4.292669e-68 1
## [,1]
## [1,] "1"
## [2,] "1"
## [3,] "1"
## [4,] "1"
## [5,] "1"
## [6,] "1"
## [7,] "1"
## [8,] "1"
## [9,] "1"
## [10,] "1"
prev2 = predf('x', tr2, tst2, cl4, cclas=1, cd=0)
accuracy = (sum((prev2 == tst2[,1])*1)/length(tst2[,1]) )*100
accuracy## [1] 100
Again, I created two graphics, one with the decision region for each classes and one with the decision boundary
In the figures below, it appears that a VIX above 35 is associated with crises in the stock market. Apparently, the exchange rate index greater than 105 seems to be associated with crises.
Naive Bayes decision region
Naive Bayes decision boundaries