My Machine Learning work

I built Naive Bayes algorithm to solve classification problems
I used R language, version 4.0, to do this
This presentation was built using Beamer Rmarkdown
My Naive Bayes algorithm and this presentation can be view in my Github, see Nbayes2_work file. This presentation can also be seen in my Rpubs
First, I will apply my algorithm in three data sets. The first two are simpler
After, I will apply my algorithm in Ibovespa returns. This is my principal analysis. I will try to predict the crash of brasilian stock market during COVID-19 pandemic
In the last three data sets I have problems with unbalanced classes. To solve this I used the ROSE library

My Machine Learning work

The boundaries of my algorithm were made using ggplot2 library
To do quality control I compared my algorithm with that of the e1071 library
There are two types of independent variables: Categorical and non-categorical
The approach to classification in this context is diferent
So, I built two functions to solve this, and put this two functions inside one

Naive Bayes function

I created two functions: one to be used in datasets with categorical independent variables and the other to be used in datasets with non-categorical independent variables

naivef = function(k, df, cd=1){
    if(cd == 1){
      naive_marcos(k, df)
    }else if (cd == 0){
      naive_marcos2(k, df)
    }else{
      cat('Type cd = 1 for categorical dependent variables, \n
      and cd = 0 for non-categorical dependent variables.')
    }}

If cd=1 the algorithm can be used in classification problems with categorical dependent variables (naive_marcos)
If cd=0 the algorithm can be used in classification problems with non-categorical dependent variables (naive_marcos2)

Naive Bayes function

k is the class
df is the data frame that contains the dataset of interest
My predict function can be see below

predf = function(k, df, df_n, cl, cclas=0, cd=1){
  if(cd == 1){
    pred_marcos(k, df, df_n, cl, cclas)
  }else if (cd == 0){
    pred_marcos2(k, df, df_n, cl, cclas)
  }else{
    cat('Type cd = 1 for categorical dependent variables, 
    \n and cd = 0 for non-categorical dependent variables.')
  }}

df_n is the new data set that we want to predict the class
cl is the inductor
cclas gives the class if cclas=1, and probabilities if cclas=0

My first example (default risk)

There are three attributes in dependent variable (risco) and two independents variables. The independent variables (historia, divida) are categoricals as you can see in head table of my data set:

Dataset with categorical independent variables
historia	divida	risco
ruim	alta	alto
desconhecida	alta	alto
desconhecida	baixa	moderado
desconhecida	baixa	alto
desconhecida	baixa	baixo
desconhecida	baixa	baixo

Risco is a default risk that the bank runs when lending money
Historia is customer credit history and divida is customer debt in market
So, I will predict if the new customer is a good customer

Inductor to categorical dependet variables

I used naivef function in my dataset

cl = naivef('risco', df, cd=1)

## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
## 
##      alto     baixo  moderado 
## 0.4285714 0.3571429 0.2142857 
## Conditional Probabilities:

This function return a table that contains conditional probabilities
This table was save in cl object

Inductor to categorical dependet variables

cl object is a tensor
The tensor has a dimension equal to the number of the class attributes. In this case three
You can see below the first dimension of this tensor
The first row of first column is: \[P(alta, boa|alto)= 0.04761\]

head(cl[, ,1])

##                    alta      baixa
## boa          0.04761905 0.02380952
## desconhecida 0.09523810 0.04761905
## ruim         0.14285714 0.07142857

Predict

I have six new customers and i want to know if they are good payers
My new data set can be see below

New data set with categorical independent variables
historia	divida
boa	baixa
boa	alta
ruim	baixa
ruim	alta
desconhecida	baixa
desconhecida	alta

To do this i used predf function

Predict

Here, i used cclas = 0, so, my function return the probabilities of risk associated with my new client

predf('risco', df, df_teste, cl, cclas = 0, cd=1)

##           alto     baixo  moderado
## [1,] 0.1190476 0.6428571 0.2380952
## [2,] 0.3030303 0.5454545 0.1515152
## [3,] 0.6000000 0.0000000 0.4000000
## [4,] 0.8571429 0.0000000 0.1428571
## [5,] 0.2631579 0.4736842 0.2631579
## [6,] 0.5405405 0.3243243 0.1351351

Predict

Here, i used cclas = 1, so, my function return the attribute of my new client
Recall that cd=1 is to categorical independent variables

predf('risco', df, df_teste, cl, cclas = 1, cd=1)

## [1] "baixo" "baixo" "alto"  "alto"  "baixo" "alto"

Quality control

Here only to verify if my algorith is correct i compared to Naive Bayes produced by library e1071

library(e1071) 
clas2 = naiveBayes(x=df[-3], y = as.factor(df$risco))
prev2 = predict(clas2, newdata = df_teste) 
print(prev2)

## [1] baixo baixo alto  alto  baixo alto 
## Levels: alto baixo moderado

The answers of my algorithm and e1071 are identical

Plots

I use ggplot to plot decision boundaries of my algorithm
Note that credit history is important to predict risk default (1 = alto, 2 = baixo, 3 = moderado)

Plots

Naive Bayes decision boundaries

My second example (gender characteristics)

I have two attributes in dependent variable (sex) and two independent variables, weight and height
Weight and height are non-categorical as you can see in head table of my data set

Data set with non categorical independent variables
height	weight	sex
6.00	180	male
5.92	190	male
5.58	170	male
5.92	165	male
5.00	100	female
5.50	150	female

Height and weight are characteristics of the individual
And i will predict your sex based in this variables

Inductor to non-categorical dependet variables

cl2 = naivef('sex', teste, cd=0)

## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
## 
## female   male 
##    0.5    0.5

I suppose that independent variables are normally distributed
This function return a table that contains conditional probabilities
This table was save in cl2 object

Inductor to non-categorical dependet variables

cl2 is a tensor that contains the two first moments of height and weight by sex. As you can see below
The tensor has a dimension equal to the number of the class attributes

cl2

## , , female
## 
##          mean   variance
## [1,]   5.4175  0.3118092
## [2,] 132.5000 23.6290781
## 
## , , male
## 
##         mean   variance
## [1,]   5.855  0.1871719
## [2,] 176.250 11.0867789

Predict

I have four new people and i want to know if they are male or female
My new data set can be see below

New data set with non categorical independent variables
height	weight
5.4	170
5.8	183
6.0	188
5.0	188

So, i used predf function to predict the attribute of people

Predict

cclas = 1 returns the attribute of new people

predf('sex',teste, dfn, cl2, cclas =1, cd=0)

##      [,1]    
## [1,] "female"
## [2,] "male"  
## [3,] "male"  
## [4,] "female"

Predict

cclas = 0 returns the probabilities of the people to be male or female

predf('sex',teste, dfn, cl2, cclas =0, cd=0)

##           female        male
## [1,] 0.642353175 0.357646825
## [2,] 0.016711702 0.983288298
## [3,] 0.007327301 0.992672699
## [4,] 0.997700955 0.002299045

Quality control

Here only to verify if my algorith is correct i compared to Naive Bayes produced by library e1071

clas3 = naiveBayes(x=teste[-3], y = teste$sex)
prev3 = predict(clas3, newdata = dfn, 'raw')
print(prev3)

##           female        male
## [1,] 0.642353175 0.357646825
## [2,] 0.016711702 0.983288298
## [3,] 0.007327301 0.992672699
## [4,] 0.997700955 0.002299045

The answers of my algorithm and e1071 are identical

Plots

I use ggplot to plot decision region and decision boundaries of my data set
This can be view in the next two slides
Note that male is havier than female
And on average, the man is taller

Plots

Naive Bayes decision boundaries

Plots

Naive Bayes decision boundaries

My third example (census data)

I have two attributes in dependent variable (income) and two independent variables, occupation and education. The levels of variables can be seen on the next slide
My independent variables are categorical
So, I want to predict the income based in occupation and education
My data set have 30162 observations

Census dataset head
education	occupation	income
HS-grad	Adm-clerical	<=50K
Some-college	Prof-specialty	<=50K
HS-grad	Adm-clerical	<=50K
Bachelors	Prof-specialty	<=50K
Bachelors	Prof-specialty	<=50K
HS-grad	Other-service	<=50K

My third example

education	occupation	income
10th	Adm-clerical	<=50K
11th	Armed-Forces	>50K
12th	Craft-repair
1st-4th	Exec-managerial
5th-6th	Farming-fishing
7th-8th	Handlers-cleaners
9th	Machine-op-inspct
Assoc-acdm	Other-service
Assoc-voc	Priv-house-serv
Bachelors	Prof-specialty
Doctorate	Protective-serv
HS-grad	Sales
Masters	Tech-support
Preschool	Transport-moving
Prof-school
Some-college

Inductor to categorical dependet variables

I used 28000 observations to train, and 2161 to test my algorithm
In this dataset I have a problem with unbalanced classes
To solve this, I use ROSE library

cl4 =  naivef('income', tr1, cd=1)

## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
## 
##     <=50K      >50K 
## 0.5032143 0.4967857 
## Conditional Probabilities:

Predict

I used cclas = 0, so, my function return the probabilities of income be >50K or <=50k

head(predf('income', tr1, tst1, cl4, cclas=0, cd=1))

##          <=50K      >50K
## [1,] 0.8412499 0.1587501
## [2,] 0.4708610 0.5291390
## [3,] 1.0000000 0.0000000
## [4,] 1.0000000 0.0000000
## [5,] 1.0000000 0.0000000
## [6,] 0.1352615 0.8647385

Predict

Here I used cclas = 1, so, my function return the class the attribute

ndp = predf('income', tr1, tst1, cl4, cclas=1, cd=1)
head(ndp)

## [1] " <=50K" " >50K"  " <=50K" " <=50K" " <=50K" " >50K"

Predict

The acurracy of my algorithm in this case is 73.63%

acurracy1 = (sum((ndp==tst1[,'income'])*1)/length(tst1[,1])) *100
acurracy1

## [1] 73.63552

Plots

I created two graphics, one with the decision region for each classes and one with the decision boundary
We can see in the figures on the next two slides that a higher level of education is associated with a higher level of income (<=50 is 1, > 50 is 1)

Plots

Naive Bayes decision region

Plots

Naive Bayes decision boundaries

Predict financial crisis using my algorithm

I want predict crisis in brasilian stock market
One of the most important models in finance is CCAPM. The complete derivation of the model can be view in my Github
The principal equation of the model is:

\[\begin{equation}\label{eq12} E(R^i_{t+1}) - R^f_{t+1} = \lambda_{g_{t+1}} \beta_{i,g_{t+1}} \end{equation}\]

where

\[\begin{equation}\label{eq13} \beta_{i,g_{t+1}} = \left(\frac{Cov_t(g_{t+1}, R_{t+1})}{Var_t(g_{t+1})} \right) \end{equation}\]

and

\[\begin{equation}\label{eq14} \lambda_{g_{t+1}} = \gamma Var_t(g_{t+1}) \end{equation}\]

Predict financial crisis using my algorithm

\(R^i_{t+1}\) is the return of asset i
\(R^f_{t+1}\) is the risk free asset
The left side of the equation is known as the risk premium
\(g_{t+1}\) is the consumption growth
\(t\) is a time subscript
\(\gamma\) is the risk aversion and \(\beta\) the price of risk
I will not go into the details of the model so as not to lose the focus of the work
My claim is that i can predict crisis in brasilian stock market using risk aversion \(\gamma\)
Just create a variable that represents crisis in the stock market brasilian, create a proxy for risk aversion \(\gamma\), and choice other dependent variable

Crisis proxy

To make a crisis proxy i create CMAX algorithm to detects extreme price levels, in Ibovespa returns, over a given period (12 months for example)
CMAX equation can be see below

\[\begin{equation}\label{eq15} CMAX_t = \frac{p_t}{max(p_{t-12},\dotsb,p_t)} \end{equation}\]

Crisis proxy

And my CMAX algorithm is:

CMAX = function(w, n, s){
  l = matrix(nrow=n,ncol = (w+1))
  max = matrix(nrow=n, ncol = 1)
  cmax = matrix(nrow=n, ncol = 1)
  for (j in 1:n){
    
    l[j, 1:(w+1)] = s[j:(w+j)]
    max[j] = max(l[j, 1:(w+1)])
    
    cmax[j] = l[j, (w+1)]/max(max[j])
  }
  return(cmax)
}

Crisis proxy

w is the window size
n is the number of windows
s is the vector that i will pass the function
If the CMAX exceeds a certain limit, we can say that it is a crisis period and the crisis proxy will be equal to 1. Otherwise, it will be zero
This limit can be the Value at Risk in 5%. This is represented by the horizontal line in the Figure in the next slide

Crisis proxy

CMAX to Ibovespa

Ibovespa returns

Note the big drop of ibov in 2020 in the figure (a) below. The vertical line in figure (b) is the limit used to define crisis. This is the quantile of 0.05 of Ibovespa returns. This approach is known as Value at Risk (Var)

Non-categorical dependent variables

The other two variables that i choose is PCA and oil price
PCA was constructed using Principal Component Analysis of return of 23 assets that compose Ibovespa index. The construction of this variable can be view in my Github
Oil price i get in Yahoo finance using quantmod library
My data is a monthly time series from 2000-03 to 2020-03
I use the data 2000-03 to 2019-03 to train model. And 2019-06 to 2020-03 to test model
This last time interval cover COVID-19 pandemic. During this pandemic (2020-01 to 2020-03) the Ibovespa fell sharply

Inductor to crisis forecast in brasilian stock market

In my dataset There are fewer crises (1) than non-crises (0)
So, I have the unbalanced dataset problem
I solve this using ROSE library
So, i used naivef function in my dataset, where tr is my train dataset

cl3 = naivef('x',tr, cd=0)

## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
## 
##         0         1 
## 0.4935065 0.5064935

Predict

I used predf funtion to predict crisis
tst is my test dataset, this cover COVID-19 crisis
I used cclas = 0, so, my function return the probabilities of crisis proxy be 0 or 1

predf('x', tr, tst, cl3, cclas=0, cd=0)

##                   0 1
##  [1,] 5.545518e-105 1
##  [2,]  1.277693e-98 1
##  [3,]  5.748209e-91 1
##  [4,]  4.422171e-88 1
##  [5,]  1.389092e-87 1
##  [6,]  2.479376e-92 1
##  [7,] 1.332146e-110 1
##  [8,]  3.762972e-80 1
##  [9,]  2.888895e-67 1
## [10,]  1.063357e-14 1

Predict

Here I used cclas = 1, so, my function return the class the attribute

predf('x', tr, tst, cl3, cclas=1, cd=0)

##       [,1]
##  [1,] "1" 
##  [2,] "1" 
##  [3,] "1" 
##  [4,] "1" 
##  [5,] "1" 
##  [6,] "1" 
##  [7,] "1" 
##  [8,] "1" 
##  [9,] "1" 
## [10,] "1"

Predict

The accuracy of my model is 100%

prev = predf('x', tr, tst, cl3, cclas=1, cd=0)
accuracy = (sum((prev == tst[,1])*1)/length(tst[,1]) )*100
accuracy

## [1] 100

This accuracy may have been caused by the fact that the fall in the brazilian stock market was very sharp

Plots

I created two graphics, one with the decision region for each classes and one with the decision boundary
We can see in the figures below that there is no obvious pattern that allows the prediction of falls in the Brazilian stock market. But it seems that the stock market falls are associated with lower oil prices and high pca

Plots

Naive Bayes decision region

Plots

Naive Bayes decision boundaries

Predict financial crisis using my algorithm

Very nice! But I try predict crisis using other variables
So, now, I will use VIX with proxy to risk aversion
The other independent variable will Real exchange rate index (INPC)
I get VIX in Yahoo finance using quantmod library
And INPC I get in Banco Central data set using GetBCBData library
The time interval is the same

Inductor to crisis forecast in brasilian stock market

So, i used naivef function in my dataset, where tr2 is my train dataset

cl4 = naivef('x',tr2, cd=0)

## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## [1] "Marcos Naive Bayes Classifier for Discrete Predictors"
## [1] "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
## A-priori probabilities:
## 
##         0         1 
## 0.4935065 0.5064935

Predict

I used cclas = 0, so, my function return the probabilities of crisis proxy be 0 or 1

predf('x', tr2, tst2, cl4, cclas=0, cd=0)

##                  0 1
##  [1,] 5.020233e-42 1
##  [2,] 3.710212e-40 1
##  [3,] 5.763950e-46 1
##  [4,] 1.000049e-48 1
##  [5,] 4.960224e-48 1
##  [6,] 3.223118e-49 1
##  [7,] 1.396563e-46 1
##  [8,] 1.053612e-47 1
##  [9,] 3.915360e-53 1
## [10,] 4.292669e-68 1

Predict

Here I used cclas = 1, so, my function return the class the attribute

predf('x', tr2, tst2, cl4, cclas=1, cd=0)

##       [,1]
##  [1,] "1" 
##  [2,] "1" 
##  [3,] "1" 
##  [4,] "1" 
##  [5,] "1" 
##  [6,] "1" 
##  [7,] "1" 
##  [8,] "1" 
##  [9,] "1" 
## [10,] "1"

Predict

The accuracy of my model is 100%

prev2 = predf('x', tr2, tst2, cl4, cclas=1, cd=0)
accuracy = (sum((prev2 == tst2[,1])*1)/length(tst2[,1]) )*100
accuracy

## [1] 100

Plots

Again, I created two graphics, one with the decision region for each classes and one with the decision boundary
In the figures below, it appears that a VIX above 35 is associated with crises in the stock market. Apparently, the exchange rate index greater than 105 seems to be associated with crises.

Forecast financial crisis in brasilian stock market using Naive Bayes

My Machine Learning work

My Machine Learning work

Naive Bayes function

Naive Bayes function

My first example (default risk)

Inductor to categorical dependet variables

Inductor to categorical dependet variables

Predict

Predict

Predict

Quality control

Plots

Plots

My second example (gender characteristics)

Inductor to non-categorical dependet variables

Inductor to non-categorical dependet variables

Predict

Predict

Predict

Quality control

Plots

Plots

Plots

My third example (census data)

My third example

Inductor to categorical dependet variables

Predict

Predict

Predict

Plots

Plots

Plots

Predict financial crisis using my algorithm

Predict financial crisis using my algorithm

Crisis proxy

Crisis proxy

Crisis proxy

Crisis proxy

Ibovespa returns

Non-categorical dependent variables

Inductor to crisis forecast in brasilian stock market

Predict

Predict

Predict

Plots

Plots

Plots

Predict financial crisis using my algorithm

Inductor to crisis forecast in brasilian stock market

Predict

Predict

Predict

Plots

Plots

Plots