Introduction

In the study of Neural Networks, Deep Learning, Artificial Intelligence, and fancy terms of the like, it seems that binary classification is one of the first problems to solve. It is the first one one tackles, and I think for a good reason: it is the most simple shallow neural network one may come up with with the structure we then generalise for “deep” neural networks.

The best book I have found on the subject of deep learning is, precisely, Deep Learning by Ian Goodfellow. You may buy the book here or, in spirit of free knowledge, you may read the book for free here.

I have been reading on the Deep Learning approach to things, but it really doesn’t feel like a natural way of solving the problem to me. It feels a bit of a brute force approach, which doesn’t mean it’s wrong or that it doesn’t work. Maybe it’s just that, being a probabilist, thinking in a logistic regression seems the natural way to go. This is why I am writing this notes: so next time I need to explain how to do this stuff I know how to do it from the very beginning.

Binary classification

The idea is that we have data on several variables \(x_1,x_2,\ldots,x_m\) and a result \(y\) that may take values on the set \(\{0,1\}\) for each of \(n\) observations.

In order to tackle the problem I will develop some ideas below and I will use the following fake data as an example.

# The result y can only take values 0 or 1.
# The additional information we know are the variables x1 and x2.
# These variables have no restrictions.
head(df)
##        x1        x2 y
## 1 0.78051 -0.063669 1
## 2 0.28774  0.291390 1
## 3 0.40714  0.178780 1
## 4 0.29230  0.421700 1
## 5 0.50922  0.352560 1
## 6 0.27785  0.108020 1
# Here is a summary of the information of each variable.
summary(df)
##        x1                 x2                 y      
##  Min.   :0.008449   Min.   :-0.06367   Min.   :0.0  
##  1st Qu.:0.334825   1st Qu.: 0.34346   1st Qu.:0.0  
##  Median :0.528265   Median : 0.55639   Median :0.5  
##  Mean   :0.520591   Mean   : 0.51966   Mean   :0.5  
##  3rd Qu.:0.693588   3rd Qu.: 0.71025   3rd Qu.:1.0  
##  Max.   :1.000000   Max.   : 1.00000   Max.   :1.0
# Here is how the three variables are related
par(pty = 's')
plot(df$x1,df$x2,col = unlist(lapply(df$y,function(x){ return(if_else(x==1,'black','blue'))})),
       pch = 15,
       xlab = 'x1',ylab = 'x2', main = 'Fake data', asp = 1, xlim = c(0,1), ylim = c(0,1))
legend(1.05, 0.6, c('y = 0', 'y = 1'),col = c('blue','black'),pch = 15, xpd = TRUE, bty = 'n')

Maximum Likelihood

As a probabilist, I am forced to start the modelling with the phrase “Let (name of random variable) be the random variable that…” so here we go:

Let \(Y\) be the random variable that denotes whether the result be a \(0\) or a \(1\) for a random observation. Since \(Y\) can only take values on the set \(\{0,1\}\) a Bernoulli model sounds like a good idea: \[ Y\sim\mathrm{Bernoulli}(p).\]

Then we can write explicitly the probability that \(Y\) takes a specific value: \[ \mathbb P[Y = y] = p^y (1-p)^{1-y} \mathbb I_{\{0,1\}} (y).\] Where the function \(\mathbb I_A\colon\mathbb R\to\{0,1\}\) is the indicator function that takes the value of \(0\) if the number where it is evaluated doesn’t belong to the set \(A\) and \(1\) if it does. That is,

\[ \mathbb I_A(x) = \begin{cases} 0 & \text{ if }x\notin A,\\ 1 & \text{ if }x\in A.\end{cases}\]

Assuming that we have a random sample of size \(n\); that is, \(n\) independent and identically distributed random variables \(Y_1,Y_2,\ldots,Y_n\) from this distribution then we can find the joint probability as \[ \begin{align*} \mathbb P[Y_1 = y_1,Y_2= y_2,\ldots,Y_n= y_n] &= \prod_{k = 1}^n \mathbb P[Y_k = y_k]\\ & = \prod_{k = 1}^n p^{y_k} (1-p)^{1-y_k} \mathbb I_{\{0,1\}} (y_k)\\ & = \prod_{k = 1}^n p^{y_k} (1-p)^{1-y_k} \\ & = p^{\sum_{k = 1}^n y_k} (1-p)^{n-\sum_{k = 1}^n y_k}. \end{align*}\]

If we see this function as a function of \(p\) rather than the values \(y_1,\ldots,y_n\) we call this function the Liklihood function. Since we already have a data set with the values \(y_1,\ldots,y_n\) it makes sense that we should find \(p\) that maximises the probability that we have already observed the data. We call this value of \(p\) the Maximum Likelihood Estimator (MLE) and denote it by \(\hat p\).

To find it we note that the argument that maximises the Likelihood function is the same as the argument that minises the negative logarithm of the Likelihood function. We call this function the loss function and denote it by \(\ell(p) = -\log \mathbb P[Y_1 = y_1,Y_2= y_2,\ldots,Y_n= y_n]\).

Simplifying a bit this expression, we find that \[ \begin{align*} \ell(p) & = -\log \mathbb P[Y_1 = y_1,Y_2= y_2,\ldots,Y_n= y_n]\\ & = -\log\left( p^{\sum_{k = 1}^n y_k} (1-p)^{n-\sum_{k = 1}^n y_k} \right)\\ & = -\log(p) \sum_{k = 1}^n y_k - \left(n-\sum_{k = 1}^n y_k \right) \log(1-p). \end{align*}\]

In order to find \(\hat p\) we now differentiate, \[\begin{align*} \frac{d}{dp}\ell(p) & = -\frac{d}{dp} \left(\log(p) \sum_{k = 1}^n y_k + \left(n-\sum_{k = 1}^n y_k \right) \log(1-p)\right)\\ & = -\frac{1}{p}\sum_{k=1}^n y_k + \frac{1}{1-p}\left(n - \sum_{k=1}^n y_k\right), \end{align*}\]

equate to \(0\), \[ \frac{1}{\hat p}\sum_{k=1}^n y_k - \frac{1}{1-\hat p}\left(n - \sum_{k=1}^n y_k\right) = 0,\]

and solve for \(\hat p\),

\[\hat p = \frac{1}{n}\sum_{k=1}^n y_k.\]

This shouldn’t come as a surprise, the proportion of \(1\)’s should be the probability of \(Y\) being \(1\). In our case, \(\hat p = 0.5\). However, it is clear that the variables \(x_1\) and \(x_2\) provide additional information that may help us refine the parameter \(p\) for each specific observation. This is when logistic regression comes into play.

Logistic Regression by hand

We start the model to start in the same way. \(Y\) is still a Bernoulli random variable, but now the parameter \(p\) is a function of the additional information, which we call covariates. Then, the sample is no longer identically distributed, but each observation has its own parameters \[ Y_k\sim\mathrm{Bernoulli}(p_k).\]

The likelihood can’t be longer simplified as the above case, we only get up to here: \[ \begin{align*} \mathbb P[Y_1 = y_1,Y_2= y_2,\ldots,Y_n= y_n] &= \prod_{k = 1}^n \mathbb P[Y_k = y_k]\\ & = \prod_{k = 1}^n p_k^{y_k} (1-p_k)^{1-y_k} \mathbb I_{\{0,1\}} (y_k)\\ & = \prod_{k = 1}^n p_k^{y_k} (1-p_k)^{1-y_k}. \end{align*}\]

Taking logarithms \[\begin{align*} \ell(p) & = -\log \prod_{k = 1}^n p_k^{y_k} (1-p_k)^{1-y_k}\\ & =-\sum_{k=1}^n \left( y_k \log(p_k) + (1- y_k)\log(1 - p_k) \right). \end{align*}\]

Here we then propose a functional form in which \(p\) depends on the covariates. Recall that \(p\) must be a number in \((0,1)\), so we propose \[ \mathrm{logit} (p_k(x_1,\ldots,x_p)) = \sum_{j=1}^p \beta_j x_j.\] To simplify notation, we will write as \(\mathbf x = (x_1,\ldots,x_p)\in\mathbb R^p\) all the information known. Hence, our functional form can be written as \[ \mathrm{logit} (p_k(\mathbf x)) = \mathbf\beta^T \mathbf x.\] For the \(k\)-th observation we’ll write \[ \mathrm{logit} (p_k(\mathbf x_k)) = \mathbf\beta^T \mathbf x_k.\]

This leads to the loss function \[\begin{align*} \ell(p) & =-\sum_{k=1}^n \left( y_k \log(p_k) + (1- y_k)\log(1 - p_k) \right)\\ & =-\sum_{k=1}^n \left( y_k \log\frac{p_k}{1-p_k} + \log(1 - p_k) \right)\\ & =-\sum_{k=1}^n \left( y_k \mathbf \beta^T\mathbf x_k - \log(1 + e^{\mathbf \beta^T\mathbf x_k}) \right). \end{align*}\]

Differentiating with respect to a single parameter, let’s say \(\beta_j\) we find \[\begin{align*} \frac{d}{d\beta_j}\ell(\beta) &-= \frac{d}{d\beta_j} \sum_{k=1}^n \left( y_k \mathbf \beta^T\mathbf x_k - \log(1 + e^{\mathbf \beta^T\mathbf x_k}) \right)\\ & = -\sum_{k=1}^n y_kx_{kj} - \frac{x_{kj}e^{\mathbf \beta^T\mathbf x_k}}{1 + e^{\mathbf \beta^T\mathbf x_k}}\\ & = -\sum_{k=1}^n x_{kj}(y_k - p_k(\mathbf x_k,\beta)). \end{align*}\]

Rewriting this in matrix notation, we find that \[ \nabla_\beta \ell(\beta) = -X^T(\mathbf y - \mathbf p(X,\beta)).\] Where \(X\in\mathbb R^{n\times m}\) is the matrix of covariates, \(\mathbf y\in\mathbb R^n\) is the vector of responses, \(\mathbb p\in\mathbb R^n\) is the vector of probabilities for each respondent and recall it is a function of the parameter \(\beta\in\mathbb R^m\) we are trying to find.

We now equate to zero, and find a system of \(m\) unknowns (the vector \(\hat \beta\)) in \(m\) non-linear equations: \[ X^T(\mathbb y - \mathbb p(X,\hat\beta)) = \mathbf 0\in\mathbb R^m.\] There is no tractable way to solve this equation, we will have to rely on numerical methods for this. We recall Newton’s method:

  1. Define \(F\colon \mathbb R^{m}\to\mathbb R^m\) by \(F(\beta) = X^T(\mathbf y - \mathbf p(X,\beta))\).
  2. Set \(\beta^{(0)}\in\mathbb R^m\).
  3. Calculate iteratively \(\beta^{(i+1)} = \beta^{(i)} - \nabla F(\beta^{(i)})^{-1}F(\beta^{(i)})\).
  4. The sequence \(\{\beta^{(i)}\}_{i\in\mathbb N_0}\) converges quadratically to \(\hat\beta\) in a neighbourhood of \(\hat\beta\).

This basically means, that we should repeat step 3 enough times to get a good result. This is itself a challenge since we are dealing with the inverse of a matrix. Our example seems quite inocuous so we will just try to give it the naïve mathematician’s approach to inverse matrices and see what happens.

# We first set the variables as our equations.
y <- df$y

# We will add a column of ones to our covariate matrix just to have a constant in our model.
# Our equations never really prohibited this. The important issue will be that the inverse
# of the derivative of F exists.
X <- data.frame(Constant = rep(1,length(y)),
                x1 = df$x1,
                x2 = df$x2)

# Since the MLE of p is 0.5, we will start with a beta that does that.
# This means all 3 betas in 0, this way logit(p)=0 and hence p=0.5 for all observations.
beta <- rep(0,length=3)

# Define how p depends on X and beta.
# Note that we will consider X to be fixed and we will be interested in p
# just as a function of beta.
p <- function(beta){
  return(1/(1 + exp(-colSums(apply(X,1,function(k){return(k*beta)})))))
}

# We then define the function F
Fun <- function(beta){
  res <- (X %>% as.matrix() %>% t()) %*% (y - p(beta))
  res <- as.numeric(res)
  return(res)
}

Before proceding, we should know what the derivative of \(F\) looks like. For that, we first find the derivative of \(\mathbf p\). Recall that \(p_k = \frac{1}{1+e^{\mathbf x_k^T\beta}}\) and as such \(\mathbf p\colon\mathbb R^m\to\mathbb R^n\) hence the derivative is a matrix \(\nabla \mathbf p \colon \mathbb{R}^m \to \mathbb{R}^{m\times n}\) whose element in row \(j\) and column \(k\) is given by \[\begin{align*} \frac{\partial}{\partial \beta_j}p_k(\beta) &= \frac{\partial}{\partial \beta_j}\frac{1}{1+e^{\mathbf x_k^T\beta}}\\ & = \frac{x_{kj}e^{\mathbf x_k^T\beta}}{(1+e^{\mathbf x_k^T\beta})^2}\\ & = x_{kj}p_k(\beta)(1-p_k(\beta)). \end{align*}\]

Which we may see that in matrix notation may be rewritten as \[ \nabla \mathbf p = X^T W,\] with the \(W\in\mathbb R^{n\times n}\) a diagonal matrix defined as \[W_{k,k} = p_k(1-p_k).\]

We now find the derivative of \(F\) using the chain rule \[\begin{align*} \nabla F(\beta) &= \nabla X^T(\mathbf y - \mathbf p)\\ & = -(\nabla \mathbf p(\beta))X\\ & = - X^T W X. \end{align*}\]

Ok, back to R:

# We now calculate the derivative of F
DFun <- function(beta){
  res <- - t(as.matrix(X)) %*% diag(p(beta)*(1-p(beta))) %*% as.matrix(X)
  return(res)
}

# We write a function to do Newton's method
# following the previously described pseudo-code.
Newton <- function(beta0, maxiter = 1000, eps = 1e-10){
  # Initialise the results
  current_beta <- beta0
  convergence <- 0
  iter <- 0
  norms <- sqrt(sum(Fun(current_beta) * Fun(current_beta)))
  
  # Start loop to convergence
  while(convergence == 0 & iter < maxiter & norms[length(norms)] > eps){
    # Check if the derivative of F is singular
    if(abs(det(DFun(current_beta))) <= eps){
      # Return current beta with no convergence
      res <- list(beta = current_beta,
                  convergence = 0,
                  iter = iter + 1,
                  norms = norms)
      # Claim it has converged to get out of loop
      convergence <- 1
    }else{
      # We have a non-singular matrix, so we proceed
      # Update beta
      current_beta <- current_beta - solve(DFun(current_beta)) %*% Fun(current_beta)
      # Update norm vector, this is for the history of convergence
      # Recall we should find a quadratic convergence
      norms <- c(norms,sqrt(sum(Fun(current_beta) * Fun(current_beta))))
      # Update number of iterations
      iter <- iter + 1
      # Check if we have converged
      if(sqrt(sum(Fun(current_beta) * Fun(current_beta))) <= eps){
        # Update convergence
        convergence <- 1
        # This is what we want to find
        res <- list(beta = current_beta,
                    convergence = 1,
                    iter = iter,
                    norms = norms)
      }
    }
  }
  # Check if we ran out of iterations
  if(iter >= maxiter){
    # Return that we have with no convergence
    res <- list(beta = current_beta,
                convergence = 0,
                iter = iter,
                norms = norms)
  }
  # Only one return!
  return(res)
}

It is now the moment of truth. We run this function, and we will have find our logistic regression.

# Finding the MLE of the logistic regression coefficients
beta_hat <- Newton(beta)

# Checking that the algorithm converged
beta_hat$convergence
## [1] 1
# Checking that the derivative of the log-likelihood is indeed 0. That is F(beta_hat)=0.
Fun(beta_hat$beta)
## [1] 5.162537e-15 1.255377e-15 1.151208e-15
# Looking how the method converges. It took only 10 iterations.
plot(beta_hat$norms, pch = 15,xlab = 'Iteration', ylab = 'Norm of F(beta)',
     main="Convergence of Newton's method")

Logistic Regression with R

Fortunately for everybody, no one needs to know all of these and we may find all of the above in a few lines (less than 2!) of R:

# Logistic regression
log_reg <- glm(y ~ x1 + x2,family = 'binomial', data = df)

That’s right! We can summarise all the previous section in just one line of code (\(1<2\) told you less thatn two!). We will notice that actually find the same coefficients, the same values of \(\beta\).

data.frame( our_coefficients = beta_hat$beta,
            R_function = log_reg$coefficients)
##          our_coefficients R_function
## Constant         16.28138   16.28138
## x1              -14.51051  -14.51051
## x2              -16.42439  -16.42439

Now we have a set of probabilities for each observation. Note that our high probabilities have mostly associated values of \(y=1\) while our low probabilities a value of \(y=0\).

# First, note that I'll keep the R results from here on, but we estimated exactly the same thing.
plot(log_reg$fitted.values,p(beta_hat$beta), pch = 15,
     xlab = 'R estimated probabilities',
     ylab = 'Our estimated probabilities',
     main = "Comparison of our method and R's")

# Second, we now add to our data frame, it's estimated probabilities
df$probabilities <- log_reg$fitted.values

# Third compare them to the actual values of y.
boxplot(probabilities~y,data=df, xlab = 'Values of y',
        ylab = 'Estimated probability of y being 1')

Here we see that regardless of the cut-off point we decide to round a probability to decide whether the corresponding value of \(y\) should be either \(0\) or \(1\), we will have some errors. Not many, but some.

In fact, we can see how our probability function looks like on the whole plane:

grid <- expand.grid(list(x1 = seq(0,1,length = 100), x2 = seq(0,1,length = 100)))
surface <-  1/(1+exp(-predict(log_reg,newdata = grid)))

image(seq(0,1,length = 100),seq(0,1,length = 100),matrix(surface,ncol = 100),
      col = rev(heat.colors(15)),
      xlab = 'x1',ylab = 'x2', main = 'Logistic regression', asp = 1,
      xlim = c(0,1), ylim = c(0,1))
points(df$x1,df$x2,col = unlist(lapply(df$y,function(x){ return(if_else(x==1,'black','blue'))})),
       pch = 15)
legend(1.05, 0.6, c('y = 0', 'y = 1'),col = c('blue','black'),pch = 15, xpd = TRUE, bty = 'n')

The redder the region, the higher the probability that \(y=1\); the yellower, the lower. There’s a clear space where the gradient starts falling. We may even see the exact values in which the decision is fuzzy. Whatever we do, that’s the region where we will fail, uncertainty is too big to go for one team or the other.

image(seq(0,1,length = 100),seq(0,1,length = 100),matrix(surface,ncol = 100),
      col = rev(heat.colors(15)),
      xlab = 'x1',ylab = 'x2', main = 'Logistic regression', asp = 1,
      xlim = c(0,1), ylim = c(0,1))
points(df$x1,df$x2,col = unlist(lapply(df$y,function(x){ return(if_else(x==1,'black','blue'))})),
       pch = 15)
legend(1.05, 0.6, c('y = 0', 'y = 1'),col = c('blue','black'),pch = 15, xpd = TRUE, bty = 'n')
contour(x = seq(0,1,length = 100),
        y = seq(0,1,length = 100),
        z = matrix(surface,ncol = 100),
        levels = c(0.1,0.5,0.9), 
        add = TRUE)

Since the final objective is to take a decision on whether \(0\) or \(1\). We go for the \(0.5\) cut-off. We decide \(y=1\) if \(p>=0.5\) and \(y=0\) if \(p<0.5\).

# We now add the decision made by the logistic regression
df$fitted <- df$probabilities %>% 
  lapply(function(p){ return(if_else(p>=0.5,1,0)) }) %>% 
  unlist()

# Compare the actual value against the fitted one
xtabs(~y +fitted,data = df)
##    fitted
## y    0  1
##   0 46  4
##   1  3 47

We classify correctly 93% of the data. This is a very good logistic regression. We may even change the cut-off point by checking how well we are classifying:

# This table gathers the percentage of correct decisions given a ceratin cut-off value
Calif <- data.frame(cut_off = seq(0.01,0.99,0.001))
Calif <- lapply(1:length(Calif$cut_off),function(k){
  aux <- df$probabilities %>%
    lapply(function(p){ return(if_else(p>=Calif$cut_off[k],1,0)) }) %>% unlist()
  cont_tab <- xtabs(~df$y + aux)
  Correct <- (cont_tab[1,1] + cont_tab[2,2])/(cont_tab[1,1] + cont_tab[2,2] + cont_tab[1,2] + cont_tab[2,1])
  Jac <- cont_tab[2,2]/(cont_tab[2,2] + cont_tab[1,2] + cont_tab[2,1])
  Prec <- cont_tab[2,2]/(cont_tab[2,2] + cont_tab[1,2])
  Spec <- cont_tab[1,1]/(cont_tab[1,1] + cont_tab[1,2])
  Sens <- cont_tab[2,2]/(cont_tab[2,2] + cont_tab[2,1])
  res <- data.frame(Correct = Correct,
                    Jaccard = Jac,
                    Precision = Prec,
                    Specificity = Spec,
                    Sensitivity = Sens)
  return(res)
}) %>% do.call(rbind,.) %>% cbind(Calif,.)

# This plot shows each probability cut-off performance (by different measurements).
# I would just focus on the accuracy and the Jaccard coefficients.

plot(Calif$cut_off,Calif$Jaccard,xlab = 'Cut off probability',ylab = 'Jaccard coefficient',
     type='l',col = 'blue',ylim = c(0.6,1))
lines(Calif$cut_off,Calif$Correct,col='red')
lines(Calif$cut_off,Calif$Precision,col='green')
lines(Calif$cut_off,Calif$Specificity,col='orange')
lines(Calif$cut_off,Calif$Sensitivity,col='purple')
legend(0.5, 0.8, c('Accuracy', 'Jaccard','Precision','Specificity','Sensitivity'),
       col = c('red','blue','green','orange','purple'),pch = 1, xpd = TRUE, bty = 'n')

# Since we win 1 extra correct classification, we'll change the cut-off to 0.6
# but right now we don't have a good argument to do so.
df$fitted <- df$probabilities %>% 
  lapply(function(p){ return(if_else(p>=0.6,1,0)) }) %>% 
  unlist()

Loss function

This is how a logistic regression is done. The main issue was to minimise the loss function \(\ell\), another way to put it is to maximise the log-likelihood \[ -\ell(\beta) = \log L(\beta) = \sum_{k = 1}^n (y_k\log(p_k) + (1-y_k)\log(1-p_k)).\] In the AI argot, we refer to the loss function rather than the likelihood or log-likelihood, and it will be something we would like to minimise rather than maximise. So from here onwards, I’ll only talk about the loss function rather than the log-likelihood.

A Bayesian approach

For the same model, we need to specify our current knowledge of the unknown parameter \(\beta\) with a probability distribution, such as \(f_{\beta}(\beta)\), that we call prior. Then, our Likelihood looks like this:

\[L(\beta) = \prod_{k = 1}^n f_{Y_k|\beta}(y_k).\]

We are making explicit the fact that the probability ditribution of \(Y\) depends on the parameter \(\beta\). The posterior distribution of \(\beta\) is the update of our knowledge of \(\beta\) after observing the data. That is:

\[f_{\beta|Y}(\beta) = \frac{f_{Y|\beta}(\mathbf y) f_{\beta}(\beta)}{f_Y(\mathbf y)} = \frac{L(\beta) f_{\beta}(\beta)}{\int_{\mathbb R}f_{Y|\beta}(\mathbf y) f_{\beta}(\beta)d\beta}.\]

This is Bayes’s Theorem. THe numerator is the product of the likelihood times the prior of the parameter. The numerator is the marginal of \(Y\), that is unknown to us. THat is why, we simply write the previous equation as a proportionality statement:

\[ f_{\beta|Y}(\beta) \propto L(\beta) f_{\beta}(\beta).\]

The goal is to say something about the posterior of \(\beta\), a function that we know up to a multiplicative constant. I don’t want to go into the details here, but that’s precisely what Gibbs sampling is all about. This is implemented in R via the rstan package.

We first save the following file:

// Bayesian logistic regresion
// file saved as bayes_log_reg.stan
data {
  int<lower=0> N;             // number of data points 
  int<lower=0,upper=1> y[N];  // value 0 or 1
  vector[N] x1;                 // First covariate
  vector[N] x2;                 // Second covariate
  real b0_init;
  real b1_init;
  real b2_init;
  real<lower=0> sig;
}
parameters {
  real b0;                    // intercept
  real b1;                    // slope of x1
  real b2;                    // slope of x2
}
transformed parameters {
  vector[N] p = 1 ./ (1 + exp(-b0 - x1 * b1 - x2 * b2)); // probabilities
}
model {
  b0 ~ normal(b0_init,sig);
  b1 ~ normal(b1_init,sig);
  b2 ~ normal(b2_init,sig);
  for(k in 1:N)
      y[k] ~ bernoulli(p[k]);
}

Then run the model like this:

# We prepare the data for STAN
logistic_dat <- list(N = 100,
                     y = df$y, x1 = df$x1, x2 = df$x2,
                     b0_init = log_reg$coefficients[1],
                     b1_init = log_reg$coefficients[2],
                     b2_init = log_reg$coefficients[3],
                     sig = 20)
# Run model
#MCMC_log_reg <- stan(file = 'bayes_log_reg.stan', data = logistic_dat)
MCMC_log_reg <- rstan::sampling(blrs, data = logistic_dat)

The first thing that makes it different with respect to the previous approach is that we don’t have a point estimator for our parameter, but rather a posterior density.

Taking the mean of these densities, we then may compare with our previous results:

##          our_coefficients R_function Bayesian_regression
## Constant         16.28138   16.28138            18.43367
## x1              -14.51051  -14.51051           -16.39721
## x2              -16.42439  -16.42439           -18.59865

Yes, they differ a bit. The question is whether that difference is a huge one, or it isn’t.

surface <-  1/(1+exp(-bayes_coeff[1] - bayes_coeff[2] * grid$x1 - bayes_coeff[3] * grid$x2))

image(seq(0,1,length = 100),seq(0,1,length = 100),matrix(surface,ncol = 100),
      col = rev(heat.colors(15)),
      xlab = 'x1',ylab = 'x2', main = 'Bayesian Logistic regression', asp = 1,
      xlim = c(0,1), ylim = c(0,1))

points(df$x1,df$x2,col = unlist(lapply(df$y,function(x){
  return(if_else(x==1,'black','blue'))})),
  pch = 15)

legend(1.05, 0.6, c('y = 0', 'y = 1'),col = c('blue','black'),pch = 15,
       xpd = TRUE, bty = 'n')

contour(x = seq(0,1,length = 100),
        y = seq(0,1,length = 100),
        z = matrix(surface,ncol = 100),
        levels = c(0.1,0.5,0.9), 
        add = TRUE)

It seems we are still on the right track, and even better we have found a whole distribution that represents our knowledge of the parameters we are trying to find.

Conclusion

I hope this helps explaining how a logistic regression is done. It is quite important in the machine learning community since this is the first and simplest neural network that has a linear layer and an activation function.