Logistic Regression is not a regression, but a classification learning algorithm. The name comes from statistics and is due to the fact that the mathematical formulaion of the logistic regression is similar to that of linear regression. Logistic Regression works by transforming linear regression into a classification model through the use of the logistic function known as sigmoid function (note: not logistic regression model).
\(\sigma\left(x\right)\ =\ \frac{1}{1\ +\ e^{-x}}\)
1830-1850 - under the guidance of Adolphe Quetelet, Perre Francois Verhulot developed the logistic function. It was developed for the purpose of modeling population growth.
1833 - logistic function was independently developed in chemistry as a model of autocatalysis by Wihelm Ostwald.
Logistic function curve is a S-shaped curve and any value of x will have an output range between 0 and 1. Many natural real world systems have a carrying capacity or a natural limiting factor. For example - world population growth.
1940s : using logistic function for staistical modeling was developed by Joseph Berkson and he published “Application of the logistic function to bioassay” in the Journal of the American Statistical Association in 1944.
From the above example, the graph is plotted sepal length vs probability of being setosa species and we can clearly see that linear regression is not a good model for that data (remember Anscombe’s Quartet) but, logistic model can explain almost 100% of being setosa species from petal length from that data.
From linear regression, we already know the formula, \(\hat{y} = \beta_0x_0 + ... + \beta_nx_n = \Sigma_{i=0}^n \beta_ix_i\)
In Logistic function, it needs to transform the output from that formula between range of 0 and 1 and treat them as probabilities. \(\hat{y} = \sigma(\Sigma_{i=0}^n \beta_ix_i) = \frac{1}{1 + e^{-(\Sigma_{i=0}^n \beta_ix_i)}}\)
First, we need to understand the term odds. Odds of an event with probability, p is defined as the chance of the event happening divided by the chance of the event not happening.
\(odds = \frac{p}{1-p}\)
e.g - an event with 50% of probability of happening, 0.5/(1-0.5) which is equal to 1 to 1 odds of happening.
\(\hat{y} = \frac{1}{1 + e^{-(\Sigma_{i=0}^n \beta_ix_i)}}\)
\(\hat{y} + \hat{y}e^{-(\Sigma_{i=0}^n \beta_ix_i)} = 1\)
\(\hat{y}e^{-(\Sigma_{i=0}^n \beta_ix_i)} = 1 - \hat{y}\)
\(\frac{\hat{y}}{1 - \hat{y}} = e^{(\Sigma_{i=0}^n \beta_ix_i)}\)
\(ln(\frac{\hat{y}}{1 - \hat{y}}) = \Sigma_{i=0}^n \beta_ix_i\)
Note that left hand side is not odds. It is log odds. Consider p = 0.5, \(ln(\frac{0.5}{1-0.5}) = ln(1) = 0\)
As p goes to 1 then log odds become to \(\infty\) and also p goes to 0, log odds becomes to \(-\infty\).
So, in terms of log odds, class points (actual data points (x’s)) now shift to infinity. Since the log odds scale is non-linear, \(\beta\) values cannot be directly linked to one unit increase as in linear regression. Postivie \(\beta\) values indicate an increase in likelihood of belonging to 1 class with increase in associated x feature and negative values as decrease. Magnitude of coefficients are harder to interpret. We can use odds ratio for comparing the magnitudes against each other. By comparing the magnitudes of coefficients against each other can lead to insight over which features have the strongest effect on prediction output.
In linear regression, we used residaul sum of squares (RSS) to solve the problem. In logistic regression as said above, the log odds values are at infinity and RSS cannot be calculated. So, we have to use maximum likelihood.
To calculate maximum likelihood, first, we need to shift the data from infinity to probabilities between range 0 and 1.
Log odds is defined as
\(ln(\frac{\hat{p}}{1 - \hat{p}}) = ln(odds)\)
\(\frac{\hat{p}}{1 - \hat{p}} = e^{ln(odds)}\)
\(p = (1-p) e^{ln(odds)}\)
\(p = e^{ln(odds)} - pe^{ln(odds)}\)
\(p + pe^{ln(odds)} = e^{ln(odds)}\)
\(p(1 + e^{ln(odds)}) = e^{ln(odds)}\)
\(p = \frac{e^{ln(odds)}}{1 + e^{ln(odds)}}\)
Choose a line in log(odds) axis and project the points on to the line (dot products) and plot these values as probabilities on the logistic regression model.
Likelihood is the product of probabilities that belong to class 1. The general formula for likelihood is
\(L_{w,b} = \Pi f_{w,b}(x_i)^{(y_i)} (1-f_{w,b}(x_i)^{(1-y_i)}\)
The formula seems strange, but it is easy. If target value y is 1 then calculate the product of first part that is \(f_{w,b}(x_i)^{y_i}\) as the power of the next part become 0 and 1. If target value y is 0, then calculate the second part.
In practice, we actually maximize the log of the likelihoods for the gradient descent calculation to make easier. This problem is trying to maximize the likelihood, the computer’s gradient descent methods can only search for minimums. So, we add negative terms in the formula for maximization. And the cost function will be like this.
\(J(x) = -\frac{1}{m} \Sigma_{j=1}^m y^j log(\hat{y}^j) + (1-y^j)log(1-\hat{y}^j)\)
where \(\hat{y} = \frac{1}{1 + e^{-(\Sigma_{i=0}^n \beta_ix_i)}}\)
To find gradient descent, we need to calculate derivatives.
\(\sigma(x)' = \sigma(x)(1 - \sigma(x))\)
\(\frac{\partial}{\partial x} J(x) = \frac{1}{m}\Sigma_{i=1}^m [\hat{y}^{(i)} - y^{(i)}]x_j^{(i)}\)
I skip the intermediate steps as they are too much to write. From that formula we can calculate the gradient descent as in linear regression like that.
\(Repeat\)
\(\beta_j := beta_j - \frac{\alpha}{m}\Sigma_{i=1}^m(\hat{y} - y^{(i)})x_j^{(i)}\)
That’s all about how to calculate logistic regression.