Logistic Regression

Video 6-1: Classification

Examples:

Email: spam/not spam
Online transactions: fraud/not fraud
Tumor: malignant/benign

The explanatory variable has only two possible values for now (multiclass problems may take on more values).

$ y \in {0,1} $

0: “Negative Class” 1: “Positive Class”

The following shows an example of a classification problem.

plot of chunk unnamed-chunk-1

A naive approach:

Threshold classifier output $ h_{\theta}(x) $ at 0.5:

If $ h_{\theta}(x) \ge 0.5 $, predict y=1
If $ h_{\theta}(x) \ 0.5 $, predict y=0

This seems to work; the intersection of the gray lines shows that the values with x >= 5 are classified as 1, others as 0.

A linear regression might look like it would work, until we add another data point at x=50, y=1, then the regression line changes greatly and many values with y=1 are now on the wrong side of the line.

plot of chunk unnamed-chunk-2

Now the regression line does not work, points are misclassified.

Another “funny thing” about using linear regression for a classification problem:

For classificatio problem, y = 0 or 1

$ h_{\theta}(x) $ can be > 1 or < 0

But these open ranges don't match up with “1 or 0”.

Logisitic Regression: $ 0 \le h_{\theta}(x) \le 1 $

Logistic regression is a classification algorithm. The name is a misnomer. True regression works on continuous numeric values.

Video 6-2: Hypothesis Presentation

Logistic Regression Model

Want $ 0 \le h_{\theta}(x) \le 1 $

Hypothesis in linear regression:

\[ h_{\theta}(x) = \theta^Tx \]

Modified for logistic regression:

\[ h_{\theta}(x) = g(\theta^Tx) \]

\[ g(z) = \frac{1}{1+e^{-z}}, z \in R \]

This is the sigmoid or logistic function.

plot of chunk unnamed-chunk-3

Any value for x will return a y between 0 and 1. x=0 returns y=0.5. There are asymptotes at y = 0 and y = 1.

So combining both functions above:

\[ h_{\theta}(x) = \frac{1}{1 + e^{-\theta^Tx}} \]

Interpretation of Hypothesis Output

$ h_{\theta}(x) $ = estimated probability that y=1 on input x

Example:

If x = [x_0=1, x_1=tumorSize], and as a result, $ h_{\theta}(x)=0.7 $, then there is a 70% chance that tumor is malignant.

In terms of probability:

\[ h_{\theta} = P(y=1|x;\theta) \]

Since y must be either 0 or 1, it also follows that:

\[ P(y=0|x;\theta) + P(y=1|x;\theta) = 1 \]

Video 6-3: Decision Boundary

Suppose we predict “y=1” when $ h_{\theta}(x) \ge 0 $. Therefore, “y=0” when $ h_{\theta}(x) < 0 $.

The plot of the sigmoid function above shows that, when $ x < 0 $, then $ y < 0.5 $, and when $ x \ge 0 $, $ y \ge 0.5 $. Thus $ g(z) \ge 0 $ when $ z \ge 0 $.

Since in logistic regression, $ h_{\theta}(x) = g(\theta^Tx) $, then $ h_{\theta}(x) \ge 0.5 $ whenever $ \theta^Tx \ge 0 $.

So, $ y=1 $ whenever $ \theta^Tx \ge 0 $, and $ y=0 $ whenever $ \theta^Tx < 0 $.

Decision Boundary

plot of chunk unnamed-chunk-4

Based on the data in the plot above, let $ \theta = [-3,1,1] $. Then predict y=1 if $ -3 + x_1 + x_2 \ge 0 $, or $ x_1 + x_2 \ge 3 $.

Everything above the line is in the “y=1” region. Everythin below it is in the “y=0” region ($ x_1 + x_2 < 3 $).

The line in the plot therefore is the decision boundary. The decision boundary is a property of the hypothesis, including the parameters $ \theta_0 $, $ \theta_1 $ and $ \theta_2 $. It is a property of the hypothesis, not of the data set.

Later, we learn to fit the parameters, and that involves the use of the data set.

Non-Linear Decision Boundaries

plot of chunk unnamed-chunk-5

Now the decision boundary would form something like a circle, which might be represented by $ h_{\theta}(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2) $.

Not yet to the point to discuss automatically selecting values for the parameters in $ \theta $. But for now, speculate that the values are $ [-1, 0, 0, 1, 1] $.

This in turn predicts that $ y=1 $ when $ -1 + x_1^2 + x_2^2 \ge 0 $, or $ x_1^2 + x_2^2 \ge 1 $. This is the equation for a circle centered at origin with radius of 1. (This doesn't fit my example plot above but too bad. Close enough.)

This is the first example of a polynomial as a more-complex decision boundary than a simple linear equation.

I think the point here is that the training set does not define the decision boundary. For example, it is not related to the selection of the polynomial. It is used to “fit” the model, by adjusting the values of $ \theta $, but it does not determine the polynomial. (And yet the data must somehow influence this too…this must come later.)

There can be even more complex, even higher-ordered polynomials to form a decision boundary. They may take the form of an ellipse, or almost any shape, with many local optima.

Video 6-4: Cost Function

We have a training set of $ m $ examples:

$ \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}),\dots,(x^{(m)},y^{(m)})\} $

$ m $ examples of $ x $, where $ x $ has $ n $ features and $ n+1 $ dimensions:

$ x \in \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \\ \end{bmatrix} $, where $ x_0=1 $, $ y \in \{0,1\} $

And:

$ h_{\theta}(x) = \frac{1}{1 + e^{-\theta^Tx}} $, where $ x_0 = 1 $.

How to choose the parameters for $ \theta $?

Linear regression: $ J(\theta) = \frac{1}{m} \sum\limits_{i=1}^{m} \frac{1}{2} (h_{\theta} (x^{(i)}) - y^{(i)})^2 $

Note that, instead of $ \frac{1}{2m} $, the $ \frac{1}{2} $ has been moved inside the summation.

Restate the above as $ J(\theta) = \frac{1}{m} \sum\limits_{i=1}^{m} \text{cost}( h_{\theta} (x^{(i)}) - y^{(i)}) $

So:

\[ \text{cost}( h_{\theta} (x^{(i)}), y^{(i)}) = \frac{1}{2} (h_{\theta} (x^{(i)}) - y^{(i)})^2 \]

Now we can eliminate the $ (i) $ superscripts, since the summation is gone:

\[ \text{cost}( h_{\theta}(x), y) = \frac{1}{2} (h_{\theta} (x) - y)^2 \]

The above is, again, the cost for linear regression. It won't work for logistic regression, because the function is non-convex. It will have many local optima.

Logistic regression cost function

\[ \text{Cost}(h_{\theta}(x), y) = \{ \begin{array}{l l} -\log(h_{\theta}(x)) & \quad \text{if $y = 1$}\\ -\log(1 - h_{\theta}(x)) & \quad \text{if $y = 0$}\\ \end{array} \] Here is plot of $ \log(x) $:

plot of chunk unnamed-chunk-7

Here is plot of $ -\log(x) $:

plot of chunk unnamed-chunk-8

The last plot shows that values of x between 0 and 1, we fall from infinity toward 1 on the y axis. This is the cost we want when $ y=1 $. If x is very near zero, we do not expect y to be 1, so there is a very high cost. If x is very near 1, we expect y to be 1, so there is a very low cost.

When $ y = 0 $, we flip this around. Here is the plot of $ f(x) = -log(1 - x) $:

## Warning: NaNs produced

plot of chunk unnamed-chunk-9

Now we expect y to be 0. If x is zero, the cost is very low. If x is near 1, the cost is very high.

Video 6-5: Simplified Cost Function and Gradient Descent

Cost function (recap of last video):

\[ J(\theta) \frac{1}{m} \sum\limits_{i=1}^{m} \text{cost}( h_{\theta}(x), y) \]

\[ \text{Cost}(h_{\theta}(x), y) = \{ \begin{array}{l l} -\log(h_{\theta}(x)) & \quad \text{if $y = 1$}\\ -\log(1 - h_{\theta}(x)) & \quad \text{if $y = 0$}\\ \end{array} \]

Note: $ y $ is always 0 or 1. Therefore cost can be simplified as:

\[ \text{Cost}(h_{\theta}(x), y) = -y\log(h_{\theta}(x)) - (1 - y) \log(1 - h_{\theta}(x)) \]

When $ y=0 $, the first part of the equation is 0. When $ y=1 $, the second part of the equation is 0. So the complete cost function is:

\[ J(\theta) = -\frac{1}{m} \sum\limits_{i=1}^{m} y^{(i)}\log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \]

Other cost functions are possible, but this is derived from statistics using the principle of maximum likelihood estimation. It has a property that it is convex. This topic is outside the scope of the course.

The next step is determining how to minimize $ J(\theta) $. Want $ min_{\theta} J(\theta) $.

Repeat {
$ \theta_j : \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta) $
} (simultaneously update all $ \theta_j $)

\[ \frac{\partial}{\partial\theta_j} = \frac{1}{m} \sum\limits_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)} \]

Therefore to minimize:

Repeat {
$ \theta_j : \theta_j - \alpha\sum\limits_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)} $
}

This algorithm looks the same as the one used in linear regression, but of course the hypothesis is now very different.

Feature scaling still applies with logistic regression–it can help gradient descent run much faster with logistic regression.

Video 6-6: Advanced Optimization

The goal is to find the minimum of $ \theta $ for a cost function $ J(\theta) $. We have seen code that uses the partial derivatives of each feature in $ \theta $ to iteratre and minimize $ \theta $. So gradient descent requires the definitions of $ J(\theta) $ and the partial derivatives. The actual gradient descent algorithm does not require the definition of $ J(\theta) $, unless you want to monitor the progress of the descent.

There is more than one algorithm to minimize $ \theta $:

gradient descent
conjugate gradient
BFGS
L-BFGS

The last three are all much more sophisticated and complex. Advantages include:

No need to manually pick $ \alpha $
Often faster than gradient descent

Disadvantages:

More complex; difficult to understand how they work

Other languages such as C++ and Java have their own libraries with more advanced minimization functions. But they may vary in performance so it is important to test and investigate them.

Example

$ \theta = \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} $

$ J(\theta) = (\theta_1 - 5)^2 + (\theta_2 - 5)^2 $

$ \frac{\partial}{\partial_{\theta_1}} = 2(\theta_1 - 5) $

$ \frac{\partial}{\partial_{\theta_2}} = 2(\theta_2 - 5) $

The obvious minimum here would be $ \theta_1 = 5, \theta_2 = 5 $. But to use an algorithm:

function[jVal, gradient] = costFunction(theta)
  jVal = (theta(1) - 5)^2 + (theta(2) - 5)^2
  gradient = zeros(2,1)
  gradient(1) = 2 * (theta(1) - 5)
  gradient(2) = 2 * (theta(2) - 5)
end

This returns two things. The first is jVal, the function to compute J. The second is a 2-element vector that contains the partial derivatives for $ \theta_1 $ and $ \theta_2 $.

Now you can pass these items to the advanced optimization function.

options = optimset('GradObj', 'on', 'MaxIter', '100')
initialTheta = zeros(2,1)
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options)

fminiunc stands for “function minimization unconstrained.”

The optimset function builds a data structure with options for fminunc. The options here:

'GradObj' is 'on': This means you will supply a gradient to the alogrithm
'MaxIter' is '100': perform no more than 100 iterations

The @ symbol before costFubction identifies it as a pointer to the function. Rather like Perl, I think.

The call will return:

optTheta, the final, optimized set of values for $ \theta $
functionVal is the value returned for the optimized values of theta; in our example, this would be zero, or a number very close to it
exitFlag is the “convergence status”–a 1 means that the function is confident it found a convergence

How do we adapt this simple example to logistic regression?

$ \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} $

function [jVal, gradient] = costFunction(theta)  
  jVal = # code to compute J(theta)  
  gradient(1) = [code to compute partial derviative for theta(1)];  
  gradient(2) = [code to compute partial derviative for theta(2)];  
  ...  
  gradient(n+1) = [code to compute partial derviative for theta(n+1)];  
end

Video 6-7: Multiclass Classification

Examples of a multiclass classification problem:

Email foldering/tagging: friends, family, work, hobby
Medical diagnosis: Not ill, cold, flu
Weather: Sunny, cloudy, snow, rain

In each example, y can take on one of a small number of discrete, predefined values.

In a binary classification problem, there are two groups of data in a plot. For examples, X's and O's. In a multiclass classification problem, there will be three or more: X's, triangles, squares, etc. plot of chunk unnamed-chunk-10

Here, $h_{\theta}¹(x) is the hypothesis separating the plus signs from everything else. The superscript 1 stands for “class 1” in a multiclass one-vs-all scheme. We can now train a simple logistic regression algorithm against this hypothesis.

We can now draw two other lines separating the other classes, and their hypotheses are $ h_{\theta}^{(2)} $ and $ h_{\theta}^{(3)} $.

The probabilities for these hypotheses are:

P(y=1|x;$ \theta $)
P(y=2|x;$ \theta $)
P(y=3|x;$ \theta $)

So we have trained a logisitic regression classifier $ h_{\theta}^{(i)}(x) $ for each class $ i $ to predict the probability that $ y=i $.

To make a prediction on a new input $ x $, pick the class $ i $ that maximizes:

\[ \smash{\displaystyle\max_{i}} h_{\theta}^{(i)}(x) \]

This means that we run each classifier, and then pick the one that returns returns the highest value; this indicates that class with the highest probability of containing $ x $.