Partial Derivative of Cost Function for Logistic Regression

The challenge here is to work out the partial derivatives for a cost function in logistic regression. It is far more complicated than the trivial exercise of partial derivatives for linear regression. For me, anyway.

Cost function

The cost function for logistic regression is:

\[J(\theta) = -\frac{1}{m} \sum\limits_{i=1}^m y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)}))\]

Minimized cost function

This is what we know (thanks to Prof. Ng) to be the partial derivative of the cost function. Can we find it?

\[\frac{\partial}{\partial \theta_j} = \frac{1}{m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}\]

Hypothesis (sigmoid) function

\[h_{\theta}(x^{(i)}) = \frac{1}{1+e^{-\theta^Tx}}\]

Rules for logarithmic expressions

\[\log \left( \frac{x}{y} \right) = \log(x) - \log(y)\]

\[\log(e^a) = a\]

Rules for derivatives of logarithmic expressions

\[\frac{\delta}{\delta x} \log(\text{expression}) = \frac{1}{\text{expression}} \cdot \frac{\delta}{\delta x} \text{expression}\]

Examples:

\[\frac{\delta}{\delta x} \log(x) = \frac{1}{x} \cdot \frac{\delta}{\delta x} x = \frac{1}{x} \cdot 1 = \frac{1}{x}\]

\[\frac{\delta}{\delta x} \log(\frac{1}{2x^2 + 3}) = (2x^2 + 3) \cdot \frac{\delta}{\delta x} 2x^2 + 3 = (2x^2 + 3) \cdot 4x = 8x^3 + 12x\]

Finding the partial derivatives for each \(j\) in \(\theta\)

This is an almost maddeningly clever technique. It is not the sort of answer one might find casually. It seems clear that there was much trial and error arriving at the neat, compact form at the end of step 2.

1. Simplify the cost function

\[\begin{eqnarray} J(\theta) &=& -\frac{1}{m} \left[ \sum\limits_{i=1}^m \left( y^{(i)} (\log h_{\theta}(x^{(i)})) + (1 - y^{(i)})\log(1 - h_{\theta}(x^{(i)})) \right) \right] \\ \nonumber & & \text{Replace }h_{\theta}(x^{(i)})\text{ with sigmoid} \\ \nonumber &=& -\frac{1}{m} \left[ \sum\limits_{i=1}^m \left( y^{(i)}\log(\frac{1}{1+e^{-\theta^Tx^{(i)}}}) + (1-y^{(i)}) \log(1 - \frac{1}{1+e^{-\theta^Tx^{(i)}}}) \right) \right] \\ \nonumber & & \text{Convert right term to single rational expression} \\ \nonumber &=& -\frac{1}{m} \left[ \sum\limits_{i=1}^m \left( y^{(i)}\log(\frac{1}{1+e^{-\theta^Tx^{(i)}}}) + (1-y^{(i)}) \log(\frac{e^{-\theta^Tx^{(i)}}}{1+e^{-\theta^Tx^{(i)}}}) \right) \right] \\ \nonumber & & \text{Apply }\log(\frac{a}{b})=\log(a) - \log(b) \text{ on left term} \\ \nonumber &=& -\frac{1}{m} \left[ \sum\limits_{i=1}^m \left( y^{(i)}(\log(1)-\log(1+e^{-\theta^Tx^{(i)}})) + (1-y^{(i)}) \log(\frac{e^{-\theta^Tx^{(i)}}}{1+e^{-\theta^Tx^{(i)}}}) \right) \right] \\ \nonumber &=& -\frac{1}{m} \left[ \sum\limits_{i=1}^m \left( -y^{(i)}\log(1+e^{-\theta^Tx^{(i)}}) + (1-y^{(i)}) \log(\frac{e^{-\theta^Tx^{(i)}}}{1+e^{-\theta^Tx^{(i)}}}) \right) \right] \\ \nonumber & &\text{Apply }\log(\frac{a}{b}) = \log(a) - \log(b) \text{ to right term}\\ \nonumber &=& -\frac{1}{m} \left[ \sum\limits_{i=1}^m \left( -y^{(i)}\log(1+e^{-\theta^Tx^{(i)}}) + (1-y^{(i)}) \log(e^{-\theta^Tx^{(i)}})-(1-y^{(i)})(\log({1+e^{-\theta^Tx^{(i)}}}) \right) \right] \\ \nonumber & & \text{Apply } \log(e^a) = a \text{ to right term}\\ \nonumber &=& -\frac{1}{m} \left[ \sum\limits_{i=1}^m \left( -y^{(i)}\log(1+e^{-\theta^Tx^{(i)}}) + (1-y^{(i)})(-\theta^Tx^{(i)})-(1-y^{(i)})(\log({1+e^{-\theta^Tx^{(i)}}}) \right) \right] \\ \nonumber & & \text{Move minus sign inside }\sum \\ \nonumber &=& \frac{1}{m} \left[ \sum\limits_{i=1}^m \left( y^{(i)}\log(1+e^{-\theta^Tx^{(i)}}) + (1-y^{(i)})(\theta^Tx^{(i)})+(1-y^{(i)})(\log({1+e^{-\theta^Tx^{(i)}}}) \right) \right] \\ \nonumber & & \text{Combine first and third terms} \\ \nonumber &=& \frac{1}{m} \left[ \sum\limits_{i=1}^m \left( \log(1+e^{-\theta^Tx^{(i)}}) + (1-y^{(i)})(\theta^Tx^{(i)}) \right) \right] \\ \nonumber \end{eqnarray}\]

2. Take the partial derivative

See step 2 in First Attempt (below) for initial partial derivative of left term.

\[\begin{eqnarray} \frac{\partial}{\partial \theta_j} J(\theta) &=& \frac{1}{m} \left[ \sum\limits_{i=1}^{m} \left( \frac{e^{-\theta^Tx^{(i)}}(-x_j^{(i)})}{1 + e^{-\theta^Tx^{(i)}}} + (1 - y^{(i)})x_j^{(i)}\right) \right] \\ \nonumber & & \text{Now factor out }x^{(i)_j}\\ \nonumber &=& \frac{1}{m} \left[ \sum\limits_{i=1}^{m} \left( \frac{-e^{-\theta^Tx^{(i)}}}{1 + e^{-\theta^Tx^{(i)}}} + 1 - y^{(i)} \right) x_j^{(i)} \right] \\ \nonumber & & \text{Combine first two terms} \\ \nonumber &=& \frac{1}{m} \left[ \sum\limits_{i=1}^{m} \left( \frac{1}{1 + e^{-\theta^Tx^{(i)}}} - y^{(i)})x_j^{(i)}\right) \right] \\ \nonumber & & \text{Substitute }h_{\theta}(x^{(i)}) \text{ for sigmoid function}\\ \nonumber &=& \frac{1}{m} \left[ \sum\limits_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}\right) \right] \\ \nonumber \end{eqnarray}\]

First attempt

Everything below was a first attempt, probably doomed without first simplifying the cost function.

1. Derivative of \(e^{-\theta^Tx}\)

\[\frac{\partial}{\partial\theta_j} e^{-\theta^Tx} = e^{-\theta^Tx} \cdot \frac{\partial}{\partial\theta_j} -\theta^T x= e^{-\theta^Tx} \cdot -x_j^{(i)}\]

If I’m right here, it means that we can’t get rid of the \(-\theta^Tx\), because \(\frac{d}{dx} e^x = e^x \cdot \frac{d}{dx} x\). Hence, things quickly get messy.

However, we are looking at \(h_{\theta}(x^{(i)})\), not \(h_{\theta}(x)\). So I think it should be understood that every instance of \(e^{-\theta^Tx}\) below is actually:

\[e^{-\theta^Tx^{(i)}}\]

Does this make a difference?

2. Derivative of \(1 + e^{-\theta^Tx}\)

The 1 is just a constant, so the derivative is the same as for \(e^{-\theta^Tx}\):

\[\frac{\partial}{\partial\theta_j} 1 + e^{-\theta^Tx} = 0 + \frac{\partial}{\partial\theta_j} e^{-\theta^Tx}= e^{-\theta^Tx} \cdot -x_j^{(i)}\]

3. Derivative of sigmoid function

\[\frac{\partial}{\partial\theta_j} \frac{1}{1+e^{-\theta^Tx}} = \frac{\partial}{\partial\theta_j} (1+e^{-\theta^Tx})^{-1}=-1(1+e^{-\theta^Tx})^{-2} \cdot \frac{\partial}{\partial\theta_j} 1 + e^{-\theta^Tx}\]

Substitute 2. for the last term, which becomes the numerator:

\[\frac{\partial}{\partial\theta_j} \frac{1}{1+e^{-\theta^Tx}} = \frac{e^{-\theta^Tx} \cdot -x_j^{(i)}}{(1+e^{-\theta^Tx})^2}\]

4. Derivative of log of sigmoid function

\[\frac{\partial}{\partial\theta_j} \log(\frac{1}{1+e^{-\theta^Tx}}) = (1+e^{-\theta^Tx}) \cdot \frac{\partial}{\partial\theta_j} \frac{1}{1 + e^{-\theta^Tx}}\]

Substitute 3. for the last term.

\[\frac{\partial}{\partial\theta_j} \log(\frac{1}{1+e^{-\theta^Tx}}) = (1+e^{-\theta^Tx}) \cdot \frac{e^{-\theta^Tx} \cdot -x_j^{(i)}}{(1+e^{-\theta^Tx})^2} = \frac{e^{-\theta^Tx} \cdot -x_j^{(i)}}{1+e^{-\theta^Tx}}\]

Substitute \(h_{\theta}(x)\) for \(\frac{1}{1+e^{-\theta^Tx}}\).

\[\frac{\partial}{\partial\theta_j} \log(\frac{1}{1+e^{-\theta^Tx}}) = (1+e^{-\theta^Tx}) \cdot \frac{e^{-\theta^Tx} \cdot -x_j^{(i)}}{(1+e^{-\theta^Tx})^2} = h_{\theta}(x) \cdot e^{-\theta^Tx} \cdot -x_j^{(i)}\]

5. Derivative of log of (1 - sigmoid function)

My final expression is so long and ugly that I doubt it is correct. But I can’t find any flaw in the logic.

Find \(\frac{\partial}{\partial\theta_j} \log(1-h_{\theta}(x^{(i)})\).

First, replace \(h_{\theta}(x^{(i)})\) with the sigmoid function.

\[\frac{\partial}{\partial\theta_j} \log(1-h_{\theta}(x^{(i)})) = \frac{\partial}{\partial\theta_j} \log(1 - \frac{1}{1+e^{-\theta^Tx}})\]

Put into a single rational expression.

\[= \frac{\partial}{\partial\theta_j} \log(\frac{1+e^{-\theta^Tx}}{1+e^{-\theta^Tx}}-\frac{1}{1+e^{-\theta^Tx}}) = \frac{\partial}{\partial\theta_j} \log(\frac{e^{-\theta^Tx}}{1 + e^{-\theta^Tx}})\]

Next, \(\frac{\partial}{\partial x} \log(\text{expr}) = \frac{1}{\text{expr}} \cdot \frac{\partial}{\partial x} \text{expr}\).

\[= \frac{\partial}{\partial\theta_j} \log(\frac{e^{-\theta^Tx}}{1+e^{-\theta^Tx}}) = \frac{1+e^{-\theta^Tx}}{e^{-\theta^Tx}} \cdot \frac{\partial}{\partial\theta_j} \frac{e^{-\theta^Tx}}{1+e^{-\theta^Tx}}\]

Convert elements already derived above.

\[= \frac{1+e^{-\theta^Tx}}{e^{-\theta^Tx}} \cdot \frac{\partial}{\partial\theta_j} (e^{-\theta^Tx}) \cdot \frac{1}{1 + e^{-\theta^Tx}}\]

Now apply product rule, substitute in previous derivatives (2. and 3.).

\[= \frac{1+e^{-\theta^Tx}}{e^{-\theta^Tx}} \cdot [(e^{-\theta^Tx}) \cdot \frac{e^{-\theta^Tx} \cdot -x_j^{(i)}}{(1+e^{-\theta^Tx})^2} + (e^{-\theta^Tx} \cdot -x_j^{(i)}) \cdot \frac{1}{1+e^{-\theta^Tx}}]\]

Simplify both expressions in the square brackets into rational expressions.

\[= \frac{1+e^{-\theta^Tx}}{e^{-\theta^Tx}} \cdot [\frac{(e^{-\theta^Tx})^2 \cdot -x_j^{(i)}}{(1+e^{-\theta^Tx})^2} + \frac{e^{-\theta^Tx} \cdot -x_j^{(i)}}{1+e^{-\theta^Tx}}]\]

Make the denominators in the square brackets equivalent.

\[= \frac{1+e^{-\theta^Tx}}{e^{-\theta^Tx}} \cdot [\frac{(e^{-\theta^Tx})^2 \cdot -x_j^{(i)}}{(1+e^{-\theta^Tx})^2} + \frac{e^{-\theta^Tx} \cdot -x_j^{(i)} \cdot (1+e^{-\theta^Tx})}{(1+e^{-\theta^Tx})^2}]\]

Combine the two expressions in the square brackets.

\[= \frac{1+e^{-\theta^Tx}}{e^{-\theta^Tx}} \cdot [\frac{((e^{-\theta^Tx})^2 \cdot -x_j^{(i)}) + (e^{-\theta^Tx} \cdot -x_j^{(i)} \cdot (1+e^{-\theta^Tx}))}{(1+e^{-\theta^Tx})^2}]\]

Simplify the whole thing. Factor \(e^{-\theta^Tx}\) out of the left denominator and right numerator. Factor \(1 + e^{-\theta^Tx}\) out of the left numerator and the right denominator.

\[= \frac{(e^{-\theta^Tx}\cdot -x_j^{(i)}) + (-x_j^{(i)} \cdot (1+e^{-\theta^Tx}))}{1+e^{-\theta^Tx}}\]

Factor out \(-x_j^{(i)}\) in the numerator.

\[= \frac{-x_j^{(i)} (e^{-\theta^Tx} + (1+e^{-\theta^Tx}))}{1+e^{-\theta^Tx}} = \frac{-x_j^{(i)} (2e^{-\theta^Tx} + 1)}{1+e^{-\theta^Tx}}\]

\[= \frac{-2x_j^{(i)} e^{-\theta^Tx} - x_j^{(i)}}{1+e^{-\theta^Tx}}\]

Maybe this needs to be changed to another form. For example it could be:

\[= (-2x_j^{(i)} e^{-\theta^Tx} - x_j^{(i)}) \cdot h_{\theta}(x^{(i)})\]

6. Derivative of everything in the left half of the \(\sum\) expression

\[\frac{\partial}{\partial\theta_j} y^{(i)} \log(h_{\theta}(x^{(i)})) = y^{(i)} \cdot x_j^{(i)} \cdot h_{\theta}(x^{(i)}) \cdot e^{-\theta^Tx}\]

7. Derivative of everything in the right half of the \(\sum\) expression

\[\frac{\partial}{\partial\theta_j} (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) = \]

8.

Add 6. and 7. together.