Partial Derivatives of Cost Function for Linear Regression

The following does not attempt to explain everything about a derivative; it is assumed you are at least familiar with the power rule.

First, here are the elements in the hypothesis:

So \(n\) represents the number of features in the training set and \(m\) represents the number of rows in the training set.

The hypothesis is:

\[h_{\theta} = \theta^Tx\]

This is the cost function in its simplest form (the superscript i’s refer to a particular feature):

\[J(\theta) = \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\]

This simply gives the total of the squared errors. To get the mean of the squared errors:

\[J(\theta) = \frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\]

We can also multiply this by \(\frac{1}{2}\) for a purely arbitrary reason, to make the derivative easier to calculate, as we’ll see below. Note that this is not now the mean of the squared errors, but rather one-half of the mean, but this is not important, because we will only be comparing various calculations of one-half the mean against each other.

\[J(\theta) = \frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\]

Now we can take the partial derivative for any one feature in \(x\).

\[\frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \text{?}\]

First, the \(\frac{1}{2m}\) is outside of the \(\sum\), so it acts as a constant that will not change. So we need to take the partial derivative of everything inside the \(\sum\). We have a quantity that is squared, so the first step is to apply the chain rule, which effectively means:

\[\frac{\partial}{\partial_{\text{stuff}}} (\text{stuff})^2 = 2(\text{stuff}) \times \frac{\partial}{\partial_{\text{stuff}}} \text{stuff}\]

So:

\[\frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m}\sum\limits_{i=1}^m 2(h_{\theta}(x^{(i)}) - y^{(i)}) \times \frac{\partial}{\partial\theta_{j}} h_{\theta}(x^{(i)})\]

We know from above that the function \(h_{\theta}\) is:

\[h_{\theta} = \theta^Tx\]

So:

\[\frac{\partial}{\partial\theta_j} h_{\theta} = \frac{\partial}{\partial\theta_j} \theta_j^Tx_j\]

Here, \(x_j\) is a constant, and the derivative of \(\theta_j\) with respect to \(\theta_j\) is 1, so we are left with:

\[\frac{\partial}{\partial\theta_j} h_{\theta} = x_j\]

Plug this back into the above equation and we have:

\[\frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m}\sum\limits_{i=1}^m 2(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}\]

As the final step, the 2 in the denominator outside of the \(\sum\) and the 2 inside the \(\sum\) cancel each other out, leaving:

\[\frac{\partial}{\partial\theta_j} = \frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}\]