Partial Derivatives of Cost Function for Linear Regression

The following does not attempt to explain everything about a derivative; it is assumed you are at least familiar with the power rule.

First, here are the elements in the hypothesis:

\(x\) is an mxn matrix containing one row for each training record and one column for each feature
\(\theta\) is a 1xn matrix containing the weights for the hypothesis
\(y\) is an mx1 matrix containing the response values

So \(n\) represents the number of features in the training set and \(m\) represents the number of rows in the training set.

The hypothesis is:

\[h_{\theta} = \theta^Tx\]

This is the cost function in its simplest form (the superscript i’s refer to a particular feature):

\[J(\theta) = \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\]

This simply gives the total of the squared errors. To get the mean of the squared errors:

\[J(\theta) = \frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\]

We can also multiply this by \(\frac{1}{2}\) for a purely arbitrary reason, to make the derivative easier to calculate, as we’ll see below. Note that this is not now the mean of the squared errors, but rather one-half of the mean, but this is not important, because we will only be comparing various calculations of one-half the mean against each other.

\[J(\theta) = \frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\]

Now we can take the partial derivative for any one feature in \(x\).

\[\frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \text{?}\]

First, the \(\frac{1}{2m}\) is outside of the \(\sum\), so it acts as a constant that will not change. So we need to take the partial derivative of everything inside the \(\sum\). We have a quantity that is squared, so the first step is to apply the chain rule, which effectively means:

\[\frac{\partial}{\partial_{\text{stuff}}} (\text{stuff})^2 = 2(\text{stuff}) \times \frac{\partial}{\partial_{\text{stuff}}} \text{stuff}\]

So:

\[\frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m}\sum\limits_{i=1}^m 2(h_{\theta}(x^{(i)}) - y^{(i)}) \times \frac{\partial}{\partial\theta_{j}} h_{\theta}(x^{(i)})\]

We know from above that the function \(h_{\theta}\) is:

\[h_{\theta} = \theta^Tx\]

So:

\[\frac{\partial}{\partial\theta_j} h_{\theta} = \frac{\partial}{\partial\theta_j} \theta_j^Tx_j\]

Here, \(x_j\) is a constant, and the derivative of \(\theta_j\) with respect to \(\theta_j\) is 1, so we are left with:

\[\frac{\partial}{\partial\theta_j} h_{\theta} = x_j\]

Plug this back into the above equation and we have:

\[\frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m}\sum\limits_{i=1}^m 2(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}\]

As the final step, the 2 in the denominator outside of the \(\sum\) and the 2 inside the \(\sum\) cancel each other out, leaving:

\[\frac{\partial}{\partial\theta_j} = \frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}\]