3 Model selection

3.1 Nonlinear feature transformation

We now get back to our original example of study hour and grades. If you inspect the population plot in section 2.2 our assumption of linear relationship between \(x\) and \(y\) does not seem quite right. As \(x\) increases further, the \(y\) eventually saturates and does not increase any longer. Is there some other funciton than linear that would perhaps capture this better? Recalling our discussion of different function shapes, you may remember that the logarithm grows very fast in the beginning but finishes growing rather slwo. How can we use this? We will resort to feature transformation.

Feature transformation is an appraoch in which we manually tranform the features (the inputs) before using them in the standard linear regreesion. In our case, we will try the log transformation by creating a new set of inputs as \(\tilde{x_i} = \log(x_i), \ i=1, \ldots, n\).

We continue working with the simple linear regression model from section 2.3 but instead of the original feature \(x\) we now use the transformed \(\tilde{x} = \log(x)\) \[y \approx w_0 + w_1 \tilde{x}, \qquad \tilde{x} = \log(x) \enspace .\] We will next follow the matrix approach introduced in section 2.3.3 simply replacing everywhere the original \(x\) with the transformed \(\tilde{x}\). We will first augment the data by the column of ones so that \(\mathbf{\tilde{x}}_i = (1, \tilde{x})_i\) is a \(2\)-dimensional vector for every \(i = 1, \ldots, n\). We then work the \(n \times 2\) matrix \(\mathbf{\tilde{X}}\) and we minimize the loss function \[L(\mathbf{w}) = \frac{1}{n} || \, \mathbf{y} - \mathbf{\tilde{X}} \mathbf{w} \, ||_2^2\] In analogy with the result in section 2.3.3. the minimising solution is \[ \widehat{\mathbf{w}} = \big(\mathbf{\tilde{X}}^T\, \mathbf{\tilde{X}} \big)^{-1} \mathbf{\tilde{X}}^T \, \mathbf{y} \enspace . \]

Let’s explore one more, perhaps more obvious example. Here clearly fitting a line to the original data does not do a great job. Howerver, applying first the feature trnasformation \(\tilde{x} = x^2\) and then peforming the linear regression \[y \approx w_0 + w_1 \tilde{x} = w_0 + w_1 x^2\] seems to work rather well!

In the examples above we followed the same strategy. Inspacting the data, we suspected that the hypotheses of linear relationship between the inputs \(x \in \mathcal{X}\) and \(y \in \mathcal{Y}\) is not quite satisfactory. Instead of working with the original inputs \(x \in \mathcal{X}\) and trying to figure out how to learn a complex nonlinear model, we first applied a suitable nolinear feature transformation to the data \(\phi: \mathcal{X} \to \tilde{\mathcal{X}}, \ \tilde{x} = \phi(x)\) and performed the linear regression in this tranformed feature space \(\tilde{\mathcal{X}}\).

\[y \approx w_0 + w_1 \phi(x)\]

This neat trick allowed us to get from the very restrictive world of linear relationships to highly nonlinear ones while keeping the math and machinery simple!

In the examples above we used a single nonlinear transformation to a single input dimension. It is, however, possible to apply multiple transformations to the the same variable creating a vector of transformed inputs \(\tilde{\mathbf{x}} = (\phi_1(x), \phi_2(x), \ldots, \phi_2(x))^T\). Instead of simple linear regression we then need to work with multivariate regression with the parameters vector \(\mathbf{w}\) of lenght \(d+1\).

The most common transformations are powers \((x^2, x^3, \ldots)\). This is sometimes referred to as polynomial regression because the hypotheses is a polynomial in the original fature space \(\mathcal{X}\) (while still a simple linear regression in the transformed feature space \(\tilde{\mathcal{X}}\)).

When the original feature space \(\mathcal{X} \subseteq \mathbb{R}^d\) is multi-dimensional, we can apply various non-linear transformations (such as powers) to each dimension \(x_i\). These may or may not be the same for each dimension. Furthermore, we can account for interactions between dimesoins by working with products such as \(x_i x_j\). The regression model is then \[y \approx \phi(\mathbf{x})^T \mathbf{w} \enspace ,\] where \(\phi: \mathcal{X} \to \tilde{\mathcal{X}}\) is the vector nonlinaer transformation function.

In fact, the feature transformation is such a common approach that you can often see simply \[y \approx \mathbf{\phi}^T \mathbf{w} \enspace ,\] where everybody understands that this is a linear regression problem after first applying some non-linear transformation to the original data. Everything we discussed for the linear regression over the original input \(\mathbf{x}\) vectors and \(\mathbf{X}\) matrix can be rewritten using the tranfsormed feature vectors \(\phi\) and matrix \(\Phi\).

There are two major takehome messages from this:

First, while linear regression seems almost too trivial, it is at the core of ML theory and many ML algorithms. It is mathematically very simple, yet can be rather powerful when combined well with suitable data transformations.
Second, the idea of feature transformations is not specific to linear regression. It is a general tool that you can use in all ML algorithms and if used correctly, it may very much improve your modelling efforts.

There is one question we haven’t yet answered. How do you know which feature transformation you shall apply? This indeed is a very critical question and the answer is anything but obvious. The process of finding suitable feature transformations is called feature ingeneering. It is usually very much based on domain knowladge and experience and if done right can bring rather impressive improvements to the basic ML models. Two classical examples come from computer vision and natural language processing.

In computer vision you often work with images. These can be initially represented either as \(2\)-d matrices \(x \in \mathcal{X} \in [0,1]^2\) where each element represents the grayscale inentsity of the pixel in the image, or \(3\)-d cubes \(x \in \mathcal{X} \in [0,1]^3\) with intensities of the Red, Green and Blue channels. Computer vision experts have discovered quite long ago that what typically matters in images (for tasks such as classification) are the important parts of the images such as edges, corners, blobs, etc. They have therefore developed a multitude of feature detectors of which probably the most succesfull one is the SIFT (Scale-Invariant Feature Transform). It performs a series of transformations to the the original pixel values of the images to find the important and part of the image and encodes them into an transformed feature space \(\tilde{x} \in \tilde{\mathcal{X}}\).

In natural language processing the original intput data \(x \in \mathcal{X}\) are often simply pieces of text (one for each instance). In the bag of words (BOW) approach we treat the text as a set of words disregarding their order and grammar. We can then craete features by counting the number of occurences of all words in each of the text.

3.2 Data normalization

One of the most common feature transformations which is usually a good idea to apply to any numerical data you work with is the so-called normalization. Some people call it standardization though these are not exactly the same and there is a little bit of confusion about these two. It is thus a good idea to always make clear the mathematical formulation of the transformations you use rather than relying simple on the key words.

The normalization is in fact a result of a composition of two transformations: centering and rescaling.

The goal of centering is to move the average of all the data to zero. One of the major reasons for this is to get rid of the intercept term \(w_0\) in the linear regression and the need to expand the inputs \(\mathbf{X}\) by the 1st column of ones.

The centering transformation removes the mean from the data and is defined as follows (\(c\) stands for center) \[x^{(c)} = \phi_{c}(x) = x - \bar{x} = x - \frac{1}{n} \sum_i^n x_i\] It is rather easy to see that this transformed variable has zero mean.

\[\bar{x}^{(c)} = \frac{1}{n} \sum_i^n x^{(c)}_i = \frac{1}{n} \sum_i^n \big( x_i - \frac{\sum_j^n x_j}{n} \big) = \frac{1}{n} \sum_i^n x_i - \frac{1}{n} \sum_i^n \frac{\sum_j^n x_j}{n} = \frac{1}{n} \sum_i^n x_i - \frac{1}{n} \frac{n \sum_j^n x_j}{n} = 0 \] We usually apply the centering transformation to the outputs as well \[y^{(c)} = y - \bar{y} = y - \frac{1}{n} \sum_i^n y_i \enspace .\]

Revisiting the minimising solution for simple linear regression from section 2.3.2 \[\widehat{w}_0 = \bar{y} - \widehat{w}_1 \bar{x} \enspace ,\] we see that when working with the centred data we get alwyas \[\widehat{w}_0 = \bar{y}^{(c)} - \widehat{w}_1 \bar{x}^{(c)} = 0 - \widehat{w}_1 0 = 0 \enspace .\] We can thus drop the parameter \(w_0\) from the regression altogether (it is alwyas zero) and there is not need to augment our data matrix by the column of ones.

M6C Data Science I

Magda Gregorova

16/4/2019

3 Model selection

3.1 Nonlinear feature transformation

3.2 Data normalization