Least Squares Solutions to \(Ax = b\)

As always we are looking for solutions (\(x\)) to systems of the form \(Ax = b\). A Least Squares solution is a vector \(\hat{x}\) that:

  1. Solves \(Ax = b\) exactly if \(b\) is in \(col(A)\)

  2. Gives a best approximate solution if \(b\) is not in \(col(A)\): \(A\hat{x}\) is a vector in \(col(A)\) that is the closest to \(b\) out of all other \(Ax\) vectors.

There are many ways to think of a solution being the “best.” Least Squares defines “best” as getting \(Ax\) as “close” to \(b\) as possible out of possible solutions. In particular, “close” is defined as Euclidean distance. As we saw last lecture, this means we must find find \(\hat{x}\) to minimize \(\|(b-Ax)\|^2\), hence the name “Least Squares.”

If the columns of \(A\) are linearly independent then a unique least squares solution can be expressed as

\[\begin{aligned} \hat{x} = (A^TA)^{-1}A^Tb. \end{aligned}\]

The complexity (computational load) of this calculation grows rapidly with the size of \(A\) but provided we have enough computing power and \(A^TA\) is not near-singular, finding \(\hat{x}\) is very straightforward.

Least Squares Application: Regression Models

If we are using systems of equations to study real-world problems we are often interested in learning how a response variable \(y\) is related to a set of input variables \(x_1\),…, \(x_p\). If we have reason to believe that the underlying relationship between \(y\) and each \(x_i\) is linear; that is:

\[\begin{aligned} y = f(x_1, x_2, ..., x_p) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon \end{aligned}\]

then we can compute a regression model that estimates this linear relationship. Broadly speaking, Machine Learning in this context uses data collected from \(n\) subjects on each of these \(p + 1\) variables to construct a Linear Regression Model of the form

\[\begin{aligned} \hat{y} = \hat{f}(x_1, x_2, ..., x_p) = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + ... + \hat{\beta}_px_p. \end{aligned}\]

If the underlying function \(f(x_1,x_2,..., x_p)\) that relates the variables is linear, then \(\hat{\beta}_0, ..., \hat{\beta}_p\) are our model’s estimates of \(\beta_0, ..., \beta_p\), and \(\hat{y}\) is the model’s estimate for the output value that results from input values for \(x_1, ..., x_p\).

Training a Regression Model

Suppose we have an \(n\) x \(p\) data matrix \(C\), where each row of \(C\) contains measurements of the \(x_i\) variables for each of the \(n\) subjects. In Lecture 14 we saw that \(\hat{\beta}_0, ..., \hat{\beta}_p\) can be found using least squares, where our linear system \(Ax = b\) is

\[\begin{aligned} b = \vec{y}, A = \Big[ \vec{1} |C\Big], \hat{x} = [\hat{\beta}_0, ..., \hat{\beta}_p]^T. \end{aligned}\]

If our goal is to simply use the data we have to estimate \(\beta_0, ..., \beta_p\) and minimize the Residual Sum of Squares (RSS), then we can simply use all the data we have and find the \(\hat{\beta}_i\) slopes for each \(x_i\) in the model. The resulting model will do the best job possible (“in the least squares sense”) of predicting outputs for inputs that came from data the model was trained on.

Model Assumptions: Differences Between Pure ML and Statistical Learning

There are generally quite a few conditions to check before fitting a regression model, most of which have to do with making sure a linear model is reasonable (see the pairs function in the Iris Example below), and independence between sample values (usually achieved by collecting the data using a random sampling technique). Machine Learning from a statistical perspective (often called Statistical Learning) emphasizes checking these modeling conditions because they lead to desirable statistical properties/results and generally lead to models that are more interpretable. Since this is a linear algebra course (compared to a statistics course) we are operating in a pure ML environment, so we will not focus as much on these statistical details: our only goal is to develop models that are powerful predictive tools, and to that end our focus is mainly on minimizing prediction error (rather than statistical significance or model interpretability).

Assessing Model Fit

The least squares solution minimizes \[\sum_{i=1}^{n} (y_i - \hat{y}_i)^2,\] which is commonly called the Residual Sum of Squares or RSS. The RSS is the total error in the model, so it makes sense to build the model with the goal of making it small. Using RSS to assess model when comparing competing models all using the same data set is fine (see examples in Lecture 14) but in ML applications we are typically trying to choose a best model from a large number of competing models, and RSS is not the best measure to compare models on since it’s only measuring fit of a particular model relative to the variables and data it uses. More commonly used measures of how good model fit is are:

  1. The Mean Squared Error is \(MSE = \frac{RSS}{n}\). It adjusts the RSS for sample size, making it an average of the squared error in the model.

  2. Resdiual Standard Error is \(RSE = \sqrt{\frac{1}{n-p-1}RSS}\). It scales the RSS by the sample size and number of variables. If we used \(\sqrt{\frac{1}{n}RSS}\) then we would have the Root-Mean-Square defined in Exercise 19 from the Midterm Exam. Dividing by \(n-p-1\) is commonly used instead of \(n\) in order to make RSE an unbiased estimate of \(\epsilon\) (think of this as a desirable statistical property).

  3. R2 measures model fit as \(\frac{TSS-RSS}{TSS}\), where \(TSS = \sum_{i=1}^{n} (y_i - \bar{y}_i)^2.\) Think of TSS as the total variation in the output variable; our model seeks to explain that variation in terms of the \(x_i\) inputs. RSS is the amount of variation that the model doesn’t explain. What would R2 be if the model was able explain all the variation in \(y\)?

  4. Adjusted R2 is an improvement over R2 that penalizes (or “adjusts”) R2 for the number of input variables the model uses:

    \[\begin{aligned} R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}. \end{aligned}\]

    Including more variables in a model will always increase R2 because the more dimensions gives more flexibility for the model to be close to our data values. Adjusted R2 penalizes this inflation in R2 by reducing it in proportion to the number of variables we use.

Example 1: The Iris Dataset

In Lab 3 we used the k-means algorithm to group similar flowers in the \(iris\) data set. Let’s explore Linear Regression by trying to predict one of the variables from the others.

iris[1:10,] # display the first 10 rows of the dataset 
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

Note that there are 5 total variables, all of which are numerical except for Species, which is categorical (R calls it “factor” instead of “categorical).

class(iris$Petal.Length) 
## [1] "numeric"
class(iris$Species)
## [1] "factor"

In Lecture 16 we’ll explore classification models, which can be used to predict categorical variables. For today we’ll restrict ourselves to models that have a numerical output variable, although categorical input variables are allowed.

The pairs function graphs each variable in the data set against all other variables. It’s helpful to have an idea of how each variable’s values relates to each other before we use them in our model. Since our underlying model is linear we hope to find linear relationships within the data.

pairs(iris)

Let’s build a linear regression model to predict \(Petal.Length\) from the other variables.

lm_iris = lm(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width + Species, data = iris)
summary(lm_iris)
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width + 
##     Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78396 -0.15708  0.00193  0.14730  0.65418 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.11099    0.26987  -4.117 6.45e-05 ***
## Sepal.Length       0.60801    0.05024  12.101  < 2e-16 ***
## Sepal.Width       -0.18052    0.08036  -2.246   0.0262 *  
## Petal.Width        0.60222    0.12144   4.959 1.97e-06 ***
## Speciesversicolor  1.46337    0.17345   8.437 3.14e-14 ***
## Speciesvirginica   1.97422    0.24480   8.065 2.60e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2627 on 144 degrees of freedom
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9778 
## F-statistic:  1317 on 5 and 144 DF,  p-value: < 2.2e-16
MSE_iris = mean(lm_iris$residuals^2)
MSE_iris
## [1] 0.06626455

The following code inputs the variable values from the first iris in the data set to predict \(Petal.Length\):

predict(lm_iris, data.frame(Sepal.Length = 5.1, Sepal.Width = 3.5, Petal.Width = 0.2, Species = "setosa"))
##        1 
## 1.478453

Example 2: A closer look at model selection for the Iris data set

This example will be done in class.

Training and Test Data

In practice, most machine learning applications seek to build a model that will make the most accurate possible predictions about the response \(y\) that will be associated with a new value of inputs. In that case, model performance on data that was used to train that model isn’t very interesting- we know the \(y\) response associated with the each of the inputs because we have that data in \(\vec{y}\) and \(C\). Further, we know mathematically that the model we constructed fits the data that was used to construct it well, good fit for the data we already have was the mathematical motivation that underlies the model in the first place.

A simple way to test how our model will perform on future data is to randomly partition the data we already have into two groups: a training data set, and a test data set. We train (build) the regression model using only the training data and get a sense of how well it fits by looking at MSE for the training data. Then we run the model on the test data set and see how well it does. To do this we enter each of the training data values in the model, then compare the model’s prediction of \(y\) to the model’s \(\hat{y}\). Finally, calculate the MSE for the test data. Since the test data was not used to train the model it won’t “know” it as well as the training data, thus the test MSE will be a more accurate representation of how well the model will perform on future (unseen) data. First, we randomly partition the data: 80% for the training, and the remaining 20% for the testing.

set.seed(3) # Makes this result reproducible 
train_indexes = sample(1:150, 120) # randomly pull 120 numbers from 1:150
iris_train = iris[train_indexes,]
iris_test = iris[-train_indexes,]

Train the model using training data subset:

lm_iris = lm(Petal.Length ~. + I(Sepal.Width^2), data = iris_train)

Find the MSE for the training data using the mean function:

mean((iris$Petal.Length[train_indexes] - predict(lm_iris, iris_train))^2)
## [1] 0.06110418

See how well the model does on new data by using it to predict \(Petal.Length\) and find the MSE for the test data:

mean((iris$Petal.Length[-train_indexes] - predict(lm_iris, iris_test))^2)
## [1] 0.07966783

Notice that the MSE for the test data is higher than the MSE for the training data. This is not surprising since the model was trained on the training data- it fits that data well so it does a good job of predicting it. The test MSE is a more realistic estimate of how the model would perform if it were predicting \(Petal.Length\) from new data.

Example 2: The Babies Dataset

Read about this data set here: https://www.openintro.org/data/index.php?data=babies.

babies <- read.table("babies.csv", header = TRUE, sep = ",", row.names = "case") # import the dataset
babies = babies[complete.cases(babies),] # remove incomplete data
pairs(babies)

set.seed(1) # makes this result reproducible
train_indexes_b = sample(1:length(babies[,1]), 0.8*length(babies[,1])) # randomly pull 80% of the indexes from the dataset  
babies_train = babies[train_indexes_b,]
babies_test = babies[-train_indexes_b,]

Train the model on the training data subset:

lm_babies = lm(bwt ~. , data = babies_train) 
summary(lm_babies)
## 
## Call:
## lm(formula = bwt ~ ., data = babies_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.953  -9.905  -0.428   9.142  50.740 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -81.308260  16.170698  -5.028 5.94e-07 ***
## gestation     0.480474   0.034256  14.026  < 2e-16 ***
## parity       -3.670333   1.265397  -2.901  0.00381 ** 
## age          -0.006929   0.095029  -0.073  0.94189    
## height        0.996741   0.228034   4.371 1.38e-05 ***
## weight        0.056827   0.028122   2.021  0.04359 *  
## smoke        -8.604407   1.061008  -8.110 1.59e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.72 on 932 degrees of freedom
## Multiple R-squared:  0.2702, Adjusted R-squared:  0.2655 
## F-statistic: 57.52 on 6 and 932 DF,  p-value: < 2.2e-16

Find the MSE for the training data using the mean function:

mean((babies$bwt[train_indexes_b] - predict(lm_babies, babies_train))^2) 
## [1] 245.2302

See how well the model did by using it to predict \(bwt\) and find the MSE for the test data:

mean((babies$bwt[-train_indexes_b] - predict(lm_babies, babies_test))^2)
## [1] 266.8251