Introduction

Multiple linear regression is a very useful tool for science, but psychologists are rarely taught how the coefficients are derived when going beyond the basic case of a simple linear regression. This tutorial is designed to show R users in the field of psychology how to reverse engineer the lm function to understand what is going on under the hood with respect to finding the coefficients. In doing so, one can get a more complete understanding of how the machinery of ordinary least squares works. This tutorial assumes one has some very basic familiarity with how to run R, but the instructions are fairly basic and do not require advanced programming knowledge (save for a more involved bonus section at the end).

Typical Regression Fitting in R

Before starting off on the linear algebra, we can first look at the easy way of doing this. You may be familiar with the lm function in R, which estimates a linear model by default using OLS. The standard way is to fit a formula with your dependent variable left of the ~ and your predictors on the right of the same ~, using the + operator to add main effects, followed by your fitted data. As an example, I will fit a linear regression in R first from the well-known iris data, which has measures of flower dimensions such as the widths and lengths of petals and sepals. We can specify a simple model here with one intercept and two slopes as follows:

\[ \text{Petal Length} = \beta_0 + {\beta_1} \text{Petal Width} + \beta_2 \text{Sepal Width} + \epsilon \] First the model fit, which we save as an object called fit. I use formula and data here explicitly to show you what each part of the function is doing, but they can be technically omitted:

fit <- lm(
  formula = Petal.Length ~ Petal.Width + Sepal.Width,
  data = iris 
  )

Running coef(fit), we get the following intercept and slopes:

## (Intercept) Petal.Width Sepal.Width 
##   2.2581635   2.1556105  -0.3550346

Which in linear equation notation would be:

\[ \text{Petal Length} = 2.258 + (2.156 \times \text{Petal Width}) + (-0.355 \times \text{Sepal Width}) + \epsilon \]

So based off this, we have a conditional mean of \(2.258\), a slope which shows that petal width seems to generally increase petal length by \(2.156\) with each unit increase, and sepal width seems to slightly decrease petal length with each unit increase (by about \(-0.355\)). We will come back to this at the end, but we have saved this regression output here for now to compare against our hand-calculated version later.

Some Basic Linear Algebra

To understand what follows, we first need to cover some basic linear algebra (you can skip to “How to Get the Coefficients” if you know this already). For starters, a vector is a \(n \times 1\) column of data, such as what is shown below:

\[ x = \begin{bmatrix} x_1 \\ x_2 \\ . \\ . \\ . \\ x_i \end{bmatrix} \]

where each \(x_i\) is a raw data point. If we run the following code, we get a vector of \(n = 6\) data points for sepal widths from our iris data.

head(iris$Sepal.Width)
## [1] 3.5 3.0 3.2 3.1 3.6 3.9

Arranged in vector notation as shown below:

\[ \text{Petal Width} = \begin{bmatrix} 3.5 \\ 3.0 \\ 3.2 \\ 3.1 \\ 3.6 \\ 3.9 \end{bmatrix} \] We can also just create our own by using the c() function to “concatenate” (aka combine) a string of numbers together into a single vector:

x <- c(0,3,4,1,8)
x
## [1] 0 3 4 1 8

A matrix on the other hand is a \(n \times p\) arrangement of \(n\) rows and \(p\) columns of data, such as the matrix \(A\) shown below (here I made up some data to show what it would look like in practice):

\[ A = \begin{bmatrix} 45 & 15 & 8 \\ 27 & 43 & 32 \\ 0 & 35 & 28 \end{bmatrix} \]

We can create this matrix ourselves in R with the following code. We first enter a string of numbers for each row of the matrix, then specify the number of rows (or columns), and then use the byrow=T argument to enter the numbers by row so it reads the same way we entered it.

#### Create Matrix ####
A <- matrix(
  c(45,15,8, 
    27,43,32, 
    0,35,28),
  nrow=3,
  byrow = T
)

#### Print Matrix ####
A
##      [,1] [,2] [,3]
## [1,]   45   15    8
## [2,]   27   43   32
## [3,]    0   35   28

To pull a specific cell from the matrix, you would use the code A[n,p], where n is the row and p is the column. For example, we can get the cell located in Row 1, Column 3 by using this code:

A[1,3]
## [1] 8

To pull an entire row, we would instead use A[n,] and for an entire column we would use A[,p]. Examples are shown for both below:

A[2,]
## [1] 27 43 32
A[,2]
## [1] 15 43 35

We would call this a square matrix because the rows and columns are equal, in that it is a \(3 \times 3\) matrix. It’s transpose, \(A^T\), is simply a rotation of the matrix over it’s diagonal, so that the row and column indices are swapped, such as below:

\[ A = \begin{bmatrix} 45 & 15 & 8 \\ 27 & 43 & 32 \\ 0 & 35 & 28 \end{bmatrix} \]

\[ A^T = \begin{bmatrix} 45 & 27 & 0 \\ 15 & 43 & 35 \\ 8 & 32 & 28 \end{bmatrix} \]

We can get \(A^T\) by using the t() function.

t(A)
##      [,1] [,2] [,3]
## [1,]   45   27    0
## [2,]   15   43   35
## [3,]    8   32   28

A symmetric matrix would look something like below, where the off-diagonal elements are equal to their adjacent counterparts. Importantly, if a matrix is symmetric, then \(A = A^T\), so taking a symmetric matrix’s transpose is equivalent to using the original matrix.

\[ \begin{bmatrix} 5 & 1 & 8 \\ 1 & 4 & 3 \\ 8 & 3 & 2 \end{bmatrix} \]

If the matrix looked something like this:

\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

We would call this an identity matrix because it’s diagonal elements are all equal to \(1\) and it’s off-diagonal elements are equal to \(0\), making it both a square and symmetric matrix (the identity matrix has some important functions, but we won’t go into much detail about it here). We can create an identity matrix by using the diag function and entering any number we want to create the number of diagonal \(1\) values. For a \(3 \times 3\) identity matrix, we would just run this:

diag(3)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

The Inverse

The last key thing we need to learn before moving on is the inverse. The inverse of a matrix, \(A^{-1}\), is one of the crucial elements of calculating the coefficients later. Essentially, if \(A\) is our matrix, then multiplying \(A^{-1}\) with \(A\) gives us back the identity matrix. We can actually solve for the coefficients in a multitude of ways, including QR decomposition, Cholesky decomposition, and other methods. These are generally preferable to using the inverse. For simplicity, I show a very simple method for solving a set of linear equations, later showing how to do it much faster in R. This is just to show you, again, what is loosely going on under the hood when you are estimating OLS regressions.

Suppose we have two basic linear equations, where the variables are represented as \(x_1\) and \(x_2\). The numbers next to them are the coefficients, which are numbers that multiply with each variable. The right hand side is what the left hand side equals, the constants. As an example, the first equation reads as “\(2\) of \(x_1\) plus \(4\) of \(x_2\) equals 10.” The second equation has a similar interpretation.

\[ 2x_1 + 4x_2 = 10 \\ 4x_1 + 6x_2 = 12 \] It would be helpful to know what \(x_1\) and \(x_2\) equal so we can solve these equations. To do so, we have to reorganize the equations here a bit. The constants on the left side can be composed into a matrix \(A\) as follows:

\[ A = \begin{bmatrix} 2 & 4 \\ 4 & 6 \end{bmatrix} \]

The constants on the right make up their own vector \(b\):

\[ b = \begin{bmatrix} 10 \\ 12 \end{bmatrix} \] We merge them into an augmented matrix, which puts the coefficients on the left of a bar and the constants on the right.

\[ \begin{bmatrix} 2 & 4 & \bigm| & 10 \\ 4 & 6 &\bigm| & 12 \end{bmatrix} \] or simply:

\[ \begin{bmatrix} 2 & 4 & 10 \\ 4 & 6 & 12 \end{bmatrix} \]

One of the most basic ways to solve these two equations is via Gauss-Jordan elimination. It may be helpful to see a visual example of this method here as you follow along. It is okay if by the end of this section you don’t “get” this method. The core idea is to simply show how it works, and then graduate to simply using a convenient function to do this for us. In any case, the matrix is transformed using three basic principles of elimination:

  1. Replacement. This is what we do most of the time, where we replace a row with another by adding or subtracting it by another row.
  2. Interchange. This involves replacing one row with another to make the arithmetic easier before or after replacement.
  3. Scaling. This simply involves multiplying a row by a constant value to make the arithmetic easier for the same reasons.

The basic idea is this: we want all of the diagonals in the \(A\) coefficient matrix to equal 1 and all of off-diagonal elements to equal 0 (in other words: an identity matrix). The \(b\) constants can take on any value so long as we satisfy this constraint. Since we need all the off-diagonals to equal zero, we first use the scaling method to multiply the first row by \(2\). That way we can subtract the first row and second row to make the first element zero.

\[ \begin{bmatrix} 2 & 4 & 10 \\ 4 & 6 & 12 \end{bmatrix} -> \begin{bmatrix} 4 & 8 & 20 \\ 4 & 6 & 12 \end{bmatrix} \] We then subtract the second row from the first:

\[ \begin{bmatrix} 4 & 8 & 20 \\ 4 & 6 & 12 \end{bmatrix} -> \begin{bmatrix} 4 & 8 & 20 \\ 0 & -2 & 8 \end{bmatrix} \] We then divide the first row by \(4\) so we can get the first diagonal to be equal to \(1\).

\[ \begin{bmatrix} 4 & 8 & 20 \\ 0 & -2 & 8 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 5 \\ 0 & -2 & -8 \end{bmatrix} \] We do the same thing to the second row to get \(1\) in the first element by dividing the second row by \(-2\) to get the following matrix:

\[ \begin{bmatrix} 1 & 2 & 5 \\ 0 & -2 & -8 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 5 \\ 0 & 1 & 4 \end{bmatrix} \] Now that the numbers below the diagonal in the \(A\) matrix are set to zero, we need to do the same for above the diagonal, where the \(2\) is located. We multiply the second row by \(2\) so we can subtract both rows and set the \(2\) to zero.

\[ \begin{bmatrix} 1 & 2 & 5 \\ 0 & 1 & 4 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 5 \\ 0 & 2 & 8 \end{bmatrix} \] We then subtract the second row from the first row:

\[ \begin{bmatrix} 1 & 2 & 5 \\ 0 & 2 & 8 \end{bmatrix} -> \begin{bmatrix} 1 & 0 & -3 \\ 0 & 2 & 8 \end{bmatrix} \] Finally to get our last \(1\) in the diagonal, we divide the second row by \(2\):

\[ \begin{bmatrix} 1 & 0 & -3 \\ 0 & 2 & 8 \end{bmatrix} -> \begin{bmatrix} 1 & 0 & -3 \\ 0 & 1 & 4 \end{bmatrix} \] Now our \(A\) matrix is an identity matrix and the \(b\) vector has our solutions. Recall that each column contained a variable, either \(x_1\) or \(x_2\). When the entries in each column equal zero, this means nothing exists here. When they equal one, they simply equal the variable. Thus our solutions are:

\[ x_1 = -3 \\ x_2 = 4 \]

Our original set of equations was this:

\[ 2x_1 + 4x_2 = 10 \\ 4x_1 + 6x_2 = 12 \] If we now enter our solutions into these two equations, they become congruent:

\[ 2(-3) + 4(4) = 10 \\ 4(-3) + 6(4) = 12 \]

Using the Much Simpler Method in R

This can scale up by a lot depending on how many linear equations we have. Suppose we have another set of three linear equations like the ones below.

\[ 2 x_1 + x_2 + 4 x_3 = 10\\ 4 x_1 + 7 x_2 + 5 x_3 = 6\\ 8 x_1 + 9 x_2 + 3x_3 = 5 \]

This will take longer to solve. It would be great if we didn’t have to spend forever trying to solve these equations, particularly if they contain decimal or fractional values which make the math slower to perform. The Gauss-Jordan method is also not ideal for modern computation, as other methods tend to be more stable (though I show how you can do it in R at the end if you are curious).

Instead we can let R do all the work using the solve function, which speeds up this process by a lot. First, the left hand side which contains the coefficients, \(A\), is saved below in R:

A <- matrix(
  c(2,1,4,
    4,7,5,
    8,9,3),
  nrow = 3,
  byrow = T
  )

The right hand side of the equation, the constant vector \(b\), is similarly saved in R. To get it to work with the solve function later, we save it as a matrix, but it is still technically a vector since it only has one column.

b <- matrix(
  c(10,6,5),
  nrow=3,
  ncol=1
  )
b
##      [,1]
## [1,]   10
## [2,]    6
## [3,]    5

Our \(A\) and \(b\) form the following augmented matrix:

\[ \begin{bmatrix} 2 & 1 & 4 &\bigm| & 10 \\ 4 & 7 & 5 &\bigm| & 6 \\ 8 & 9 & 3 &\bigm| & 5 \end{bmatrix} \]

And can be solved with the following code:

solution <- solve(A,b)
solution
##       [,1]
## [1,]  1.57
## [2,] -1.54
## [3,]  2.10

This means that the solution which solves all three of the linear equations is:

\[ x_1 = 1.57 \\ x_2 = -1.54 \\ x_3 = 2.10 \] To prove this works, we can solve the first linear equation by multiplying the first row of the \(A\) matrix by the solution to get the first row of our \(b\) (we use the %*% operator for matrix multiplication here).

A[1,] %*% solution
##      [,1]
## [1,]   10

Recall that in our original set of equations, the constant was equal to \(10\), so this matches what we wanted. Now that we have some basics, we can move on to estimating the coefficients, which will involve some similar math, but plugged into R.

How to Get the Coefficients

A simple linear regression can be expressed as:

\[ Y = \beta_0 + \beta_1 x_i + \epsilon \] where \(\beta_0\) represents the intercept, \(\beta_1\) is the slope for a given predictor, \(x\) is a vector which contains our \(x_i\) values, and \(\epsilon\) is our leftover error. Our equation here seeks to solve a set of \(n\) linear equations for each row of data to obtain our \(\beta\) coefficients:

\[ Y_1 = \beta_0 + \beta_1 x_1 + \epsilon_1 \\ Y_2 = \beta_0 + \beta_1 x_2 + \epsilon_2 \\ .\\ .\\ .\\ Y_n = \beta_0 + \beta_1 x_n + \epsilon_n \] We can generalize this into much simpler matrix notation, which instead of considering each \(x\) predictor or \(n\) row of data, we have the following more complete equation:

\[ Y = X \beta + \epsilon \]

Here \(Y\) is the dependent variable, \(X\) is the design matrix (our predictors and a linear constant, usually \(1\)), \(\beta\) is a vector of coefficients, and \(\epsilon\) is the same error term. Usually capital letters denote matrices and non-capitals as vectors, but to stay consistent with notation later I use capitals throughout and explain what each part is doing along the way. This equation can first be visualized by looking at each vector making up the matrix. If we are using our regression fitted to the iris data, the first column is the outcome \(Y\), or petal length, the next column is a linear constant \(1\) to estimate an intercept, the next two columns are the predictors, and the last column is the errors (you wouldn’t normally have these til after you estimate the regression).

\[ \begin{array}{c} \begin{matrix} \text{Petal Length} & \text{Constant} & \text{Petal Width} & \text{Sepal Width} & \text{Error} \end{matrix} \\ \begin{bmatrix} 1.4 &&&&& 1 &&&&& 0.2 &&&&& 3.5 &&&&& 0.05\\ 1.4 &&&&& 1 &&&&& 0.2 &&&&& 3.0 &&&&& -0.22\\ 1.3 &&&&& 1 &&&&& 0.2 &&&&& 3.2 &&&&& -0.25\\ 1.5 &&&&& 1 &&&&& 0.2 &&&&& 3.1 &&&&& -0.08\\ 1.4 &&&&& 1 &&&&& 0.2 &&&&& 3.6 &&&&& -0.01\\ 1.7 &&&&& 1 &&&&& 0.4 &&&&& 3.9 &&&&& -0.04 \end{bmatrix} \end{array} \] The constant, \(1\), and the two predictor vectors make up the design matrix \(X\), so that the same matrix above can be re-represented as so: \[ \begin{array}{c} \begin{matrix} Y &&& X &&&&& \epsilon \end{matrix} \\ \begin{bmatrix} 1.4 & | 1 & 0.2 & 3.5| & 0.05\\ 1.4 & | 1 & 0.2 & 3.0| & -0.22\\ 1.3 & | 1 & 0.2 & 3.2| & -0.25\\ 1.5 & | 1 & 0.2 & 3.1| & -0.09\\ 1.4 & | 1 & 0.2 & 3.6| & -0.01\\ 1.7 & | 1 & 0.4 & 3.9| & -0.04 \end{bmatrix} \end{array} \]

You may have already noticed that the vector of \(\beta\) coefficients is missing here. To obtain the \(\beta\) coefficients, we use the formula:

\[ (X^T X)^{-1} X^T Y \] The formula looks a bit ominous, but it basically says the following:

There are a lot of words here…and probably not helpful on their own. Let’s instead start building our coefficients to see how this works. First we need the design matrix, which will include the linear constant and our two predictors, sepal width and petal width. The dataset has \(n=150\) observations, so our design matrix will be a \(150 \times 3\) matrix. This can be easily created in R, where cbind simply glues these three columns together (the \(1\) will simply repeat here \(150\) times to match the other two columns):

X <- cbind(
  1,
  iris$Petal.Width,
  iris$Sepal.Width
)

Printing X gets us this matrix (I show the first ten rows here to save space). You should notice that they match what is in the \(X\) matrix above.

       [,1] [,2] [,3]
  [1,]    1  0.2  3.5
  [2,]    1  0.2  3.0
  [3,]    1  0.2  3.2
  [4,]    1  0.2  3.1
  [5,]    1  0.2  3.6
  [6,]    1  0.4  3.9
  [7,]    1  0.3  3.4
  [8,]    1  0.2  3.4
  [9,]    1  0.2  2.9
 [10,]    1  0.1  3.1

Now we just need our \(Y\), which will just be a vector of petal lengths.

Y <- iris$Petal.Length

You may think that if we already have our \(X\) and \(Y\), surely we could just multiply the matrix \(X\) by \(Y\). Using the %*% operator for matrix multiplication, we could theoretically do that.

X %*% Y

This doesn’t really make much sense to do, but illustrates something about matrix multiplication, as it predictably fails:

Error in X %*% Y : non-conformable arguments

Why does this operation fail? In order to do matrix multiplication, we need the following condition: the columns of the first matrix, which we can call matrix \(A\), must be equal to the rows in the second matrix, which we may call matrix \(B\). So if matrix \(A\) is a \(2 \times 3\) matrix and matrix \(B\) is a \(3 \times 2\) matrix, this will work! But if \(A\) was instead a \(2 \times 4\) matrix, it would fail to give us a solution. How this is achieved involves creating inner products to form a solved matrix \(C\). As an example of that, here is the product of two matrices with the matching dimensions necessary for a solution. Notice how the number of columns in the first matrix matches the number of rows in the second matrix, which is necessary for doing this.

\[ \begin{bmatrix} 5 & 2 \\ 3 & 8 \\ 0 & 7 \end{bmatrix} \times \begin{bmatrix} 9 & 1 & 0 \\ 0 & 5 & 4 \end{bmatrix} = \begin{bmatrix} 45 & 15 & 8 \\ 27 & 43 & 32 \\ 0 & 35 & 28 \end{bmatrix} \]

Before we can do any of that, we first need to take the transpose of \(X\). We can test this on our \(X\) matrix with the following code:

X # original matrix
t(X) # transposed matrix

You’ll now notice \(X\) is a \(3 \times 150\) matrix when it was originally a \(150 \times 3\) matrix. Now the nice part about a matrix and it’s transpose is this: multiplying them together gives you a symmetric matrix, but the order matters here for linear regression. If we wrap the product of \(X\) and \(X^T\) with the dim function, we can see the row by column dimensions of the matrix, where one becomes a \(3 \times 3\) matrix while another becomes a \(150 \times 150\) matrix:

dim(X %*% t(X)) # what we dont want
dim(t(X) %*% X) # what we want

We want the \(3 \times 3\) solution, not the first solution. This part will be crucial later, but now we have to figure out the inverse. Here we go ahead and solve the matrix multiplication of \(X^T\) and \(X\) with solve.

solve(t(X) %*% X)

This gives back a \(3 \times 3\) matrix solution which we can now use to find out our coefficients in the linear regression:

            [,1]         [,2]         [,3]
[1,]  0.46981785 -0.042111063 -0.134969247
[2,] -0.04211106  0.013339489  0.008540963
[3,] -0.13496925  0.008540963  0.040795612

Our inverse of \(X^T X\) was a matrix with \(3\) columns, and we already know that \(X^T\) has three rows, which means that we can perform matrix multiplication, something we wouldn’t have been able to do if the order was flipped when finding the inverse. See below for proof:

solve(X %*% t(X)) %*% t(X) # doesn't work
solve(t(X) %*% X) %*% t(X) # works like magic

Now we can combine all of these steps using the formula we saw previously. The formula for achieving this is for our case of one intercept and two slopes would be:

\[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} = (X^T X)^{-1} X^T Y \]

All we are adding here is multiplying everything we already did by \(Y\). Remember that \(Y\) by itself is just a column vector of \(150\) rows. But since the left hand side of the solution of \((X^T X)^{-1}\) has \(150\) columns, we can again get our product! Our complete equation for solving for \(\beta\) is this code:

betas <- solve(t(X) %*% X) %*% t(X) %*% Y

That was a lot of work, but now we can finally check to see if the regression from our saved lm object matches what we get here. As a reminder, we fit our original regression with this code:

fit <- lm(Petal.Length ~ Petal.Width + Sepal.Width,
          iris)

We can now print both our estimated betas and those from the saved model using betas for the former and coef(fit) for the latter.

betas
coef(fit)

And they match!

> betas
           [,1]
[1,]  2.2581635
[2,]  2.1556105
[3,] -0.3550346

> coef(fit)
(Intercept) Petal.Width Sepal.Width 
  2.2581635   2.1556105  -0.3550346 
  

You have now successfully hand-calculated the coefficients of a linear regression. This is a crucial step to understanding not only how OLS works, but how coefficients are estimated in other types of regression (with some modifications).

Bonus: Finding the Inverse Manually

It may be illustrative to see how one can compute the inverse of \(X^T X\) without using a simple wrapper function like solve. We can create a custom function that uses Gauss-Jordan elimination. Because it probably requires a fair deal of sophistication with coding, it is heavily annotated to explain each step.

#### Gauss Jordan Inverse Function ####
gauss.jor <- function(a) {
  
  #### Get Dimensions ####
  n <- dim(a)[1]  # Get number of rows/columns (it's a square matrix)
  
  #### Augment With the Identity Matrix #### 
  a <- cbind(a, diag(n))  
  
  #### Loop Over Each Row of Matrix ####
  for (k in 1:n) {
    
    #### For Each Row, Loop Over All Other Rows (Our Operations) ####
    for (i in 1:n) {
      
      ### Don't Change the Row We're Currently Working On ###
      if (i != k) {
        
        #### Factor Method ####
        # Calculate the factor by which to multiply the current row, 
        # which is the element at the intersection of current row & column k, 
        # divided by the diagonal element of the current column.

        factor <- a[i, k] / a[k, k]
        
        #### Operations ####
        # Subtract current row (multiplied by the factor) from all other rows
        # This creates zeros below/above the diagonal element of current column.

        a[i, ] <- a[i, ] - factor * a[k, ]
      }
    }
    
    #### Morph to 1 ####
    # Divide current row by its diagonal element to make the diagonal element 1,
    # which is necessary for completing Gauss-Jordan elimination.

    a[k, ] <- a[k, ] / a[k, k]
  }
  
  #### Return Inverse of Augmented Matrix ###
  # This is the right half after the original matrix has been transformed into 
  # the identity matrix. If we print a we get a bunch of unnecessary 1's.

  return(a[, -(1:n)])  

}

#### Use the Gauss-Jordan Method ####
gauss.jor(t(X) %*% X) # original computation
gauss.jor(t(X) %*% X) %*% t(X) %*% Y # the betas

And you can see the beta coefficients are the same. Note, again, that this method is not ideal for complex computation and is just for illustration since it is simpler than the actual solve function. In any case, it is close enough to show how one can hand-calculate every piece of the regression.