Mathematical formula for the OLS (least-squares) estimator:

\[\hat{\beta} = (X^T X)^{-1} X^T y\]

where one of the ways of computing the inverse of a (square) matrix is

\[A^{-1} = \frac{1}{|det(A)|} adj(A)\]

where the formula for adjoint of a matrix is given here.

However, using this method for computing the inverse of the matrix can be numerically unstable. [add cite]

Thus, a different method is used to compute the least-squares coefficients, namely through the QR-decomposition of the \(X\) matrix. This results in \(X = QR\) where Q is an orthogonal matrix and R is an upper triangular matrix. If the matrix that is used in the decomposition is \(n \times p\) where \(n > p\) (which is the case we’re concerned with here) then \[X = [Q_1 \ Q_2] \begin{bmatrix} R_1 \\ 0 \end{bmatrix}\]

where \(X = Q_1 R_1\) is known as the thin QR factorization of the data matrix X.

The inverse of a square matrix \(A\) can be written as \[A^{-1} = (QR)^{-1} = R^{-1} Q^{-1} = R^{-1} Q^T\] since \(Q\) is an orthogonal matrix. For orthogonal matrices, \[QQ^T = I \Rightarrow (Q^{-1}Q)Q^T = Q^{-1}I \Rightarrow Q^T = Q^{-1}\]

Instead of computing \((X^T X)^{-1}\) using QR and plugging it into the formula for \(\hat{\beta}\) above, we can directly use the QR matrices to compute the \(\beta\)’s using the following formula \[\hat{\beta} = R^{-1} Q^T y\]

where R^{-1} can be computed using back-substitution (R command: backsolve()) or using a different formula for computing the inverse of a triangular matrix.

The following function computes the inner product of two vectors (of equal length).

# function to compute the inner product of two vectors
# https://en.wikipedia.org/wiki/Kahan_summation_algorithm
# https://en.wikipedia.org/wiki/Dot_product#Computation
inner_prod = function(v1, v2) {
  
  stopifnot(length(v1) == length(v2))
  
  len = length(v1)
  res = vector("numeric", length = len)
  
  for (i in 1:len) {
    res[i] = v1[i] * v2[i]
  }
  
  return(sum(res))
}
inner = function(v1, v2) {
  
  stopifnot(length(v1) == length(v2))
  
  return(sum(v1 * v2))
}
# test the functions
all.equal(inner_prod(1:10, 1:10), inner(1:10, 1:10))
[1] TRUE

Using Gram-Schmidt Process

The formula/algorithm for the classical Gram-Schmidt process can be found here.

The function below implements the modified Gram-Schmidt algorithm which is a slight modification of the classical Gram-Schmidt algorithm resulting in greater numerical stability.

# Thin QR factorization and modified gram-schmidt
qr_gs = function(x) {
  
  n = nrow(x)
  p = ncol(x)
  
  q1 = matrix(0, nrow = n, ncol = p)
  r1 = matrix(0, nrow = p, ncol = p) 
  u = matrix(0, nrow = n, ncol = p)
  
  u[, 1] = x[, 1]
  
  for (k in 2:ncol(x)) {
    
    u[, k] = x[, k]
    
    # successive orthogonalization
    for (ctr in seq(1, (k - 1), 1)) {
      
      # classical GS
      #u[, k] = u[, k] - ((inner(u[, ctr], x[, k]) / inner(u[, ctr], u[, ctr])) * (u[, ctr]))
      
      # modified (stable) GS
      u[, k] = u[, k] - ((inner(u[, ctr], u[, k]) / inner(u[, ctr], u[, ctr])) * (u[, ctr]))
    }
    
  }
  
  # dividing each column by respective vector (l2) norm (rescaling each vector to have length 1)
  q1 = apply(u, 2, function(x) { x / sqrt(inner(x, x)) })
  r1 = crossprod(q1, x) # t(q1) %*% x
  
  return(list(q = q1, r = r1, u = u))
}

Check if it works: (example taken from the wikipedia article)

A = matrix(c(12, -51, 4, 6, 167, -68, -4, 24, -41), ncol = 3, byrow = TRUE)
A
     [,1] [,2] [,3]
[1,]   12  -51    4
[2,]    6  167  -68
[3,]   -4   24  -41
res = qr_gs(A)
res2 = qr(A)
res$u
     [,1] [,2]  [,3]
[1,]   12  -69 -11.6
[2,]    6  158   1.2
[3,]   -4   30 -33.0
res$q
           [,1]       [,2]        [,3]
[1,]  0.8571429 -0.3942857 -0.33142857
[2,]  0.4285714  0.9028571  0.03428571
[3,] -0.2857143  0.1714286 -0.94285714
qr.Q(res2)
           [,1]       [,2]        [,3]
[1,] -0.8571429  0.3942857  0.33142857
[2,] -0.4285714 -0.9028571 -0.03428571
[3,]  0.2857143 -0.1714286  0.94285714
all.equal(abs(qr.Q(res2)), abs(res$q))
[1] TRUE
round(res$r, 10)
     [,1] [,2] [,3]
[1,]   14   21  -14
[2,]    0  175  -70
[3,]    0    0   35
qr.R(res2)
     [,1] [,2] [,3]
[1,]  -14  -21   14
[2,]    0 -175   70
[3,]    0    0  -35
all.equal(abs(round(res$r, 10)), abs(qr.R(res2)))
[1] TRUE

Why are the signs from the qr call and our implementation different? The footnote on p. 20 of Wood (2006) mentions this:

In fact the QR decomposition is not uniquely defined, in that the sign of rows of Q, and corresponding columns of R, can be switched, without changing X — these sign changes are equivalent to reflections of vectors, and the sign leading to maximum numerical stability is usually selected in practice.

Check if it gives the same result for \(A^{-1}\) as R function solve()

all.equal(tcrossprod(backsolve(qr_gs(A)$r, diag(3)), qr_gs(A)$q), solve(A))
[1] TRUE

Using Householder Transformations

@TODO

Using Givens Rotations

@TODO

Computing OLS Coefficients

Taken from wikipedia.

\(\hat{\beta} = R^{-1} Q^T y\)

# respose vector y and predictor (n x p) matrix x
get_betas = function(x, y) {
  
  # get QR matrices
  res = qr_gs(x)
  
  # t(Q) %*% y
  qb = crossprod(res$q, y)
  
  # betas: R^-1 Q^T y
  coeff = as.vector(backsolve(res$r, qb))
  
  names(coeff) = colnames(x)
  
  # alternatively
  # return(as.vector(backsolve(res$r), crossprod(res$q, y)))
  
  return(coeff)
}

Test whether the beta’s are the same from the lm call and our QR implementation

# generate data
n = 100
x = matrix(rnorm(2 * n), ncol = 2)
y = 2 * x[, 1] + 3 * x[, 2] + rnorm(n)
unname(coef(lm(y ~ 0 + x))) # by lm
[1] 1.925652 2.867929
get_betas(x, y)
[1] 1.925652 2.867929

References

https://en.wikipedia.org/wiki/QR_decomposition

Simon Wood - Generalized Additive Models (2006)

Some other useful references:

---
title: "Computing OLS Coefficients Using QR Decomposition"
author: "Akshat Dwivedi"
date: "`r Sys.Date()`"
output: html_notebook
---

Mathematical formula for the OLS (least-squares) estimator:

$$\hat{\beta} = (X^T X)^{-1} X^T y$$

where one of the ways of computing the inverse of a (square) matrix is

$$A^{-1} = \frac{1}{|det(A)|} adj(A)$$

where the formula for adjoint of a matrix is given [here](https://en.wikipedia.org/wiki/Adjugate_matrix).

However, using this method for computing the inverse of the matrix can be numerically unstable. [add cite]

Thus, a different method is used to compute the least-squares coefficients, namely through the QR-decomposition of the $X$ matrix. This results in $X = QR$ where Q is an orthogonal matrix and R is an upper triangular matrix. If the matrix that is used in the decomposition is $n \times p$ where $n > p$ (which is the case we're concerned with here) then $$X = [Q_1 \ Q_2] \begin{bmatrix} R_1 \\ 0 \end{bmatrix}$$

where $X = Q_1 R_1$ is known as the thin QR factorization of the data matrix X.

The inverse of a square matrix $A$ can be written as $$A^{-1} = (QR)^{-1} = R^{-1} Q^{-1} = R^{-1} Q^T$$ since $Q$ is an orthogonal matrix. For orthogonal matrices, $$QQ^T = I \Rightarrow (Q^{-1}Q)Q^T = Q^{-1}I \Rightarrow Q^T = Q^{-1}$$

Instead of computing $(X^T X)^{-1}$ using QR and plugging it into the formula for $\hat{\beta}$ above, we can directly use the QR matrices to compute the $\beta$'s using the following formula $$\hat{\beta} = R^{-1} Q^T y$$

where R^{-1} can be computed using [back-substitution](https://en.wikipedia.org/wiki/Triangular_matrix#Forward_and_back_substitution) (R command: `backsolve()`) or using [a different formula for computing the inverse of a triangular matrix](https://math.stackexchange.com/questions/4841/inverse-of-an-invertible-triangular-matrix-either-upper-or-lower-is-triangular).

The following function computes the inner product of two vectors (of equal length).

```{r}
# function to compute the inner product of two vectors
# https://en.wikipedia.org/wiki/Kahan_summation_algorithm
# https://en.wikipedia.org/wiki/Dot_product#Computation

inner_prod = function(v1, v2) {
  
  stopifnot(length(v1) == length(v2))
  
  len = length(v1)
  res = vector("numeric", length = len)
  
  for (i in 1:len) {
    res[i] = v1[i] * v2[i]
  }
  
  return(sum(res))
}

inner = function(v1, v2) {
  
  stopifnot(length(v1) == length(v2))
  
  return(sum(v1 * v2))
}

# test the functions
all.equal(inner_prod(1:10, 1:10), inner(1:10, 1:10))
```

### Using Gram-Schmidt Process

The formula/algorithm for the classical Gram-Schmidt process can be found [here](https://en.wikipedia.org/wiki/QR_decomposition#Using_the_Gram.E2.80.93Schmidt_process).

The function below implements the [modified Gram-Schmidt algorithm](https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process#Numerical_stability) which is a slight modification of the classical Gram-Schmidt algorithm resulting in greater numerical stability.

```{r}
# Thin QR factorization and modified gram-schmidt
qr_gs = function(x) {
  
  n = nrow(x)
  p = ncol(x)
  
  q1 = matrix(0, nrow = n, ncol = p)
  r1 = matrix(0, nrow = p, ncol = p) 
  u = matrix(0, nrow = n, ncol = p)
  
  u[, 1] = x[, 1]
  
  for (k in 2:ncol(x)) {
    
    u[, k] = x[, k]
    
    # successive orthogonalization
    for (ctr in seq(1, (k - 1), 1)) {
      
      # classical GS
      #u[, k] = u[, k] - ((inner(u[, ctr], x[, k]) / inner(u[, ctr], u[, ctr])) * (u[, ctr]))
      
      # modified (stable) GS
      u[, k] = u[, k] - ((inner(u[, ctr], u[, k]) / inner(u[, ctr], u[, ctr])) * (u[, ctr]))
    }
    
  }
  
  # dividing each column by respective vector (l2) norm (rescaling each vector to have length 1)
  q1 = apply(u, 2, function(x) { x / sqrt(inner(x, x)) })
  r1 = crossprod(q1, x) # t(q1) %*% x
  
  return(list(q = q1, r = r1, u = u))
}
```

Check if it works: (example taken from the wikipedia article)

```{r}
A = matrix(c(12, -51, 4, 6, 167, -68, -4, 24, -41), ncol = 3, byrow = TRUE)
A

res = qr_gs(A)
res2 = qr(A)

res$u
res$q
qr.Q(res2)
all.equal(abs(qr.Q(res2)), abs(res$q))
round(res$r, 10)
qr.R(res2)
all.equal(abs(round(res$r, 10)), abs(qr.R(res2)))
```

Why are the signs from the `qr` call and our implementation different? The footnote on p. 20 of Wood (2006) mentions this:

```
In fact the QR decomposition is not uniquely defined, in that the sign of rows of Q, and corresponding columns of R, can be switched, without changing X — these sign changes are equivalent to reflections of vectors, and the sign leading to maximum numerical stability is usually selected in practice.
```
Check if it gives the same result for $A^{-1}$ as R function `solve()`

```{r}
all.equal(tcrossprod(backsolve(qr_gs(A)$r, diag(3)), qr_gs(A)$q), solve(A))
```

### Using Householder Transformations

@TODO

### Using Givens Rotations

@TODO

### Computing OLS Coefficients

Taken from [wikipedia](https://en.wikipedia.org/wiki/QR_decomposition#Using_for_solution_to_linear_inverse_problems).

$\hat{\beta} = R^{-1} Q^T y$

```{r}
# respose vector y and predictor (n x p) matrix x
get_betas = function(x, y) {
  
  # get QR matrices
  res = qr_gs(x)
  
  # t(Q) %*% y
  qb = crossprod(res$q, y)
  
  # betas: R^-1 Q^T y
  coeff = as.vector(backsolve(res$r, qb))
  
  names(coeff) = colnames(x)
  
  # alternatively
  # return(as.vector(backsolve(res$r), crossprod(res$q, y)))
  
  return(coeff)
}
```

Test whether the beta's are the same from the `lm` call and our QR implementation

```{r}
# generate data
n = 100
x = matrix(rnorm(2 * n), ncol = 2)
y = 2 * x[, 1] + 3 * x[, 2] + rnorm(n)

unname(coef(lm(y ~ 0 + x))) # by lm

get_betas(x, y)
```

```{r, include=FALSE, eval=FALSE}
out = matrix(0, nrow = nrow(A), ncol = ncol(A))
for (i in 1:ncol(A)) {
  for (j in 1:ncol(A)) {
    out[i, j] = sum(A[i, ] * A[, j])
  }
}
out
A %*% A
all.equal(out, A %*% A)
```

### References

https://en.wikipedia.org/wiki/QR_decomposition

Simon Wood - Generalized Additive Models (2006)

Some other useful references:

* https://stackoverflow.com/questions/39849941/writing-a-householder-qr-factorization-function-in-r-code

* https://stackoverflow.com/questions/38741127/lm-what-is-qraux-returned-by-qr-decomposition-in-linpack-lapack

* https://stackoverflow.com/questions/13248960/qr-decomposition-different-in-lm-and-biglm
