Problem 1: Gradient Descent

(a) Finding the Critical Point

We consider the function:

\[ f(x, y) = (x - 1)^2 + (y + 2)^2 \]

Taking the partial derivatives:

\[ \frac{\partial f}{\partial x} = 2(x - 1), \quad \frac{\partial f}{\partial y} = 2(y + 2) \]

Setting both to zero:

\[ x - 1 = 0 \Rightarrow x = 1, \quad y + 2 = 0 \Rightarrow y = -2 \]

Thus, the critical point is \((1, -2)\). To confirm this is a minimum, we compute the Hessian matrix:

\[ H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \]

Since the Hessian matrix is positive definite (all its eigenvalues are positive, which are the diagonal elements in this case), this confirms that \((1, -2)\) is a global minimum.

(b) Gradient Descent for a New Function

The objective function is:

\[ f(x, y) = x^2 + b y^2 \]

Implementation in R

Gradient Descent Function

gradient_descent <- function(b, k, iterations = 100) {
  x <- b
  y <- 1
  history <- data.frame(iter = 0, x = x, y = y)
  
  for (i in 1:iterations) {
    grad_x <- 2 * x
    grad_y <- 2 * b * y
    x <- x - k * grad_x
    y <- y - k * grad_y
    history <- rbind(history, data.frame(iter = i, x = x, y = y))
  }
  return(history)
}

Experiment: Running for Different Learning Rates

k_values <- seq(0.01, 0.3, length.out = 100)
b_values <- c(3, 10)
errors <- data.frame()

for (b in b_values) {
  for (k in k_values) {
    history <- gradient_descent(b, k)
    final_error <- sqrt(history$x[100]^2 + history$y[100]^2)
    errors <- rbind(errors, data.frame(b = b, k = k, error = final_error))
  }
}

Error Analysis Plot

Observations

  • The final error decreases as the learning rate \(k\) increases up to an optimal value, but excessive \(k\) leads to divergence.
  • For \(b = 3\), error remains low for a wider range of \(k\), showing more stable convergence.
  • For \(b = 10\), error increases sharply beyond \(k pprox 0.1\), indicating difficulty in convergence.
  • This suggests that higher \(b\) values make gradient descent more sensitive to learning rate choice.
ggplot(errors, aes(x = k, y = log10(error), color = factor(b))) +
  geom_line() +
  labs(title = "Effect of Learning Rate on Convergence",
       x = "Learning Rate (k)",
       y = "Log10(Error after 100 iterations)",
       color = "b value")

Contour Plot with Gradient Descent Path

Observations

  • The gradient descent trajectory moves faster in the \(x\)-direction than in the \(y\)-direction, showing anisotropy in function scaling.
  • The path is stretched, meaning that different directions converge at different speeds, which aligns with the condition number effect.
  • Higher \(b\) values slow down convergence due to increased imbalance in gradient steps across axes.
contour_data <- expand.grid(x = seq(-2, 2, 0.1), y = seq(-2, 2, 0.1))
contour_data$z <- with(contour_data, x^2 + 3 * y^2)  # Ensure correct computation

best_k <- 0.1  # Choose an optimal k from the error plot
history <- gradient_descent(3, best_k)

ggplot(contour_data, aes(x = x, y = y)) +
  geom_contour(aes(z = z), bins = 20) +  
  geom_path(data = history, aes(x = x, y = y), color = "red", linewidth = 1) +
  labs(title = "Gradient Descent Trajectory on Contour Plot")

- - -

Problem 2: Solving Least Squares Problems

We generate a random \(20 \times 10\) matrix \(A\) and a random 20-vector \(b\) (using a Gaussian distribution). We solve the least squares problem:

\[ \min_{x \in \mathbb{R}^{10}} \| Ax - b \|^2 \]

(a) Solve using the Moore-Penrose Pseudoinverse

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
set.seed(42)
A <- matrix(rnorm(20 * 10), nrow = 20, ncol = 10)
b <- rnorm(20)
x_pinv <- ginv(A) %*% b  # Using generalized inverse from MASS package
x_pinv
##              [,1]
##  [1,] -0.04603502
##  [2,]  0.22854321
##  [3,] -0.21036634
##  [4,]  0.80145627
##  [5,]  0.07151163
##  [6,]  0.15837052
##  [7,]  0.67111527
##  [8,] -0.68633301
##  [9,] -0.11063594
## [10,]  0.22085979

(b) Solve using built-in least squares function

x_lm <- lm.fit(A, b)$coefficients  # Using lm.fit for least squares
x_lm
##          x1          x2          x3          x4          x5          x6 
## -0.04603502  0.22854321 -0.21036634  0.80145627  0.07151163  0.15837052 
##          x7          x8          x9         x10 
##  0.67111527 -0.68633301 -0.11063594  0.22085979

(c) Solve using QR Factorization

QR <- qr(A)
Q <- qr.Q(QR)
R <- qr.R(QR)
x_qr <- solve(R, t(Q) %*% b)  # Solving using QR decomposition
x_qr
##              [,1]
##  [1,] -0.04603502
##  [2,]  0.22854321
##  [3,] -0.21036634
##  [4,]  0.80145627
##  [5,]  0.07151163
##  [6,]  0.15837052
##  [7,]  0.67111527
##  [8,] -0.68633301
##  [9,] -0.11063594
## [10,]  0.22085979

(d) Verify solutions are nearly equal by computing the Norm of the Differences

Since these difference values are very close to zero, thus the solutions are nearly identical.

norm(x_pinv - x_lm, type = "2")  # Difference between pseudoinverse and least squares
## [1] 1.855165e-15
norm(x_lm - x_qr, type = "2")    # Difference between least squares and QR factorization
## [1] 4.550666e-16
norm(x_pinv - x_qr, type = "2")  # Difference between pseudoinverse and QR
## [1] 1.653977e-15

(e) Verify residuals are orthogonal to \(Ax\)

Since orthogonality_check returns values very close to zero (-2.88658e-15), this confirms that the residuals are numerically orthogonal to Ax, as expected from least squares theory.

residuals <- A %*% x_lm - b
orthogonality_check <- t(residuals) %*% (A %*% x_lm)
orthogonality_check
##              [,1]
## [1,] -2.88658e-15

Observations

  • The three methods (Pseudoinverse, built-in least squares, and QR factorization) yield nearly identical solutions.
  • The residuals \(Ax - b\) are orthogonal to \(Ax\), confirming the least squares solution satisfies the orthogonality condition.

Problem 3: Iterative Solutions to Least Squares

Although the pseudoinverse provides an exact formula for least squares solutions, for large matrices, it can be computationally expensive. Instead, we use an iterative method, the Richardson Iteration:

\[ x^{(k+1)} = x^{(k)} - \mu A^T (Ax^{(k)} - b) \]

where \(\mu\) is a positive parameter ensuring convergence, set as:

\[ \mu = \frac{1}{\| A \|^2} \]

(a) Verification that \(x = A^+ b\) is a Fixed Point

Substituting \(x = A^+ b\) into the Richardson iteration:

\[ x^{(k+1)} = x^{(k)} - \mu A^T (Ax^{(k)} - b), \]

we set \(x^{(k)} = A^+ b\):

\[ x^{(k+1)} = A^+ b - \mu A^T (A A^+ b - b). \]

Using the Moore-Penrose property \(A A^+ b = b\), this simplifies to:

\[ x^{(k+1)} = A^+ b - \mu A^T (b - b) = A^+ b. \]

Thus, \(x = A^+ b\) remains unchanged under iteration, confirming it as a fixed point.

(b) Implement the Richardson Iteration

set.seed(42)
A <- matrix(rnorm(20 * 10), nrow = 20, ncol = 10)
b <- rnorm(20)
x_exact <- ginv(A) %*% b  # Exact least squares solution

mu <- 1 / norm(A, type = "F")^2  # Step size
x_iter <- matrix(0, nrow = 10, ncol = 501)  # Store iterations

for (k in 1:500) {
  x_iter[, k+1] <- x_iter[, k] - mu * t(A) %*% (A %*% x_iter[, k] - b)
}

# Compute error at each step
errors <- apply(x_iter, 2, function(x_k) norm(x_k - x_exact, type = "2"))

(c) Plot convergence of \(x^{(k)}\)

library(ggplot2)
data <- data.frame(iteration = 0:500, error = errors)
ggplot(data, aes(x = iteration, y = log10(error))) +
  geom_line() +
  labs(title = "Convergence of Richardson Iteration",
       x = "Iteration",
       y = "Log10(Norm of Error)")

Observations

  • The error decreases monotonically, verifying convergence to the least squares solution.
  • The choice of \(\mu\) ensures stability, preventing divergence.
  • The method is useful when direct computation of \(A^+\) is impractical for large matrices.