We consider the function:
\[ f(x, y) = (x - 1)^2 + (y + 2)^2 \]
Taking the partial derivatives:
\[ \frac{\partial f}{\partial x} = 2(x - 1), \quad \frac{\partial f}{\partial y} = 2(y + 2) \]
Setting both to zero:
\[ x - 1 = 0 \Rightarrow x = 1, \quad y + 2 = 0 \Rightarrow y = -2 \]
Thus, the critical point is \((1, -2)\). To confirm this is a minimum, we compute the Hessian matrix:
\[ H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \]
Since the Hessian matrix is positive definite (all its eigenvalues are positive, which are the diagonal elements in this case), this confirms that \((1, -2)\) is a global minimum.
The objective function is:
\[ f(x, y) = x^2 + b y^2 \]
gradient_descent <- function(b, k, iterations = 100) {
x <- b
y <- 1
history <- data.frame(iter = 0, x = x, y = y)
for (i in 1:iterations) {
grad_x <- 2 * x
grad_y <- 2 * b * y
x <- x - k * grad_x
y <- y - k * grad_y
history <- rbind(history, data.frame(iter = i, x = x, y = y))
}
return(history)
}
k_values <- seq(0.01, 0.3, length.out = 100)
b_values <- c(3, 10)
errors <- data.frame()
for (b in b_values) {
for (k in k_values) {
history <- gradient_descent(b, k)
final_error <- sqrt(history$x[100]^2 + history$y[100]^2)
errors <- rbind(errors, data.frame(b = b, k = k, error = final_error))
}
}
ggplot(errors, aes(x = k, y = log10(error), color = factor(b))) +
geom_line() +
labs(title = "Effect of Learning Rate on Convergence",
x = "Learning Rate (k)",
y = "Log10(Error after 100 iterations)",
color = "b value")
contour_data <- expand.grid(x = seq(-2, 2, 0.1), y = seq(-2, 2, 0.1))
contour_data$z <- with(contour_data, x^2 + 3 * y^2) # Ensure correct computation
best_k <- 0.1 # Choose an optimal k from the error plot
history <- gradient_descent(3, best_k)
ggplot(contour_data, aes(x = x, y = y)) +
geom_contour(aes(z = z), bins = 20) +
geom_path(data = history, aes(x = x, y = y), color = "red", linewidth = 1) +
labs(title = "Gradient Descent Trajectory on Contour Plot")
- - -
We generate a random \(20 \times 10\) matrix \(A\) and a random 20-vector \(b\) (using a Gaussian distribution). We solve the least squares problem:
\[ \min_{x \in \mathbb{R}^{10}} \| Ax - b \|^2 \]
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
set.seed(42)
A <- matrix(rnorm(20 * 10), nrow = 20, ncol = 10)
b <- rnorm(20)
x_pinv <- ginv(A) %*% b # Using generalized inverse from MASS package
x_pinv
## [,1]
## [1,] -0.04603502
## [2,] 0.22854321
## [3,] -0.21036634
## [4,] 0.80145627
## [5,] 0.07151163
## [6,] 0.15837052
## [7,] 0.67111527
## [8,] -0.68633301
## [9,] -0.11063594
## [10,] 0.22085979
x_lm <- lm.fit(A, b)$coefficients # Using lm.fit for least squares
x_lm
## x1 x2 x3 x4 x5 x6
## -0.04603502 0.22854321 -0.21036634 0.80145627 0.07151163 0.15837052
## x7 x8 x9 x10
## 0.67111527 -0.68633301 -0.11063594 0.22085979
QR <- qr(A)
Q <- qr.Q(QR)
R <- qr.R(QR)
x_qr <- solve(R, t(Q) %*% b) # Solving using QR decomposition
x_qr
## [,1]
## [1,] -0.04603502
## [2,] 0.22854321
## [3,] -0.21036634
## [4,] 0.80145627
## [5,] 0.07151163
## [6,] 0.15837052
## [7,] 0.67111527
## [8,] -0.68633301
## [9,] -0.11063594
## [10,] 0.22085979
Since these difference values are very close to zero, thus the solutions are nearly identical.
norm(x_pinv - x_lm, type = "2") # Difference between pseudoinverse and least squares
## [1] 1.855165e-15
norm(x_lm - x_qr, type = "2") # Difference between least squares and QR factorization
## [1] 4.550666e-16
norm(x_pinv - x_qr, type = "2") # Difference between pseudoinverse and QR
## [1] 1.653977e-15
Since orthogonality_check
returns values very close to
zero (-2.88658e-15), this confirms that the residuals are numerically
orthogonal to Ax, as expected from least squares theory.
residuals <- A %*% x_lm - b
orthogonality_check <- t(residuals) %*% (A %*% x_lm)
orthogonality_check
## [,1]
## [1,] -2.88658e-15
Although the pseudoinverse provides an exact formula for least squares solutions, for large matrices, it can be computationally expensive. Instead, we use an iterative method, the Richardson Iteration:
\[ x^{(k+1)} = x^{(k)} - \mu A^T (Ax^{(k)} - b) \]
where \(\mu\) is a positive parameter ensuring convergence, set as:
\[ \mu = \frac{1}{\| A \|^2} \]
Substituting \(x = A^+ b\) into the Richardson iteration:
\[ x^{(k+1)} = x^{(k)} - \mu A^T (Ax^{(k)} - b), \]
we set \(x^{(k)} = A^+ b\):
\[ x^{(k+1)} = A^+ b - \mu A^T (A A^+ b - b). \]
Using the Moore-Penrose property \(A A^+ b = b\), this simplifies to:
\[ x^{(k+1)} = A^+ b - \mu A^T (b - b) = A^+ b. \]
Thus, \(x = A^+ b\) remains unchanged under iteration, confirming it as a fixed point.
set.seed(42)
A <- matrix(rnorm(20 * 10), nrow = 20, ncol = 10)
b <- rnorm(20)
x_exact <- ginv(A) %*% b # Exact least squares solution
mu <- 1 / norm(A, type = "F")^2 # Step size
x_iter <- matrix(0, nrow = 10, ncol = 501) # Store iterations
for (k in 1:500) {
x_iter[, k+1] <- x_iter[, k] - mu * t(A) %*% (A %*% x_iter[, k] - b)
}
# Compute error at each step
errors <- apply(x_iter, 2, function(x_k) norm(x_k - x_exact, type = "2"))
library(ggplot2)
data <- data.frame(iteration = 0:500, error = errors)
ggplot(data, aes(x = iteration, y = log10(error))) +
geom_line() +
labs(title = "Convergence of Richardson Iteration",
x = "Iteration",
y = "Log10(Norm of Error)")