First we introduce the basic algorithm and explore some of its important design components: step sizes and convergence checking.
The final section of this notebook shows a more sophisticated
implementation of gradient descent for estimating logistic regression
coefficients that uses a few advanced topics, so it’s OK to not
understand all the code in a first read. Some interesting experiments to
try in that section: changing the data generating parameters
(n, p, sparsity, etc) or
algorithm parameters (max_steps, tol).
(Note: if there’s any place where code is not displayed the source can be seen in the .Rmd file version of this notebook).
We can think of this iterative optimization algorithm as minimizing a function by taking steps in the direction of the steepest decrease (or descent). The path followed by the algorithm is like someone skiiing downhill as fast as possible by always turning in the direction with the steepest downward slope. Here’s a simple example using the function \(f(x) = x^4\). This starts at the point \(x_0 = 1\), and then computes a sequence of inputs \(x_1, x_2, \ldots, x_n\) with the goal that \(f(x_n)\) converges to a (potentially local!) minimum of \(f\). In this case \(f\) has a global minimum at \(x = 0\) of \(f(0) = 0\), and we hope our algorithm will give this correct answer. We treat \(x\) as the parameter and \(f(x)\) as the loss function here. \[ x_{new} = x_{old} - \eta\left.\frac{\partial f(x)}{\partial x}\right\vert_{x_{old}} \]
# Function to minimize
f <- function(x) x^4
# Derivative (because it's univariate)
grad_f <- function(x) 4 * x^3
xn <- 1 # starting point
fn <- f(xn)
step_size <- .1
for (step in 1:5) {
# Update step
xn <- xn - step_size * grad_f(xn)
# Show results
cat(paste0("At step ", step,
", xn = ", round(xn, 5),
" and f(xn) = ", round(f(xn), 5)), "\n")
}
## At step 1, xn = 0.6 and f(xn) = 0.1296
## At step 2, xn = 0.5136 and f(xn) = 0.06958
## At step 3, xn = 0.45941 and f(xn) = 0.04454
## At step 4, xn = 0.42062 and f(xn) = 0.0313
## At step 5, xn = 0.39086 and f(xn) = 0.02334
Notice that xn is getting closer to 0 and
f(xn) is decreasing. Perhaps if the algorithm takes more
steps it will converge to the minimizer of \(f\).
However, I’ve chosen the values here carefully. Right now it
converges very slow. If we increase the number of steps from 5 to 100 it
still has not yet reached even xn = 0.1, never mind the
actual minimizer of 0. If we change the step_size or
initial starting point xn to some other values it may
converge faster, or it may diverge to infinity instead of converging to
0.
This shows that step sizes and the number of steps are important parts of any gradient descent implementation, and performance will also depend on the function (and its gradient) and starting points.
The step size is a scalar multiplier of the gradient. It’s denoted \(\gamma_n\) in the Wikipedia article and we’ll keep that convention.
We need the step size to decrease to zero so that the algorithm doesn’t get stuck jumping back and forth past the minimum value instead of getting closer to it. But how fast should it decrease?
If it starts too small or decreases too fast then gradient descent will require many more tiny steps before it reaches the optimum. On the other hand, if it decreases too slowly then gradient descent may jump around the optimum many times before the step size decreases enough for it to get closer.
One common approach is to take a number \(0 < \gamma < 1\), e.g. \(\gamma \approx .9\), and make step \(n\) have step size \(\gamma^n\). The step sizes then would look like this:
gamma <- .9
data.frame(step = 1:30) |>
mutate(step_size = gamma^step) |>
ggplot(aes(x = step, y = step_size)) + geom_point(size = 2)
Let’s see this in action trying to minimize \(f(x) = x^2\).
First we’ll try a value of \(\gamma\) which is too large.
f <- function(x) x^2 # Function to optimize
grad_f <- function(x) 2 * x # Gradient of f
max_steps <- 15
x0 <- 3 # starting point
gamma <- .995
# ... (rest of the code not displayed)
Now we plot the path of the algorithm. We can see gradient descent keeps jumping past the optimum many times, wasting computation by converging too slowly.
Now we’ll try a value of \(\gamma\) which is too small. This time gradient descent converges too fast because the step sizes become numerically close to zero even though we have not yet reached the optimum (minimum of \(f\)).
gamma <- .55
Recall that at a (local) optimal value, the derivative (or gradient) should be zero (or undefined, as in the case of \(|x|\)). We can use this to check if gradient descent converged (stopped moving) because of the step size and has not yet reached an optimum:
grad_f(xnext) # numerically close to zero?
## [1] 1.025414
The gradient isn’t close to zero, instead the algorithm has stopped because the overall step is small:
gamma^step * grad_f(xnext)
## [1] 0.0001307192
Summary: Step sizes are decreasing too slow if gradient descent jumps around the optimum and takes a long time to converge. Step sizes are decreasing too fast if gradient descent converges but fails to reach an optimum.
In practice, people often try different values of \(\gamma\), check to see if gradient descent is converging too slowly or quickly, then modify \(\gamma\) and try again. Or they might pick a sequence of \(\gamma\) values using multiple phases that looks something like this:
Using these step sizes to minimize \(f(x) = x^2\) again, this time the result would converge better:
But strategies like this require “hand tuning,” which we usually prefer to avoid. Instead, if they are available we could use “adaptive” methods that work automatically (most of the time). These include methods like Adam.
We’ll now create a version with adaptive step sizes and numeric differentiation, and apply it to a logistic regression model.
This is the function we want to minimize to fit a logistic regression
model. This is a special case of maximum
likelihood estimation, which could also be applied to any other
glm (or parametric probability model).
Since gradient descent minimizes, we apply it to the negative of the log-likelihood in order to maximize the log-likelihood.The cost function is: \[ C(\beta) = -\sum_{i=1}^n \left[ y_i \log\left( \frac{1}{1 + e^{-x_i \beta}} \right) + (1 - y_i) \log\left( 1 - \frac{1}{1 + e^{-x_i \beta}} \right) \right] \]
# See e.g. lecture slides
neglogL <- function(X, Y, beta) {
Xbeta <- X %*% beta
expXbeta <- exp(-Xbeta)
exp_ratio <- 1/(1+expXbeta)
-sum(Y * log(exp_ratio) + (1-Y) * log(1-exp_ratio))
}
Some functions are difficult to differentiate by hand, or may not
even have a closed-form expression for their gradient. But we can still
use numeric differentiation like this finite difference
method to approximate the gradient. Instead of taking a limit (as in the
definition of derivatives or gradients), we just substitute in a very
small value of h and compute:
\[\frac{f(x + h) - f(x - h)}{2h}\]
Or, in code: (f(x+h) - f(x-h))/(2*h) (note: I spent a
long time wondering why I was getting an error message because my code
had 2h instead of 2*h… this happens sometimes
even after many years of experience). The limit of this as
h goes to zero would be the exact gradient (or derivative)
rather than an approximation. But since we use this approximate method
we don’t need to do any calculus to find an expression for the gradient,
we just need to evaluate the function itself.
Reference: Wikipedia
# Note: This value of h is chosen to be
# on the order of magnitude of the cube-root
# of .Machine$double.eps, which helps avoid
# some numerical approximation errors
numeric_grad <- function(X, Y, beta, h = 1e-06) {
# Approximate each coordinate of the gradient
numerator <- sapply(1:length(beta), function(j) {
H <- rep(0, length(beta))
H[j] <- h # step in each coordinate j
neglogL(X, Y, beta + H) - neglogL(X, Y, beta - H)
})
numerator / (2*h)
}
sapply(v, function): apply the (finite difference) function on each element of the vector V.
We’ll use the Barzilai–Borwein method for choosing a step size. This BB-method requires we keep track of two consecutive values of the function and its gradient in order to compute a step size, and that is why the code for our final implementation is a little more complicated.
Reference: Wikipedia
Now we create some data to fit logistic regression models.
# Setting parameters
set.seed(1)
# Try changing some parameters
n <- 200
p <- 10
sparsity <- p/2
nonzero_beta <- runif(sparsity, min = -1, max = 1)
# or, e.g. rep(1, sparsity), runif(sparsity, min = -1, max = 1)
true_beta <- c(.5, nonzero_beta, rep(0, p - sparsity))
# Generating simulated data
X <- cbind(rep(1, n), matrix(rnorm(n*p), nrow = n))
mu <- X %*% true_beta
px <- exp(mu)/(1+exp(mu))
Y <- rbinom(n, 1, prob = px)
train_ld <- data.frame(y = as.factor(Y), x = X)
fit_glm <- glm(Y ~ X-1, family = "binomial")
# to compare with our results later
Initialize and take a first step:
max_steps <- 50 # try ~3 for comparison
beta_prev2 <- rnorm(ncol(X)) # random start
#beta_prev2 <- c(mean(Y), cor(X[,-1], Y)) # "warm" start
grad_prev2 <- numeric_grad(X, Y, beta_prev2)
beta_prev1 <- beta_prev2 + 0.1 * grad_prev2 / sqrt(sum(grad_prev2^2)) #some ascending (with respect to beta prev 2), random update
grad_prev1 <- numeric_grad(X, Y, beta_prev1)
previous_loss <- neglogL(X, Y, beta_prev2)
next_loss <- neglogL(X, Y, beta_prev1)
steps <- 1
Repeat until convergence: Recall the BB method is defined as: \[ \gamma_n=\frac{|(\mathbf{\beta_n}- \mathbf{\beta_{n-1}})^T[\nabla F(\mathbf{\beta_n})-\nabla F(\mathbf{\beta_{n-1}})]|}{||\nabla F(\mathbf{\beta_n})-\nabla F(\mathbf{\beta_{n-1}})||^2} \] The numerator is an inner product of two vectors, it equals to the component-wise product of the two vectors and take the sum; The denominator is the square of the L2norm, which equals to square the vector and sum.
tol <- 1e-5
while (abs(previous_loss - next_loss) > tol) { # if two losses are close enough to each other, then we stop
# Compute BB step size
grad_diff <- grad_prev1 - grad_prev2
step_BB <- abs(sum((beta_prev1 - beta_prev2) * grad_diff)) / sum(grad_diff^2)
beta_prev2 <- beta_prev1 # for the (n+1)-th step, beta_{n} becomes beta_{n-1}
# Update step
beta_prev1 <- beta_prev1 - step_BB * grad_prev1 # update the newest parameter
# Shift previous steps
grad_prev2 <- grad_prev1
grad_prev1 <- numeric_grad(X, Y, beta_prev1)
previous_loss <- next_loss
next_loss <- neglogL(X, Y, beta_prev1)
# Print beta values and loss for the first 5 steps
if (steps <= 5) {
cat("Step ", steps, ":\n")
cat("Beta_prev2: ", beta_prev2, "\n")
cat("Beta_prev1: ", beta_prev1, "\n")
cat("Loss: ", next_loss, "\n\n")
}
# Print loss every 5 steps, track path, control loop
if (round(steps / 5) == steps / 5) print(previous_loss)
steps <- steps + 1
beta_path[steps, 2] <- next_loss
beta_path[steps, 3:ncol(beta_path)] <- beta_prev1
if (steps > max_steps) break
}
## Step 1 :
## Beta_prev2: -1.191636 0.1574767 1.259388 1.8611 0.2918844 0.8160081 -0.6491023 -0.7952438 -0.8237287 1.27518 -1.353789
## Beta_prev1: 1.4708 -0.8155563 -0.1115051 0.3089516 1.671834 -1.352305 0.3501109 0.5994087 -0.2929673 0.7166714 0.5835274
## Loss: 131.5617
##
## Step 2 :
## Beta_prev2: 1.4708 -0.8155563 -0.1115051 0.3089516 1.671834 -1.352305 0.3501109 0.5994087 -0.2929673 0.7166714 0.5835274
## Beta_prev1: 1.142294 -0.924233 -0.3264248 0.3958693 1.25034 -0.7510966 -0.09464864 0.2733416 -0.1779926 0.2503399 0.3461318
## Loss: 109.9626
##
## Step 3 :
## Beta_prev2: 1.142294 -0.924233 -0.3264248 0.3958693 1.25034 -0.7510966 -0.09464864 0.2733416 -0.1779926 0.2503399 0.3461318
## Beta_prev1: 0.8224974 -0.7359625 -0.2382096 0.3839713 1.121106 -0.6059559 -0.04953832 0.2754572 -0.08279057 0.08517445 0.144412
## Loss: 106.4919
##
## Step 4 :
## Beta_prev2: 0.8224974 -0.7359625 -0.2382096 0.3839713 1.121106 -0.6059559 -0.04953832 0.2754572 -0.08279057 0.08517445 0.144412
## Beta_prev1: 0.8287109 -0.6573478 -0.2457377 0.2878532 0.9632382 -0.558952 -0.09016935 0.2575075 -0.06937636 0.1479613 0.1913891
## Loss: 105.8221
##
## Step 5 :
## Beta_prev2: 0.8287109 -0.6573478 -0.2457377 0.2878532 0.9632382 -0.558952 -0.09016935 0.2575075 -0.06937636 0.1479613 0.1913891
## Beta_prev1: 0.7923798 -0.6108725 -0.1967297 0.2976425 0.9544923 -0.5719123 -0.04268959 0.2645689 -0.0886301 0.1166794 0.1659749
## Loss: 105.5898
##
## [1] 105.8221
## [1] 105.4875
How many steps?
steps - 1 # not counting starting point
## [1] 11
The plot below compares the estimate from our gradient descent
implementation to the estimate from the glm() function
(hollow squares and diamonds, should be almost identical), and also
displays the true coefficients (solid points).
beta_plot_df <- data.frame(
coordinate = 1:11,
beta = c(true_beta, fit_glm |> coef(), beta_prev1),
type = c(rep("true beta", p+1), rep("est: glm()", p+1), rep("est: grad. desc.", p+1)))
beta_plot_df |>
filter(type != "true beta") |>
ggplot(aes(coordinate, beta, color = type)) +
geom_point(alpha = .8, size = 2, stroke = 1, aes(shape = type)) +
scale_shape_manual(values = c(0, 5)) +
geom_point(size = 1.8, color = "black", alpha = .9, data = beta_plot_df |> filter(type == "true beta")) +
scale_color_viridis_d(begin = .1, end = .9, direction = 1) +
scale_x_continuous(breaks = 1:11)
(Leaving out the initial starting point)
(Note: the coefficient indexes may be a bit confusing because of the intercept term)
(MLE: maximizing neglogL is the same as minimizing
neglogL)
How far did gradient descent move (in Euclidean distance of the \(p\)-dimensional parameter space) in each step?
Euclidian Distance -> how much we update/change beta by in each step, but quantified in some way. Done by calculating the square root of the second norm (each element squared added up, and then square rooted) of the vector of differences.
Note: with numeric differentiation we did not need to do the calculus ourselves to find an expression for the gradient. (People have done a lot of work clearing this path for us, though…)
Question: Suppose that instead of logistic regression we wanted to fit some other kind of model. What would we need to change about the code?
Answer: Only the neglogL function! We
could optimize many functions (and therefore fit many models with MLE)
with almost the same code.
Concerns: Although finite difference looks good here, but it is usually not popular in machine learning, for some reasons: 1. Finite difference approximations are prone to numerical errors. The calculation involves subtracting two similar values, which can lead to catastrophic cancellation—a significant loss of precision in floating-point arithmetic. 2. For each dimension of vector β, you need two function evaluations, for high-dimensional parameter spaces, this results in a high computational cost, making it impractical for large-scale problems. 3. The quality of the gradient approximation heavily depends on the choice of the step size h. Finding an appropriate step size is often problem-dependent and can require trial and error.
Generate data from a high-dimensional model. After your
implementation is finished, you can come back here to experiment and see
how the algorithm’s performance depends on these values. (One
interesting experiment: reduce lambda in rpois
to have a sparser beta and see how the results change).
set.seed(1) # comment/uncomment/change this
n <- 100
p <- 400
X <- matrix(rnorm(n*p), nrow = n)
beta <- rpois(p, lambda = 1.5)
SGD_starting_beta <- rnorm(p) # generate once
#beta <- sample(c(-1, 0, 0, 1), p, replace = TRUE)
sigma <- 2
y <- X %*% beta + rnorm(n, sd = sigma)
# sparsity level
sum(beta != 0)
## [1] 308
Question: Can we fit OLS to estimate beta? Why or why not?
Answer: No. Since there are more predictors than observations, i.e. \(p > n\), we cannot fit OLS. (The matrix \(X^TX\) is not invertible)
# Loss function
least_squares_loss <- function(x, y, beta) {
residuals <- y - x %*% beta
sum(residuals^2)
}
\[ L(\beta) = ||y - X \beta||^2 \] Bear in mind that the below gradient is with respct to beta \[ \nabla L(\beta) = -2 X^T (y - X \beta) \]
# Gradient of the loss function
least_squares_gradient <- function(x, y, beta) {
residuals <- y - x %*% beta
-2 * t(x) %*% residuals
}
SGD stands for stochastic gradient descent
The Summand actually gives us a p dimensional vector \[ \nabla L_{SGD}(\beta) = -\frac{1}{m} \sum_{i=1}^m X_i (y_i - X_i^T \beta) \] \[ \beta \leftarrow \beta - \eta \nabla L_{\text{SGD}}(\beta) \] Computing gradient on a subset of the sample:
batch_size <- 10
batch_indices <- sample(1:nrow(X), batch_size)
X_batch <- X[batch_indices, , drop = FALSE]
y_batch <- y[batch_indices]
gradient_batch <- least_squares_gradient(X_batch, y_batch, SGD_starting_beta)
dim(X_batch)
## [1] 10 400
length(y_batch)
## [1] 10
dim(gradient_batch)
## [1] 400 1
Modify the code below to complete your implementation.
In this implementation, normalize the gradient to make it a unit
vector before taking a step. Recall that if v is vector,
then u <- v / sqrt(sum(v^2)) is a unit vector in the
direction of v. (Optional: for numerical stability you
could also add a small positive number in the denominator like
1e-8).
epochs <- 40 # decrease for early stopping?
batch_size <- 8 # for batch SGD
gamma <- 0.9
beta_hat <- SGD_starting_beta # generated once above
#beta_hat <- rnorm(p) # random start
num_batches <- ceiling(n/batch_size)
steps <- epochs * num_batches
SGD_path <- data.frame(step = 1:steps,
epoch = numeric(steps),
gamma = numeric(steps),
loss = numeric(steps),
MSE = numeric(steps),
beta_hat_norm = numeric(steps),
gradient_norm = numeric(steps))
start_time <- Sys.time()
step <- 1
for (epoch in 1:epochs) {
# Pass over sample in a random order each time
shuffled_indices <- sample(1:n, n, replace = TRUE)
for (batch in 1:num_batches) {
batch_start <- 1 + batch_size * (batch - 1)
batch_end <- min(batch_size * batch, nrow(X))
batch_indices <- shuffled_indices[batch_start:batch_end]
# Mini-batch
X_batch <- X[batch_indices, , drop = FALSE]
y_batch <- y[batch_indices]
# Update beta_hat
gradient_batch <- least_squares_gradient(X_batch, y_batch, beta_hat)
gradient_norm <- sqrt(sum(gradient_batch^2))
beta_hat <- beta_hat - gamma^epoch * gradient_batch / (1e-8 + gradient_norm)
LS_loss <- least_squares_loss(X, y, beta_hat)
# Store algorithm path information
SGD_path[step, ] <- c(step,
epoch,
gamma^epoch,
LS_loss,
mean((beta_hat - beta)^2),
sqrt(sum(beta_hat^2)),
gradient_norm)
step <- step + 1
}
}
print(Sys.time() - start_time)
## Time difference of 0.09337211 secs
Epoch: An epoch is one complete pass through the entire training dataset. iteration: An iteration refers to one update of the model’s parameters. Example: If your dataset has 1,000 samples and your batch size is 100, it will take 10 iterations to complete one epoch.
(Averaged over all batches in that epoch)
Note: the loss is plotted on a logarithmic scale so that
large initial changes don’t obscure later variation.
Question: Does the MSE of beta_hat level off? After
roughly how many epochs? Is this happening because gamma
decreased too fast?
Answer: Yes, in this example after roughly 20
epochs, and no because the gamma value is still over 0.1
(has not yet become numerically very close to zero).
We have 100 samples, batch size is 8, so there are 13 iterations per epoch. We have 40 epoches, the total number of iterations is 520.
Has the algorithm pulled more of the squared errors closer to zero compared to the (random) initial starting point?
Compute the (Euclidean) distances between the following points:
# From algorithm starting point to end point
sqrt(sum((beta_hat - SGD_starting_beta)^2))
## [1] 21.06934
# From starting point to true beta
sqrt(sum((beta - SGD_starting_beta)^2))
## [1] 43.28581
# From end point to true beta
sqrt(sum((beta - beta_hat)^2))
## [1] 37.63315
Question: Does SGD do better than a random guess (the starting point) at estimating the true coefficients?
Answer: Yes, after taking some steps the algorithm has reduced the MSE of estimating the true beta.
\[ L_{SGD}(\mathbf{\beta}) = -\frac{1}{m} \sum_{i=1}^m \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] \] \[ \hat{y}_i = \sigma(\beta^T \mathbf{x}_i) = \frac{1}{1 + e^{-\beta^T \mathbf{x}_i}} \]
\[ \frac{d\sigma(z)}{dz}=\sigma(z)(1-\sigma(z)) \] Hence \[ \frac{\partial \hat y_i}{\partial \beta}=\frac{\partial \sigma(\beta^Tx_i)}{\partial \beta}=\sigma(\beta^Tx_i)(1-\sigma(\beta^Tx_i))x_i=\hat{y}_i(1-\hat{y}_i)x_i \] Substitute back to the loss function: \[ \frac{\partial L_{SGD}(\mathbf{\beta})}{\partial \beta}=-\frac{1}{m}\sum^m_{i=1}\big[ y_i(1-\hat y_i)x_i - (1-y_i)\hat y_ix_i \big]=-\frac{1}{m}\sum^m_{i=1}(y_i-\hat y_i)x_i \] ##### 1. Data generation
set.seed(123)
n <- 500
# Generate Feature1 and Feature2 for Class 0
feature1_class0 <- rnorm(n, mean = 2, sd = 1)
feature2_class0 <- rnorm(n, mean = 2, sd = 1)
# Generate Feature1 and Feature2 for Class 1
feature1_class1 <- rnorm(n, mean = 4, sd = 1)
feature2_class1 <- rnorm(n, mean = 4, sd = 1)
# Combine into data frames
data_class0 <- data.frame(
Feature1 = feature1_class0,
Feature2 = feature2_class0,
Class = 0
)
data_class1 <- data.frame(
Feature1 = feature1_class1,
Feature2 = feature2_class1,
Class = 1
)
# Combine both classes into one dataset
data <- rbind(data_class0, data_class1)
# Introduce some overlap by adding noise
data$Feature1 <- data$Feature1 + rnorm(nrow(data), mean = 0, sd = 0.5)
data$Feature2 <- data$Feature2 + rnorm(nrow(data), mean = 0, sd = 0.5)
# Shuffle the dataset
data <- data[sample(nrow(data)), ]
ggplot(data, aes(x = Feature1, y = Feature2, color = factor(Class))) +
geom_point(alpha = 0.7) +
labs(title = "Simulated Binary Classification Dataset",
x = "Feature 1",
y = "Feature 2",
color = "Class") +
theme_minimal()