Gradient descent

Rhitankar Bandyopadhyay
Subhrangsu Bhunia

2024-04-19

What is Gradient descent ?

Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm which essentially finds the local minimum of a differentiable objective function, or what is often referred to as criterion.

Suppose we have a function \(y = f(x)\). The derivative \(f'(x)\) gives the slope of \(f(x)\) at the point \(x\). In other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output : \(f(x + \epsilon) \approx \ \textit{f}(x) + \epsilon f'(x)\). The derivative is therefore useful for minimizing a function because it tells us how to change \(x\) in order to make a small improvement in \(y\).

When \(f'(x) = 0\), the derivative provides no information about which direction to move. These are called the critical points or stationary points.

Method of gradient-based optimisation

The directional derivative in direction \(u\) (a unit vector) is the slope of the function \(f\) in direction \(u\), i.e. the derivative of the function \(f(x + \alpha{u})\) with respect to $ {}$, evaluated at \(\alpha = 0\). Using the chain rule, we see that \(\frac{\partial f(\mathbf x+\alpha \mathbf u)}{\partial \alpha}\) evaluates to \(u^T \nabla f(x)\) when \(\alpha = 0\).

To minimise \(f\), we would find the direction in which \(f\) decreases the fastest. We can do this using the directional derivative :

where \(\theta\) is the angle between \(u\) and the gradient. Substituting in \(| \textit{u} \|_2 = 1\) and ignoring factors that do not depend on \(u\), this simplifies to \(\min_{\textit{u}} \cos(\theta)\). This is minimized when \(u\) points in the opposite direction as the gradient.

In other words, the gradient points directly uphill and the negative gradient points directly downhill. We can decrease \(f\) by moving in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.

Learning rate

Steepest descent proposes a new point : \(x' = x - \epsilon \nabla f(x)\)

where \(\epsilon\) is the learning rate, a positive scalar determining the size of the step.

A popular approach of chosing \(\epsilon\) is to set it to be a pre-assigned small constant. Sometimes, we can solve for the step size that makes the directional derivative vanish.

Another approach is a strategy called line search in which we are to evaluate \(f(x - \epsilon \nabla f(x))\) for several values of \(\epsilon\) and chose the one which results in the smallest objective function value.

Steepest descent converges when every element of the gradient is zero or in practice, very close to zero. In some cases, we may be able to avoid running the iterative algorithm and just jump directly to the critical point by solving the equation \(\nabla f(x) = 0\).

Jacobian and Hessian matrices

For a function \(f\) : \({R}^m -> {R}^n\), the Jacobian matrix \(J\) of \(f\) is defined such that \(J_{i,j}\) = \(\frac{\partial f(x)_{i}}{\partial x_{j}}\).

When our function has multiple input dimensions, there are many second derivatives. These derivatives can be collected together into a matrix called the Hessian matrix \(H(f)(x)\), defined such that \(H(f)(x)_{i,j}\) = \(\frac{\partial^2 f(x)}{\partial x_i \partial x_j}\).

Equivalently, the Hessian is the Jacobian of the gradient.

In our context, we will mostly deal with functions for which second partial derivatives are continuous almost everywhere, thus, the Hessian matrix will be symmetric almost everywhere. Because \(H\) is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.

The second derivative in a a specific direction \(d\) (a unit vector) is given by \(d^THd\). When \(d\) is an eigenvector of \(H\), the second derivative in that direction is given by the corresponding eigenvalue. For other directions of \(d\), the directional second derivative is a weighted average of all the eigenvalues of \(H\).

How does Hessian play its part

Code
knitr::include_graphics("G:\\mstat_isi\\2nd_sem\\R_programming\\snipp_gd.png")

Basic examples

Example 1

Code
library(numDeriv)
library(plotly)
library(dplyr)

gradient_fun=function(x){
 return(ifelse(x<5,(x**2),(5*exp(5-x))))
}

gradientD_1d=function(initial_point,range=5,learning_rate=0.01,tol=0.0001,iter=80,...){
  u=c();u[1]=initial_point;z=c();z[1]=1;
  i=2
  while(i<iter) {
    z=c(z,i)
    u[i] = u[i-1]-learning_rate*calculus::gradient(f=gradient_fun,var=c(x=u[i-1]))[1]
    if((u[i]-u[i-1])**2< tol**2){
      break
    }
  i=i+1
  }
  x_values=seq(-range,range,by=0.1)
  point_data=data.frame(x=u,y=gradient_fun(u),iteration=z)
  plot <- plot_ly(mode="markers") %>%
    add_lines(x = x_values, y = gradient_fun(x_values), name = "Function") %>%
    add_markers(data = point_data, x = ~x, y = ~y,frame = ~iteration, color = I("red"), name = "Gradient Descent Steps") %>%
    layout(title = "Gradient Descent Optimization", xaxis = list(title = "X"),yaxis = list(title = "f(X)")) 
return(plot)
}
gradientD_1d(initial_point = 4.5,range=10,learning_rate=0.2,iter=50)

The process takes 22 iterations to obtain a point precisely at a distance not more than \(0.0001\) away from the local minima.

Example2

Code
gradient_fun=function(x){
 return(x/4-sin(x)+3)
}
gradientD_1d(initial_point = 3.8,range=8,learning_rate=0.2,iter=50)

Example 3

We now consider the function \(f(x, y) = sin(x) + cos(y)\) on the 3-dimensional space. The algorithm attempts to find a local minima starting from a fixed point \((1.8, 1.8, 0.5)\), moving forward with the help of a pre-assigned learning rate.

Code
library(gganimate)
gradient_fun2=function(x,y){return(sin(x)+cos(y))}
gradientD_multi = function(initial_pointx,initial_pointy ,range=5, learning_rate=0.1, tol=0.0001, iter=50, ...) {
  u=c();u[1]=initial_pointx;iteration=c();iteration[1]=1;v=c();v[1]=initial_pointy
  i=2
  while(i<iter ) {
    iteration=c(iteration,i)
    u[i] = u[i-1]-learning_rate*calculus::gradient(f=gradient_fun2,var=c(x=u[i-1],y=v[i-1]))[1]
    v[i] = v[i-1]-learning_rate*calculus::gradient(f=gradient_fun2,var=c(x=u[i-1],y=v[i-1]))[2]
   if((u[i]-u[i-1])**2+(v[i]-v[i-1])**2 < tol**2){
     break
   }
     i=i+1
  }
  x_values=seq(-range,range,by=0.1)
  point_data=data.frame(x=u,y=v,z=gradient_fun2(u,v),iteration)
  z_grid = expand.grid(x=x_values,y=x_values)
  z_grid$z=gradient_fun2(x=z_grid$x, y=z_grid$y)
  z_matrix <- matrix(z_grid$z, nrow = length(x_values), byrow = TRUE)
  plot=plot_ly( x =x_values, y = x_values, z = z_matrix, type = "surface") %>%
    layout(scene = list(xaxis = list(title = "X"),yaxis = list(title = "Y"),zaxis = list(title = "Z"),camera=list(eye=list(x=1.8,y=1.8,z=0.5))))
    plot <- plot %>%
    add_trace(type = "scatter3d",mode = "markers",data = point_data,x = ~x,y = ~y,z = ~z,frame = ~iteration,marker = list(size = 4, color = "red"), name = "Gradient Descent Steps")
  animation_settings <- list(frame = list(duration = 200, redraw = TRUE),fromcurrent = TRUE,transition = list(duration = 100, easing = "quadratic-in-out"))
  plot <- plot %>%
  animation_opts( frame = 500,redraw = TRUE)
  return(plot)
}
gradientD_multi(1,5,iter=50)

The process takes 49 iterations to reach to an optimised solution with a precision of \(0.0001\) from the actual local minima.

Example4

Code
gradient_fun2=function(x,y){return(x**2+y**2)}
gradientD_multi(2,4,iter=50)

Applications

Loss functions

The primary idea of gradient descent is to optimise various loss functions. We first have a look at a few commonly used loss functions.

  • Mean Squared Error (MSE)

MSE measures the average squared difference between the predicted values and the actual values. It penalises larger errors more severely than small errors due to squaring operation, making it sensitive to outliers.

\[ \mathrm{MSE}=\frac{1}{n}\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2 \]

  • Root Mean Squared Error (RMSE)

RMSE is derived from the MSE by taking the square root of the average squared difference between the predicted and actual values. While MSE is preferred for optimization problems, RMSE is often preferred for reporting model performances as it is in the same unit as the target variable \((y)\), making it easier to understand the magnitude of error.

\[ \mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2} \]

  • Mean Absolute Error

MAE measures the average absolute difference between the predicted and actual values. MAE is suitable for regression tasks, generally in cases, where outliers are present. It is advisable to use when the target variable follows a skewed distribution.

\[ \mathrm{MAE}=\frac{1}{n}\sum_{i=1}^n\left|y_i-\hat{y}_i\right| \]

  • Root Mean Log Squared Error (RMLSE)

RMLSE is a variant of RMSE that uses logarithm of the target variable.This makes it more suitable for handling targets with a wide range of values as logarithm compresses the scale, reducing the impact of large values.

\[ \mathrm{RMSLE}=\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log\left(\hat{y_i}+1\right)-\log\left({y_i}+1\right)\right)^2} \]

Few more loss functions

  • Huber Loss

Huber loss is a hybrid loss function that combines the propertoes of MSE and MAE. It is designed to be less sensitive to outliers than MSE but more sensitive than MAE. It is preferred when we want a loss function which is robust to outliers while still providing smooth gradients for optimization.

\[ L_{\mathrm{Huber}}\left(r\right)=\begin{cases}\frac{1}{2}r^2 & \mathrm{if}\left|r\right|<\delta\\\delta\left(\left|r\right|-\frac{1}{2}\right) & \mathrm{if}\left|r\right|>\delta \end{cases} \]

where \(r=\) residual error, \(\delta=\) smoothing parameter which controls the transition point from quadratic to linear behaviour.

  • Log-Cosh Loss

This is a smooth approximation of MAE and Huber loss, striking a balance between the robustness to outliers exhibited by MAE and the smoothness of Huber loss function. It is particularly helpful when we want to optimise a differentiable loss function that is robust to outliers.

\[ \mathrm{logcosh}=\frac{1}{n}\sum_{i=1}^n\log\left(\cosh\left(\hat{y}_i-y_i\right)\right) \]

Simple Linear Regression

MSE Loss Function

Consider the Simple Linear Regression model with usual assumptions,

\[ y=\beta_0+\beta_1 x+\epsilon \]

MSE Loss function is defined by,

\[ L\left(\boldsymbol{\beta}\right)=\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)^2 \]

We’ll apply gradient descent algorithm to obtain the optimized value of this loss function.

\[ \nabla L\left(\boldsymbol{\beta}\right)=\left(\begin{array}{cc}\frac{\partial}{\partial\beta_0}L\left(\boldsymbol{\beta}\right)\\\frac{\partial}{\partial\beta_1}L\left(\boldsymbol{\beta}\right)\end{array}\right) \]

\[ =\left(\begin{array}{cc}-\frac{2}{n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)\\-\frac{2}{n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)x_i\end{array}\right)=\left(\begin{array}{cc}-\frac{2}{n}\sum_{i=1}^n e_i\\-\frac{2}{n}\sum_{i=1}^n e_ix_i\end{array}\right) \]

We apply Gradient Descent algorithm on the MSE Loss function to obtain it’s optimized value.

\[ \boldsymbol{\beta}^{(i)}=\boldsymbol{\beta}^{(i-1)}-\alpha\nabla L\left(\boldsymbol{\beta}^{(i-1)}\right) \]

where \(\alpha\) is our pre-assigned learning rate.

The Setup

\[ y_i=\beta_0+\beta_1x_i+\epsilon_i,\:\: i=1,2,\ldots,200 \]

where \(x\) observations are drawn from \(\mathcal{N}\left(0,1\right)\) and \(\epsilon_i\overset{iid}{\sim}\mathcal{N}\left(0,3\right)\)

Here the true parameter values are \(\beta_0=1\) and \(\beta_1=5\)

Code
### squared error loss
library(tidyverse)
set.seed(9179)
true_beta_0 <- 1
true_beta_1 <- 5
n_obs <- 200
x <- rnorm(n_obs)
y <- true_beta_1*x + true_beta_0 + rnorm(n_obs, 0, 3)
slrdata<- data.frame(x = x, y = y)
ggplot(slrdata, aes(x = x, y = y)) + 
  geom_point(size = 2) + theme_minimal() + 
  labs(title = 'Simulated Data for simple linear regression')+
  geom_abline(intercept = 1,slope = 5,col="red",lty=1,lwd=1.1)

Cont.

We obtain the ordinary least square estimates and the corresponding cost function value.

Code
## ols estimate
ols <- lm(y ~ x, data = slrdata)
ols$coefficients
(Intercept)           x 
   1.093300    4.822676 
Code
cost_function <- function(beta_0, beta_1, x, y){
  pred <- beta_1*x + beta_0;res_sq <- (y - pred)^2;res_ss <- sum(res_sq)
  return(mean(res_ss))
}

cost_function(beta_0 = ols$coefficients[1][[1]],beta_1 = ols$coefficients[2][[1]],x = slrdata$x, y = slrdata$y)
[1] 2133.751

We obtain the estimates of \(\beta_0\) and \(\beta_1\) :

Code
gradient_desc <- function(beta_0, beta_1, x, y){
  N = length(x);pred <- beta_1*x + beta_0;res = y - pred;delta_beta_0 = -(2/N)*sum(res);delta_beta_1 = -(2/N)*sum(res*x)
  return(c(delta_beta_0, delta_beta_1))
}
minimize_function <- function(beta_0, beta_1, x, y, learning_rate=0.1,iter=100,...){
  gd = gradient_desc(beta_0, beta_1, x, y);new_beta_0 = beta_0 - gd[1] * learning_rate;new_beta_1 = beta_1 -  gd[2] * learning_rate
  return(c(new_beta_0, new_beta_1))
}
res <- list();res[[1]] <- c(0, 0);precision=0.001;iter=100
for (i in 2:iter){
  res[[i]] <- minimize_function(res[[i-1]][1], res[[i-1]][2], slrdata$x, slrdata$y, learning_rate = 0.1)
if((res[[i]][1]-res[[i-1]][1])**2+(res[[i]][2]-res[[i-1]][2])**2 <precision**2) {break}
  }
parameter = bind_rows(lapply(res, function(x) as.data.frame(t(x))))
colnames(parameter) = c('beta0', 'beta1')   ## updates the betas in each iteration
MSE=NULL
for(i in 1:nrow(parameter)){
  MSE=c(MSE,cost_function(parameter$beta0[i],parameter$beta1[i],slrdata$x,slrdata$y))
}
parameter2=cbind(parameter,MSE,Iteration=1:nrow(parameter))
cat("the estimate for intercept is",parameter2$beta0[nrow(parameter2)])
the estimate for intercept is 1.093811
Code
cat("\nthe estimate for slope is",parameter2$beta1[nrow(parameter2)])

the estimate for slope is 4.819216

Minimizing MSE through gradient descent

We observe how the gradient descent algorithm gradually moves towards the minimum value of MSE after every iteration.

Code
ggplot(parameter2, aes(x = Iteration, y =MSE)) + 
  geom_point(size = 2,col="chocolate") + 
  theme_minimal() + geom_line(aes(group = 1)) +
  labs(title = 'MSE at different iterations')+ylab("Mean Square Error")+xlab("Iteration")

Note that although, from the curve, it appears that the optimised MSE value is almost achieved after the first 12-15 iterations, the process continues till about the 30th iterative step and stops only when it achieves a desired level of precision, here \(0.001\)

Minimizing MSE by Gradient descent (contd.)

Code
anime_plot=ggplot(data=slrdata,aes(x=x,y=y))+geom_point(col="steelblue")+
   geom_abline(intercept = parameter2$beta0,slope = parameter2$beta1,lwd=1,col="purple")+
   geom_abline(intercept = 1,slope = 5,lwd=1,col="red")+xlab("X values")+ylab("Y values")+
   transition_reveal(parameter2$beta0)+theme_minimal()+labs(title = "Linear Regression by Gradient Descent")
 anime_plot

Robust Regression

Huber Loss Function

\[ L_{\mathrm{Huber}}\left(r\right)=\begin{cases}\frac{1}{2}r^2 & \mathrm{if}\left|r\right|<\delta\\\delta\left(\left|r\right|-\frac{\delta}{2}\right) & \mathrm{if}\left|r\right|>\delta \end{cases} \]

where, \(r=\) residual error, \(\delta=\) threshold that determines the point where the loss function transitions from quadratic to linear.

Here we are interested in the Huber Loss Function corresponding to the usual simple linear regression model

\[ \boldsymbol{y}=X\boldsymbol{\beta}+\epsilon \]

Gradient Descent on Huber Loss function

\[ g_{\mathrm{Huber}}\left(\boldsymbol{\beta}\right)=\begin{cases}\frac{1}{2n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)^2 & \mathrm{if}\left|y_i-\beta_0-\beta_1x_i\right|\leq c\\\frac{c}{n}\left(\sum_{i=1}^n\left|y_i-\beta_0-\beta_1x_i\right|\right)-\frac{c^2}{2n} & \mathrm{otherwise} \end{cases} \]

We apply Gradient Descent algorithm on the Huber Loss function to obtain it’s optimized value.

\[ \boldsymbol{\beta}^{(i)}=\boldsymbol{\beta}^{(i-1)}-\alpha\nabla g_{\mathrm{Huber}}\left(\boldsymbol{\beta}^{(i-1)}\right) \]

where \(\alpha\) is our pre-assigned learning rate.

\[ \nabla g_{\mathrm{Huber}}\left(\boldsymbol{\beta}\right)=\left(\begin{array}{cc}\frac{\partial}{\partial\beta_0}g_{\mathrm{Huber}}\left(\boldsymbol{\beta}\right)\\\frac{\partial}{\partial\beta_1}g_{\mathrm{Huber}}\left(\boldsymbol{\beta}\right) \end{array}\right) \]

\[ =\begin{cases}\left(\begin{array}{cc}-\frac{1}{n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)\\-\frac{1}{n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)x_i\end{array}\right)&\mathrm{if}\left|e_i\right|\leq c\\\left(\begin{array}{cc}-c\\-\frac{c}{n}\sum_{i=1}^nx_i\end{array}\right)&\mathrm{if}\:e_i> c\\\left(\begin{array}{cc}c\\\frac{c}{n}\sum_{i=1}^nx_i\end{array}\right)&\mathrm{if}\:e_i<-c\end{cases} \]

Huber Loss function (contd.)

Code
# Define Huber loss function
huber_loss <- function(r, delta) {
  ifelse(abs(r) <= delta, 0.5 * r^2, delta * (abs(r) - 0.5 * delta))
}
x=seq(-6,6,by=0.1)
huber_data=data.frame(x,f=huber_loss(x,2),g=0.5*x**2)
cols=c("Sq. Error loss"="purple","Huber Loss"="red")
ggplot(data=huber_data,aes(x=x,y=f))+
  geom_line(aes(col="Huber Loss"),lwd=1)+
  geom_line(aes(x=x,y=g,col="Sq. Error loss"),lwd=1)+labs(title = "Huber loss function")+ylab("f(x)")+
  scale_color_manual("Loss Functions",values = cols)

Code
# Compute gradient of Huber loss with respect to beta0 and beta1
huber_gradient <- function(r, x, delta) {
  clipped_r <- pmin(pmax(r, -delta), delta)
  grad_beta0 <- -clipped_r
  grad_beta1 <- -clipped_r * x
  return(list(grad_beta0 = grad_beta0, grad_beta1 = grad_beta1))
}

The function calculates the gradient \(\nabla g_{Huber}\left(\boldsymbol{\beta}\right)\) at every iteration and gradually moves towards the optimized loss function value and halts the process once it achieves a certain precision,

\[ \left\|\boldsymbol{\beta}^{(i)}-\boldsymbol{\beta}^{(i-1)}\right\|<10^{-6} \]

The value of \(\boldsymbol{\beta}\) for which the optimized value of \(g_{Huber}\left(\boldsymbol{\beta}\right)\) is denoted by \(\boldsymbol{\beta}^*\) .

Code
# Perform gradient descent for robust linear regression with Huber loss
gradient_descent_huber <- function(x, y, delta = 1.0, learning_rate = 0.01, max_iter = 1000, tol = 1e-6) {
  n <- length(y)
  beta0 <- 0  # Initialize beta0
  beta1 <- 0  # Initialize beta1
  
  for (iter in 1:max_iter) {
    # Compute residuals
    residuals <- y - beta0 - beta1 * x
    
    # Compute gradients of Huber loss
    gradients <- huber_gradient(residuals, x, delta)
    
    # Update parameters using gradients
    beta0 <- beta0 - learning_rate * mean(gradients$grad_beta0)
    beta1 <- beta1 - learning_rate * mean(gradients$grad_beta1)
    
    # Check convergence
    if (iter > 1 && sqrt(sum(gradients$grad_beta0^2) + sum(gradients$grad_beta1^2)) < tol) {
      break
    }
  }
  
  # Return optimized parameters
  return(list(beta0 = beta0, beta1 = beta1))
}

The Setup

We observe how the above algorithms performs on a specific regression model,

\[ y_i=\beta_0+\beta_1x_i+\epsilon_i,\:\: i=1,2,\ldots,100 \]

\(x\) values are drawn from \(\mathcal{U}\left(-1,1\right)\) and \(\epsilon\sim\mathcal{C}auchy\left(0,5\right)\)

The Setup

Code
set.seed(123)
n <- 100
x <- runif(n, -1, 1)
epsilon <- rcauchy(n, 0, 5)  # Cauchy distribution for non-normal errors
tbeta0=30;tbeta1=8
y <- tbeta0 + tbeta1 * x + epsilon
noise=seq(0,20,2)
# Introduce outliers
beta=c();beta1=matrix(0,nrow=length(noise),ncol=5)
for(i in 3:length(noise)){
  y[seq(1,n,10)] <- y[seq(1,n,10)] + noise[i]**2
  trial=lm(y~x)
  coeff=unname(trial$coefficients)
  robust_data=data.frame(x=x,y=y,iteration=rep(i,n))
  new_data=robust_data
  result <- gradient_descent_huber(x, y, delta = 1.35, learning_rate = 0.1)
  beta=c(result$beta0,result$beta1)
  beta1[i,]=c(beta,coeff,i)
  beta1=as.data.frame(beta1)
  anime <- ggplot(data = new_data, aes(x = x, y = y)) +
    geom_point() +
    geom_abline(data = beta1, aes(intercept = beta1[i,1], slope = beta1[i,2]), 
                color = "purple", size = 1) +
    geom_abline(data = beta1, aes(intercept = beta1[i,3], slope = beta1[i,4]), 
                color = "green", size = 1) +
    geom_abline(intercept = tbeta0,slope = tbeta1,lwd=0.5,col="red")+
    labs(x = "X values", y = "Y values") 
  anime=ggplotly(anime)
}
print(anime)

Estimates

Code
# Print estimated coefficients
cat("Intercept (beta0):", result$beta0, "\n")
Intercept (beta0): 32.07752 
Code
cat("Slope (beta1):", result$beta1, "\n")
Slope (beta1): 11.46457 

Logistic Regression

Gradient descent in Logistic regression

The setup

Suppose we have a logistic regression setup, where, \(y\) denotes the response variable which takes binary values from the set {\(0,1\)} and the independent variables are \(x_1\; and\;x_2\).
Our given data is \((\boldsymbol{x_1},\boldsymbol{x_2},\boldsymbol{y})\)
Define the model as
\[\mathbb{E}[\log({\dfrac{y}{1-y})]}\;=\beta_0\;\;+\;\;x_1\beta_1 \;\;+\;\;x_2\beta_2\]

Theory

Define,

\[h_{\boldsymbol{\beta}}(\boldsymbol{x})\;\;=\;\;g(\boldsymbol{\beta}^T\boldsymbol{x})\;\;=\;\;\dfrac{1}{1+e^{-\boldsymbol{\beta}^T\boldsymbol{x}}} \] i.e., \(g(z)\;\;=\;\;\dfrac{1}{1+e^{-z}}\) Define, \[p(y=1|\boldsymbol{x};\boldsymbol{\beta})\;\;=\;\;h_{\boldsymbol{\beta}}(\boldsymbol{x})\] \[p(y=0 |\boldsymbol{x};\boldsymbol{\beta})\;\;=\;\;1-h_{\boldsymbol{\beta}}(\boldsymbol{x})\] \[\therefore\hspace{1cm}p(y|\boldsymbol{x};\boldsymbol\beta)\;\;=\;\;(h_{\boldsymbol\beta}(\boldsymbol{x}))^y(1-(h_{\boldsymbol\beta}(\boldsymbol{x}))^{(1-y)}\] \[=>logp(y|\boldsymbol{x};\boldsymbol\beta)\;\;=\;\;y\;log\;(h_{\boldsymbol\beta}(\boldsymbol{x}))+(1-y)\;log\;(1-(h_{\boldsymbol\beta}(\boldsymbol{x}))\]

Logistic regression (contd.)

Now the joint log likelihood is : \[=>logp(\boldsymbol{y}|\boldsymbol{X};\boldsymbol\beta)\;\;=\;\;\sum_i{yi}\;log\;(h_{\boldsymbol\beta}(\boldsymbol{x_i}))+\sum_i{(1-yi)}\;log\;(1-(h_{\boldsymbol\beta}(\boldsymbol{xi}))\]

We have to maximize the likelihood to obtain the MLE, or equivalently we can maximize the log likelihood function. Note that maximized the log likelihood \(\underset{\boldsymbol\beta}{argmax}(log(p(\boldsymbol{y}|\boldsymbol{X};\boldsymbol\beta)))\) is equivalent to \(\underset{\boldsymbol\beta}{argmin}(-log(p(\boldsymbol{y}|\boldsymbol{X};\boldsymbol\beta)))\)

Now, Consider the function \[l(\boldsymbol{\beta})\;\;= -\;\;\;\;\sum_i{yi}\;log\;(h_{\boldsymbol\beta}(\boldsymbol{x_i}))-\sum_i{(1-yi)}\;log\;(1-(h_{\boldsymbol\beta}(\boldsymbol{xi}))\] So, let us minimize, \(-l(\boldsymbol\beta)\)
Note that \(-l(\boldsymbol\beta)\) is a smooth function, thus using gradiant decent the minimized value is: \[\boldsymbol\beta^*=\boldsymbol\beta_0\;\;-\;\;\alpha\nabla_{\beta}l(\boldsymbol\beta)\] \[Now,\hspace{3cm}\dfrac{\delta}{\delta\beta_j}l(\boldsymbol\beta)\] \[= \left(-\dfrac{y}{h_{\boldsymbol\beta}(\boldsymbol{x})}\;\;+\dfrac{(1-y)}{1-h_{\boldsymbol\beta}(\boldsymbol{x})}\right)\dfrac{\delta}{\delta\boldsymbol\beta_j}g(\boldsymbol\beta^T\boldsymbol{x})\] \[=\;\;\left(-\dfrac{y}{g(\boldsymbol\beta^T\boldsymbol{x})}\;\;+\dfrac{(1-y)}{1-g(\boldsymbol\beta^T\boldsymbol{x})}\right)g(\boldsymbol\beta^T\boldsymbol{x})\left(1-g(\boldsymbol\beta^T\boldsymbol{x})\right)\dfrac{\delta}{\delta\boldsymbol\beta_j}\boldsymbol\beta^T\boldsymbol{x}\] \[=\;\;\left(-y\left(1-g(\boldsymbol\beta^T\boldsymbol{x})\right)\;\;+\;\;(1-y)g(\boldsymbol\beta^T\boldsymbol{x})\right)x_j\] \[=\;\;-(y-h_\boldsymbol\beta(\boldsymbol{x}))x_j\]

(contd.)

Thus we have in the iterative stage, \[\beta_j^{(i)}\;\;=\;\;\beta_j^{(i-1)}+\alpha\left(y^{(i)}-h_\boldsymbol\beta(\boldsymbol{x}^{(i)})\right)x_j^{(i)}\] Hence Considering the whole data, \[\beta_j^{(k+1)}\;\;=\;\;\beta_j^{(k)}+\alpha\sum_{i}\left(y^{(i)}-h_{\boldsymbol\beta^k}(\boldsymbol{x}^{(i)})\right)x_j^{(i)}\] \[=\;\;\beta_j^{(k)}+\alpha\sum_{i}\left(y^{(i)}-\;\;\dfrac{1}{1+e^{-\boldsymbol{\beta}^{k^T}\boldsymbol{x}}}\right)x_j^{(i)}\]

Visualization

Code
n = 200
x1 = rnorm(n, 1, 2)
x2 = rnorm(n, 0.5, 1)
true_beta0 = -1
true_beta1 = 1.5
true_beta2 = -0.5
y= true_beta0 + true_beta1 * x1 + true_beta2 * x2
prob = plogis(y)
true_y = rbinom(n, 1, prob)
classify_data=as.data.frame(cbind(x1,x2,response=true_y))
random=sample(1:200,40,replace=F)
test_data=classify_data[random,]
classify_data=classify_data[-random,]

predict_logistic = function(beta, x) {            ## x:dataframe
  z = beta[1] + beta[2] * x[,1] + beta[3] * x[,2]    
  return(plogis(z))
}
neg_log_likelihood = function(beta, x) {    ## x:dataframe
  p = predict_logistic(beta, x)
  return(-sum(x[,3] * log(p) + (1 - x[,3]) * log(1 - p)))
}
gradient_likelihood = function(beta, x) {
  p = predict_logistic(beta, x)
  gradient = c( -sum(x[,3] - p), -sum((x[,3] - p) * x[,1]), -sum((x[,3] - p) * x[,2]))
  return(gradient)
}
gd_logistic = function(x, learning_rate=0.01, iter=100,tol=0.001,...) {
  beta=list()
  beta[[1]] = rep(0, 3)
  for (i in 2:iter) {
    gradient = gradient_likelihood(beta[[i-1]], x)
    beta[[i]] = beta[[i-1]] - learning_rate * gradient
    if(sum((beta[[i]]-beta[[i-1]])**2)< tol**2){ break}
  }
 
  return(beta[length(beta)])
}
beta_hat = gd_logistic(classify_data)[[1]]
plot(classify_data$x1[which(classify_data$response == 0)], classify_data$x2[which(classify_data$response == 0)], col = "blue", pch = 19, xlab = "x1", ylab = "x2", main = "Logistic Regression by Gradient Descent",xlim=c(-5,7),ylim=c(-5,7))
points(classify_data$x1[which(classify_data$response == 1)], classify_data$x2[which(classify_data$response == 1)], col = "red", pch = 19)
x_range = range(x1)
boundary_x = c(x_range[1], x_range[2])
boundary_y = (-beta_hat[1] - beta_hat[2] * boundary_x) / beta_hat[3]
lines(boundary_x, boundary_y, col = "green", lwd = 3)
legend("topright", legend = c("Class 0", "Class 1", "Decision Boundary"), col = c("blue", "red", "green"), pch = 19, lwd = 2)
Code
pred=predict_logistic(beta_hat,test_data)
count=0
for(i in 1:length(pred)){
  if(pred[i] >0.5) pred[i]=1
  else pred[i]=0
  if(pred[i]==test_data$response[i]) count=count+1
}
cat("the Apparent Error Rate(APER) is",(length(pred)-count)/length(pred))
the Apparent Error Rate(APER) is 0.1

Limitations

Limitations of Gradient descent

We observe that the algorithm is efficient and effective in many problems, however a number of limitations can be traced.

  • Choice of Learning Rate (\(\alpha\)) : The performance of gradient descent is sensitive to the choice of the learning rate (\(\alpha\)). If the learning rate is too small, the algorithm may converge very slowly, requiring higher number of iterations to reach the minimum. On the other hand, if the learning rate is too large, the algorithm may overshoot the minimum and fail to converge, or it may even oscillate around the minimum without converging.

  • Local minima and Saddle points : Gradient descent can get trapped in local minima, especially in non-convex optimization problems. It may converge to a suboptimal solution that is not the global minimum. Additionally, gradient descent can struggle around saddle points, where the gradient is near-zero but the point is not a minimum. This can lead to slow convergence or getting stuck somewhere.

  • Functions in high-dimensional spaces : In high-dimensional spaces (such as those common in deep learning), gradient descent can be computationally expensive and memory-intensive. Computing the gradient requires evaluating all data points, and this becomes increasingly costly as the dimensionality of the data increases.

  • Feature Scaling : Gradient descent’s performance can be influenced by the scale of features in the dataset. Features with different scales can cause issues with convergence, as the gradient updates may be dominated by features with larger scales.

  • Non-differentiable points : If the objective function contains non-differentiable points (or may contain points of discontinuity), the algorithm may fail to converge or provide inaccurate results.

  • Batch Size and Stochastic Gradient Descent (SGD) : In practice, using the full dataset (batch gradient descent) can be slow and memory-intensive in some cases. Techniques like mini-batch gradient descent or stochastic gradient descent (SGD) are often used to approximate the gradient using a subset of the data. However, these approaches introduce additional complexity and may require careful tuning of parameters.

  • Sensitivity to Initial Parameters : Gradient descent can be sensitive to the initial parameters (starting point) of the optimization. The algorithm may converge to a different minima depending on the initialization, especially in non-convex optimization problems.

Thank you