2024-04-19
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm which essentially finds the local minimum of a differentiable objective function, or what is often referred to as criterion.
Suppose we have a function \(y = f(x)\). The derivative \(f'(x)\) gives the slope of \(f(x)\) at the point \(x\). In other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output : \(f(x + \epsilon) \approx \ \textit{f}(x) + \epsilon f'(x)\). The derivative is therefore useful for minimizing a function because it tells us how to change \(x\) in order to make a small improvement in \(y\).
When \(f'(x) = 0\), the derivative provides no information about which direction to move. These are called the critical points or stationary points.
The directional derivative in direction \(u\) (a unit vector) is the slope of the function \(f\) in direction \(u\), i.e. the derivative of the function \(f(x + \alpha{u})\) with respect to $ {}$, evaluated at \(\alpha = 0\). Using the chain rule, we see that \(\frac{\partial f(\mathbf x+\alpha \mathbf u)}{\partial \alpha}\) evaluates to \(u^T \nabla f(x)\) when \(\alpha = 0\).
To minimise \(f\), we would find the direction in which \(f\) decreases the fastest. We can do this using the directional derivative :
where \(\theta\) is the angle between \(u\) and the gradient. Substituting in \(| \textit{u} \|_2 = 1\) and ignoring factors that do not depend on \(u\), this simplifies to \(\min_{\textit{u}} \cos(\theta)\). This is minimized when \(u\) points in the opposite direction as the gradient.
In other words, the gradient points directly uphill and the negative gradient points directly downhill. We can decrease \(f\) by moving in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.
Steepest descent proposes a new point : \(x' = x - \epsilon \nabla f(x)\)
where \(\epsilon\) is the learning rate, a positive scalar determining the size of the step.
A popular approach of chosing \(\epsilon\) is to set it to be a pre-assigned small constant. Sometimes, we can solve for the step size that makes the directional derivative vanish.
Another approach is a strategy called line search in which we are to evaluate \(f(x - \epsilon \nabla f(x))\) for several values of \(\epsilon\) and chose the one which results in the smallest objective function value.
Steepest descent converges when every element of the gradient is zero or in practice, very close to zero. In some cases, we may be able to avoid running the iterative algorithm and just jump directly to the critical point by solving the equation \(\nabla f(x) = 0\).
For a function \(f\) : \({R}^m -> {R}^n\), the Jacobian matrix \(J\) of \(f\) is defined such that \(J_{i,j}\) = \(\frac{\partial f(x)_{i}}{\partial x_{j}}\).
When our function has multiple input dimensions, there are many second derivatives. These derivatives can be collected together into a matrix called the Hessian matrix \(H(f)(x)\), defined such that \(H(f)(x)_{i,j}\) = \(\frac{\partial^2 f(x)}{\partial x_i \partial x_j}\).
Equivalently, the Hessian is the Jacobian of the gradient.
In our context, we will mostly deal with functions for which second partial derivatives are continuous almost everywhere, thus, the Hessian matrix will be symmetric almost everywhere. Because \(H\) is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
The second derivative in a a specific direction \(d\) (a unit vector) is given by \(d^THd\). When \(d\) is an eigenvector of \(H\), the second derivative in that direction is given by the corresponding eigenvalue. For other directions of \(d\), the directional second derivative is a weighted average of all the eigenvalues of \(H\).
library(numDeriv)
library(plotly)
library(dplyr)
gradient_fun=function(x){
return(ifelse(x<5,(x**2),(5*exp(5-x))))
}
gradientD_1d=function(initial_point,range=5,learning_rate=0.01,tol=0.0001,iter=80,...){
u=c();u[1]=initial_point;z=c();z[1]=1;
i=2
while(i<iter) {
z=c(z,i)
u[i] = u[i-1]-learning_rate*calculus::gradient(f=gradient_fun,var=c(x=u[i-1]))[1]
if((u[i]-u[i-1])**2< tol**2){
break
}
i=i+1
}
x_values=seq(-range,range,by=0.1)
point_data=data.frame(x=u,y=gradient_fun(u),iteration=z)
plot <- plot_ly(mode="markers") %>%
add_lines(x = x_values, y = gradient_fun(x_values), name = "Function") %>%
add_markers(data = point_data, x = ~x, y = ~y,frame = ~iteration, color = I("red"), name = "Gradient Descent Steps") %>%
layout(title = "Gradient Descent Optimization", xaxis = list(title = "X"),yaxis = list(title = "f(X)"))
return(plot)
}
gradientD_1d(initial_point = 4.5,range=10,learning_rate=0.2,iter=50)The process takes 22 iterations to obtain a point precisely at a distance not more than \(0.0001\) away from the local minima.
We now consider the function \(f(x, y) = sin(x) + cos(y)\) on the 3-dimensional space. The algorithm attempts to find a local minima starting from a fixed point \((1.8, 1.8, 0.5)\), moving forward with the help of a pre-assigned learning rate.
library(gganimate)
gradient_fun2=function(x,y){return(sin(x)+cos(y))}
gradientD_multi = function(initial_pointx,initial_pointy ,range=5, learning_rate=0.1, tol=0.0001, iter=50, ...) {
u=c();u[1]=initial_pointx;iteration=c();iteration[1]=1;v=c();v[1]=initial_pointy
i=2
while(i<iter ) {
iteration=c(iteration,i)
u[i] = u[i-1]-learning_rate*calculus::gradient(f=gradient_fun2,var=c(x=u[i-1],y=v[i-1]))[1]
v[i] = v[i-1]-learning_rate*calculus::gradient(f=gradient_fun2,var=c(x=u[i-1],y=v[i-1]))[2]
if((u[i]-u[i-1])**2+(v[i]-v[i-1])**2 < tol**2){
break
}
i=i+1
}
x_values=seq(-range,range,by=0.1)
point_data=data.frame(x=u,y=v,z=gradient_fun2(u,v),iteration)
z_grid = expand.grid(x=x_values,y=x_values)
z_grid$z=gradient_fun2(x=z_grid$x, y=z_grid$y)
z_matrix <- matrix(z_grid$z, nrow = length(x_values), byrow = TRUE)
plot=plot_ly( x =x_values, y = x_values, z = z_matrix, type = "surface") %>%
layout(scene = list(xaxis = list(title = "X"),yaxis = list(title = "Y"),zaxis = list(title = "Z"),camera=list(eye=list(x=1.8,y=1.8,z=0.5))))
plot <- plot %>%
add_trace(type = "scatter3d",mode = "markers",data = point_data,x = ~x,y = ~y,z = ~z,frame = ~iteration,marker = list(size = 4, color = "red"), name = "Gradient Descent Steps")
animation_settings <- list(frame = list(duration = 200, redraw = TRUE),fromcurrent = TRUE,transition = list(duration = 100, easing = "quadratic-in-out"))
plot <- plot %>%
animation_opts( frame = 500,redraw = TRUE)
return(plot)
}
gradientD_multi(1,5,iter=50)The process takes 49 iterations to reach to an optimised solution with a precision of \(0.0001\) from the actual local minima.
The primary idea of gradient descent is to optimise various loss functions. We first have a look at a few commonly used loss functions.
MSE measures the average squared difference between the predicted values and the actual values. It penalises larger errors more severely than small errors due to squaring operation, making it sensitive to outliers.
\[ \mathrm{MSE}=\frac{1}{n}\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2 \]
RMSE is derived from the MSE by taking the square root of the average squared difference between the predicted and actual values. While MSE is preferred for optimization problems, RMSE is often preferred for reporting model performances as it is in the same unit as the target variable \((y)\), making it easier to understand the magnitude of error.
\[ \mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2} \]
MAE measures the average absolute difference between the predicted and actual values. MAE is suitable for regression tasks, generally in cases, where outliers are present. It is advisable to use when the target variable follows a skewed distribution.
\[ \mathrm{MAE}=\frac{1}{n}\sum_{i=1}^n\left|y_i-\hat{y}_i\right| \]
RMLSE is a variant of RMSE that uses logarithm of the target variable.This makes it more suitable for handling targets with a wide range of values as logarithm compresses the scale, reducing the impact of large values.
\[ \mathrm{RMSLE}=\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log\left(\hat{y_i}+1\right)-\log\left({y_i}+1\right)\right)^2} \]
Huber loss is a hybrid loss function that combines the propertoes of MSE and MAE. It is designed to be less sensitive to outliers than MSE but more sensitive than MAE. It is preferred when we want a loss function which is robust to outliers while still providing smooth gradients for optimization.
\[ L_{\mathrm{Huber}}\left(r\right)=\begin{cases}\frac{1}{2}r^2 & \mathrm{if}\left|r\right|<\delta\\\delta\left(\left|r\right|-\frac{1}{2}\right) & \mathrm{if}\left|r\right|>\delta \end{cases} \]
where \(r=\) residual error, \(\delta=\) smoothing parameter which controls the transition point from quadratic to linear behaviour.
This is a smooth approximation of MAE and Huber loss, striking a balance between the robustness to outliers exhibited by MAE and the smoothness of Huber loss function. It is particularly helpful when we want to optimise a differentiable loss function that is robust to outliers.
\[ \mathrm{logcosh}=\frac{1}{n}\sum_{i=1}^n\log\left(\cosh\left(\hat{y}_i-y_i\right)\right) \]
Consider the Simple Linear Regression model with usual assumptions,
\[ y=\beta_0+\beta_1 x+\epsilon \]
MSE Loss function is defined by,
\[ L\left(\boldsymbol{\beta}\right)=\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)^2 \]
We’ll apply gradient descent algorithm to obtain the optimized value of this loss function.
\[ \nabla L\left(\boldsymbol{\beta}\right)=\left(\begin{array}{cc}\frac{\partial}{\partial\beta_0}L\left(\boldsymbol{\beta}\right)\\\frac{\partial}{\partial\beta_1}L\left(\boldsymbol{\beta}\right)\end{array}\right) \]
\[ =\left(\begin{array}{cc}-\frac{2}{n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)\\-\frac{2}{n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)x_i\end{array}\right)=\left(\begin{array}{cc}-\frac{2}{n}\sum_{i=1}^n e_i\\-\frac{2}{n}\sum_{i=1}^n e_ix_i\end{array}\right) \]
We apply Gradient Descent algorithm on the MSE Loss function to obtain it’s optimized value.
\[ \boldsymbol{\beta}^{(i)}=\boldsymbol{\beta}^{(i-1)}-\alpha\nabla L\left(\boldsymbol{\beta}^{(i-1)}\right) \]
where \(\alpha\) is our pre-assigned learning rate.
\[ y_i=\beta_0+\beta_1x_i+\epsilon_i,\:\: i=1,2,\ldots,200 \]
where \(x\) observations are drawn from \(\mathcal{N}\left(0,1\right)\) and \(\epsilon_i\overset{iid}{\sim}\mathcal{N}\left(0,3\right)\)
Here the true parameter values are \(\beta_0=1\) and \(\beta_1=5\)
### squared error loss
library(tidyverse)
set.seed(9179)
true_beta_0 <- 1
true_beta_1 <- 5
n_obs <- 200
x <- rnorm(n_obs)
y <- true_beta_1*x + true_beta_0 + rnorm(n_obs, 0, 3)
slrdata<- data.frame(x = x, y = y)
ggplot(slrdata, aes(x = x, y = y)) +
geom_point(size = 2) + theme_minimal() +
labs(title = 'Simulated Data for simple linear regression')+
geom_abline(intercept = 1,slope = 5,col="red",lty=1,lwd=1.1)We obtain the ordinary least square estimates and the corresponding cost function value.
(Intercept) x
1.093300 4.822676
[1] 2133.751
We obtain the estimates of \(\beta_0\) and \(\beta_1\) :
gradient_desc <- function(beta_0, beta_1, x, y){
N = length(x);pred <- beta_1*x + beta_0;res = y - pred;delta_beta_0 = -(2/N)*sum(res);delta_beta_1 = -(2/N)*sum(res*x)
return(c(delta_beta_0, delta_beta_1))
}
minimize_function <- function(beta_0, beta_1, x, y, learning_rate=0.1,iter=100,...){
gd = gradient_desc(beta_0, beta_1, x, y);new_beta_0 = beta_0 - gd[1] * learning_rate;new_beta_1 = beta_1 - gd[2] * learning_rate
return(c(new_beta_0, new_beta_1))
}
res <- list();res[[1]] <- c(0, 0);precision=0.001;iter=100
for (i in 2:iter){
res[[i]] <- minimize_function(res[[i-1]][1], res[[i-1]][2], slrdata$x, slrdata$y, learning_rate = 0.1)
if((res[[i]][1]-res[[i-1]][1])**2+(res[[i]][2]-res[[i-1]][2])**2 <precision**2) {break}
}
parameter = bind_rows(lapply(res, function(x) as.data.frame(t(x))))
colnames(parameter) = c('beta0', 'beta1') ## updates the betas in each iteration
MSE=NULL
for(i in 1:nrow(parameter)){
MSE=c(MSE,cost_function(parameter$beta0[i],parameter$beta1[i],slrdata$x,slrdata$y))
}
parameter2=cbind(parameter,MSE,Iteration=1:nrow(parameter))
cat("the estimate for intercept is",parameter2$beta0[nrow(parameter2)])the estimate for intercept is 1.093811
the estimate for slope is 4.819216
We observe how the gradient descent algorithm gradually moves towards the minimum value of MSE after every iteration.
Note that although, from the curve, it appears that the optimised MSE value is almost achieved after the first 12-15 iterations, the process continues till about the 30th iterative step and stops only when it achieves a desired level of precision, here \(0.001\)
anime_plot=ggplot(data=slrdata,aes(x=x,y=y))+geom_point(col="steelblue")+
geom_abline(intercept = parameter2$beta0,slope = parameter2$beta1,lwd=1,col="purple")+
geom_abline(intercept = 1,slope = 5,lwd=1,col="red")+xlab("X values")+ylab("Y values")+
transition_reveal(parameter2$beta0)+theme_minimal()+labs(title = "Linear Regression by Gradient Descent")
anime_plot\[ L_{\mathrm{Huber}}\left(r\right)=\begin{cases}\frac{1}{2}r^2 & \mathrm{if}\left|r\right|<\delta\\\delta\left(\left|r\right|-\frac{\delta}{2}\right) & \mathrm{if}\left|r\right|>\delta \end{cases} \]
where, \(r=\) residual error, \(\delta=\) threshold that determines the point where the loss function transitions from quadratic to linear.
Here we are interested in the Huber Loss Function corresponding to the usual simple linear regression model
\[ \boldsymbol{y}=X\boldsymbol{\beta}+\epsilon \]
\[ g_{\mathrm{Huber}}\left(\boldsymbol{\beta}\right)=\begin{cases}\frac{1}{2n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)^2 & \mathrm{if}\left|y_i-\beta_0-\beta_1x_i\right|\leq c\\\frac{c}{n}\left(\sum_{i=1}^n\left|y_i-\beta_0-\beta_1x_i\right|\right)-\frac{c^2}{2n} & \mathrm{otherwise} \end{cases} \]
We apply Gradient Descent algorithm on the Huber Loss function to obtain it’s optimized value.
\[ \boldsymbol{\beta}^{(i)}=\boldsymbol{\beta}^{(i-1)}-\alpha\nabla g_{\mathrm{Huber}}\left(\boldsymbol{\beta}^{(i-1)}\right) \]
where \(\alpha\) is our pre-assigned learning rate.
\[ \nabla g_{\mathrm{Huber}}\left(\boldsymbol{\beta}\right)=\left(\begin{array}{cc}\frac{\partial}{\partial\beta_0}g_{\mathrm{Huber}}\left(\boldsymbol{\beta}\right)\\\frac{\partial}{\partial\beta_1}g_{\mathrm{Huber}}\left(\boldsymbol{\beta}\right) \end{array}\right) \]
\[ =\begin{cases}\left(\begin{array}{cc}-\frac{1}{n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)\\-\frac{1}{n}\sum_{i=1}^n\left(y_i-\beta_0-\beta_1x_i\right)x_i\end{array}\right)&\mathrm{if}\left|e_i\right|\leq c\\\left(\begin{array}{cc}-c\\-\frac{c}{n}\sum_{i=1}^nx_i\end{array}\right)&\mathrm{if}\:e_i> c\\\left(\begin{array}{cc}c\\\frac{c}{n}\sum_{i=1}^nx_i\end{array}\right)&\mathrm{if}\:e_i<-c\end{cases} \]
# Define Huber loss function
huber_loss <- function(r, delta) {
ifelse(abs(r) <= delta, 0.5 * r^2, delta * (abs(r) - 0.5 * delta))
}
x=seq(-6,6,by=0.1)
huber_data=data.frame(x,f=huber_loss(x,2),g=0.5*x**2)
cols=c("Sq. Error loss"="purple","Huber Loss"="red")
ggplot(data=huber_data,aes(x=x,y=f))+
geom_line(aes(col="Huber Loss"),lwd=1)+
geom_line(aes(x=x,y=g,col="Sq. Error loss"),lwd=1)+labs(title = "Huber loss function")+ylab("f(x)")+
scale_color_manual("Loss Functions",values = cols)The function calculates the gradient \(\nabla g_{Huber}\left(\boldsymbol{\beta}\right)\) at every iteration and gradually moves towards the optimized loss function value and halts the process once it achieves a certain precision,
\[ \left\|\boldsymbol{\beta}^{(i)}-\boldsymbol{\beta}^{(i-1)}\right\|<10^{-6} \]
The value of \(\boldsymbol{\beta}\) for which the optimized value of \(g_{Huber}\left(\boldsymbol{\beta}\right)\) is denoted by \(\boldsymbol{\beta}^*\) .
# Perform gradient descent for robust linear regression with Huber loss
gradient_descent_huber <- function(x, y, delta = 1.0, learning_rate = 0.01, max_iter = 1000, tol = 1e-6) {
n <- length(y)
beta0 <- 0 # Initialize beta0
beta1 <- 0 # Initialize beta1
for (iter in 1:max_iter) {
# Compute residuals
residuals <- y - beta0 - beta1 * x
# Compute gradients of Huber loss
gradients <- huber_gradient(residuals, x, delta)
# Update parameters using gradients
beta0 <- beta0 - learning_rate * mean(gradients$grad_beta0)
beta1 <- beta1 - learning_rate * mean(gradients$grad_beta1)
# Check convergence
if (iter > 1 && sqrt(sum(gradients$grad_beta0^2) + sum(gradients$grad_beta1^2)) < tol) {
break
}
}
# Return optimized parameters
return(list(beta0 = beta0, beta1 = beta1))
}We observe how the above algorithms performs on a specific regression model,
\[ y_i=\beta_0+\beta_1x_i+\epsilon_i,\:\: i=1,2,\ldots,100 \]
\(x\) values are drawn from \(\mathcal{U}\left(-1,1\right)\) and \(\epsilon\sim\mathcal{C}auchy\left(0,5\right)\)
set.seed(123)
n <- 100
x <- runif(n, -1, 1)
epsilon <- rcauchy(n, 0, 5) # Cauchy distribution for non-normal errors
tbeta0=30;tbeta1=8
y <- tbeta0 + tbeta1 * x + epsilon
noise=seq(0,20,2)
# Introduce outliers
beta=c();beta1=matrix(0,nrow=length(noise),ncol=5)
for(i in 3:length(noise)){
y[seq(1,n,10)] <- y[seq(1,n,10)] + noise[i]**2
trial=lm(y~x)
coeff=unname(trial$coefficients)
robust_data=data.frame(x=x,y=y,iteration=rep(i,n))
new_data=robust_data
result <- gradient_descent_huber(x, y, delta = 1.35, learning_rate = 0.1)
beta=c(result$beta0,result$beta1)
beta1[i,]=c(beta,coeff,i)
beta1=as.data.frame(beta1)
anime <- ggplot(data = new_data, aes(x = x, y = y)) +
geom_point() +
geom_abline(data = beta1, aes(intercept = beta1[i,1], slope = beta1[i,2]),
color = "purple", size = 1) +
geom_abline(data = beta1, aes(intercept = beta1[i,3], slope = beta1[i,4]),
color = "green", size = 1) +
geom_abline(intercept = tbeta0,slope = tbeta1,lwd=0.5,col="red")+
labs(x = "X values", y = "Y values")
anime=ggplotly(anime)
}
print(anime)Suppose we have a logistic regression setup, where, \(y\) denotes the response variable which takes binary values from the set {\(0,1\)} and the independent variables are \(x_1\; and\;x_2\).
Our given data is \((\boldsymbol{x_1},\boldsymbol{x_2},\boldsymbol{y})\)
Define the model as
\[\mathbb{E}[\log({\dfrac{y}{1-y})]}\;=\beta_0\;\;+\;\;x_1\beta_1 \;\;+\;\;x_2\beta_2\]
Define,
\[h_{\boldsymbol{\beta}}(\boldsymbol{x})\;\;=\;\;g(\boldsymbol{\beta}^T\boldsymbol{x})\;\;=\;\;\dfrac{1}{1+e^{-\boldsymbol{\beta}^T\boldsymbol{x}}} \] i.e., \(g(z)\;\;=\;\;\dfrac{1}{1+e^{-z}}\) Define, \[p(y=1|\boldsymbol{x};\boldsymbol{\beta})\;\;=\;\;h_{\boldsymbol{\beta}}(\boldsymbol{x})\] \[p(y=0 |\boldsymbol{x};\boldsymbol{\beta})\;\;=\;\;1-h_{\boldsymbol{\beta}}(\boldsymbol{x})\] \[\therefore\hspace{1cm}p(y|\boldsymbol{x};\boldsymbol\beta)\;\;=\;\;(h_{\boldsymbol\beta}(\boldsymbol{x}))^y(1-(h_{\boldsymbol\beta}(\boldsymbol{x}))^{(1-y)}\] \[=>logp(y|\boldsymbol{x};\boldsymbol\beta)\;\;=\;\;y\;log\;(h_{\boldsymbol\beta}(\boldsymbol{x}))+(1-y)\;log\;(1-(h_{\boldsymbol\beta}(\boldsymbol{x}))\]
Now the joint log likelihood is : \[=>logp(\boldsymbol{y}|\boldsymbol{X};\boldsymbol\beta)\;\;=\;\;\sum_i{yi}\;log\;(h_{\boldsymbol\beta}(\boldsymbol{x_i}))+\sum_i{(1-yi)}\;log\;(1-(h_{\boldsymbol\beta}(\boldsymbol{xi}))\]
We have to maximize the likelihood to obtain the MLE, or equivalently we can maximize the log likelihood function. Note that maximized the log likelihood \(\underset{\boldsymbol\beta}{argmax}(log(p(\boldsymbol{y}|\boldsymbol{X};\boldsymbol\beta)))\) is equivalent to \(\underset{\boldsymbol\beta}{argmin}(-log(p(\boldsymbol{y}|\boldsymbol{X};\boldsymbol\beta)))\)
Now, Consider the function \[l(\boldsymbol{\beta})\;\;= -\;\;\;\;\sum_i{yi}\;log\;(h_{\boldsymbol\beta}(\boldsymbol{x_i}))-\sum_i{(1-yi)}\;log\;(1-(h_{\boldsymbol\beta}(\boldsymbol{xi}))\] So, let us minimize, \(-l(\boldsymbol\beta)\)
Note that \(-l(\boldsymbol\beta)\) is a smooth function, thus using gradiant decent the minimized value is: \[\boldsymbol\beta^*=\boldsymbol\beta_0\;\;-\;\;\alpha\nabla_{\beta}l(\boldsymbol\beta)\] \[Now,\hspace{3cm}\dfrac{\delta}{\delta\beta_j}l(\boldsymbol\beta)\] \[= \left(-\dfrac{y}{h_{\boldsymbol\beta}(\boldsymbol{x})}\;\;+\dfrac{(1-y)}{1-h_{\boldsymbol\beta}(\boldsymbol{x})}\right)\dfrac{\delta}{\delta\boldsymbol\beta_j}g(\boldsymbol\beta^T\boldsymbol{x})\] \[=\;\;\left(-\dfrac{y}{g(\boldsymbol\beta^T\boldsymbol{x})}\;\;+\dfrac{(1-y)}{1-g(\boldsymbol\beta^T\boldsymbol{x})}\right)g(\boldsymbol\beta^T\boldsymbol{x})\left(1-g(\boldsymbol\beta^T\boldsymbol{x})\right)\dfrac{\delta}{\delta\boldsymbol\beta_j}\boldsymbol\beta^T\boldsymbol{x}\] \[=\;\;\left(-y\left(1-g(\boldsymbol\beta^T\boldsymbol{x})\right)\;\;+\;\;(1-y)g(\boldsymbol\beta^T\boldsymbol{x})\right)x_j\] \[=\;\;-(y-h_\boldsymbol\beta(\boldsymbol{x}))x_j\]
Thus we have in the iterative stage, \[\beta_j^{(i)}\;\;=\;\;\beta_j^{(i-1)}+\alpha\left(y^{(i)}-h_\boldsymbol\beta(\boldsymbol{x}^{(i)})\right)x_j^{(i)}\] Hence Considering the whole data, \[\beta_j^{(k+1)}\;\;=\;\;\beta_j^{(k)}+\alpha\sum_{i}\left(y^{(i)}-h_{\boldsymbol\beta^k}(\boldsymbol{x}^{(i)})\right)x_j^{(i)}\] \[=\;\;\beta_j^{(k)}+\alpha\sum_{i}\left(y^{(i)}-\;\;\dfrac{1}{1+e^{-\boldsymbol{\beta}^{k^T}\boldsymbol{x}}}\right)x_j^{(i)}\]
n = 200
x1 = rnorm(n, 1, 2)
x2 = rnorm(n, 0.5, 1)
true_beta0 = -1
true_beta1 = 1.5
true_beta2 = -0.5
y= true_beta0 + true_beta1 * x1 + true_beta2 * x2
prob = plogis(y)
true_y = rbinom(n, 1, prob)
classify_data=as.data.frame(cbind(x1,x2,response=true_y))
random=sample(1:200,40,replace=F)
test_data=classify_data[random,]
classify_data=classify_data[-random,]
predict_logistic = function(beta, x) { ## x:dataframe
z = beta[1] + beta[2] * x[,1] + beta[3] * x[,2]
return(plogis(z))
}
neg_log_likelihood = function(beta, x) { ## x:dataframe
p = predict_logistic(beta, x)
return(-sum(x[,3] * log(p) + (1 - x[,3]) * log(1 - p)))
}
gradient_likelihood = function(beta, x) {
p = predict_logistic(beta, x)
gradient = c( -sum(x[,3] - p), -sum((x[,3] - p) * x[,1]), -sum((x[,3] - p) * x[,2]))
return(gradient)
}
gd_logistic = function(x, learning_rate=0.01, iter=100,tol=0.001,...) {
beta=list()
beta[[1]] = rep(0, 3)
for (i in 2:iter) {
gradient = gradient_likelihood(beta[[i-1]], x)
beta[[i]] = beta[[i-1]] - learning_rate * gradient
if(sum((beta[[i]]-beta[[i-1]])**2)< tol**2){ break}
}
return(beta[length(beta)])
}
beta_hat = gd_logistic(classify_data)[[1]]
plot(classify_data$x1[which(classify_data$response == 0)], classify_data$x2[which(classify_data$response == 0)], col = "blue", pch = 19, xlab = "x1", ylab = "x2", main = "Logistic Regression by Gradient Descent",xlim=c(-5,7),ylim=c(-5,7))
points(classify_data$x1[which(classify_data$response == 1)], classify_data$x2[which(classify_data$response == 1)], col = "red", pch = 19)
x_range = range(x1)
boundary_x = c(x_range[1], x_range[2])
boundary_y = (-beta_hat[1] - beta_hat[2] * boundary_x) / beta_hat[3]
lines(boundary_x, boundary_y, col = "green", lwd = 3)
legend("topright", legend = c("Class 0", "Class 1", "Decision Boundary"), col = c("blue", "red", "green"), pch = 19, lwd = 2)the Apparent Error Rate(APER) is 0.1
We observe that the algorithm is efficient and effective in many problems, however a number of limitations can be traced.
Choice of Learning Rate (\(\alpha\)) : The performance of gradient descent is sensitive to the choice of the learning rate (\(\alpha\)). If the learning rate is too small, the algorithm may converge very slowly, requiring higher number of iterations to reach the minimum. On the other hand, if the learning rate is too large, the algorithm may overshoot the minimum and fail to converge, or it may even oscillate around the minimum without converging.
Local minima and Saddle points : Gradient descent can get trapped in local minima, especially in non-convex optimization problems. It may converge to a suboptimal solution that is not the global minimum. Additionally, gradient descent can struggle around saddle points, where the gradient is near-zero but the point is not a minimum. This can lead to slow convergence or getting stuck somewhere.
Functions in high-dimensional spaces : In high-dimensional spaces (such as those common in deep learning), gradient descent can be computationally expensive and memory-intensive. Computing the gradient requires evaluating all data points, and this becomes increasingly costly as the dimensionality of the data increases.
Feature Scaling : Gradient descent’s performance can be influenced by the scale of features in the dataset. Features with different scales can cause issues with convergence, as the gradient updates may be dominated by features with larger scales.
Non-differentiable points : If the objective function contains non-differentiable points (or may contain points of discontinuity), the algorithm may fail to converge or provide inaccurate results.
Batch Size and Stochastic Gradient Descent (SGD) : In practice, using the full dataset (batch gradient descent) can be slow and memory-intensive in some cases. Techniques like mini-batch gradient descent or stochastic gradient descent (SGD) are often used to approximate the gradient using a subset of the data. However, these approaches introduce additional complexity and may require careful tuning of parameters.
Sensitivity to Initial Parameters : Gradient descent can be sensitive to the initial parameters (starting point) of the optimization. The algorithm may converge to a different minima depending on the initialization, especially in non-convex optimization problems.