Experimenting with Gradient Descent

Kaius de Paula Corrêa

2023-11-01

Above image locally generated by me using Stable Diffusion. Uploaded on prompthero.

1 Project: Implementing Gradient Descent from Scratch in R

1.1 Introduction

In this project, we embark on a journey to delve into the heart of one of the fundamental optimization algorithms in machine learning and data science - Gradient Descent. We’ll demonstrate our mastery of this critical concept by developing a custom Gradient Descent algorithm from scratch, using the R programming language.

1.1.1 What is Gradient Descent?

Gradient Descent is an optimization algorithm used to minimize a cost function by iteratively adjusting model parameters. It plays a pivotal role in training machine learning models, including linear regression, logistic regression, neural networks, and more.

1.2 Project Objectives

Our main objectives are as follows:

Understanding Gradient Descent: We will start by gaining a deep understanding of the principles behind Gradient Descent, including the mathematics involved.
Implementing Gradient Descent: I will create a custom implementation of the Gradient Descent algorithm in R, without relying on external libraries. This implementation will be flexible enough to work with various cost functions and optimization problems.

1.3 Methodology

Our project will be structured into the following stages:

1.3.1 1. Mathematical Foundation

I will lay the mathematical groundwork by understanding the core concepts, including:

Cost Function: Definition and mathematical representation.
Gradient: Calculating the gradient of the cost function.
Learning Rate: Exploring the impact of learning rate on convergence.

1.3.2 2. Custom Gradient Descent Implementation

In this stage, I will create our custom Gradient Descent algorithm. Key components will include:

Initialization: Setting initial parameter values.
Iterative Updates: Implementing the core update rule.
Convergence Criteria: Determining when to stop the optimization process.

1.3.3 3. Testing and Validation

I will apply our custom Gradient Descent to a simple linear regression problem and compare its performance with established R packages like lm() to verify correctness.

1.4 Expected Outcomes

Upon completion of this project, I anticipate:

A profound understanding of Gradient Descent and its inner workings.
A functional, custom Gradient Descent implementation in R.
The ability to fine-tune hyperparameters and optimize convergence.
A tangible skill set that can be applied to real-world machine learning problems.

1.5 Conclusion

This project is not only an exercise in technical proficiency but also a demonstration of your dedication to mastering the intricacies of fundamental machine learning concepts. By implementing Gradient Descent from scratch, you are poised to impress your peers and establish your expertise in the field of data science. Happy coding!

2 Packages Utilized on this document

I’ll be using the following R packages:

# Data Manipulation
library(data.table)
library(dplyr)

# Data Visualization
library(ggplot2)
library(ggpmisc)

3 Generating Synthetic Data

To lay the foundation for our exploration, I’ll create a synthetic dataset using R’s data.table. This dataset will feature two variables, ‘x’ and ‘y,’ with a clear linear relationship. Our data generation process will mimic a real-world scenario, making ‘x’ a random variable and ‘y’ a linear function of ‘x’ with some added noise.

dataset <- data.table(x = rnorm(10000, 50, 2))
dataset[, y := 100 + (x * 2) + rnorm(10000, 10)]

lm_model <- lm(y~x, dataset)

p <- ggplot(dataset, aes(x, y)) +
  geom_point(color = "steelblue") +
  theme_classic()

# Add the regression line using geom_abline
p + geom_smooth(method = "lm", formula = y ~ x, color = "red") +
  stat_poly_eq(
    formula = y ~ x,
    aes(label = paste(after_stat(eq.label), after_stat(rr.label), sep = "~~~~")),
    parse = TRUE
  )

4 Linear Modelling on Vector Data

On this section, I’ll experiment gradient descent on a very specific use case of 2 parameters: \(\beta_0\) and \(\beta_1\).

x = dataset$x # as x is a vector, I'll use x instead of X.
y = dataset$y

# Initialize random betas
b1 = rnorm(1)
b0 = rnorm(1)

# Prediction is simply:
predict <- function(x, b0, b1){
  return (b0 + b1 * x)
}

# And loss function is:

loss_mse <- function(predictions, y){
  residuals = y - predictions
  return(mean(residuals ^ 2))
}

predictions <- predict(x, b0, b1)

loss = loss_mse(predictions, y)

print(paste0("Loss is: ", round(loss)))

## [1] "Loss is: 32512"

Let’s look at the gradient of this linear model:

The MSE function is defined as:

\[ L = \frac{\sum(y_i - \hat y_i)^2}{n} = \frac{1}{n}\sum(y_i - f(\beta_0, \beta_1))^2 \]

Our prediction is simply put:

\[ f(\beta_0, \beta_1) =\beta_0 + x_{i} \cdot \beta_{1} \]

The partial derivative of the Loss function related to \(\beta_1\) is, by the chain rule:

\[ \frac{\partial L}{\partial \beta_1} = \frac{1}{n}\sum 2(y_i - (x_{i} \cdot \beta_1 + \beta_0)) \cdot-x_i = \frac{-2}{n}\sum x_i(y_i - (x_{i} \cdot \beta_1 + \beta_0)) \]

The partial derivative of the Loss function related to \(\beta_0\) is, by the chain rule:

\[ \frac{\partial L}{\partial \beta_0} = \frac{1}{n}\sum2(y_i - (x_{i} \cdot \beta_1 + \beta_0))\cdot -1 = \frac{-2}{n}\sum(y_i - (x_{i} \cdot \beta_1 + \beta_0)) \]

This is basically all. Now, the R implementation for the gradient is:

gradient <- function(x, y, predictions){
  dinputs = y - predictions
  db1 = -2 * mean(x * dinputs)
  db0 = -2 * mean(dinputs)
  
  return(list("db1" = db1, "db0" = db0))
}

gradients <- gradient(x, y, predictions)
print(gradients)

## $db1
## [1] -18047.99
## 
## $db0
## [1] -360.5727

And it’s time to optimize our results. I’ll implement a pretty straightforward Gradient Descent optimizer below:

Important note: Gradient Descent is best utilized with scaled X, and even more important when using euclidean distance loss functions, such as MSE. Scaling y is not a necessity, but it would converge faster to a global minimum. I won’t delve into the specifics of that, but it’s highly recommended to learn more about it.

# Train the model with scaled features
x_scaled <- (x - mean(x)) / sd(x)

learning_rate = 1e-1

# Record Loss for each epoch:
logs = list()

for (epoch in 1:50){
    
    # Predict all y values:
    predictions = predict(x_scaled, b0, b1)
    loss = loss_mse(predictions, y)
    
    logs = append(logs, loss)
    
    if (epoch %% 10 == 0){
      print(paste0("Epoch: ",epoch, ", Loss: ", round(loss, 5)))
    }
    
    gradients <- gradient(x_scaled, y, predictions)
    db1 = gradients$db1
    db0 = gradients$db0
    
    b1 = b1 - db1 * learning_rate
    b0 = b0 - db0 * learning_rate
}

## [1] "Epoch: 10, Loss: 795.18259"
## [1] "Epoch: 20, Loss: 10.15455"
## [1] "Epoch: 30, Loss: 1.10379"
## [1] "Epoch: 40, Loss: 0.99944"
## [1] "Epoch: 50, Loss: 0.99824"

# I must unscale coefficients to make them comprehensible
b0 =  b0 - (mean(x) / sd(x)) * b1
b1 = b1 / sd(x)

As I scaled X using standard scaler, to interpret the coefficients, I must unscale them. I found out that the equations are pretty much:

\[ w_i = \frac{x_i - mean(x_i)}{std(x_i)} \]

The original scaled equation is:

\[ y = \alpha_0 + \alpha_1 \cdot w_i + \epsilon_i \]

\[ \beta_0 = \alpha_0 - \frac{mean(x)}{sd(x)} \cdot \alpha_1 \]

\[ \beta_1 = \frac{\alpha_1}{sd(x)} \]

So:

\[ y = \beta_0 + \beta_1 \cdot x_i + u_i \]

Let’s see if the Gradient Descent Optimizer is converging to expected values:

print(paste0("Inclination: ", b1, ", Intercept: ", b0))

## [1] "Inclination: 2.00152116791044, Intercept: 109.932747724164"

It seems like it. Let’s see how loss function behaved on the progression of epochs:

logs <- data.table(epoch = 1:50, loss = unlist(logs))

ggplot(logs, aes(epoch, loss)) +
  geom_line(color="steelblue") +
  theme_classic()

And looking at original scaled graph:

# Original graph
p <- ggplot(dataset, aes(x, y)) +
  geom_point(color = "steelblue") +
  theme_classic()

# Add the regression line using geom_abline
p + geom_abline(intercept = b0, slope = b1, color = "red")

5 Matrix Form of MSE for Gradient Descent

Matrix calculation is a powerful approach to perform gradient descent in machine learning and optimization. While it may appear more complex initially, it offers significant advantages:

1. Efficiency: Matrix operations are highly optimized in numerical computing libraries such as NumPy (Python) or matrix algebra in R. These libraries are written in lower-level languages like C or Fortran, resulting in faster execution and improved algorithm efficiency.

2. Simplicity: Matrix notation simplifies the expression of mathematical operations, reducing the complexity of gradient and loss function computations. This simplification makes code more readable and easier to maintain.

3. Parallelism: Modern hardware, including CPUs and GPUs, is designed to handle parallel matrix operations efficiently. Utilizing matrix calculations allows for parallel processing and harnessing the full computational power of these hardware resources.

4. Scalability: Matrix operations enable scalable implementation of gradient descent algorithms. Whether you are working with a few features or high-dimensional data, the same matrix-based code can be applied, making it adaptable to a wide range of problem sizes.

5. Abstraction: Matrix notation abstracts the underlying mathematical operations, allowing for greater flexibility when dealing with different optimization problems. This abstraction simplifies code reuse across various applications.

In summary, matrix calculation is a fundamental tool in gradient descent, offering not only computational advantages but also simplicity, scalability, and abstraction, which make it a preferred choice for implementing and optimizing machine learning algorithms.

5.1 Math behind gradient of the Loss function

Remember I are estimating the gradients of the MSE function, defined as:

\[ L = \frac{1}{n}\sum(y_i - f(w, b))^2 \]

On matrix notation, I can have a vector of parameters, defined as \(\beta\), a matrix of X and a vector of y. Translating MSE into matrix notation is simply:

\[ L = \frac{1}{n}\sum(y_i - \beta^T \cdot X)^2 \]

Note that \(\beta^T \cdot X\) are the model predictions. The partial derivative of the Loss function and the vector of parameters is:

\[ \frac{\partial L}{\partial \beta} = \frac{2}{n}(y_i - \beta^T \cdot X) \cdot -X \]

\[ \frac{\partial L}{\partial \beta} = \frac{2}{n} X^T (X \cdot \beta - y_i) \]

This translated to R code is:

rm(list = setdiff(ls(), "dataset"))

X = cbind(1, as.matrix(dataset$x)) # as X is a Matrix and add a column of a constant.
y = dataset$y

# Initialize random betas
b = as.matrix(c(rnorm(1),rnorm(1)))

# Prediction is simply:
predict <- function(X, b){
  return (X %*% b)
}

# And loss function is:

loss_mse <- function(predictions, y){
  residuals = y - predictions
  return(mean(residuals ^ 2))
}

predictions <- predict(X, b)

loss = loss_mse(predictions, y)

gradient_vector <- function(X, y, predictions){
  N = length(y)
  db = 2 / N * t(X) %*% (predictions - y) 
  return(db)
}

gradients <- gradient_vector(X, y, predictions)
gradients <- paste(gradients, collapse = ' ')
print(paste0("First loss was: ", loss, " and gradients were: ", gradients))

## [1] "First loss was: 43094.3911620034 and gradients were: -415.108305876494 -20780.107297104"

So it all comes to optimize:

# Train the model with scaled features
W <- cbind(1, as.matrix((dataset$x - mean(dataset$x)) / sd(dataset$x)))

# Initialize random betas # Unscaled
a = as.matrix(c(rnorm(1),rnorm(1)))

learning_rate = 1e-1

# Record Loss for each epoch:
logs = c()

for (epoch in 1:100){
    
    # Predict all y values:
    predictions = predict(W, a)
    loss = loss_mse(predictions, y)
    
    logs = append(logs, loss)
    
    if (epoch %% 10 == 0){
      print(paste0("Epoch: ",epoch, ", Loss: ", round(loss, 5)))
    }
    
    db <- gradient_vector(W, y, predictions)
    
    a = a - db * learning_rate
}

## [1] "Epoch: 10, Loss: 801.36449"
## [1] "Epoch: 20, Loss: 10.22582"
## [1] "Epoch: 30, Loss: 1.10461"
## [1] "Epoch: 40, Loss: 0.99945"
## [1] "Epoch: 50, Loss: 0.99824"
## [1] "Epoch: 60, Loss: 0.99823"
## [1] "Epoch: 70, Loss: 0.99823"
## [1] "Epoch: 80, Loss: 0.99823"
## [1] "Epoch: 90, Loss: 0.99823"
## [1] "Epoch: 100, Loss: 0.99823"

Return original parameters scale is just the same idea:

Original equation is:

\[ W_{ij} \cdot \alpha = X_{ij} \cdot \beta \]

Solve for beta:

\[ \beta = [X_{ij}^T \cdot X_{ij}]^{-1} \cdot X_{ij}^T \cdot W_{ij} \cdot \alpha \]

b = solve(t(X) %*% X) %*% t(X) %*% W %*% a
print(b)

##            [,1]
## [1,] 109.934527
## [2,]   2.001546

Which is our solution!

6 Extending Models with Multiple Parameters

In many real-world scenarios, machine learning models are more complex and involve multiple parameters. While working with just two parameters is illustrative, practical applications often require models with numerous variables to capture the underlying complexity of the data. This expansion presents both opportunities and challenges.

6.1 Benefits of Multiple Parameters

Increased Model Capacity: With additional parameters, models have greater capacity to capture complex relationships within the data. This can lead to improved model performance and predictive accuracy.
Feature Representation: Multiple parameters allow for the representation of a wide range of features or factors that contribute to the target variable. This provides a more comprehensive view of the problem.
Flexibility: A model with multiple parameters can adapt to various data patterns, making it versatile and suitable for different tasks.

6.2 Challenges and Considerations

Gradient Descent: Optimization becomes more challenging with higher-dimensional parameter spaces. Convergence can be slower, and it may require fine-tuning of learning rates and regularization techniques.
Data Size: As the number of parameters increases, the model may require a larger dataset to effectively estimate the parameters and avoid overfitting.
Regularization: With more parameters, the risk of overfitting grows. Regularization techniques (e.g., L1 or L2 regularization) are often necessary to prevent excessive model complexity.
Visualization: Visualizing the parameter space is impractical in high dimensions. Interpretability may become more challenging.
Computational Cost: More parameters can significantly increase the computational cost, especially if you’re working with deep learning models and large datasets.

When working with models that have more than two parameters, it’s crucial to carefully manage optimization, regularization, and data considerations. Gradient descent remains a fundamental optimization technique, but be prepared to adapt your approach to handle the added complexity and challenges that arise in high-dimensional parameter spaces.

6.3 Generating Artificial Dataset

Now that I’ll work with more than 2 dimentions, it’ll be impossible to clearly visualize all variables on a singular graph. Let’s just look at the OLS results:

dataset <- data.table(
  x = rnorm(10000, 50, 2),
  w = rnorm(10000, 15, 1),
  z = rnorm(10000, 4, 0.1),
  t = rnorm(10000, 10, 2),
  h = rnorm(10000, 80, 2)
)
dataset[, y := 100 + (x * 2) + (w * -4) + (z * 0.2) + (t * 10) + (h * -3) + rnorm(10000, 10)]

lm_model <- lm(y~x+w+z+t+h, dataset)
print(coefficients(lm_model))

## (Intercept)           x           w           z           t           h 
## 110.4126878   2.0050596  -4.0008427   0.1048177   9.9961378  -3.0030729

OLS got the optimal result quite fast! Let’s see how the Gradient Descent Algoritm will perform.

6.4 Gradient Descent on Multidimentional Data

standard_scaler <- function(x) {
  return((x - mean(x)) / sd(x))
}

X = cbind(1, as.matrix(dataset[,1:5])) # as X is a Matrix and add a column of a constant.
y = dataset$y

# Original X Matrix
X <- cbind(1, as.matrix(dataset[,1:5]))

# Scaled X Matrix
W <- sapply(dataset, standard_scaler)
W <- cbind(1, as.matrix(W))

# Initialize random betas # Unscaled
a = as.matrix(c(rnorm(1), rnorm(1), rnorm(1), rnorm(1), rnorm(1), rnorm(1), rnorm(1)))

learning_rate = 1e-2

# Record Loss for each epoch:
log = c()

# I'll increment epochs and reduce learning rate
for (epoch in 1:250){
    
    # Predict all y values:
    predictions = predict(W, a)
    loss = loss_mse(predictions, y)
    
    log = append(log, loss)
    
    if (epoch %% 10 == 0){
      print(paste0("Epoch: ",epoch, ", Loss: ", round(loss, 5)))
    }
    
    db <- gradient_vector(W, y, predictions)
    
    a = a - db * learning_rate
}

## [1] "Epoch: 10, Loss: 365.77269"
## [1] "Epoch: 20, Loss: 181.13565"
## [1] "Epoch: 30, Loss: 93.03944"
## [1] "Epoch: 40, Loss: 49.80435"
## [1] "Epoch: 50, Loss: 27.84109"
## [1] "Epoch: 60, Loss: 16.23506"
## [1] "Epoch: 70, Loss: 9.84086"
## [1] "Epoch: 80, Loss: 6.17184"
## [1] "Epoch: 90, Loss: 3.98805"
## [1] "Epoch: 100, Loss: 2.64781"
## [1] "Epoch: 110, Loss: 1.80516"
## [1] "Epoch: 120, Loss: 1.26568"
## [1] "Epoch: 130, Loss: 0.91571"
## [1] "Epoch: 140, Loss: 0.68657"
## [1] "Epoch: 150, Loss: 0.53557"
## [1] "Epoch: 160, Loss: 0.43562"
## [1] "Epoch: 170, Loss: 0.36925"
## [1] "Epoch: 180, Loss: 0.32509"
## [1] "Epoch: 190, Loss: 0.29564"
## [1] "Epoch: 200, Loss: 0.27599"
## [1] "Epoch: 210, Loss: 0.26285"
## [1] "Epoch: 220, Loss: 0.25405"
## [1] "Epoch: 230, Loss: 0.24815"
## [1] "Epoch: 240, Loss: 0.24417"
## [1] "Epoch: 250, Loss: 0.24149"

And our scaled parameters are:

b = solve(t(X) %*% X) %*% t(X) %*% W %*% a
print(b)

##           [,1]
##   110.76308032
## x   2.00488811
## w  -3.99965998
## z   0.02560536
## t  10.00112905
## h  -3.00510368

Which is the global minimum!

Let’s see how the optimizer minimized the loss function:

logs <- data.table(epoch = 1:length(log), loss = log)

ggplot(logs, aes(epoch, loss)) +
  geom_line(color="steelblue") +
  theme_classic()

And that’s it. As it seems, all linear models will have a convex loss function, so minimizing those is quite easy!

6.5 Conclusion

In the course of this project, we embarked on a journey to implement a custom Gradient Descent algorithm from scratch in R. We began by delving into the foundational principles of Gradient Descent, dissecting the mathematical intricacies, and understanding its crucial role in machine learning and optimization.

Our objectives were not just to develop a working algorithm but to empower ourselves with the knowledge and skills to comprehend and fine-tune the inner workings of this fundamental optimization method.

As we conclude this project, we have not only acquired a valuable skill set in developing Gradient Descent algorithms but also a broader appreciation for the importance of efficient optimization techniques in the world of data science and machine learning. The journey doesn’t end here; it’s just the beginning of a more profound exploration into the boundless realm of optimization algorithms and their practical applications.

May the knowledge and experiences gained from this project continue to inspire your pursuit of excellence in the exciting field of data science and machine learning. Thank you for joining me on this enlightening voyage.

Keep learning, keep exploring, and keep innovating.