library(pryr)
library(ggplot2)
library(dplyr)
Mini-batch stochastic gradient descent.
The step of steepest descent does not always point to the minimum unless the ellipse is a circle. The same issues happen for non-linear multi-layer nets. Then computing the gradient for the whole dataset is wastefull.
Here we take a simple function and run some steps of gradient descent along it.
f1 <- function(x, y) x^2 + y^2
f1_h = f(x, f1(x[1], x[2])) # helper, since numDeriv wants a one input fn
Compute the gradient.
library(numDeriv)
grad(func = f1_h, x = c(1,2))
## [1] 2 4
For this simple function you could as well do a very simple approximation
eps <- 0.0001
grad_x <- (f1(1 + eps,2) - f1(1, 2))/eps
grad_y <- (f1(1, 2 + eps) - f1(1, 2))/eps
c(grad_x, grad_y)
## [1] 2.0001 4.0001
Now we do a couple of steps of descent and store the results
eps <- 0.1
N <- 20
vals <- vector("list", length = N)
vals[[1]] <- c(1,2)
grad <- vector("list", length = N - 1)
for (idx in seq_len(N - 1)) {
grad[[idx]] <- grad(func = f1_h, x = vals[[idx]])
vals[[idx + 1]] <- vals[[idx]] - eps * grad[[idx]]
}
Here then a plot of the descent (note the arrows are not orthogonal to the level lines b/c the x and y axes are not scaled identically)
dat <- setNames(as.data.frame(do.call(rbind, vals)), c('x', 'y'))
dat_contour <-
expand.grid(x = seq(0, 2, by = 0.2), y = seq(0, 2, by = 0.2)) %>%
mutate(z = f1(x,y))
dat_steps <-
cbind(dat[1:(N - 1),], rename(dat[2:N,], xend = x, yend = y))
ggplot(dat = dat, aes_(x = ~x, y = ~y)) +
geom_contour(dat = dat_contour, aes_(x = ~x, y = ~y, z = ~z)) +
geom_curve(dat = dat_steps, aes_(x = ~x, y = ~y, xend = ~xend, yend = ~yend),
arrow = arrow(length = unit(0.03, "npc")),
curvature = 0, color = "grey") +
geom_point()
sqrt(fan-in).Goal is a circular error surface (or good approx) since then gradient points towards minimum.