First of all, we define a linear function. Because the magnitude and intercept (a, and b) will be constant during this project, I will not pass them to the function as arguments.
a <- 5
b <- 0.2
linear_function <- function(x){
return(a*x + b)
}
The function curve() gives us samples over the function we pass to it in specified boundaries and also plots it. We also need to define the number_of_samples too.
number_of_samples <- 50
samples <- curve(from=0, to=1,
linear_function, n=number_of_samples,
type="p", col="blue3", cex=0.7)
The data we took would be perfectly fitted to a linear model. Because it is created by the linear function without any errors. To make it a bit closer to what is in nature, we add errors on each data point in y and x axis. These errors are being considered to be normally distributed, therefor we use rnorm() to generate them.
deviation_x <- 0.1
deviation_y <- 0.2
samples$x <- rnorm(mean=samples$x,
sd=deviation_x,
n=number_of_samples)
samples$y <- rnorm(mean=samples$y,
sd=deviation_y,
n=number_of_samples)
plot(samples, col="red2", type="p", cex=0.7)
First we create a linear model by defining a function. As you can see, this model has only 2 parameters; w0 and w1 which are proportion to the slop and interception of the line in x-y plane. This function return the associated value of any given input.
Note that the input x can be also a number or a vector of numbers.
hw <- function(x, w0, w1){
return(w0*x + w1)
}
Then we define the error and cost function. Any individual point in the sample data, has it’s own error to the model we create. This error calculation is defined as bellow:
\[\large E_i=[h_w(x^{(i)}) - y^{(i)}]^2\] \[\large J(w_0,w_1) = \frac{1}{2n}\sum_{i=1}^{n}{E_i}=\frac{1}{2n}\sum_{i=1}^{n}{[h_w(x^{(i)}) - y^{(i)}]^2}\] where:
\[\large h_w(x_i) = w_0x_i + w_1\] \[\large n=number\_of\_samples\] and \(y^{(i)}\) is the corresponding \(y_i\) to \(x_i\) in our sample data.
E <- function(w0, w1){
return((hw(samples$x, w0, w1) - samples$y)^2)
}
J <- function(w0, w1){
return((1/(2*number_of_samples)) * sum(E(w0, w1)))
}
Errors are vertical lines with lengths equal to the distance of a samples to their proportion value in model. In other words, any error individual is the distance of the sample point to it’s projection in y axis to the model line.
curve(from=-0.5, to=1.5,
hw(x, a, b),
type="l", col="red2", cex=0.7, ylim=c(-2, 8))
points(samples, col="blue2", type="p", cex=0.7)
x_s <- samples$x
y_s <- samples$y
y_m <- hw(x_s, a, b)
segments(x0=x_s, x1=x_s, y0=y_s, y1=y_m, lty = "dotted")
To have a visual sense, we change parameters of the model in each frame.
frames <- 10
cost <- rep(NA, frames)
for (f in 1:frames) {
w0 <- f/2
w1 <- 0
par(mfrow=c(1, 2), mar=c(4, 4, 1, 1))
# left plot
curve(from=-0.5, to=1.5,
hw(x, w0, w1), n=number_of_samples,
type="l", col="red2", cex=0.7, ylim=c(-2, 8))
points(samples, col="blue3", cex=0.7)
legend("topleft",
col=c("red2", "blue3"),
lty=c(1, NA),
pch=c(NA, "o"),
legend=c("model", "samples"))
x_s <- samples$x
y_s <- samples$y
y_m <- hw(x_s, w0, w1)
segments(x0=x_s, x1=x_s, y0=y_s, y1=y_m, lty = "dotted")
cost[f] <- J(w0, w1)
# right plot
plot(type="p", cost, col="red2", cex=0.7, ylim = c(0, 4))
}
To illustrate the cost and parameter’s space we can build a grid and evaluate cost function.
suppressMessages(require(plotly))
w0_space <- seq(from=-10, to=10, length.out=100)
w1_space <- seq(from=-10, to=10, length.out=100)
cost <- matrix(nrow=length(w0_space), ncol=length(w1_space))
for (w0_i in 1:length(w0_space)) {
for (w1_i in 1:length(w1_space)) {
cost[w0_i, w1_i] = (J(w0_space[w0_i], w1_space[w1_i]))
}
}
plotly::plot_ly(z=cost, type="heatmap") %>%
plotly::layout(scene = list(
xaxis = list(title = "w0"),
yaxis = list(title = "w1")))
plotly::plot_ly(z=cost, type="surface") %>%
plotly::layout(scene = list(
xaxis = list(title = "w0"),
yaxis = list(title = "w1"),
zaxis = list(title = "cost")))
Now that every thing is ready, we are ready to optimize the parameters of the model by minimizing the cost value in order to fit the model in to out data. To do this we propose two different methods; Gradient-Descend and Genetic-Algorithm.
In this method we need to calculate the gradient of the cost function proportion to the parameters of it, w0 and w1. To minimize the cost, we have to walk in opposite direction of the gradient. Thus, we need to update parameters as the following: \[\large w:=w-\alpha\triangledown J(w)\] Where \(\alpha\) is the learning rate and \(w'\) is the updated \(w\). \[\large \triangledown J(w) = \frac{\partial}{\partial w}J(w)\] \[\large =\frac{\partial}{\partial w}\frac{1}{2n}\sum_{i=1}^{n}{[h_w(x^{(i)}) - y^{(i)}]^2}\] \[\large =\frac{1}{2n}\sum_{i=1}^{n}\frac{\partial}{\partial w}{[h_w(x^{(i)}) - y^{(i)}]^2}\] \[\large =\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial w}{(h_w(x^{(i)}) - y^{(i)})[h_w(x^{(i)}) - y^{(i)}]}\] \[\large =\frac{1}{n}\sum_{i=1}^{n}[\frac{\partial}{\partial w}{h_w(x^{(i)}) - 0][h_w(x^{(i)}) - y^{(i)}]}\] Thus:
\[\large \frac{\partial}{\partial w_0}J(w)=\frac{1}{n}\sum_{i=1}^{n}[\frac{\partial}{\partial w_0}{(w_0 * x^{(i)} + w_1)][h_w(x^{(i)}) - y^{(i)}]}\] \[\large =\frac{1}{n}\sum_{i=1}^{n}[x^{(i)} + 0][h_w(x^{(i)}) - y^{(i)}]\] \[\large =\frac{1}{n}\sum_{i=1}^{n}x^{(i)}[h_w(x^{(i)}) - y^{(i)}]\] and \[\large \frac{\partial}{\partial w_1}J(w)=\frac{1}{n}\sum_{i=1}^{n}[\frac{\partial}{\partial w_1}{(w_0 * x^{(i)} + w_1)][h_w(x^{(i)}) - y^{(i)}]}\] \[\large =\frac{1}{n}\sum_{i=1}^{n}[0 + 1][h_w(x^{(i)}) - y^{(i)}]\] \[\large =\frac{1}{n}\sum_{i=1}^{n}[h_w(x^{(i)}) - y^{(i)}]\] Therefor: \[\large w_0:=w_0-\alpha\frac{1}{n}\sum_{i=1}^{n}x^{(i)}[h_w(x^{(i)}) - y^{(i)}]\] and \[\large w_1:=w_1-\alpha\frac{1}{n}\sum_{i=1}^{n}[h_w(x^{(i)}) - y^{(i)}]\]
alpha <- .2
stop_error <- 0.000001
steps <- 5000
w0 <- -8
w1 <- -8
costs <- c()
w0s <- c()
w1s <- c()
grad_w0s <- c()
grad_w1s <- c()
for(step in 1:steps){
grad_w0 <- sum((hw(samples$x, w0, w1)-samples$y)*samples$x)/number_of_samples
w0p <- w0 - alpha*grad_w0
grad_w1 <- sum(hw(samples$x, w0, w1)-samples$y)/number_of_samples
w1p <- w1 - alpha*grad_w1
w0 <- w0p
w1 <- w1p
w0s <- c(w0s, w0)
w1s <- c(w1s, w1)
grad_w0s <- c(grad_w0s, grad_w0)
grad_w1s <- c(grad_w1s, grad_w1)
costs <- c(costs, J(w0, w1))
if(step > 1)
if(costs[step-1] < costs[step]){
w0 <- w0s[step-1]
w1 <- w1s[step-1]
break
}
}
curve(from=-0.5, to=1.5,
hw(x, w0, w1),
type="l", col="red2", cex=0.7)
points(samples)
The final value for w0 is 4.7007863, w1 is 0.2948093 and the corresponding cost is 0.1273341.
plotly::plot_ly(data.frame()) %>%
plotly::add_trace(x=w0_space, y=w1_space, z=cost, type="surface") %>%
plotly::add_trace(x=w1s, y=w0s, z=costs+5, type='scatter3d', mode='lines+markers',
opacity = 1, marker = list(size=2, color="Set3"))
In this section we attempt to solve this problem by a huristic method called Genetic Algorithm. The first step is to define a fitness function.
suppressMessages(require(GA))
require(plotly)
mon <- function(obj){
#if (obj@iter == 1) {
contour(x=w0_space, y=w1_space, z=cost, nlevels = 100, drawlabels = F,
ylab="w1", xlab="w0")
#}
grad <- obj@iter / obj@maxiter
points(x=obj@population[,1], y=obj@population[,2], cex=0.5, col=rgb(grad, 0, 1-grad, 1), pch=16)
Sys.sleep(0.5)
if (obj@iter == obj@maxiter) {
best <- as.numeric(obj@bestSol[obj@maxiter][[1]])
points(x=best[1], y=best[2], col="black", cex=2, pch=8)
}
}
result <- ga(
type = "real-valued",
lower = c(-10, -10),
upper = c(10, 10),
popSize = 50,
maxiter = 10,
pmutation = 0.1,
pcrossover = 0.9,
elitism = 10,
fitness = function(x)(-J(x[1], x[2])),
monitor = mon,
keepBest = T
)
The final value for w0 is 4.6110536, w1 is 0.4134887 and the corresponding cost is 0.1304028.
summary(lm(samples$y~samples$x))
##
## Call:
## lm(formula = samples$y ~ samples$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29978 -0.33567 -0.00838 0.31819 1.02289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2948 0.1406 2.096 0.0414 *
## samples$x 4.7008 0.2384 19.722 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5151 on 48 degrees of freedom
## Multiple R-squared: 0.8901, Adjusted R-squared: 0.8879
## F-statistic: 388.9 on 1 and 48 DF, p-value: < 2.2e-16