Functin definition

First of all, we define a linear function. Because the magnitude and intercept (a, and b) will be constant during this project, I will not pass them to the function as arguments.

a <- 5
b <- 0.2

linear_function <- function(x){
  return(a*x + b)
}

Sample creation

The function curve() gives us samples over the function we pass to it in specified boundaries and also plots it. We also need to define the number_of_samples too.

number_of_samples <- 50
samples <- curve(from=0, to=1, 
                 linear_function, n=number_of_samples, 
                 type="p", col="blue3", cex=0.7)

The data we took would be perfectly fitted to a linear model. Because it is created by the linear function without any errors. To make it a bit closer to what is in nature, we add errors on each data point in y and x axis. These errors are being considered to be normally distributed, therefor we use rnorm() to generate them.

deviation_x <- 0.1
deviation_y <- 0.2

samples$x <- rnorm(mean=samples$x, 
                   sd=deviation_x, 
                   n=number_of_samples)

samples$y <- rnorm(mean=samples$y, 
                   sd=deviation_y, 
                   n=number_of_samples)

plot(samples, col="red2", type="p", cex=0.7)

Linear regression

First we create a linear model by defining a function. As you can see, this model has only 2 parameters; w0 and w1 which are proportion to the slop and interception of the line in x-y plane. This function return the associated value of any given input.
Note that the input x can be also a number or a vector of numbers.

hw <- function(x, w0, w1){
  return(w0*x + w1)
}

Then we define the error and cost function. Any individual point in the sample data, has it’s own error to the model we create. This error calculation is defined as bellow:
\[\large E_i=[h_w(x^{(i)}) - y^{(i)}]^2\] \[\large J(w_0,w_1) = \frac{1}{2n}\sum_{i=1}^{n}{E_i}=\frac{1}{2n}\sum_{i=1}^{n}{[h_w(x^{(i)}) - y^{(i)}]^2}\] where:
\[\large h_w(x_i) = w_0x_i + w_1\] \[\large n=number\_of\_samples\] and \(y^{(i)}\) is the corresponding \(y_i\) to \(x_i\) in our sample data.

E <- function(w0, w1){
  return((hw(samples$x, w0, w1) - samples$y)^2)
}

J <- function(w0, w1){
  return((1/(2*number_of_samples)) * sum(E(w0, w1)))
}

Visualization

Errors are vertical lines with lengths equal to the distance of a samples to their proportion value in model. In other words, any error individual is the distance of the sample point to it’s projection in y axis to the model line.

curve(from=-0.5, to=1.5,
      hw(x, a, b), 
      type="l", col="red2", cex=0.7, ylim=c(-2, 8))
points(samples, col="blue2", type="p", cex=0.7)
x_s <- samples$x
y_s <- samples$y
y_m <- hw(x_s, a, b)
segments(x0=x_s, x1=x_s, y0=y_s, y1=y_m, lty = "dotted")

To have a visual sense, we change parameters of the model in each frame.

frames <- 10
cost <- rep(NA, frames)
for (f in 1:frames) {
  w0 <- f/2
  w1 <- 0
  par(mfrow=c(1, 2), mar=c(4, 4, 1, 1))
  # left plot
  curve(from=-0.5, to=1.5,
        hw(x, w0, w1), n=number_of_samples,
        type="l", col="red2", cex=0.7, ylim=c(-2, 8))
  points(samples, col="blue3", cex=0.7)
  legend("topleft", 
         col=c("red2", "blue3"), 
         lty=c(1, NA), 
         pch=c(NA, "o"), 
         legend=c("model", "samples"))
  x_s <- samples$x
  y_s <- samples$y
  y_m <- hw(x_s, w0, w1)
  segments(x0=x_s, x1=x_s, y0=y_s, y1=y_m, lty = "dotted")
  cost[f] <- J(w0, w1)
  
  # right plot
  plot(type="p", cost, col="red2", cex=0.7, ylim = c(0, 4))
}

To illustrate the cost and parameter’s space we can build a grid and evaluate cost function.

suppressMessages(require(plotly))
w0_space <- seq(from=-10, to=10, length.out=100)
w1_space <- seq(from=-10, to=10, length.out=100)
cost <- matrix(nrow=length(w0_space), ncol=length(w1_space))
for (w0_i in 1:length(w0_space)) {
  for (w1_i in 1:length(w1_space)) {
    cost[w0_i, w1_i] = (J(w0_space[w0_i], w1_space[w1_i]))
  }
}
plotly::plot_ly(z=cost, type="heatmap") %>% 
  plotly::layout(scene = list(
    xaxis = list(title = "w0"),
    yaxis = list(title = "w1")))
plotly::plot_ly(z=cost, type="surface") %>% 
  plotly::layout(scene = list(
    xaxis = list(title = "w0"),
    yaxis = list(title = "w1"),
    zaxis = list(title = "cost")))

Optimization

Now that every thing is ready, we are ready to optimize the parameters of the model by minimizing the cost value in order to fit the model in to out data. To do this we propose two different methods; Gradient-Descend and Genetic-Algorithm.

Gradient Discend

In this method we need to calculate the gradient of the cost function proportion to the parameters of it, w0 and w1. To minimize the cost, we have to walk in opposite direction of the gradient. Thus, we need to update parameters as the following: \[\large w:=w-\alpha\triangledown J(w)\] Where \(\alpha\) is the learning rate and \(w'\) is the updated \(w\). \[\large \triangledown J(w) = \frac{\partial}{\partial w}J(w)\] \[\large =\frac{\partial}{\partial w}\frac{1}{2n}\sum_{i=1}^{n}{[h_w(x^{(i)}) - y^{(i)}]^2}\] \[\large =\frac{1}{2n}\sum_{i=1}^{n}\frac{\partial}{\partial w}{[h_w(x^{(i)}) - y^{(i)}]^2}\] \[\large =\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial w}{(h_w(x^{(i)}) - y^{(i)})[h_w(x^{(i)}) - y^{(i)}]}\] \[\large =\frac{1}{n}\sum_{i=1}^{n}[\frac{\partial}{\partial w}{h_w(x^{(i)}) - 0][h_w(x^{(i)}) - y^{(i)}]}\] Thus:
\[\large \frac{\partial}{\partial w_0}J(w)=\frac{1}{n}\sum_{i=1}^{n}[\frac{\partial}{\partial w_0}{(w_0 * x^{(i)} + w_1)][h_w(x^{(i)}) - y^{(i)}]}\] \[\large =\frac{1}{n}\sum_{i=1}^{n}[x^{(i)} + 0][h_w(x^{(i)}) - y^{(i)}]\] \[\large =\frac{1}{n}\sum_{i=1}^{n}x^{(i)}[h_w(x^{(i)}) - y^{(i)}]\] and \[\large \frac{\partial}{\partial w_1}J(w)=\frac{1}{n}\sum_{i=1}^{n}[\frac{\partial}{\partial w_1}{(w_0 * x^{(i)} + w_1)][h_w(x^{(i)}) - y^{(i)}]}\] \[\large =\frac{1}{n}\sum_{i=1}^{n}[0 + 1][h_w(x^{(i)}) - y^{(i)}]\] \[\large =\frac{1}{n}\sum_{i=1}^{n}[h_w(x^{(i)}) - y^{(i)}]\] Therefor: \[\large w_0:=w_0-\alpha\frac{1}{n}\sum_{i=1}^{n}x^{(i)}[h_w(x^{(i)}) - y^{(i)}]\] and \[\large w_1:=w_1-\alpha\frac{1}{n}\sum_{i=1}^{n}[h_w(x^{(i)}) - y^{(i)}]\]

alpha      <- .2
stop_error <- 0.000001
steps      <- 5000
w0         <- -8
w1         <- -8
costs      <- c()
w0s        <- c()
w1s        <- c()
grad_w0s   <- c()
grad_w1s   <- c()
for(step in 1:steps){
  grad_w0 <- sum((hw(samples$x, w0, w1)-samples$y)*samples$x)/number_of_samples
  w0p <- w0 - alpha*grad_w0
  
  grad_w1 <- sum(hw(samples$x, w0, w1)-samples$y)/number_of_samples
  w1p <- w1 - alpha*grad_w1
  
  w0 <- w0p
  w1 <- w1p
  
  w0s <- c(w0s, w0)
  w1s <- c(w1s, w1)
  
  grad_w0s <- c(grad_w0s, grad_w0)
  grad_w1s <- c(grad_w1s, grad_w1)
  
  costs <- c(costs, J(w0, w1))
  if(step > 1)
    if(costs[step-1] < costs[step]){
      w0 <- w0s[step-1]
      w1 <- w1s[step-1]
      break
    }
}

curve(from=-0.5, to=1.5,
      hw(x, w0, w1), 
      type="l", col="red2", cex=0.7)
points(samples)

The final value for w0 is 4.7007863, w1 is 0.2948093 and the corresponding cost is 0.1273341.

plotly::plot_ly(data.frame()) %>% 
  plotly::add_trace(x=w0_space, y=w1_space, z=cost, type="surface") %>%
  plotly::add_trace(x=w1s, y=w0s, z=costs+5, type='scatter3d', mode='lines+markers',
                   opacity = 1, marker = list(size=2, color="Set3"))

Genetic Algorithm

In this section we attempt to solve this problem by a huristic method called Genetic Algorithm. The first step is to define a fitness function.

suppressMessages(require(GA))
require(plotly)

mon <- function(obj){
  #if (obj@iter == 1) {
    contour(x=w0_space, y=w1_space, z=cost, nlevels = 100, drawlabels = F, 
            ylab="w1", xlab="w0")
  #}
  grad <- obj@iter / obj@maxiter
  points(x=obj@population[,1], y=obj@population[,2], cex=0.5, col=rgb(grad, 0, 1-grad, 1), pch=16)
  Sys.sleep(0.5)
  if (obj@iter == obj@maxiter) {
    best <- as.numeric(obj@bestSol[obj@maxiter][[1]])
    points(x=best[1], y=best[2], col="black", cex=2, pch=8)
  }
}
result <- ga(
  type = "real-valued",
  lower = c(-10, -10),
  upper = c(10, 10),
  popSize = 50,
  maxiter = 10,
  pmutation = 0.1,
  pcrossover = 0.9,
  elitism = 10,
  fitness = function(x)(-J(x[1], x[2])),
  monitor = mon,
  keepBest = T
)

The final value for w0 is 4.6110536, w1 is 0.4134887 and the corresponding cost is 0.1304028.

lm function of R-base!

summary(lm(samples$y~samples$x))
## 
## Call:
## lm(formula = samples$y ~ samples$x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29978 -0.33567 -0.00838  0.31819  1.02289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2948     0.1406   2.096   0.0414 *  
## samples$x     4.7008     0.2384  19.722   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5151 on 48 degrees of freedom
## Multiple R-squared:  0.8901, Adjusted R-squared:  0.8879 
## F-statistic: 388.9 on 1 and 48 DF,  p-value: < 2.2e-16