Linear Regression with One Variable

The Hypothesis Function

Hypothesis function in general form: $h_\theta(x) = \theta_0 + \theta_1 x$

Example:

Given: first row from training set $x_1 = 2, y_1 = 10$

Now we can randomly iterate through $\theta_0$ and $\theta_1$.

So that $h_\theta$ for $\theta_0=3$ and $\theta_1=5$ becomes:

$h_\theta(x) = 3 + 5 x,$

and for given $x_1 = 2$ our $h_\theta = 13$. It’s greater by 3 from $y_1$.

Same example with R:

Hypothesis function is represented as: $h_\theta(x) = \theta_0 x_0 + \theta_1 x_1$, where $x_0 = 1$ all times for matrix multiplication.

theta <- c(3, 5)
x1 <- c(1, 2) # add 1 as x_0
# find h_theta
h <- x1 %*% theta 
print(paste("h =", h[1]))

## [1] "h = 13"

# or we can calculate h for all training set at once
x <- matrix(c(rep(1, 3), c(10,20,30)), ncol=2)
x

##      [,1] [,2]
## [1,]    1   10
## [2,]    1   20
## [3,]    1   30

x %*% theta

##      [,1]
## [1,]   53
## [2,]  103
## [3,]  153

Cost Function

This function is also known as “Squared error function”, or “Mean squared error”.

$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$

computeCost <- function (X, y, theta){
    # number of training examples
    m <- length(y);
    # need to return
    J <- 0;
    
    predictions <-  X %*% theta;
    sqerrors = (predictions - y)^2;
    J = 1/(2*m)* sum(sqerrors);
    
    J
}

Example:

Given:

$x_1 = 2, y_1 = 10, m = 1$

$\theta_0=3, \theta_1=5$:

$h_\theta(x) = 3 + 5 x,$

for given $x_1 = 2$, our $h_\theta = 13$.

It’s greater by 3 from $y_1$.

$J(3, 5) = \dfrac {1}{2 * 1} \displaystyle \sum _{i=1}^1 \left (13 - 10 \right)^2 = \dfrac {9}{2} = 4.5$

theta

## [1] 3 5

x1

## [1] 1 2

print(paste("J =", computeCost(x1, 10, theta)))

## [1] "J = 4.5"

Gradient Descent

Repeat until convergence:

$\begin{align*} \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*}$

Where:

$m$ – size of training set,

$theta_0, theta_1$ – values to change simultaneously,

$x_i, y_i$ – items of training set,

$\alpha$ – step size.

gradientDescent <- function(X, y, theta, alpha, num_iters){
    m <- length(y);  
    J_history = rep(0, num_iters);
    for (iter in 1:num_iters){
        predictions <-  X %*% theta;
        updates = t(X) %*% (predictions - y);
        theta = theta - alpha * (1/m) * updates;
        J_history[iter] <- computeCost(X, y, theta);
    }
    list("theta" = theta, "J_history" = J_history)  
}

Example

theta <- c(0, 0)
iterations <- 1500
alpha <- 0.01
X.test <- matrix(c(1, 1, 3, 3), ncol=2)
y.test <- matrix(c(10, 10), ncol=1)
# answer must be 1,3
# h(1, 3) = 1 + 3*x = 1 + 3*3 = 10
result.test <- gradientDescent(X.test, y.test, theta, alpha, iterations)
result.test$theta

##      [,1]
## [1,]    1
## [2,]    3

How it works

Create demo data

library(ggplot2)

set.seed(37)
x <- rnorm(n=100, mean=5, sd=1)
e <- rnorm(n=100, mean=0, sd=1)
y <- e + 2*x 

# show data
data = data.frame(x=x, y=y)
g <- ggplot(data, aes(x=x, y=y))  + 
    geom_point(alpha=1/3, size=4) +
    geom_smooth(method="lm", se=F, col="steelblue") +
    labs(title = "Linear Regression – Demo data")
g

# Add a column of ones to x
X <- matrix(c(rep(1,length(x)),x), ncol=2)
head(X)

##      [,1]     [,2]
## [1,]    1 5.124754
## [2,]    1 5.382075
## [3,]    1 5.579243
## [4,]    1 4.706252
## [5,]    1 4.171651
## [6,]    1 4.667286

Run Gradient Descent

Now let’s initialize Gradient Descent parameters and execute function.

# Initialize 
theta <- c(0, 0)
iterations <- 1500
# to be precise pick alpha=0.0002
alpha <- 0.0001 # for difference on plot


# run gradient descent
result <- gradientDescent(X, y, theta, alpha, iterations);
theta <- result$theta
print("theta found:");print(theta)

## [1] "theta found:"

##           [,1]
## [1,] 0.3725434
## [2,] 1.9043863

Let’s show new line based on found theta.

# data with prediction
data = data.frame(x=x, y=y, test = X%*%theta)


ggplot(data, aes(x=x, y=y, test=test))  + 
    geom_point(alpha=1/3, size=4) +
    stat_smooth(method = "lm", formula = test ~ x, size = 1, se = FALSE,
                aes(color="Gradient Descent")) +
    geom_smooth(method="lm", se=F, aes(color="Training set")) +
    scale_colour_manual(name="Method", values=c("red", "steelblue")) +
    theme(legend.position = "bottom") +
    labs(title = "Gradient Descent – Results")

History of executed cost functions stored in result$J_history.

data <- data.frame(x=seq(1, length(result$J_history)),
                   y=result$J_history)
ggplot(data, aes(x=x, y=y)) +
    geom_line() +
    labs(title="Gradient descent iterations",
         x="Iterations", y="Cost J")

Now we can make predictions

predict1 <- c(1, 3.5) %*% theta
predict2 <- c(1, 7) %*% theta

For x = 3.5, we predict y of 7.038
For x = 7, we predict y of 13.703

Predict with predict.lm

Run again Gradient Descent with $\alpha = 0.0002$ (more precise) to compare:

theta <- c(0, 0)
iterations <- 1500
alpha <- 0.0002 # set alpha more precisely
result <- gradientDescent(X, y, theta, alpha, iterations);
matrix(c(1, 1, 3.5, 7), ncol=2) %*% result$theta

##           [,1]
## [1,]  7.177233
## [2,] 13.974120

Finally, make prediction with lm:

lm <- lm(y ~ x)
newdata <- data.frame(x=c(3.5, 7))
predict(lm, newdata, interval="none")

##         1         2 
##  7.202689 13.953160

Gradient Descent for Linear Regression with One Variable

Vladimir Kuznetsov

December 2015