class: title-slide .row[ .col-7[ .title[ # Gradient Descent (One variable) ] .subtitle[ ## Gradient Descent for linear regression with one variable ] .author[ ### Laxmikant Soni <br> [Web-Site](https://laxmikants.github.io) <br> [<i class="fab fa-github"></i>](https://github.com/laxmiaknts) [<i class="fab fa-twitter"></i>](https://twitter.com/laxmikantsoni09) ] .affiliation[ ] ] .col-5[ .logo[ <!-- --> ] ] ] --- class: inverse, center, middle # Linear Regression with One Variable --- class: very-large-body .pull-top[ ## Hypothesis function Hypothesis function in general form: `\(h_\theta(x) = \theta_0 + \theta_1 x\)` Example: first row from training set `\(x_1 = 2, y_1 = 10\)`. Now we can randomly iterate through `\(\theta_0\)` and `\(\theta_1\)`. So that `\(h_\theta\)` for `\(\theta_0=3\)` and `\(\theta_1\)`=5 becomes `\(h_\theta(x) = 3 + 5x\)`, and for given `\(x_1 = 2\)` our `\(h_\theta = 13\)`. It’s greater by 3 from `\(y_1\)` ] --- # Some example with R .pull-left[ Hypothesis function is represented as: `\(h_\theta(x) = \theta_0 x_0 + \theta_1 x_1\)`, where `\(x_0=1\)` all times for matrix multiplication. ```r theta <- c(3, 5) x1 <- c(1, 2) # add 1 as x_0 # find h_theta h <- x1 %*% theta print(paste("h =", h[1])) ``` ``` ## [1] "h = 13" ``` ```r # or we can calculate h for all training set at once x <- matrix(c(rep(1, 3), c(10,20,30)), ncol=2) x ``` ``` ## [,1] [,2] ## [1,] 1 10 ## [2,] 1 20 ## [3,] 1 30 ``` ```r x %*% theta ``` ``` ## [,1] ## [1,] 53 ## [2,] 103 ## [3,] 153 ``` ] --- # Cost Function .pull-left[ This function is also known as “Squared error function”, or “Mean squared error”. `\(J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2\)` ] -- .pull-right[ ```r computeCost <- function (X, y, theta){ # number of training examples m <- length(y); # need to return J <- 0; predictions <- X %*% theta; sqerrors = (predictions - y)^2; J = 1/(2*m)* sum(sqerrors); J } ``` ] --- # Example .pull-left[ Given `\(x_1 = 2, y_1 = 10, m = 1\)` `\(\theta_0=3, \theta_1=5\)` `\(h_\theta(x) = 3 + 5 x\)` for given `\(x_1 = 2\)`, our `\(h_\theta = 13\)` It is greater by 3 from `\(y_1\)` `\(J(3, 5) = \dfrac {1}{2 * 1} \displaystyle \sum _{i=1}^1 \left (13 - 10 \right)^2 = \dfrac {9}{2} = 4.5\)` ```r theta ``` ``` ## [1] 3 5 ``` ```r x1 ``` ``` ## [1] 1 2 ``` ```r print(paste("J =", computeCost(x1, 10, theta))) ``` ``` ## [1] "J = 4.5" ``` ] --- # Gradient Descent .pull-left[ Repeat until convergence: `\(\begin{align*} \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*}\)` ] Where: m – size of training set, `\(\theta_0,\theta_1\)` – values to change simultaneously, `\(x_i,y_i\)` – items of training set, α – step size. -- .pull-right[ ```r gradientDescent <- function(X, y, theta, alpha, num_iters){ m <- length(y); J_history = rep(0, num_iters); for (iter in 1:num_iters){ predictions <- X %*% theta; updates = t(X) %*% (predictions - y); theta = theta - alpha * (1/m) * updates; J_history[iter] <- computeCost(X, y, theta); } list("theta" = theta, "J_history" = J_history) } ``` ] --- # Example .pull-left[ ```r theta <- c(0, 0) iterations <- 1500 alpha <- 0.01 X.test <- matrix(c(1, 1, 3, 3), ncol=2) y.test <- matrix(c(10, 10), ncol=1) # answer must be 1,3 # h(1, 3) = 1 + 3*x = 1 + 3*3 = 10 result.test <- gradientDescent(X.test, y.test, theta, alpha, iterations) result.test$theta ``` ``` ## [,1] ## [1,] 1 ## [2,] 3 ``` ] --- # Demo .pull-left[ Create demo data ``` ## `geom_smooth()` using formula 'y ~ x' ``` <!-- --> ] -- .pull-right[ ```r gradientDescent <- function(X, y, theta, alpha, num_iters){ m <- length(y); J_history = rep(0, num_iters); for (iter in 1:num_iters){ predictions <- X %*% theta; updates = t(X) %*% (predictions - y); theta = theta - alpha * (1/m) * updates; J_history[iter] <- computeCost(X, y, theta); } list("theta" = theta, "J_history" = J_history) } ``` ] --- # Demo (Cont..) .pull-left[ Add column of ones to x ```r X <- matrix(c(rep(1,length(x)),x),ncol = 2) head(X) ``` ``` ## [,1] [,2] ## [1,] 1 5.124754 ## [2,] 1 5.382075 ## [3,] 1 5.579243 ## [4,] 1 4.706252 ## [5,] 1 4.171651 ## [6,] 1 4.667286 ``` ] -- .pull-right[ Run Gradient Descent ```r # Initialize theta <- c(0, 0) iterations <- 1500 # to be precise pick alpha=0.0002 alpha <- 0.0001 # for difference on plot # run gradient descent result <- gradientDescent(X, y, theta, alpha, iterations); theta <- result$theta print("theta found:");print(theta) ``` ``` ## [1] "theta found:" ``` ``` ## [,1] ## [1,] 0.3725434 ## [2,] 1.9043863 ``` ] --- # Demo (Cont..) .pull-left[ Let’s show new line based on found theta. ```r #data = data.frame(x=x, y=y, test = X%*%theta) #ggplot(data, aes(x=x, y=y, test=test)) + # geom_point(alpha=1/3, size=4) + # stat_smooth(method = "lm", formula = test ~ x, size = 1, se = FALSE, # aes(color="Gradient Descent")) + # geom_smooth(method="lm", se=F, aes(color="Training set")) + # scale_colour_manual(name="Method", values=c("red", "steelblue")) + #theme(legend.position = "bottom") + #labs(title = "Gradient Descent – Results") ``` ] -- .pull-right[ ``` ## `geom_smooth()` using formula 'y ~ x' ``` <!-- --> ] --- # Demo (Cont..) .pull-left[ History of executed cost functions stored in result$J_history. ```r #data <- data.frame(x=seq(1, length(result$J_history)), # y=result$J_history) #ggplot(data, aes(x=x, y=y)) + # geom_line() + # labs(title="Gradient descent iterations", # x="Iterations", y="Cost J") ``` ] -- .pull-right[ <!-- --> ] --- # Demo (Cont..) .pull-left[ Make predictions ```r #predict1 <- c(1, 3.5) %*% theta #predict2 <- c(1, 7) %*% theta ``` ] -- .pull-right[ ] For x = 3.5, we predict y of 7.038 For x = 7, we predict y of 13.703 --- # Demo (Cont..) .pull-left[ Run again Gradient Descent with α=0.0002 (more precise) to compare: ```r #theta <- c(0, 0) #iterations <- 1500 #alpha <- 0.0002 # set alpha more precisely #result <- gradientDescent(X, y, theta, alpha, iterations); #matrix(c(1, 1, 3.5, 7), ncol=2) %*% result$theta ``` ] -- .pull-right[ ``` ## [,1] ## [1,] 7.177233 ## [2,] 13.974120 ``` For x = 3.5, we predict y of 7.038 For x = 7, we predict y of 13.703 ] --- # Using predict.lm in R .pull-left[ predict using predict.lm ```r #lm <- lm(y ~ x) #newdata <- data.frame(x=c(3.5, 7)) #predict(lm, newdata, interval="none") ``` ] -- .pull-right[ ``` ## 1 2 ## 7.202689 13.953160 ``` ] --- class: very-large-body # Take aways ... To make gradient descent converge we must slowly decrease alpha over time. If alpha is too large it may fail to converge Gradient descent can converge to a local minimum, even with a fixed learning rate alpha. Gradient Descent will not always converge to global minimum --- class: inverse, center, middle # Thanks ---