Info about the activity

Objectives

By the end of this activity, students should be able to

Implement gradient descent from scratch for the logistic regression.
Explain pitfalls of the gradient descent algorithm.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Data

We will work with the data of Titanic survivors. We will estimate the probability that a passenger survived the Titanic disaster given their age and ticket price.

Loading data into R

First we will load libearies and the data into R. Note that libraries that are not installed will be installed automatically. You can use this method of package installation in future - it will help to run your R codes on another computer.

Below are the dimensions of the data

library(tidyverse)
library(caret)
library(titanic)

T <- titanic_train[(titanic_train$Fare > 0)& !(is.na(titanic_train$Age)) , ]
T$Sex <- as.factor(T$Sex)

dim(T)

## [1] 707  12

Below are variable names

names(T)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

Below are types of variables

sapply(T, class)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##   "integer"   "integer"   "integer" "character"    "factor"   "numeric" 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##   "integer"   "integer" "character"   "numeric" "character" "character"

Below is a sample of the data

T[sample(1:nrow(T), size = 5),]

Below is the summary of each variable

summary(T)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:707        
##  1st Qu.:221.5   1st Qu.:0.0000   1st Qu.:1.000   Class :character  
##  Median :446.0   Median :0.0000   Median :2.000   Mode  :character  
##  Mean   :448.4   Mean   :0.4088   Mean   :2.238                     
##  3rd Qu.:677.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##      Sex           Age            SibSp            Parch       
##  female:261   Min.   : 0.42   Min.   :0.0000   Min.   :0.0000  
##  male  :446   1st Qu.:20.00   1st Qu.:0.0000   1st Qu.:0.0000  
##               Median :28.00   Median :0.0000   Median :0.0000  
##               Mean   :29.65   Mean   :0.5177   Mean   :0.4356  
##               3rd Qu.:38.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##               Max.   :80.00   Max.   :5.0000   Max.   :6.0000  
##     Ticket               Fare            Cabin             Embarked        
##  Length:707         Min.   :  4.013   Length:707         Length:707        
##  Class :character   1st Qu.:  8.050   Class :character   Class :character  
##  Mode  :character   Median : 15.850   Mode  :character   Mode  :character  
##                     Mean   : 35.038                                        
##                     3rd Qu.: 34.198                                        
##                     Max.   :512.329

Data visualization

Below is a scatterplot of the data. Each point represents one passenger. The coordinates are log of the ticket price and the age of the passenger.

ggplot(data = T, aes(x = log(Fare), 
                     y = Age, colour = Survived)) +
  geom_point(size = 2)

Question

Based on the plot, how do you think age and ticket price are related to survival on Titanic? Were older people more likely to survive than younger people? Were people with expensive tickets more likely to survive than people with cheap tickets?

People with expensive tickets were more likely to surive, as were children

Splitting the data into training and test sets

We will randomly split the data into 70% training and 30% test sets.

set.seed(2019)
random_choice <- runif(nrow(T)) < 0.7
ind_train <- which(random_choice)
ind_val <- which(!random_choice)
cat("Records in the training set are", head(ind_train),"...\n")

## Records in the training set are 3 4 5 6 8 9 ...

cat("Records in the validation set are", head(ind_val),"...\n")

## Records in the validation set are 1 2 7 11 18 23 ...

train_data <- T[ind_train , ]
val_data <- T[ind_val , ]

Logistic regression

Built-in function

First we will fit a logistic model using the built-it R function.

mod_logit <- glm(Survived ~ Age + Fare, family = "binomial",
                 data = train_data)

summary(mod_logit)

## 
## Call:
## glm(formula = Survived ~ Age + Fare, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4410  -0.9425  -0.8560   1.2730   1.7232  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.357963   0.213029  -1.680   0.0929 .  
## Age         -0.015884   0.006559  -2.422   0.0155 *  
## Fare         0.013638   0.002685   5.080 3.77e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 692.43  on 511  degrees of freedom
## Residual deviance: 649.92  on 509  degrees of freedom
## AIC: 655.92
## 
## Number of Fisher Scoring iterations: 4

We will extract the coefficients of the logistic regressions, i.e., the vector \[ (\beta_0,\beta_1,\beta_2) \] The vector is

beta_opt <- summary(mod_logit)$coefficients[ , 1]
beta_opt

## (Intercept)         Age        Fare 
## -0.35796266 -0.01588390  0.01363827

Question 1

What does this logistic regression tell us? Were older people more likely to survive than younger people? Were people with expensive tickets more likely to survive than people with cheap tickets?

What is the estimated probability of survival of a hypothetical 10 year old passenger who paid 16 dollars for a ticket? How would this chance change if the ticket cost 200 dollars instead?

Logistic regression tells us the same thing as informal inspection of the plot, i.e., people with expensive tickets were more likely to surive as were children. Below is the estimated probability for a hypothetical 10 year old passenger with a 16 dollars ticket to survive

predict(mod_logit, data.frame(Age = 10, Fare = 16), type = "response")

##         1 
## 0.4259024

Below is the estimated probability for a hypothetical 10 year old passenger with a 200 dollars ticket to survive

predict(mod_logit, data.frame(Age = 10, Fare = 200), type = "response")

##         1 
## 0.9012198

Sigmoid function

We will first define the sigmoid function \[ s(z)=\frac{1}{1+e^{-z}} \] in R. Below is its graph.

sigmoid <- function(z) 
{
  1 / (1 + exp(-z))
}

ggplot(data = data.frame(x = c(-5, 5), y = c(1, 1)),
       aes(x = x, y = y)) +
  stat_function(fun = sigmoid, colour = "blue")

Now we will define the sigmoid activation function, i.e., the prediction of logistic regression. Its output is \[ p(x)=s\left( \begin{bmatrix} 1 & x_1 & x_2 \end{bmatrix}\cdot \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} \right) \]

Below we use it to estimate the probability to survive the Titanic distaster for a hypothetical 10 year old passenger who paid 16 dollars for a ticket:

sigmoid_activation <- function(x, beta) {
  # Note that this function is written in a way so that it will work for input of arbitary dimension.
  X <- cbind(1, matrix(x, ncol = length(beta) - 1))
  W <- matrix(beta, ncol = 1)
  z <- X %*% W
  sigmoid(z)
}

sigmoid_activation(c(10, 16), beta_opt)

##           [,1]
## [1,] 0.4259024

The loss function

The loss of logistic regression is given by \[ J(\beta)=-\sum_{i=1}^{N}y^{i}\log p(x^{i}) -\sum_{i=1}^{N}(1-y^{i})\log(1- p(x^{i})) \]

Below we input it into R

J <- function(beta, 
              response = "Survived",
              features = c("Age", "Fare"),
              dataset = train_data) {
  x <- as.matrix(dataset[ , features])
  y <- dataset[ , response]
  -sum(y * log(sigmoid_activation(x, beta))) -
    sum((1-y) * (log(1 - sigmoid_activation(x, beta))))
}


beta_other <- c(-2, -0.2, -0.2)
names(beta_other) <- names(beta_opt)

cat("The loss of", beta_opt, "is", J(beta_opt), "\n")

## The loss of -0.3579627 -0.0158839 0.01363827 is 324.9582

cat("The loss of", beta_other, "is", J(beta_other), "\n")

## The loss of -2 -0.2 -0.2 is 3735.104

Gradient of the loss function

The derivatives of the loss function are \[ J_{\beta_0}= \sum_{i=1}^{N}\left(p(x^{i})-y^{i}\right),\quad J_{\beta_1}= \sum_{i=1}^{N}\left(p(x^{i})-y^{i}\right)x^{i}_1,\quad J_{\beta_2}= \sum_{i=1}^{N}\left(p(x^{i})-y^{i}\right)x^{i}_2 \]

Below we input these functions into R

dJdb0 <- function(beta) {
  x1 <- T$Age
  x2 <- T$Fare
  y <- T$Survived
  sum(sigmoid_activation(c(x1, x2), beta) - y)
}

dJdb1 <- function(beta) {
  x1 <- T$Age
  x2 <- T$Fare
  y <- T$Survived
  sum((sigmoid_activation(c(x1, x2), beta) - y) * x1)
}

dJdb2 <- function(beta) {
  x1 <- T$Age
  x2 <- T$Fare
  y <- T$Survived
  sum((sigmoid_activation(c(x1, x2), beta) - y) * x2)
}


cat("Beta =", beta_other, "\n")

## Beta = -2 -0.2 -0.2

cat("J(beta) =", J(beta_other), "\n")

## J(beta) = 3735.104

cat("grad J =", c(dJdb0(beta_other), dJdb1(beta_other), dJdb2(beta_other)), "\n")

## grad J = -288.805 -8192.789 -15032.3

Question 2

Rewrite the functions above in the same fashion as our function J, i.e., write just one function that takes as input a vector \(\beta\) of arbitary length \(p+1\), a dataset, and a vector of predictors’ names whose length \(p\). And the new function should return the vector \[ \begin{bmatrix} J_{\beta_0} & J_{\beta_1} & \cdots & J_{\beta_p} \end{bmatrix} \]

# Modify this code here
grad_J <- function(beta, response = "Survived", 
                   features = c("Age", "Fare"), 
                   dataset = train_data) {
  p <- length(features)
  x <- as.matrix(dataset[ , features], ncol = p)
  y <- as.matrix(dataset[ , response], ncol = 1)
  dJdb0 <-  sum(sigmoid_activation(x, beta) - y)
  dJdb <- apply(matrix(rep((sigmoid_activation(x, beta) - y), p), ncol = p) * x,
                2, sum)
  
  c(dJdb0, dJdb)
}

print(grad_J(beta_opt))

##                         Age          Fare 
## -4.937155e-07 -1.479214e-05 -1.102473e-04

Gradient descent implemented from scratch

Begin at some initial point and then update \(\beta\) as follows \[ \beta_{new}=\beta_{old}- \alpha J_{\beta} \] This is what it looks like:

alpha = 0.000001
cat("Learning rate =", alpha, "\n")

## Learning rate = 1e-06

grad_descent <- function(num_steps,
                         learning_rate,
                         initial_point = beta_other,
                         cost_function = J,
                         grad_function = grad_J) {
  p <- length(initial_point)-1
  grad_vars <- paste("Jb", 0:p, sep = "")
  result <- as.data.frame(matrix(NA, nrow = num_steps, ncol = 2*p+3))
  names(result)[1] <- "J"
  names(result)[2:(p+2)] <- names(initial_point)
  names(result)[(p+3):(2*p+3)] <- grad_vars
  result[1, names(initial_point)] <- t(as.data.frame(initial_point))

  result[ , grad_vars] <- NA
  result[1, "J"] <- cost_function(initial_point)
  
  for (k in 1:(num_steps - 1)) {
    b <- as.numeric(result[k, 2:(p+2)])
    v <- grad_function(b)
    result[k, grad_vars] <- v
    result[k+1, 2:(p+2)] <-
      b - learning_rate * v
    result[k+1, "J"] <- cost_function(b)
  }
  result
}

grad_descent(10, alpha)

Below is a trajectory of the gradient descent with the learning rate \[ \alpha=0.000001 \] and 1000 steps.

gd <- grad_descent(1000, 0.000001)
beta_gd <- as.numeric(gd[nrow(gd) , names(beta_opt)])
names(beta_gd) <- names(beta_opt)
cat("Our result is", beta_gd,"\n")

## Our result is -1.96219 0.02296343 0.01923662

ggplot(gd, aes(x = Age, y = Fare)) +
  geom_line()

Question 3

How good is the final result produced by our own implementation of gradient descent? Find the loss function of our result and compare to the loss function of the parameter vector produced by the build-in function.

What will happen if we try to change the learning rate? Increase the learning rate 10 and 100 times and compare the plots. Explain the results.

The loss of our final result is

gd[nrow(gd) , "J"]

## [1] 351.6276

while the loss of the result produced by the built-in function is

J(beta_opt)

## [1] 324.9582

With a bigger learning rate, we will get the following plots

alpha <- 0.00001
cat("Learning rate is", alpha, "\n")

## Learning rate is 1e-05

gd <- grad_descent(1000, alpha)
beta_gd <- as.numeric(gd[nrow(gd) , names(beta_opt)])
names(beta_gd) <- names(beta_opt)
cat("Our result is", beta_gd,"\n")

## Our result is -1.742619 -0.0016482 0.01301318

ggplot(gd, aes(x = Age, y = Fare)) +
  geom_line()

alpha <- 0.0001
cat("Learning rate is", alpha, "\n")

## Learning rate is 1e-04

gd <- grad_descent(1000, alpha)
beta_gd <- as.numeric(gd[nrow(gd) , names(beta_opt)])
names(beta_gd) <- names(beta_opt)
cat("Our result is", beta_gd,"\n")

## Our result is -2.314643 -0.2803736 0.216131

ggplot(gd, aes(x = Age, y = Fare)) +
  geom_line()

This happens because some components of the gradient vector are much larger than others. Level surfaces of the loss function \(J\) are very long and very narrow ellipsoids. Gradient decent overshoots

Accuracy of predictions

Let us now compare accuracy of predictions of the build-in logistic regression and our own logistic regression implemented from scratch. We will predict survival whenever the estimated probability of survival is at least 0.5.

Below is the confusion matrix for the built-in logistic regression.

pred_logit <- 0 + (predict(mod_logit, val_data, type = "response") > 0.5 )
confusionMatrix(as.factor(pred_logit), as.factor(val_data$Survived))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 107  56
##          1   8  24
##                                           
##                Accuracy : 0.6718          
##                  95% CI : (0.6011, 0.7372)
##     No Information Rate : 0.5897          
##     P-Value [Acc > NIR] : 0.01133         
##                                           
##                   Kappa : 0.2536          
##                                           
##  Mcnemar's Test P-Value : 4.228e-09       
##                                           
##             Sensitivity : 0.9304          
##             Specificity : 0.3000          
##          Pos Pred Value : 0.6564          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.5897          
##          Detection Rate : 0.5487          
##    Detection Prevalence : 0.8359          
##       Balanced Accuracy : 0.6152          
##                                           
##        'Positive' Class : 0               
##

Below is the confusion matrix for our own version of logistic regression:

pred_our <- 0 + (sigmoid_activation(
  as.matrix(val_data[ , c("Age", "Fare")]), beta_gd) >= 0.5)
confusionMatrix(as.factor(pred_our), as.factor(val_data$Survived))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 103  42
##          1  12  38
##                                           
##                Accuracy : 0.7231          
##                  95% CI : (0.6546, 0.7846)
##     No Information Rate : 0.5897          
##     P-Value [Acc > NIR] : 7.415e-05       
##                                           
##                   Kappa : 0.3931          
##                                           
##  Mcnemar's Test P-Value : 7.933e-05       
##                                           
##             Sensitivity : 0.8957          
##             Specificity : 0.4750          
##          Pos Pred Value : 0.7103          
##          Neg Pred Value : 0.7600          
##              Prevalence : 0.5897          
##          Detection Rate : 0.5282          
##    Detection Prevalence : 0.7436          
##       Balanced Accuracy : 0.6853          
##                                           
##        'Positive' Class : 0               
##

Class 5, notebook 1 — gradient descent for logistic regression