By the end of this activity, students should be able to
Implement gradient descent from scratch for the logistic regression.
Explain pitfalls of the gradient descent algorithm.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
We will work with the data of Titanic survivors. We will estimate the probability that a passenger survived the Titanic disaster given their age and ticket price.
First we will load libearies and the data into R. Note that libraries that are not installed will be installed automatically. You can use this method of package installation in future - it will help to run your R codes on another computer.
Below are the dimensions of the data
library(tidyverse)
library(caret)
library(titanic)
T <- titanic_train[(titanic_train$Fare > 0)& !(is.na(titanic_train$Age)) , ]
T$Sex <- as.factor(T$Sex)
dim(T)
## [1] 707 12
Below are variable names
names(T)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
Below are types of variables
sapply(T, class)
## PassengerId Survived Pclass Name Sex Age
## "integer" "integer" "integer" "character" "factor" "numeric"
## SibSp Parch Ticket Fare Cabin Embarked
## "integer" "integer" "character" "numeric" "character" "character"
Below is a sample of the data
T[sample(1:nrow(T), size = 5),]
Below is the summary of each variable
summary(T)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:707
## 1st Qu.:221.5 1st Qu.:0.0000 1st Qu.:1.000 Class :character
## Median :446.0 Median :0.0000 Median :2.000 Mode :character
## Mean :448.4 Mean :0.4088 Mean :2.238
## 3rd Qu.:677.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
## Sex Age SibSp Parch
## female:261 Min. : 0.42 Min. :0.0000 Min. :0.0000
## male :446 1st Qu.:20.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :28.00 Median :0.0000 Median :0.0000
## Mean :29.65 Mean :0.5177 Mean :0.4356
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :80.00 Max. :5.0000 Max. :6.0000
## Ticket Fare Cabin Embarked
## Length:707 Min. : 4.013 Length:707 Length:707
## Class :character 1st Qu.: 8.050 Class :character Class :character
## Mode :character Median : 15.850 Mode :character Mode :character
## Mean : 35.038
## 3rd Qu.: 34.198
## Max. :512.329
Below is a scatterplot of the data. Each point represents one passenger. The coordinates are log of the ticket price and the age of the passenger.
ggplot(data = T, aes(x = log(Fare),
y = Age, colour = Survived)) +
geom_point(size = 2)
Based on the plot, how do you think age and ticket price are related to survival on Titanic? Were older people more likely to survive than younger people? Were people with expensive tickets more likely to survive than people with cheap tickets?
People with expensive tickets were more likely to surive, as were children
We will randomly split the data into 70% training and 30% test sets.
set.seed(2019)
random_choice <- runif(nrow(T)) < 0.7
ind_train <- which(random_choice)
ind_val <- which(!random_choice)
cat("Records in the training set are", head(ind_train),"...\n")
## Records in the training set are 3 4 5 6 8 9 ...
cat("Records in the validation set are", head(ind_val),"...\n")
## Records in the validation set are 1 2 7 11 18 23 ...
train_data <- T[ind_train , ]
val_data <- T[ind_val , ]
First we will fit a logistic model using the built-it R function.
mod_logit <- glm(Survived ~ Age + Fare, family = "binomial",
data = train_data)
summary(mod_logit)
##
## Call:
## glm(formula = Survived ~ Age + Fare, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4410 -0.9425 -0.8560 1.2730 1.7232
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.357963 0.213029 -1.680 0.0929 .
## Age -0.015884 0.006559 -2.422 0.0155 *
## Fare 0.013638 0.002685 5.080 3.77e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.43 on 511 degrees of freedom
## Residual deviance: 649.92 on 509 degrees of freedom
## AIC: 655.92
##
## Number of Fisher Scoring iterations: 4
We will extract the coefficients of the logistic regressions, i.e., the vector \[ (\beta_0,\beta_1,\beta_2) \] The vector is
beta_opt <- summary(mod_logit)$coefficients[ , 1]
beta_opt
## (Intercept) Age Fare
## -0.35796266 -0.01588390 0.01363827
What does this logistic regression tell us? Were older people more likely to survive than younger people? Were people with expensive tickets more likely to survive than people with cheap tickets?
What is the estimated probability of survival of a hypothetical 10 year old passenger who paid 16 dollars for a ticket? How would this chance change if the ticket cost 200 dollars instead?
Logistic regression tells us the same thing as informal inspection of the plot, i.e., people with expensive tickets were more likely to surive as were children. Below is the estimated probability for a hypothetical 10 year old passenger with a 16 dollars ticket to survive
predict(mod_logit, data.frame(Age = 10, Fare = 16), type = "response")
## 1
## 0.4259024
Below is the estimated probability for a hypothetical 10 year old passenger with a 200 dollars ticket to survive
predict(mod_logit, data.frame(Age = 10, Fare = 200), type = "response")
## 1
## 0.9012198
We will first define the sigmoid function \[ s(z)=\frac{1}{1+e^{-z}} \] in R. Below is its graph.
sigmoid <- function(z)
{
1 / (1 + exp(-z))
}
ggplot(data = data.frame(x = c(-5, 5), y = c(1, 1)),
aes(x = x, y = y)) +
stat_function(fun = sigmoid, colour = "blue")
Now we will define the sigmoid activation function, i.e., the prediction of logistic regression. Its output is \[ p(x)=s\left( \begin{bmatrix} 1 & x_1 & x_2 \end{bmatrix}\cdot \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} \right) \]
Below we use it to estimate the probability to survive the Titanic distaster for a hypothetical 10 year old passenger who paid 16 dollars for a ticket:
sigmoid_activation <- function(x, beta) {
# Note that this function is written in a way so that it will work for input of arbitary dimension.
X <- cbind(1, matrix(x, ncol = length(beta) - 1))
W <- matrix(beta, ncol = 1)
z <- X %*% W
sigmoid(z)
}
sigmoid_activation(c(10, 16), beta_opt)
## [,1]
## [1,] 0.4259024
The loss of logistic regression is given by \[ J(\beta)=-\sum_{i=1}^{N}y^{i}\log p(x^{i}) -\sum_{i=1}^{N}(1-y^{i})\log(1- p(x^{i})) \]
Below we input it into R
J <- function(beta,
response = "Survived",
features = c("Age", "Fare"),
dataset = train_data) {
x <- as.matrix(dataset[ , features])
y <- dataset[ , response]
-sum(y * log(sigmoid_activation(x, beta))) -
sum((1-y) * (log(1 - sigmoid_activation(x, beta))))
}
beta_other <- c(-2, -0.2, -0.2)
names(beta_other) <- names(beta_opt)
cat("The loss of", beta_opt, "is", J(beta_opt), "\n")
## The loss of -0.3579627 -0.0158839 0.01363827 is 324.9582
cat("The loss of", beta_other, "is", J(beta_other), "\n")
## The loss of -2 -0.2 -0.2 is 3735.104
The derivatives of the loss function are \[ J_{\beta_0}= \sum_{i=1}^{N}\left(p(x^{i})-y^{i}\right),\quad J_{\beta_1}= \sum_{i=1}^{N}\left(p(x^{i})-y^{i}\right)x^{i}_1,\quad J_{\beta_2}= \sum_{i=1}^{N}\left(p(x^{i})-y^{i}\right)x^{i}_2 \]
Below we input these functions into R
dJdb0 <- function(beta) {
x1 <- T$Age
x2 <- T$Fare
y <- T$Survived
sum(sigmoid_activation(c(x1, x2), beta) - y)
}
dJdb1 <- function(beta) {
x1 <- T$Age
x2 <- T$Fare
y <- T$Survived
sum((sigmoid_activation(c(x1, x2), beta) - y) * x1)
}
dJdb2 <- function(beta) {
x1 <- T$Age
x2 <- T$Fare
y <- T$Survived
sum((sigmoid_activation(c(x1, x2), beta) - y) * x2)
}
cat("Beta =", beta_other, "\n")
## Beta = -2 -0.2 -0.2
cat("J(beta) =", J(beta_other), "\n")
## J(beta) = 3735.104
cat("grad J =", c(dJdb0(beta_other), dJdb1(beta_other), dJdb2(beta_other)), "\n")
## grad J = -288.805 -8192.789 -15032.3
Rewrite the functions above in the same fashion as our function J, i.e., write just one function that takes as input a vector \(\beta\) of arbitary length \(p+1\), a dataset, and a vector of predictors’ names whose length \(p\). And the new function should return the vector \[ \begin{bmatrix} J_{\beta_0} & J_{\beta_1} & \cdots & J_{\beta_p} \end{bmatrix} \]
# Modify this code here
grad_J <- function(beta, response = "Survived",
features = c("Age", "Fare"),
dataset = train_data) {
p <- length(features)
x <- as.matrix(dataset[ , features], ncol = p)
y <- as.matrix(dataset[ , response], ncol = 1)
dJdb0 <- sum(sigmoid_activation(x, beta) - y)
dJdb <- apply(matrix(rep((sigmoid_activation(x, beta) - y), p), ncol = p) * x,
2, sum)
c(dJdb0, dJdb)
}
print(grad_J(beta_opt))
## Age Fare
## -4.937155e-07 -1.479214e-05 -1.102473e-04
Begin at some initial point and then update \(\beta\) as follows \[ \beta_{new}=\beta_{old}- \alpha J_{\beta} \] This is what it looks like:
alpha = 0.000001
cat("Learning rate =", alpha, "\n")
## Learning rate = 1e-06
grad_descent <- function(num_steps,
learning_rate,
initial_point = beta_other,
cost_function = J,
grad_function = grad_J) {
p <- length(initial_point)-1
grad_vars <- paste("Jb", 0:p, sep = "")
result <- as.data.frame(matrix(NA, nrow = num_steps, ncol = 2*p+3))
names(result)[1] <- "J"
names(result)[2:(p+2)] <- names(initial_point)
names(result)[(p+3):(2*p+3)] <- grad_vars
result[1, names(initial_point)] <- t(as.data.frame(initial_point))
result[ , grad_vars] <- NA
result[1, "J"] <- cost_function(initial_point)
for (k in 1:(num_steps - 1)) {
b <- as.numeric(result[k, 2:(p+2)])
v <- grad_function(b)
result[k, grad_vars] <- v
result[k+1, 2:(p+2)] <-
b - learning_rate * v
result[k+1, "J"] <- cost_function(b)
}
result
}
grad_descent(10, alpha)
Below is a trajectory of the gradient descent with the learning rate \[ \alpha=0.000001 \] and 1000 steps.
gd <- grad_descent(1000, 0.000001)
beta_gd <- as.numeric(gd[nrow(gd) , names(beta_opt)])
names(beta_gd) <- names(beta_opt)
cat("Our result is", beta_gd,"\n")
## Our result is -1.96219 0.02296343 0.01923662
ggplot(gd, aes(x = Age, y = Fare)) +
geom_line()
How good is the final result produced by our own implementation of gradient descent? Find the loss function of our result and compare to the loss function of the parameter vector produced by the build-in function.
What will happen if we try to change the learning rate? Increase the learning rate 10 and 100 times and compare the plots. Explain the results.
The loss of our final result is
gd[nrow(gd) , "J"]
## [1] 351.6276
while the loss of the result produced by the built-in function is
J(beta_opt)
## [1] 324.9582
With a bigger learning rate, we will get the following plots
alpha <- 0.00001
cat("Learning rate is", alpha, "\n")
## Learning rate is 1e-05
gd <- grad_descent(1000, alpha)
beta_gd <- as.numeric(gd[nrow(gd) , names(beta_opt)])
names(beta_gd) <- names(beta_opt)
cat("Our result is", beta_gd,"\n")
## Our result is -1.742619 -0.0016482 0.01301318
ggplot(gd, aes(x = Age, y = Fare)) +
geom_line()
alpha <- 0.0001
cat("Learning rate is", alpha, "\n")
## Learning rate is 1e-04
gd <- grad_descent(1000, alpha)
beta_gd <- as.numeric(gd[nrow(gd) , names(beta_opt)])
names(beta_gd) <- names(beta_opt)
cat("Our result is", beta_gd,"\n")
## Our result is -2.314643 -0.2803736 0.216131
ggplot(gd, aes(x = Age, y = Fare)) +
geom_line()
This happens because some components of the gradient vector are much larger than others. Level surfaces of the loss function \(J\) are very long and very narrow ellipsoids. Gradient decent overshoots
Let us now compare accuracy of predictions of the build-in logistic regression and our own logistic regression implemented from scratch. We will predict survival whenever the estimated probability of survival is at least 0.5.
Below is the confusion matrix for the built-in logistic regression.
pred_logit <- 0 + (predict(mod_logit, val_data, type = "response") > 0.5 )
confusionMatrix(as.factor(pred_logit), as.factor(val_data$Survived))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 107 56
## 1 8 24
##
## Accuracy : 0.6718
## 95% CI : (0.6011, 0.7372)
## No Information Rate : 0.5897
## P-Value [Acc > NIR] : 0.01133
##
## Kappa : 0.2536
##
## Mcnemar's Test P-Value : 4.228e-09
##
## Sensitivity : 0.9304
## Specificity : 0.3000
## Pos Pred Value : 0.6564
## Neg Pred Value : 0.7500
## Prevalence : 0.5897
## Detection Rate : 0.5487
## Detection Prevalence : 0.8359
## Balanced Accuracy : 0.6152
##
## 'Positive' Class : 0
##
Below is the confusion matrix for our own version of logistic regression:
pred_our <- 0 + (sigmoid_activation(
as.matrix(val_data[ , c("Age", "Fare")]), beta_gd) >= 0.5)
confusionMatrix(as.factor(pred_our), as.factor(val_data$Survived))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 103 42
## 1 12 38
##
## Accuracy : 0.7231
## 95% CI : (0.6546, 0.7846)
## No Information Rate : 0.5897
## P-Value [Acc > NIR] : 7.415e-05
##
## Kappa : 0.3931
##
## Mcnemar's Test P-Value : 7.933e-05
##
## Sensitivity : 0.8957
## Specificity : 0.4750
## Pos Pred Value : 0.7103
## Neg Pred Value : 0.7600
## Prevalence : 0.5897
## Detection Rate : 0.5282
## Detection Prevalence : 0.7436
## Balanced Accuracy : 0.6853
##
## 'Positive' Class : 0
##