Introduction

For this exercise, you will use logistic regression and neural networks to recognize handwritten digits (from 0 to 9). Automated handwritten digit recognition is widely used today - from recognizing zip codes (postal codes) on mail envelopes to recognizing amounts written on bank checks. This exercise will show you how the methods you’ve learned can be used for this classification task.

1 Multi-class Classification

In the first part of the exercise, you will extend your previous implemention of logistic regression and apply it to one-vs-all classification.

You are given a data set in ex3data1.mat that contains 5000 training examples of handwritten digits.

Loading data

library(R.matlab)
## Warning: package 'R.matlab' was built under R version 3.3.3
## R.matlab v3.6.1 (2016-10-19) successfully loaded. See ?R.matlab for help.
## 
## Attaching package: 'R.matlab'
## The following objects are masked from 'package:base':
## 
##     getOption, isOpen
ds_data <- readMat("~/R_projects/171206_exo_3_Ng/ex3data1.mat")
ds_weight <- readMat("~/R_projects/171206_exo_3_Ng/ex3weights.mat")  

X <- as.matrix(ds_data$X)
X <- cbind(1, X)
y <- as.vector(ds_data$y)

theta1 <- as.matrix(ds_weight$Theta1)
theta2 <- as.matrix(ds_weight$Theta2)

Logistic Regression : Cost function

LRCOSTFUNCTION Compute cost and gradient for logistic regression with regularization J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using theta as the parameter for regularized logistic regression and the gradient of the cost w.r.t. to the parameters.

Here, we will divide the function in 2, one to compute J cost and one to compute the grad

CostFunction <- function(X, y, Theta, Lambda){
  m <- length(y)
  Hypothesis <- 1 / (1 + exp(- X %*% Theta)) # probability that y = 1 on input X
  J <- (1/m) * sum(- y * log(Hypothesis) - (1 - y) * log(1 - Hypothesis)) # compute J cost
  reg <- (Lambda / (2 * m)) * sum(Theta[2:length(Theta)] ^ 2)
  J <- J + reg
  return(J)
}


GradFunction <- function(X, y, Theta, Lambda){
  m <- length(y)
  Hypothesis <- 1 / (1 + exp(- X %*% Theta))
  Grad <- ((t(X) %*% (Hypothesis - y)) / m) + ((Lambda / m) * c(0,Theta[2:length(Theta)]))
  return(Grad)
}

Testing cost function

theta_t <- c(-2, -1, 1, 2)
X_t <- as.matrix(cbind(1, matrix(1:15, 5,3)/10))
y_t <- c(1, 0, 1, 0, 1) >= 0.5
lambda_t <- 3

CostFunction(X_t, y_t, theta_t, lambda_t)
## [1] 2.534819
# fprintf('Expected cost: 2.534819\n');

GradFunction(X_t, y_t, theta_t, lambda_t)
##            [,1]
## [1,]  0.1465614
## [2,] -0.5485584
## [3,]  0.7247223
## [4,]  1.3980030
# fprintf(' 0.146561\n -0.548558\n 0.724722\n 1.398003\n');

One-vs-all Classification

In this part of the exercise, you will implement one-vs-all classification by training multiple regularized logistic regression classifiers, one for each of the K classes in our dataset

function [all_theta] = oneVsAll(X, y, num_labels, lambda)

ONEVSALL trains multiple logistic regression classifiers and returns all the classifiers in a matrix all_theta, where the i-th row of all_theta corresponds to the classifier for label i

[all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels logistic regression classifiers and returns each of these classifiers in a matrix all_theta, where the i-th row of all_theta corresponds to the classifier for label i

In our case, we will divide the function into 2, one to train the logistic regression, one for the OneVsAll

# Train Logistic Regression
TrainLogisticReg <- function(X, y, Lambda){
  ifelse(is.vector(X), initial_theta <- c(0,0), initial_theta <- rep(0, ncol(X))) # initialize theta
  ifelse(is.vector(X), m <- length(X), m <- nrow(X))
  
  assign("m", m, .GlobalEnv)
  
  LogisticRec <- optim(par = initial_theta,
                       fn = CostFunction,
                       gr = GradFunction,
                       method = "BFGS",
                       X = X,
                       y = y,
                       Lambda = Lambda)
  
  print(paste("J", LogisticRec$value))
  return(LogisticRec$par)
}

theta_result <- TrainLogisticReg(X, y, 0)
## [1] "J -78.1116869389638"
# One vs All function
# note that we already had 1 to X
OneVsAllFunction <- function(X, y, Lambda){
  
  y_cat <- unique(y) # determine how many categories we have within y
  Predict_result <- matrix(nrow = nrow(X), ncol = length(y_cat)) # create a matrix to put the Y prediction
  colnames(Predict_result) <- y_cat # put the category names
  
  Theta_result <- matrix(nrow = length(y_cat), ncol = ncol(X)) # +1 for theta 0 already counted
  
  row.names(Theta_result) <- y_cat
  
  for(i in 1:length(y_cat)){ 
    y_temp <- rep(0, length(y))
    
    for(j in 1:length(y)){ #loop to create the 1 | 0  y_temp
      ifelse(y[j] == y_cat[i], y_temp[j] <- 1, y_temp[j] <- 0)
    } 
    
    Theta_temp <- TrainLogisticReg(X, y_temp, Lambda)
    Predict_result[ ,i] <- 1 / (1 + exp(- X %*% Theta_temp)) #probability that y = 1 on input X
    Theta_result[i, ] <- Theta_temp 
    
  }
  
  Predict_result <- as.data.frame(Predict_result)
  # check which column has the max result
  Predict_result$Final <- max.col(Predict_result[ ,1:length(y_cat)], ties.method = "first")
  for (k in 1 : m){
    a <- Predict_result$Final[k]
    Predict_result$Final[k] <- colnames(Predict_result)[as.numeric(a)]
    
  }

  Result <- list("Predict" = Predict_result, "Theta" = Theta_result) # create list of the result
  return(Result)
  
}

Predict for One-Vs-All

lambda <- 0.1
test <- OneVsAllFunction(X, y, lambda)
## [1] "J 0.0103494707209181"
## [1] "J 0.0148622440173195"
## [1] "J 0.0567794334766962"
## [1] "J 0.0629846884235192"
## [1] "J 0.0381953746039654"
## [1] "J 0.0614150237069662"
## [1] "J 0.0221935890719235"
## [1] "J 0.0349697783660666"
## [1] "J 0.0833928138934852"
## [1] "J 0.0770231714049791"
table(test$Predict$Final, y)
##     y
##        1   2   3   4   5   6   7   8   9  10
##   1  496   2   2   1   1   0   1   4   1   0
##   10   0   2   0   0   2   2   0   0   2 496
##   2    1 466  11   1   3   0   2   3   3   0
##   3    0   3 459   0  12   0   1   9   4   1
##   4    1   4   1 481   4   1   4   7   7   1
##   5    1   3  11   0 458   4   1   4   3   1
##   6    0   3   1   1   4 492   0   4   1   0
##   7    0   3   6   0   1   0 480   2  12   0
##   8    1  11   3   4   9   1   1 461   4   1
##   9    0   3   6  12   6   0  10   6 463   0
mean(test$Predict$Final == y)
## [1] 0.9504
# You should see that the training set accuracy is about 94.9%

2 Neural Network

In the previous part of this exercise, you implemented multi-class logistic re- gression to recognize handwritten digits. However, logistic regression cannot form more complex hypotheses as it is only a linear classifier.

In this part of the exercise, you will implement a neural network to recognize handwritten digits using the same training set as before. The neural network will be able to represent complex models that form non-linear hypotheses. For this week, you will be using parameters from a neural network that we have already trained. Your goal is to implement the feedforward propagation algorithm to use our weights for prediction. In next week’s exercise, you will write the backpropagation algorithm for learning the neural network parameters.

Predict

#PREDICT Predict the label of an input given a trained neural network
#%   p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
#%   trained weights of a neural network (Theta1, Theta2)
Theta1 <- ds_weight$Theta1
Theta2 <- ds_weight$Theta2

predict <- function(Theta1, Theta2, X){
  # Useful values
  ifelse(is.vector(X), m <- length(X), m <- nrow(X))
  p <- rep(0, nrow(X))
  
  # the +1 on X are already taken into account
  A1 <- 1 / (1 + exp(- X %*% t(Theta1)))
  A1 <- cbind(1, A1)
  A2 <- 1 / (1 + exp(- A1 %*% t(Theta2)))
  p <- apply(A2, 1, FUN= "which.max")
  return(p)
}


pred <- predict(Theta1, Theta2, X)

#fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);
print(mean(pred == y))
## [1] 0.9752