Introduction

In this exercise, you will implement the backpropagation algorithm for neural networks and apply it to the task of hand-written digit recognition.

Note: It was quite challenging to find a way to minimize the J cost (which is done in Octave with the fmincg function). After trying and failing with the optim function which I used previously for the logistic regression and linear regression, I found a solution on this github: “https://github.com/faridcher/machine-learning-course/tree/master

1. Neural Networks

In the previous exercise, you implemented feedforward propagation for neural networks and used it to predict handwritten digits with the weights we provided. In this exercise, you will implement the backpropagation algorithm to learn the parameters for the neural network.

Loading data

library(R.matlab)
## Warning: package 'R.matlab' was built under R version 3.3.3
## R.matlab v3.6.1 (2016-10-19) successfully loaded. See ?R.matlab for help.
## 
## Attaching package: 'R.matlab'
## The following objects are masked from 'package:base':
## 
##     getOption, isOpen
ds_data <- readMat("~/R_projects/171213_exo4_ng/ex4data1.mat")
X <- ds_data$X
y <- ds_data$y
rm(ds_data)

#  In this part of the exercise, we load some pre-initialized neural network parameters.
ds_weight <- readMat("~/R_projects/171213_exo4_ng/ex4weights.mat")
Theta1 <- ds_weight$Theta1
Theta2 <- ds_weight$Theta2
rm(ds_weight)

# unroll parameters
nn_params <- c(as.vector(Theta1), as.vector(Theta2))  

# Setup the parameters you will use for this exercise
input_layer_size  <- 400  # 20x20 Input Images of Digits
hidden_layer_size <- 25   # 25 hidden units
num_labels <- 10          # 10 labels, from 1 to 10 (note that we have mapped "0" to label 10)

Useful functions: sigmoid, sigmoidGradient and random weight initialization

sigmoid <- function(z) {
  #SIGMOID Compute sigmoid functoon
  #   J <- SIGMOID(z) computes the sigmoid of z.
  
  # You need to return the following variables correctly
  g <- matrix(0,dim(as.matrix(z)))
  g <- 1 / (1 + exp(-1 * z))
  g

}

sigmoidGradient <- function(z) {
  #SIGMOIDGRADIENT returns the gradient of the sigmoid function
  #evaluated at z
  #   g <- SIGMOIDGRADIENT(z) computes the gradient of the sigmoid function
  #   evaluated at z. This should work regardless if z is a matrix or a
  #   vector. In particular, if z is a vector or matrix, you should return
  #   the gradient for each element.
  
  #g <- rep(0,)  zeros(size(z))
  g  <- sigmoid(z) * (1 - sigmoid(z))
  g
}

randInitializeWeights <- function(L_in, L_out) {
  #RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in
  #incoming connections and L_out outgoing connections
  #   W <- RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights
  #   of a layer with L_in incoming connections and L_out outgoing
  #   connections.
  #
  #   Note that W should be set to a matrix of size (L_out, 1 + L_in) as
  #   the first row of W handles the "bias" terms
  #
  
  # You need to return the following variables correctly
  W <- matrix(0,L_out, 1 + L_in)
  
  # ----------------------- YOUR CODE HERE -----------------------
  # Instructions: Initialize W randomly so that we break the symmetry while
  #               training the neural network.
  #
  # Note: The first row of W corresponds to the parameters for the bias units
  # Also, in the pdf, on page 7 in the bottom, it is written how to calculate the epsilon
  
  epsilon_init <- 0.12
  
  rnd <- runif(L_out * (1 + L_in))
  rnd <- matrix(rnd,L_out,1 + L_in)
  W <- rnd * 2 * epsilon_init - epsilon_init
  W
}

Cost function with feedforward and backpropagation

To the neural network, you should first start by implementing the feedforward part of the neural network that returns the cost only. After implementing the feedforward to compute the cost, you can verify that your implementation is correct by verifying that you get the same cost as us for the fixed debugging parameters.

# NNCOSTFUNCTION Implements the neural network cost function for a two layer neural network which performs classification
#    [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
#    X, y, lambda) computes the cost and gradient of the neural network. The
#    parameters for the neural network are "unrolled" into the vector nn_params and need to be converted back into the weight matrices. 
#    The returned parameter grad should be a "unrolled" vector of the partial derivatives of the neural network.

nnCostFunction <- function(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda){
  
  # useful value
  m <- nrow(X)
  
  # Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices for our 2 layer neural network
  Theta1 <- matrix(nn_params[1: (hidden_layer_size * (input_layer_size + 1))], 
                    nrow = hidden_layer_size,
                    ncol = input_layer_size + 1)

  Theta2 <- matrix(nn_params[((hidden_layer_size * (input_layer_size + 1))+1) : length(nn_params)],
                    nrow= num_labels,
                    ncol= hidden_layer_size + 1)

  # You need to return the following variables correctly 
  J <- 0
  Theta1_grad <-  matrix(0, nrow= dim(Theta1)[1], ncol= dim(Theta1)[2]) 
  Theta2_grad <- matrix(0, nrow= dim(Theta2)[1], ncol= dim(Theta2)[2]) 
  
  
  # Part 1: Feedforward the neural network and return the cost in the variable J. After implementing Part 1, you can verify that your cost function computation is correct by verifying the cost computed in ex4.m
  
  
  # Step 1 - Expand the 'y' output values into a matrix of single values. 
  y_matrix <- matrix(0, nrow = length(y), ncol = num_labels)
  
  for (i in 1:m){
    ans <- y[i]
    y_matrix[i,ans] <- 1 
  }

  # step 2: implement forward propoagation to get the hypothesis for any x(i)
  #add a column of 1 to X (bias)

  #calculate the hypothesis = a3 = g(z3) = g(theta2 * g(theta1 * X)
  a1 <- cbind(1,X)
  z2 <- a1 %*% t(Theta1)
  a2 <- 1 / (1 + exp(- z2))
  a2 <- cbind(1, a2)
  z3 <- a2 %*% t(Theta2)
  a3 <- 1 / (1 + exp(- z3))
  
  # step 3: Implement code to compute code function
  J <- - sum(y_matrix * log(a3)) - sum((1 - y_matrix) * log(1 - a3))
  J <- J / m
  
  # step 4: add the regularization term  
  Theta1_b <- Theta1[ ,2: ncol(Theta1)]
  Theta2_b <- Theta2[ ,2: ncol(Theta2)]
  
  t1 <- sum(Theta1_b ^2)
  t2 <- sum(Theta2_b^2)

  J <- J + (lambda/(2 * m)) * (t1 + t2)
  print(J)
  
  #   Part 2: Implement the backpropagation algorithm to compute the gradients Theta1_grad and Theta2_grad. You should return the partial derivatives of the cost function with respect to Theta1 and Theta2 in Theta1_grad and Theta2_grad, respectively. After implementing Part 2, you can check that your implementation is correct by running checkNNGradients
  # 
  # Note: The vector y passed into the function is a vector of labels containing values from 1..K. You need to map this vector into a binary vector of 1's and 0's to be used with the neural network cost function.
  #                
  # Hint: We recommend implementing backpropagation using a for-loop over the training examples if you are implementing it for the first time.
  #  
  error3 <- a3 - y_matrix #5000 * 10
  #error2 <- error3 %*% Theta2_b * (1 / (1 + exp(- z2))) # test like the github
  error2 <- (error3 %*% Theta2) * sigmoidGradient(cbind(rep(1,dim(z2)[1]),z2))
  error2 <- error2[,-1]

  # Unroll gradient
  Theta1_grad <- Theta1_grad + t(error2) %*% a1 # 25*401
  Theta1_grad <- Theta1_grad / m
  
  Theta2_grad <- Theta2_grad + t(error3) %*% a2 #10*26
  Theta2_grad <- Theta2_grad / m
  
  Theta1_b <- cbind(0,Theta1_b)
  Theta2_b <- cbind(0,Theta2_b)
  
 # Part 3: Implement regularization with the cost function and gradients.
 # 
 #         Hint: You can implement this around the code for
 #               backpropagation. That is, you can compute the gradients for
 #               the regularization separately and then add them to Theta1_grad
 #               and Theta2_grad from Part 2.


  Theta1_grad <- Theta1_grad + lambda / m * Theta1_b
  Theta2_grad <- Theta2_grad + lambda / m * Theta2_b
  
  grad <- c(as.vector(Theta1_grad), as.vector(Theta2_grad))
  return(grad)
}

Testing J Cost:

  • Cost at parameters with lambda = 0: this value should be about 0.287629

  • Cost at parameters with lambda = 1: this value should be about 0.383770

test <- nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, 0)
## [1] 0.2876292
test <- nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, 1)
## [1] 0.3837699

lbfgsb3_ function instead of fmincg

In order to use a R function similar to fmincg in octave (to minimize J), we will need the lbfgsb3 function from the lbfgsb3 library. This lbfgsb3_ function is given below (instead being used from the library) to expend the limit of parameters allowed. This solution was found on the github “https://github.com/faridcher/machine-learning-course/tree/master

Furthermore, in order to use this function, we will need to divide the Cost and the Gradient function in 2 seperate functions.

lbfgsb3_ <- function (prm, fn, gr = NULL, lower = -Inf, upper = Inf, control = list(), 
          ...)
{
  tasklist <- c("NEW_X", "START", "STOP", "FG", "ABNORMAL_TERMINATION_IN_LNSRCH", 
                "CONVERGENCE", "CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL", 
                "CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH", "ERROR: FTOL .LT. ZERO", 
                "ERROR: GTOL .LT. ZERO", "ERROR: INITIAL G .GE. ZERO", 
                "ERROR: INVALID NBD", "ERROR: N .LE. 0", "ERROR: NO FEASIBLE SOLUTION", 
                "ERROR: STP .GT. STPMAX", "ERROR: STP .LT. STPMIN", 
                "ERROR: STPMAX .LT. STPMIN", "ERROR: STPMIN .LT. ZERO", 
                "ERROR: XTOL .LT. ZERO", "FG_LNSRCH", "FG_START", "RESTART_FROM_LNSRCH", 
                "WARNING: ROUNDING ERRORS PREVENT PROGRESS", "WARNING: STP .eq. STPMAX", 
                "WARNING: STP .eq. STPMIN", "WARNING: XTOL TEST SATISFIED")
  ctrl <- list(maxit = 100, trace = 0, iprint = 0L)
  namc <- names(control)
  if (!all(namc %in% names(ctrl))) 
    stop("unknown names in control: ", namc[!(namc %in% 
                                                names(ctrl))])
  ctrl[namc] <- control
  iprint <- as.integer(ctrl$iprint)
  factr <- 1e+07
  pgtol <- 1e-05
  nmax <- 26260
  mmax <- 17L
  if (length(prm) > nmax) 
    stop("The number of parameters cannot exceed 1024")
  n <- as.integer(length(prm))
  m <- 5L
  nbd <- rep(2L, n)
  nwa <- 2 * mmax * nmax + 5 * nmax + 11 * mmax * mmax + 8 * 
    mmax
  wa <- rep(0, nwa)
  dsave <- rep(0, 29)
  lsave <- rep(TRUE, 4)
  isave <- rep(0L, 44)
  iwa <- rep(0L, 3 * nmax)
  csave <- ""
  if (length(lower) == 1) 
    lower <- rep(lower, n)
  if (length(upper) == 1) 
    upper <- rep(upper, n)
  bigval <- .Machine$double.xmax/10
  for (i in 1:n) {
    if (is.finite(lower[i])) {
      if (is.finite(upper[i])) 
        nbd[i] <- 2
      else {
        nbd[i] <- 1
        upper[i] <- bigval
      }
    }
    else {
      if (is.finite(upper[i])) {
        nbd[i] <- 3
        lower[i] <- -bigval
      }
      else {
        nbd[i] <- 0
        upper[i] <- bigval
        lower[i] <- -bigval
      }
    }
  }
  itask <- 2L
  task <- tasklist[itask]
  f <- .Machine$double.xmax/100
  g <- rep(f, n)
  icsave <- 0
  repeat {
    if (isave[34] > ctrl$maxit )
      break
    
    if (ctrl$trace >= 2) {
      cat("Before call, f=", f, "  task number ", itask, 
          " ")
      print(task)
    }
    possibleError <- tryCatch(
    result <- .Fortran("lbfgsb3", n = as.integer(n), m = as.integer(m), 
                       x = as.double(prm), l = as.double(lower), u = as.double(upper), 
                       nbd = as.integer(nbd), f = as.double(f), g = as.double(g), 
                       factr = as.double(factr), pgtol = as.double(pgtol), 
                       wa = as.double(wa), iwa = as.integer(iwa), itask = as.integer(itask), 
                       iprint = as.integer(iprint), icsave = as.integer(icsave), 
                       lsave = as.logical(lsave), isave = as.integer(isave), 
                       dsave = as.double(dsave))
    , error = function(e) e)
    
    if(inherits(possibleError, "error"))
      break
    
    
    itask <- result$itask
    icsave <- result$icsave
    prm <- result$x
    g <- result$g
    iwa <- result$iwa
    wa <- result$wa
    nbd <- result$nbd
    lsave <- result$lsave
    isave <- result$isave
    dsave <- result$dsave
    if (ctrl$trace > 2) {
      cat("returned from lbfgsb3\n")
      cat("returned itask is ", itask, "\n")
      task <- tasklist[itask]
      cat("changed task to ", task, "\n")
    }
    if (itask %in% c(4L, 20L, 21L)) {
      if (ctrl$trace >= 2) {
        cat("computing f and g at prm=")
        print(prm)
      }
      f <- fn(prm, ...)
      if (is.null(gr)) {
        g <- grad(fn, prm, ...)
      }
      else {
        g <- gr(prm, ...)
      }
      if (ctrl$trace > 0) {
        cat("At iteration ", isave[34], " f =", f)
        if (ctrl$trace > 1) {
          cat("max(abs(g))=", max(abs(g)))
        }
        cat("\n")
      }
    }
    else {
      if (itask == 1L) {
      }
      else break
    }
  }
  info <- list(task = task, itask = itask, lsave = lsave, 
               icsave = icsave, dsave = dsave, isave = isave)
  ans <- list(prm = prm, f = f, g = g, info = info)
}

Cost function and Gradient function

CostFunction <- function(input_layer_size, hidden_layer_size, num_labels,X, y, lambda){
  
  function(nn_params) {
    # Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices for our 2 layer neural network
    Theta1 <- matrix(nn_params[1: (hidden_layer_size * (input_layer_size + 1))], 
                     nrow = hidden_layer_size,
                     ncol = input_layer_size + 1)
    
    Theta2 <- matrix(nn_params[((hidden_layer_size * (input_layer_size + 1))+1) : length(nn_params)],
                     nrow= num_labels,
                     ncol= hidden_layer_size + 1)
    
    m <- dim(X)[1]  
    J <- 0
    
    # implement forward propoagation to get the hypothesis for any x(i)
    #add a column of 1 to X (bias)
    
    #calculate the hypothesis = a3 = g(z3) = g(theta2 * g(theta1 * X)
    a1 <- cbind(1,X)
    z2 <- a1 %*% t(Theta1)
    a2 <- 1 / (1 + exp(- z2))
    a2 <- cbind(1, a2)
    z3 <- a2 %*% t(Theta2)
    a3 <- 1 / (1 + exp(- z3))
    
    
    # Implement code to compute code function
    J <- - sum(y_matrix * log(a3)) - sum((1 - y_matrix) * log(1 - a3))
    J <- J / m
    
    # add the regularization term  
    Theta1_b <- Theta1[ ,2: ncol(Theta1)]
    Theta2_b <- Theta2[ ,2: ncol(Theta2)]
    
    
    t1 <- sum(Theta1_b ^2)
    t2 <- sum(Theta2_b^2)
    
    J <- J + (lambda/(2 * m)) * (t1 + t2)
    return(J)
  }
}



GradFunction <- function(input_layer_size, hidden_layer_size, num_labels,
                         X, y, lambda){
  
  function(nn_params) {
    # Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
    # for our 2 layer neural network
    Theta1 <- matrix(nn_params[1: (hidden_layer_size * (input_layer_size + 1))], 
                     nrow = hidden_layer_size,
                     ncol = input_layer_size + 1)
    
    Theta2 <- matrix(nn_params[((hidden_layer_size * (input_layer_size + 1))+1) : length(nn_params)],
                     nrow= num_labels,
                     ncol= hidden_layer_size + 1)
    
    
    Theta1_grad <-  matrix(0, nrow= dim(Theta1)[1], ncol= dim(Theta1)[2]) 
    Theta2_grad <- matrix(0, nrow= dim(Theta2)[1], ncol= dim(Theta2)[2]) 
    
    m <- dim(X)[1]  
    
    
    # implement forward propoagation to get the hypothesis for any x(i)
    #add a column of 1 to X (bias)
    
    #calculate the hypothesis = a3 = g(z3) = g(theta2 * g(theta1 * X)
    a1 <- cbind(1,X)
    z2 <- a1 %*% t(Theta1)
    a2 <- 1 / (1 + exp(- z2))
    a2 <- cbind(1, a2)
    z3 <- a2 %*% t(Theta2)
    a3 <- 1 / (1 + exp(- z3))
    
    # calculate sigmas
    error3 <- a3 - y_matrix #5000 * 10
    error2 <- (error3 %*% Theta2) * sigmoidGradient(cbind(rep(1,dim(z2)[1]),z2))
    error2 <- error2[,-1]
    
    
    # accumulate gradients
    Theta1_grad <- Theta1_grad + t(error2) %*% a1 # 25*401
    Theta1_grad <- Theta1_grad / m
    
    Theta2_grad <- Theta2_grad + t(error3) %*% a2 #10*26
    Theta2_grad <- Theta2_grad / m
    
    # Part 3: Implement regularization with the cost function and gradients.
    # 
    #         Hint: You can implement this around the code for
    #               backpropagation. That is, you can compute the gradients for
    #               the regularization separately and then add them to Theta1_grad
    #               and Theta2_grad from Part 2.
    
    # calculate regularized gradient
    Theta1_grad <- Theta1_grad + lambda / m * cbind(rep(0,dim(Theta1)[1]), Theta1[,-1])
    Theta2_grad <- Theta2_grad + lambda / m * cbind(rep(0,dim(Theta2)[1]), Theta2[,-1])
    
    #unroll gradients
    grad <- c(as.vector(Theta1_grad), as.vector(Theta2_grad))
    return(grad)
  }
}

Training the Neural Network

Code taken from the “https://github.com/faridcher/machine-learning-course/tree/master”. In order to have to work, he/she created 2 short hand functions to have only the nn_params needed to run.

library(lbfgsb3)
## Loading required package: numDeriv
initial_Theta1 <- randInitializeWeights(input_layer_size, hidden_layer_size)
initial_Theta2 <- randInitializeWeights(hidden_layer_size, num_labels)

lambda <- 0
# Expand the 'y' output values into a matrix of single values. 
y_matrix <- matrix(0, nrow = length(y), ncol = num_labels)

for (i in 1:nrow(X)){
  ans <- y[i]
  y_matrix[i,ans] <- 1 
}

# Unroll parameters
initial_nn_params <- c(initial_Theta1,initial_Theta2)

# Create "short hand" for the cost function to be minimized
costF <- CostFunction(input_layer_size, hidden_layer_size, 
                                   num_labels, X, y_matrix, lambda) #over nn_params

gradF <- GradFunction(input_layer_size, hidden_layer_size, 
                               num_labels, X, y, lambda) #over nn_params

# Now, costFunction and gradFunction are functions that take in only one argument (the
# neural network parameters)

# lbfgsb3 works like fmincg (fast)

# After you have completed the assignment, change the maxit to a larger
# value to see how more training helps.
opt <- lbfgsb3_(initial_nn_params, fn= costF, gr=gradF,
                  control = list(trace=1,maxit=50))
## This problem is unconstrained.
## At iteration  0  f = 6.976408
## At iteration  2  f = 4.276925
## At iteration  3  f = 3.252908
## At iteration  4  f = 3.244889
## At iteration  5  f = 3.236283
## At iteration  6  f = 3.206265
## At iteration  7  f = 3.081171
## At iteration  8  f = 3.14935
## At iteration  9  f = 2.872306
## At iteration  10  f = 2.581716
## At iteration  11  f = 2.394459
## At iteration  12  f = 1.95672
## At iteration  13  f = 1.80206
## At iteration  14  f = 1.636209
## At iteration  15  f = 1.47261
## At iteration  16  f = 1.25506
## At iteration  17  f = 1.126324
## At iteration  18  f = 1.041125
## At iteration  19  f = 0.9431381
## At iteration  20  f = 0.8774341
## At iteration  21  f = 0.8264804
## At iteration  22  f = 0.769674
## At iteration  23  f = 0.730318
## At iteration  24  f = 0.6919751
## At iteration  25  f = 0.6407558
## At iteration  26  f = 0.6163657
## At iteration  27  f = 0.5740396
## At iteration  28  f = 0.5558077
## At iteration  29  f = 0.5227216
## At iteration  30  f = 0.5131235
## At iteration  31  f = 0.4795803
## At iteration  32  f = 0.4668565
## At iteration  33  f = 0.4524015
## At iteration  34  f = 0.4304622
## At iteration  35  f = 0.4288251
## At iteration  36  f = 0.41672
## At iteration  37  f = 0.4007615
## At iteration  38  f = 0.383439
## At iteration  39  f = 0.3635975
## At iteration  40  f = 0.3460794
## At iteration  41  f = 0.3382101
## At iteration  42  f = 0.3251173
## At iteration  43  f = 0.3145473
## At iteration  44  f = 0.3040788
## At iteration  45  f = 0.2840729
## At iteration  46  f = 0.2773529
## At iteration  47  f = 0.2674028
## At iteration  48  f = 0.2580811
## At iteration  49  f = 0.2502549
## At iteration  50  f = 0.2364219
## At iteration  51  f = 0.2396558
nn_params <- opt$prm
cost <- opt$f

# Obtain Theta1 and Theta2 back from nn_params
Theta1 <- matrix(nn_params[1:(hidden_layer_size * (input_layer_size + 1))],
                 hidden_layer_size, (input_layer_size + 1))

Theta2 <- matrix(nn_params[(1 + (hidden_layer_size * (input_layer_size + 1))):length(nn_params)],
                 num_labels, (hidden_layer_size + 1))

Predict with lambda = 0

predict <- function(Theta1, Theta2, X){
  #PREDICT Predict the label of an input given a trained neural network
  # p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
  #  trained weights of a neural network (Theta1, Theta2)
  
  # Useful values
  m <- nrow(X)
  num_labels <- nrow(Theta2)
                     
  # You need to return the following variables correctly 
  p <- rep(0, m)
  
  h1 <- sigmoid(cbind(1, X) %*% t(Theta1))
  h2 <- sigmoid(cbind(1, h1) %*% t(Theta2))
  p <- apply(h2,1,which.max)
  return(p)
}

pred <- predict(Theta1, Theta2, X)
# accuracy
print(mean(pred==y) * 100)
## [1] 96.86
# confusion matrix
table(pred, y)
##     y
## pred   1   2   3   4   5   6   7   8   9  10
##   1  494   1   2   1   1   0   2   1   0   0
##   2    1 486   4   1   2   0   2   3   0   0
##   3    0   0 463   0   2   0   1   2   5   2
##   4    0   2   0 491   0   1   4   1   4   1
##   5    1   3  22   0 490   6   0  10   5   5
##   6    0   1   0   2   2 492   0   1   1   0
##   7    1   0   3   0   0   0 485   3   5   0
##   8    3   4   5   0   1   0   0 478   3   2
##   9    0   1   1   4   1   0   6   1 475   1
##   10   0   2   0   1   1   1   0   0   2 489

Predict with lambda = 10

lambda <- 10

# Create "short hand" for the cost function to be minimized
costF <- CostFunction(input_layer_size, hidden_layer_size, 
                                   num_labels, X, y_matrix, lambda) #over nn_params

gradF <- GradFunction(input_layer_size, hidden_layer_size, 
                               num_labels, X, y, lambda) #over nn_params

opt <- lbfgsb3_(initial_nn_params, fn= costF, gr=gradF,
                  control = list(trace=1,maxit=50))
## This problem is unconstrained.
## At iteration  0  f = 7.025619
## At iteration  2  f = 4.327071
## At iteration  3  f = 3.306176
## At iteration  4  f = 3.298132
## At iteration  5  f = 3.289381
## At iteration  6  f = 3.260188
## At iteration  7  f = 3.152471
## At iteration  8  f = 3.131547
## At iteration  9  f = 2.721189
## At iteration  10  f = 2.351989
## At iteration  11  f = 2.081654
## At iteration  12  f = 1.982127
## At iteration  13  f = 1.888246
## At iteration  14  f = 1.695554
## At iteration  15  f = 1.612727
## At iteration  16  f = 1.530456
## At iteration  17  f = 1.494745
## At iteration  18  f = 1.418897
## At iteration  19  f = 1.394754
## At iteration  20  f = 1.348353
## At iteration  21  f = 1.329708
## At iteration  22  f = 1.299886
## At iteration  23  f = 1.270708
## At iteration  24  f = 1.256065
## At iteration  25  f = 1.21982
## At iteration  26  f = 1.205858
## At iteration  27  f = 1.184341
## At iteration  28  f = 1.168228
## At iteration  29  f = 1.150984
## At iteration  30  f = 1.137874
## At iteration  31  f = 1.127415
## At iteration  32  f = 1.118114
## At iteration  33  f = 1.109182
## At iteration  34  f = 1.099622
## At iteration  35  f = 1.097045
## At iteration  36  f = 1.085889
## At iteration  37  f = 1.081499
## At iteration  38  f = 1.076358
## At iteration  39  f = 1.073028
## At iteration  40  f = 1.068067
## At iteration  41  f = 1.063618
## At iteration  42  f = 1.059889
## At iteration  43  f = 1.059816
## At iteration  44  f = 1.057212
## At iteration  45  f = 1.053664
## At iteration  46  f = 1.051392
## At iteration  47  f = 1.048636
## At iteration  48  f = 1.04656
## At iteration  49  f = 1.043939
## At iteration  50  f = 1.042195
## At iteration  51  f = 1.040808
nn_params <- opt$prm
cost <- opt$f

# Obtain Theta1 and Theta2 back from nn_params
Theta1 <- matrix(nn_params[1:(hidden_layer_size * (input_layer_size + 1))],
                 hidden_layer_size, (input_layer_size + 1))

Theta2 <- matrix(nn_params[(1 + (hidden_layer_size * (input_layer_size + 1))):length(nn_params)],
                 num_labels, (hidden_layer_size + 1))

pred <- predict(Theta1, Theta2, X)
# accuracy
print(mean(pred==y) * 100)
## [1] 93.44
# confusion matrix
table(pred, y)
##     y
## pred   1   2   3   4   5   6   7   8   9  10
##   1  488   5   4   2   2   3  11   7   2   0
##   2    1 448  14   4   2   1   2   4   1   0
##   3    0   4 449   0  14   0   0   5   5   2
##   4    1   9   0 474   8   1   8   4   6   2
##   5    2   1  17   0 453   8   0   7   3   0
##   6    1   4   1   6   8 483   0   7   1   2
##   7    2   9   7   0   0   1 467   1  12   1
##   8    4  12   5   2   3   1   0 459   2   3
##   9    1   1   2  12   4   0   9   5 462   1
##   10   0   7   1   0   6   2   3   1   6 489