In this exercise, you will implement the backpropagation algorithm for neural networks and apply it to the task of hand-written digit recognition.
Note: It was quite challenging to find a way to minimize the J cost (which is done in Octave with the fmincg function). After trying and failing with the optim function which I used previously for the logistic regression and linear regression, I found a solution on this github: “https://github.com/faridcher/machine-learning-course/tree/master”
In the previous exercise, you implemented feedforward propagation for neural networks and used it to predict handwritten digits with the weights we provided. In this exercise, you will implement the backpropagation algorithm to learn the parameters for the neural network.
library(R.matlab)
## Warning: package 'R.matlab' was built under R version 3.3.3
## R.matlab v3.6.1 (2016-10-19) successfully loaded. See ?R.matlab for help.
##
## Attaching package: 'R.matlab'
## The following objects are masked from 'package:base':
##
## getOption, isOpen
ds_data <- readMat("~/R_projects/171213_exo4_ng/ex4data1.mat")
X <- ds_data$X
y <- ds_data$y
rm(ds_data)
# In this part of the exercise, we load some pre-initialized neural network parameters.
ds_weight <- readMat("~/R_projects/171213_exo4_ng/ex4weights.mat")
Theta1 <- ds_weight$Theta1
Theta2 <- ds_weight$Theta2
rm(ds_weight)
# unroll parameters
nn_params <- c(as.vector(Theta1), as.vector(Theta2))
# Setup the parameters you will use for this exercise
input_layer_size <- 400 # 20x20 Input Images of Digits
hidden_layer_size <- 25 # 25 hidden units
num_labels <- 10 # 10 labels, from 1 to 10 (note that we have mapped "0" to label 10)
sigmoid <- function(z) {
#SIGMOID Compute sigmoid functoon
# J <- SIGMOID(z) computes the sigmoid of z.
# You need to return the following variables correctly
g <- matrix(0,dim(as.matrix(z)))
g <- 1 / (1 + exp(-1 * z))
g
}
sigmoidGradient <- function(z) {
#SIGMOIDGRADIENT returns the gradient of the sigmoid function
#evaluated at z
# g <- SIGMOIDGRADIENT(z) computes the gradient of the sigmoid function
# evaluated at z. This should work regardless if z is a matrix or a
# vector. In particular, if z is a vector or matrix, you should return
# the gradient for each element.
#g <- rep(0,) zeros(size(z))
g <- sigmoid(z) * (1 - sigmoid(z))
g
}
randInitializeWeights <- function(L_in, L_out) {
#RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in
#incoming connections and L_out outgoing connections
# W <- RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights
# of a layer with L_in incoming connections and L_out outgoing
# connections.
#
# Note that W should be set to a matrix of size (L_out, 1 + L_in) as
# the first row of W handles the "bias" terms
#
# You need to return the following variables correctly
W <- matrix(0,L_out, 1 + L_in)
# ----------------------- YOUR CODE HERE -----------------------
# Instructions: Initialize W randomly so that we break the symmetry while
# training the neural network.
#
# Note: The first row of W corresponds to the parameters for the bias units
# Also, in the pdf, on page 7 in the bottom, it is written how to calculate the epsilon
epsilon_init <- 0.12
rnd <- runif(L_out * (1 + L_in))
rnd <- matrix(rnd,L_out,1 + L_in)
W <- rnd * 2 * epsilon_init - epsilon_init
W
}
To the neural network, you should first start by implementing the feedforward part of the neural network that returns the cost only. After implementing the feedforward to compute the cost, you can verify that your implementation is correct by verifying that you get the same cost as us for the fixed debugging parameters.
# NNCOSTFUNCTION Implements the neural network cost function for a two layer neural network which performs classification
# [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
# X, y, lambda) computes the cost and gradient of the neural network. The
# parameters for the neural network are "unrolled" into the vector nn_params and need to be converted back into the weight matrices.
# The returned parameter grad should be a "unrolled" vector of the partial derivatives of the neural network.
nnCostFunction <- function(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda){
# useful value
m <- nrow(X)
# Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices for our 2 layer neural network
Theta1 <- matrix(nn_params[1: (hidden_layer_size * (input_layer_size + 1))],
nrow = hidden_layer_size,
ncol = input_layer_size + 1)
Theta2 <- matrix(nn_params[((hidden_layer_size * (input_layer_size + 1))+1) : length(nn_params)],
nrow= num_labels,
ncol= hidden_layer_size + 1)
# You need to return the following variables correctly
J <- 0
Theta1_grad <- matrix(0, nrow= dim(Theta1)[1], ncol= dim(Theta1)[2])
Theta2_grad <- matrix(0, nrow= dim(Theta2)[1], ncol= dim(Theta2)[2])
# Part 1: Feedforward the neural network and return the cost in the variable J. After implementing Part 1, you can verify that your cost function computation is correct by verifying the cost computed in ex4.m
# Step 1 - Expand the 'y' output values into a matrix of single values.
y_matrix <- matrix(0, nrow = length(y), ncol = num_labels)
for (i in 1:m){
ans <- y[i]
y_matrix[i,ans] <- 1
}
# step 2: implement forward propoagation to get the hypothesis for any x(i)
#add a column of 1 to X (bias)
#calculate the hypothesis = a3 = g(z3) = g(theta2 * g(theta1 * X)
a1 <- cbind(1,X)
z2 <- a1 %*% t(Theta1)
a2 <- 1 / (1 + exp(- z2))
a2 <- cbind(1, a2)
z3 <- a2 %*% t(Theta2)
a3 <- 1 / (1 + exp(- z3))
# step 3: Implement code to compute code function
J <- - sum(y_matrix * log(a3)) - sum((1 - y_matrix) * log(1 - a3))
J <- J / m
# step 4: add the regularization term
Theta1_b <- Theta1[ ,2: ncol(Theta1)]
Theta2_b <- Theta2[ ,2: ncol(Theta2)]
t1 <- sum(Theta1_b ^2)
t2 <- sum(Theta2_b^2)
J <- J + (lambda/(2 * m)) * (t1 + t2)
print(J)
# Part 2: Implement the backpropagation algorithm to compute the gradients Theta1_grad and Theta2_grad. You should return the partial derivatives of the cost function with respect to Theta1 and Theta2 in Theta1_grad and Theta2_grad, respectively. After implementing Part 2, you can check that your implementation is correct by running checkNNGradients
#
# Note: The vector y passed into the function is a vector of labels containing values from 1..K. You need to map this vector into a binary vector of 1's and 0's to be used with the neural network cost function.
#
# Hint: We recommend implementing backpropagation using a for-loop over the training examples if you are implementing it for the first time.
#
error3 <- a3 - y_matrix #5000 * 10
#error2 <- error3 %*% Theta2_b * (1 / (1 + exp(- z2))) # test like the github
error2 <- (error3 %*% Theta2) * sigmoidGradient(cbind(rep(1,dim(z2)[1]),z2))
error2 <- error2[,-1]
# Unroll gradient
Theta1_grad <- Theta1_grad + t(error2) %*% a1 # 25*401
Theta1_grad <- Theta1_grad / m
Theta2_grad <- Theta2_grad + t(error3) %*% a2 #10*26
Theta2_grad <- Theta2_grad / m
Theta1_b <- cbind(0,Theta1_b)
Theta2_b <- cbind(0,Theta2_b)
# Part 3: Implement regularization with the cost function and gradients.
#
# Hint: You can implement this around the code for
# backpropagation. That is, you can compute the gradients for
# the regularization separately and then add them to Theta1_grad
# and Theta2_grad from Part 2.
Theta1_grad <- Theta1_grad + lambda / m * Theta1_b
Theta2_grad <- Theta2_grad + lambda / m * Theta2_b
grad <- c(as.vector(Theta1_grad), as.vector(Theta2_grad))
return(grad)
}
Testing J Cost:
Cost at parameters with lambda = 0: this value should be about 0.287629
Cost at parameters with lambda = 1: this value should be about 0.383770
test <- nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, 0)
## [1] 0.2876292
test <- nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, 1)
## [1] 0.3837699
In order to use a R function similar to fmincg in octave (to minimize J), we will need the lbfgsb3 function from the lbfgsb3 library. This lbfgsb3_ function is given below (instead being used from the library) to expend the limit of parameters allowed. This solution was found on the github “https://github.com/faridcher/machine-learning-course/tree/master”
Furthermore, in order to use this function, we will need to divide the Cost and the Gradient function in 2 seperate functions.
lbfgsb3_ <- function (prm, fn, gr = NULL, lower = -Inf, upper = Inf, control = list(),
...)
{
tasklist <- c("NEW_X", "START", "STOP", "FG", "ABNORMAL_TERMINATION_IN_LNSRCH",
"CONVERGENCE", "CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL",
"CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH", "ERROR: FTOL .LT. ZERO",
"ERROR: GTOL .LT. ZERO", "ERROR: INITIAL G .GE. ZERO",
"ERROR: INVALID NBD", "ERROR: N .LE. 0", "ERROR: NO FEASIBLE SOLUTION",
"ERROR: STP .GT. STPMAX", "ERROR: STP .LT. STPMIN",
"ERROR: STPMAX .LT. STPMIN", "ERROR: STPMIN .LT. ZERO",
"ERROR: XTOL .LT. ZERO", "FG_LNSRCH", "FG_START", "RESTART_FROM_LNSRCH",
"WARNING: ROUNDING ERRORS PREVENT PROGRESS", "WARNING: STP .eq. STPMAX",
"WARNING: STP .eq. STPMIN", "WARNING: XTOL TEST SATISFIED")
ctrl <- list(maxit = 100, trace = 0, iprint = 0L)
namc <- names(control)
if (!all(namc %in% names(ctrl)))
stop("unknown names in control: ", namc[!(namc %in%
names(ctrl))])
ctrl[namc] <- control
iprint <- as.integer(ctrl$iprint)
factr <- 1e+07
pgtol <- 1e-05
nmax <- 26260
mmax <- 17L
if (length(prm) > nmax)
stop("The number of parameters cannot exceed 1024")
n <- as.integer(length(prm))
m <- 5L
nbd <- rep(2L, n)
nwa <- 2 * mmax * nmax + 5 * nmax + 11 * mmax * mmax + 8 *
mmax
wa <- rep(0, nwa)
dsave <- rep(0, 29)
lsave <- rep(TRUE, 4)
isave <- rep(0L, 44)
iwa <- rep(0L, 3 * nmax)
csave <- ""
if (length(lower) == 1)
lower <- rep(lower, n)
if (length(upper) == 1)
upper <- rep(upper, n)
bigval <- .Machine$double.xmax/10
for (i in 1:n) {
if (is.finite(lower[i])) {
if (is.finite(upper[i]))
nbd[i] <- 2
else {
nbd[i] <- 1
upper[i] <- bigval
}
}
else {
if (is.finite(upper[i])) {
nbd[i] <- 3
lower[i] <- -bigval
}
else {
nbd[i] <- 0
upper[i] <- bigval
lower[i] <- -bigval
}
}
}
itask <- 2L
task <- tasklist[itask]
f <- .Machine$double.xmax/100
g <- rep(f, n)
icsave <- 0
repeat {
if (isave[34] > ctrl$maxit )
break
if (ctrl$trace >= 2) {
cat("Before call, f=", f, " task number ", itask,
" ")
print(task)
}
possibleError <- tryCatch(
result <- .Fortran("lbfgsb3", n = as.integer(n), m = as.integer(m),
x = as.double(prm), l = as.double(lower), u = as.double(upper),
nbd = as.integer(nbd), f = as.double(f), g = as.double(g),
factr = as.double(factr), pgtol = as.double(pgtol),
wa = as.double(wa), iwa = as.integer(iwa), itask = as.integer(itask),
iprint = as.integer(iprint), icsave = as.integer(icsave),
lsave = as.logical(lsave), isave = as.integer(isave),
dsave = as.double(dsave))
, error = function(e) e)
if(inherits(possibleError, "error"))
break
itask <- result$itask
icsave <- result$icsave
prm <- result$x
g <- result$g
iwa <- result$iwa
wa <- result$wa
nbd <- result$nbd
lsave <- result$lsave
isave <- result$isave
dsave <- result$dsave
if (ctrl$trace > 2) {
cat("returned from lbfgsb3\n")
cat("returned itask is ", itask, "\n")
task <- tasklist[itask]
cat("changed task to ", task, "\n")
}
if (itask %in% c(4L, 20L, 21L)) {
if (ctrl$trace >= 2) {
cat("computing f and g at prm=")
print(prm)
}
f <- fn(prm, ...)
if (is.null(gr)) {
g <- grad(fn, prm, ...)
}
else {
g <- gr(prm, ...)
}
if (ctrl$trace > 0) {
cat("At iteration ", isave[34], " f =", f)
if (ctrl$trace > 1) {
cat("max(abs(g))=", max(abs(g)))
}
cat("\n")
}
}
else {
if (itask == 1L) {
}
else break
}
}
info <- list(task = task, itask = itask, lsave = lsave,
icsave = icsave, dsave = dsave, isave = isave)
ans <- list(prm = prm, f = f, g = g, info = info)
}
CostFunction <- function(input_layer_size, hidden_layer_size, num_labels,X, y, lambda){
function(nn_params) {
# Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices for our 2 layer neural network
Theta1 <- matrix(nn_params[1: (hidden_layer_size * (input_layer_size + 1))],
nrow = hidden_layer_size,
ncol = input_layer_size + 1)
Theta2 <- matrix(nn_params[((hidden_layer_size * (input_layer_size + 1))+1) : length(nn_params)],
nrow= num_labels,
ncol= hidden_layer_size + 1)
m <- dim(X)[1]
J <- 0
# implement forward propoagation to get the hypothesis for any x(i)
#add a column of 1 to X (bias)
#calculate the hypothesis = a3 = g(z3) = g(theta2 * g(theta1 * X)
a1 <- cbind(1,X)
z2 <- a1 %*% t(Theta1)
a2 <- 1 / (1 + exp(- z2))
a2 <- cbind(1, a2)
z3 <- a2 %*% t(Theta2)
a3 <- 1 / (1 + exp(- z3))
# Implement code to compute code function
J <- - sum(y_matrix * log(a3)) - sum((1 - y_matrix) * log(1 - a3))
J <- J / m
# add the regularization term
Theta1_b <- Theta1[ ,2: ncol(Theta1)]
Theta2_b <- Theta2[ ,2: ncol(Theta2)]
t1 <- sum(Theta1_b ^2)
t2 <- sum(Theta2_b^2)
J <- J + (lambda/(2 * m)) * (t1 + t2)
return(J)
}
}
GradFunction <- function(input_layer_size, hidden_layer_size, num_labels,
X, y, lambda){
function(nn_params) {
# Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
# for our 2 layer neural network
Theta1 <- matrix(nn_params[1: (hidden_layer_size * (input_layer_size + 1))],
nrow = hidden_layer_size,
ncol = input_layer_size + 1)
Theta2 <- matrix(nn_params[((hidden_layer_size * (input_layer_size + 1))+1) : length(nn_params)],
nrow= num_labels,
ncol= hidden_layer_size + 1)
Theta1_grad <- matrix(0, nrow= dim(Theta1)[1], ncol= dim(Theta1)[2])
Theta2_grad <- matrix(0, nrow= dim(Theta2)[1], ncol= dim(Theta2)[2])
m <- dim(X)[1]
# implement forward propoagation to get the hypothesis for any x(i)
#add a column of 1 to X (bias)
#calculate the hypothesis = a3 = g(z3) = g(theta2 * g(theta1 * X)
a1 <- cbind(1,X)
z2 <- a1 %*% t(Theta1)
a2 <- 1 / (1 + exp(- z2))
a2 <- cbind(1, a2)
z3 <- a2 %*% t(Theta2)
a3 <- 1 / (1 + exp(- z3))
# calculate sigmas
error3 <- a3 - y_matrix #5000 * 10
error2 <- (error3 %*% Theta2) * sigmoidGradient(cbind(rep(1,dim(z2)[1]),z2))
error2 <- error2[,-1]
# accumulate gradients
Theta1_grad <- Theta1_grad + t(error2) %*% a1 # 25*401
Theta1_grad <- Theta1_grad / m
Theta2_grad <- Theta2_grad + t(error3) %*% a2 #10*26
Theta2_grad <- Theta2_grad / m
# Part 3: Implement regularization with the cost function and gradients.
#
# Hint: You can implement this around the code for
# backpropagation. That is, you can compute the gradients for
# the regularization separately and then add them to Theta1_grad
# and Theta2_grad from Part 2.
# calculate regularized gradient
Theta1_grad <- Theta1_grad + lambda / m * cbind(rep(0,dim(Theta1)[1]), Theta1[,-1])
Theta2_grad <- Theta2_grad + lambda / m * cbind(rep(0,dim(Theta2)[1]), Theta2[,-1])
#unroll gradients
grad <- c(as.vector(Theta1_grad), as.vector(Theta2_grad))
return(grad)
}
}
Code taken from the “https://github.com/faridcher/machine-learning-course/tree/master”. In order to have to work, he/she created 2 short hand functions to have only the nn_params needed to run.
library(lbfgsb3)
## Loading required package: numDeriv
initial_Theta1 <- randInitializeWeights(input_layer_size, hidden_layer_size)
initial_Theta2 <- randInitializeWeights(hidden_layer_size, num_labels)
lambda <- 0
# Expand the 'y' output values into a matrix of single values.
y_matrix <- matrix(0, nrow = length(y), ncol = num_labels)
for (i in 1:nrow(X)){
ans <- y[i]
y_matrix[i,ans] <- 1
}
# Unroll parameters
initial_nn_params <- c(initial_Theta1,initial_Theta2)
# Create "short hand" for the cost function to be minimized
costF <- CostFunction(input_layer_size, hidden_layer_size,
num_labels, X, y_matrix, lambda) #over nn_params
gradF <- GradFunction(input_layer_size, hidden_layer_size,
num_labels, X, y, lambda) #over nn_params
# Now, costFunction and gradFunction are functions that take in only one argument (the
# neural network parameters)
# lbfgsb3 works like fmincg (fast)
# After you have completed the assignment, change the maxit to a larger
# value to see how more training helps.
opt <- lbfgsb3_(initial_nn_params, fn= costF, gr=gradF,
control = list(trace=1,maxit=50))
## This problem is unconstrained.
## At iteration 0 f = 6.976408
## At iteration 2 f = 4.276925
## At iteration 3 f = 3.252908
## At iteration 4 f = 3.244889
## At iteration 5 f = 3.236283
## At iteration 6 f = 3.206265
## At iteration 7 f = 3.081171
## At iteration 8 f = 3.14935
## At iteration 9 f = 2.872306
## At iteration 10 f = 2.581716
## At iteration 11 f = 2.394459
## At iteration 12 f = 1.95672
## At iteration 13 f = 1.80206
## At iteration 14 f = 1.636209
## At iteration 15 f = 1.47261
## At iteration 16 f = 1.25506
## At iteration 17 f = 1.126324
## At iteration 18 f = 1.041125
## At iteration 19 f = 0.9431381
## At iteration 20 f = 0.8774341
## At iteration 21 f = 0.8264804
## At iteration 22 f = 0.769674
## At iteration 23 f = 0.730318
## At iteration 24 f = 0.6919751
## At iteration 25 f = 0.6407558
## At iteration 26 f = 0.6163657
## At iteration 27 f = 0.5740396
## At iteration 28 f = 0.5558077
## At iteration 29 f = 0.5227216
## At iteration 30 f = 0.5131235
## At iteration 31 f = 0.4795803
## At iteration 32 f = 0.4668565
## At iteration 33 f = 0.4524015
## At iteration 34 f = 0.4304622
## At iteration 35 f = 0.4288251
## At iteration 36 f = 0.41672
## At iteration 37 f = 0.4007615
## At iteration 38 f = 0.383439
## At iteration 39 f = 0.3635975
## At iteration 40 f = 0.3460794
## At iteration 41 f = 0.3382101
## At iteration 42 f = 0.3251173
## At iteration 43 f = 0.3145473
## At iteration 44 f = 0.3040788
## At iteration 45 f = 0.2840729
## At iteration 46 f = 0.2773529
## At iteration 47 f = 0.2674028
## At iteration 48 f = 0.2580811
## At iteration 49 f = 0.2502549
## At iteration 50 f = 0.2364219
## At iteration 51 f = 0.2396558
nn_params <- opt$prm
cost <- opt$f
# Obtain Theta1 and Theta2 back from nn_params
Theta1 <- matrix(nn_params[1:(hidden_layer_size * (input_layer_size + 1))],
hidden_layer_size, (input_layer_size + 1))
Theta2 <- matrix(nn_params[(1 + (hidden_layer_size * (input_layer_size + 1))):length(nn_params)],
num_labels, (hidden_layer_size + 1))
predict <- function(Theta1, Theta2, X){
#PREDICT Predict the label of an input given a trained neural network
# p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
# trained weights of a neural network (Theta1, Theta2)
# Useful values
m <- nrow(X)
num_labels <- nrow(Theta2)
# You need to return the following variables correctly
p <- rep(0, m)
h1 <- sigmoid(cbind(1, X) %*% t(Theta1))
h2 <- sigmoid(cbind(1, h1) %*% t(Theta2))
p <- apply(h2,1,which.max)
return(p)
}
pred <- predict(Theta1, Theta2, X)
# accuracy
print(mean(pred==y) * 100)
## [1] 96.86
# confusion matrix
table(pred, y)
## y
## pred 1 2 3 4 5 6 7 8 9 10
## 1 494 1 2 1 1 0 2 1 0 0
## 2 1 486 4 1 2 0 2 3 0 0
## 3 0 0 463 0 2 0 1 2 5 2
## 4 0 2 0 491 0 1 4 1 4 1
## 5 1 3 22 0 490 6 0 10 5 5
## 6 0 1 0 2 2 492 0 1 1 0
## 7 1 0 3 0 0 0 485 3 5 0
## 8 3 4 5 0 1 0 0 478 3 2
## 9 0 1 1 4 1 0 6 1 475 1
## 10 0 2 0 1 1 1 0 0 2 489
lambda <- 10
# Create "short hand" for the cost function to be minimized
costF <- CostFunction(input_layer_size, hidden_layer_size,
num_labels, X, y_matrix, lambda) #over nn_params
gradF <- GradFunction(input_layer_size, hidden_layer_size,
num_labels, X, y, lambda) #over nn_params
opt <- lbfgsb3_(initial_nn_params, fn= costF, gr=gradF,
control = list(trace=1,maxit=50))
## This problem is unconstrained.
## At iteration 0 f = 7.025619
## At iteration 2 f = 4.327071
## At iteration 3 f = 3.306176
## At iteration 4 f = 3.298132
## At iteration 5 f = 3.289381
## At iteration 6 f = 3.260188
## At iteration 7 f = 3.152471
## At iteration 8 f = 3.131547
## At iteration 9 f = 2.721189
## At iteration 10 f = 2.351989
## At iteration 11 f = 2.081654
## At iteration 12 f = 1.982127
## At iteration 13 f = 1.888246
## At iteration 14 f = 1.695554
## At iteration 15 f = 1.612727
## At iteration 16 f = 1.530456
## At iteration 17 f = 1.494745
## At iteration 18 f = 1.418897
## At iteration 19 f = 1.394754
## At iteration 20 f = 1.348353
## At iteration 21 f = 1.329708
## At iteration 22 f = 1.299886
## At iteration 23 f = 1.270708
## At iteration 24 f = 1.256065
## At iteration 25 f = 1.21982
## At iteration 26 f = 1.205858
## At iteration 27 f = 1.184341
## At iteration 28 f = 1.168228
## At iteration 29 f = 1.150984
## At iteration 30 f = 1.137874
## At iteration 31 f = 1.127415
## At iteration 32 f = 1.118114
## At iteration 33 f = 1.109182
## At iteration 34 f = 1.099622
## At iteration 35 f = 1.097045
## At iteration 36 f = 1.085889
## At iteration 37 f = 1.081499
## At iteration 38 f = 1.076358
## At iteration 39 f = 1.073028
## At iteration 40 f = 1.068067
## At iteration 41 f = 1.063618
## At iteration 42 f = 1.059889
## At iteration 43 f = 1.059816
## At iteration 44 f = 1.057212
## At iteration 45 f = 1.053664
## At iteration 46 f = 1.051392
## At iteration 47 f = 1.048636
## At iteration 48 f = 1.04656
## At iteration 49 f = 1.043939
## At iteration 50 f = 1.042195
## At iteration 51 f = 1.040808
nn_params <- opt$prm
cost <- opt$f
# Obtain Theta1 and Theta2 back from nn_params
Theta1 <- matrix(nn_params[1:(hidden_layer_size * (input_layer_size + 1))],
hidden_layer_size, (input_layer_size + 1))
Theta2 <- matrix(nn_params[(1 + (hidden_layer_size * (input_layer_size + 1))):length(nn_params)],
num_labels, (hidden_layer_size + 1))
pred <- predict(Theta1, Theta2, X)
# accuracy
print(mean(pred==y) * 100)
## [1] 93.44
# confusion matrix
table(pred, y)
## y
## pred 1 2 3 4 5 6 7 8 9 10
## 1 488 5 4 2 2 3 11 7 2 0
## 2 1 448 14 4 2 1 2 4 1 0
## 3 0 4 449 0 14 0 0 5 5 2
## 4 1 9 0 474 8 1 8 4 6 2
## 5 2 1 17 0 453 8 0 7 3 0
## 6 1 4 1 6 8 483 0 7 1 2
## 7 2 9 7 0 0 1 467 1 12 1
## 8 4 12 5 2 3 1 0 459 2 3
## 9 1 1 2 12 4 0 9 5 462 1
## 10 0 7 1 0 6 2 3 1 6 489