Background

An artificial neural network (ANN) learning algorithm, usually called “neural network” (NN), is a learning algorithm that is inspired by the structure and functional aspects of biological neural networks. Falling hardware prices and the development of GPUs for personal use in the last few years have contributed to the development of the concept of deep learning which consists of multiple hidden layers in an artificial neural network. Neural networks and deep learning currently provide the best solutions to many problems in image recognition, speech recognition, and natural language processing. In the conventional approach to programming, we tell the computer what to do, breaking big problems up into many small, precisely defined tasks that the computer can easily perform. By contrast, in a neural network we don’t tell the computer how to solve our problem. Instead, it learns from observational data, figuring out its own solution to the problem at hand.

Prepare Environment

Google Cloud Platform

Created a virtual machine instance with:

  • Machine Name: neural-network
  • Hardware: 2 vCPUs, 10GB RAM, 100GB Disk
  • Operating System: Linux Ubuntu 17.04 Zesty
  • Access Scopes: Allow full access to all Cloud APIs
  • Firewall: Allow HTTP traffic

Docker Cloud on GCP

sudo apt-get remove docker docker-engine docker.io
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce
sudo docker run hello-world

DataQuest Data Science Image

sudo mkdir -p /home/me/notebooks/
sudo docker run -d -p 8888:8888 -v /home/me/notebooks:/home/ds/notebooks dataquestio/python2-starter
sudo docker ps # to get <container_hash>
sudo docker exec -it <container_hash> bash
sudo chmod 777 /home/ds/notebooks/
pip install --upgrade pip
touch requirements.txt
pip freeze > requirements.txt
pip install -U $(pip freeze | awk '{split($0, a, "=="); print a[1]}')
pip install notebook==4.0.6 ipython==4.0.0 ipykernel==4.1.1
pip freeze > requirements.txt

Download Data and Code

The MNIST database of handwritten digits. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

wget https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip
sudo apt-get install unzip
unzip master.zip

Expand MNIST Data Set

One way to improve results is by algorithmically expanding the training data. A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one pixel, left one pixel, or right one pixel. This is accomplished by running the program expand_mnist.py from the shell prompt.

cd neural-networks-and-deep-learning-master/src
python expand_mnist.py
cd ../..

Jupyter Notebook Interface

Open web browser and navigate to http://public_dns_name:8888/tree/neural-networks-and-deep-learning-master/src where public_dns_name is the GCP External IP address for your VM Instance. Start a “New Python 2” Notebook.

Python Libraries

Theano makes it easy to implement backpropagation for convolutional neural networks, since it automatically computes all the mappings involved. Theano is also quite a bit faster than implementing a plain vanilla backpropagation algorithm, and this makes it practical to train more complex networks. In particular, Theano evaluates expressions with C and can also run code on either a CPU or, if available, a GPU. Running on a GPU provides a substantial speedup and, again, helps make it practical to train more complex networks.

import cPickle, gzip, numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv
from theano.tensor.nnet import softmax
from theano.tensor import shared_randomstreams
from theano.tensor.signal import downsample

Activation Functions

The neural network’s output is assumed to be the index of whichever neuron in the final layer has the highest activation. Rectified linear units consistently outperformed networks based on sigmoid activation functions on this dataset. At present, there is a poor understanding of what makes the rectified linear activation function better than the sigmoid or tanh functions.

def linear(z): return z
def ReLU(z): return T.maximum(0.0, z)
from theano.tensor.nnet import sigmoid
from theano.tensor import tanh

Processing Unit

GPU = False
if GPU:
    print "Trying to run under a GPU. If this is not desired, then set the GPU flag to False."
    try: theano.config.device = 'gpu'
    except: pass # it's already set
    theano.config.floatX = 'float32'
else:
    print "Running with a CPU. If this is not desired, then set the GPU flag to True."
Running with a CPU. If this is not desired, then set the GPU flag to True.

Load MNIST Data

The MNIST database of handwritten digits. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

def load_data_shared(filename="neural-networks-and-deep-learning-master/data/mnist.pkl.gz"):
    f = gzip.open(filename, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    def shared(data):
        """Place the data into shared variables.  This allows Theano to copy
        the data to the GPU, if one is available."""
        shared_x = theano.shared(
            np.asarray(data[0], dtype=theano.config.floatX), borrow=True)
        shared_y = theano.shared(
            np.asarray(data[1], dtype=theano.config.floatX), borrow=True)
        return shared_x, T.cast(shared_y, "int32")
    return [shared(training_data), shared(validation_data), shared(test_data)]

Network Class Instance

Main class used to construct and train networks.

class Network(object):

    def __init__(self, layers, mini_batch_size):
        """Takes a list of `layers`, describing the network architecture, and
        a value for the `mini_batch_size` to be used during training
        by stochastic gradient descent."""
        self.layers = layers
        self.mini_batch_size = mini_batch_size
        self.params = [param for layer in self.layers for param in layer.params]
        self.x = T.matrix("x")
        self.y = T.ivector("y")
        init_layer = self.layers[0]
        init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
        for j in xrange(1, len(self.layers)):
            prev_layer, layer  = self.layers[j-1], self.layers[j]
            layer.set_inpt(
                prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)
        self.output = self.layers[-1].output
        self.output_dropout = self.layers[-1].output_dropout

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            validation_data, test_data, lmbda=0.0):
        """Train the network using mini-batch stochastic gradient descent."""
        training_x, training_y = training_data
        validation_x, validation_y = validation_data
        test_x, test_y = test_data

        # compute number of minibatches for training, validation and testing
        num_training_batches = size(training_data)/mini_batch_size
        num_validation_batches = size(validation_data)/mini_batch_size
        num_test_batches = size(test_data)/mini_batch_size

        # define the (regularized) cost function, symbolic gradients, and updates
        l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
        cost = self.layers[-1].cost(self)+\
               0.5*lmbda*l2_norm_squared/num_training_batches
        grads = T.grad(cost, self.params)
        updates = [(param, param-eta*grad)
                   for param, grad in zip(self.params, grads)]

        # define functions to train a mini-batch, and to compute the
        # accuracy in validation and test mini-batches.
        i = T.lscalar() # mini-batch index
        train_mb = theano.function(
            [i], cost, updates=updates,
            givens={
                self.x:
                training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        validate_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        test_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        self.test_mb_predictions = theano.function(
            [i], self.layers[-1].y_out,
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        # Do the actual training
        best_validation_accuracy = 0.0
        for epoch in xrange(epochs):
            for minibatch_index in xrange(num_training_batches):
                iteration = num_training_batches*epoch+minibatch_index
                if iteration % 1000 == 0:
                    print("Training mini-batch number {0}".format(iteration))
                cost_ij = train_mb(minibatch_index)
                if (iteration+1) % num_training_batches == 0:
                    validation_accuracy = np.mean(
                        [validate_mb_accuracy(j) for j in xrange(num_validation_batches)])
                    print("Epoch {0}: validation accuracy {1:.2%}".format(
                        epoch, validation_accuracy))
                    if validation_accuracy >= best_validation_accuracy:
                        print("This is the best validation accuracy to date.")
                        best_validation_accuracy = validation_accuracy
                        best_iteration = iteration
                        if test_data:
                            test_accuracy = np.mean(
                                [test_mb_accuracy(j) for j in xrange(num_test_batches)])
                            print('The corresponding test accuracy is {0:.2%}'.format(
                                test_accuracy))
        print("Finished training network.")
        print("Best validation accuracy of {0:.2%} obtained at iteration {1}".format(
            best_validation_accuracy, best_iteration))
        print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))

Cost Functions

Sometimes referred to as a loss or objective function. The training algorithm has done a good job if it can find weights and biases such that the quadratic cost function, sometimes known as the mean squared error or just MSE, is equal to zero. By contrast, it’s not doing so well when the cost (error) is large. \[C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \tag{Quadratic}\] There is a learning slowdown (difficulty learning when initial errors are large) problem encountered with the quadratic cost function when using a sigmoid function. This is mitaged by using a Cross-Entropy cost function with the sigmoid function. \[C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \tag{Cross-Entropy}\] A more common solution in modern image classification networks however, Log-likelihood cost function with a softmax function. A softmax output layer with log-likelihood cost is quite similar to a sigmoid output layer with cross-entropy cost. \[C \equiv -\ln a^L_y \tag{Log-Likelihood}\]

Define Layer Types

Convolutional-Pooling Layers

Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling. Each neuron in the first hidden layer is connected to a small region of the input neurons. That region in the input image is called the local receptive field for hidden neurons. Each hidden neuron has a bias and weights connected to its local receptive field. The same weights and bias are used for each of the hidden neurons. The shared weights and bias are often said to define a kernel or filter. Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.

Local Receptive Fields Feature Maps

Max-pooling is a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information. The intuition is that once a feature has been found, its exact location isn’t as important as its rough location relative to other features. A big benefit is that there are many fewer pooled features, and so this helps reduce the number of parameters needed in later layers.

Pooling Units Feature Maps
class ConvPoolLayer(object):
    """Used to create a combination of a convolutional and a max-pooling
    layer.  A more sophisticated implementation would separate the
    two, but for our purposes we'll always use them together, and it
    simplifies the code, so it makes sense to combine them."""
    def __init__(self, filter_shape, image_shape, poolsize=(2, 2),
                 activation_fn=sigmoid):
        """`filter_shape` is a tuple of length 4, whose entries are the number
        of filters, the number of input feature maps, the filter height, and the
        filter width.

        `image_shape` is a tuple of length 4, whose entries are the
        mini-batch size, the number of input feature maps, the image
        height, and the image width.

        `poolsize` is a tuple of length 2, whose entries are the y and
        x pooling sizes."""
        self.filter_shape = filter_shape
        self.image_shape = image_shape
        self.poolsize = poolsize
        self.activation_fn=activation_fn
        # initialize weights and biases
        n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize))
        self.w = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),
                dtype=theano.config.floatX),
            borrow=True)
        self.b = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),
                dtype=theano.config.floatX),
            borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape(self.image_shape)
        conv_out = conv.conv2d(
            input=self.inpt, filters=self.w, filter_shape=self.filter_shape,
            image_shape=self.image_shape)
        pooled_out = downsample.max_pool_2d(
            input=conv_out, ds=self.poolsize, ignore_border=True)
        self.output = self.activation_fn(
            pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
        self.output_dropout = self.output # no dropout in the convolutional layers

Fully Connected Layer

Adjacent network layers are fully connected to one another. That is, every neuron in the network is connected to every neuron in adjacent layers. Although they work well, the network architecture of fully-connected layers alone do not take into account the spatial structure of the images when classifying images. Convolutional neural networks take advantage of the spatial structure. These networks use a special architecture which is particularly well-adapted to classify images. Using this architecture makes convolutional networks fast to train. This, in turn, helps us train deep, many-layer networks, which are very good at classifying images. The convolutional and pooling layers are sometimes treated as a single layer.


class FullyConnectedLayer(object):

    def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.activation_fn = activation_fn
        self.p_dropout = p_dropout
        # Initialize weights and biases
        self.w = theano.shared(
            np.asarray(
                np.random.normal(
                    loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
                dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
                       dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = self.activation_fn(
            (1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)
        self.inpt_dropout = dropout_layer(
            inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = self.activation_fn(
            T.dot(self.inpt_dropout, self.w) + self.b)

    def accuracy(self, y):
        "Return the accuracy for the mini-batch."
        return T.mean(T.eq(y, self.y_out))

Softmax Layer

A softmax layer (that uses the softmax function) addresses the learning slowdown problem encountered with the sigmoid function. A softmax output layer with log-likelihood cost is quite similar to a sigmoid output layer with cross-entropy cost. The output from the softmax layer is a set of positive numbers which sum up to 1. In other words, the output from the softmax layer can be thought of as a probability distribution. Softmax plus log-likelihood cost is more common in modern image classification networks.

\[\sigma(z) \equiv \frac{1}{1+e^{-z}}=\frac{1}{1+\exp(-\sum_j w_j x_j-b)} \tag{Sigmoid Function}\]\[a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \tag{Softmax Function}\]

class SoftmaxLayer(object):

    def __init__(self, n_in, n_out, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.p_dropout = p_dropout
        # Initialize weights and biases
        self.w = theano.shared(
            np.zeros((n_in, n_out), dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.zeros((n_out,), dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = softmax((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)
        self.inpt_dropout = dropout_layer(
            inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = softmax(T.dot(self.inpt_dropout, self.w) + self.b)

    def cost(self, net):
        "Return the log-likelihood cost."
        return -T.mean(T.log(self.output_dropout)[T.arange(net.y.shape[0]), net.y])

    def accuracy(self, y):
        "Return the accuracy for the mini-batch."
        return T.mean(T.eq(y, self.y_out))

Compute Minibatch Size

Compute number of minibatches for training, validation and testing. The result will ultimately be divided by mini_batch_size to get number of batches for training, validation and testing.

def size(data):
    "Return the size of the dataset `data`."
    return data[0].get_value(borrow=True).shape[0]

Dropout Neurons

Remove individual activations at random while training the network. This makes the model more robust to the loss of individual pieces of evidence, and thus less likely to rely on particular idiosyncrasies of the training data. This also reduces overfitting.


def dropout_layer(layer, p_dropout):
    srng = shared_randomstreams.RandomStreams(
        np.random.RandomState(0).randint(999999))
    mask = srng.binomial(n=1, p=1-p_dropout, size=layer.shape)
    return layer*T.cast(mask, theano.config.floatX)

Prepare MNIST Data

training_data, validation_data, test_data = load_data_shared()
expanded_training_data, _, _ = load_data_shared("neural-networks-and-deep-learning-master/data/mnist_expanded.pkl.gz")

Set Hyper-Parameters

A deep neural network has 3 or more hidden layers.

mini_batch_size = 100 # random training inputs (to find minima within)
neurons = [] # List with number of neurons in each layer

First Convolutional-Pooling Layer

The convolutional and pooling layers are treated as a single layer. This layer analyzes unmodified versions of the scanned handwritten digit images. The local receptive field of the Convolutional Neural Network moves across the input image at a stride length of 1.

feature_maps1 = 1 # The starting number of input feature maps
image_width1 = 28 # Horizontal pixels of scanned images
image_height1 = 28 # Vertical pixels of scanned images
neurons.append(image_width1 * image_height1) # Input layer
num_filters1 = 20 # Image subsections (with shared weights and bias)
filter_width1 = 5 # local receptive field horizontal pixel width
filter_height1 = 5 # local receptive field vertical pixel height
width = (image_width1 - filter_width1 + 1)
height = (image_height1 - filter_height1 + 1)
neurons.append(num_filters1 * height * width) # hidden layer 1
pool_x1 = 2 # max-pooling pixel width
pool_y1 = 2 # max-pooling pixel height
width = (image_width1 - filter_width1 + 1) / pool_x1
height = (image_height1 - filter_height1 + 1) / pool_y1
neurons.append(num_filters1 * height * width) # hidden layer 2

Second Convolutional-Pooling Layer

The convolutional and pooling layers are treated as a single layer. This layer analyzes abstracted and condensed versions of the original images using the output of the first convolutional-pooling layer as input. The local receptive field of the Convolutional Neural Network moves across the input image at a stride length of 1.

feature_maps2 = 20 # The starting number of input feature maps (subsections above)
image_height2 = 12 # Horizontal pixels of abstracted image
image_width2 = 12 # Vertical pixels of abstracted image
num_filters2 = 40 # Image subsections (with shared weights and bias)
filter_height2 = 5 # local receptive field pixel height
filter_width2 = 5 # local receptive field pixel width
width = (image_width2 - filter_width2 + 1)
height = (image_height2 - filter_height2 + 1)
neurons.append(num_filters2 * height * width) # hidden layer 3
pool_x2 = 2 # max-pooling pixel width
pool_y2 = 2 # max-pooling pixel height
width = (image_width2 - filter_width2 + 1) / pool_x2
height = (image_height2 - filter_height2 + 1) / pool_y2
neurons.append(num_filters2 * height * width) # hidden layer 4

First Fully Connected Layer

Connects every neuron from max-pooled layer to a hidden layer of neurons. The neural network’s output is assumed to be the index of whichever neuron in the final layer has the highest activation. Dropout reduces overfitting and improves learning speed.

neurons.append(1000) # hidden layer 5
dropout_prob1 = 0.5 # Random probability for removing individual activations

Second Fully Connected Layer

Connects every neuron from the previous hidden layer to another hidden layer of neurons. This is like adding a second hidden layer in a basic neural network model. Dropout reduces overfitting and improves learning speed.

neurons.append(1000) # hidden layer 6
dropout_prob2 = 0.5 # Random probability for removing individual activations

Softmax Layer

Connects every neuron from previous hidden layer to all 10 output neurons. Uses softmax function plus log-likelihood cost function for predictions. Dropout reduces overfitting and improves learning speed.

neurons.append(10) # Output layer for digits zero through nine
dropout_prob3 = 0.5 # Random probability for removing individual activations

Neurons in Layers

The neural network’s output is assumed to be the index of whichever neuron in the final layer has the highest activation.

print(neurons)
[784, 11520, 2880, 2560, 640, 1000, 1000, 10]
sum(neurons)
20394

Define Neural Network

The convolutional and pooling layers are sometimes as a single layer. The first convolutional-pooling analyzes the unmodified version of scanned handwritten digit images. The second convolutional-pooling layer uses output of first convolutional-pooling layer as input and analyzes an abstracted and condensed version of original images. When the abstraction retains a useful amount of spatial structure, this second convolutional-pooling helps improve model accuracy. The first Fully Connected layer Connects every neuron from max-pooled layer to every nueron in a second Fully Connected layer. The second Fully Connected layer Connects every neuron from the first Fully Connected layer to all 10 Softmax output layer neurons. Adding a second Fully Connected layer is like adding another hidden layer in a basic neural network model.

net = Network([
        ConvPoolLayer(image_shape=(mini_batch_size, feature_maps1, image_height1, image_width1), 
                      filter_shape=(num_filters1, feature_maps1, filter_height1, filter_width1), 
                      poolsize=(pool_x1, pool_y1), 
                      activation_fn=ReLU),
        ConvPoolLayer(image_shape=(mini_batch_size, feature_maps2, image_height2, image_width2), 
                      filter_shape=(num_filters2, feature_maps2, filter_height2, filter_width2), 
                      poolsize=(pool_x2, pool_y2), 
                      activation_fn=ReLU),
        FullyConnectedLayer(
            n_in=neurons[4], n_out=neurons[5], activation_fn=ReLU, p_dropout=dropout_prob1),
        FullyConnectedLayer(
            n_in=neurons[5], n_out=neurons[6], activation_fn=ReLU, p_dropout=dropout_prob2),
        SoftmaxLayer(n_in=neurons[5], n_out=neurons[7], p_dropout=dropout_prob3)], 
        mini_batch_size)

Stochastic Gradient Descent

The training algorithm has done a good job if it can find weights and biases such that the cost function (error) is equal to zero. By contrast, it’s not doing so well when the cost (error) is large. Gradient descent minimizes the cost as a function of the weights and biases. Stochastic gradient descent can be used to speed up learning. Stochastic gradient descent works by randomly picking out a small number of randomly chosen mini-batch of training inputs, and training with them. It’s much easier to sample a small mini-batch than it is to apply gradient descent to the full batch.

\[C \equiv -\ln a^L_y \tag{Log-Likelihood Cost}\]\[\Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \tag{Cost Change}\]\[\nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T \tag{Gradient Vector}\]\[\Delta C \approx \Delta v \cdot \nabla C = -\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2 \tag{Learning Rate}\]
epochs = 40 # training periods (iterations)
eta = 0.03 # learning rate (rate of descent)
net.SGD(expanded_training_data, epochs, mini_batch_size, eta, validation_data, test_data)
Training mini-batch number 0
Training mini-batch number 1000
Training mini-batch number 2000
Epoch 0: validation accuracy 94.54%
This is the best validation accuracy to date.
The corresponding test accuracy is 94.31%
Training mini-batch number 3000
Training mini-batch number 4000
Epoch 1: validation accuracy 97.22%
This is the best validation accuracy to date.
The corresponding test accuracy is 97.23%
Training mini-batch number 5000
Training mini-batch number 6000
Training mini-batch number 7000
Epoch 2: validation accuracy 97.84%
This is the best validation accuracy to date.
The corresponding test accuracy is 98.00%
Training mini-batch number 8000
Training mini-batch number 9000
Epoch 3: validation accuracy 98.08%
This is the best validation accuracy to date.
The corresponding test accuracy is 98.31%
Training mini-batch number 10000
Training mini-batch number 11000
Training mini-batch number 12000
Epoch 4: validation accuracy 98.46%
This is the best validation accuracy to date.
The corresponding test accuracy is 98.62%
Training mini-batch number 13000
Training mini-batch number 14000
Epoch 5: validation accuracy 98.56%
This is the best validation accuracy to date.
The corresponding test accuracy is 98.79%
Training mini-batch number 15000
Training mini-batch number 16000
Training mini-batch number 17000
Epoch 6: validation accuracy 98.72%
This is the best validation accuracy to date.
The corresponding test accuracy is 98.96%
Training mini-batch number 18000
Training mini-batch number 19000
Epoch 7: validation accuracy 98.82%
This is the best validation accuracy to date.
The corresponding test accuracy is 98.96%
Training mini-batch number 20000
Training mini-batch number 21000
Training mini-batch number 22000
Epoch 8: validation accuracy 98.87%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.08%
Training mini-batch number 23000
Training mini-batch number 24000
Epoch 9: validation accuracy 98.92%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.20%
Training mini-batch number 25000
Training mini-batch number 26000
Training mini-batch number 27000
Epoch 10: validation accuracy 98.95%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.21%
Training mini-batch number 28000
Training mini-batch number 29000
Epoch 11: validation accuracy 99.02%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.25%
Training mini-batch number 30000
Training mini-batch number 31000
Training mini-batch number 32000
Epoch 12: validation accuracy 99.00%
Training mini-batch number 33000
Training mini-batch number 34000
Epoch 13: validation accuracy 99.07%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.31%
Training mini-batch number 35000
Training mini-batch number 36000
Training mini-batch number 37000
Epoch 14: validation accuracy 99.09%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.30%
Training mini-batch number 38000
Training mini-batch number 39000
Epoch 15: validation accuracy 99.14%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.32%
Training mini-batch number 40000
Training mini-batch number 41000
Training mini-batch number 42000
Epoch 16: validation accuracy 99.16%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.39%
Training mini-batch number 43000
Training mini-batch number 44000
Epoch 17: validation accuracy 99.22%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.34%
Training mini-batch number 45000
Training mini-batch number 46000
Training mini-batch number 47000
Epoch 18: validation accuracy 99.19%
Training mini-batch number 48000
Training mini-batch number 49000
Epoch 19: validation accuracy 99.27%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.35%
Training mini-batch number 50000
Training mini-batch number 51000
Training mini-batch number 52000
Epoch 20: validation accuracy 99.22%
Training mini-batch number 53000
Training mini-batch number 54000
Epoch 21: validation accuracy 99.26%
Training mini-batch number 55000
Training mini-batch number 56000
Training mini-batch number 57000
Epoch 22: validation accuracy 99.26%
Training mini-batch number 58000
Training mini-batch number 59000
Epoch 23: validation accuracy 99.28%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.39%
Training mini-batch number 60000
Training mini-batch number 61000
Training mini-batch number 62000
Epoch 24: validation accuracy 99.29%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.43%
Training mini-batch number 63000
Training mini-batch number 64000
Epoch 25: validation accuracy 99.33%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.44%
Training mini-batch number 65000
Training mini-batch number 66000
Training mini-batch number 67000
Epoch 26: validation accuracy 99.33%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.42%
Training mini-batch number 68000
Training mini-batch number 69000
Epoch 27: validation accuracy 99.28%
Training mini-batch number 70000
Training mini-batch number 71000
Training mini-batch number 72000
Epoch 28: validation accuracy 99.28%
Training mini-batch number 73000
Training mini-batch number 74000
Epoch 29: validation accuracy 99.32%
Training mini-batch number 75000
Training mini-batch number 76000
Training mini-batch number 77000
Epoch 30: validation accuracy 99.29%
Training mini-batch number 78000
Training mini-batch number 79000
Epoch 31: validation accuracy 99.30%
Training mini-batch number 80000
Training mini-batch number 81000
Training mini-batch number 82000
Epoch 32: validation accuracy 99.32%
Training mini-batch number 83000
Training mini-batch number 84000
Epoch 33: validation accuracy 99.36%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.45%
Training mini-batch number 85000
Training mini-batch number 86000
Training mini-batch number 87000
Epoch 34: validation accuracy 99.36%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.44%
Training mini-batch number 88000
Training mini-batch number 89000
Epoch 35: validation accuracy 99.36%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.49%
Training mini-batch number 90000
Training mini-batch number 91000
Training mini-batch number 92000
Epoch 36: validation accuracy 99.38%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.43%
Training mini-batch number 93000
Training mini-batch number 94000
Epoch 37: validation accuracy 99.34%
Training mini-batch number 95000
Training mini-batch number 96000
Training mini-batch number 97000
Epoch 38: validation accuracy 99.36%
Training mini-batch number 98000
Training mini-batch number 99000
Epoch 39: validation accuracy 99.40%
This is the best validation accuracy to date.
The corresponding test accuracy is 99.45%
Finished training network.
Best validation accuracy of 99.40% obtained at iteration 99999
Corresponding test accuracy of 99.45%