An artificial neural network (ANN) learning algorithm, usually called “neural network” (NN), is a learning algorithm that is inspired by the structure and functional aspects of biological neural networks. Falling hardware prices and the development of GPUs for personal use in the last few years have contributed to the development of the concept of deep learning which consists of multiple hidden layers in an artificial neural network. Neural networks and deep learning currently provide the best solutions to many problems in image recognition, speech recognition, and natural language processing. In the conventional approach to programming, we tell the computer what to do, breaking big problems up into many small, precisely defined tasks that the computer can easily perform. By contrast, in a neural network we don’t tell the computer how to solve our problem. Instead, it learns from observational data, figuring out its own solution to the problem at hand.
Created a virtual machine instance with:
sudo apt-get remove docker docker-engine docker.io
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce
sudo docker run hello-world
sudo mkdir -p /home/me/notebooks/
sudo docker run -d -p 8888:8888 -v /home/me/notebooks:/home/ds/notebooks dataquestio/python2-starter
sudo docker ps # to get <container_hash>
sudo docker exec -it <container_hash> bash
sudo chmod 777 /home/ds/notebooks/
pip install --upgrade pip
touch requirements.txt
pip freeze > requirements.txt
pip install -U $(pip freeze | awk '{split($0, a, "=="); print a[1]}')
pip install notebook==4.0.6 ipython==4.0.0 ipykernel==4.1.1
pip freeze > requirements.txt
The MNIST database of handwritten digits. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
wget https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip
sudo apt-get install unzip
unzip master.zip
One way to improve results is by algorithmically expanding the training data. A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one pixel, left one pixel, or right one pixel. This is accomplished by running the program expand_mnist.py
from the shell prompt.
cd neural-networks-and-deep-learning-master/src
python expand_mnist.py
cd ../..
Open web browser and navigate to http://public_dns_name:8888/tree/neural-networks-and-deep-learning-master/src
where public_dns_name
is the GCP External IP address for your VM Instance. Start a “New Python 2” Notebook.
Theano makes it easy to implement backpropagation for convolutional neural networks, since it automatically computes all the mappings involved. Theano is also quite a bit faster than implementing a plain vanilla backpropagation algorithm, and this makes it practical to train more complex networks. In particular, Theano evaluates expressions with C
and can also run code on either a CPU or, if available, a GPU. Running on a GPU provides a substantial speedup and, again, helps make it practical to train more complex networks.
import cPickle, gzip, numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv
from theano.tensor.nnet import softmax
from theano.tensor import shared_randomstreams
from theano.tensor.signal import downsample
The neural network’s output is assumed to be the index of whichever neuron in the final layer has the highest activation. Rectified linear units consistently outperformed networks based on sigmoid activation functions on this dataset. At present, there is a poor understanding of what makes the rectified linear activation function better than the sigmoid or tanh functions.
def linear(z): return z
def ReLU(z): return T.maximum(0.0, z)
from theano.tensor.nnet import sigmoid
from theano.tensor import tanh
GPU = False
if GPU:
print "Trying to run under a GPU. If this is not desired, then set the GPU flag to False."
try: theano.config.device = 'gpu'
except: pass # it's already set
theano.config.floatX = 'float32'
else:
print "Running with a CPU. If this is not desired, then set the GPU flag to True."
The MNIST database of handwritten digits. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
def load_data_shared(filename="neural-networks-and-deep-learning-master/data/mnist.pkl.gz"):
f = gzip.open(filename, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
def shared(data):
"""Place the data into shared variables. This allows Theano to copy
the data to the GPU, if one is available."""
shared_x = theano.shared(
np.asarray(data[0], dtype=theano.config.floatX), borrow=True)
shared_y = theano.shared(
np.asarray(data[1], dtype=theano.config.floatX), borrow=True)
return shared_x, T.cast(shared_y, "int32")
return [shared(training_data), shared(validation_data), shared(test_data)]
Main class used to construct and train networks.
class Network(object):
def __init__(self, layers, mini_batch_size):
"""Takes a list of `layers`, describing the network architecture, and
a value for the `mini_batch_size` to be used during training
by stochastic gradient descent."""
self.layers = layers
self.mini_batch_size = mini_batch_size
self.params = [param for layer in self.layers for param in layer.params]
self.x = T.matrix("x")
self.y = T.ivector("y")
init_layer = self.layers[0]
init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
for j in xrange(1, len(self.layers)):
prev_layer, layer = self.layers[j-1], self.layers[j]
layer.set_inpt(
prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)
self.output = self.layers[-1].output
self.output_dropout = self.layers[-1].output_dropout
def SGD(self, training_data, epochs, mini_batch_size, eta,
validation_data, test_data, lmbda=0.0):
"""Train the network using mini-batch stochastic gradient descent."""
training_x, training_y = training_data
validation_x, validation_y = validation_data
test_x, test_y = test_data
# compute number of minibatches for training, validation and testing
num_training_batches = size(training_data)/mini_batch_size
num_validation_batches = size(validation_data)/mini_batch_size
num_test_batches = size(test_data)/mini_batch_size
# define the (regularized) cost function, symbolic gradients, and updates
l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
cost = self.layers[-1].cost(self)+\
0.5*lmbda*l2_norm_squared/num_training_batches
grads = T.grad(cost, self.params)
updates = [(param, param-eta*grad)
for param, grad in zip(self.params, grads)]
# define functions to train a mini-batch, and to compute the
# accuracy in validation and test mini-batches.
i = T.lscalar() # mini-batch index
train_mb = theano.function(
[i], cost, updates=updates,
givens={
self.x:
training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
validate_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y),
givens={
self.x:
validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
test_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y),
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
self.test_mb_predictions = theano.function(
[i], self.layers[-1].y_out,
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
# Do the actual training
best_validation_accuracy = 0.0
for epoch in xrange(epochs):
for minibatch_index in xrange(num_training_batches):
iteration = num_training_batches*epoch+minibatch_index
if iteration % 1000 == 0:
print("Training mini-batch number {0}".format(iteration))
cost_ij = train_mb(minibatch_index)
if (iteration+1) % num_training_batches == 0:
validation_accuracy = np.mean(
[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])
print("Epoch {0}: validation accuracy {1:.2%}".format(
epoch, validation_accuracy))
if validation_accuracy >= best_validation_accuracy:
print("This is the best validation accuracy to date.")
best_validation_accuracy = validation_accuracy
best_iteration = iteration
if test_data:
test_accuracy = np.mean(
[test_mb_accuracy(j) for j in xrange(num_test_batches)])
print('The corresponding test accuracy is {0:.2%}'.format(
test_accuracy))
print("Finished training network.")
print("Best validation accuracy of {0:.2%} obtained at iteration {1}".format(
best_validation_accuracy, best_iteration))
print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))
Sometimes referred to as a loss or objective function. The training algorithm has done a good job if it can find weights and biases such that the quadratic cost function, sometimes known as the mean squared error or just MSE, is equal to zero. By contrast, it’s not doing so well when the cost (error) is large. \[C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \tag{Quadratic}\] There is a learning slowdown (difficulty learning when initial errors are large) problem encountered with the quadratic cost function when using a sigmoid function. This is mitaged by using a Cross-Entropy cost function with the sigmoid function. \[C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \tag{Cross-Entropy}\] A more common solution in modern image classification networks however, Log-likelihood cost function with a softmax function. A softmax output layer with log-likelihood cost is quite similar to a sigmoid output layer with cross-entropy cost. \[C \equiv -\ln a^L_y \tag{Log-Likelihood}\]
Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling. Each neuron in the first hidden layer is connected to a small region of the input neurons. That region in the input image is called the local receptive field for hidden neurons. Each hidden neuron has a bias and weights connected to its local receptive field. The same weights and bias are used for each of the hidden neurons. The shared weights and bias are often said to define a kernel or filter. Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.
Local Receptive Fields | Feature Maps |
---|---|
Max-pooling is a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information. The intuition is that once a feature has been found, its exact location isn’t as important as its rough location relative to other features. A big benefit is that there are many fewer pooled features, and so this helps reduce the number of parameters needed in later layers.
Pooling Units | Feature Maps |
---|---|
class ConvPoolLayer(object):
"""Used to create a combination of a convolutional and a max-pooling
layer. A more sophisticated implementation would separate the
two, but for our purposes we'll always use them together, and it
simplifies the code, so it makes sense to combine them."""
def __init__(self, filter_shape, image_shape, poolsize=(2, 2),
activation_fn=sigmoid):
"""`filter_shape` is a tuple of length 4, whose entries are the number
of filters, the number of input feature maps, the filter height, and the
filter width.
`image_shape` is a tuple of length 4, whose entries are the
mini-batch size, the number of input feature maps, the image
height, and the image width.
`poolsize` is a tuple of length 2, whose entries are the y and
x pooling sizes."""
self.filter_shape = filter_shape
self.image_shape = image_shape
self.poolsize = poolsize
self.activation_fn=activation_fn
# initialize weights and biases
n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize))
self.w = theano.shared(
np.asarray(
np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),
dtype=theano.config.floatX),
borrow=True)
self.b = theano.shared(
np.asarray(
np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),
dtype=theano.config.floatX),
borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape(self.image_shape)
conv_out = conv.conv2d(
input=self.inpt, filters=self.w, filter_shape=self.filter_shape,
image_shape=self.image_shape)
pooled_out = downsample.max_pool_2d(
input=conv_out, ds=self.poolsize, ignore_border=True)
self.output = self.activation_fn(
pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
self.output_dropout = self.output # no dropout in the convolutional layers
Adjacent network layers are fully connected to one another. That is, every neuron in the network is connected to every neuron in adjacent layers. Although they work well, the network architecture of fully-connected layers alone do not take into account the spatial structure of the images when classifying images. Convolutional neural networks take advantage of the spatial structure. These networks use a special architecture which is particularly well-adapted to classify images. Using this architecture makes convolutional networks fast to train. This, in turn, helps us train deep, many-layer networks, which are very good at classifying images. The convolutional and pooling layers are sometimes treated as a single layer.
class FullyConnectedLayer(object):
def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
self.n_in = n_in
self.n_out = n_out
self.activation_fn = activation_fn
self.p_dropout = p_dropout
# Initialize weights and biases
self.w = theano.shared(
np.asarray(
np.random.normal(
loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
dtype=theano.config.floatX),
name='w', borrow=True)
self.b = theano.shared(
np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
dtype=theano.config.floatX),
name='b', borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape((mini_batch_size, self.n_in))
self.output = self.activation_fn(
(1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
self.y_out = T.argmax(self.output, axis=1)
self.inpt_dropout = dropout_layer(
inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
self.output_dropout = self.activation_fn(
T.dot(self.inpt_dropout, self.w) + self.b)
def accuracy(self, y):
"Return the accuracy for the mini-batch."
return T.mean(T.eq(y, self.y_out))
A softmax layer (that uses the softmax function) addresses the learning slowdown problem encountered with the sigmoid function. A softmax output layer with log-likelihood cost is quite similar to a sigmoid output layer with cross-entropy cost. The output from the softmax layer is a set of positive numbers which sum up to 1. In other words, the output from the softmax layer can be thought of as a probability distribution. Softmax plus log-likelihood cost is more common in modern image classification networks.
\[\sigma(z) \equiv \frac{1}{1+e^{-z}}=\frac{1}{1+\exp(-\sum_j w_j x_j-b)} \tag{Sigmoid Function}\]\[a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \tag{Softmax Function}\]
class SoftmaxLayer(object):
def __init__(self, n_in, n_out, p_dropout=0.0):
self.n_in = n_in
self.n_out = n_out
self.p_dropout = p_dropout
# Initialize weights and biases
self.w = theano.shared(
np.zeros((n_in, n_out), dtype=theano.config.floatX),
name='w', borrow=True)
self.b = theano.shared(
np.zeros((n_out,), dtype=theano.config.floatX),
name='b', borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape((mini_batch_size, self.n_in))
self.output = softmax((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
self.y_out = T.argmax(self.output, axis=1)
self.inpt_dropout = dropout_layer(
inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
self.output_dropout = softmax(T.dot(self.inpt_dropout, self.w) + self.b)
def cost(self, net):
"Return the log-likelihood cost."
return -T.mean(T.log(self.output_dropout)[T.arange(net.y.shape[0]), net.y])
def accuracy(self, y):
"Return the accuracy for the mini-batch."
return T.mean(T.eq(y, self.y_out))
Compute number of minibatches for training, validation and testing. The result will ultimately be divided by mini_batch_size
to get number of batches for training, validation and testing.
def size(data):
"Return the size of the dataset `data`."
return data[0].get_value(borrow=True).shape[0]
Remove individual activations at random while training the network. This makes the model more robust to the loss of individual pieces of evidence, and thus less likely to rely on particular idiosyncrasies of the training data. This also reduces overfitting.
def dropout_layer(layer, p_dropout):
srng = shared_randomstreams.RandomStreams(
np.random.RandomState(0).randint(999999))
mask = srng.binomial(n=1, p=1-p_dropout, size=layer.shape)
return layer*T.cast(mask, theano.config.floatX)
training_data, validation_data, test_data = load_data_shared()
expanded_training_data, _, _ = load_data_shared("neural-networks-and-deep-learning-master/data/mnist_expanded.pkl.gz")
A deep neural network has 3 or more hidden layers.
mini_batch_size = 100 # random training inputs (to find minima within)
neurons = [] # List with number of neurons in each layer
The convolutional and pooling layers are treated as a single layer. This layer analyzes unmodified versions of the scanned handwritten digit images. The local receptive field of the Convolutional Neural Network moves across the input image at a stride length of 1.
feature_maps1 = 1 # The starting number of input feature maps
image_width1 = 28 # Horizontal pixels of scanned images
image_height1 = 28 # Vertical pixels of scanned images
neurons.append(image_width1 * image_height1) # Input layer
num_filters1 = 20 # Image subsections (with shared weights and bias)
filter_width1 = 5 # local receptive field horizontal pixel width
filter_height1 = 5 # local receptive field vertical pixel height
width = (image_width1 - filter_width1 + 1)
height = (image_height1 - filter_height1 + 1)
neurons.append(num_filters1 * height * width) # hidden layer 1
pool_x1 = 2 # max-pooling pixel width
pool_y1 = 2 # max-pooling pixel height
width = (image_width1 - filter_width1 + 1) / pool_x1
height = (image_height1 - filter_height1 + 1) / pool_y1
neurons.append(num_filters1 * height * width) # hidden layer 2
The convolutional and pooling layers are treated as a single layer. This layer analyzes abstracted and condensed versions of the original images using the output of the first convolutional-pooling layer as input. The local receptive field of the Convolutional Neural Network moves across the input image at a stride length of 1.
feature_maps2 = 20 # The starting number of input feature maps (subsections above)
image_height2 = 12 # Horizontal pixels of abstracted image
image_width2 = 12 # Vertical pixels of abstracted image
num_filters2 = 40 # Image subsections (with shared weights and bias)
filter_height2 = 5 # local receptive field pixel height
filter_width2 = 5 # local receptive field pixel width
width = (image_width2 - filter_width2 + 1)
height = (image_height2 - filter_height2 + 1)
neurons.append(num_filters2 * height * width) # hidden layer 3
pool_x2 = 2 # max-pooling pixel width
pool_y2 = 2 # max-pooling pixel height
width = (image_width2 - filter_width2 + 1) / pool_x2
height = (image_height2 - filter_height2 + 1) / pool_y2
neurons.append(num_filters2 * height * width) # hidden layer 4
Connects every neuron from max-pooled layer to a hidden layer of neurons. The neural network’s output is assumed to be the index of whichever neuron in the final layer has the highest activation. Dropout reduces overfitting and improves learning speed.
neurons.append(1000) # hidden layer 5
dropout_prob1 = 0.5 # Random probability for removing individual activations
Connects every neuron from the previous hidden layer to another hidden layer of neurons. This is like adding a second hidden layer in a basic neural network model. Dropout reduces overfitting and improves learning speed.
neurons.append(1000) # hidden layer 6
dropout_prob2 = 0.5 # Random probability for removing individual activations
Connects every neuron from previous hidden layer to all 10 output neurons. Uses softmax function plus log-likelihood cost function for predictions. Dropout reduces overfitting and improves learning speed.
neurons.append(10) # Output layer for digits zero through nine
dropout_prob3 = 0.5 # Random probability for removing individual activations
The neural network’s output is assumed to be the index of whichever neuron in the final layer has the highest activation.
print(neurons)
sum(neurons)
The convolutional and pooling layers are sometimes as a single layer. The first convolutional-pooling analyzes the unmodified version of scanned handwritten digit images. The second convolutional-pooling layer uses output of first convolutional-pooling layer as input and analyzes an abstracted and condensed version of original images. When the abstraction retains a useful amount of spatial structure, this second convolutional-pooling helps improve model accuracy. The first Fully Connected layer Connects every neuron from max-pooled layer to every nueron in a second Fully Connected layer. The second Fully Connected layer Connects every neuron from the first Fully Connected layer to all 10 Softmax output layer neurons. Adding a second Fully Connected layer is like adding another hidden layer in a basic neural network model.
net = Network([
ConvPoolLayer(image_shape=(mini_batch_size, feature_maps1, image_height1, image_width1),
filter_shape=(num_filters1, feature_maps1, filter_height1, filter_width1),
poolsize=(pool_x1, pool_y1),
activation_fn=ReLU),
ConvPoolLayer(image_shape=(mini_batch_size, feature_maps2, image_height2, image_width2),
filter_shape=(num_filters2, feature_maps2, filter_height2, filter_width2),
poolsize=(pool_x2, pool_y2),
activation_fn=ReLU),
FullyConnectedLayer(
n_in=neurons[4], n_out=neurons[5], activation_fn=ReLU, p_dropout=dropout_prob1),
FullyConnectedLayer(
n_in=neurons[5], n_out=neurons[6], activation_fn=ReLU, p_dropout=dropout_prob2),
SoftmaxLayer(n_in=neurons[5], n_out=neurons[7], p_dropout=dropout_prob3)],
mini_batch_size)
The training algorithm has done a good job if it can find weights and biases such that the cost function (error) is equal to zero. By contrast, it’s not doing so well when the cost (error) is large. Gradient descent minimizes the cost as a function of the weights and biases. Stochastic gradient descent can be used to speed up learning. Stochastic gradient descent works by randomly picking out a small number of randomly chosen mini-batch of training inputs, and training with them. It’s much easier to sample a small mini-batch than it is to apply gradient descent to the full batch.
\[C \equiv -\ln a^L_y \tag{Log-Likelihood Cost}\]\[\Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \tag{Cost Change}\]\[\nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T \tag{Gradient Vector}\]\[\Delta C \approx \Delta v \cdot \nabla C = -\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2 \tag{Learning Rate}\]epochs = 40 # training periods (iterations)
eta = 0.03 # learning rate (rate of descent)
net.SGD(expanded_training_data, epochs, mini_batch_size, eta, validation_data, test_data)
http://yann.lecun.com/exdb/mnist/
http://deeplearning.net/software/theano/
http://neuralnetworksanddeeplearning.com/
https://github.com/mnielsen/neural-networks-and-deep-learning
https://stackoverflow.com/questions/24906126/how-to-unpack-pkl-file
https://en.wikipedia.org/wiki/Machine_learning#Artificial_neural_networks
https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi