chapter 6: deep learning

notes

topics: convolutions, pooling, GPUs (to do more training), algorithmic expansion of data (reduce overfitting), dropout (also reduce overfitting), ensembles of networks
upon reflection, it is strange to use networks with fc (fully-connected) layers to classify images.
- such a network does not take into account the spatial information of the images:
  - it treats input pixels which are far apart and close together as the same.
cnns use three basic ideas: local receptive fields, shared weights and pooling.
1. the creation of maps that learn a feature, i.e. a feature map
2. a subset of the input image (28x28), say 5x5 is weighted and added to a bias to create a single node of the feature map. these weights and biases remain the same for each new 5x5 subset computation from the original input image
  - hence shared weights. this relates to the feature maps being invariant to where that feature occurs in the input image
3. we discard the positional information with pooling ⊕ , because the relative position of a feature is more important than the absolute location of that feature
the "local receptive field" slides over by a "stride-length"
- BTW, we can use validation data to choose the stride length that gives the best performance!
for the \(j\), \(k\)-th hidden neuron the output is:
\begin{equation} \label{eq:a} \sigma \left ( b + \sum^4_{l=0}\sum^4_{m=0} w_{l,m} a_{j+l, k+m} \right ) \end{equation}
the convolution operation can be used to rewrite \ref{eq:a}:
\begin{equation} \label{eq:a-conv} a^1 = \sigma(b + w * a ^0) \end{equation}
pooling layers are usually used immediately after convolutional layers.
- they take each feature map and prepare a condensed feature map
- max-pooling just takes the highest activation in a given 2x2 region
- L2 pooling takes the square root of the sum of the squares of the activations in the 2x2 region.
- we can certainly leverage the validation data to see which pooling strategy is most superior!
we need to modify backprop (from network.py / network2.py for CNN's.)
softmax plus log-likelihood cost is more common in modern image classification networks.

experiments

in the code, the convolutional and pooling layers are treated as a single layer.

network3.py

"""
Got the code from https://github.com/MichalDanielDobrzanski/DeepLearningPython/pull/14/
"""

"""network3.py
~~~~~~~~~~~~~~
A Theano-based program for training and running simple neural
networks.
Supports several layer types (fully connected, convolutional, max
pooling, softmax), and activation functions (sigmoid, tanh, and
rectified linear units, with more easily added).
When run on a CPU, this program is much faster than network.py and
network2.py.  However, unlike network.py and network2.py it can also
be run on a GPU, which makes it faster still.
Because the code is based on Theano, the code is different in many
ways from network.py and network2.py.  However, where possible I have
tried to maintain consistency with the earlier programs.  In
particular, the API is similar to network2.py.  Note that I have
focused on making the code simple, easily readable, and easily
modifiable.  It is not optimized, and omits many desirable features.
This program incorporates ideas from the Theano documentation on
convolutional neural nets (notably,
http://deeplearning.net/tutorial/lenet.html ), from Misha Denil's
implementation of dropout (https://github.com/mdenil/dropout ), and
from Chris Olah (http://colah.github.io ).
"""

#### Libraries
# Standard library
import pickle
import gzip

# Third-party libraries
import numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv
from theano.tensor.nnet import softmax
from theano.tensor import shared_randomstreams
from theano.tensor.signal.pool import pool_2d

# Activation functions for neurons
def linear(z): return z
def ReLU(z): return T.maximum(0.0, z)
from theano.tensor.nnet import sigmoid
from theano.tensor import tanh


#### Constants
GPU = True
if GPU:
    print("Trying to run under a GPU.  If this is not desired, then modify "+\
        "network3.py\nto set the GPU flag to False.")
    try: theano.config.device = 'gpu'
    except: pass # it's already set
    theano.config.floatX = 'float32'
else:
    print("Running with a CPU.  If this is not desired, then the modify "+\
        "network3.py to set\nthe GPU flag to True.")

#### Load the MNIST data
def load_data_shared(filename="../data/mnist.pkl.gz"):
    f = gzip.open(filename, 'rb')
    training_data, validation_data, test_data = pickle.load(f, encoding="latin1")
    f.close()
    def shared(data):
        """Place the data into shared variables.  This allows Theano to copy
        the data to the GPU, if one is available.
        """
        shared_x = theano.shared(
            np.asarray(data[0], dtype=theano.config.floatX), borrow=True)
        shared_y = theano.shared(
            np.asarray(data[1], dtype=theano.config.floatX), borrow=True)
        return shared_x, T.cast(shared_y, "int32")
    return [shared(training_data), shared(validation_data), shared(test_data)]

#### Main class used to construct and train networks
class Network(object):

    def __init__(self, layers, mini_batch_size):
        """Takes a list of `layers`, describing the network architecture, and
        a value for the `mini_batch_size` to be used during training
        by stochastic gradient descent.
        """
        self.layers = layers
        self.mini_batch_size = mini_batch_size
        self.params = [param for layer in self.layers for param in layer.params]
        self.x = T.matrix("x")
        self.y = T.ivector("y")
        init_layer = self.layers[0]
        init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
        for j in range(1, len(self.layers)): # xrange() was renamed to range() in Python 3.
            prev_layer, layer  = self.layers[j-1], self.layers[j]
            layer.set_inpt(
                prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)
        self.output = self.layers[-1].output
        self.output_dropout = self.layers[-1].output_dropout

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            validation_data, test_data, lmbda=0.0):
        """Train the network using mini-batch stochastic gradient descent."""
        training_x, training_y = training_data
        validation_x, validation_y = validation_data
        test_x, test_y = test_data

        # compute number of minibatches for training, validation and testing
        num_training_batches = int(size(training_data)/mini_batch_size)
        num_validation_batches = int(size(validation_data)/mini_batch_size)
        num_test_batches = int(size(test_data)/mini_batch_size)

        # define the (regularized) cost function, symbolic gradients, and updates
        l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
        cost = self.layers[-1].cost(self)+\
               0.5*lmbda*l2_norm_squared/num_training_batches
        grads = T.grad(cost, self.params)
        updates = [(param, param-eta*grad)
                   for param, grad in zip(self.params, grads)]

        # define functions to train a mini-batch, and to compute the
        # accuracy in validation and test mini-batches.
        i = T.lscalar() # mini-batch index
        train_mb = theano.function(
            [i], cost, updates=updates,
            givens={
                self.x:
                training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        validate_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        test_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        self.test_mb_predictions = theano.function(
            [i], self.layers[-1].y_out,
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        # Do the actual training
        best_validation_accuracy = 0.0
        for epoch in range(epochs):
            for minibatch_index in range(num_training_batches):
                iteration = num_training_batches*epoch+minibatch_index
                if iteration % 1000 == 0:
                    print("Training mini-batch number {0}".format(iteration))
                cost_ij = train_mb(minibatch_index)
                if (iteration+1) % num_training_batches == 0:
                    validation_accuracy = np.mean(
                        [validate_mb_accuracy(j) for j in range(num_validation_batches)])
                    print("Epoch {0}: validation accuracy {1:.2%}".format(
                        epoch, validation_accuracy))
                    if validation_accuracy >= best_validation_accuracy:
                        print("This is the best validation accuracy to date.")
                        best_validation_accuracy = validation_accuracy
                        best_iteration = iteration
                        if test_data:
                            test_accuracy = np.mean(
                                [test_mb_accuracy(j) for j in range(num_test_batches)])
                            print('The corresponding test accuracy is {0:.2%}'.format(
                                test_accuracy))
        print("Finished training network.")
        print("Best validation accuracy of {0:.2%} obtained at iteration {1}".format(
            best_validation_accuracy, best_iteration))
        print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))

#### Define layer types

class ConvPoolLayer(object):
    """Used to create a combination of a convolutional and a max-pooling
    layer.  A more sophisticated implementation would separate the
    two, but for our purposes we'll always use them together, and it
    simplifies the code, so it makes sense to combine them.
    """

    def __init__(self, filter_shape, image_shape, poolsize=(2, 2),
                 activation_fn=sigmoid):
        """`filter_shape` is a tuple of length 4, whose entries are the number
        of filters, the number of input feature maps, the filter height, and the
        filter width.
        `image_shape` is a tuple of length 4, whose entries are the
        mini-batch size, the number of input feature maps, the image
        height, and the image width.
        `poolsize` is a tuple of length 2, whose entries are the y and
        x pooling sizes.
        """
        self.filter_shape = filter_shape
        self.image_shape = image_shape
        self.poolsize = poolsize
        self.activation_fn=activation_fn
        # initialize weights and biases
        n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize))
        self.w = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),
                dtype=theano.config.floatX),
            borrow=True)
        self.b = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),
                dtype=theano.config.floatX),
            borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape(self.image_shape)
        conv_out = conv.conv2d(
            input=self.inpt, filters=self.w, filter_shape=self.filter_shape,
            image_shape=self.image_shape)
        pooled_out = pool_2d(
            input=conv_out, ws=self.poolsize, ignore_border=True)
        self.output = self.activation_fn(
            pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
        self.output_dropout = self.output # no dropout in the convolutional layers

class FullyConnectedLayer(object):

    def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.activation_fn = activation_fn
        self.p_dropout = p_dropout
        # Initialize weights and biases
        self.w = theano.shared(
            np.asarray(
                np.random.normal(
                    loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
                dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
                       dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = self.activation_fn(
            (1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)
        self.inpt_dropout = dropout_layer(
            inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = self.activation_fn(
            T.dot(self.inpt_dropout, self.w) + self.b)

    def accuracy(self, y):
        "Return the accuracy for the mini-batch."
        return T.mean(T.eq(y, self.y_out))

class SoftmaxLayer(object):

    def __init__(self, n_in, n_out, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.p_dropout = p_dropout
        # Initialize weights and biases
        self.w = theano.shared(
            np.zeros((n_in, n_out), dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.zeros((n_out,), dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = softmax((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)
        self.inpt_dropout = dropout_layer(
            inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = softmax(T.dot(self.inpt_dropout, self.w) + self.b)

    def cost(self, net):
        "Return the log-likelihood cost."
        return -T.mean(T.log(self.output_dropout)[T.arange(net.y.shape[0]), net.y])

    def accuracy(self, y):
        "Return the accuracy for the mini-batch."
        return T.mean(T.eq(y, self.y_out))


#### Miscellanea
def size(data):
    "Return the size of the dataset `data`."
    return data[0].get_value(borrow=True).shape[0]

def dropout_layer(layer, p_dropout):
    srng = shared_randomstreams.RandomStreams(
        np.random.RandomState(0).randint(999999))
    mask = srng.binomial(n=1, p=1-p_dropout, size=layer.shape)
    return layer*T.cast(mask, theano.config.floatX)

network3.py

DONE single hidden layer, baseline

CLOSED: [2025-04-14 Mon 13:06]

State "DONE" from [2025-04-14 Mon 13:06]

60 epochs, \(\eta = 0.1\), mini-batch 10, 100 hidden neurons

(theo310) z5362216@k105:~/neural-networks-and-deep-learning/src $ python
Python 3.10.8 (main, Dec  5 2022, 10:38:26) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import network3
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Trying to run under a GPU.  If this is not desired, then modify network3.py
to set the GPU flag to False.
>>> from network3 import Network
>>> from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
>>> training_data, validation_data, test_data = network3.load_data_shared()
>>> mini_batch_size = 10
>>> net = Network([FullyConnectedLayer(n_in=784, n_out=100),SoftmaxLayer(n_in=100,n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)

#+RESULTS Epoch 59: validation accuracy 97.74% Finished training network. Best validation accuracy of 97.82% obtained at iteration 114999 Corresponding test accuracy of 97.67%

DONE adding 1 convolutional-pooling layer:

CLOSED: [2025-04-14 Mon 13:06]

State "DONE" from [2025-04-14 Mon 13:06]

>>> net = Network([
... ConvPoolLayer(
... image_shape=(mini_batch_size, 1, 28, 28),
... filter_shape=(20,1,5,5),
... poolsize=(2,2)),
... FullyConnectedLayer(n_in=20*12*12, n_out=100),
... SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)

#+RESULTS Epoch 59: validation accuracy 98.90% This is the best validation accuracy to date. The corresponding test accuracy is 98.81% Finished training network. Best validation accuracy of 98.90% obtained at iteration 299999 Corresponding test accuracy of 98.81%

DONE adding a second conv-pool layer:

CLOSED: [2025-04-14 Mon 13:06]

State "DONE" from [2025-04-14 Mon 13:06]

net = Network([
    ConvPoolLayer(
	image_shape=(mini_batch_size, 1, 28, 28),
	filter_shape=(20,1,5,5),poolsize=(2,2)),
    ConvPoolLayer(
	image_shape=(mini_batch_size, 20, 12, 12),
	filter_shape=(40,20,5,5),poolsize=(2,2)),
    FullyConnectedLayer(n_in=40*4*4, n_out=100),
    SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1, validation_data, test_data)

Epoch 59: validation accuracy 98.94% Finished training network. Best validation accuracy of 98.94% obtained at iteration 259999 Corresponding test accuracy of 98.98%

DONE changing to relu activation function:

CLOSED: [2025-04-14 Mon 14:34]

State "DONE" from "WAIT" [2025-04-14 Mon 14:34]
State "WAIT" from [2025-04-14 Mon 13:06]

  from network3 import ReLU
  net = Network([
      ConvPoolLayer(
	  image_shape=(mini_batch_size, 1, 28, 28),
	  filter_shape=(20,1,5,5),poolsize=(2,2), activation_fn=ReLU),
      ConvPoolLayer(
	  image_shape=(mini_batch_size, 20, 12, 12),
	  filter_shape=(40,20,5,5),poolsize=(2,2), activation_fn=ReLU),
      FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU),
      SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
  net.SGD(training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)

:Epoch 59: validation accuracy 99.12% :Finished training network. :Best validation accuracy of 99.13% obtained at iteration 199999 :Corresponding test accuracy of 99.19%

DONE augment training data:

CLOSED: [2025-04-14 Mon 21:43]

State "DONE" from "WAIT" [2025-04-14 Mon 21:43]
State "WAIT" from "TODO" [2025-04-14 Mon 15:11]

minor but significant! move each image 1 pixel up/down/left/right

  python expand_mnist.py
  expanded_training_data, _, _ = network3.load_data_shared("../data/mnist_expanded.pkl.gz")
  net = Network([
      ConvPoolLayer(
	  image_shape=(mini_batch_size, 1, 28, 28),
	  filter_shape=(20,1,5,5),poolsize=(2,2), activation_fn=ReLU),
      ConvPoolLayer(
	  image_shape=(mini_batch_size, 20, 12, 12),
	  filter_shape=(40,20,5,5),poolsize=(2,2), activation_fn=ReLU),
      FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU),
      SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
  net.SGD(expanded_training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)

Epoch 59: validation accuracy 99.39% Finished training network. Best validation accuracy of 99.40% obtained at iteration 1449999 Corresponding test accuracy of 99.36%

expand_mnist code

"""expand_mnist.py
~~~~~~~~~~~~~~~~~~

Take the 50,000 MNIST training images, and create an expanded set of
250,000 images, by displacing each training image up, down, left and
right, by one pixel.  Save the resulting file to
../data/mnist_expanded.pkl.gz.

Note that this program is memory intensive, and may not run on small
systems.

"""

from __future__ import print_function

#### Libraries

# Standard library
import pickle
import gzip
import os.path
import random

# Third-party libraries
import numpy as np

print("Expanding the MNIST training set")

if os.path.exists("../data/mnist_expanded.pkl.gz"):
    print("The expanded training set already exists.  Exiting.")
else:
    f = gzip.open("../data/mnist.pkl.gz", 'rb')
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    training_data, validation_data, test_data = u.load()
    f.close()
    expanded_training_pairs = []
    j = 0 # counter
    for x, y in zip(training_data[0], training_data[1]):
        expanded_training_pairs.append((x, y))
        image = np.reshape(x, (-1, 28))
        j += 1
        if j % 1000 == 0: print("Expanding image number", j)
        # iterate over data telling us the details of how to
        # do the displacement
        for d, axis, index_position, index in [
                (1,  0, "first", 0),
                (-1, 0, "first", 27),
                (1,  1, "last",  0),
                (-1, 1, "last",  27)]:
            new_img = np.roll(image, d, axis)
            if index_position == "first": 
                new_img[index, :] = np.zeros(28)
            else: 
                new_img[:, index] = np.zeros(28)
            expanded_training_pairs.append((np.reshape(new_img, 784), y))
    random.shuffle(expanded_training_pairs)
    expanded_training_data = [list(d) for d in zip(*expanded_training_pairs)]
    print("Saving expanded data. This may take a few minutes.")
    f = gzip.open("../data/mnist_expanded.pkl.gz", "w")
    pickle.dump((expanded_training_data, validation_data, test_data), f)
    f.close()

expand_mnist.py

DONE dropout (regularisation)

CLOSED: [2025-04-16 Wed 21:01]

State "DONE" from "TODO" [2025-04-16 Wed 21:01]

     net = Network([
	 ConvPoolLayer(
	     image_shape=(mini_batch_size, 1, 28, 28),
	     filter_shape=(20,1,5,5),poolsize=(2,2), activation_fn=ReLU),
	 ConvPoolLayer(
	     image_shape=(mini_batch_size, 20, 12, 12),
	     filter_shape=(40,20,5,5),poolsize=(2,2), activation_fn=ReLU),
	 FullyConnectedLayer(n_in=40*4*4, n_out=1000, activation_fn=ReLU, p_dropout=0.5),
	 FullyConnectedLayer(n_in=1000, n_out=1000, activation_fn=ReLU, p_dropout=0.5),
	 SoftmaxLayer(n_in=1000, n_out=10, p_dropout=0.5)
     ], mini_batch_size)
     >>> net.SGD(expanded_training_data, 40, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)

Epoch 39: validation accuracy 99.54% Finished training network. Best validation accuracy of 99.62% obtained at iteration 874999 Corresponding test accuracy of 99.60%

discussion

from fc to c-p

a big advantage of shared weights (and biases) is that it greatly reduces the number of parameters of the network.
- in this case, if you just had a c-p (convolutional-pooling) layer instead of fc layer, then you would have a 40x saving

second conv-pool layer

what does it conceptually mean to add such a layer?
- just think of the new input images as slighly more condensed versions of the original image, with lots of patterns still to discover.
- interestingly there is not just 1 input layer anymore. there would be however many inputs as there are feature maps.
  - the answer to this, is the same answer if the input image was RGB: just let the convolutional operation sample across all channels.

relu

empirically this performs better than the sigmoid.
- \(max(0,z)\) doesn't saturate in the limit of large \(z\), unlike sigmoid neurons

expanded mnist

reasonable gains to be had here. we explode the training data from 50,000 images to 250,000.
- each copy generates another 4, one pixel up/down/left/right
in 2003, Simard, Steinkraus and Platt improved their MNIST performance to 99.6 by mimicking handwriting data augmentation with "elastic distortions"
- they did not have ReLU back then.

dropout

our best result.
we applied dropout to FC layers only. convolutional layers have their own regularisation due to the shared weights.

ensembles

implemented by Nielsen on his own, he achieved 99.67 percent accuracy.
- realise that this implies 9,967 / 10,000 images were classified correctly!

conclusion

we managed to train despite the difficulties (exploding / vanishing gradients).

these difficulties did not disappear, but rather we avoided them by:

using convolutional layers, reducing the number of parameters which would suffer
using dropout, and more data to reduce overfitting
using ReLU instead of sigmoids
using GPU's
good weight initialisations
- note that these are different for the different activation functions.

deep belief networks are worth looking into. they can do both unsupervised and semi-supervised learning. they are generative models. a key component of these are restricted Boltzmann Machines.

To recognise shapes, first learn to generate images.—Geoffrey Hinton

The ability to learn hierarchies of concepts, building up multiple layers of abstraction, seems to be fundamental to making sense of the world.

Conway's Law:

Any organization that designs a system… will inevinable produce a design whose structure is a copy of the organization's communication structure.

—

The mark of a mature field is the necessity for specialisation, c.f. Hippocrates / Galen in medicine:

"the fields start out monolithic, with just a few deep ideas. early experts can master all those ideas. but as time passes that monolithic character changes. we discover many deep new ideas, too many for any one person to really master."