• the algorithm was introduced in the 1970s, but its importance wasn't fully appreciated until the famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
  • "workhorse of learning in neural networks"
  • at the heart of it is an expression that tells us how quickly the cost function changes when we change the weights and biases.
/projects/ml/dl/neural-nets/ch2/
activations.svg
activation diagram of a single neuron in matrix notation

notation

  • means the weight of the j neuron in layer to the k neuron in the previous layer

    • similarly for we say it is the j bias in the l layer
    • and 𝑎𝑙𝑗 we say is the j𝑡 activation in the l𝑡 layer

obviously the activations of 𝑎𝑙𝑗 are computed with:

𝑎𝑙𝑗=𝜎(𝑘𝑤𝑙𝑗𝑘𝑎𝑙1𝑘+𝑏𝑙𝑗)

and says that the activation of any given neuron in layer l = the sigmoid of the weights multiplied by activations for all inputs of the previous layer + bias.

we can then vectorise this:

𝑎𝑙=𝜎(𝑤𝑙𝑎𝑙1+𝑏𝑙)

it appears that the 'juicy bit' of the sigma above is worth naming: 𝑤𝑙𝑎𝑙1 +𝑏𝑙 =𝑧𝑙.

cost function

as before, we have:

𝐶=12𝑛𝑥𝑦(𝑥)𝑎𝐿(𝑥)2

we make two assumptions about this cost function.

  1. the cost function can be written as an average 𝐶 =1𝑛𝑥𝐶𝑥 over cost functions 𝐶𝑥 for individual training examples, 𝑥.
  2. the cost function can be written as a function of the outputs from the neural network.

hadamard product

it's just the element-wise multiplication of vectors.

implementing back-prop with this operator tends to be quite fast. (faster than matrix multiplication at times).

four fundamental equations.

Erroneous nesting of equation structures

where:

  • 𝛿𝑙𝑗 =𝜕𝐶𝜕𝑧𝑙𝑗 is the error of neuron j in layer l.
  • L is the last layer
  • and then 𝛿𝐿𝑗 =𝜕𝐶𝜕𝑎𝐿𝑗𝜎(𝑧𝐿𝑗) =𝑎𝐶 𝜎(𝑧𝐿) =(𝑎𝐿 𝑦) 𝜎(𝑧𝐿)

note that saturation means that the neuron has low or high activations (close to 0 or 1) and thus its output is not changing despite varying inputs. i.e. it is not learning.

pseudocode

backprop

  1. Input x: Set the corresponding activation 𝑎1 for the input layer
  2. Feedforward: For each 𝑙 =2,3,,𝐿 compute 𝑧𝑙 =𝑤𝑙𝑎𝑙1 +𝑏𝑙 and 𝑎𝑙 =𝜎(𝑧𝑙).
  3. Output Error 𝛿𝑙: Compute the vector 𝛿𝐿 =𝑎𝐶 𝜎(𝑧𝐿)
  4. Backpropagate the error: For each 𝑙 =𝐿 1,𝐿 2,,2 compute 𝛿𝑙 =((𝑤𝑙1)𝑇𝛿𝑙1) 𝜎(𝑧𝑙)
  5. Output: The gradient of the cost function is given by 𝜕𝐶𝜕𝑤𝑙𝑗𝑘=𝑎𝑙1𝑘𝛿𝑙𝑗 and 𝜕𝐶𝜕𝑏𝑙𝑗 =𝛿𝑙𝑗.

sgd

given a minibatch, we step in the direction of steepest descent with the stochastic gradient descent algorithm:

  1. Input minibatch
  2. For each training example x: Set the corresponding input activation \(ax,1, and perform the following steps:

    1. Feedforward: For each 𝑙 =2,3,...,𝐿 compute 𝑧𝑥,𝑙 =𝑤𝑙𝑎𝑥,𝑙1 +𝑏𝑙 and 𝑎𝑥,𝑙 =𝜎(𝑧𝑥,𝑙)
    2. Output Error 𝛿𝑥,𝐿: Compute the vector 𝛿𝑥,𝐿 =𝑎𝐶𝑥 𝜎(𝑧𝑥,𝐿).
    3. Backpropagate the Error: For each 𝑙 =𝐿 1,𝐿 2,...,2 compute 𝛿𝑥,𝐿 =((𝑤𝑙+1)𝑇𝛿𝑥,𝑙+1) 𝜎(𝑧𝑥,𝐿).
  3. Gradient descent: For each 𝑙 =𝐿,𝐿 1,...,2 update the weights according to the rule 𝑤𝑙 𝑤𝑙 𝜂𝑚𝑥𝛿𝑥,𝑙(𝑎𝑥,𝑙1)𝑇, and the biases according to the rule 𝑏𝑙 𝑏𝑙 𝜂𝑚𝑥𝛿𝑥,𝑙.

speed of backprop

  • the total cost of backpropagation is roughly the same as making just two forward passes through the network.
  • speedup was first appreciated in 1986, and greatly expanded the range of problems that neural networks could solve.
  • as such it is possible to use backprop to train deep neural nets.