chapter 2: how the backpropagation algorithm works
- the algorithm was introduced in the 1970s, but its importance wasn't fully appreciated until the famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
- "workhorse of learning in neural networks"
- at the heart of it is an expression that tells us how quickly the cost function changes when we change the weights and biases.
notation
-
⊕ means the weight of the j neuron in layer to the k neuron in the previous layer - similarly for
we say it is the j bias in the l layer - and
we say is the j𝑎 𝑙 𝑗 activation in the l𝑡 ℎ layer𝑡 ℎ
- similarly for
obviously the activations of
and says that the activation of any given neuron in layer l = the sigmoid of the weights multiplied by activations for all inputs of the previous layer + bias.
we can then vectorise this:
it appears that the 'juicy bit' of the sigma above is worth naming:
cost function
as before, we have:
we make two assumptions about this cost function.
- the cost function can be written as an average
over cost functions𝐶 = 1 𝑛 ∑ 𝑥 𝐶 𝑥 for individual training examples,𝐶 𝑥 .𝑥 - the cost function can be written as a function of the outputs from the neural network.
hadamard product
it's just the element-wise multiplication of vectors.
implementing back-prop with this operator tends to be quite fast. (faster than matrix multiplication at times).
four fundamental equations.
where:
is the error of neuron j in layer l.𝛿 𝑙 𝑗 = 𝜕 𝐶 𝜕 𝑧 𝑙 𝑗 - L is the last layer
- and then
𝛿 𝐿 𝑗 = 𝜕 𝐶 𝜕 𝑎 𝐿 𝑗 𝜎 ′ ( 𝑧 𝐿 𝑗 ) = ∇ 𝑎 𝐶 ⊙ 𝜎 ′ ( 𝑧 𝐿 ) = ( 𝑎 𝐿 − 𝑦 ) ⊙ 𝜎 ′ ( 𝑧 𝐿 )
note that saturation means that the neuron has low or high activations (close to 0 or 1) and thus its output is not changing despite varying inputs. i.e. it is not learning.
pseudocode
backprop
- Input x: Set the corresponding activation
for the input layer𝑎 1 - Feedforward: For each
compute𝑙 = 2 , 3 , … , 𝐿 and𝑧 𝑙 = 𝑤 𝑙 𝑎 𝑙 − 1 + 𝑏 𝑙 .𝑎 𝑙 = 𝜎 ( 𝑧 𝑙 ) - Output Error
: Compute the vector𝛿 𝑙 𝛿 𝐿 = ∇ 𝑎 𝐶 ⊙ 𝜎 ′ ( 𝑧 𝐿 ) - Backpropagate the error: For each
compute𝑙 = 𝐿 − 1 , 𝐿 − 2 , … , 2 𝛿 𝑙 = ( ( 𝑤 𝑙 1 ) 𝑇 𝛿 𝑙 1 ) ⊙ 𝜎 ′ ( 𝑧 𝑙 ) - Output: The gradient of the cost function is given by
and𝜕 𝐶 𝜕 𝑤 𝑙 𝑗 𝑘 = 𝑎 𝑙 − 1 𝑘 𝛿 𝑙 𝑗 .𝜕 𝐶 𝜕 𝑏 𝑙 𝑗 = 𝛿 𝑙 𝑗
sgd
given a minibatch, we step in the direction of steepest descent with the stochastic gradient descent algorithm:
- Input minibatch
-
For each training example x: Set the corresponding input activation \(ax,1, and perform the following steps:
- Feedforward: For each
compute𝑙 = 2 , 3 , . . . , 𝐿 and𝑧 𝑥 , 𝑙 = 𝑤 𝑙 𝑎 𝑥 , 𝑙 − 1 + 𝑏 𝑙 𝑎 𝑥 , 𝑙 = 𝜎 ( 𝑧 𝑥 , 𝑙 ) - Output Error
: Compute the vector𝛿 𝑥 , 𝐿 .𝛿 𝑥 , 𝐿 = ∇ 𝑎 𝐶 𝑥 ⊙ 𝜎 ′ ( 𝑧 𝑥 , 𝐿 ) - Backpropagate the Error: For each
compute𝑙 = 𝐿 − 1 , 𝐿 − 2 , . . . , 2 .𝛿 𝑥 , 𝐿 = ( ( 𝑤 𝑙 + 1 ) 𝑇 𝛿 𝑥 , 𝑙 + 1 ) ⊙ 𝜎 ′ ( 𝑧 𝑥 , 𝐿 )
- Feedforward: For each
- Gradient descent: For each
update the weights according to the rule𝑙 = 𝐿 , 𝐿 − 1 , . . . , 2 , and the biases according to the rule𝑤 𝑙 → 𝑤 𝑙 − 𝜂 𝑚 ∑ 𝑥 𝛿 𝑥 , 𝑙 ( 𝑎 𝑥 , 𝑙 − 1 ) 𝑇 .𝑏 𝑙 → 𝑏 𝑙 − 𝜂 𝑚 ∑ 𝑥 𝛿 𝑥 , 𝑙
speed of backprop
- the total cost of backpropagation is roughly the same as making just two forward passes through the network.
- speedup was first appreciated in 1986, and greatly expanded the range of problems that neural networks could solve.
- as such it is possible to use backprop to train deep neural nets.