chapter 2: how the backpropagation algorithm works
- the algorithm was introduced in the 1970s, but its importance wasn’t fully appreciated until the famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
- “workhorse of learning in neural networks”
- at the heart of it is an expression that tells us how quickly the cost function changes when we change the weights and biases.
activation diagram of a single neuron in matrix notation
notation
- \(w_{jk}^l\)
𐃏
means the weight of the j\(^{th}\) neuron in layer \(l\) to the k\(^{th}\) neuron in the previous layer
- similarly for \(b_j^l\) we say it is the j\(^{th}\) bias in the l\(^{th}\) layer
- and \(a_j^l\) we say is the j\(^{th}\) activation in the l\(^{th}\) layer
obviously the activations of \(a_j^l\) are computed with:
\begin{equation} \label{eq:ac_sum} a_j^l = \sigma(\sum_k w_{jk}^l a_k^{l-1} + b_j^l) \end{equation}
and says that the activation of any given neuron in layer l = the sigmoid of the weights multiplied by activations for all inputs of the previous layer + bias.
we can then vectorise this:
\begin{equation} \label{eq:ac_vec} a^l = \sigma(w^la^{l-1} + b^l) \end{equation}
it appears that the ‘juicy bit’ of the sigma above is worth naming: \(w^la^{l-1} + b^l = z^l\).
cost function
as before, we have:
\begin{equation} \label{eq:cost} C = \cfrac{1}{2n}\sum_x \|y(x) -a^L(x) \|^2 \end{equation}
we make two assumptions about this cost function.
- the cost function can be written as an average \(C = \frac{1}{n}\sum_x C_x\) over cost functions \(C_x\) for individual training examples, \(x\).
- the cost function can be written as a function of the outputs from the neural network.
hadamard product
it’s just the element-wise multiplication of vectors.
implementing back-prop with this operator tends to be quite fast. (faster than matrix multiplication at times).
four fundamental equations.
\[\require{bbox}\require{color}\definecolor{myred}{RGB}{128,0,0} \fcolorbox{red}{lightblue}{$\begin{align} \color{myred}{\boldsymbol{\delta^L}} &\color{myred}{\boldsymbol{= \nabla_a C \odot \sigma’(z^L)}} \label{eq:bp1}\\ \color{myred}{\boldsymbol{\delta^l}} &\color{myred}{\boldsymbol{= ((w^{l+1})^T \delta^{l+1}) \odot \sigma’(z^l)}} \label{eq:bp2}\\ \color{myred}{\boldsymbol{\cfrac{\partial C}{\partial b_j^l}}} &\color{myred}{\boldsymbol{= \delta_j^l}} \label{eq:bp3}\\ \color{myred}{\boldsymbol{\cfrac{\partial C}{\partial w_{jk}^l}}} &\color{myred}{\boldsymbol{= a_k^{l-1} \delta_j^l}} \label{eq:bp4} \end{align}$}\]
where:
- \(\delta_j^l = \cfrac{\partial C}{\partial z_j^l}\) is the error of neuron j in layer l.
- L is the last layer
- and then \(\delta_j^L = \cfrac{\partial C}{\partial a_j^L}\sigma’(z^L_j) = \nabla_a C \odot \sigma’(z^L) = (a^L -y) \odot \sigma’(z^L)\)
note that saturation means that the neuron has low or high activations (close to 0 or 1) and thus its output is not changing despite varying inputs. i.e. it is not learning.
pseudocode
backprop
- Input x: Set the corresponding activation \(a^1\) for the input layer
- Feedforward: For each \(l=2,3,\ldots,L\) compute \(z^l = w^la^{l-1}+b^l\) and \(a^l = \sigma(z_l)\).
- Output Error \(\delta^l\): Compute the vector \(\delta^L = \nabla_a C \odot \sigma’(z^L)\)
- Backpropagate the error: For each \(l=L-1,L-2,\ldots,2\) compute \(\delta^l = ((w^{l^1})^T \delta^{l^1}) \odot \sigma’(z^l)\)
- Output: The gradient of the cost function is given by \(\cfrac{\partial C}{\partial w_{jk}^l = a_k^{l-1}\delta^l_j}\) and \(\cfrac{\partial C}{\partial b_j^l} = \delta_j^l\).
sgd
given a minibatch, we step in the direction of steepest descent with the stochastic gradient descent algorithm:
- Input minibatch
- For each training example x: Set the corresponding input activation \(a^{x,1}, and perform the following steps: a. *Feedforward*: For each \(l = 2, 3, …, L\) compute \(z^{x,l} = w^l a^{x, l-1} + b^l\) and \(a^{x,l} = \sigma(z^{x,l})\) b. Output Error \(\delta^{x,L}\): Compute the vector \(\delta^{x,L} = \nabla_a C_x \odot \sigma’(z^{x,L})\). c. Backpropagate the Error: For each \(l = L-1, L-2, …, 2\) compute \(\delta^{x,L} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma’(z^{x,L})\).
- Gradient descent: For each \(l = L, L-1, …, 2\) update the weights according to the rule \(w^l \rightarrow w^l - \frac{\eta}{m}\sum_x \delta^{x,l}(a^{x,l-1})^T\), and the biases according to the rule \(b^l \rightarrow b^l - \frac{\eta}{m}\sum_x \delta^{x,l}\).
speed of backprop
- the total cost of backpropagation is roughly the same as making just two forward passes through the network.
- speedup was first appreciated in 1986, and greatly expanded the range of problems that neural networks could solve.
- as such it is possible to use backprop to train deep neural nets.
Backlinks (2)
“We all die. The goal isn’t to live forever, the goal is to create something that will.” — Chuck Palahniuk
Originally the AI suffix stood for archived intellect, however these days it has concretised to becoming an Augmenting Infrastructure — a place from which to branch out in many directions.
Within this site you will find self-contained material in the form of project posts and blog posts, but also external links 1 to other work – my own as well as not.
2. Feedforward Deep Neural Networks /wiki/ml/dl/feedforward/
This page includes 𐃏 my Chapter notes for the book by Michael Nielsen.