Multilayer Perceptron
We have seen what can be learned by the perceptron algorithm — namely, linear decision boundaries for binary classification problems.
It may also be of interest to know that the perceptron algorithm can also be used for regression with the simple modification of not applying an activation function (i.e. the sigmoid). I refer the interested reader to open another tab.
We begin with the punchline:
XOR

Now clearly, taking a ruler, your finger or positioning any straight-lined object on the above figure will not enable you to separate the blue (true) from the red (false) circles. This was also one of Marvin Minsky's arguments against further development of the Perceptron in 1963. However, with the benefit of hindsight, we shall not retire so quickly, instead we add another layer of the neurons:
Whilst this above architecture does not immediately solve our problem it puts us on the correct trajectory.
Propositional Logic
At some point we were going to have to deal with the logical interpretation of XOR.
XOR, denoted by the symbol means the logical exclusive or. Which translates colloquially to:
or , but not both
We can express this sentiment in logical syntax
and then construct smaller diagrams that express the , and .
Piecing these together (notice we now have weights), and taking the of these two:
To reproduce the mathematical equivalence we are looking for:
The overlapping 's and 's yield the glorious MLP:
Manual Trace
We now test all of our points to see if they are correctly classified by the above MLP:
-
[1,1]
:- lights up the and nodes.
- the first pushes up +1 to and -1 to
- the bias is defeated at and thus will now fire
- 's bias causes it to fire by default, and the -1 charge from maintains this
- next, we get the charges from : +1 to and -1 to .
- the becomes more positive, and thus certainly fires, whilst has now been dragged below the 0 threshold.
- now, for the next layer to fire, since it is an
AND
neuron, you must have both and fired. - thus, y will not fire this time.
- Continuing with this line of thought, for [0,0] the MLP does not fire either. Only does, and as we have seen a moment ago this is not sufficient.
- Fires! Do the trace to check.
- Also fires! Yay.
Linear Decision Boundary in Hidden State Space
From the above trace we can summarise the following results:
1 | 1 | 1 | 0 |
0 | 0 | 0 | 1 |
1 | 0 | 1 | 1 |
0 | 1 | 1 | 1 |
And produce the a linear decision boundary in the state space!

Original State Space
Reverting to our original state space, we can see that plotting the weights corresponds to using TWO linear inequality boundaries!

Truth Table (optional)
This part can be skipped but I do consider it of value to understand the problem that I am working on from all angles.
1 | 1 | 1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 1 |
0 | 1 | 0 | 1 | 1 | 0 | 1 |
A benefit of this analysis is noticing that we have another equivalence for the XOR:
A Single Neuron
Also, for reference, here is a single neuron:
The Problem with XOR
Now, as beautiful and rewarding this manual derivation is, it is not always possible to know how many neurons you will need to be able to linearly separate your data in a different state space.
It is also worth acknowledging that we introduced a degree of non-linearity by using the step-function activation at the hidden nodes ⊕ — this function is not differentiable, and is now nearing the end of its shelf life. As such we must refactor the wheel:


Real world problems – such as MNIST tackled in the last heading of this page – have hundreds of inputs with thousands of weights across tens of layers. ⊕
Backpropagation is the bridge between simple Perceptrons and Deep Learning with Multi-layered Perceptrons. We will now solve 3-XOR by Backpropagation.
-XOR / Advanced XOR

I call this problem XOR. It makes sense to extend our XOR of two inputs ([0,1])
, to that of three inputs ([0,1,2])
.
In general we could extend the problem to any integer N, and the number of dots would be .
Code
Here we implement a MLP in Pytorch, train it using binary cross-entropy loss and visualise the hidden layer activations and outputs. We will also in a moment make use of the ability to set weights manually, but for now we will let the network use random initialisations of 0.15.
Results
5 nodes
Creating a neural network to learn the weights with 5 hidden nodes was possible. We can observe the output and understand where on our MLP architecture these weights sit:
The above figure was generated by ChatGPT.
Hidden Unit Dynamics
We can also visualise what each of these hidden nodes was responsible for contributing to the overall segmentation of blue dots from red:






4 nodes
Trying to achieve the same effect with 4 nodes is a different story. Running the Code above for 200,000 epochs multiple times still does not allow it to converge to 100% accuracy, and thus the task is never learned. We can however input the initialisation weights manually after studying the problem on paper to produce:





With weights
Conclusion
In conclusion, we can see that MLP's are beautiful, and the logical next step in a world where perceptron learning works. Furthermore, we notice the fragility of the model to initial weights, and the way in which it is sometimes just unable to produce the correct global optima and instead sits in a local one. Finally, Machine Learning continues to be as much art as science in that we must sprinkle the right amounts of non-linearity in the right places to get our puppet to jiggle and dance.
I leave you with a small code snippet from geeksforgeeks who use the tensorflow
library to leverage MLP's and a modern pipeline to classify the MNIST dataset.
Code
Results
