It has only one hidden layer. The activation function is simple. The process is straightforward. We've not done anything particularly fancy - though there are techniques and optimisations to boost performance.
And yet we've managed to get really good performance.
Yann LeCun is a world leading researcher and pioneer of neural networks - and his page lists some benchmarks for performance against the MNIST data set.
Let's go through the performance improvements that we can do without spoiling the simplicity that we've managed to retain on our journey:
94.7%Our basic code had a performance of almost 95% is good for a network that we wrote at home, as begginers. It compares with the benchmarks on LeCun's website. A score of 60% or 30% would have been not so good but understandable as a first effort. Even a score of 85% would have been a solid starting point.
95.3%We tweaked the learning rate and the performance broke through the 95% barrier. Keep in mind that there will be diminishing returns as we keep pushing for more performance. This is because there will be inherent limits - the data itself might have gaps or a bias, and so not be fully educational. The architecture of the network itself will impose a limit - architecture means the number of layers, the number of nodes in each layer, the activation function, the design of the labelling system, etc
The following shows the results of an experiment tweaking the learning rate. The shape of the performance curve is as expected - extremes away from a sweet spot give poorer performance.
96.3%Another easy improvement is running through the training data multiple times, called epochs. Doing it twice, instead of once, boosted the performance to 95.8%. Getting closer to that elusive 96%!
The following shows the results of trying different numbers of epochs.
- neural network learning is at heart a random process, and results will be different evert time, and sometimes will go wrong altogether
- these experiments are not very scientific, to do that we'd have to run them many many times to reduce the impact of the randomness
The peak now is 96.3% with 7 epochs.
96.9%Increasing the epochs means doing more traingin steps. Perhaps we can shorten the learning steps so that we are more cautious and reduce the chance of overshooting, now that we make up for smaller steps with more steps overall?
The following shows the previous experiment's results overlaid by the results with a smaller learning rate of 0.1.
There is a really good peak at 5 epochs with performance now boosted to 96.9%. That's very very good. And all without any really fancy tricks. We're getting closer to breaching the 97% mark!
97.6%One parameter we haven't yet tweaked is the number of hidden nodes. This is an important layer because it is the one at the heart of any learning the network does. The input nodes simply bring in the question. The output nodes pretty much pop out the answer. It is the hidden nodes - or more strictly, the link weights either side of those nodes - that contain any knowledge the network has gained through learning.
Too few and there just usn't enough learning capacity. Too many and you dilute the learning and increase the time it takes to get to an effective learned state. Have a look at the following experiment varying the number of hidden nodes.
This is really interesting. Why? because even with 5 hidden nodes, a tiny number if you think about it, the performance is still amazing at 70%. That really is remarkable. The ability to do a complex task like recognise human handwritten numbers has been encoded by 5 nodes well enough to perform with 70% accuracy. Not bad at all.
As the number of nodes increases, the performance does too, but the returns are diminishing. The previous 100 nodes gives us 96.7% accuracy. Increasing the hidden layer to 200 nodes boosts the performance to 97.5%. We've broken the 97% barrier!
Actually, 500 hidden nodes gives us 97.6% accuracy, a small improvement, but at the cost of a much larger number of calculations and time taken to do them. So for me, 200 is the sweet spot.
Can we do more?
97.9% (!!!!)We can do more, and the next idea is only mildly more sophisticated than our simple refinements above.
Take each training image, and create two new versions of each one byt rotating the original clockwise and anti-clockwise by a specific angle. This gives us new training examples, but ones which might add additional knowledge because they represent the possibility of somone writing those numbers are a different angle. The following illustrates this.
We've not cheated by taking new training data. We've used the existsng training data to create additional versions.
Let's see what performance we get if we run experiments with different angles. We've added in the results for 10 epochs just to see what happens too.
The peak for 5 epochs at +/- 10 degrees rotation is 97.5%.
Increase the epochs to 10 and we boost the performance to a record breaking 97.9% !!
This 2% error compares with amongst the best benchmarks. And all from simple ideas and simple code.
Isn't computer science cool ?!