Wednesday, 6 July 2016

Error Backpropagation Revisted

A great question from Alex J has prompted a deeper look at how we take the error at the output layer of a neural network and propagate it back into the network.


Reminder: Error Informs Weight Updates

Here's a reminder why we care about the error:
  • In a neural network, it is the link weights that do the learning. They are adjusted again and again in an attempt to better match the training data.
  • This refinement of the weights is informed by the error associated with a node in the network. A small error means we don't need to change the weights much. 
  • The error at the output layer is easy - it's simply the difference between the desired target and actual output of the network.
  • However the error associated with internal hidden layer nodes is not obvious.


What's The Error Inside The Network?

There isn't a mathematically perfect answer to this question.

So we use approaches that make sense intuitively, even if there isn't a mathematically pure and precise derivation for them. These kinds of approaches are called heuristics.

These "rule of thumb" heuristics are fine ... as long as they actually help the network learn!

The following illustrates what we're trying to achieve - use the error at the output layer to work out, somehow, the error inside the network.



Previously, and in the book, we considered three ideas. An extra one is added here:
  • split the error equally amongst the connected nodes, recombine at the hidden node
  • split the error in proportion to the link weights, recombine at the hidden node
  • simply multiply the error by the link weights, recombine at the hidden node
  • the same as above but attempt to normalise by dividing by the number of hidden nodes

Let's look at these in turn, before we try them to see what performance each approach gives us.


1. Split Error Equally Amongst Links

We split the error at each output node, dividing it equally amongst the number of connected incoming links. We then recombine these pieces at each hidden layer node to arrive at an internal error.


Mathematically, and in matrix form, this looks like the following. $N$ is the number of links from hidden layer nodes into an output node - that is, the number of hidden layer nodes.

$$
e_{hidden} =
\begin{pmatrix}
1/N & 1/N & \cdots \\
1/N & 1/N & \cdots  \\
\vdots & \vdots & \ddots \\
\end{pmatrix}
\cdot e_{output}
$$

Remember that a matrix form is really useful because Python's numpy can do the calculations efficiently (quickly) and we can write very concise code.


2. Split Error In Proportion To Link Weights

We split the error, not equally, but in proportion to the link weights. The reason for this is that those links with larger weights contributed more to the error at the output layer. That makes intuitive sense - small weights contribute smaller signals to the final output layer, and should be blamed less for the overall error. These proportional bits are recombined again at the hidden layer nodes.


Again, in matrix form, this looks like the following.

$$
e_{hidden} =
\begin{pmatrix}
\frac{w_{11}}{w_{11} + w_{21} + \cdots} & \frac{w_{12}}{w_{12} + w_{22} + \cdots} & \cdots \\
\frac{w_{21}}{w_{11} + w_{21} + \cdots} & \frac{w_{22}}{w_{12} + w_{22} + \cdots} & \cdots \\
\vdots & \vdots & \ddots \\
\end{pmatrix}
\cdot e_{output}
$$

The problem is ... we can't easily write this as a simple combination of matrices we already have, like the weight matrix and the output error matrix. To code this, we'd lose the benefits of numpy being able to accelerate the calculations. Even so, let's try it to see how well it performs.


3. Error Simply Multiplied By Link Weights

We don't split the error, but simply multiply the error by the link weights. This is much simpler than the previous idea but retains the key intuition that larger weights contribute more to the networks error at the output layer.

You can see from the expression above that the output errors are multiple day the weights, and there is also a kind of normalisation division. Here we don't have that normalisation.

In matrix form this looks like the following - it is very simple!

$$
e_{hidden} = w^{T} \cdot e_{output}
$$

Let's try it - and if it works, we have a much simpler heuristic, and one that can be accelerated by numpy's ability to do matrix multiplications efficiently.


4. Same as Above But "Normalised"

This additional heuristic is the same as the previous very simple one - but with an attempt to apply some kind of normalisation. We want to see if the lack of a normalisation in the simple heuristic has a negative effect on performance. 

The expression is still simple, the above expression divided by the number of hidden nodes $N$.

$$
e_{hidden} = \frac{w^{T}}{N} \cdot e_{output}
$$

You can imagine this goes some way to allaying fears that the previous approach magnifies the error unduly. This fear goes away if you realise the weights can be $<1$ and so can have a shrinking effect, not just a growing effect.


Results!

The above heuristics were coded and compared using the MNIST challenge. We keep the number of hidden nodes at 100, and the learning rate at 0.1 We do vary he number of learning epochs over 1, 2 and 5.

The following shows the results.


We can make some interesting conclusions from these results.


Conclusions

  • Naively splitting the error equally among links doesn't work. At all! The performance of 0.1 or 10% accuracy is what you'd get randomly choosing an answer from a possible 10 answers (the digits 0-9).
  • There is no real difference between the sophisticated error splitting and the much simpler multiplication by the link weights. This is important - it means we can safely use the much simpler method and benefit from accelerated matrix multiplication.
  • Trying to normalise the simple method actually reduces performance ... by slowing down the learning rate. You can see it recover as you increase the number of learning epochs.

All this explains why we, and others, choose the simpler heuristic. It's simple, it works really well, and it can benefit from technology that accelerates matrix multiplication ... software like Python's numpy, and hardware like GPUs through openCL and CUDA.



I'll update the book so readers can benefit from a better explanation of the choice of heuristic. All ebooks can be updated for free by asking Amazon Kindle support.