Friday 3 April 2015

Backpropagation 3/3

In the last two posts on backpropagation we saw:
  • that the error of a neutral network is used to refine the internal weights
  • that simple calculus is used to determine the change in weight for a single neuron

We now need to cover two more themes:
  • activation functions that have nice properties to support this calculus
  • seeing how this change in weights is applied to internal layers of a neural network, and to many nodes each with many connections

Sigmoid Activation Function
We know the activation function needs to meet certain criteria. These are
  • it must be non-linear, otherwise the entire neural network collapses into a single linear node
  • it should be broadly similar to biology - that is be monotonically increasing and reflect "activation" or "firing" once a threshold of inputs is reached
But it should also be easy to calculate with - and as we saw in the last post, it should be easily differentiable with respect to the weights. If this is hard, or the computation expensive, then it will make programming a neural network difficult or impossible.

For this last reason, the sigmoid function is often used. Really, the main reason for it is easy calculus and it has the right shape (ramping from a lower level to a higher level), not much more!


What is this easy calculus? Well if the sigmoid function is

f(x) = 1/(1+exp(-x))

then the derivative


Δf/Δx is amazingly simple f(x)(1-f(x))

Nice and neat! You could work this out using all the algebra but that's too boring for a blog - you see the few steps of algebra here.

But we actually want the relationship between the output f and the weights aka Δf/Δw. That's ok - and easy - because the weights are independent of each other. The "x" in f(x) is really the weighted sum of several inputs w1*x1 + w2*x2 + w3*x3 + ... and so on. The nice thing about calculus is that the derivative of f with respect to one of these, say w2, is independent of anything else including the other wn.

If we have Δf/Δx but we want Δf/Δw what do we do? Well, we use the "chain rule" of calculus which helps us do this translation:

Δf/Δw is the same as Δf/Δx * Δx/Δw

This looks like a normal fraction relationship but isn't really - but if it helps you remember it, thats ok.




CORRECTION: This post has been edited to remove the previous content of this section. You can find a fuller explanation of the maths, still written in an accessible manner, here in the early draft of Make Your Own Neural Network:

http://makeyourownneuralnetwork.blogspot.co.uk/2016/02/early-draft-feedback-wanted.html



What error do you use for internal nodes?
I forgot to explain a key point. The error of the entire neural network is easy to understand - its the difference between the output of the network and the desired target output. We use this to refine the final layer of nodes using the above formula.

But what about internal nodes? It doesn't make sense for the internal nodes to to be refined using the same formula but with the error of the entire network. Intuitively they all have smaller errors that contribute - the final error is too big. What do we do?

We could simply divide the big final error equally into the number of nodes. Not too bad an idea but that wouldn't reflect the contribution of some weights taking the final output in different directions - negative not positive weights. If a weight was trying to invert its input, it doesn't make sense to naively distribute the final error without taking this into account. This is a clue to a better approach.

That better approach is to divide the final error into pieces according to the weights. This blog talks about it http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html but the idea is easily illustrated:

The error is divided between nodes C and E in the same proportion as WCF and WEF. The same logic applies to dividing this error at A and B. The error used to calculate updated WAC and WBC is the error at WCF divided in the proportion WAC to WBC .

What does this mean? It means that if, say WCF was much larger than WEF it should carry much more of the final error. Which is nicely intuitive too!

Again - it's amazing how few texts and guides explain this, leaving it a mystery.