Saturday, 18 April 2015

Backpropagation Video Tutorials

The following youtube videos are a great tutorial on the backpropagation algorithm. They're about 13 minutes each.

Enjoy! Amd credit to Ryan Harris.

Wednesday, 15 April 2015

Early Code Seems to Work

I've been working on turning the theory of the last few posts into working code. The hardest part has been trying to create code that works with matrices as a whole - and not code which inefficiently works element by element.

Today I was a happy milestone! The code is far from finished but I did my first cut of the back propagation updating the weights.

To check this works, my code prints out lots of info on internal variables, like the weights or the outputs of the hidden layer, or the final output. I used the output of the overall network error - which is the difference between the training target and the actual output of the network to see if it got smaller for each iteration of the back propagation (called epoch).

I was delighted to see it did:

The code is still early, and the above example used a very simplistic training set (all 1s as input and a 1 as output for a 3-2-1 node nework) but the fact that the overall error falls with training epoch is a very good sign.

Onwards and upwards.

Next steps ... I'm worried my code doesn't handle the division in the weight update equation well if the denominator is small or zero:

Friday, 10 April 2015

Matrices - Doing Lot of Work in One Go

One of the things that programmers and algorithm designers try to do is to take advantage of that fact that computers are quite good at doing lots of the same thing very quickly.

What do we mean by this?

Imagine I had a list of bank balances and I wanted to multiply them by 1.1 to add 10% interest. I could do it by taking each one individually and adding the 10%. Something like:

For each balance in bank:
newbalance = b * 1.10
pass

However - many computers, and programming languages running on those computers, allow the same job to be done on chunks of many data items. This is vastly more efficient (and quicker) than looping over each data item. This would look something like this:

newbalance = balance * 1.10

The underlining shows the thing is a matrix. Thislast expression allows the computer to operate on all of the balance data items at the same time, paralellising the calculation. Its as if the computer gives the work to a team of workers, instead of getting one worker to work through the list sequentially. In reality, there is a limit to how much this parallisation happens, but in most cases it is vastly more efficient.

This is variously called parallelisation, array programming, vectorisation. Wikipedia has a good entry on it http://en.wikipedia.org/wiki/Array_programming

Now... it is worth exploring to see if this is applicable to our neural network calculations.

In fact calculating the outputs of each neural network layer easily translates into this vectorised approach.

Consider the calculation of outputs from the output layer using the outputs from the hiden layer and of course taking into account the weights.

output_1 = activation_function (  hiddenoutput_1 * weight_11 + hiddenoutput_2 * weight_21 + hiddenoutput_3 * weight_31 )

output_2 = activation_function (  hiddenoutput_1 * weight_12 + hiddenoutput_2 * weight_22 + hiddenoutput_3 * weight_32 )

...

Can you see the pattern? Look at the drawing connecting the hidden layer to the output layer through the weights. Let's user shorter words to see it cleaer:

o_1 = af (  ho_1 * w_11 + ho_2 * w_21 + ho_3 * w_31 )

o_2 = af (  ho_1 * w_12 + ho_2 * w_22 + ho_3 * w_32 )

...

These patterns where you can see essentially the same expressions but with just one array index changing are usually emanable to vectorisation. It doesn't take long to see that:

So when we're implementing this in Python we can use the dot product function to multiply the weights array and the hidden outputs array. The same logic applies to arraive at the hidden outputs using the inputs and their associated weights.

Neat!

That's the forward flow vectorised.... but I'm still struggling to parallelise the backpropagation that we talked about in the last post.

Friday, 3 April 2015

Backpropagation 3/3

In the last two posts on backpropagation we saw:
• that the error of a neutral network is used to refine the internal weights
• that simple calculus is used to determine the change in weight for a single neuron

We now need to cover two more themes:
• activation functions that have nice properties to support this calculus
• seeing how this change in weights is applied to internal layers of a neural network, and to many nodes each with many connections

Sigmoid Activation Function
We know the activation function needs to meet certain criteria. These are
• it must be non-linear, otherwise the entire neural network collapses into a single linear node
• it should be broadly similar to biology - that is be monotonically increasing and reflect "activation" or "firing" once a threshold of inputs is reached
But it should also be easy to calculate with - and as we saw in the last post, it should be easily differentiable with respect to the weights. If this is hard, or the computation expensive, then it will make programming a neural network difficult or impossible.

For this last reason, the sigmoid function is often used. Really, the main reason for it is easy calculus and it has the right shape (ramping from a lower level to a higher level), not much more!

What is this easy calculus? Well if the sigmoid function is

f(x) = 1/(1+exp(-x))

then the derivative

Δf/Δx is amazingly simple f(x)(1-f(x))

Nice and neat! You could work this out using all the algebra but that's too boring for a blog - you see the few steps of algebra here.

But we actually want the relationship between the output f and the weights aka Δf/Δw. That's ok - and easy - because the weights are independent of each other. The "x" in f(x) is really the weighted sum of several inputs w1*x1 + w2*x2 + w3*x3 + ... and so on. The nice thing about calculus is that the derivative of f with respect to one of these, say w2, is independent of anything else including the other wn.

If we have Δf/Δx but we want Δf/Δw what do we do? Well, we use the "chain rule" of calculus which helps us do this translation:

Δf/Δw is the same as Δf/Δx * Δx/Δw

This looks like a normal fraction relationship but isn't really - but if it helps you remember it, thats ok.

CORRECTION: This post has been edited to remove the previous content of this section. You can find a fuller explanation of the maths, still written in an accessible manner, here in the early draft of Make Your Own Neural Network:

http://makeyourownneuralnetwork.blogspot.co.uk/2016/02/early-draft-feedback-wanted.html

What error do you use for internal nodes?
I forgot to explain a key point. The error of the entire neural network is easy to understand - its the difference between the output of the network and the desired target output. We use this to refine the final layer of nodes using the above formula.

But what about internal nodes? It doesn't make sense for the internal nodes to to be refined using the same formula but with the error of the entire network. Intuitively they all have smaller errors that contribute - the final error is too big. What do we do?

We could simply divide the big final error equally into the number of nodes. Not too bad an idea but that wouldn't reflect the contribution of some weights taking the final output in different directions - negative not positive weights. If a weight was trying to invert its input, it doesn't make sense to naively distribute the final error without taking this into account. This is a clue to a better approach.

That better approach is to divide the final error into pieces according to the weights. This blog talks about it http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html but the idea is easily illustrated:

The error is divided between nodes C and E in the same proportion as WCF and WEF. The same logic applies to dividing this error at A and B. The error used to calculate updated WAC and WBC is the error at WCF divided in the proportion WAC to WBC .

What does this mean? It means that if, say WCF was much larger than WEF it should carry much more of the final error. Which is nicely intuitive too!

Again - it's amazing how few texts and guides explain this, leaving it a mystery.