## Wednesday, 6 July 2016

### Error Backpropagation Revisted

A great question from Alex J has prompted a deeper look at how we take the error at the output layer of a neural network and propagate it back into the network.

### Reminder: Error Informs Weight Updates

Here's a reminder why we care about the error:
• In a neural network, it is the link weights that do the learning. They are adjusted again and again in an attempt to better match the training data.
• This refinement of the weights is informed by the error associated with a node in the network. A small error means we don't need to change the weights much.
• The error at the output layer is easy - it's simply the difference between the desired target and actual output of the network.
• However the error associated with internal hidden layer nodes is not obvious.

### What's The Error Inside The Network?

There isn't a mathematically perfect answer to this question.

So we use approaches that make sense intuitively, even if there isn't a mathematically pure and precise derivation for them. These kinds of approaches are called heuristics.

These "rule of thumb" heuristics are fine ... as long as they actually help the network learn!

The following illustrates what we're trying to achieve - use the error at the output layer to work out, somehow, the error inside the network.

Previously, and in the book, we considered three ideas. An extra one is added here:
• split the error equally amongst the connected nodes, recombine at the hidden node
• split the error in proportion to the link weights, recombine at the hidden node
• simply multiply the error by the link weights, recombine at the hidden node
• the same as above but attempt to normalise by dividing by the number of hidden nodes

Let's look at these in turn, before we try them to see what performance each approach gives us.

### 1. Split Error Equally Amongst Links

We split the error at each output node, dividing it equally amongst the number of connected incoming links. We then recombine these pieces at each hidden layer node to arrive at an internal error.

Mathematically, and in matrix form, this looks like the following. $N$ is the number of links from hidden layer nodes into an output node - that is, the number of hidden layer nodes.

$$e_{hidden} = \begin{pmatrix} 1/N & 1/N & \cdots \\ 1/N & 1/N & \cdots \\ \vdots & \vdots & \ddots \\ \end{pmatrix} \cdot e_{output}$$

Remember that a matrix form is really useful because Python's numpy can do the calculations efficiently (quickly) and we can write very concise code.

### 2. Split Error In Proportion To Link Weights

We split the error, not equally, but in proportion to the link weights. The reason for this is that those links with larger weights contributed more to the error at the output layer. That makes intuitive sense - small weights contribute smaller signals to the final output layer, and should be blamed less for the overall error. These proportional bits are recombined again at the hidden layer nodes.

Again, in matrix form, this looks like the following.

$$e_{hidden} = \begin{pmatrix} \frac{w_{11}}{w_{11} + w_{21} + \cdots} & \frac{w_{12}}{w_{12} + w_{22} + \cdots} & \cdots \\ \frac{w_{21}}{w_{11} + w_{21} + \cdots} & \frac{w_{22}}{w_{12} + w_{22} + \cdots} & \cdots \\ \vdots & \vdots & \ddots \\ \end{pmatrix} \cdot e_{output}$$

The problem is ... we can't easily write this as a simple combination of matrices we already have, like the weight matrix and the output error matrix. To code this, we'd lose the benefits of numpy being able to accelerate the calculations. Even so, let's try it to see how well it performs.

### 3. Error Simply Multiplied By Link Weights

We don't split the error, but simply multiply the error by the link weights. This is much simpler than the previous idea but retains the key intuition that larger weights contribute more to the networks error at the output layer.

You can see from the expression above that the output errors are multiple day the weights, and there is also a kind of normalisation division. Here we don't have that normalisation.

In matrix form this looks like the following - it is very simple!

$$e_{hidden} = w^{T} \cdot e_{output}$$

Let's try it - and if it works, we have a much simpler heuristic, and one that can be accelerated by numpy's ability to do matrix multiplications efficiently.

### 4. Same as Above But "Normalised"

This additional heuristic is the same as the previous very simple one - but with an attempt to apply some kind of normalisation. We want to see if the lack of a normalisation in the simple heuristic has a negative effect on performance.

The expression is still simple, the above expression divided by the number of hidden nodes $N$.

$$e_{hidden} = \frac{w^{T}}{N} \cdot e_{output}$$

You can imagine this goes some way to allaying fears that the previous approach magnifies the error unduly. This fear goes away if you realise the weights can be $<1$ and so can have a shrinking effect, not just a growing effect.

### Results!

The above heuristics were coded and compared using the MNIST challenge. We keep the number of hidden nodes at 100, and the learning rate at 0.1 We do vary he number of learning epochs over 1, 2 and 5.

The following shows the results.

We can make some interesting conclusions from these results.

### Conclusions

• Naively splitting the error equally among links doesn't work. At all! The performance of 0.1 or 10% accuracy is what you'd get randomly choosing an answer from a possible 10 answers (the digits 0-9).
• There is no real difference between the sophisticated error splitting and the much simpler multiplication by the link weights. This is important - it means we can safely use the much simpler method and benefit from accelerated matrix multiplication.
• Trying to normalise the simple method actually reduces performance ... by slowing down the learning rate. You can see it recover as you increase the number of learning epochs.

All this explains why we, and others, choose the simpler heuristic. It's simple, it works really well, and it can benefit from technology that accelerates matrix multiplication ... software like Python's numpy, and hardware like GPUs through openCL and CUDA.

I'll update the book so readers can benefit from a better explanation of the choice of heuristic. All ebooks can be updated for free by asking Amazon Kindle support.

1. It would be easy to just multiply wT by the matrix:
|x 0|
|0 y|,

where x = normalizing factor for left column and y = normalizing factor for the right column. That doesn't seem to add too much to a calcalulation

2. In the page 81 of your book (may be near that page somewhere), you have written - if we throw out that kind of normalization factor, the matrix multiplication is easy to spot. And you have given the same matrix before transpose and after transpose. And the before matrix looks correct itself.- can you verify please.

1. I'd love some insight into this as well! The matrix, while being labeled as transposed, seems to be the same. Is this intentional?

3. Good question!

The expression you see at the bottom of that page is

error_hidden = W^T . e

Why is that a W^T and not just simply W?

Well .. earlier in the book .... and throughout the book we have W as

| w_1,1 w_2,1 w_3,1 ... |
| w_1,2 w_2,2 w_3,2 ... |
| w_1,3 w_2,3 w_3,3 ... |

but the expression we arrive at for the error at a hidden layer involves

| w_1,1 w_1,2 w_1,3 ... |
| w_2,1 w_2,2 w_2,3 ... |
| w_3,1 w_3,2 w_3,3 ... |

which is W^T not W.

Earlier when we did the calculation for feed-forward signals, the expression was W not W^T ... O = W.I

The key is to stick to a convention for which way round you make the matrices for weights. I got lost several times when I wasn't careful and consistent about which way around my matrix was. Some textbooks or guides will have it the other way around so be aware when reading different texts and blogs.

Let me know if this doesn't help and we'll have another go!

1. .. and please do get back to me if this still doesn't make sense .. I really want to get to the bottom of any issues or errors!

4. Hi!

One thing I don't fully understand is why we use gradient descent to adjust our weights. Why don't we use the proportionate backpropagated error directly to change the associated weight?

Regards/Per

1. hi - the reason for using gradient descent is that the relationship between the inputs to a node (incl the weights) and the output of that node is not simple (linear) .. it is by design non-linear.

if the relationship was simple and linear - you could use the error directly .. just like we do earlier in the book with the linear predictors .. where we use a trivial shuffling of the algebra to use the error directly to update the weight.

also - the gradient descent means we "smooth" any errors in the example data and also avoid over-fitting ...

2. if I've misunderstood let me know ... how did you want to use the error?

I'm really bad at math so I'm sorry if I've misunderstood the explanation of the gradient descent but
what I guess it all boils down to lowering or increasing the weights right?

When I look at the final expression for updating the weight (-error * sigmoid(sumofinput)*(1-sigmoid(sumofinput)) * ot) it looks like "error" is the only variable that drives the direction to change the weight in (+/-) (since sigmoid(x) will always be in the range 0-1) and "ot" is an output that's also in the range 0-1.

As I said math is really not my strong suite and I've been scratching my head for days trying to figure out why gradient descent is needed and why we can't use the backpropagated error to change the weights.

I guess the overfitting and smoothing might not be applicable if done this way?

4. hi - I'm glad you persisted with this .. it is a good question!

The main reason for gradient descent is to be able to find out where a minimum for a complicated function is. That is - if it was easy, like y=x^2 .. we would't necessarily need gradient descent. However, the expression relating a network's outputs to the internal weights is complicated ... have a look at slide 60 of tutorial slides at https://goo.gl/JKsb62 .. that's a horrible expression .. which we can't easily analyse in a pure algebraic way .. so we need approximate methods like gradient descent.

The expression you refer to ... also shown on slide 65 ... is the expression for the gradient.

So back to your question. Why can we use an expression for gradient descent which looks manageable .. but not a direct solution?

Because .. luckily ... the expression for the gradient of the error was much easier to handle than the full expression for error.

Why can't we simply substitute the local node errors and get an expression for the correct weight? Because following a gradient only takes you in the direction of the desired minimum .. it doesn't always take your to the exact spot.

On this last point - the book uses examples in the appendix showing gradients for simple curves like y=x^2 ... the gradient at a point o that curve will point you int he direction of the minimum but doesn't point to the exact spot.

The second min reason is as you say .. to reduce over-fitting and smoothing ...

Does this help? If not do get back to me again!

5. Hi again and thanks for replying. I really appreciate your help! :)

I understand that directly trying to find a solution is extremely hard because of the complexity of the net. What I'm struggling with is some of the components of the final expression for updating the weight. You might've already answered this and I just haven't gotten it through my thick skull :P

If we split the below expression into it's components:
(-error * sigmoid(i)*(1-sigmoid(i)) * ot)

1: -error <- Can be -/+

2: sigmoid(i) <- will always be 0-1

3: (1-sigmoid(i)) <- will always be 0-1

4: ot <- will always be 0-1

So "-error" is the only component that drives whether or not to increase or decrease the weight. The rest of the components will "only" weight "-error".

Wouldn't it be enough to use -error * learning rate to update the weights?

Regards/Per

6. Just to be clear. With -error I mean the proportionate error of each weight like those discussed in this blog post.

7. aha - yes the weights are changed in proportion to the error.

so yes - you could just try to have (-error * learning_rate) ..

so what's missing from this? what's missing is the other factors which scale that adjustent ..

i haven't done extensive experiments but I suspect that the outcome could be anywhere between (1) slower convergence, an (2) failure to converge.

i suspect that factor O(1-O)Ot is important in providing a scale that's *relative* to other nodes which may have the same error but need different weight adjustments

.... i like your question .. it uncovers things most guides gloss over...

are you convinced? if you do experiments i'd love to share them here and via twitter

8. Hi again!

Indeed. The weight scaling might be important to the speed of the convergence. I'll have to run some tests and see what kind of impact only using -error as a heuristic would cause.

Thanks so much for your help. I'll be sure to post my results.
And thanks again for your great book. It's been invaluable for me to really understand backpropagation.

Regards/Per

9. I can inform you that it didn't work :)
I was running my NN on a XOR training set and what it did was it converged all outputs towards 0.5.
One guess is that it moves the resulting outputs towards the mean value of the expected outputs.

Regards/Per

10. thanks for sharing your experiment results!

I too had many questions early on about backdrop .. it was sooo badly explained in many texts .. and some of my own tests showed a simpler idea didn't work ..

interestingly I spent ages trying to work out why people used a squared error (t-o)^2 and not just the (t-o) .. there's a blog post on this which talks about some really revealing insight for me anyway!

11. I couldn't agree more. Alot of texts just follow previous examples and never go into the detail about why things are done a certain way.

Yes! Exactly!! The (t-o)^2 thing hade me very confused until I read your book. It was another one of those "Why didn't they just tell me that in the first place"-moments :P

5. Going through the book, page 93, the diagram doesn't mention 'b' which is used in the next page to drop the SUM part.

Also, cannot understand how the derivative of dE/dOk = -2(tk - ok). Why the -ve sign? We will be negating this output to the old weight, as described in page 98.

1. Don't know why it was posted as "Unknown". Posted this for my notification.

2. Hi Kevin

The negative sign comes from the chain rule of differentiation. Here's a simpler example:

if y = (3-2x)^2

then dy/dx = 2 * (3-2x) * (-2) = -4(3-2x)

The appendix on calculus at the end of the book explains the chain rule. Does that help?

The first question - not sure which part you mean. Can you email me a picture to makeyourownneuralnetwork at gmail dot com and I'll get back to you.

6. Hi,

Hope you can shine a light for me - i'm a little confused about something. I have been using the book, which is great btw, but i'm having trouble with the weight update method - specifically for the hidden layer weight.

The book intructs us to calculate the error for a hidden node by multiplying each weight that connects it to a node in the forward layer (output node fex) by the error of that forward node, and then sum these together. Most other sources I have found also multiply each product with the derived activation of the output of the forwards layer. Often this quantity is called 'Gamma' or 'Delta'. So, a hidden weight update would take the form....

HWeight += eta *(hidden output)*(1 - hidden ouptut)*previous layer output*[the sum of all(weights between hidden and output*(target-actual)*finaloutput*(1-finaloutput))]

The part that is missing from the book is the 'finaloutput*(1-finaloutput)' part.

What am I missing here? (likely alot)

Any help very much appreciated.

1. Hi Tonje - great question.

When I was researching the various ways neural networks are implemented I was confused by the different explanations, and indeed the terminology eg "gamma" "delta" "derived activation" etc

on page 99 you'll see the update expression for the weights between the hidden and output layers Wjk:

delta Wjk = alpha * error_k * Ok(1 - Ok) * Oj

That Ok(1 - Ok) is the finaloutput(1-finaloutput) you're looking for because k is the output layer.

I don't recognise the expression you're printed:
HWeight += eta *(hidden output)*(1 - hidden ouptut)*previous layer output*[the sum of all(weights between hidden and output*(target-actual)*finaloutput*(1-finaloutput))]

I have tried to unpick it but I can't make sense of it. Perhaps you could point me to teh source so I can look at it?

Overall I would go with an expression whose derivation you can understand and are happy with - rather than use expressions which you have to take on trust. For this reason - over many years - I stopped pursuing papers/texts where I couldn't understand their logic. That is mostl likely moy own limitation ..

I'd be happy to look again.

Perhaps this expalantion which uses the alternative terminology of "error term" might be helpful? It's one of the better explanations that I've seen.
https://brilliant.org/wiki/backpropagation/

2. Hi,

Thanks for responding - I really do appreciate you taking the time :D

I followed the link you posted and, unless i'm mistaken, it agrees with the expression I posted. The wiki explanation describes the 'error term' as 'delta_k' and says that it is composed of the following..

For the final layer the error term is:

Delta_m1 = go'(a_m1)*(actual - target)

where go'(a_m1) is the derived output layer activation function of the input to the output node.

For the hiddenlayer the error term is:

Delta_kj = g'(a_jk)*SUM_OF_ALL(Weights_k+1*Delta_k+1)

where g'(a_jk) is the derived hidden layer activation function and the Delta_k+1 is the error term of the next layer - for the output layer this would be the 'Delta_m1' expression described above.

The error term for the hidden layer then would be composed of the derived activation function for the hidden layer as well as the sum of the derived activation functions of the net layer. This is not the same as is shown in the book since the book only uses the '(actual-target)' as the output layer error.

Interestingly, the wiki article you posted does at one point describe the output layer error to be the same as the book - BUT - this is only in the case where the output layer activation function is the linear activation. This is for a regression network not classification and the derived linear activation function is simply = 1! This would leave only the '(actual-target)' as the error term.

Maybe i'm missing something? Could you take another look and let me know?

Here are some other sources which explain the expression I am refering too..

^^The above link is a for part2 of a set of videos explaining the backprop algorithm.

http://neuralnetworksanddeeplearning.com/chap2.html

^^This on is an online book and uses the same approach.

http://www.cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.3-BackProp.pdf

^^this is a pdf - Page 18/30 shows the algorithm when the activation functions of the output and hidden layer are the sigmoid.

All of these examples use include the derived activation function of the output layer as part of the error term to be backpropgated. The only way the error term can be simply the (actual-target) is if there is a linear activation function for the output layer - which essentially amounts to no activation function and is only for regression networks.

I wish that I could attach images here to better show the expressions from the sources - it would make things much easier to explain.

Look forwards to your input :D

1. Hi Tonje - my apologies for the delay. I have been a little busy with moving home and have not had time to look at this yet. I will need to spend some time on this as I don't find the maths in other books / papers easy to follow !!

7. Hello. I can't understand this topic: Backpropagating Errors To More Layers, on page 78.
To find e1 I used this formula: w21/(w21+w11) = 3/(3+2) and I got proportions 0.6 and 0.4. But there is 0.6 and 0.9. Where did 0.9 come from?
Lets find e2. w12=1, w22=4, so w22/(w22+w12) = 4(4+1) = 4/5. Proportions are 0.8 and 0.2, but you wrote 0.4 and 0.1