Last time we covered the overarching idea that, during training, the network's overall output error is the key thing that guides how each internal weight must be tuned so that the overall output error is reduced.
Simpler Case First: Linear Neuron
Because the maths usually gets in the way of most explanations, let's illustrate the idea using an simpler model of a neuron.
Remember the neuron we described in an earlier post? We know consider an artificially simple node which has a linear activation function. Linear means the output function is of the form ax+b .... that is, (weight x input) + constant, or o=w.i+c to be more concise.
Ignore for now the fact that useful neural networks can't have such simple activation functions.
Imagine we start with an untrained neuron, with
- weight w=0.8
- and for simplicity set the constant c as zero.
- Imagine also we have a training data with input = 1.0 and output = -1.0.
So what do we do with this error -1.8? We said in the last post that this is the value we used to refine the network's weights to improve the result.
Derivatives: How Outputs Depend on Other Factors
To answer this we need to dig a little into the activation function which turns the input into the output and uses the weight to do so. This makes intuitive sense - if we're to tweak the weight, we need to understand the function which uses the weight to produce, hopefully, a better output. This function is of course the activation function o = w.i + c.
Look at the following graphs showing o = w.i for different w. Larger w have larger slopes, negative w have slopes going in the oppositve direction. So w has a direct effect on the output. You can see that w=-0.7 would be a better weight for our training dara because with this the output would be -0.7, not so far from the desired -1.0. Much better than 0.8 above.
But we need to be able to express this connection more precisely than handwaving at a graph.
As is all too common in mathematics, you can get an insight into a function by looking at how it depends on a variable.You'll remember this is simply calculus - how does X change as a result of Y changing. The following shows the idea applied to a general activation function.
So let's do this with o = w.i + c and see how it varies with w, because we're interested in tweaking w. The derivative of o with respect to w is Δo/Δw and is simply Δo/Δw = i. This is simple school level calculus.
Let's rearrange that expression so that we have the output and weights on different sides of the equation. That becomes Δo = Δw.i and this simply says that a small change in w multiplied by i gives the small change in o. Or rearranging slightly Δw = Δo/i.
This is looking hopeful! The Δw is the change in weight we want to work out. We know i because for the training data set it was 1.0. So what is Δo? It is the change in o, the output, we want to correspond with that change in w. But what should Δo be? Its the error, the different between the desired and current output. So, using the above training data item, we have error = Δo = -1.8. This deliberately simplistic function and it's derivative tell us that the Δw should change by -1.8 too. The weight w was 0.8 but changing it by -1.8 gives the new value -1.0.
Hey presto! The new weight of -1.0, gives us a new improved output of -1.0 ... to perfectly match the target desired output.
Step Back: The Process
Ok - phew! That was along winded way of tweaking a really simple weight to get a better output... but the important thing is the process - the process of using the derivatives of the activation function to understand how the output depends on the weight. Or more precisely, how small changes in output correlate to changes in weight Δw. And using this relationship to work out Δw, the required weight change.
This is the important idea: That we improve the weights by calculating the small change dw based on the desired change in output, Δo. And the relationship between Δw and Δo is derived using calculus of the activation function.
In reality, we don't just change the weight by the value of Δw. Instead we moderate how much of the Δw change we apply. This is often called a learning rate. We might choose to only apply half of each Δw we calculate for example. Why do this?
We moderate the "training", or changes in weight recommended by the calculation of Δw, because the training data won't necessarily perfectly fit a real world model, and trying to aim at it too strongly means risking over-fitting. Over-fitting is when a system is too closely matched to imperfect training data, and as a result performs badly on new data. In short, it doesn't generalise but merely memorises the training data.
This time we looked at a deliberately and extremely simple neural network consisting of one neuron, and a simple activation function, just to see more clearly how we can refine the weights with each training example.
Next time we'll have to work out how to do this with a more complex network, with hidden layers.