Having got a small neural network to learn logical XOR I tried scaling up to learn the MNIST handwritten characters.
The early rough code is here: www.googledrive.com/host/0B6e9Zx7axvo-TjcwM2J5cUlVLVk/
You'll recall the images are bitmaps of 28x28 pixels. The neural network therefore has 28x28 = 784 nodes. That's a big step up from our 2 or 3 nodes!
The output needs to represent the range of possible answers, and one way of doing this is simply to have a node for each answer - ie 10 nodes, one for each possible character between 0 and 9.
The number of middle hidden layer nodes is interesting - it could be 784, more than 784, or a lot less. More means great computational load and perhaps a risk of over-fitting. Too few and the neural network cannot learn to classify the characters because there isn't enough freedom to represent the model required. Let's try 100 nodes.
Look through the code and you'll see the code
inputs = (numpy.asfarray(linebits[1:])/ 256.0) + 0.01
which scales the inputs fom the initial range of 0-255 to the range 0.01 to just over 1.00. We could fix this to make it exactly one but this is just a quick hack to prevent the input having an input of zero which we know damages back propagation learning.
The code runs through all 60,000 training examples - character bitmaps together with the correct previously known answer. As we do this we keep a track of the sum-squared-error and redirect it to a file for plotting later. Just like in the previous posts, we expect it to fall over training epoch. But it doesn't look like a clean drop like we've seen before.
You might be able to see a density of points shifting downwards which we need a better way of visualising. Let's plot these errors as a histogram. The following gnuplot code does this:
binwidth=0.1
plot [:][:] 'a.txt' using (bin($1,binwidth)):(1.0) smooth freq with boxes
Now that's much clearer! The vast majority of sum-square-errors are in the range 0.0 - 0.1. In fact approx 48,000 of the 60,000 or 80% of them are. That's a very good result for a first attempt at training a neural network to recognise handwritten numerals.
Test Query
The last bit of code in the Python notebook illustrates querying the neural network. In this case we take the 5th example (it could have been any other example) which represents the numeral "2". Some code prepares the query and we see the following output:
finaloutputs =
[[ 1.31890673e-05]
[ 8.37041536e-08]
[ 9.96541147e-01]
[ 1.68720831e-06]
[ 1.39223748e-07]
[ 8.52160977e-09]
[ 3.48925352e-06]
[ 4.08379490e-05]
[ 4.82287514e-05]
[ 5.91914796e-03]]
These are the ten output layer nodes. We can see that the largest one by far is the 3rd element which represents the desired "2" - because we count from zero: 0, 1, 2, ...Let's visualise these output values:
You can see quite clearly the output value for the node representing "2" is by far the largest! The next most likely candidate "9" has an output 100 times smaller.
Wednesday, 13 May 2015
Tuesday, 5 May 2015
Saturation and Normalising Inputs
I've been puzzled by my neural network code not working. It struck me that back propagation suffers some issues, like getting struck in local minima, or not being able to learn from inputs that are all the same, or zero.
I think I was suffering from network saturation - where the activation functions were all at the point where the gradient is almost zero. No gradient means no weight change means no learning!
The simple problem I was trying to get working was the famous XOR problem. Normalising the inputs from the range [0, 1.0] to [0.1, 0.9] seems to work much better.
The python source code I have so far is at:
The resultant sum-squared-error for XOR learning is:
I will read with interest the fantastic paper "Efficient Backprop": http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
I tried my early code with OR, and AND not just XOR and it all worked!
I'm still cautious so I'll think about this a bit more and read that paper in more detail.
I think I was suffering from network saturation - where the activation functions were all at the point where the gradient is almost zero. No gradient means no weight change means no learning!
The simple problem I was trying to get working was the famous XOR problem. Normalising the inputs from the range [0, 1.0] to [0.1, 0.9] seems to work much better.
The python source code I have so far is at:
The resultant sum-squared-error for XOR learning is:
I will read with interest the fantastic paper "Efficient Backprop": http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
I tried my early code with OR, and AND not just XOR and it all worked!
I'm still cautious so I'll think about this a bit more and read that paper in more detail.
Sunday, 3 May 2015
Hill Climbing Learning
Having real trouble debugging my back propagation algorithm, and also understanding the error in my own derivation of weight change... so took a step back and tried a different approach altogether.
Hill climbing is a fancy term but all we're doing is taking an untrained neural network and making a small change to one of the weights to see if it improves the overall result. If it does, keep that change, if it doesn't discard it and revert.
That's an intuitive explanation - much easier to understand than back propagation at the weight change level.
Why is it called hill climbing? Imagine the landscape of weights .. 2-dimensions is easy to imagine .. that is w1 and w2. We stand at a starting position and we want to find the best w1 and w2. If we take a small step in a random direction, we can then see if it improves the overall output (squared sum error) of the neural network - or not. If it does, you've moved closer to the optimal w1, w2. Keep doing this and you'll get there. The "hill" is a sense of solution quality, the error function for example.
Again - a great graph showing a neural network improving it's accuracy over training epochs:
Do keep in mind that is method, just like the back propagation, can end up with a locally - not globally - optimum solution. Thinking of that landscape above, we end up in a ditch but not the deepest ditch!
Hill climbing is a fancy term but all we're doing is taking an untrained neural network and making a small change to one of the weights to see if it improves the overall result. If it does, keep that change, if it doesn't discard it and revert.
That's an intuitive explanation - much easier to understand than back propagation at the weight change level.
Why is it called hill climbing? Imagine the landscape of weights .. 2-dimensions is easy to imagine .. that is w1 and w2. We stand at a starting position and we want to find the best w1 and w2. If we take a small step in a random direction, we can then see if it improves the overall output (squared sum error) of the neural network - or not. If it does, you've moved closer to the optimal w1, w2. Keep doing this and you'll get there. The "hill" is a sense of solution quality, the error function for example.
Again - a great graph showing a neural network improving it's accuracy over training epochs:
Do keep in mind that is method, just like the back propagation, can end up with a locally - not globally - optimum solution. Thinking of that landscape above, we end up in a ditch but not the deepest ditch!
Subscribe to:
Posts (Atom)