Having got a small neural network to learn logical XOR I tried scaling up to learn the MNIST handwritten characters.
The early rough code is here: www.googledrive.com/host/0B6e9Zx7axvo-TjcwM2J5cUlVLVk/
You'll recall the images are bitmaps of 28x28 pixels. The neural network therefore has 28x28 = 784 nodes. That's a big step up from our 2 or 3 nodes!
The output needs to represent the range of possible answers, and one way of doing this is simply to have a node for each answer - ie 10 nodes, one for each possible character between 0 and 9.
The number of middle hidden layer nodes is interesting - it could be 784, more than 784, or a lot less. More means great computational load and perhaps a risk of over-fitting. Too few and the neural network cannot learn to classify the characters because there isn't enough freedom to represent the model required. Let's try 100 nodes.
Look through the code and you'll see the code
inputs = (numpy.asfarray(linebits[1:])/ 256.0) + 0.01
which scales the inputs fom the initial range of 0-255 to the range 0.01 to just over 1.00. We could fix this to make it exactly one but this is just a quick hack to prevent the input having an input of zero which we know damages back propagation learning.
The code runs through all 60,000 training examples - character bitmaps together with the correct previously known answer. As we do this we keep a track of the sum-squared-error and redirect it to a file for plotting later. Just like in the previous posts, we expect it to fall over training epoch. But it doesn't look like a clean drop like we've seen before.
You might be able to see a density of points shifting downwards which we need a better way of visualising. Let's plot these errors as a histogram. The following gnuplot code does this:
plot [:][:] 'a.txt' using (bin($1,binwidth)):(1.0) smooth freq with boxes
Now that's much clearer! The vast majority of sum-square-errors are in the range 0.0 - 0.1. In fact approx 48,000 of the 60,000 or 80% of them are. That's a very good result for a first attempt at training a neural network to recognise handwritten numerals.
The last bit of code in the Python notebook illustrates querying the neural network. In this case we take the 5th example (it could have been any other example) which represents the numeral "2". Some code prepares the query and we see the following output:
These are the ten output layer nodes. We can see that the largest one by far is the 3rd element which represents the desired "2" - because we count from zero: 0, 1, 2, ...Let's visualise these output values:
You can see quite clearly the output value for the node representing "2" is by far the largest! The next most likely candidate "9" has an output 100 times smaller.