Monday, 9 May 2016

Great Question from France: Training Order

I had a great question from Hamid from France, which led onto more interesting thoughts.

He assumed each training example was fed forward through the network many times, each time reducing the error, and wanted to know when to stop and move onto the next training example. That is:

Training Example 1: FW, BP, FW, BP,  FW, BP, ....
Training Example 2: FW, BP, FW, BP,  FW, BP, ....
Training Example 3: FW, BP, FW, BP,  FW, BP, ....
Training Example 4: FW, BP, FW, BP,  FW, BP, ....

( FW=Feed Forward, BP=Back Propagate )

---

My immediate reply was to say that this wasn't how it was normally done, but instead each training example was used in turn. Some call this on-line learning. That is:

Training Example 1: FW, BP
Training Example 2: FW, BP
Training Example 3: FW, BP
Training Example 4: FW, BP
...

And I said that it is often a good idea to repeat this several times, that is, training for several epochs.

---

Some will in fact batch together a few training examples and sum up the error to be used for back propagation. That is:

Training Example 1: FW (accumulate error)
Training Example 2: FW (accumulate error)
Training Example 3: FW (accumulate error)
Training Example 4FW, BP accumulated error
Training Example 5: FW (accumulate error)
Training Example 6: FW (accumulate error)
Training Example 7: FW (accumulate error)
Training Example 8FW, BP accumulated error

---

Then I thought about it more and concluded that Hamid's approach wasn't wrong at all - just different. He was asking what the stopping criteria would be for applying the same training data example many times. The real answer is .. I don't know - but I would experiment to find out.

---

Hamid's question is a good one, because it is not often made very clear which order or scheme is used. It is too easy for authors and teachers to assume new readers will know which scheme is being considered, or even which ones are a good idea.

That's why I love feedback from readers - they ask the best questions!

Thanks Hamid!

5 comments:

  1. Firstly, enjoyed reading your book which clarified so much for me.

    I recently challenged myself to write a clustering algorithm which worked out for itself how many clusters there are, rather than it being passed as a parameter as many are. It works by comparing how big a step it is to join the closest clusters compared with the density of the existing clusters. As such there are no parameters or constants. It has limitations but is very effective with clean data.

    I was thinking of applying the same principles to the above question from Hamid about training with a single example until it has 'got it', then moving on to the next. I did a quick test using a hacked version of your code which trained with a single example and then tested with the same sample. It appeared that the network was very confident after the first learn scoring over .9 for the right label. This might be obvious but I was expecting it to creep towards confidence. Is that level of confidence to be expected with a single learn ?

    Thanks.

    ReplyDelete
    Replies
    1. Did some work and got a workable solution. It is possible to FW, BP, FW, BP on the same example and get it to decide when it has learned enough before moving on to the next. It actually turned out to be quite simple (took about 15 minutes to work it out and code it) and doesn't need any constants or parameters. I logged its learning and with the early examples it needs to learn it up to 5 times but then later on it gets it first or second time. A bit like human learning really.
      I ran it using the settings from the first successful run in the book :- 100 hidden nodes and a learning rate of 0.1. Its performance was 0.9503 which is slightly better than the book.
      Happy to share the solution.

      Delete
    2. Hi Jed

      Really sorry i missed this comment - not sure how it evaded my attention!

      I love how you've applied these ideas to clustering .. and how you've addressed the main problem of traditional clustering algorithms requiring the number of clusters ahead of time.

      If you did a blog post about your work I'd be interested .. and also link to it too.

      About your question - getting 0.9 for the right label .. i think i am surprised but it just goes to show how effective the neural network idea really works well for character classification. More to the point - it shows that the method is very robust to imperfect data (not much, with errors, or with noise, etc).

      Delete
    3. No problem, just checked back and saw your reply.

      I'm not a blogger but happy to describe what I did with the clustering.

      The first phase is to put all elements into their own group and then loop through all elements looking to see if there is another element closer (euclidean distance) to them than the closest member of their own group. If there is then it swaps to that element's group. It cycles through until no element swaps groups.

      At this point elements are in 'perfect' groups where they are closer to at least one element in their group than any other group. Although the groups are perfect there could be a lot of them.

      The next stage is to start clumping groups together. It does this by going through the groups looking for the nearest adjacent group. Once found it examines how much the density of the combined group would drop. If the drop is acceptable it joins them and looks again for adjacent groups. It also modifies the level of acceptability for the next join. At some point the drop in density is not acceptable (the two groups are too far apart) and it stops.

      The code to work out how to change the acceptability took the most time and a lot of trial and error and is sort of based on integration.

      As joining is always at the edge and does not use centroids the shapes (of any dimension) do not matter and it happily does the two moons test for example.

      If fails where there is not a clean divide, for instance it cannot do the iris data set which has overlapping data points.

      Cheers.

      Delete
    4. Hi Jed - that's a great approach. You have a clustering method that is really practical and effective.

      If you do ever decide to write it up as a blog (and people love illustrations!) .. I'd very happy share your great work!

      Delete