Tuesday, 28 June 2016

Bias Nodes in Neural Networks

I've been asked about bias nodes in neural networks. What are they? Why are they useful?

Back to Basics

Before we dive into bias nodes .. let's go back to basics. Each node in a neural network applies a threshold function to the input. The output helps us make a decision about the inputs.

We know the nodes in a real neural network are usually sigmoid in shape, with the $1/(1+e^{-x})$ logistic function and the $tanh()$ function also being popular.

But before we arrived at those, we used a very simple linear function to understand how it could be used to classify or predict, and how it could be refined  by adjusting its slope. So let's stick with linear functions for now - because they are simpler.

The following is a simple linear function.

$$y = A\cdot x$$

You'll remember it was the parameter $A$ that we varied to get different classifications. And it was this parameter $A$ that we refined by learning from the error from each training example.

The following diagram shows some examples of different lines possible with such a linear function.

You can see how some lines are better at separating the two clusters. In this case the line $y=2x$ is the best at separating the two clusters.

That's all cool and happy - and stuff we've already covered before.

A Limitation

Look at the following digram and see which line of the form $y=A\cdot x$ would best separate the data.

Ouch! We can't seem to find a line that does the job - no matter what slope we choose.

This is a limitation we've hit. Any line of the form $y= A\cdot x$ must go through the origin. You can see in the diagram all three example lines do.

More Freedom

What we need is to be able to shift the line up and down. We need an extra degree of freedom.

The following diagram shows some example separator lines which have been liberated from the need to go through the origin.

You can see one that does actually do a good job of separating the two data clusters.

So what form do these liberated lines take? They take the following form:

$$ y = A \cdot x + B $$

We've added an extra $+B$ to the previous simpler equation $y = A\cdot x$. All this will be familiar to you if you've done maths at school.

Bias Node

So we've just found that for some problems, a simple linear classifier of the form $y=A\cdot x$ was insufficient to represent the training data. We needed an extra degree of freedom so the lines were freer to go all over the data. The full form of a linear function $y = A\cdot x + B$ does that.

The same idea applies even when we're using sigmoid shaped functions in each neural network node. You can see that without a $+B$ those simpler functions are doomed to stick to a fixed origin point, and only their slope changes.  You can see this in the following diagram.

How do we represent this in a neural network?

We could change the activation function in each node. But remember, we chose not to alter the slope of that function, never mind adding a constant. We instead chose to change the weights of the incoming signals.

So we need to continue that approach. The way to do this is to add a special additional node into a layer, alongside the others, which always has a constant value usually set to 1. The weight of the link is able to change, and even become negative. This has the same effect of adding the additional degree of freedom that we needed above.

The following illustrates the idea:

The activation function is a sigmoid is of the combined incoming signals $w_0 + w_1\cdot x$. The $w_0$ is provided by the additional node and has the effect of shifting the function left or right along the x-axis. That in effect allows the function to escape being pinned to the "origin" which is $(0, \frac{1}{2})$ for the logistic function and $(0,0)$ for the $tanh()$.

Don't forget that the $w_1$ can be negative too ... which allows the function to flip top to bottom too, allowing for lines which fall not just rise.

The following shows how the extra node is included in a layer. That node is called a bias node.

It is worth experimenting to determine whether you need a bias node to augment the input layer, or whether you also need one to augment the internal hidden layers. Clearly you don't have one on the output layer.

Coding A Bias Node

A bias node is simple to code. The following shows how we might add a bias node to the input layer, with code based on our examples in github.

  • Make sure the weight matrix has the right shape by incrementing the number of input nodes, self.inodes = input_nodes + 1.
  • This automatically means that the weight matrix takes the right shape, self.wih depends on self.inodes.
  • In the query() and train() functions, the inputs_list has a 1.0 bias constant input prepended or appended to it.

Why Didn't We Use Bias?

Why didn't we use bias when we created a neural network to learn the MNIST data set?

The primary aim of the book was to keep things as simple as possible and avoid additional details or optimisations as much as possible.

The MNIST data challenge is one that happens not to need a bias node. Just like some cluster separation problems don't need the extra degree of freedom.