Monday, 17 February 2020

Calculating the Output Size of Convolutions and Transpose Convolutions

Convolution is common in neural networks which work with images, either as classifiers or as generators. When designing such convolutional neural networks, the shape of data emerging from each convolution layer needs to be worked out.

Here we’ll see how this can be done step-by-step with configurations of convolution that we’re likely to see working with images.

In particular, transposed convolutions are thought of as difficult to grasp. Here we’ll show that they’re not difficult at all by working though some examples which all follow a very simple recipe.

Example 1: Convolution With Stride 1, No Padding

In this first simple example we apply a 2 by 2 kernel to an input of size 6 by 6, with stride 1.

The picture shows how the kernel moves along the image in steps of size 1. The areas covered by the kernel do overlap but this is not a problem. Across the top of the image, the kernel can take 5 positions, which is why the output is 5 wide. Down the image, the kernel can also take 5 positions, which is why the output is a 5 by 5 square. Easy!

The PyTorch function for this convolution is:

nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=1)

Example 2: Convolution With Stride 2, No Padding

This second example is the same as the previous one, but we now have a stride of 2.

We can see the kernel moves along the image in steps of size 2. This time the areas covered by the kernel don’t overlap. In fact, because the kernel size is the same as the stride, the image is covered without overlaps or gaps. The kernel can take 3 positions across and down  the image, so the output is 3 by 3.

The PyTorch function for this convolution is:

nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2)

Example 3: Convolution With Stride 2, With Padding

This third example is the same as the previous one, but this time we use a padding of 1.

By setting padding to 1, we extend all the image edges by 1 pixel, with values set to 0. That means the image width has grown by 2. We apply the kernel to this extended image. The picture shows the kernel can take 4 positions across the image. This is why the output is 4 by 4.

The PyTorch function for this convolution is:

nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2, padding=2)

Example 4: Convolution With Coverage Gaps

This example illustrates the case where the chosen kernel size and stride mean it doesn’t reach the end of the image.

Here, the 2 by 2 kernel moves with a step size of 2 over the 5 by 5 image. The last column of the image is not covered by the kernel.

The easiest thing to do is to just ignore the uncovered column, and this is in fact the approach taken by many implementations, including PyTorch. That’s why the output is 2 by 2.

For medium to large images, the loss of information from the very edge of the image is rarely a problem as the meaningful content is usually in the middle of the image. Even if it wasn’t, the fraction of information lost is very small.

If we really wanted to avoid any information being lost, we’d adjust some of the option. We could add a padding to ensure no part of the input image was missed, or we could adjust the kernel and stride sizes so they matches the image size.

Example 5: Transpose Convolution With Stride 2, No Padding

The transpose convolution is commonly used to expand a tensor to a larger tensor. This is the opposite of a normal convolution which is used to reduce a tensor to a smaller tensor.

In this example we use a 2 by 2 kernel again, set to stride 2, applied to a 3 by 3 input.

The process for transposed convolution has a few extra steps but is not complicated.

First we create an intermediate grid which has the original input’s cells spaced apart with a step size set to the stride. In the picture above, we can see the pink cells spaced apart with a step size of 2. The new in-between cells have value 0.

Next we extend the edges of the intermediate image with additional cells with value 0. We add the maximum amount of these so that a kernel in the top left covers one of the original cells. This is shown in the picture at the top left of the intermediate grid. If we added another ring of cells, the kernel would no longer cover the original pink cell.

Finally, the kernel is moved across this intermediate grid in step sizes of 1. This step size is always 1. The stride option is used to set how far apart the original cells are in the intermediate grid. Unlike normal convolution, here the stride is not used to decide how the kernel moves.

The kernel moving across this 7 by 7 intermediate grid gives us an output of 6 by 6.

Notice how this transformation of a 3 by 3 input to a 6 by 6 output is the opposite of Example 2 which transformed an input of size 6 by 6 to an output of size 3 by 3, using the same kernel size and stride options.

The PyTorch function for this transpose convolution is:

nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=2)

Example 6: Transpose Convolution With Stride 1, No Padding

In the previous example we used a stride of 2 because it is easier to see how it is used in the process. In this example we use a stride of 1.

The process is exactly the same. Because the stride is 1, the original cells are spaced apart without a gap in the intermediate grid. We then grow the intermediate grid with the maximum number of additional outer rings so that a kernel in the top left can still cover one of the original cells. We then move the kernel with step size 1 over this intermediate 7 by 7 grid to give an output of size 6 by 6.

You’ll notice this is the opposite transformation to Example 1.

The PyTorch function for this transpose convolution is:

nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=1)

Example 7: Transpose Convolution With Stride 2, With Padding

In this transpose convolution example we introduce padding. Unlike the normal convolution where padding is used to expand the image, here it is used to reduce it.

We have a 2 by 2 kernel with stride set to 2, and an input of size 3 by 3, and we have set padding to 1.

We create the intermediate grid just as we did in Example 5. The original cells are spaced 2 apart, and the grid is expanded so that the kernel can cover one of the original values.

The padding is set to 1, so we remove 1 ring from around the grid. This leaves the grid at size 5 by 5. Applying the kernel to this grid gives us an output of size 4 by 4.

The PyTorch function for this transpose convolution is:

nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=2, padding=1)

Calculating Output Sizes

Assuming we’re working with square shaped input, with equal width and height, the formula for calculating the output size for a convolution is:

The L-shaped brackets take the mathematical floor of the value inside them. That means the largest integer below or equal to the given value. For example, the floor of 2.3 is 2.

If we use this formula for Example 3, we have input size = 6, padding = 1, kernel size = 2. The calculation inside the floor brackets is (6 + 2 - 1 -1) /2 + 1, which is 4. The floor of 4 remains 4, which is the size of the output.

Again, assuming square shaped tensors, the formula for transposed convolution is:

Let’s try this with Example 7, where the input size = 3, stride = 2, padding = 1, kernel size = 2. The calculation is then simply 2*2 - 2 + 1 + 1 = 4, so the output is of size 4.

On the PyTorch references pages you can read about more general formulae, which can work with rectangular tensors and also additional configuration options we’ve not needed here.

More Reading

Wednesday, 12 September 2018

Application of Neural Networks - Satellite Measurement of Water Waves

It's always great to see interesting uses of machine learning methods - and especially satisfying to see someone inspired by my book to apply the methods.

I was privileged to have an initial discussion with Dennis when he was planning on applying neural networks to the task of classifying water waveforms measured by radar from a satellite orbiting the Earth.

He went on to succeed and presented his work at a well respected conference. You can see his presentation slides here:


Satellite radar is used to measure the altitude (height) of surface features - which can be both land and water.

The signal needs to be interpreted and so that:

  • we can establish if the surface is land or water
  • and if water, calculate the height of the water waves from the non-trivial signal pattern

Land or Water?

A neural network was trained to determine whether the signal was from land or water.

As you can see from the slide above, the signal signature is very different.

A neural network was very successful in detecting water. Detecting land was a little more challenging but this initial work showed great promise.

Water Wave Height

The next step is to calculate the height of the water waves. In-situ measurements were used as reference data to train a different neural network.

Part of the challenge for a neural network is that there are several peaks that can be detected during a measurement, and we want the highest peak of a wave.

Tracking a peak as it moves allows us to have a higher level of confidence in labelling it a water wave peak.


The results are promising with some areas identified for further work.

The following shows how good the calculated water wave heights are based on automatic analysis by neural networks.

The first area for improvement is detecting land where the accuracy rate is lower than it is for water.

The second area for further work is to the resolve the "delay" visible in the calculated heights. This is not a major issue in this application as the height and shape are more important than the horizontal displacement / phase.

The following shows more challenging wave forms.

A good next challenge is to automate the detection of the correct peak, and neural network architectures that take into account a sequence of data - such as recurrent neural networks - can help in these scenarios.

Tuesday, 22 May 2018

Imageio.imread() Replaces Scipy.misc.imread()

Some of the code we wrote reads data from image files using a helper function scipy.misc.imread().

However, recently, users were notified that this function is deprecated:

We're encouraged to use the imageio.imread() function instead.

From imread() to imread()

The change is very easy. We first change the import statements which include the helper library.

From this:
import scipy.misc

To this:
import imageio

We then change the actual function which reads image data from files.

From this form:
img_array = scipy.misc.imread(image_file_name, flatten=True)

To this form:
img_array = imageio.imread(image_file_name, as_gray=True)


We can see the new function is used in a very similar way. We still provide the name of the image file we want to read into a array of data.

Previously we used flattern=True to convert the image pixels into a greyscale value, instead of having separate numbers for the red, green, blue and maybe alpha channels. We now use as_grey=True which does the same thing.

I thought we might have to mess about with inverting number ranges from 0-255 to 255-0 but it seems we don't need to.

Github Code Updated

The notebooks which use imread() have been updated on the main github repository.

This does mean the code is slightly different to that described in the book, but the change should be easy to understand until a new version of the book is released.

Wednesday, 16 May 2018

Online Interactive Course by

I've been really impressed with who took the content for Make Your Own Neural Network and developed a beautifully designed interactive online course.

The course breaks the content down into digestible bite-size chunks, and the interactivity is really helpful to the process of learning through hands-on experimentation and play.

Have a go!

Wednesday, 7 February 2018

Saving and Loading Neural Networks

A very common question I get is how to save a neural network, and load it again later.

Why Save and Load?

There are two key scenarios when being able to save and load a neural network are useful.

  • During a long training period it is sometimes useful to stop and continue at a later time. This might be because you're using a laptop which can't remain on all the time. It could be because you want to stop the training and test how well the neural network performs. Being able to resume training at a different time is really helpful.
  • It is useful to share your trained neural network with others. Being able to save it, and for someone else to load it, is necessary for this to work.

What Do We Save?

In a neural network the thing that is doing the learning are the link weights. In our Python code, these are represented by matrices like wih and who. The wih matrix contains the weights for the links between the input and hidden layer, and the who matrix contains the weights for the links between the hidden and output layer.

If we save these matrices to a file, we can load them again later. That way we don't need to restart the training from the beginning.

Saving Numpy Arrays

The matrices wih and who are numpy arrays. Luckily the numpy library provides convenience functions for saving and load them.

The function to save a numpy array is, array). This will store array in filename. If we wanted to add a method to our neuralNetwork class, we could do it simply it like this:

# save neural network weights 
def save(self):'saved_wih.npy', self.wih)'saved_who.npy', self.who)

This will save the wih matrix as a file saved_wih.npy, and the wih matrix as a file saved_wih.npy.

If we want to stop the training we can issue in a notebook cell. We can then close down the notebook or even shut down the computer if we need to.

Loading Numpy Arrays

To load a numpy array we use array = numpy.load(filename). If we want to add a method to our neuralNetwork class, we should use the filenames we used to save the data.

# load neural network weights 
def load(self):
    self.wih = numpy.load('saved_wih.npy')
    self.who = numpy.load('saved_who.npy')

If we come back to our training, we need to run the notebook up to the point just before training. That   means running the Python code that sets up the neural network class, and sets the various parameters like the number of input nodes, the data source filenames, etc.

We can then issue n.load() in a notebook cell to load the previously saved neural networks weights back into the neural network object n.


We've kept the approach simple here, in line with our approach to learning about and coding simple neural networks. That means there are some things our very simple network saving and loading code doesn't do.

Our simple code only saves and loads the two wih and who weights matrices. It doesn't do anything else. It doesn't check that the loaded data matches the desired size of neural network. We need to make sure that if we load a saved neural network, we continue to use it with the same parameters. For example, we can't train a network, pause, and continue with different settings for the number of nodes in each layer.

If we want to share our neural network, they need to also be running the same Python code. The data we're passing them isn't rich enough to be independent of any particular neural network code. Efforts to develop such an open inter-operable data standard have started, for example the Open Neural Network Exchange Format.

HDF5 for Very Large Data

In some cases, with very large networks, the amount of data to be saved and loaded can be quite big. In my own experience from around 2016, the normal saving of bumpy arrays in this was didn't always work. I then fell back to a slightly more involved method to save and load data using the very mature HDF5 data format , popular in science and engineering.

The Anaconda Python distribution allows you to install the h5py package, which gives Python the ability to work with HDF5 data.

HDF5 data stores do more than the simple data saving and loading. They have the idea of a group or folder which can contain several data sets, such as numpy arrays. The data stores also keep account of data set names, and don't just blindly save data. For very large data sets, the data can be traverse and segmented on-disk without having to load it all into memory before subsets are taken.

You can explore more here:

Tuesday, 23 May 2017

Learning MNIST with GPU Acceleration - A Step by Step PyTorch Tutorial

I'm often asked why I don't talk about neural network frameworks like Tensorflow, Caffe, or Theano.

Reasons for Not Using Frameworks

I avoided these frameworks because the main thing I wanted to do was to learn how neural networks actually work. That includes learning about the core concepts and the maths too. By creating our own neural networks code, from scratch, we can really start to understand them, and the issues that emerge when trying to apply them to real problems.

We don't get that learning and experience if we only learned how to use someone else's library.

Reasons for Using Frameworks - GPU Acceleration

But there are some good reasons for using such frameworks, after you've learned about how neural networks actually work.

One reason is that you want to take advantage of the special hardware in some computers, called a GPU, to accelerate the core calculations done by a neural network. The GPU - graphics processing unit - was traditionally used to accelerate calculations to support rich and intricate graphics, but recently that same special hardware has been used to accelerate machine learning.

The normal brain of a computer, the CPU, is good at doing all kinds of tasks. But if your tasks are matrix multiplications, and lots of them in parallel, for example, then a GPU can do that kind of work much faster. That's because they have lots and lots of computing cores, and very fast access to locally stored data. Nvidia has a page explaining the advantage, with a fun video too - link. But remember, GPU's are not good for general purpose work, they're just really fast at a few specific kinds of jobs.

The following illustrates a key difference between general purpose CPUs and GPUs with many, more task-specific, compute cores:

GPU's have hundreds of cores, compared to a CPU's 2, 4 or maybe 8.

Writing code to directly take advantage of GPU's is not fun, currently. In fact, it is extremely complex and painful. And very very unlike the joy of easy coding with Python.

This is where the neural network frameworks can help - they allows you to imagine a much simpler world - and write code in that word, which is then translated into the complex, detailed, and low-level  nuts-n-bolts code that the GPUs need.

There are quite a few neural network frameworks out there .. but comparing them can be confusing. There are a few good comparisons and discussions on the web like this one - link.


I'm going to use PyTorch for three main reasons:
  • It's largely vendor independent. Tensorflow has a lot of momentum and interest, but is very much a Google product. 
  • It's designed to be Python - not an ugly and ill-fitting Python wrap around something that really isn't Python. Debugging is also massively easier if what you're debugging is Python itself.
  • It's simple and light - preferring simplicity in design, working naturally with things like the ubiquitous numpy arrays, and avoiding hiding too much stuff as magic, something I really don't like.

Some more discussion of PyTorch can be found here - link.

Working With PyTorch

To use PyTorch, we have to understand how it wants to be worked with. This will be a little different to the normal Python and numpy world we're used to.

The main ideas are:
  • build up your network architecture using the building blocks provided by PyTorch - these are things like layers of nodes and activation functions.
  • you let PyTorch automatically work out how to back propagate the error - it can do this for any of the building blocks it provides, which is really convenient.
  • we train the network in the normal way, and measure accuracy as usual, but pytorch provides functions for doing this.
  • to make use of the GPU, we configure a setting to and push the neural network weight matrices to the GPU, and work on them there.
We shouldn't try to replicate what we did with our pure Python (and bumpy) neural network code - we should work with PyTorch in the way it was designed to be used.

A key part of this auto differentiation. Let's look at that next.

Auto Differentiation

A powerful and central part of PyTorch is the ability to create neural networks, chaining together different elements - like activation functions,  convolutions, and error functions - and for PyTorch to work out the error gradients for the various parameters we want to improve.

That's quite cool if it works!

Let's see it working. Imagine a simple parameter $y$ which depends on another input variable $x$. Imagine that

$$  y = x^2 + 5x + 2 $$

Let's encode this in PyTorch:

import torch
from torch.autograd import Variable

x = Variable(torch.Tensor([2.0]), requires_grad=True)

y = (x**2) + (5*x) + 2

Let's look at that more slowly.  First we import torch, and also the Variable from torch.autograd, the auto differentiation library. Variable is important because we need to wrap normal Python variables with it, so that PyTorch can do the differentiation. It can't do it with normal Python variables like a = 10, or b = 5*a. Variables include links to where the variables came from - so that if one depends on another, PyTorch can do the correct differentiation.

We then create x as a Variable. You can see that it is a simple tensor of trivial size, just a single number, 2.0. We also signal that it requires a gradient to be calculated.

A tensor? Think of it as just a fancy name for multi-dimensional matrices. A 2-dimensional tensor is a matrix that we're all familiar with, like bumpy arrays. A 1-dimensional tensor is like a list. A 0-dimensional one is just a single number. When we create a torch.Tensor([2.0]) w'ere just creating a single number.

We then create the next Variable called y. That looks like a normal Python variable by the way we've created it .. but it isn't, because it is made from x, which is a PyTorch Variable. Remember, the magic that Variable brings is that when we define y in terms of x, the definition of y remembers this, so we can do proper differentiation on it with respect to x.

So let's do the differentiation!


That's it. That all that is required to ask PyTorch to use what it knows about y and all the Variables it depends on to work out how to differentiate it.

Let's see if it did it correctly. Remember that $x=2$ so we're asking for

$$ \frac{\delta y}{\delta x}\Big|_{x=2} =  2x + 5 = 9$$

This is how we ask for that to be done.


Let's see how all that works out:

It works! You can also see how y is shown as type Variable, not just x.

So that's cool. And that's how we define our neural network, using elements that PyTorch provides us, so it can automatically work out error gradients.

Let's Describe Our Simple Neural Network

Let's look at some super-simple skeleton code which is a common starting point for many, if not all, PyTorch neural networks.

import torch
import torch.nn

class NeuralNetwork(torch.nn.Module):

    def __init__(self):


    def forward(self, inputs):

        return outputs

net = NeuralNetwork()

The neural network class is derived from torch.nn.Module which brings with it the machinery of a neural network including the training and querying functions - see here for the documentation.

There is a tiny bit of boilerplate code we have to add to our initialisation function __init__() .. and that's calling the initialisation of the class it was derived from. That should be the __init__() belonging to torch.nn.Module. The clean way to do this is to use super():

    def __init__(self):
        # call the base class's initialisation too

We're not finished yet. When we create an object from the NeuralNetwork class, we need to tell it at that time what shape it will be. We're sticking with a simple 3-layer design .. so we need to specify how many nodes there are at the input, hidden and output layers. Just like our pure Python example, we pass this information to the __init__() function. We might as well create these layers during the initialisation. Our __init__() now looks like this:

    def __init__(self, inodes, hnodes, onodes):
        # call the base class's initialisation too
        # define the layers and their sizes, turn off bias
        self.linear_ih = nn.Linear(inodes, hnodes, bias=False)
        self.linear_ho = nn.Linear(hnodes, onodes, bias=False)
        # define activation function

        self.activation = nn.Sigmoid()

The nn.Linear() module is the thing that creates the relationship between one layer and another and combines the network signals in a linear way .. which is what we did in our pure Python code. Because this is PyTorch, that nn.Linear() creates a parameter that can be adjusted .. the link weights that we're familiar with. You can read more nn.Linear() about it here.

We also create the activation function we want to use, in this case the logistic sigmoid function. Note, we're using the one provided by torch.nn, not making our own.

Note that we're not using these PyTorch elements yet, we're just defining them because we have the information about the number of input, hidden and output nodes.

We have to over-ride the forward() function in our neural network class. Remember, that backward() is provided automatically, but can only work if PyTorch knows how we've designed our neural network - how many layers, what those layers are doing with activation functions, what the error function is, etc.

So let's create a simple forward() function which is the description of the network architecture. Our example will be really simple, just like the one we created with pure Python to learn the MNIST dataset.

    def forward(self, inputs_list):
        # convert list to Variable
        inputs = Variable(inputs_list)
        # combine input layer signals into hidden layer
        hidden_inputs = self.linear_ih(inputs)
        # apply sigmiod activation function
        hidden_outputs = self.activation(hidden_inputs)
        # combine hidden layer signals into output layer
        final_inputs = self.linear_ho(hidden_outputs)
        # apply sigmiod activation function
        final_outputs = self.activation(final_inputs)
        return final_outputs

You can see the first thing we do is convert the list of numbers, a Python list, into a PyTorch Variable.  We must do this, otherwise PyTorch won't be able to calculate the error gradient later.

The next section is very familiar, the combination of signals at each node, in each layer, followed immediately by the activation function. Here we're using the nn.Linear() elements we defined above, and the activation function we defined earlier, using the torch.nn.Sigmoid() provided by PyTorch.

Error Function
Now that we've defined the network, we need to define the error function. This is an important bit of information because it defines how we judge the correctness of the neural network, and wrong-ness is used to update the internal parameters during training.

There are any error functions that people use, some better for some kinds of problems than others. We'll use the really simple one we developed for the pure Python network, the squared error function.  It looks like the following.

error_function = torch.nn.MSELoss(size_average=False)

We've set the size_average parameter to False to avoid the error function dividing by the size of the target and desired vectors.


We're almost there. We've just defined the error function, which means we know how far wrong the neural network is during training. We know that PyTorch can calculate the error gradients for each parameter.

When we created our simple neural network, we didn't think too much about different ways of improving the parameters based on the error function and error gradients. We simply descended down the gradients a small bit. And that is simple, and powerful.

Actually there are many refined and sophisticated approaches to doing this step. Some are designed to avoid false minimum traps, others designed to converge as quickly as possible, etc. We'll stick to the simple approach we took, and the closest in the PyTorch toolset is the stochastic gradient descent:

optimiser = torch.optim.SGD(net.parameters(), lr=0.1)

We feed this optimiser the adjustable parameters of our neural network, and we also specify the familiar learning rate as lr.

Finally, Doing the Update

Finally, we can talk about doing the update - that is, updating the neural network parameters in response to the error seen with each training example.

Here's how we do that for each training example:

  • calculate the output for a training data example
  • use the error function to calculate the difference (the loss, as people call it)
  • zero gradients of the optimiser which might be hanging around from a previous iteration
  • perform automatic differentiation to calculate new gradients
  • use the optimiser to update parameters based on these new gradients

In code this will look like:

for inputs, target in training_set:

    output = net(inputs)

    # Compute and print loss
    loss = error_function(output, target)

    # Zero gradients, perform a backward pass, and update the weights.

It is a common error not to zero the gradients during each iteration, so keep an eye out for that. I'm not really sure why the default is not to clear them ...

The Final Code 

Now that we have all the elements developed and understood, we can rewrite the pure python neural network we developed in the course of Make Your Own Neural Network and throughout this blog.

You can find the code as a notebook on GitHub:

The only unusual thing I had to work out was that during the evaluation of performance, we keep a scorecard list, and append a 1 to it if the network's answer matches the known correct answer from the test data set. This comparison needs the actual number to be extracted from the PyTorch tensor via numpy, as follows. We couldn't just say label == correct_label.

if ([0][0] == correct_label):

The results seem to match our pure python code for performance - no major difference, and we expected that because we've tried to architect the network to be the same.

Performance Comparison On a Laptop

Let's compare performance between our simple pure python (with bumpy) code and the PyTorch version. As a reminder, here are the details of the architecture and data:

  • MNIST training data with 60,000 examples of 28x28 images
  • neural network with 3 layers: 784 nodes in input layer, 200 in hidden layer, 10 in output layer
  • learning rate of 0.1
  • stochastic gradient descent with mean squared error
  • 5 training epochs (that is, repeat training data 5 times)
  • no batching of training data

The timing was done with the following python notebook magic command in the cell that contains only the code to train the network. The options ensure only one run of the code, and the -c option ensures unix user time is used to account for other tasks taking CPU time on the same machine.

%%timeit -n1 -r1 -c

The results from doing this twice eon a MacBook Pro 13 (early 2015), which has no GPU for accelerating the tensor calculations, are:

  • home-made simple pure python - 440 seconds, 458 seconds
  • simple PyTorch version - 841 seconds, 834 seconds

Amazing! Our own home-made code is about 1.9 times faster .. roughy twice as fast!

GPU Accelerated Performance

One of the key reasons we chose to invest time learning a framework like PyTorch is that it makes it easy to take advantage of GPU acceleration. So let's try it.

I don't have a laptop with a CUDA GPU so I fired up a Google Cloud Compute Instance.  The specs for mine are:

  • n1-highmem-2 (2 vCPUs, 13 GB memory)
  • Intel Sandy Bridge
  • 1 x NVIDIA Tesla K80 GPU

So we can compare GPU results with CPU results, I ran the above code but this time not as a notebook but a command line script, using the unix time command. This will I've us the time to complete the whole program, including the training and testing stages. The results are:

real    8m14.387s
user    7m31.223s
sys     8m39.810s

The interpretation of these numbers needs some sophistication, especially if our code has multiple threads, so we'll just stick to the simple real wall-clock time of 8m14s or 494 seconds.

Now we need to change the code to run on he GPU. First check that CUDA - NVIDIA's GPU acceleration framework - is available to Python and PyTorch:

Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()

So CUDA is available. This gave a False on my own home laptop.

The overall approach to shifting work from the CPU to the GPU is to shift the tensors there. Here is the current (but immature) PyTorch guidance on working with the GPU. To create a Tensor on a GPU we use torch.cuda:

>>> x = torch.cuda.FloatTensor([1.0, 2.0])
>>> x

[torch.cuda.FloatTensor of size 2 (GPU 0)]

You can see that this new tensor x is created on the GPU, it is shown as GPU 0, as there can be more. If we perform a calculation on x, it is actually varied out on the same GPU 0, and if the results are assigned to a new variable, they are also stored on the same GPU.

>>> y = x**x
>>> y

[torch.cuda.FloatTensor of size 2 (GPU 0)]

This may not seem like much but is incredibly powerful - yet easy to use, as you've just seen.

The changes to the code are minimal:

  • we move the neural network class to the GPU once we've created it using n.cuda()
  • the inputs are converted from a list to a PyTorch Tensor, we now use the CUDA variant: inputs = Variable(torch.cuda.FloatTensor(inputs_list).view(1, self.inodes))
  • similarly the target outputs are also coverted using this variant: target_variable = Variable(torch.cuda.FloatTensor(targets_list).view(1, self.onodes), requires_grad=False)

That's it! Not too difficult at all .. actually that took a day to work out because the PyTorch documentation isn't yet that accessible to beginners.

The results from the GPU enabled version of the code are:

real    6m6.328s
user    5m57.443s
sys     0m13.488s

That is faster at 366 seconds. That's about 25% faster. We're seeing some encouraging results.

Let's do more runs, just to be scientific and collate the results:


494 366. 
483 372. 
451 355. 

476.0 364.3

The GPU based network is consistently faster by about 25%.

Perhaps we expected the code to be much much faster? Well for such a small network, the overheads corrode the benefits. The GPU approach really shines for much larger networks and data.

Let's do a better experiment and compare the PyTorch code in CPU and GPU mode, varying the number of hidden layer nodes.  Here are the results:

nodes CPU GPU

200 463 362
1000 803 356
2000 1174 366
5000 3390 518

Visualising this ...

We can see now the benefit of a PyTorch using the GPU. As the scale of the network grows (hidden layer nodes here), the time it takes for the GPU to complete training rises very slowly, compared to the CPU doing it, which rises quickly.

One one more tweak .. the contributors at GitHub suggested setting an environment variable to control how many CPU threads the task is managed by. See here.  In my Google GPU instance I'll set this to OMP_NUM_THREADS=2. The resulting duration is 361 seconds .. so not much improved. We didn't see an improvement when we tried it on the CPU only code, earlier. I did see that less threads were being used, by using the top utility, but at these scales I didn't see a difference.

Friday, 7 April 2017

Neural Network in Forth

I love how people have been inspired to make their own neural networks in their own way, sometimes using R or Julia programming langauages.

I was very pleasantly surprised that Robin had decided to make neural networks in Forth.

Forth is an interesting langauge - you can read about it here, and here - it is a small, efficient and fast language, with applicatiosn often close to the metal.

You can follow Robin's progress here: