Make Your First GAN with PyTorch is now available!
Amazon printed edition: https://www.amazon.com/dp/B085RNKXPD.
All code is on github: https://github.com/makeyourownneuralnetwork/gan
Sample pages:
Saturday, 14 March 2020
Thursday, 5 March 2020
Gradient Descent Unstable For GANs?
When training neural networks we use gradient descent to find a path down a loss function to find the combination of learnable parameters that minimise the error. This is a very well researched area and techniques today are very sophisticated, the Adam optimiser being a good example.
The dynamics of a GAN are different to a simple neural network. The generator and discriminator networks are trying to achieve opposing objectives. There are parallels between a GAN and adversarial games where one player is trying to maximise an objective while the other is trying to minimise it, each undoing the benefit of the opponent’s previous move.
Is the gradient descent method of finding the correct, or even good enough, combination of learnable parameters suitable for such adversarial games? This might seem like an unnecessary question, but the answer is rather interesting.
The following is a very simple objective function:
One player has control over the values of x and is trying to maximise the objective f. A second player has control over y and is trying to minimise the objective f.
Let’s visualise this function to get a feel for it. The following picture shows a surface plot of f = x·y from three slightly different angles.
We can see that the surface of f = x·y is a saddle. That means, along one direction the values rise then fall, but in another direction, the values fall then rise.
The following picture shows the same function from above, using colours to indicate the values of f. Also marked are the directions of increasing gradient.
If we used our intuition to find a good solution to this adversarial game, we would probably say the best answer is the middle of that saddle at (x,y) = (0,0). At this point, if one player sets x = 0, the second player can’t affect the the value of f no matter what value of y is chosen. The same applies if y = 0, no value of x can change the value of f. The actual value of f at this point is also the best compromise. Elsewhere there are as many higher values of f as there are lower, so it seems like a good compromise.
You can explore the surface interactively yourself using the math3d.org website:
Let’s now move away from intuition and work out the answer by simulating both players using gradient descent, each trying to find a good solution for themselves.
You’ll remember from Make Your Own Neural Network that parameters are adjusted by a small amount that depends on the gradient of the objective function.
The reason we have different signs in these update rules is that y is trying to minimise f by moving down the gradient, but x is trying to maximise f by moving up the gradient. That lr is the usual learning rate.
Because we know f = x·y we can write those update rules with the gradients worked out.
We can write some code to pick starting values for x and y, and then repeatedly apply these update rules to get successive x and y values.
The following shows how x and y evolve as training progresses.
We can see that the values of x and y don’t converge, but oscillate with ever greater amplitude. Trying different starting values leads to the same behaviour. Reducing the learning rate merely delays the inevitable divergence.
This is bad. It shows that gradient descent can’t find a good solution to this simple adversarial game, and even worse, the method leads to disastrous divergence.
The following picture shows x and y plotted together. We can see the values orbit around the ideal point (0,0) but run away from it.
It can be shown mathematically (see below) that the best case scenario is that (x,y) orbits in a fixed circle around the (0,0) without getting closer to it, but this is only when the update step is infinitesimally small. As soon we have a finite step size, as we do when approximate that continuous process in discrete steps, the orbit diverges.
You can explore the code which plays this adversarial game using gradient descent here:
We’ve shown that gradient descent fails to find a solution to an adversarial game with a very simple objective function. In fact, it doesn’t just fail to find a solution, it catastrophically diverges. In contrast, gradient descent used in the normal way to minimise a function is guaranteed to find a minimum, even if it isn’t the global minimum.
Does this mean GAN training will fail in general? No.
Realistic GANs with meaningful data will have much more complex loss functions, and that can reduce the chances of runaway divergence. That’s why GAN training throughout this book has worked fairly well. But this analysis does indicate why training GANs is hard, and can become chaotic. Orbiting around a good solution might also explain why some GANs seem to progress onto different modes of single-mode collapse with extended training rather than improving the quality of images themselves.
Fundamentally, gradient descent is the wrong approach for GANs, even if it works well enough in many cases. Finding optimisation techniques designed for adversarial dynamics like those in GANs is currently an open research question, with some researchers already publishing encouraging results.
Above we stated that (x,y) orbits as a circle when two players each use gradient descent to optimise f = x·y in opposite directions. Here we’ll do the maths to show why it is a circle.
Let’s look at the update rules again.
If we want to know how x and y evolve over time t, we can write:
If we take the second derivatives with respect to t, we get the following.
You may remember from school algebra that expressions of the form d2y/dt2 = - a2x have a solution the form y = sin(at) or y = cos(at). To satisfy the first derivatives above, we need x and y to be the following combination.
These describe (x,y) moving around a unit circle with angular speed lr.
The dynamics of a GAN are different to a simple neural network. The generator and discriminator networks are trying to achieve opposing objectives. There are parallels between a GAN and adversarial games where one player is trying to maximise an objective while the other is trying to minimise it, each undoing the benefit of the opponent’s previous move.
Is the gradient descent method of finding the correct, or even good enough, combination of learnable parameters suitable for such adversarial games? This might seem like an unnecessary question, but the answer is rather interesting.
Simple Adversarial Example
The following is a very simple objective function:
One player has control over the values of x and is trying to maximise the objective f. A second player has control over y and is trying to minimise the objective f.
Let’s visualise this function to get a feel for it. The following picture shows a surface plot of f = x·y from three slightly different angles.
We can see that the surface of f = x·y is a saddle. That means, along one direction the values rise then fall, but in another direction, the values fall then rise.
The following picture shows the same function from above, using colours to indicate the values of f. Also marked are the directions of increasing gradient.
If we used our intuition to find a good solution to this adversarial game, we would probably say the best answer is the middle of that saddle at (x,y) = (0,0). At this point, if one player sets x = 0, the second player can’t affect the the value of f no matter what value of y is chosen. The same applies if y = 0, no value of x can change the value of f. The actual value of f at this point is also the best compromise. Elsewhere there are as many higher values of f as there are lower, so it seems like a good compromise.
You can explore the surface interactively yourself using the math3d.org website:
Let’s now move away from intuition and work out the answer by simulating both players using gradient descent, each trying to find a good solution for themselves.
You’ll remember from Make Your Own Neural Network that parameters are adjusted by a small amount that depends on the gradient of the objective function.
The reason we have different signs in these update rules is that y is trying to minimise f by moving down the gradient, but x is trying to maximise f by moving up the gradient. That lr is the usual learning rate.
Because we know f = x·y we can write those update rules with the gradients worked out.
We can write some code to pick starting values for x and y, and then repeatedly apply these update rules to get successive x and y values.
The following shows how x and y evolve as training progresses.
We can see that the values of x and y don’t converge, but oscillate with ever greater amplitude. Trying different starting values leads to the same behaviour. Reducing the learning rate merely delays the inevitable divergence.
This is bad. It shows that gradient descent can’t find a good solution to this simple adversarial game, and even worse, the method leads to disastrous divergence.
The following picture shows x and y plotted together. We can see the values orbit around the ideal point (0,0) but run away from it.
It can be shown mathematically (see below) that the best case scenario is that (x,y) orbits in a fixed circle around the (0,0) without getting closer to it, but this is only when the update step is infinitesimally small. As soon we have a finite step size, as we do when approximate that continuous process in discrete steps, the orbit diverges.
You can explore the code which plays this adversarial game using gradient descent here:
Gradient Descent Isn’t Ideal For Adversarial Games
We’ve shown that gradient descent fails to find a solution to an adversarial game with a very simple objective function. In fact, it doesn’t just fail to find a solution, it catastrophically diverges. In contrast, gradient descent used in the normal way to minimise a function is guaranteed to find a minimum, even if it isn’t the global minimum.
Does this mean GAN training will fail in general? No.
Realistic GANs with meaningful data will have much more complex loss functions, and that can reduce the chances of runaway divergence. That’s why GAN training throughout this book has worked fairly well. But this analysis does indicate why training GANs is hard, and can become chaotic. Orbiting around a good solution might also explain why some GANs seem to progress onto different modes of single-mode collapse with extended training rather than improving the quality of images themselves.
Fundamentally, gradient descent is the wrong approach for GANs, even if it works well enough in many cases. Finding optimisation techniques designed for adversarial dynamics like those in GANs is currently an open research question, with some researchers already publishing encouraging results.
Why A Circular Orbit?
Above we stated that (x,y) orbits as a circle when two players each use gradient descent to optimise f = x·y in opposite directions. Here we’ll do the maths to show why it is a circle.
Let’s look at the update rules again.
If we want to know how x and y evolve over time t, we can write:
If we take the second derivatives with respect to t, we get the following.
You may remember from school algebra that expressions of the form d2y/dt2 = - a2x have a solution the form y = sin(at) or y = cos(at). To satisfy the first derivatives above, we need x and y to be the following combination.
These describe (x,y) moving around a unit circle with angular speed lr.
Monday, 17 February 2020
Calculating the Output Size of Convolutions and Transpose Convolutions
Convolution is common in neural networks which work with images, either as classifiers or as generators. When designing such convolutional neural networks, the shape of data emerging from each convolution layer needs to be worked out.
Here we’ll see how this can be done step-by-step with configurations of convolution that we’re likely to see working with images.
In particular, transposed convolutions are thought of as difficult to grasp. Here we’ll show that they’re not difficult at all by working though some examples which all follow a very simple recipe.
In this first simple example we apply a 2 by 2 kernel to an input of size 6 by 6, with stride 1.
The picture shows how the kernel moves along the image in steps of size 1. The areas covered by the kernel do overlap but this is not a problem. Across the top of the image, the kernel can take 5 positions, which is why the output is 5 wide. Down the image, the kernel can also take 5 positions, which is why the output is a 5 by 5 square. Easy!
The PyTorch function for this convolution is:
nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=1)
This second example is the same as the previous one, but we now have a stride of 2.
We can see the kernel moves along the image in steps of size 2. This time the areas covered by the kernel don’t overlap. In fact, because the kernel size is the same as the stride, the image is covered without overlaps or gaps. The kernel can take 3 positions across and down the image, so the output is 3 by 3.
The PyTorch function for this convolution is:
nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2)
This third example is the same as the previous one, but this time we use a padding of 1.
By setting padding to 1, we extend all the image edges by 1 pixel, with values set to 0. That means the image width has grown by 2. We apply the kernel to this extended image. The picture shows the kernel can take 4 positions across the image. This is why the output is 4 by 4.
The PyTorch function for this convolution is:
nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2, padding=2)
This example illustrates the case where the chosen kernel size and stride mean it doesn’t reach the end of the image.
Here, the 2 by 2 kernel moves with a step size of 2 over the 5 by 5 image. The last column of the image is not covered by the kernel.
The easiest thing to do is to just ignore the uncovered column, and this is in fact the approach taken by many implementations, including PyTorch. That’s why the output is 2 by 2.
For medium to large images, the loss of information from the very edge of the image is rarely a problem as the meaningful content is usually in the middle of the image. Even if it wasn’t, the fraction of information lost is very small.
If we really wanted to avoid any information being lost, we’d adjust some of the option. We could add a padding to ensure no part of the input image was missed, or we could adjust the kernel and stride sizes so they matches the image size.
The transpose convolution is commonly used to expand a tensor to a larger tensor. This is the opposite of a normal convolution which is used to reduce a tensor to a smaller tensor.
In this example we use a 2 by 2 kernel again, set to stride 2, applied to a 3 by 3 input.
The process for transposed convolution has a few extra steps but is not complicated.
First we create an intermediate grid which has the original input’s cells spaced apart with a step size set to the stride. In the picture above, we can see the pink cells spaced apart with a step size of 2. The new in-between cells have value 0.
Next we extend the edges of the intermediate image with additional cells with value 0. We add the maximum amount of these so that a kernel in the top left covers one of the original cells. This is shown in the picture at the top left of the intermediate grid. If we added another ring of cells, the kernel would no longer cover the original pink cell.
Finally, the kernel is moved across this intermediate grid in step sizes of 1. This step size is always 1. The stride option is used to set how far apart the original cells are in the intermediate grid. Unlike normal convolution, here the stride is not used to decide how the kernel moves.
The kernel moving across this 7 by 7 intermediate grid gives us an output of 6 by 6.
Notice how this transformation of a 3 by 3 input to a 6 by 6 output is the opposite of Example 2 which transformed an input of size 6 by 6 to an output of size 3 by 3, using the same kernel size and stride options.
The PyTorch function for this transpose convolution is:
nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=2)
In the previous example we used a stride of 2 because it is easier to see how it is used in the process. In this example we use a stride of 1.
The process is exactly the same. Because the stride is 1, the original cells are spaced apart without a gap in the intermediate grid. We then grow the intermediate grid with the maximum number of additional outer rings so that a kernel in the top left can still cover one of the original cells. We then move the kernel with step size 1 over this intermediate 7 by 7 grid to give an output of size 6 by 6.
You’ll notice this is the opposite transformation to Example 1.
The PyTorch function for this transpose convolution is:
nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=1)
In this transpose convolution example we introduce padding. Unlike the normal convolution where padding is used to expand the image, here it is used to reduce it.
We have a 2 by 2 kernel with stride set to 2, and an input of size 3 by 3, and we have set padding to 1.
We create the intermediate grid just as we did in Example 5. The original cells are spaced 2 apart, and the grid is expanded so that the kernel can cover one of the original values.
The padding is set to 1, so we remove 1 ring from around the grid. This leaves the grid at size 5 by 5. Applying the kernel to this grid gives us an output of size 4 by 4.
The PyTorch function for this transpose convolution is:
nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=2, padding=1)
Assuming we’re working with square shaped input, with equal width and height, the formula for calculating the output size for a convolution is:
The L-shaped brackets take the mathematical floor of the value inside them. That means the largest integer below or equal to the given value. For example, the floor of 2.3 is 2.
If we use this formula for Example 3, we have input size = 6, padding = 1, kernel size = 2. The calculation inside the floor brackets is (6 + 2 - 1 -1) /2 + 1, which is 4. The floor of 4 remains 4, which is the size of the output.
Again, assuming square shaped tensors, the formula for transposed convolution is:
Let’s try this with Example 7, where the input size = 3, stride = 2, padding = 1, kernel size = 2. The calculation is then simply 2*2 - 2 + 1 + 1 = 4, so the output is of size 4.
On the PyTorch references pages you can read about more general formulae, which can work with rectangular tensors and also additional configuration options we’ve not needed here.
Here we’ll see how this can be done step-by-step with configurations of convolution that we’re likely to see working with images.
In particular, transposed convolutions are thought of as difficult to grasp. Here we’ll show that they’re not difficult at all by working though some examples which all follow a very simple recipe.
Example 1: Convolution With Stride 1, No Padding
In this first simple example we apply a 2 by 2 kernel to an input of size 6 by 6, with stride 1.
The picture shows how the kernel moves along the image in steps of size 1. The areas covered by the kernel do overlap but this is not a problem. Across the top of the image, the kernel can take 5 positions, which is why the output is 5 wide. Down the image, the kernel can also take 5 positions, which is why the output is a 5 by 5 square. Easy!
The PyTorch function for this convolution is:
nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=1)
Example 2: Convolution With Stride 2, No Padding
This second example is the same as the previous one, but we now have a stride of 2.
We can see the kernel moves along the image in steps of size 2. This time the areas covered by the kernel don’t overlap. In fact, because the kernel size is the same as the stride, the image is covered without overlaps or gaps. The kernel can take 3 positions across and down the image, so the output is 3 by 3.
The PyTorch function for this convolution is:
nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2)
Example 3: Convolution With Stride 2, With Padding
This third example is the same as the previous one, but this time we use a padding of 1.
By setting padding to 1, we extend all the image edges by 1 pixel, with values set to 0. That means the image width has grown by 2. We apply the kernel to this extended image. The picture shows the kernel can take 4 positions across the image. This is why the output is 4 by 4.
The PyTorch function for this convolution is:
nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2, padding=2)
Example 4: Convolution With Coverage Gaps
This example illustrates the case where the chosen kernel size and stride mean it doesn’t reach the end of the image.
Here, the 2 by 2 kernel moves with a step size of 2 over the 5 by 5 image. The last column of the image is not covered by the kernel.
The easiest thing to do is to just ignore the uncovered column, and this is in fact the approach taken by many implementations, including PyTorch. That’s why the output is 2 by 2.
For medium to large images, the loss of information from the very edge of the image is rarely a problem as the meaningful content is usually in the middle of the image. Even if it wasn’t, the fraction of information lost is very small.
If we really wanted to avoid any information being lost, we’d adjust some of the option. We could add a padding to ensure no part of the input image was missed, or we could adjust the kernel and stride sizes so they matches the image size.
Example 5: Transpose Convolution With Stride 2, No Padding
The transpose convolution is commonly used to expand a tensor to a larger tensor. This is the opposite of a normal convolution which is used to reduce a tensor to a smaller tensor.
In this example we use a 2 by 2 kernel again, set to stride 2, applied to a 3 by 3 input.
The process for transposed convolution has a few extra steps but is not complicated.
First we create an intermediate grid which has the original input’s cells spaced apart with a step size set to the stride. In the picture above, we can see the pink cells spaced apart with a step size of 2. The new in-between cells have value 0.
Next we extend the edges of the intermediate image with additional cells with value 0. We add the maximum amount of these so that a kernel in the top left covers one of the original cells. This is shown in the picture at the top left of the intermediate grid. If we added another ring of cells, the kernel would no longer cover the original pink cell.
Finally, the kernel is moved across this intermediate grid in step sizes of 1. This step size is always 1. The stride option is used to set how far apart the original cells are in the intermediate grid. Unlike normal convolution, here the stride is not used to decide how the kernel moves.
The kernel moving across this 7 by 7 intermediate grid gives us an output of 6 by 6.
Notice how this transformation of a 3 by 3 input to a 6 by 6 output is the opposite of Example 2 which transformed an input of size 6 by 6 to an output of size 3 by 3, using the same kernel size and stride options.
The PyTorch function for this transpose convolution is:
nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=2)
Example 6: Transpose Convolution With Stride 1, No Padding
In the previous example we used a stride of 2 because it is easier to see how it is used in the process. In this example we use a stride of 1.
The process is exactly the same. Because the stride is 1, the original cells are spaced apart without a gap in the intermediate grid. We then grow the intermediate grid with the maximum number of additional outer rings so that a kernel in the top left can still cover one of the original cells. We then move the kernel with step size 1 over this intermediate 7 by 7 grid to give an output of size 6 by 6.
You’ll notice this is the opposite transformation to Example 1.
The PyTorch function for this transpose convolution is:
nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=1)
Example 7: Transpose Convolution With Stride 2, With Padding
In this transpose convolution example we introduce padding. Unlike the normal convolution where padding is used to expand the image, here it is used to reduce it.
We have a 2 by 2 kernel with stride set to 2, and an input of size 3 by 3, and we have set padding to 1.
We create the intermediate grid just as we did in Example 5. The original cells are spaced 2 apart, and the grid is expanded so that the kernel can cover one of the original values.
The padding is set to 1, so we remove 1 ring from around the grid. This leaves the grid at size 5 by 5. Applying the kernel to this grid gives us an output of size 4 by 4.
The PyTorch function for this transpose convolution is:
nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=2, padding=1)
Calculating Output Sizes
Assuming we’re working with square shaped input, with equal width and height, the formula for calculating the output size for a convolution is:
The L-shaped brackets take the mathematical floor of the value inside them. That means the largest integer below or equal to the given value. For example, the floor of 2.3 is 2.
If we use this formula for Example 3, we have input size = 6, padding = 1, kernel size = 2. The calculation inside the floor brackets is (6 + 2 - 1 -1) /2 + 1, which is 4. The floor of 4 remains 4, which is the size of the output.
Again, assuming square shaped tensors, the formula for transposed convolution is:
Let’s try this with Example 7, where the input size = 3, stride = 2, padding = 1, kernel size = 2. The calculation is then simply 2*2 - 2 + 1 + 1 = 4, so the output is of size 4.
On the PyTorch references pages you can read about more general formulae, which can work with rectangular tensors and also additional configuration options we’ve not needed here.
- nn.ConvTranspose2d https://pytorch.org/docs/stable/nn.html#convtranspose2d
More Reading
- Convolutional neural networks: https://en.wikipedia.org/wiki/Convolutional_neural_network
- Convolutions in image classification and generation: http://makeyourownalgorithmicart.blogspot.com/2019/06/generative-adversarial-networks-part-iv.html
Subscribe to:
Posts (Atom)