Converting FC Layers to Conv Layers
Motivation
Technology is remarkable when it inspires other technologies. The field of computer vision has done this time and again. The simple idea of convolving a 2D data like image with another 2D matrix called kernel with some stride and padding changed computer vision forever and opened so many new possibilities.
One such possibility is for the computers to understand what’s there in an image or scene or frame. CNN revolutionized image classification task. But their power lies in the fact that they can sustain spatial information, unlike a dense neural network. The spatial information can thus be modified, manipulated and interpolated by computers to understand more from an image and not just classify.
Semantic Segmentation is one such aspect of computer vision and image processing in general whereby every pixel has a story to convey. We humans can judge and infer so many spatial information by looking at an image. The possibilities offered by semantic segmentation opened this door for computers too, thus advancements in technologies like autonomous vehicles.
Image Classification Way to Semantic Segmentation
One of the key ideas which you will come across when learning semantic segmentation is Fully Convolutional Network. In image classification task convolution blocks are usually flattened and then few layers of fully connected network end up to give classification results. But the retained spatial information got lost. In the case of classification, the convolution operation is done to reduce the numbers of features to be fed to a dense fully connected network. Another way to say this is Convolution operation downsamples a 2D data like image. But as mentioned above spatial information can be used to get more insights, which we are loosing by FC layer.
Convolution Neural Network for Image Classification is usually termed standard convolution or vanilla convolution or simply CNN. Here’s a nice blog to understand the underlying principles of CNN.
You can also find my implementation of Image Classification using CNN here.
Fully Connected to Fully Convolutional Network
Thus Fully Connected network or FC layer has to be converted to an equivalent Conv layer. Let’s see what happens after one convolution operation.
Here, an input volume of shape (5x5x3) is convolved with a kernel of size (3x3) with zero padded input and stride of one. The resulting output volume is of shape (3x3x20), assuming 20 filters were used. Thus an image is downsampled while spatial information is retained.
The resulting output volume can be given by this formula:
It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical.
Thus to convert an FC layer Conv layer set the filter size(n) to be exactly the size of the input volume, and hence the output will be equivalent to a FC layer. To understand this let’s see a practical utility.
Understand with Example
Let’s take a sequential VGG16 network pretrained on images of shape (224x224x3).
As we can see that the last three layers are fully connected (FC) layers, namely ‘fc1’, ‘fc2’, ‘predictions’. The input shape is (224,224,3) which got downsampled to (7,7,512) convolutional feature volume by a sequence of convolution, activation and pooling. They are flattened by a layer named ‘flattened’. The stored spatial information gets destroyed at this layer.
Let us take a convolutional layer which outputs a volume of size 7×7×512 and this is followed by a FC layer with 4096 neurons i.e. the output of the FC layers is 1x4096 for single image input. This can be equivalently expressed as a Conv layer with f=7, p=0, s=1 and nc=4096. You can do the maths. We end up with a volume of shape 1x1x4096. Likewise other two FC layers can be converted to equivalent Conv layers.
Build Fully Convolution Network
A fully convolution network can be built by simply replacing the FC layers with there equivalent Conv layers. In the example of VGG16 we can do so by first removing the last four layers. One way to do so is to pop layers from the model. In the model stack, each popping will remove the last layer.
In order to add the equivalent layers we can use Keras functional model.
We can observe that the total number of parameters still remain the same as in VGG16.
Conclusion
This blog talks about the essence of fully convolution network without going into many details. Also, the aim was to implement the same. Semantic Segmentation with FCN will be the followup blog. Thus having an understanding of FCN and how to actually implement it we can move forward to having a better understanding of semantic segmentation and it’s implementation.