Understanding Convolution Neural Networks (CNNs).

In this article, we'll discuss CNNs in a theoretical way, No coding. If you'd like to get your hands dirty with coding, look forward to my future articles on CNNs and its applications.

So let's get started!

What is Convolution neural network?

Convolution Neural Network is a class of deep neural networks that is majorly used for analyzing visual imagery.

Experimentally, it has been proven with a lot of real world examples that CNNs perform very well on visual entities; in this case images and videos understanding more than any other classes of neural networks.

Applications of CNNs

Some applications of CNN are image recognition ( which includes image classification, object detection, image segmentation and instance Segmentation), Audio classification, video classification, text Classification (this is a rare case though).

Why you should use CNNs for visual imagery and not regular Neural Networks.

A regular Feed Forward neural network receives an input (as a single vector) and transforms it through a series of hidden layers. Each of the layer is made of a set of neurons, where each neuron is fully connected to all the neurons in the previous layer. Visual recognition is a very simple task for humans due to our high cognitive prowess. However this is not true for computers, computers only process numerical figures. An image is a grid of pixels, while each pixel has a value ranging from 0 to 255 for RGB or colored image. A grid of pixels is a matrix with rows and columns.

A grayscale image is a two dimensional object (i.e. Width and Height), while A colored image is a three dimensional object; the Width, the Height and the Channels (RGB in this regard) image 2.png So, in case of Feed Forward Neural Networks, the pixel representation of N dimension of the input image is flattened to a column vector before passing it in a feed forward manner to the network because FFNNs only take vectors as input. The cons of using a regular NNs are;

Lots of parameters and easy to overfit.
Loss of spatial and temporal dependencies in an image during flattening.

CNNs (or ConvNets) can look at an image as a whole and capture the spatial and temporal dependencies.

ConvNets leverage on the properties of natural image of a photograph for its assumptions;

Locality : Nearby pixels are more strongly correlated.

Translation Invariance: Meaningful patterns can occur anywhere in the image.

Weights Sharing: Using the same network parameters to detect local patterns as many locations in the image.

Topographical structure: Each element in the picture is a composition of another.

Hierarchy: Local low level features are composed into larger more abstract features.

Components of Convolutional Network

Convolutional Layers

Convents are filters (a.k.a. weights or kernels) which convolve across our input to extract features. The filters are randomly initialized and learn to pick meaningful features from the input.

Basically, each filter represents a feature and we use this filter on the other inputs to extract features, hence parameters sharing. In summary convents learned filters that give optimal representation of the input data. The figure below shows how convolution operation or cross correlation is done. Stride: It is the step size with which the filters slide across the input during cross correlation operation. Strides of size 1 or 2 are commonly used in practice.

Padding: this is by adding values across all sides of the input, typically to create a volume with whole number dimensions. For example, zero padding means zero padded to the input. Padding types are;

Valid padding: This basically implies no padding, filters used the exact values of the input.

Same padding: values (often zeros) are padded evenly to the right and left of the input. This type of padding is commonly used in practice because all values of the input are processed or used by the filters.

Below is the formula that helps in calculating the input spatial dimension at any ConvNets layers;

W = (W + 2P - F) / S + 1

H = (H + 2P - F) / S + 1

W is the width of the input,

H is the height of the input,

F is the size of filters,

S is the Stride,

P is the amount of zero padding.

Note: the number of filters or kernels used to convolve correspond to the depth received by the next layer as input.

For example, given a colored image with dimension 3x128x128, a filter of size 4x4, stride of S=2 and zero padding of P=1, if we decided to use 16 filters at the first layer, the dimension received by the next layer will be;

W = (128 + 2x1 - 4) / 2 + 1 = 64

H = (128 + 2x1 - 4) / 2 + 1 = 64

The input dimension on the next layer will be 16x64x64.

Activation Functions activation function adds non linearity functionality to the network, the presence of activation in a neural network gives it the ability to approximate any function (Universal Approximation Theorem) Some of the commonly used activation functions are; softmax, ReLU, Leaky ReLU, Hyperbolic tan, Maxout e.t.c

Pooling Layers Pooling layers basically down-sample or reduce the spatial dimensions of the features map. Feature map is the result of the convolving filters on an input. Pooling is done for two purposes;

To reduce the redundant features information of the feature map.
To reduce computational cost

Pooling operations can be the max values or average values.

max_avg_pooling.jfif

Batch Normalization layers The original paper said it "reduces the internal covariate shift" whatever that means. Batch normalization is done by standardizing (mean=0 std=1) the activation of the previous layers. batch norm_resi.jpg

Batch normalization helps CNNs train faster and better.

Other types of normalization include;

Group Normalization: This is often used in image/object detection tasks. Its also performs better in classification tasks (often recommended) because it's more stable and theoretically simpler.

Layer Normalization and Instance Normalization: These are heavily used in language problems.

Fully Connected Layers The output of the convolutional layers is flattened and passed into a regular neural network for classification or regressing depending on the tasks.

The standard spatial Convolution neural network structure is;

INPUT=>CONV LAYER=>RELU=>POOL LAYER … => FCN.

I hope you've learnt something.

Thanks for reading!

Below are the resources to learn more;

cs231n.github.io/convolutional-networks/#case

github.com/madewithml/basics

atcold.github.io/pytorch-Deep-Learning