Understanding the Convolution Operation in CNN with an Example

Jan. 6, 2025 | Dr. S. Lovelyn Rose | Chief Technology Officer & Head of AI Research

>> Blogs / Understanding the Convolution Operation in CNN with an Example

The convolutional neural network gets its name from the hidden layers that perform the convolution operation. The convolution operation is the basic operation in a CNN and they aid in extracting space aware features. The layer that performs the convolution operation is the convolution layer.

The Convolution Layer

The effect of the convolution layer is such that we get a summary of the presence of all the features in this layer. In the convolution layer the input is convolved with a set of weights/filters.

What is Convolution?

Convolution is a linear operation that computes the dot product of the local receptive field and the filter matrix. As the filter matrix is slid over the input, multiple convolved values are produced resulting in a matrix of convolved values. Nonlinearity may be introduced to the output from the convolution layer using a nonlinear activation function.

Convolution and Discrete Signals

Convolution is a mathematical operation borrowed from digital signal processing to determine the amount of correlation between two signals. The concept is extended to digital images to find the correlation between two images. If one of the images is smaller, then it determines the correlation of the smaller image with every possible window of same size in the other image. Given a discrete signal x(n) of length N and filter f(n) of length M , the amount of correlation between them is determined as

Here, the filter is defined over a set of integers from -M/2 and M/2.

Cross-correlation

Convolution operation assumes that the filter f is flipped and shifted across x to get (x*f)[n]. Another terminology used in the similar context is that of cross-correlation. The only difference between convolution and cross correlation is that the filter is not flipped in cross correlation.

So, what we usually call as convolution when we apply CNN is cross correlation. Given an 2D input image X of size M x M and 2D kernel W of size N x N, the convolution operation can be generalized and written as in the following equation. Feature response at pixel position (i,j) is obtained by computing the dot product between a window of size N x N centered at pixel (i,j) and the filter.

Figure 1. Convolution Operation

Figure 1 shows the convolution between a image of size 4x4 and filter of size 3x3. Table 1 shows the computation of filter responses. The receptive field (portion of image) which constitutes to the feature response is highlighted.

Table 1. Computation of filter responses using convolution operation

Understanding Convolution with an Example

This section presents how to perform convolution using filters of higher dimension. Figure 2 illustrates the filter response computation along with various terminologies associated with the operation.

Figure 2. Filter response for source pixel in an input image

For explanation purpose let us consider a convolutional layer which accepts 7x7 RGB image. To apply convolution, we need a filter with the same number of channels as that of the input. Let us consider a 3x3 kernel of 3 channels. The color intensity of the sample image is as shown in figure 3.

Figure 3. Channel Color Intensity of sample RGB image of size 7 x 7

A 3x3x3 filter shown in figure 4 is convolved with the input image.

Figure 4. Sample kernel of size 3 x 3

Let us see the convolution operation in detail for the first cell.

Similarly, all the cells are filled. The resultant matrix termed the activation map is given below.

Note that the final value in a pixel is a single scalar value even when multiple channels are used. Also, the dimension of the resultant matrix is 5x5, while the original image was of dimension 7x7. When the kernel size is increased from 3x3 to 5x5, it is obvious that 4 rows and 4 columns in the edge will be subject to border effect resulting in a matrix of size 3x3. Non-linearity is applied to the filter responses before it gets processed through subsequent layers.

Recent Blogs