Translation in-variance and CNN

7 min readMar 25, 2022

Translation in-variance is a result of the pooling operation.

Before we start, make sure that you understand how convolution layers extract features from images. You can check out the convolution 101 tutorial. In convolution 101 : we discussed about the big picture behind powerful image models . We build an edge detector from scratch using TensorFlow and displayed the output of convolution (feature map).

How computer sees images

An image is made of pixels. The more pixels an image has, the more sharper and clear the colors in an image become. A problem we’ve is, computers don’t understand colors. Computer only understand numbers. The computer uses a color model for understanding colors , the most commonly used color model is RGB, where each pixel in an image is represented as a mix of these three colors.

With RGB color model , the color info in each color channel is stored in 8 bits. An 8 bit number can store values from 0 to 255. There are 3 channels and for each channel the color information is stored in 8 bits, so in total it requires 24 bits to store the color information of each pixel.

In contrast the Gray scale images contains pixels with just one value, the light intensity information. Here 0 means black and 255 means white and every value in between represent some shades of gray.

Since a computer only understand numbers, for color images every pixel is represented by three numbers, corresponding to the amounts of red, green, and blue present in that pixel.

Why Convolutional Neural Network ?

Traditional machine learning approach consist of fully connected dense layers, an example Multi Layer Perceptron (MLP), compare pixel to pixel.

Traditional machine learning methods do not handle translations well.

For computer these are two different images

The traditional method of comparing pixel to pixel is not robust to these translations. Computers sees different information in the corresponding pixel locations when we give it an image and a flipped version of the same image, as a result the computer thinks that these are different images

The objects inside the image are often the same under different spatial translations and we need a model that is robust against this spatial translation. This is why we use CNNs in the image domain.

Transnational In-variance makes the CNN powerful models for image processing. The In-variance to translation means that the model produces exactly the same response, regardless of how its input is shifted

Mathematics behind convolution

The main part of convolution for feature extraction is a filter/kernel. We slide the filter through the image while computing weighted sum.

The kernel is nothing but a filter that is used to extract the features from the images. The kernel is a matrix that moves over the image, computing the weighted sum with the data inside its receptive field.

The Kernel moves on the input data by the stride value. If stride=1 then kernel moves one pixel at a time , if stride =2 then kernel moves two pixel at a time. In short, the kernel is used to extract high-level features like edges, corners, curves etc… from the image.

Lets define a simple array to act as an image, and another array to act as the kernel to see the mathematics behind the scenes

image = np.array([
    [0, 0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 1, 1, 0, 0, 0],
    [0, 0, 0, 1, 1, 0, 0, 0],
])

# kernel extract vertical lines in an image
vertical = np.array([
    [1, -1],
    [1, -1],
])

Image contains a vertical line at center , the kernel we defined can extract this vertical line from the image

# lets see convolution in action
image = tf.cast(image, tf.float32)
# Adding batch dimension and color channel
eimage = tf.reshape(image, shape=[1, *image.shape, 1])
# CONVOLUTION 
convolution = tf.nn.conv2d(image,
                    filters=get_filter(vertical,
                    is_color=False),
                    strides=1,
                    padding='SAME')

We can apply ReLU activation function to isolate the pattern (vertical line) we extracted by convolution operation ( sliding the kernel through the image and performing weighted sum).

[ left ] Convolution output → **tf.nn.relu( convolution output )** [ right ]

## CONVOLUTION : extract patterns in the image
plt.subplot(1,2,1)
plt.imshow(tf.squeeze(convolution))
plt.axis('off')## Activation function : isolate the patterns 
plt.subplot(1,2,2)
plt.imshow(tf.squeeze(tf.nn.relu(convolution)))
plt.axis('off')

plt.show()

Pooling : Summarizing the feature map

Notice that after applying the ReLU activation function the feature map ends up with a lot of black area; that is, areas containing only 0’s (the black areas in the image).

These black areas of the feature map are not giving us much information about the patterns we are extracting. However they position the pattern within the feature map.

What we would like to do is to summarize the feature map to retain only the most useful part — the feature itself. This is exactly what pooling layers do , they takes a window of feature map and replaces the activation’s within the window by a summary.

Max pooling

What maximum pooling does is, it takes a window of feature map and replaces them with the maximum activation in that window.

circle = np.zeros(shape=(64,64))
rr, cc = circle_perimeter(32,32,8,shape=(64,64))
circle[rr,cc] = 1

plt.figure(figsize=(15,15))
plt.subplot(1,5,1)
plt.imshow(circle)
plt.axis('off')
plt.title(circle.shape)

# reshape to add batch dimension and color channnel
circle_ = tf.reshape(circle, shape=(1,*circle.shape,1))

for i in range(4):
  plt.subplot(1,5,i+2)
  # max-pooling
  circle_ = tf.nn.max_pool2d(circle_,
                             ksize=2,
                             strides=2,
                             padding='SAME')
  plt.imshow(tf.squeeze(circle_))
  plt.axis('off')
  plt.title(tf.squeeze(circle_).shape)

plt.show()

Average pooling

What average pooling does is, it takes a window of feature map and replaces them with the average activation in that window.

for i in range(4):
  plt.subplot(1,5,i+2)
  # average-pooling
  circle_ = tf.nn.avg_pool2d(circle_,
                             ksize=2,
                             strides=2,
                             padding='SAME')
  plt.imshow(tf.squeeze(circle_))
  plt.axis('off')
  plt.title(tf.squeeze(circle_).shape)

plt.show()

Global pooling

Instead of down sampling patches of the input feature map, global pooling down samples the entire feature map to a single value. This would be the same as setting the window size to the size of the input feature map.

Why to use Pooling Layers?

Pooling layers are used to reduce the dimensions of the feature maps.
The pooling layer summarizes the features present in a region of the feature map generated by a convolution layer. So, further operations are performed on summarized features instead of precisely positioned features generated by the convolution layer. This makes the model more robust to spatial translations.

Lets see how convolution kernel extract patterns from an image : Github link

extracted_features = tf.nn.conv2d(input_img,
                                  filters=get_filter(bottom_sobel),
                                  strides=1,
                                  padding='SAME')# let's view the output of convolution
plt.imshow(tf.squeeze(extracted_features));

Convolution kernel extract patterns from the image

Convolution output before applying activation function

Now let’s isolate the patterns by applying ReLU activation

# let's apply ReLU activation -- isloates the feature
feature_map = tf.nn.relu(extracted_features)
# let's view the feature map
plt.imshow(tf.squeeze(feature_map));

We can see that , by applying the activation function the features get isolated

Now lets apply : Max pooling to intensify the extracted patterns

# Use max pooling to summarize the feature map
summarize_feature_map = tf.nn.max_pool2d(feature_map,
                                         ksize=2,
                                         strides=2,
                                         padding='SAME' )
# visualize
plt.imshow(tf.squeeze(summarize_feature_map));

Max pooling intensify the extracted patterns

That’s it folks. Today we learned about the mathematics behind the scenes of a convolution operation for feature extraction. Today we learned that by sliding the convolution kernels through an image, we can extract patterns from the image. This output of convolution is passed through an activation function to isolate the patterns in the feature map which are then intensified by the pooling step.

Resources

Convolution kernels

An image kernel is a small matrix used to apply effects , such as blurring, sharpening, outlining or embossing. They’re used in machine learning for feature extraction.

Find more at Kernel (image processing) From Wikipedia