Exploring pre-trained Convolutional layers and Kernels
For an interactive experience with this post's code, check out this kaggle notebook
In this post, we will explore how Convolutional Neural Networks (CNNs) "see" the world. We will do this by implementing some kernels from scratch and convolving them onto images. We will also visualise the features learned by the first two layers of the VGG16 model to try to get an understanding of what these models actually learn during training. This post was inspired by the paper Visualizing and Understanding Convolutional Neural Networks, which I recommend everyone read if they want to acquire some intuition about what is actually learned by convolutional networks during training.
What exactly are Convolutional and Pooling layers
In this section, we will make our own kernel and apply it to an image. In my first post, I explained that kernels are small 2D matrices that slide over images. So, to convolve an image is to slide the kernel across every pixel, performing an element-wise multiplication with the underlying image pixels and summing the results. This process creates a new pixel in the output image. For a more visual representation of this process, please go see the video by Futurology and 3Blue1Brown; they both do a great job working through this process visually.
So let us create a matrix. Below is an edge detection matrix implemented in NumPy. This matrix was hand-made by a human being to detect edges and outlines within an image.
Quick note: In real-world programming, you won't typically implement these kernels by hand. Libraries like OpenCV handle the heavy lifting. For now, focus on understanding what kernels do and how they work before diving deep into the math and implementation specifics. Playing around with them first often builds better intuition
import cv2 as cv
from matplotlib import pyplot as plt
import numpy as np
# Edge detection kernel taken from
# https://en.wikipedia.org/wiki/Kernel_(image_processing)
kernel = np.array([
[0, -1, 0],
[-1,4,-1],
[0,-1, 0]])
We will be applying the edge detection kernel above to the following picture of Pikachu
This is the full code of the application
img = cv.imread("/pikachu.jpg", cv.IMREAD_UNCHANGED)
img = cv.cvtColor(img, cv.COLOR_BGR2RGB) # breaking some opencv best practices
edge_img = cv.filter2D(img, -1, kernel)
plt.imshow(edge_img )
As you can see, we applied a kernel to the image of Pikachu and got back the outline of the image. This simple hand-made edge detection convolution we used mirrors what the first layers in CNNs do: they detect edges and basic features. However, the CNN learns to create these filters automatically through backpropagation and a good loss function. These models can even learn very complex kernels to pick up some features, such as floppy dog ears and cat whiskers.
This is basically what the kernel is doing pixel by pixel to generate the output pixel.
Image: Kernel:
[1 2 3] [0 -1 0]
[4 5 6] [-1 4 -1]
[7 8 9] [0 -1 0]
Output Pixel = (1*0) + (2*-1) + (3*0) +
(4*-1) + (5*4) + (6*-1) +
(7*0) + (8*-1) + (9*0)
= 0 - 2 + 0 - 4 + 20 - 6 + 0 - 8 + 0 = 0
Gaussian Blur
A popular kernel in the image processing world is the Gaussian kernel, which is mainly used to blur images. This blurring is then used to threshold images in order to do things such as background subtraction, removing noise, or better edge detection. Below is a function that we use to create the Gaussian blur kernel with the mathematical formula below
For more details about the Gaussian Blur kernel, please read the Wikipedia article, this good blog post about it, and this amazing post on a forum.
def create_gaussian_kernel(sigma: float, size:int):
kernel = np.zeros((size, size))
m = size // 2
n = size // 2
for x in range(-m, m+1):
for y in range(-n, n+1):
expo_pow = -1 * ((x**2 + y**2)/(2* sigma**2))
denominator = 2 * np.pi * sigma**2
val = (1 / denominator) * np.exp(expo_pow)
kernel[x+m][y+n] = val
return kernel\kernel.sum()
gauss_kernel = create_gaussian_kernel(0.8, 3)
gauss_kernel
gauss_img = cv.filter2D(img, -1, gauss_kernel)
plt.imshow(gauss_img)
This is the result of applying the generated kernel to the picture of Pikachu
The sigma argument is the function that controls how strong the blur is. I spent quite a bit debugging why some images were blurring while others were not. After some experimenting and with the help of Claude, I realised that different images respond differently to kernels depending on their resolution and noise. For example, higher quality pictures require larger kernels and bigger values of sigma to blur properly. To find the right kernel size depending on the sigma value, use the formula below. Remember that kernels need to be odd numbers such as 3X3, 5X5, or 7X7.
What do CNN models see?
In this section, we will extract the first two layers of the pre-trained ImageNet VGG16 model and examine the features it learned to identify on its own without manual configuration. The code for importing the model and extracting the first two convolutional layers is provided below.
from keras.applications import VGG16
conv_layers = []
model = VGG16()
#print(model.summary())
counter = 0
for i in model.layers:
if "conv" in i.name:
conv_layers.append(i)
counter+= 1
if counter == 2:
break
print(conv_layers)
Now let's apply the first layer to the Pikachu picture and see what the model picks up.
# resize image
re_img = cv.resize(img, (224, 224)) # VGG16 takes 224X224 images as input
out_img = conv_layers[0](np.expand_dims(re_img, axis=0)/255) # Creating a batch and normalisation
out_img.shape
fig, axes = plt.subplots(8,8, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
f = out_img[0][:,:,i].numpy()
ax.imshow(f, cmap='gray')
ax.axis("off")
plt.tight_layout() # Adjust subplot parameters for a tight layout
plt.show()
As you can see, the first layer of VGG16 learnt to construct the edge detection kernel similar to the kernel we used earlier. This shows how the first couple of CNN layers learn to pick up simple features, but as they go further down, more complex feature patterns start being detected. In the diagram above, the light values are the ones the convolution picked up as important during training. You can see that some of them focus on picking up eyes, but most of them focus on the outline of Pikachu.
Now let's try with a real animal and see what sort of features VGG16 picks up from this picture of this cat
Now let's pass this cat image through the first Convolutional layer as we did with the Pikachu image.
Now let's pass this feature map we generated to the second convolutional layer and see what kind of kernels it learns. As a quick refresher, remember that during the forward pass, the neural network sequentially passes the output from the previous layer into the next layer. This process is simulated below:
# passing it through two layers
out_cat = conv_layers[1](out_cat)
print(out_cat.shape)
fig, axes = plt.subplots(8,8, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
f = out_cat[0][:,:,i].numpy()
ax.imshow(f, cmap='gray')
ax.axis("off")
plt.tight_layout() # Adjust subplot parameters for a tight layout
plt.savefig("cat_layer_2.png")
plt.show()
As you can see, the second layer has learned to detect more abstract and robust features compared to the first. Some of the filters clearly capture the outline of the cat, while others seem to pick up on the different fur tones, distinguishing between lighter and darker areas. Some of the filters learned to identify the cat's eyes, and the most interesting part is that a good chunk of the filters could discern the texture of the cat's fur.
Max vs Average Pooling
We have seen what convolutional layers do, but now, let us explore what pooling layers do. Well, they mostly just downsample the images; in simpler terms, they just reduce the spatial dimensions of the image. Okay, in even simpler terms, they just make the image smaller by taking either the mean or the max value in a given window. For pooling to work, the kernel size, or in this case, the pool size, has to match the stride. For example, a 2X2 pooling kernel will have a stride of 2 and will halve the size of the image. The window will slide through the image and extract either the max or average pixel value and replace it with that. Now let's see this in action.
import keras
import tensorflow as tf
max_pool = keras.layers.MaxPooling2D(pool_size=(2, 2))
average_pool = keras.layers.AveragePooling2D(pool_size=(2, 2))
cat_dim_extend = (np.expand_dims(re_cat, axis=0)/255)
max_cat = max_pool(cat_dim_extend)
average_cat = average_pool(cat_dim_extend)
cat_imgs = [cat_dim_extend, max_cat, average_cat]
print(f"Orginal size before pooling : {cat_dim_extend.shape}")
print(f"Max Pool Size : {max_cat.shape}")
print(f"Average Pool Size : {average_cat.shape}")
fig, axes = plt.subplots(1,3, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
img = tf.squeeze(cat_imgs[i]).numpy()
ax.imshow(img)
ax.axis("off")
plt.tight_layout() # Adjust subplot parameters for a tight layout
plt.show()
From the images above, you can see that average pooling resulted in a smoother image, compared to Max pooling. Both methods compressed the images into fewer dimensions. You can think of this type of pooling as less sophisticated image resizing. However, unlike simple image resizing, pooling instead selects either the most prominent (Max Pooling) or representative (Average Pooling) features within a given window. Pooling is used to reduce the amount of computation and weights in downstream layers. This process is vital for reducing computation and the number of parameters in subsequent layers, leading to faster processing and less memory consumption in your neural network
Conclusion
In this post, we explored kernels, what they do, and implemented some of our own. We looked at the first two convolutional layers of the VGG16 model and showed what features it learnt to pick up during training. Lastly, a quick look at the result of pooling an image. It took me quite a long time to know some of this material, but with time and curiosity, you can reach a level where you somewhat understand some of these concepts.
Series
- LeNet Implementation
- Exploring pre-trained Convolutional layers (you are here right now)
- AlexNet Implementation
- VGG vs Inception: Just how deep can they go?
- ResNet and Skip Connections
- Autoencoder
- Variational Autoencoders (VAEs)
- Generative Adversarial Networks (GANs)