Demystifying LeNet-5: A Deep Dive into CNN Foundations

24 May, 2025

Disclaimer: This blog post is more of a reference guide and personal notes. So, it should not be taken as a tutorial or a beginner's guide to Computer Vision, CNNs, or a deep discussion of the paper Gradient-Based Learning Applied to Document Recognition.

For an interactive experience, please visit this Kaggle notebook

A bit of history about CNNs

CNNs were first introduced in 1989, notably in Yann LeCun's early work. This is surprising since CNNs only got major prominence in 2012 off the back of the AlexNet image classification model. In the early days, CNNs and neural networks were hard to train because of the vanishing gradient problem and the low computational resources available. Support Vector Machines were used because they were easier to train and did not require a large dataset to be efficient. The emergence of GPUs, large datasets like ImageNet, and advances in training neural networks, like batch normalization and better activation functions, helped overcome these challenges and allowed CNNs to be a mainstay in the Machine learning field.

My thoughts on the paper: Gradient-Based Learning

After reading the paper, I found myself needing to revisit a few core concepts that the authors assumed familiarity with. Below are the notes I took while trying to better understand what the paper was referring to.

1. What is a Convolution?

There are plenty of sites and videos that will explain convolutions better than I would, so I have left useful links below that I found useful in learning convolutions. To understand Convolutions, you need to know three concepts: the kernel size, filters, and stride. Think of the Kernel size as the window that will slide over the image. This can be a (3X3) kernel for a 2D convolution, which is what we use for images. Filters, on the other hand, are the number of patterns or features we want to extract from the image. The more patterns, the more information that the model can learn, but the more weights you are adding. Stride is how many steps the convolution should take. If the stride is 1, then the layer is taking one step at a time to the next part of the image. In the image below, the yellow bit is the Kernel, and you can see it move around the image with a stride of 1, creating a resulting convolved feature. This image only has a filter size of one, but deep learning models usually have larger filter sizes, such as 64 or 256.

convolution operation

The paper talks about a big advantage of convolutional networks, being that they allow for parameter sharing. So let us go over this with an example using an image from the MNIST dataset. If we were using a dense neural network for this classification, suppose a small dense layer of 8 neurons. The first step would be to flatten the (28X28X1) image to a 784 vector. That means all 8 neurons will need 784 weight parameters, which would result in our dense layer having 6,272 weights. We also need to add the bias terms for this, which will be adding 8 to the 6,272, which would make the total parameter amount of our dense layer 6280. This is not bad, but this does not scale and would require more computation as the dense layer gets bigger and the image sizes get bigger.

In the case of a convolution with a kernel of (5x5x1) and a large filter size of 128 will give us a parameter amount of 3328.

parameters = (kernel height \times kernel width \times input channels + 1) \times number of filters

The convolution used in the example above has a very large filter size and would be able to capture a lot more information using fewer weights compared to the Dense layer. Convolutions are also great because they can learn to recognize features within images, despite the position of that feature, and can learn the local patterns. An example would be learning that dogs have flappy ears despite the orientation of the image and the location of the dog in that particular image.

2. What is subsampling?

Subsampling (or downsampling) reduces the spatial dimensions of an image. This layer summarizes the information in the image into a more condensed form. Think of it like compressing the information within an image. Subsampling mainly takes the form of pooling operations such as max and average pooling. For example, subsampling an image that is 28x28 by a 2x2 kernel is just 14x14. As you can see from the example, this is just dividing the width and height of the image by the kernel. These layers have no weights or parameters and so do not learn during back propagation.

3. What is a feature map?

A feature map in this paper was a term used for the output of the Convolutional layer. The formula to calculate the output of a convolution is below. A helpful learning guide is to convert this formula into a Python script and use it to calculate your layer outputs one by one to see the resulting feature map. This will save you so many headaches later.

Output Dimension = \frac{(Input Dimension - Kernel Size + 2 \times Padding)}{Stride} + 1

4. What are RBF networks?

The paper mentions the use of a Radial Basis Function for the classification of the MNIST dataset. RBF is just an activation function that is not used much these days. In this implementation of the model, the RBF network was swapped out for a softmax-activated layer.

LeNet-5: The building blocks

LeNet-5 was designed to recognize handwritten digits from the MNIST dataset. A quick note on the MNIST Dataset has a size of (28x28x1), but in the paper, they are (32x32x1). This shows that the input was padded, and so the dataset has been padded in this implementation as well. LeNet-5 has 7 layers that are shown below:

Layer	size	Feature Map	Kernel size	activation
Input	32x32x1	1	-	-
Conv2d	28x28x6	6	5x5	tanh
AveragePooling	14x14x6	6	2x2	tanh
Conv2d	10x10x16	16	5x5	tanh
AveragePooling	5x5x16	16	2x2	tanh
Flatten	400	-	-	-
Dense	120	-	-	tanh
Dense	84	-	-	tanh
Dense	10	-	-	softmax

Implenting the LeNet-5

Here's a TensorFlow/Keras implementation of LeNet-5, adapted for the MNIST dataset. We pad the input to 32x32 to match the original architecture and use average pooling layers and tanh activations just like in the paper.

from keras import Model
import tensorflow as tf
from tensorflow.keras.datasets import mnist, cifar100, cifar10
from keras.ops.image import pad_images
from PIL import Image
from keras import layers, optimizers

class LeNet(Model):

    def __init__(self, in_shape, num_labels):
        super().__init__()
        self.c1 = layers.Conv2D(6, (5,5), activation="tanh",input_shape=in_shape)
        self.c2 = layers.Conv2D(16, (5,5), activation="tanh")
        self.pool = layers.AveragePooling2D((2,2))
        self.flat = layers.Flatten()
        self.dense1 = layers.Dense(120, activation="tanh")
        self.dense2 = layers.Dense(84, activation="tanh")
        self.lastDense = layers.Dense(num_labels, activation="softmax")
    
    def call(self, inputs):
        x = self.c1(inputs)
        x = self.pool(x)
        x = self.c2(x)
        x = self.pool(x)
        x = self.flat(x)
        x = self.dense1(x)
        x = self.dense2(x)
        return  self.lastDense(x)

Importing the MNIST Dataset

# loading in the dataset 
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Adding channels to MNIST 

x_train = tf.reshape(x_train, (x_train.shape[0], x_train.shape[1], x_train.shape[2],1))
x_test = tf.reshape(x_test, (x_test.shape[0], x_test.shape[1], x_test.shape[2], 1))

y_train_s = tf.reshape(y_train, (-1, 1))
y_test_s = tf.reshape(y_test,(-1, 1))


# converting the data into floats

x_train = tf.cast(x_train, tf.float32) / 255
x_test = tf.cast(x_test, tf.float32)   / 255


# padding the images to 32x32
pad_x_train = pad_images(x_train, 
           top_padding=2,
            bottom_padding=2,
            left_padding=2,
            right_padding=2,
            data_format="channels_last"
          )

pad_x_test = pad_images(x_test, 
           top_padding=2,
            bottom_padding=2,
            left_padding=2,
            right_padding=2,
            data_format="channels_last"
          )

mn_shape = (32, 32, 1)
mnist_model = LeNet(mn_shape, 10)

mnist_model.compile(
    optimizer= optimizers.Adam(),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

mnist_model.fit(pad_x_train, y_train, 8, epochs=5, validation_split=0.2)

Then we evaluate the model.

mnist_model.evaluate(pad_x_test, y_test)

The model achieves 98% test accuracy on MNIST, which is impressive given the simplicity of the architecture. However, its performance on more complex datasets like CIFAR-10 or CIFAR-100 is limited due to its shallow structure and low parameter count. If you want to see the model's performance on the CIFAR-10 and CIFAR-100 datasets, you can go check out the kaggle notebook. These results were omitted for verbosity's sake and not to make this too long for people who just want to see the implementation.

In short, the model trained on the CIFAR-100 dataset exhibited signs of overfitting and slow convergence. This showed the limitations of this architecture. In the next installment, we will look into AlexNet and how it overcomes these issues.

Series

LeNet Implementation (You are here right now)
Exploring pre-trained Convolutional layers
AlexNet Implementation
VGG vs Inception: Just how deep can they go?
ResNet and Skip Connections
Autoencoder
Variational Autoencoders (VAEs)
Generative Adversarial Networks (GANs)

Useful links

#CNNS #LeNet #ML