Denosing Images of Cats and Dogs with Autoencoders

22 Oct, 2025

This is part 6 of my series. In the previous post, we looked at ResNet and Skip Connections, where we implemented the model and discussed the degradation problem and how skip connections mitigated this issue. We then used a pre-trained version of the model to transfer learn on the Caltech-256 dataset.

In this post, we are going to look at autoencoders. Autoencoders are a type of neural network used to convert high-dimensional data into low dimensions. They consist of an encoder and decoder, in which the encoder takes in the high-dimensional data and converts it into a low-dimensional representation. The decoder takes this low-dimensional representation and converts it into the desired output.

An example of this would be the process of data compression. The encoder takes the file secret.txt and compresses it down to the new file secret.zip. The decoder will take the secret.zip file and decompress it back into the original file secret.txt.

Autoencoders have many use cases, such as data compression, data reconstruction, feature extraction and anomaly detection. They are considered a nonlinear generalised Principal Component Analysis.

Parts of AutoEncoders

In this section, we are going to go through the three building blocks of an AutoEncoder, which consist of an encoder, a bottleneck and a decoder.

image of autoencoder

1) Encoder: The Encoder is responsible for taking in data. This data can be anything for vectors, images, or sequential data; it all depends on the architecture of the network. The encoder just transforms this data into a lower-dimensional representation of the data. For example, it can take in an image from the MNIST Dataset and convert it into a 2-dimensional vector that represents that particular image.

An image from the MNIST Dataset can be flattened and represented as a 784-dimensional vector. In the example above, we take this 784-dimensional vector and allow our encoder to learn some 2-dimensional representation of it.

2) BottleNeck: The BottleNeck represents what we want the encoder to output. In our 784-dimensional vector example above, we could have our encoder output either 2, 3 or even 10-dimensional representations of the data. It all depends on how much information we want to represent. The bigger the bottleneck, the more information that can be preserved. The term latent space is also thrown around in the literature, and it means the representation that the autoencoder learns about the data.

3) Decoder: The Decoder is responsible for taking in the latent space and reconstructing this space into some desired output. In the MNST image example, the decoder will take in the 2-dimensional latent space and convert it back to the original image.

A simple AutoEncoder

We are going to build a 3-layer autoencoder. The encoder will take in an image from the MNIST dataset. It will then learn a 2-dimensional latent space from the image. The decoder will convert this 2-dimensional latent space back to the original image. So we are just teaching this neural network to compress these images and learn how to reconstruct them.

All the code for this can be found in this Kaggle Notebook. This post will only focus on the model and leave out the code for training the model and the dataset preprocessing.

The model is broken down in the table below:

Layer	Size
Dense	28x28
Dense	2
Dense	28x28

Binary Cross-Entropy (BCE) was used as the loss function. This converts the problem into a classification of whether or not each pixel in the image is 0 or 1. Mean Squared Error could have been used for this task and would have given good results, but BCE was chosen because it will produce better gradients during backpropagation.

The Sigmoid activation function is used for the final output because it clamps the range of values to be between 0 and 1. Tanh could have been used as well, but Sigmoid was chosen because of our choice of loss function and because the images in the dataset were Binary.

The code below just implements a very simple autoencoder that takes a 28x28 vector and returns a 28x28 vector. The bottleneck will be 2 by default.

class simpleAutoEncoder(nn.Module):

    def __init__(self, latent_dim=2):
        super().__init__()
        self.model = nn.Sequential(
            nn.Flatten(), # Flattens the 28x28 image into 784 vector
            nn.Linear(28*28, latent_dim), # Encoder
            nn.ReLU(),
            nn.Linear(latent_dim, 28*28), # Decoder
            nn.Sigmoid()
        )

    def forward(self, x):
        output = self.model(x)
        return output

The code below shows the setup of the model, loss function and optimiser to use in the training loop.

latent_dim = 2
simple_autoencoder_model = simpleAutoEncoder(latent_dim).to(device)
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(simple_autoencoder_model.parameters())

Results

These are some of the images next to the reconstructed images below. You can play around with the code and change the latent_dim size to any number you like to see how the quality of the constructed images changes.

MNIST images

As you can see, our model produced good results from reconstructing these images from just a 2-dimensional vector. They seem a bit blurry, but you can experiment and increase to size of the latent dimensions to get better results.

Let's Denoising images of Cats and Dogs

Now, for a bit more fun, let us use an autoencoder to denoise a bunch of images. We will add some Gaussian noise to pictures of Cats and Dogs and see how well our model learns to denoise them.

We will be using the RED-NET (very deep Residual Encoder-Decoder Networks) framework from the paper Image Restoration Using Convolutional Auto-encoders with Symmetric Skip Connections.

Picture of RED-NET

The RED-NET just maps every convolution layer to its mirror deconvolution layer. This is where the Symmetric bit in the title of the paper comes from. It is almost like a UNET, just without the downsampling and instead concatenation of skip connections. This architecture adds the skip connections just like the original ResNet did.

Model set up

The full code for the whole training process, dataset setup and model configuration can be found in this kaggle notebook. I will mainly be focused on the results.

Code snippets of the RED-NET model are below that show a partial implementation of it.

class Encoder(nn.Module):
    def __init__(self):
        super().__init__()

        # conv block layers
        self.features = nn.Sequential(
            encoder_cnn_block(3, 32),
            encoder_cnn_block(32, 64),
            encoder_cnn_block(64, 128),
            encoder_cnn_block(128, 128),
            encoder_cnn_block(128, 256),
            encoder_cnn_block(256, 256),
            encoder_cnn_block(256, 512),
            encoder_cnn_block(512, 512),
        )

        self.relu = nn.ReLU()


    def forward(self, x):
      skips = []
      for block in self.features:
        x = block(x)
        x = self.relu(x)
        skips.append(x)
      
      return x, skips

class Decoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.sig = nn.Sigmoid()
        self.relu = nn.ReLU()

        # Transpose conv block layers
        self.feature_recontruction = nn.Sequential(
            decoder_cnn_block(512, 512),
            decoder_cnn_block(512, 256),
            decoder_cnn_block(256, 256), 
            decoder_cnn_block(256, 128), 
            decoder_cnn_block(128, 128),
            decoder_cnn_block(128, 64),
            decoder_cnn_block(64, 32),
            decoder_cnn_block(32, 3, include_norm=False)
        )

    def forward(self, x, skips):
      skips = skips[::-1]
      list_len = len(self.feature_recontruction) - 1
      for index, block in enumerate( self.feature_recontruction):
        x = block(x + skips[index])

        if index == list_len:
           x = self.sig(x)
        else:
          x = self.relu(x)
      return x
      
class Autoencoder(nn.Module):

   def __init__(self):
        super().__init__()
        self.encoder = Encoder()
        self.decoder = Decoder()

   def forward(self, x):
        encode, skips = self.encoder(x)
        decode_img = self.decoder(encode, skips)
        return decode_img

Results of the training

The Gaussian noise on the training images was given a sigma of 1. On larger values of sigma, the images often become too blurry to notice what was in the image. The results of the Denoising are below, the first row shows the images with Gaussian noise, the second row shows the original image, and the last row shows the Denoised RED-NET images.

RED-NET Results

Notice that the generated images are blurry; this is the result of the loss function. Mean Squared error does not optimise for image quality and so will produce slightly blurry images. Autoencoders also compress images into low-dimensional data, and so some of the information is lost when mapping back into high-dimensional data.

Sampling from latent space

In our MNIST example, we compress images into just 2 dimensions. So if we just randomly sample two random numbers from a distribution and feed them into our decoder, it should just generate an image for us.

Let's try to sample from a normal distribution and see what the decoder generates.

sampling latent space

You notice that the results we get are not that good. The numbers nine and four appear quite a bit in the example above. We need a way to sample from a given distribution and get back a diverse set of images.

Typically, the latent space of Autoencoders is sparse and lacks a clear structure. A well-ordered latent space will have explicit relationships among the different variables within it. For example, the numbers four and nine should be positioned closer together in this space because they look similar, just as the numbers five and six should be.

This is where Variational autoencoders come in play, they are good at ordering latent space and generating diverse results from a given distribution. I will cover Variational autoencoders in my next post.

Extra Resources on Autoencoders

I have linked some good resources I used when learning about Autoencoders below:

An Introduction to Autoencoders

The paper that introduced Autoencoders

Good Video on Latent space

A video on Denoising Autoencoders

Series

Demystifying LeNet-5: A Deep Dive into CNN Foundations
Exploring pre-trained Convolutional layers and Kernels
AlexNet: My introduction to Deep Computer Vision models
VGG vs Inception: Just how deep can they go?
ResNet and Skip Connections
Denosing Images of Cats and Dogs with Autoencoders (you are here right now)
Variational Autoencoders (VAEs)
Generative Adversarial Networks (GANs)