VGG v GoogleNet: Just how deep can they go?

30 Aug, 2025

This is part 4 of my series. In the previous post, we explored AlexNet, where we implemented the model. We then used a pre-trained version of AlexNet to transfer learn on the Caltech-256 dataset.

Going deeper

The VGG and GoogleNet/Inception models have become staples in the deep learning world and are still used today as either Architectural guides or base models in transfer learning tasks.

The papers that introduced these models aimed to make CNNs deeper. The deeper the network becomes, the harder it becomes to train. This is because of the vanishing gradient problem. The vanishing gradient problem happens during backpropagation, where the updates to the weights become smaller as the algorithm updates the earlier layers. This makes it very difficult to train very deep networks. VGG and Inception try to mitigate this problem in two completely different ways that we will discuss below

VGG

VGG is the simpler model of the duo, and so we will start with it. The major findings of the VGG paper had to do more with memory efficiency in terms of weights. The paper made a slight design choice or an incremental improvement on the current CNN architecture that had been used in models such as AlexNet.

They proposed that stacking two (3x3) kernels is the same as having one (5x5) kernel. The word "same" is used in this context to mean that these two operations are equivalent to each other.

Stacking convolutional layers has two added advantages. The first advantage of this approach is that the input goes through two activations instead of one. This allows the model to be more discriminative and learn more features.

The second advantage is that the stacked convolutions have fewer weights compared to the equivalent singular convolutional layer. As an example, 3 stacked Conv (3X3) layers are equivalent to just one (7x7) convolutional layer.

To fully grasp this concept, let us use some helpful Python. The function below calculates the weights of a convolutional layer

def calculate_conv_weights(kernel_size, filters):
    return kernel_size**2 * filters

And we run the code below:

single = calculate_conv_weights(7, 256)

stack = sum([calculate_conv_weights(3, 256) for _ in range(3)])

print(f"7x7 kernel has {single} parameters")
print(f"3x3 stacked kernel has {stack} parameters")
print(f"so a {100 - ((stack / single) * 100):.2f} % decrease in parameter count")

The following should be outputted

7x7 kernel has 12544 parameters
3x3 stacked kernel has 6912 parameters
so a 44.90 % decrease in parameter count

As we can see, there is a decrease of about 45% in terms of parameters. This is under the assumption that the stacked convolutions have the same number of channels.

Implementing VGG

We will implement the model in this section. This implementation structure is similar to the one used by PyTorch. A slight modification was made in this version, where BatchNormalization was used to normalize the model's outputs and smooth out the training process. The full code for the implementation can be found in this Kaggle Notebook

The code below defines the building blocks for a stack of CNNS and will be used to build up more complex stacks later on in the post.

def vgg_block(in_channels, out_channels):
     return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, 3, padding=1), 
        nn.BatchNorm2d(out_channels),
        nn.ReLU()
     )
    

def vgg_stack(in_channels, out_channels):
    return nn.Sequential(
        vgg_block(in_channels, out_channels),
        vgg_block(out_channels, out_channels),
    )

The version of the model implemented below is based on the 11-Weight version of VGG. This version, according to the paper, was used for warm starting the deeper networks for faster training. Warm Starting in this context means training a smaller version of a model on a subset of the data for a few epochs. The smaller model weights are then used to initialize the weights of the early layers of the larger VGG models to achieve faster convergence.

class vgg_11_bn(nn.Module):
    def __init__(self, num_labels, FC_size):
        super().__init__()

        self.layer_1 = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, stride=2))

        self.layer_2 = nn.Sequential(
           nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2, stride=2))

        self.layer_3 = vgg_stack(128, 256)
        self.layer_4 = vgg_stack(256, 512)

        self.features = nn.Sequential(
            self.layer_1,
            self.layer_2,
            self.layer_3,
            self.layer_4,
        )
        

        self.classification = nn.Sequential(
            nn.LazyLinear(FC_size),
            nn.LazyBatchNorm1d(),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.LazyLinear(FC_size),
            nn.LazyBatchNorm1d(),
            nn.ReLU(),
            nn.Dropout(p=0.3),
            nn.LazyLinear(num_labels)
        )

        self.flat = nn.Flatten()
    
    def forward(self, x):
        features = self.features(x)
        flattened = self.flat(features)
        return self.classification(flattened)

Pre-training the VGG

In this section, we will use the ImageNet pre-trained VGG model to train on the Caltech dataset and see how well it performs against AlexNet

For the full code, please check out this Kaggle notebook. I will just go over the code snippets that relate to the model.

# Loading the pre-trained AlexNet model with ImageNet weights
model = vgg11_bn(weights='IMAGENET1K_V1')

# freezing layers
for i in model.parameters():
    i.requires_grad = False

# Creating our classifier 
classifier = nn.Sequential(
    nn.Dropout(0.5),
    nn.Linear(25088, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(inplace=True),
    nn.Dropout(0.3),
    nn.Linear(128, num_classes)
)

# adding our new classifier to our model, and making sure its parameters are trainable 

model.classifier = classifier

for i in model.classifier.parameters():
     i.requires_grad = True

Results

After 7 epochs, a validation accuracy of 74.4% and a loss of 1.1306 were achieved. The VGG model did way better than Alexnet, which received an accuracy of 60% and a loss of 1.460.

GoogleNet

The GoogleNet team dealt with trying to make CNNs deeper in a completely different way. Instead of optimizing the current infrastructure, like what the VGG team did with stacking Convolutional layers and having 3x3 kernel sizes. Inception came with a new layer called the Inception block.

inception block

This layer spreads out the convolutions, then concatenates them together. You might not get how this allows for the network to go deeper, and I did not at first, well, until I looked at things through the process of backpropagation

If all these layers were stacked together in a sequential order, that would mean the gradients get smaller the deeper they go. If the network is wide, then these layers will have larger gradients because they are not connected together.

The block just uses multiple feature extractors (the convolutional layers) to get features independently, combine them, and send them out. This minimizes the effect of the vanishing gradient problem since the layers are not stacked together in a sequential order.

In the paper, some of the inception blocks use auxiliary classifiers to help boost the gradients during backpropagation. This reduces the effect of the vanishing gradient problem by allowing stronger gradients to be propagated back to earlier layers. This makes the training process more robust.

Implementing GoogleNet

The code below implements an inception block. The mid_block method is used to represent the layers that go through the 1x1 convolution, followed by either the 3x3 or 5x5 convolutional layers.

The debug variable was used when I was making the layer as a way to check the shapes of all the outputs. A quick note: the MaxPooling layer in the Inception block has a stride of 1; this took me quite a while to figure out, as it was not directly mentioned in the paper, but rather implied in Figure 3. I originally used padding as a way to match all the shapes before concatenating all the tensors.

class Inception_block(nn.Module):

    def mid_block(self, in_chns, filter_maps, kernel_size):
        return nn.Sequential(
            nn.Conv2d(in_chns, filter_maps[0], 1, padding="same"),
            nn.ReLU(),
            nn.Conv2d(filter_maps[0], filter_maps[1], kernel_size, padding="same"),
            nn.BatchNorm2d(filter_maps[1]),
            nn.ReLU(),
        )

    def __init__(
        self,
        c_1_inchannel,
        c_1_outchannel,
        block_2_channels,
        block_3_channels,
        block_4_out_channels,
        debug= False
    ):
        super().__init__()
        self.debug = debug
        self.block_1 = nn.Conv2d(c_1_inchannel, c_1_outchannel, 1, padding="same")

        self.second_block = self.mid_block(c_1_inchannel, block_2_channels, 3)
        self.third_block = self.mid_block(c_1_inchannel, block_3_channels, 5)

        self.fourth_block = nn.Sequential(
            nn.MaxPool2d(3, 1, padding=1),
            nn.Conv2d(c_1_inchannel, block_4_out_channels, 1, padding="same"),
            nn.BatchNorm2d(block_4_out_channels),
            nn.ReLU(),
        )

    def forward(self, x):
        block_1 = self.block_1(x)
        block_2 = self.second_block(x)
        block_3 = self.third_block(x)
        block_4 = self.fourth_block(x)
        if self.debug: [print(i.shape) for i in (block_1, block_2, block_3, block_4)]
        return torch.cat((block_1, block_2, block_3, block_4), 1)  # channel concat

The code below is for the full GoogleNet implementation; some slight adjustments were made, including the removal of the local response normalisation layers. Batch Normalisation is included in this implementation for faster convergence. The full code for the GoogleNet Implementation can be found in the Kaggle Notebook

class GoogleNet(nn.Module):
    def __init__(self, num_classes):
        super().__init__()

        self.no_incpetion_pass = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(3, 2),
            nn.Conv2d(64, 192, 3, stride=1, padding=2),
            nn.BatchNorm2d(192),
            nn.ReLU(),
            nn.MaxPool2d(3, 2),
        )

        self.inception_block_3 = nn.Sequential(
            Inception_block(192, 64, (96, 128), (16, 32), 32),
            Inception_block(256, 128, (128, 192), (32, 96), 64),
            nn.MaxPool2d(3, 2, padding=1),
        )

        self.inception_block_4a = Inception_block(480, 192, (96, 208), (16, 48), 64)

        self.inception_block_4b_to_4d = nn.Sequential(
            Inception_block(512, 160, (112, 224), (24, 64), 64),
            Inception_block(512, 128, (128, 256), (24, 64), 64),
            Inception_block(512, 112, (144, 288), (32, 64), 64),
        )

        self.inception_block_4e_to_5b = nn.Sequential(
            Inception_block(528, 256, (160, 320), (32, 128), 128),
            nn.MaxPool2d(3, 2, padding=1),
            Inception_block(832, 256, (160, 320), (32, 128), 128),
            Inception_block(832, 384, (192, 384), (48, 128), 128),
            nn.AvgPool2d(7),
            nn.Flatten(),
            nn.Dropout(p=0.7),
        )

        self.aux_1 = self.make_auxiliary_classfier(num_classes, 512)
        self.aux_2 = self.make_auxiliary_classfier(num_classes, 528)
        self.fc = nn.Linear(1024, num_classes)

        self._initialize_weights()

    def make_auxiliary_classfier(self, num_classes, in_chns):
        return nn.Sequential(
            nn.AvgPool2d(5, stride=3),
            nn.Conv2d(in_chns, 128, 1, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(2048, 1024),
            nn.Dropout(p=0.4),
            nn.Linear(1024, num_classes),
        )

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        pass_1 = self.no_incpetion_pass(x)
        block_1 = self.inception_block_3(pass_1)
        block_2 = self.inception_block_4a(
            block_1
        )  # wiil go through auxiliary classfifer
        block_3 = self.inception_block_4b_to_4d(
            block_2
        )  # wiil go through auxiliary classfifer
        block_4 = self.inception_block_4e_to_5b(block_3)
        if self.training:
            return self.fc(block_4), self.aux_1(block_2), self.aux_2(block_3)
        else:
            return self.fc(block_4)

GoogleNet Transfer learning

The full code for the transfer learning can be found in this Kaggle notebook. I will only be focusing on the model and will leave out sections that will deal with the dataset setup in this post.

model = googlenet(weights='IMAGENET1K_V1')

# freezing layers
for i in model.parameters():
    i.requires_grad = False


model.fc = nn.Linear(in_features=1024, out_features=num_classes)

Results

GoogleNet received a loss of 0.816 and an accuracy of 80.4%. This was a good result in terms of only having one trainable linear layer.

I trained the 19-layer VGG model to compare the results and the VGG model. The classifier bit of the model was trainable, and it included 3 Linear layers, 2 Dropout layers, and a ReLU activation function. This model can be seen in the VGG Kaggle Notebook. The 19-Layer VGG had a loss of 0.801 and an accuracy of 81.1%.

Conclusion

VGG and GoogleNet serve as advances in making CNN models deeper, more memory-efficient, and faster at inference time. These model architectures take two completely different approaches to this task, as the VGG architecture takes the route of using stacked 3X3 Convolutional layers. The GoogleNet models make use of the Inception module, which widens the network, allowing for better feature extraction.

Series

LeNet Implementation
Exploring pre-trained Convolutional layers
AlexNet Implementation
VGG vs Inception: Just how deep can they go? (you are here right now)
ResNet and Skip Connections
Autoencoder
Variational Autoencoders (VAEs)
Generative Adversarial Networks (GANs)