AlexNet: My introduction to Deep Computer Vision models
This is part 2 of my previous post on LeNet-5, where we implemented and trained it on the MNIST dataset. This post will walk through the AlexNet architecture and its different implementations in other frameworks. After this, we will use transfer learning on AlexNet so it can learn the Caltech-256 dataset
The paper ImageNet Classification with Deep Convolutional Neural Networks is known as the catalyst for the CNN boom of the early 2010s. This paper introduced several improvements to the CNN architecture, such as:
- ReLU as an activation function for the Convolutional layers
- Large filters for enhanced feature extraction
- Overlapping pooling
- Dropout layers to regulate the model
ReLU and Weight initialization
The AlexNet authors compared training a CNN with ReLU and Tanh activation functions and observed that models converged faster on ReLU. This led to ReLU being used as the activation function throughout the whole model.
The use of ReLU has some drawbacks, such as the dying ReLU problem, which causes dead neurons. The dying ReLU problem, or the dead neuron problem, is when the output of a linear transformation results in a negative value, causing it to be zero when ReLU is applied to this transformation. From the ReLU formula below, we can see that a neuron will output zero whenever it is negative. This neuron is considered dead because it outputs zero, and so during backpropagation, its gradients will not be updated.
I bring this point up because the dying ReLU problem can be mitigated when the model weights are initialized properly for that particular activation function. The choice of weight initialization can profoundly impact the training speed and how fast a model learns. So, having the right weight initialization will mitigate the dead neuron problem and allow for all neurons to work. Done inappropriately, poor weight initialization can cause the model to spend the initial epochs trying to learn basic features.
Overlapping Pooling
Pooling typically occurs when the parameters of the pooling size match the stride. Overlapping pooling is when the stride is bigger than the pooling size. Overlapping pools cause a mismatch in the different feature maps, so each feature map from a convolutional layer retains slightly more information. Overlapping pooling does not appear as frequently in later papers, or in my case, the ones I have read.
Dropout Layers
To stop the model from overfitting, the authors added dropout layers after each of the dense layers to regularize the model and stop it from overfitting. Dropout layers work by picking neurons at random and setting their output to zero. For example, if we set the parameters of a dropout layer to 0.2 and have a dense layer of size 10, then 2 neurons will be randomly selected and have their output zero. This is done to allow the network to learn more complex features. Dropout layers are only active during model training and get deactivated during inference.
My implementation
From my initial understanding, I had come to the conclusion that this was the model described in the paper.
class AlexNet(nn.Module):
def __init__(self, num_labels, dense_layer_num=4096):
super().__init__()
self.c1 = nn.LazyConv2d(96, 11)
self.c2 = nn.LazyConv2d(256, 5)
self.c3 = nn.LazyConv2d(384, 3)
self.c4 = nn.LazyConv2d(384, 3)
self.c5 = nn.LazyConv2d(256, 3)
self.pool = nn.MaxPool2d(3, 2)
self.flat = nn.Flatten()
# Dropout
self.drop = nn.Dropout(p=0.5)
self.relu = nn.ReLU()
self.norm = nn.LocalResponseNorm(2, alpha=10**-4, beta=0.75, k=2)
# Dense layers
self.d1 = nn.LazyLinear(dense_layer_num)
self.d2 = nn.LazyLinear(dense_layer_num)
self.d3 = nn.LazyLinear(num_labels)
def forward(self, x):
x = self.c1(x)
x = self.norm(x)
x = self.pool(x)
x = self.c2(x)
x = self.norm(x)
x = self.pool(x)
x = self.c3(x)
x = self.c4(x)
x = self.c5(x)
# Flattening and entering the dense layers
x = self.flat(x)
x = self.d1(x)
x = self.drop(x)
x = self.relu(x)
x = self.d2(x)
x = self.drop(x)
x = self.relu(x)
return self.d3(x)
Not bad, right? This screams junior software developer.
My version of the model was a pure combination of all the parameters that were mentioned in the paper. I did not register in my head that I had completely overlooked some key aspects. The biggest one was the structure of the model; the authors of the paper had essentially created two models to be trained on two GPUs. The two models did not share any data along the CNN layers. The Dense or fully connected layers passed information between each other. This can be seen in the image below.
I decided to read the paper One weird trick for parallelizing convolutional neural network because it was mentioned in the PyTorch documentation.
The paper discusses parallelizing model training from both data and model perspectives. The CNN layers were trained using data parallelism, while the Dense layers were trained using model parallelism.
I particularly did not understand the part that stated
The main difference is that it consists of one “tower” instead of two
Did he mean that the model only took one-half of the original architecture and left the other half, or did he mean he combined them into one big model with all the parameters together?
So, I decided to review the PyTorch implementation of the model on GitHub and noticed that it was a peculiar combination of both approaches. The first convolutional layer had a filter map of 64, and I could not find where this value was derived from. But the rest of the model seemed to mirror the original implementation.
I dug into the Caffe implementation of AlexNet and noticed it resembled my original implementation with a slight difference in the padding of outputs.
This somehow reassured me that I did not completely mess up in some aspects. I also needed to pay better attention to the output of my model to see if padding was implemented on particular Conv Layers, because the paper does not explicitly state anything.
I noticed that in the original paper, the model took 6 days to train on two GPUs. In the best case, where 8 GPUs were used, the model took 15.91 hours to train. So I decided to use the Pre-trained weights from the Pytorch implementation to transfer learning over the Caltech-256 dataset.
Transfer Learning
In all honesty, I strongly advise against training deep neural networks like AlexNet from scratch, especially some from this series. AlexNet was trained on the massive ImageNet dataset, which is 130 gigabytes, using multiple GPUs. I lack these kinds of computational resources, which makes training impractical on all counts. A more approachable option is to use a pre-trained model through transfer learning or fine-tuning. From experience, having spent weeks and countless weekend evenings trying to achieve satisfactory results, training custom deep CNNs from scratch.
And on that note, let us use transfer learning to teach AlexNet how to classify Caltech 256
I will only be showing the code snippets, but you can find the full code in this Kaggle Notebook.
Setting up the Model
This section just goes over initialising the model and freezing the initial layers.
model = alexnet(weights='IMAGENET1K_V1')
# freezing layers
for i in model.parameters():
i.requires_grad = False
This part of the code will add a new classifier layer that will be used to train the model
classifier = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(9216, 128),
nn.BatchNorm1d(128),
nn.ReLU(inplace=True),
nn.Dropout(0.3),
nn.Linear(128, num_classes)
)
model.classifier = classifier
for i in model.classifier.parameters():
i.requires_grad = True
Cross Entropy Loss was used with the Adam optimiser and a learning rate scheduler to combat the plateau during training
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.classifier.parameters())
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2)
The model was trained for 15 epochs, and all the layers were unfrozen and trained for an extra 6 epochs.
for i in model.parameters():
i.requires_grad = True
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
Results
The results of the training were subpar; we got a validation accuracy of 64.5% and a test accuracy of 65.9%. This could have resulted from two main reasons. The first is that the Caltech-256 dataset is relatively small, so there might not have been enough examples for the model to learn from the data. Data Augmentation was used through Pytorch's AutoAugment()
function, but this was not a strong enough regularization method.
The second reason might have been that AlexNet was too deep a model for the dataset. When I tried using the convolutional layers as feature extractors, I got at most a 60% accuracy. When I unfroze all the layers, trained the classifier, the training process became very slow. Using the Convolutional layers as feature extractors seemed to be the most efficient way to train the model. I could possibly get a better model if I trained the classifier a bit longer and made adjustments to the model.
At some point, during the transfer learning process, I thought I was doing something wrong, but when I tried using ResNet-18, I got better results and faster convergence. So the subpar results might have been because of the model, or I just needed to experiment more with the transfer learning methodology I was doing.
Conclusion
The original AlexNet paper initiated the deep learning boom, leading to larger or, in this case, deeper models and improved architectures. It was a pioneer in terms of both parameter count and top-1 and top-5 scores on the ImageNet dataset. It was a great but frustrating experience going through the different implementations and papers.
Useful links
ImageNet Classification with Deep Convolutional Neural Networks
Amazing Notes on CNNs from Stanford Great video talking about AlexNet and ML
Series
- LeNet Implementation
- Exploring pre-trained Convolutional layers
- AlexNet: My introduction to Deep Computer Vision models(You are here right now)
- VGG vs Inception: Just how deep can they go?
- ResNet and Skip Connections
- Autoencoder
- Variational Autoencoders (VAEs)
- Generative Adversarial Networks (GANs)