A Revised History of Deep Learning - Issue #1
Welcome to the first issue of the Deep Learning for Computer Vision newsletter. This newsletter will be a home of deep learning trends and ideas. It's not only ideas, but we will also be diving deep into them. I plan to write one issue per week, but we will see how it goes!
Before leaping into the latest advances, let's take a step back into the history of deep learning from 1958 to 2022. It is good to understand how these things started.
1958: The Rise of Perceptrons
In 1958, Frank Rosenblatt invented a perceptron, a very the simple machine that would later be the core and origin of today's intelligent machines.
Perceptron was a very simple binary classifier that can determine whether or not a given input image belong to a given class. To achieve that, it used a unit step activation function. With unit step activation, the output is 1 if the input is greater than 0, and 0 otherwise.
Here is the algorithm of the perceptron.
Frank's intent was not to build a perceptron as an algorithm, but a machine instead. The perceptron was implemented in hardware that got the name Mark I Perceptron. Mark I Perceptron was a pure electrical machine. It had 400 photocells (or photodetectors), its weights were encoded into potentiometers, and weights updates(which happen in the backward propagation) were performed by electric motors. Below is Mark I Perceptron.
Just as what you would read in news today about neural networks, Perceptron was also on top of the news back then. The New York Times reported, "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence". Today, we all know that machines still struggle to walk, talk, see, write, reproduce themselves, and consciousness is a different story(comments)!
The mere goal of Mark I Perceptron was to recognize images, and at the time, it could only be able to recognize two categories. It took time before people learned that adding more layers(perceptron is one layer neural network) could give the network the ability to learn complex functions. And that's led to multi-layer perceptrons(MLPs).
1982~1986 : Recurrent Neural Networks(RNNs)
A few years after multi-layer perceptrons showed potential in solving image recognition problems, people started thinking about how to model sequential data such as texts.
Recurrent Neural Networks are a class of neural networks that are designed to process sequences. Different from feedforward networks like multi-layer perceptrons(MLP), RNNs have an internal feedback loop that plays a role in remembering the state of information at each time step.
The very first kind of RNNs cell was revealed somewhere between 1982 and 1986, but it didn't get attraction because simple RNNs cells suffer a lot when used for long sequences mainly due to short memory and unstable gradients problems.
1998: LeNet-5 : The First Convolutional Neural Network Architecture
LeNet-5 is one of the earliest ConvNets architectures. LeNet-5 was used for document recognition in 1998. LeNet-5 was made of 3 parts: 2 convolution layers, 2 subsampling or pooling layers, and 3 fully connected layers. There were no activation functions in convolutional layers.
As the paper says, LeNet-5 was deployed commercially and read several million checks every day. Below is the architecture of LeNet-5. The image is taken from its paper.
LeNet-5 was really an influential thing at the time, but it(ConvNets in general) didn't get attractions until two decades later! It built upon earlier works such as the first-ever convolutional neural network by Fukushima, backpropagation(Hinton et al., 1986), and backpropagation applied to handwritten zip code recognition(LeCun et al., 1989).
1998: Long Short Term Memory(LSTM)
The Simple RNN cells are not capable of handling long sequences due to unstable gradients problems. LSTMs are versions of RNNs that can be used to handle long sequences. LSTM is basically an extreme RNN cell.
A special design difference about the LSTM cell is that it has a gate which is the basis of why it can control the flow of information over many time steps.
In short, LSTM uses gates to control the flow of information from the current time step to the next time step in the following 4 ways:
The input gate recognizes the input sequence.
Forget gate gets rid of all irrelevant information contained in the input sequence and stores relevant information in long-term memory.
LTSM cell updates update the cell’s state values.
Output gate controls the information that has to be sent to the next time step.
The ability of LSTMs to handle long-term sequences made it a suitable neural network architecture for various sequential tasks such as text classification, sentiment analysis, speech recognition, image caption generation, and machine translation.
LSTMs are a powerful architecture, but it is computationally expensive. GRU or Gated Recurrent Unit was introduced in 2014 to address that. It has fewer parameters compared to LSTMs and it can work great as well.
2012 : ImageNet Challenge, AlexNet, and the Rise of ConvNets
It's almost impossible to talk about the history of neural networks and deep learning without talking about the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) and AlexNet.
The mere goal of the ImageNet challenge was to evaluate image classification and object classification architectures on a large dataset. It led to lots of new, powerful, and interesting visual architectures that we will review in brief.
The challenge started in 2010 but things changed in 2012 when AlexNet won the challenge with a top-5 error rate of 15.3% which was almost half the previous winner. AlexNet was made of 5 convolution layers followed by max-pooling layers, 3 fully connected layers, and a softmax layer. AlexNet brought the idea that deep convolutional neural networks can work well on visual recognition tasks. But at the time, things weren’t deeper yet!
In the years that followed, ConvNets architectures kept getting bigger and working better. For example, VGG which has 19 layers won the challenge with an error rate of 7.3%. GoogLeNet(Inception-v1) took that further and reduced the error to 6.7%. In 2015, ResNet(Deep Residual Networks) extended that and reduced the error rate to 3.6% and showed that with residual connections, we can train even deeper networks (of over 100 layers), something that wasn’t possible back then. People kept finding that even deeper networks work better and that led to other new architectures such as ResNeXt, Inception-ResNet, DenseNet, Xception, etc...You can find the summary and implementation of those and other modern architectures here.
2014 : Deep Generative Networks
Generative networks are used for generating or synthesizing new data samples such as images and music from the training data.
There are many types of generative networks but the most popular type is Generative Adversarial Networks(GANs) which were created by Ian Goodfellow in 2014. GANs are made of two main components: a generator that generates fake samples and a discriminator that distinguishes the real and fake samples produced by the generator. A generator and discriminator are complete adversaries. They are all trained independently and during training, they play a zero-sum game. The generator continuously generates fake samples that fool the discriminator while the discriminator tries hard to spot those fake samples(with reference to real samples). At every training iteration, the generator gets better at producing fake samples that are close to real and the discriminator must raise its bar to discriminate unrealistic samples from real samples.
GANs have been one of the hottest things in a deep learning community that is most known for generating unrealistic images and deep fakes. You can learn more about GANs here, here, and here. And if you are interested in the latest advances in GANs, you can read about StyleGAN2 or try the demos of DualStyleGAN, ArcaneGAN, and AnimeGANv2. For a complete list of GANs resources, check the Awesome GANs repository. The below image(taken here) illustrates GANs.
GANs are one type of generative models. Other popular types of generative models are Variation Autoencoder(VAE), AutoEncoder, and diffusion models.
2017: Transformers and Attentions
It’s 2017. ImageNet Challenge is over. New ConvNets architectures were made. Everyone in the computer vision community is happy about the current progress. Core computer vision tasks(image classification, object detection, image segmentation) are no longer complicated as before. People can generate photorealistic images with GANs. NLP seems to be behind. But then something pops up and headlines are made all over the web. It’s a new neural network architecture that is purely based on attention. NLP is enlightened again and a few years later, attention went on to dominate other modalities(most notably vision). No more recurrences and convolutions. The architecture is dubbed a Transformer.
5 years later, here I am talking about this biggest invention. A Transformer is a class of neural network algorithms that is purely based on attention mechanisms. The transformer doesn’t use recurrent networks or convolution. It is rather made of multi-head attention, residual connections, layer normalization, fully connected layers, and positional encodings for preserving the order of sequence in data. The below image illustrates a Transformer architecture.
The Transformer has completely revolutionized NLP and it is currently doing the same thing in computer vision. In NLP, it’s been used in machine translation, text summarization, speech recognition, text completion, document search, etc...
You can learn more about transformers in its paper Attention is All You Need.
2018 - 2020s (To day)
Since 2017, deep learning algorithms, applications, and techniques skyrocketed. The later developments are by categories for clarity purposes. In each category, we revisit the key trends and some of the most important breakthroughs.
Vision Transformers
Soon after transformers showed significant performance in NLP, some awesome people were eager to throw attention into images. In the paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, a couple of Google Researchers showed that a slight modification of a normal transformer operating directly on the sequence of image patches can produce substantial results on image classification datasets. They dubbed their architecture a Vision Transformer(ViT) and it has a number of records on most computer vision benchmarks(As of writing, ViT is the state of the art classification model on Cifar-10).
ViT designers are not the first people that tried to use attention in recognition tasks. The first example is found in the paper Attention Augmented Convolutional Networks which sought to combine self-attentions and convolution(getting rid of convolution is mostly due to spatial inductive bias that CNNs introduce). Another example is found in the paper Visual Transformers: Token-based Image Representation and Processing for Computer Vision that operated transformer on filter-based tokens or visual tokens. Those two papers and many others that are not listed here pushed the boundaries on some baseline architectures(mostly ResNet) but didn’t beat the current benchmarks at the time. ViT was really one of the greatest papers. One of the greatest insights was that ViT designers were actually using image patches as input representations. They didn’t change much about the transformer architecture.
The image is taken from the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
In addition to using image patches, the traits that made vision transformers a powerful architecture are the extreme parallelism of transformers and their scaling behavior. But like everything in life, nothing is perfect. In the beginning, ViT didn’t show good performance on vision downstream tasks(object detection and segmentation).
It is after the introduction of Swin Transformers that vision transformers started being used as a backbone network in vision downstream tasks such as object detection and image segmentation. The central highlight of the Swin Transformer extreme performance is due to using shifted windows between consecutive self-attention layers. The below image depicts the difference between Swin Transformers and Vision Transformer(ViT) in terms of how they build hierarchical feature maps.
Vision Transformers have been one of the most exciting research areas in recent times. There are many Vision Transformers papers that we can possibly talk about here, but you can learn more about them in the paper Transformers in Vision: A Survey. Other latest Vision Transformers that you might check(optional) are CrossViT, ConViT, and SepViT.
Vision and Language(V-L) Models
Vision and language models are often referred to as multi-models. They are kinds of models that involve both vision and language such as text to image generation(given a text, generate images that match the text description), image captioning(given an image, generate its description), and visual question answering(given an image and question about what’s in the image, generate the answer). The success of transformer in vision and language domains has largely contributed to multi-models as a single unified network.
Virtually, all vision and language tasks leverage pre-training techniques. In computer vision, pretraining involves finetuning a network that was originally trained on a large dataset(usually ImageNet) and in NLP, it’s finetuning pre-trained BERT. To learn more about pretraining in V-L tasks, read the paper A Survey of Vision-Language Pre-Trained Models. For a general overview of vision and language tasks and datasets, check the paper Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods.
Just last week(as of writing this), OpenAI released DALL·E 2 (a revamped DALL·E), a vision-language model that can generate photorealistic images from texts. There are many existing text-to-image models but the resolution, image caption-match, and photorealism of DALL·E 2 are quite excellent. DALL·E 2 is not available to the public yet but you can join the waitlist. Below are examples of some images created by DALL·E 2.
The DALL·E 2 generated image presented above are taken from some OpenAI people such as @sama, @ilyasut,@model_mechanic, and openaidalle. For more technical details about DALL·E 2, you can read its paper.
Large Language Models(LLM)
Language models are used for many purposes. They can be used to predict the next word or character in a sentence, summarize a piece of document, translate given texts from one language to another, recognize speech or convert a piece of text to speech.
Whoever brought transformers has to be blamed for how far language models went in a number of parameters(but no one to be blamed at all, transformers are one of the greatest inventions of the decade 2010s, it’s just shocking(and amazing) that large models always work better if given enough data and compute). For the last 5 years, the size of language models has been increasing every now and then.
A year after the introduction of the paper attention is all you need, it’s when it all started. In 2018, OpenAI released the GPT(Generative Pre-trained Transformer), one of the largest language models back then. A year later, OpenAI also released GPT-2, a model that had 1.5 billion parameters. Another year later, they also released GPT-3 which had 175 billion parameters. GPT-3 was trained on 570GB of texts. With 175B parameters, the whole model is 700GB big. To understand how big GPT-3 is, according to lambdalabs, it would require 366 years and $4.6M to train it on the lowest-priced GPU cloud on the market!
GPT-n series models were just the beginning. There have been other bigger models that are comparatively near to GPT-3 or even bigger. Some examples: NVIDIA Megatron-LM has 8.3B parameters. Latest DeepMind Gopher has 280B parameters. Just last week(April 12th, 2022), DeepMind released another 70B language model dubbed Chinchilla that outperforms many language models despite being smaller than Gopher, GPT-3, and Megatron-Turing NLG(530B parameters). Chinchilla's paper showed that existing language models are undertrained where it showed that by doubling the size of the model, data should be doubled too. But then here again comes Google Pathways Language Model(PaLM) with 540 billion parameters in almost the same week!
Disclaimer: I do not work in NLP quite a lot. There are maybe other important and bigger language models that I don’t know their existence.
For more about the latest trends in large language models, check the awesome thread by Paper With Code.
Code Generation Models
Code generation is a task that involves completing a given piece of code or generating an entire working function from natural language or text. Or simply, AI systems that can write a computer program. As you can guess, modern code generators are based on Transformer.
We wouldn’t be wrong if we say that people started thinking about letting computers write their own programs for so long(like all other things we dream of teaching computers to do even those we know are moonshots), but code generators got attention after OpenAI released Codex. Codex is a GPT-3 finetuned on GitHub public repositories and other public source codes. OpenAI said “OpenAI Codex is a general-purpose programming model, meaning that it can be applied to essentially any programming task (though results may vary). We’ve successfully used it for transpilation, explaining code, and refactoring code. But we know we’ve only scratched the surface of what can be done.” Codex now powers GitHub Copilot, an AI pair programmer.
After I got access to Copilot, I was so amazed by its capabilities. I used it to prepare for my mobile applications(with Java) exam as someone who don’t write Java programs. It’s pretty cool that AI helped me to prepare for my academic exam!
A few months after OpenAI released Codex, DeepMind released AlphaCode, a transformer-based language model that can solve competitive programming questions. AlphaCode release blogpost says “AlphaCode achieved an estimated rank within the top 54% of participants in programming competitions by solving new problems that require a combination of critical thinking, logic, algorithms, coding, and natural language understanding.” Solving programming questions(or competitive programming in general) is super hard(everyone who has done tech interviews can agree with that) and as Dzmitry said, beating “human level is still light years away”.
And just last week, scientists from Meta AI released InCoder, a generative model that can generate and edit programs.
More papers and models about code generation can be found here.
Perceptrons Again
For a long time before ConvNets and transformers rose, deep learning revolved around perceptrons. ConvNets showed excellent performance in various recognition tasks replacing the use of MLPs. With what Vision Transformers are currently showing too, they seem to be a promising architecture too. But did perceptrons die completely? Probably not.
In the exact same month of July 2021, two perceptrons-based papers were released. One is MLP-Mixer: An all-MLP Architecture for Vision and the other one is Pay Attention to MLPs(gMLP).
MLP-Mixer claimed that neither convolutions nor attentions are necessary. It used only multi-layer perceptrons(MLPs) and achieved great accuracy on image classification datasets. The one important highlight of MLP-Mixer is that it contains two main MLP layers: one that is applied independently on image patches(channel mixing) and another layer that is applied across patches(spatial mixing).
gMLP also showed that by refraining from using self-attentions and convolutions(current de-facto of NLP and vision), one can achieve great accuracy in different image recognition and NLP tasks.
You’re obviously not going to get state-of-the-art performance using MLPs but it’s fascinating how comparable they are to state-of-the-art deep networks.
ConvNets Again: A ConvNet for the 2020s
Since the introduction of the Vision Transformer(in 2020), research in computer vision revolved around transformers(in NLP, the transformer is a norm already). A vanilla Vision Transformer(ViT) achieved state-of-the-art results on image classification but it couldn’t work well in vision downstream tasks(object detection and segmentation). With the introduction of Swin Transformers, it didn’t take long for vision transformers to also take over vision downstream tasks.
Many people(including myself) love ConvNets. ConvNets just work, and it’s hard to let go of things that work. This love of architecture took some awesome scientists to go back in time to study how they can modernize ConvNets(ResNet to be precise) with the traits that make vision transformers so appealing. In particular, they explored the question “How do design decisions in Transformers impact ConvNets’ performance?” All they wanted was to apply the traits that make transformers so appealing to ResNet.
Saining Xie and his colleagues at Meta AI took a roadmap that they stated clearly in the paper and ended up with a ConvNet architecture dubbed ConvNeXt. ConvNeXt achieved results that are comparable to Swin Transformer on different benchmarks. You can learn more about the roadmap they took through ModernConvNets(summary and implementation of modern CNN architectures) and the paper that I think was written well!
The reason I chose to have an exclusive section for this single paper is not only that it was entitled A ConvNet for the 2020s, or because it presents a great mapping between transformers and ConvNets, but also because it makes a great end to our issue.
Conclusion
Deep learning is a very vibrant and huge field. It’s very hard to outline everything that happened, I only scratched the surface. There are many papers than one can possibly read. It’s hard to keep track of everything. For example, we didn’t talk about reinforcement learning and deep learning algorithms such as AlphaGo, protein folding AlphaFold(which is one of the biggest scientifical breakthroughs), the evolution of deep learning frameworks(such as TensorFlow and PyTorch), hardware accelerators, Graphs Neural Networks(GNNs), Geometric Deep Learning, etc... And perhaps, there are other important things that make a large part of deep learning history, algorithms, and applications that we didn't talk about.
As a little disclaimer and as you may have noticed, I am biased toward deep learning for computer vision. There are probably other important deep learning techniques designed exclusively for NLP that I didn’t touch on because I am not aware of them.
Also, it's hard to know exactly when a particular technique was published or who published it first because most exotic things tend to be inspired by prior works. With that said, the outline of this history is really my opinion and there might be times/years that I didn't get right.
I plan to write more about the latest trends in deep learning and computer vision. If you would like to stay on top of the latest techniques and algorithms, be sure to join us. You can also follow me on Twitter at @Jeande_d.
Thanks for reading!
----------------------------------
P.S AI supermodels <....> I know how the brain works.
----------------------------------
Cover photo: Frank Rosenblatt. Image by Cornell Rare and Manuscript Collections, taken from IEEE Spectrum. Colorized with DeepAI Image Colorization API.