Flamingo: A Visual Language Model for Few-Shot Learning - Issue #3

May 19, 2022

In the last week's newsletter, we talked about the Pix2seq, an elegant object detection framework that takes inspiration from language models. This week, we will talk about a Flamingo, a single unified visual language model from DeepMind that can perform a wide range of vision and language tasks without being fine-tuned on those particular tasks, a technique that is typically called few-shot training.

As always, we will talk about the motivation behind the paper, and the architecture, and shed light on the results.

Motivation

Something that is common and (norm) in the deep learning community is transfer learning and fine-tuning where rather than training a vision or language model from scratch, you adapt a pre-trained model to the new task instead. But there is a challenge with fine-tuning. The amount of data samples you need is yes fewer than what you would when training from scratch, but it’s still a lot, roughly thousands of samples. And those samples still require annotation, something that is not easy.

Fine-tuning works great provided you have enough data for the new task, but, how can we solve a new task without having to collect and annotate thousands of samples? Such a problem is often addressed by techniques called zero-shot learning and few-shot learning. To explain those terms, in brief, zero-shot learning allows you to perform a completely new task without fine-tuning such as recognizing new categories of images that the classifier has never seen during training. In few-shot learning, as it sounds, you need a few samples as input prompts to perform a new task. Few-shot learning has been applied in vision-specific(such as image classification) and language-specific tasks(such as machine translation), but most joint vision and language tasks(multi-models) still require pre-training on large datasets and fine-tuning on the target task.

With that, Flamingo is a state-of-the-art visual language model that can perform various joint vision language tasks such as visual question answering, visual dialogue, and image captioning using few-shot learning. That is to say for example, given two images and their captions, Flamingo can use them to learn to predict the caption of unseen images. Two-shots!!

Flamingo takes inspiration from recent NLP works in few-shot learning such as Google PaLM, DeeChinchilla, Language Models are few short learners(GPT-3), etc...

One of the intriguing things about Flamingo is that it can handle arbitrary inputs of images, videos, and texts.

In a summary, the main contribution of Flamingo is to use few-shot learning to perform numerous visual language tasks such as visual question answering. Flamingo achieved state-of-the-art results on almost all vision-language datasets. We will take a look at the results in the later section.

Flamingo Architecture

The Flamingo visual language model takes visual and text inputs and produces text outputs. Below an image taken from the paper shows the overview of its architecture.

The main components of Flamingo are:

Vision Encoder, a pre-trained NFNet(A ResNet that is free from batch normalization) that provides rich feature representation. As the paper stated, choosing NFNet as a backbone network was due to the fact that it has a great trade-off between performance and efficiency. Vision Encoder is frozen during training to avoid changing the learned weights.
Transformer Decoder is made of pre-trained text-only language model blocks (Chinchilla in the largest Flamingo) and other blocks that are trained from scratch using the output of the Perceiver Resampler as input. Just as a vision encoder, the language model is also frozen during the training of the whole architecture.
Perceiver Resampler takes the high-dimensional feature maps from the vision encoder and produces visual tokens. Perceiver Resampler connects vision encoder and language model. As the paper stated, resampling the feature maps significantly reduces the computational complexity.

The vision encoder, perceiver resampler, and language model are the main components of Flamingo. But its architecture contains other minor components which you can find in the paper.

Flamingo is trained on 3 kinds of datasets which are: image-text pairs, video-text pairs, and Multi-Modal Massive Web (M3W) dataset. Most of that datasets are collected from the web and are not annotated. You just throw raw images, texts, and videos to the Flamingo model and the game is over.

Also, for generalization reasons, Flamingo is not trained on a dataset of downstream or target tasks.

More details about the architecture of Flamingo and the training dataset can be found in the paper.

Flamingo is a few shot learner in vision-language as GPT-3 is a few shot learner in language.

Results on Benchmark Datasets

The overall goal of Flamingo is to perform better on different & challenging VLM tasks. The authors evaluated their model on 16 different benchmark datasets. Across all 16 datasets, Flamingo crashed down all current few-shot learning state-of-the-art models.

Flamingo also went on to achieve state-of-the-art results on 5 tasks over 9 tasks it was evaluated on (that also it is SOTA on) during fine-tuning. The below table shows the tasks that Flamingo performed better than existing state-of-the-arts during fine-tuning.

Another thing that is obvious(conventional wisdom in modern networks) but important to touch on before we look at the actual results is that the larger the Flamingo model and the more shots, the better results.

Performance of Flamingo increases with model size and number of few shots

Results on Few Shots Samples

Flamingo paper contains many results that you can check. Below we highlight a few samples taken from the paper.

Given a few input images and questions for prompt/trigger purposes, Flamingo can answer the asked question without being fine-tuned.

Flamingo answering a question about images

With only just few prompts, Flamingo can play a dialogue!

Given a video and a related question, Flamingo can answer such a particular question!

Flamingo answering a question about videos

Key Takeaways

Flamingo is a single unified vision-language model that can perform a wide range of vision-language tasks such as image captioning and visual question answering using a few-shot learning approach. Flamingo takes the input of video, image, or texts and it generated output in text format. It can solve both open-ended questions such as visual-question answering and closed questions such as image classification.

In summary, some of the intriguing things about Flamingo are:

Fusing frozen vision and language models.
Using tons of unlabelled data collected from the web.
Being able to handle all kinds of inputs(video, images, texts).
Achieving state-of-the-art results(and crashing down 16 existing VLM SOTA) in few-shot learning.

For more about Flamingo, you can read the accompanying blog post or read its paper which is super long.

Thanks for reading!

Until the next time...

--------------------

P.S One question everyday and a thing every month!

Deep Learning Revision

Discussion about this post

Ready for more?