Deep Learning Revision - Issue #5

Jun 02, 2022

Welcome to the fifth issue of Deep Learning Revision. This week, we will look at 2 latest trends in deep learning research, 2 things from the community that I found useful, and 1 GitHub open-source highlight.

Deep Learning Trends

UViM - A Unified Modeling Approach for Vision with Learned Guiding Codes

UViM presents a single unified framework for handling different computer vision tasks such as panoptic segmentation(a task of labeling every pixel with a semantic label and segmenting/delineating every object instance), image colorization(converting grayscale images to color images), and depth estimation(predicting the length of each image pixel from the camera).

Most computer vision tasks such as object detection and segmentation produce high-dimensional outputs and due to their network complexities, it's hard to have a single network that handles all of those tasks. The motivation of UviM is to explore how a number of visual tasks can be performed together using a single model.

From the architectural point of view, UviM is made of two parts. The first part is a feed-forward model(named base model) which directly maps the input image to its output. The other part is a language model that is made of an encoder - autoregressive decoder model(both are made of Vision Transformers).

UviM results on 3 vision tasks - panoptic segmentation, image colorization, and depth estimation.

Something that stands out about UviM is that it achieves great results by using simple training techniques such as simple data augmentation techniques(cropping, horizontal flipping). More technical details about it can be found in the paper. Its implementation is also available in Big Vision.

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional Neural Networks(also known as ConvNets or CNNs) have been the primary architecture in most computer vision tasks since 2012 when they showed significant performance in image recognition. At the beginning of the 2020s, Vision Transformer(ViT), a slight modification of a normal language Transformer showed comparable results to ConvNets in various visual recognition tasks. If you happen to be in the tech community(mostly research), you probably know that CNNs or ViT is a strong ongoing argument.

The paper Do Vision Transformers See Like Convolutional Neural Networks? explores the difference between Vision Transformers and ConvNets in terms of internal representation quality or simply how they map out robust features from the input image. In particular, the authors seek to explore the internal learning mechanism of Vision Transformer and how it differs from Deep Residual Networks(ResNets).

In brief, some of the things the authors found are:

ViT learns uniform features from the first to the last layers while ResNets don't maintain similarity between low-level features and high-level features.
In the first layers, both ViT and ResNet learn almost the same features. The features start to be different in the middle layers.
Using skip or shortcut connections improves the performance and representations of Vision Transformers more than in ResNets.

Do Vision Transformers See Like Convolutional Neural Networks?

Understanding how Vision Transformers works is an ongoing area of research. For more about Vision Transformer representations, you can also read this awesome tutorial by Sayak on Keras.io and this blog post.

From Community

Making Deep Learning Go Brrrr From First Principles

Machine Learning(deep learning in particular) is very ambiguous. Deciding what to do to improve the models is like tossing the coin. Horace wrote a great blog post on approaching deep learning from first principles. Is your validation loss poor than training loss? You're overfitting. Is your training loss similar to validation loss? You're underfitting. Your Colab session is crashing on the first epoch? Maybe buy Colab Pro :-). Find out more!

Autoregressive Models in Deep Learning

Autoregressive models are types of neural network architectures that use the previous output to predict the next output (sequence models). Autoregressive models are like recurrent networks without a hidden state. These kinds of models are currently used in many papers today, so, I thought it'd best to read about them. Find out more.

Open-source HighLight

Activeloop Hub is an open-source tool that makes it easy to manage and visualize datasets for deep learning models. It works with PyTorch and TensorFlow.

Thanks for reading! Do you have a comment or a suggestion for future newsletters? Ideas and feedback are welcome. Let me know here.

----------------

P.S Convert everything to tokens and predict the next token

Deep Learning Revision

Discussion about this post

Ready for more?