Authors: D. Silverberg, M. Zhai

Coming to New Orleans on June 19, the annual IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) will invite machine learning experts and researchers to share their knowledge on modeling, mobile AI, computer vision for AR/VR, and much more.

A range of intriguing papers will also attract attention at CVPR, but how can you learn about the reports that stand out from the pack? That’s where we come in, and below are our picks for the top five papers at CVPR, in no particular order:

Shunted Self-Attention via Multi-Scale Token Aggregation

Sucheng Ren, Daquan Zhou*, Shengfeng He, Jiashi Feng, Xinchao Wang

A common constraint in computer visions tasks is how some models designate the similar receptive fields of each token feature within each layer. To address this challenge, the paper’s authors propose a novel strategy, termed shunted selfattention (SSA), that allows Vision Transformer models to mimic the attentions at hybrid scales per attention layer.

Shunted Self-Attention deals with two kinds of mismatch between standard transformer architectures and advanced tasks in computer vision: The first mismatch is the need for extensive downscaling to make self-attention computationally feasible on high resolution images. The second mismatch is the fact that unlike texts, images have no ‘natural’ tokenization and contain objects and details at multiple scales. While the original Vision Transformer brushed these issues aside by preprocessing an image into 16x16 tokens, an important recent wave of ‘pyramid’ transformers tries to solve both issues by stacking progressively downscaled transformer blocks into a CNN-like pyramid of feature maps.

In addition to the introduction of a pyramidal structure, specialized transformer architectures for computer vision also aim at improving computational efficiency within transformer blocks by modifying the attention formula. This paper presents Shunted Self-Attention, a new SOTA refinement of the ‘spatial reduction’ approach to efficient attention: The original spatial reduction method from Wang et al. saves compute by merging tokens when computing Key and Value (but not Query) matrices. Shunted Self-Attention makes significant performance gains using the same compute budget by varying the degree of spatial reduction across different attention heads within each layer. 

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, Xiaogang Wang

Another promising approach to the problem of tokenization and downscaling in transformers for dense vision tasks. Focusing on human-centric tasks that call for fine-grained modeling of human figures but tolerate lax modeling of the ambient scene, this paper proposes a clustering-based approach to downscaling in a transformer pyramid. Instead of downsampling tokens based on fixed receptive fields, Not All Tokens Are Equal clusters neighboring tokens using a DPC-KNN algorithm and then merges the tokens in each cluster with a weighted average using learned importance scores. The result is a dynamic pyramid of where semantically homogeneous image-regions gradually compress into single tokens, but semantically heterogeneous regions maintain fine-grained representations.

To leverage the overall resulting pyramid of tokens for a dense prediction task, the model aggregates the tokens from the different levels of the pyramid using upsampling and additional transformer processing. The technique achieves SOTA results in dense human-centric vision tasks like pose estimation, mesh reconstruction, and keypoint localization, and remains competitive with SOTA in general classification tasks.

Correlation Verification for Image Retrieval

Seongwon Lee, Hongje Seong, Suhyeon Lee, Euntai Kim

Image retrieval is a well-known problem in computer vision, which involves seeking to sort a database of images based on their similarities to the given query image.  Image retrieval systems traditionally use a two-stage process when looking for matches to a query image: the system begins by generating a coarse global ranking of all images within the queried database, and then performs a fine-grained reranking of the top candidates. While neural network methods revolutionized the global ranking stage in recent years, fine-grind reranking still typically relies on the “old school” computer vision method known as geometric verification. Though previous neural network SOTA in reranking has failed to surpass this non-neural technique, this paper proposes a novel image retrieval re-ranking network named Correlation Verification Networks that gradually compresses dense feature correlation into image similarity while learning diverse geometric matching patterns from various image pairs. 

In the first stage of the image-retrieval process, the method applies a resnet feature-extractor trained with a hybrid classification and momentum-contrastive loss, then coarsely rank all images by feature cosine similarity to the query. In the reranking stage, the method starts by using pyramid transformers to process the query and top candidates into multi-scale feature-maps. It proceeds by convolving horizontal slices from the queries and candidate’s feature pyramids against each other to get a 5D tensor of cross-scale feature correlations, then applying a 4D CNN to compress the tensor to a final correlation score. The overall second-stage network is trained for contrastive alignment of query/target pairs, assisted by a curriculum-learning regime to smooth training.

Crafting Better Contrastive Views for Siamese Representation Learning

Xiangyu Peng, Kai Wang1, Zheng Zhu, Mang Wang, Yang You

In the years since SimCLR achieved a breakthrough in self-supervised learning for vision by contrastively aligning the codes of two augmentations of an image, a variety of contrastive and non-contrastive methods have achieved state of the art with variations on the basic ‘Siamese’ design. Previous research has shown that much of the power of these methods depends on the choice of augmentation techniques, with some augmentations proving more crucial than others.  While many augmentations (e.g., translation, color-jitter, rotation) have different degrees of benefits depending on the learning method (e.g., BYOL, MoCO, SimSiamese), the use of random crops has proven crucial across the board. This paper proposes a simple, low-compute upgrade to random crops ContrastiveCrop, that reliably boosts representation-learning in all SOTA ‘Siamese’ methods.

The idea behind ContrastiveCrop is to try minimizing two kinds of pathologies that can occur when generating image-pairs with random crops: false-positive pairs where one crop misses the object, and trivial pairs where both crops are nearly identical. To minimize the chance of false-positive pairs, ContrastiveCrop defines a bounding-box around the object using heatmaps and restricts crops to the bounding-box. To minimize the chance of trivial pairs, ContrastiveCrop samples crops from a center-suppressed beta distribution, which encourages variety in crops.  Measured in benefits for downstream classification (both fine-tuned and linear), ContrastiveCrop consistently contributes 0.5 -1.5 accuracy percentage points on the most challenging vision benchmarks.

MAT: Mask-Aware Transformer for Large Hole Image Inpainting

Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, Jiaya Jia

If you work in computer vision, you’re aware of the scourge of inpainting, which requires reconstructing missing regions in an image.  While traditional inpainting methods based on CNNs scale cheaply and perform well on texture-heavy images with simple global structures, they struggle when images display complex long-range dependencies between fine details. Transformer architectures, on the other hand, are a natural fit for modeling long-range dependencies but impractically compute-heavy at high resolutions. In this paper, the researchers present an intuitive transformer-based model for large hole inpainting, a method that hopes to unify the merits of transformers and convolutions to efficiently process high-resolution images.

This paper observes that the recent Swin transformer design for building low-compute transformer architectures is an excellent basis for large-holes inpainting, since Swin transformers calculate attention within shifting local boxes that gradually propagate dependencies across the image-grid. Using a modified Swim transformer as a basis, the new Mask-Aware Transformer architecture gradually propagates information from the visible rims of an image towards the center of the hole. High-resolution feature maps from the transformer blocks then form the basis for final upsampling with a StyleGan-like CNN decoder.

While the majority of the paper’s contribution is in carefully selecting, calibrating, and integrating many recent tricks from the computer vision literature, one original trick plays a key enabling role: The researchers observe that the residual connection in transformer blocks causes instability when input images have large holes (i.e. many tokens at near-zero in the early layers), and replace residual connections with learned fusion operations.



How Computer Vision is used in financial services

Computer vision is of great importance to financial services, with applications including OCR, object detection and recognition, video analysis, and more. At Borealis AI, we work on problems in financial services. Many of the machine learning models we research and use are similar to those in computer vision. Please read more about how computer vision background and problems translate into what we see in finance, here