What’s on your CVPR 2023 reading list? Our team of researchers has curated a fantastic selection of accepted papers, which we’ve organized by topic to make it easier for you to navigate. Here are the categories we’ve focused on: optimization, generalization, train/test-time adaptation, self-supervised learning, diffusion models, active learning for fine-tuning, dynamic scene rendering, prompt learning and causality. For each category, we’ve selected papers that we think are particularly noteworthy, and we’ve included a summary of each paper along with our thoughts on its potential impact.

We’re thrilled to share this reading list with you and hope you find this year’s CVPR research as fascinating as we do. See you at #CVPR2023!

Topics

  • Generalization and Optimization
  • Train/Test-Time Adaptation
  • Diffusion Models
  • Self-supervised Learning
  • Active learning for fine-tuning 
  • Dynamic scene rendering
  • Prompt Learning
  • Causality

Generalization and Optimization

Sharpness-Aware Gradient Matching for Domain Generalization

Pengfei Wang, Zhaoxiang Zhang, Zhen Lei and Lei Zhang

headshot hossein sharifi

by Hossein Sharifi

Why do we recommend it?

Models that converge to flat regions of the loss landscape often provide better generalization capabilities as opposed to those that converge to sharp valleys. To achieve this, methods of sharpness-aware minimization (SAM) have been proposed with the objective of minimizing the maximum loss around model parameters. SAM first adversarially computes a perturbation for parameters that maximizes the empirical risk and then minimizes the loss of the model with perturbed parameters. However, minimizing the perturbed loss is not guaranteed to converge to flat minimum regions.

This paper studied SAM and its limitations through the lens of domain generalization. Given a set of input domains with different distributions (known as source domains), the goal of domain generalization is to build a model that generalizes to an unseen domain of interest with its unique distribution (known as target domain) without having access to the target domain when training the model. This is important because during deployment, models can encounter new data distributions different from what was observed during training, and we would like our model to be able to make accurate predictions for these target domains as well. For that, this paper proposed Sharpness-Aware Gradient Matching (SAGM) to simultaneously minimize the empirical risk, the perturbed loss, and the gap between them. Minimizing the first two terms ensures converging to a minimum that is sufficiently low within a neighbourhood, and minimizing the gap ensures that the minimum is within a flat region. The model was evaluated on multiple benchmarks such as PACS, VLCS, and DomainNet and demonstrated better performance compared to the other methods of domain generalization.

Robust Generalization against Photon-Limited Corruptions via Worst-Case Sharpness Minimization

Zhuo Huang, Miaoxi Zhu, Xiaobo Xia, Li Shen, Jun Yu, Chen Gong, Bo Han, Bo Du, and Tongliang Liu

headshot hossein sharifi

by Hossein Sharifi

Why do we recommend it?

Continuing on the sharpness and generalization theme, this paper studied the role of sharpness-aware minimization when the training data contains severe noise, and the goal is to optimize the worst-case empirical risk. Optimizing the model for the worst-case distribution is challenging because 1) this level of corruption is scarce, and 2) it is expensive to obtain annotation with respect to the level of corruptions. Moreover, noise corruptions which are often referred to as photon-limited corruptions occur when corruption photons arrive at an image meaning different photons can form different levels of noise. This poses an additional challenge to model development as different levels of noise can be perceived as multiple data distributions. Therefore, we need a model that is designed for robust generalization even when the worst-case is scarce and also to be distribution-agnostic to alleviate the need to have corruption annotations. 

This paper proposed SharpDRO, a method to fully exploit worst-case data via sharpness (the same approach as the paper discussed above). Their hypothesis is that the scarce worst-case data scatters sparsely in the feature space, which makes exploring the neighbourhood of each sample difficult producing a sharp loss curve. At the inference time, the unseen worst-case data is likely to fall on these unexplored regions causing inaccurate predictions. Therefore, this paper proposed to apply sharpness only to the worst-case distribution as opposed to utilizing it for all samples (including clean, noise-free ones), which will limit the generalization capability. Although the authors demonstrated that this distribution-aware approach outperforms existing methods, such as domain generalization methods, the model still requires access to distribution annotation (levels of corruption), which is expensive to obtain. To resolve this issue, the authors proposed to re-employ calculated weights after perturbation for out-of-distribution detection. They used the difference between original model’s predictions and the perturbed ones as a score to detect worst-case samples because it is expected predictions for those samples to deviate significantly from the original values. Adding this score provides the distribution-agnostic version of their previous models. Results on CIFAR10, CIFAR100, and ImageNet30 showed that this version of the model outperforms the baselines, particularly when the level of corruption is severe. 

Train/Test-Time Adaptation

Train/Test-Time Adaptation with Retrieval

Luca Zancato, Alessandro Achille, Tian Yu Liu, Matthew Trager, Pramuditha Perera, Stefano Soatto

he zhao headshot

by He Zhao

Why do we recommend it?

In the real world, well-trained machine learning models are often required to adapt for specific deployments, which may include a complex process of scaling, securitization, certification of new models and datasets, bias evaluation, and, eventually, generation of results. More recently, there has been a new formula – adapting models based on data observed during test time – also known as Test-Time Adaptation (TTA). This paper studies TTA by retrieving information from a large, unlabeled, heterogeneous, and evolving dataset and highlights three major merits brought by the method: (1) Automated train sample selection for training, (2) adapting the small-size model to specific tasks (nimble inference), (3) reversible adaption and (4) continual adaption for a series of tasks. 

Assuming there already exists a model (pre-trained on a labelled dataset), there are three major components to the proposed method: (1) The target dataset, (2) the auxiliary dataset and (3) a retrieval model. The first problem to solve is the lack of ground-truth labels for the target dataset. To this end, the method first uses the pre-trained model to produce pseudo-labels. The second problem is to align the feature embeddings learned on pre-training datasets with those in the target dataset, assuming they differ for some reason (e.g., temporal shift). A popular method for this purpose is contrastive learning (CL), which has been known for grouping embeddings. In this work, similar retrievals between the query from the target dataset and the auxiliary are treated as positive pairs, and unrelated ones are negative pairs. And the judgement of “similarity” depends on the probability logits from the pseudo-labels. Furthermore, to encourage embeddings to have tighter clustering for the same class, the method filters out the same-class negative pairs to stabilize the training. Finally, to have the maximum improvement from CL, another used trick is harvesting the hard negatives out of the scope of topk nearest retrievals. For test-time training, the contrastive loss and cross-entropy loss on pseudo labels are jointly optimized.  To demonstrate the robust performance of the proposed method, the model is evaluated against (1) multiple image classification datasets, (2) different data augmentation methods, (3) a couple of well-known retrievers and (4) a series of ablation studies to reveal the impact of the auxiliary dataset.

TIPI: Test Time Adaptation with Transformation Invariance 

A. Tuan Nguyen, Thanh Nguyen-Tang, Ser-Nam Lim, Philip H.S. Torr

he zhao headshot

by He Zhao

Why do we recommend it?

To combat the fact that small shifts in the data distribution can affect the model’s performance severely, Test Time Adaptation (TTA) has been adopted as it can use un-labelled test-time data to recalibrate the model. However, this paper pointed out that normal TTA methods (prediction entropy minimization or pseudo-labelling) do not perform well on small batch sizes and, instead, proposed a novel optimization target-variance regularizer between the distributions of the original data and the transformed counter-part to solve the previous failure. 

In essence, the paper proposed to use transformation functions to stimulate distribution shift and then enforce the invariance property of the transformed data to the original one by minimizing their KL divergence. To obtain a task-agnostic transformation function, the authors experimented with common choices and eventually suggested norm-based perturbation functions, which is quite general to all data domains. To enforce the invariance, they advocated the reverse KL-divergence technique to implement the regularizer, as it has a better property of suppressing the variance of gradients. For experiments, this paper specifically designed studies for tiny batch sizes on multiple public benchmarks, e.g., 2, 5 and 10, and reported promising results on all of them, as well as regular large batch sizes (e.g., 200). Meanwhile, in the last section, the authors admitted that due to the large-scale transformation searching, the proposed method takes more time to complete than others.

Diffusion Models

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models

Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping and Tom Goldstein

Portrait of Ruizhi Deng

by Ruizhi Deng

Why do we recommend it?

With the explosion of AIGC in recent years, state-of-the-art generative models trained on web-scale datasets, like DALLE2 and Stable Diffusion, demonstrate an impressive ability to generate high-quality images and readiness for many real-world applications, including artwork creation and graphic design. A natural question to ask is whether these models are actually generating original content or simply memorizing and replicating their training data. The work believes this question and the ethical and legal implications it brings about ought to be seriously considered before these generative models can be adopted for commercial usage. It studies this question for various SOTA image generative models based on a range of image similarity metrics.

Given the relativity and importance of the problem this paper studies, we believe it is worth recommending. To detect object-level replications in images, the work studies and compares 10 different prototypes of feature extractors from self-supervised learning and image retrieval works on both synthetic and real-world datasets. The best image feature extractors for replication detection are applied to diffusion models, including DDPMs and latent diffusion models, trained on datasets of various scales. The work finds that replications can be easily detected for small-and-medium-scale datasets. However, for larger-scale stable diffusion models, content replications of various forms are still found frequently in the generated samples.

On Distillation of Guided Diffusion Models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho and Tim Salimans

Portrait of Ruizhi Deng

by Ruizhi Deng

Why do we recommend it?

Classifier-free guided diffusion models have shown great success in conditional high-quality image generation. However, these models are computationally expensive to sample from as it requires querying both the conditional and unconditional models iteratively. The work proposes an approach to distilling state-of-the-art classifier-free guided diffusion models into models that are more computationally efficient to sample from without much compromise on sample quality. The distilled models demonstrate comparable performance to their original ones while being faster to sample from.

We believe this work makes an important contribution towards democratizing diffusion models and generative AIs to people with limited computation resources, especially those outside the AI community. The distillation of diffusion models consists of two stages: In the first stage, the student model (distilled model) is trained to match the output of the original models at various levels of guidance strength. In the second stage, the student model is further distilled into one with fewer discrete time steps progressively. The distillation approach can be applied to both pixel-space and latent-space diffusion models for text-guided image editing and inpainting tasks.

Self-supervised Learning

Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun and Nicolas Ballas

Vin headshot

by Vin Bhaskara

Why do we recommend it?

Self-supervised learning (SSL) has shown to be effective in training semantic representations of images from unlabeled data that perform on par with fully-supervised models trained on labelled datasets on downstream tasks after minimal fine-tuning. 

Predominant SSL methods involve invariance-based pre-training, where two or more augmentations of the same image are required to map to similar semantic encodings (e.g., contrastive learning-based methods). Such methods learn high-level semantics of images and perform well on downstream tasks, such as image classification, with simple linear-probing strategies. In contrast, a complementary approach to SSL involves learning a generative model that predicts a part of the input that is masked out. Examples of this type include denoising autoencoders, where randomly masked patches from the input are reconstructed at the pixel level. Such generative methods learn relatively lower levels of image semantics, which are helpful for tasks such as instance segmentation while requiring higher computing due to pixel-space reconstruction.

In this paper, the authors present a modality-agnostic (easily generalizable to image, audio, etc., modalities) SSL technique that combines features of invariance-based and generative modelling-based SSL methods. Specifically, the proposed method is similar to the generative SSL techniques in that it predicts the properties of an image from a given context patch and to the invariance-based SSL techniques in that the loss function optimized is computed in the feature space instead of the pixel space. The paper shows that the representations learnt by the proposed method capture semantics at multiple granularities, require lower computation, and do not need any apriori knowledge of ad hoc data augmentations. Moreover, the method is shown to perform on par with or better than existing models on diverse tasks such as image classification and instance segmentation.

Active learning for fine-tuning 

Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm

Yichen Xie, Han Lu, Junchi Yan, Xiaokang Yang, Masayoshi Tomizuka and Wei Zhan 

Headshot of Edward Smith

by Edward Smith

Why do we recommend it?

In the pretraining-finetuning training paradigm, unsupervised pretraining of deep learning models is generally followed by finetuning on a sufficiently large and fully-labeled dataset, representing the target data distribution for the downstream task.  However, the assumption that this magnitude of labeled data is readily available for real-world applications is unreasonable and ignores the cost of manually labeling large amounts of data. While active learning schemes readily exist for generic data labeling, where supervised training is initialized from scratch, these methods have been demonstrated to operate poorly in the pre-trained model setting. 

To address these shortcomings, this paper introduces a new active learning task, focusing specifically on the annotation of data samples in the pretraining-finetuning setting, and proposes ActiveFT, a novel active learning method specifically suited for this task. ActiveFT aims to select the optimal subset of samples to annotate from a large dataset over a fixed budget, such that supervised training on these samples maximizes model performance on the training goal. This is accomplished by continuous optimization in the high-dimensional feature space of the pre-trained model, using an optimization target that aims to simultaneously minimize the distance between the distributions of embedded samples selected and the full dataset and maximize the diversity within the set of embedded samples selected. ActiveFT was applied to image classification and segmentation across a variety of datasets and demonstrated impressive improvement over strong, active learning baselines in their pretraining setting. 

Dynamic scene rendering

RobustNeRF: Ignoring Distractors With Robust Losses

Sara Sabour, Suhani Vora, Daniel Duckworth, Ivan Krasin, David J. Fleet and Andrea Tagliasacchi

Headshot of Vin Bhaskara

by Vin Bhaskara

Why do we recommend it?

Neural Radiance Fields (NeRFs) are an effective way to build 3D scene representations implicitly into the weights of a neural network. NeRFs are trained on datasets of 2D captured images paired with precise camera calibrations such that the pixel-wise errors of rendered images from known viewpoints are minimized with respect to the captured images. Recent works have shown their potential in the photorealistic rendering of novel viewpoints by effectively modelling viewpoint-dependent effects such as lighting, reflections, and specularities.

Existing NeRF models show impressive 3D scene reconstructions from novel viewpoints when the captured 2D images are static, i.e., except for the variations due to the change in viewpoints, the set of captured images is photometrically consistent. Unfortunately, so, rarely that is the case. Often things in a scene are not persistent throughout a capture session that the authors call “distractors,” which are hard to detect automatically without apriori knowledge and tedious to manually remove pixel-by-pixel. The quality of 3D reconstructed scenes suffers under such distractors giving rise to spurious artifacts in the form of “floaters.”

In this paper, the authors propose a robust optimization framework by modelling distractors as outliers without needing additional supervision or any apriori knowledge of the types of distractors, thereby making NeRF models practical for realistic scene reconstructions in dynamic environments.

DynIBaR: Neural Dynamic Image-Based Rendering

Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker and Noah Snavely

Headshot of Edward Smith

by Edward Smith

Why do we recommend it?

The recent introduction of Neural Radiance Fields(Nerf) has allowed for dramatic improvements within the task of novel view synthesis, with high-definition renderings of scenes from arbitrary viewpoints made possible from only a few static images. However, contemporary methods often produce blurry or inaccurate renderings in the dynamic setting, where scene elements are in motion, especially when operating over long videos with complex object motions and camera trajectories. Furthermore, these methods are often limited to application over object-centric videos and cannot scale well to complex scenes in natural environments. With these limitations, Nerf-based methods are severely limited in their application to true in-the-wild videos. 

This paper introduces DynIBaR, a novel view synthesis method that significantly improves dynamic scene understanding within the Nerf framework. While employing a standard volumetric scene representation, this method aggregates local image features using a scene motion–adjusted ray scheme instead of along epipolar lines to better account for varying geometry and appearance over space and time. Scene motion here is described using a “motion trajectory field” and modeled using learned basis functions rather than standard MLPs for enhanced efficiency. Finally, this method factors the scene into static and dynamic components and models them separately to discourage the training from ignoring either element and, in turn, improve the quality of rendered images. With these contributions, DynIBaR operates accurately over dynamic videos capturing long durations, unbounded scenes, diverse camera trajectories, and fast and complex object motion. As a result, it achieves significant improvement over state-of-the-art models across dynamic scene datasets and clear improvements in quality over in-the-wild dynamic videos. 

Prompt Learning

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah

Portrait of Meng Yao

by Mengyao Zhai

Why do we recommend it?

Adapting large pre-trained transformers, such as CLIP and GPT, has gained lots of attention. Fine-tuning is a prevalent method of adapting these pre-trained models to downstream tasks. However, it is not always desirable: fine-tuning the model can increase supervised performance but decrease the generalization in a zero-shot setting. While freezing, the pre-trained model can retain the zero-shot capability while causing a significant drop in supervised performance. Because of this, prior works normally train two separate models for supervised and zero-shot tasks. Moreover, fine-tuning may require updating and duplicating the whole model for every downstream task, which can be very expensive.

In this paper, a unified training scheme is proposed to balance the supervised and zero-shot performance by learning multimodal prompts. More specifically, the proposed method learned prompts encoding video information at three levels: (1) Global video-level prompts that learn the overall video data distribution; (2) Local frame-level prompts that learn per-frame label-conditioned discriminative information; and 3) A summary prompt that summarises the per-frame discriminative information. On the text side, trainable soft prompts are learned for the text encoder. The paper provides compelling empirical evidence that this unified approach performs well for both supervised and zero-shot settings at a lower cost of parameter requirements.

ProD: Prompting-to-disentangle Domain Knowledge for Cross-domain Few-shot Image Classification

Tianyi Ma, Yifan Sun, Zongxin Yang and Yi Yang

Portrait of Meng Yao

by Mengyao Zhai

Why do we recommend it?

For few-shot image classification, very limited samples are provided to support transferring the classifier from source training domains to novel test domains. The limited samples can hardly represent the test data distribution and can hardly provide reliable information on test data. Besides the insufficient data, another critical challenge is the train-to-test domain gap. In this paper, a prompting method is proposed to mitigate the domain gap. Prompting is an effective way to adapt pre-trained transformers to different downstream tasks without fine-tuning the pre-trained models. In prompt-tuning, the weights of the pre-trained transformers are frozen, and prompt vectors are appended to the input giving the pre-trained model task-specific contexts.

This paper proposed learning two sets of disentangled prompts: domain-general (DG) prompt, which are learnable parameters shared across all the domains, and domain-specific (DS) prompt, which is generated with backbone features from given domains. Different from prior prompting methods, a lightweight transformer is adopted to extract general and specific domain knowledge conditioned on the combination of features, DG prompt and DS prompt. The extracted general and domain-specific features are used as final representations for inference. Moreover, for few-shot classification, the DS prompt can be generated from support samples as a test-time adaption without the need to fine-tune the model. As a result, the DG features enable good generalization ability, and DS features enable fast adaption to novel test domains.

Causality

Learning Distortion Invariant Representation for Image Restoration from A Causality Perspective

Xin Li, Bingchen Li, Xin Jin, Cuiling Lan and Zhibo Chen

Portrait of Raquel Aoki

by Raquel Aoki

Why do we recommend it?

Some works on Image restoration focus more on improving the performance in out-of-distribution samples to improve generalization in real-world images. One common issue is the model’s dependence on spurious correlations, which might introduce confounding effects specific to certain degradations. The authors of this paper argue that a robust neural network model should be distortion-invariant, as that would reduce these dependencies on spurious correlations and improve the model’s generalization.  

Hence, the authors proposed a new training strategy for neural network models. This training strategy is inspired by existing causal frameworks, and the main goal is to improve image restoration generalization in out-of-distribution images. The goal is to reduce dependencies on spurious correlations and close back doors to create distortion-invariant representations of the images. The proposed method, Distortion Invariant representation Learning (DIL), adopts counterfactual distortion augmentation to simulate distortion types and degrees as the confounders. Then, it derives a distortion-invariant representation by learning DIL with a Meta-learning approach. The authors argue that such an optimization approach can handle the backdoor criterion. The experiments supported the author’s claim that DIL can generalize better on unseen distortions than existing baselines.  

We recommend this paper as it adopts an interesting approach to learning domain-invariant representations from a causality perspective.  

Multi-view Adversarial Discriminator: Mine the Non-causal Factors for Object Detection in Unseen Domains

Mingjun Xu, Lingyun Qin, Weijie Chen, Shiliang Pu and Lei Zhang

Portrait of Raquel Aoki

by Raquel Aoki

Why do we recommend it?

In object detection, domain shift can severely degrade the performance of models. An alternative for this problem is to adopt domain-invariant features, and Domain Adversarial Learning (DAL) is one of the most common approaches for this task. When looking at this problem with causality lenses, the authors of this paper found evidence that most existing works ignore non-causal factors, keeping them in the domain-invariant set.  

In this work, the authors propose a training framework to remove non-causal features from domain-invariant features. The idea is that a domain-invariant feature set without these non-causal features would generalize better. The proposed method, called Multi-view Adversarial Discriminator (MAD), contains a Spurious Correlations Generator (SCG) and Multi-View Domain Classifier (MVDC). The goal is to use SCG to explore and expose the potential non-causal factors while MVDC discriminates and removes them. The experiments demonstrated that MAD achieved better performance than several baselines in many existing datasets. Furthermore, it is also better at removing non-causal factors.  

We recommend this paper because it shows another example of how domain-invariant sets can be improved when causality is considered. With a better domain-invariant feature set, better generalization results are also often observed.