What’s on your NeurIPS 2022 reading list? Our researchers have decided to share some of their favourite accepted papers, organized by topic: time-series analysis and modelling, bayesian inference and causality, reinforcement learning and contrastive learning, robust learning, and multi-task optimization. The team selected papers from each category and provided a summary and their thoughts on the impact of the work. We hope you enjoy this #NeurIPS2022 reading list!

Topics

Placeholder image for time series model

Time-series Analysis and Modelling

Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency

Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, Marinka Zitnik

Portrait of Tristan Sylvian

by Tristan Sylvian

Why do we recommend?

Contrastive learning is one of the high-performing families of methods for tasks such as forecasting and classification. Contrastive objectives have, in recent months, matched or exceeded the performance of larger, more complex baselines (such as time-series transformers or MLP-variants such as N-BEATS and N-Hits…). In addition to this advantage, contrastive approaches are versatile: the same encoder, once trained on a dataset, can be fine-tuned to perform well at multiple different tasks. This paper is interesting in large part because it introduces methodological and algorithmic improvements that play to these strengths. 

By taking a multi-modal approach (time- and frequency-based representations), they are able to exhibit good transfer performance. Pre-training on one dataset and performing well on another is an essential characteristic and often one that is overlooked in time-series forecasting. In addition to these points, the experiments section is extensive, showcasing the strengths (and sometimes relative weaknesses) of the algorithm in a variety of transfer settings.

GT-GAN: General Purpose Time Series Synthesis with Generative Adversarial Networks

Jinsung Jeon, Jeonghak Kim, Haryong Song, Seunghyeon Cho, Noseong Park

Portrait of Tristan Sylvian

by Tristan Sylvian

Why do we recommend?

Learning to generate high-quality synthetic data is of vital importance for many fields, including time-series modelling. Synthetic samples can be used to augment datasets, improve generalization, or simply get a better understanding of the generative process.

This paper is interesting for two main reasons. One, GANs for time series are still in their early stages: despite the performance of adversarial methods in computer vision, time-series applications such as TimeGAN still have limitations. As a result, any improvement to this family of methods is very useful. Secondly, this approach is more general than others as it can apply to both regular and irregular time-series (time-series with missing values).

The paper is built on a relatively classical combination of adversarial modules and an autoencoder. It also integrates aspects from the continuous time-flow process literature. All these aspects taken together result in an article that is both strong empirically but also showcases interesting theoretical innovations.

Non-stationary Transformers: Rethinking the Stationarity in Time Series Forecasting

Yong Liu, Haixu Wu, Jianmin Wang, Mingsheng Long

Portrait of Alex Pashevich

by Alex Pashevich

Why do we recommend?

Non-stationarity, characterized by a continuous change of statistical properties and joint distribution over time, remains one of the key challenges for time series forecasting. Directly using deep learning models to make predictions on non-stationary real-world data could significantly degenerate their performance. Therefore, training pipelines traditionally pre-process time series by eliminating non-stationarity before providing it to the model. Such pre-processed time series are deprived of inherent non-stationarity but can be less instructive. This problem is even more pronounced in the case of transformers, models with high representational power, which would generate indistinguishable temporal attention for different series. Thus the natural question to ask to improve the forecasting performance of transformers is how to increase time series stationarity while not impeding the model predictive capability.

The authors propose an elegant and effective solution to the dilemma that is shown to boost the performance of transformer models on the time series forecasting task. Technically, the contribution involves two interdependent modules that wrap each transformer block. The first module is applied to normalize a time series before feeding it into a transformer block, and the transformer attention mechanism is altered to make it aware of the pre-processing transformation. The second module is used to de-normalize the time series after the transformer block. The authors provide an extensive evaluation of their method and demonstrate that the idea can bring significant gain for four mainstream transformer models.

Icon_Reinforcement Learning

Reinforcement Learning and Contrastive Learning

Contrastive Learning as Goal-Conditioned Reinforcement Learning

Benjamin Eysenbach, Tianjun Zhang, Ruslan Salakhutdinov, Sergey Levine

by Alex Pashevich

Portrait of Alex Pashevich

by Alex Pashevich

Why do we recommend?

It is easier for a reinforcement learning method to solve a task provided a good representation. Such representation is often obtained by decoupling the representation learning problem from the reinforcement learning problem. This is typically done by employing separate objectives, for example, by introducing auxiliary losses or using data augmentation. However, representation learning and reinforcement learning might have conflicting goals, and such formulation might hurt the model performance. The authors suggest that contrastive learning resembles a goal-conditioned value function as nearby states should have similar representations while far states should have dissimilar representations.

The key idea is to use contrastive learning on action-labelled trajectories in order to estimate the critical function for a certain policy as a dot product of representations of current and goal states. The complete algorithm alternates between fitting the critic function using contrastive learning, improving the policy, and collecting more data. The authors provide limited convergence guarantees of the method and experimentally show that the proposed algorithm outperforms prior methods that employ data augmentation and auxiliary objectives. Among the limitations is the fact that only the goal-conditioned setting is considered; thus, extending the approach to arbitrary reinforcement learning tasks remains an open problem.

Bayesian Inference & Causality

Bayesian Inference and Causality

Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel

Changyong Oh, Roberto Bondesan, Efstratios Gavves, Max Welling

Portrait of Lilian Wong.

by Lilian Wong

Why do we recommend?

This is an interesting paper that deals with the combinatorial optimization problem of permutations. Permutations arise in many areas of scientific and engineering applications, such as chip design and 3D printing. Evaluating the cost associated with one permutation can be costly, which gives rise to Bayesian optimization.

This paper proposes LAW (“L-ensemble with Acquisition Weights”), a framework that speeds up Bayesian optimization to the case of permutations by means of batch acquisition and the evaluation of batch in parallel. In addition, the paper proposes a new batch acquisition method which is applicable to the search space of permutations. This new method uses DPP L-ensemble, augmented with the acquisition weight, and is dubbed “LAWS2ORDER”.

Another contribution of this paper is that the authors, for benchmarking purposes, also implemented other batch Bayesian optimization methods which are applicable to the permutation spaces. That is because such implementation was not available anywhere to the best of the author’s knowledge. Numerical improvements are shown.

Scalable Sensitivity and Uncertainty Analyses for Causal-Effect Estimates of Continuous-Valued Interventions

Andrew Jesson, Alyson Douglas, Peter Manshausen, Maëlys Solal, Nicolai Meinshausen, Philip Stier, Yarin Gal, Uri Shalit

Portrait of Raquel Aoki

by Raquel Aoki

Why do we recommend?

Many researchers have been adopting Machine learning (ML) approaches to perform traditional causal inference tasks, such as the estimation of causal effects in observational studies. The popularization of these ML approaches in causal inference is due to their flexibility to several types of applications, robustness to high-dimensional datasets, and good predictive power. Still, the adoption of causal inference in real-world applications requires strong assumptions, such as the positivity and ignorability assumptions, that rarely hold in practice. 

Required to identify the causal effect, the ignorability assumption is currently one of the main bottlenecks to the adoption of causal inference in the real-world. Several works have proposed alternatives for situations where this assumption does not hold; still, no solution is universally accepted by the causality community. Aligned with recent progress in this topic, this paper proposes an approach that uses bounds instead of point estimate the causal effect – while the causal effect cannot be point identified from observational data without ignorability, it is possible to calculate bounds that cover a range of all possible values. Focusing on applications with continuous treatments and high-dimensional datasets, the paper proposed the continuous treatment-effect marginal sensitivity model (CMSM) to estimate the causal effect and its bounds. The experiments have shown that the bounds correctly include the true causal effect under certain conditions. A limitation of this approach is the dependency on domain’s expert belief on how much the outcome’s unexplainability is due to confounding. Correctly setting this belief parameter can be quite challenging, as in real-world applications, the true answer is never known. Still, this work makes excellent contributions towards the estimation of causal effects in observational data in real-world applications.  

Causal Inference with Non-IID Data using Linear Graphical Models

Chi Zhang, Karthika Mohan, Judea Pearl

Portrait of Raquel Aoki

by Raquel Aoki

Why do we recommend?

Assumptions are essential to provide guarantees for the estimated values in Causal Inference analysis. These same assumptions are also one of the main bottlenecks to the adoption of Causal Inference in real-world applications. Currently, there is no gold standard approach to what to do when certain causal inference assumptions fail. There are, however, situations where even if a given assumption has failed, it is still possible to obtain unbiased estimates if 1) the bias is neglectable; or 2) the bias can be estimated.  

An important causal inference assumption is that the samples in an observational study are independent and identically distributed (IID). There are, however, several applications where the samples are not IID. The paper uses an example of the estimation of causal effect of vaccine doses on the severity of sickness: vaccination might reduce the viral load of some patients, reducing the transmission rate, or confounding variables could also impact higher or lower transmissions of patients in certain locations. In both cases, some samples might not be IID, and causal effect estimators might yield biased results if such dependency is not properly addressed. Therefore, this paper proposes an approach based on the interaction network to identify situations where the bias from the dependency between samples can be ignored; if the bias is not ignorable, then the paper proposes an approach to remove the bias. A limitation of the paper is that the interaction network might not be always available. Nevertheless, the paper provides compelling evidence in its experiments that their approach, under correct circumstances and assumptions, can be successful in correctly estimating causal effects in real-world settings where the samples are not IID and the interaction network is known. 

Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination

Masaki Adachi, Satoshi Hayakawa, Martin Jørgensen, Harald Oberhauser, Michael A. Osborne

Portrait of Lilian Wong.

by Lilian Wong

Why do we recommend?

In science, engineering and economics, models are used to simulate real-life phenomena and help scientists understand the dynamics of many processes. When multiple models are concerned, Bayesian model evidence establishes a clear criterion for model selection. Computing model evidence, however, requires integration over the likelihood, which could be intractable or expensive, especially if the likelihood does not have a closed-form expression.

Plenty of algorithms have been developed to compute model evidence or posteriors, such as variational inference, Monte Carlo methods and Bayesian Quadrature. The main advantages of quadrature methods over random sampling methods like Monte Carlo are its speed and ability to reduce the integral variance, but its lack of parallelization has made practical application rather difficult. This paper presents a new parallelized batch Bayesian Quadrature algorithm for fast inference, named “Bayesian Alternately Subsampled Quadrature” (BASQ). Samples from the Bayesian Quadrature surrogate model are re-selected to give a sparse set of samples via a kernel recombination algorithm, and the time to increase the batch size is almost negligible. Empirically, the new algorithm outperforms existing Bayesian Quadrature methods and Nested Sampling methods.

The paper offers a comprehensive introduction to the area of Bayesian Quadrature, with links to many relevant results in the literature. It is a great read.

NeurIPS 2022 Recommended Reading List

Robust Learning

GENIE: Higher-Order Denoising Diffusion Solvers

Tim Dockhorn, Arash Vahdat, Karsten Kreis

Portrait of Meng Yao

by Mengyao Zhai

Why do we recommend?

Denoising diffusion models are generative models which contain a forward diffusion process in which a small amount of Gaussian noise is gradually added to the samples and a reverse diffusion process in which samples are reconstructed by gradually denoising. For high-quality image generation, the slow iterative process for solving differential equations in reverse diffusion is very time-consuming. To accelerate synthesis, a novel higher-order solver based on truncated Taylor methods called GENIE is proposed in this paper. Different from existing works, GENIE explicitly uses higher-order scores without finite difference approximations for generative modelling with DDMs. This higher-order solver enables larger step sizes while staying below a certain error tolerance. More specifically, GENIE directly models the higher-order gradient terms by calculating Jacobian-vector products, which are automatically computed from the differentiation of the regular learnt first-order scores. To avoid computational overhead, the Jacobian-vector products are distilled to a small prediction head on top of the first-order score network. This small prediction head can be called for synthesis, which allows for fast inference. Comprehensive experiments show that the proposed method GENIE outperforms all state-of-the-art solvers and can generate images with better quality under all reported NFEs.

Task-Free Continual Learning via Online Discrepancy Distance Learning

Fei Ye, Adrian G. Bors

Portrait of Meng Yao

by Mengyao Zhai

Why do we recommend?

Catastrophic forgetting is the key issue for continual learning: when adapting to a new task, the trained model would forget how to perform on previous tasks. Memory or experience replay, where a small memory buffer is used to store past samples and replay these past samples when learning the new task, is a commonly used method to relieve catastrophic forgetting. For a fixed-size memory buffer, memory-based methods would suffer performance degradation on previous tasks given an increasing number of tasks. Methods have been proposed to retain a memory buffer or a sub-module for each new task or freeze certain components to preserve performances on previous tasks. However, prior knowledge of tasks is not always available, and a more realistic but challenging setting is to continuously learn a sequence of tasks when explicit task information is absent. In this paper, a task-free continual learning method where task identities are not available is proposed. A new theoretical analysis framework is developed, showing that the optimal performance depends on the discrepancy distance between the memory buffer and all previously seen samples. Inspired by the theoretical model, a novel dynamic component expansion approach is proposed to expand the network while ensuring the network is compact with optimal performance. Moreover, a novel selection scheme based on a discrepancy-based criterion is proposed to encourage component diversity which further improves the performance. Experimental results show that the proposed task-free continual learning method achieves state-of-the-art performance.

The Effects of Regularization and Data Augmentation are Class Dependent

Randall Balestriero, Leon Bottou, Yann LeCun

Portrait of Leila Pishdad

by Leila Pishdad

Why do we recommend?

Regularization methods such as weight decay and data augmentation (DA) have been widely used for decreasing the variance for different machine learning models. However, in this paper, the authors show that these general regularization methods improve the average performance across the different classes and as such, they are not fair in terms of the bias that they introduce for the different classes. For instance, with resnet50 and using random cropping for DA, the test accuracy for one class drops from 68% to 46% while the average performance improves. The authors empirically show that one reason for this could be the fact that the percentage of keeping the original image in random cropping that is required for it to be label-preserving varies across classes: it could be as low as 8% for some classes and as high as 50% for others. In other words, the same DA technique can be label-preserving for some classes and not the others. In addition to this, the authors provide a statistical test per-class with the null hypothesis that the accuracy without DA is significantly lower than accuracy with DA. They reject the null hypothesis with 95% confidence for over 4% of classes, i.e., the per-class accuracy does not improve with DA for over 4% of the classes.

What Makes a “Good” Data Augmentation in Knowledge Distillation — A Statistical Perspective

Huan Wang, Suhas Lohit, Mike Jones, Yun Fu

Portrait of Leila Pishdad

by Leila Pishdad

Why do we recommend it?

Knowledge distillation (KD) is a powerful approach with a wide range of applications, from training smaller and more efficient networks in the presence of resource constraints to learning representations with self-supervision. In KD, a teacher network is used to guide the training of a student network, i.e., the loss function is comprised of two parts: a loss for training the student network (e.g., cross-entropy) and a loss for matching the distribution of the student to the teacher network (e.g., KL divergence between the two distributions). Data augmentation (DA) can be used in KD to further improve performance without overfitting. In this paper, the authors attempt to find a predictive metric for how “good” a DA is in the context of KD and use that for further improving KD on the input side. Specifically, the authors show that the standard deviation of the teacher’s mean output probability has a statistically positive correlation with the student loss for different DA techniques, which implies that the “goodness” of a DA technique for KD could be student-invariant, i.e., a good DA scheme is the one for which the teacher’s mean probability has a lower variance over different input samples. The secondary contribution of this work is proposing an entropy-based data selection scheme for KD, which further decreases the variance of the teacher’s mean probability as well as the generalization error of the student.

Multi-task Optimization

Multi-task Optimization

Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

Derrick Xin∗, Behrooz Ghorbani∗, Ankush Garg, Orhan Firat, Justin Gilmer

Portrait of Gabriel Leivas Oliveira

by Gabriel Oliveira

Why do we recommend?

The authors present a large-scale analysis of the effects of Multi-Task Optimization methods. It is often assumed that these algorithms enhance the optimization dynamics of multi-task models and yield desirable solutions that cannot be achieved via scalarization. Results suggest that for a variety of tasks, for instance language and vision, scalarization with appropriate weights can perform similarly in terms of optimization and the generalization behaviors when compared to MTO algorithms. The observations suggest the final performance of multi-task models is highly sensitive to the choice of training hyper-parameters. A phenomenon of researchers reporting significant performance gains by simply under-tuning the competing baselines is observed, suggesting a prevalence of this situation in the current literature.  
 

FETA: Towards Specializing Foundational Models for Expert Task Applications 

Amit Alfassy, Assaf Arbelle, Oshri Halimi1, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, Peter W. J. Staar, Rogerio Feris, Leonid Karlinsky

Portrait of Gabriel Leivas Oliveira

by Gabriel Oliveira

Why do we recommend?

Foundation Models (FMs) have demonstrated unprecedented performance on multiple tasks. However, FM’s still perform poorly on expert tasks. Expert tasks can be defined as tasks with unseen data or belonging to a long-tail part of the data distribution. For this reason, an explicit evaluation and finetune of FM’s on such expert data is needed. The paper proposed the Foundation model for Expert Task (FETA) benchmark teaching FM’s to understand technical documentation via learning their graphic illustrations to corresponding language descriptions.  The authors proposed a Multi-Instance Learning (MIL) with Noise Contrastive Estimation (NCE) called MIL-NCE to finetuning the foundation model for expert tasks and show that approach is superior to previous approaches on several datasets.