What’s on your NeurIPS 2023 reading list? This year, there were over 12,343 full paper submissions, and of those, 3,500 papers were accepted to the Conference on Neural Information Processing Systems (NeurIPS). With that many papers, it can be hard to know where to start. That is why our team of researchers has curated a fantastic selection of accepted papers, which we’ve organized by topic to make it easier for you to navigate. Here are the categories we’ve focused on: optimization and generalization, self-supervised learning, time series, reinforcement learning, large language models, responsible AI, class imbalance, object-centric representations, tabular data, transformers, and double descent. For each research topic, we’ve selected papers that we think are particularly noteworthy, and we’ve included a summary of each paper along with our thoughts on its potential impact.

We’re excited to share this reading list with you and hope you find this year’s NeurIPS research as fascinating as we do. See you in New Orleans at #NeurIPS2023!


  • Generalization and Optimization
  • Large Language Models
  • Transformers
  • Time Series
  • Self-supervised Learning
  • Responsible AI
  • Double Descent 
  • Class Imbalance
  • Tabular Data
  • Object-centric representations
  • Reinforcement Learning

Generalization and Optimization

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

Kaiyue Wen, Zhiyuan Li, and Tengyu Ma

by Hossein Sharifi

Why do we recommend it?

This paper studies the correlation between sharpness and generalization in simple two-layer ReLU networks. The authors first discussed that the flatness of the loss landscape does not always correlate with the generalization capability of the model, and there can be very sharp models that generalize well. The authors discussed the role of network architecture and data distribution in the generalization capability of the flattest neural network minimizers. They showed that there can be a flattest minimizer that does not correlate with generalization. For example, in a simple XOR problem in their two-layer network, adding or removing the biases changed the generalization of a flattest minimizer on unseen data. In settings like that, where there exists a flattest minimizer that does not generalize well, it is natural to wonder whether sharpness-aware minimization would also fail. Interestingly, the answer to that question is also no and in some architectures and data distributions, sharpness-aware minimization can find a minimizer that correlates with generalization while there is a flattest minimizer that does not generalize well. 

I selected this paper because it studies the notion of sharpness and its impact on generalization from both empirical (in controlled simple settings) and theoretical points of view and provides new insights, i.e., the role of architecture and data distribution in SAM style algorithms and the fact the reducing the sharpness on its own is not the only reason for generalization of these optimizers. This aligns with the conclusion of the other paper that I summarized (Müller et al . Normalization Layers Are All That Sharpness-Aware Minimization Needs NeurIPS 2023) and some other recent studies, such as Andriushchenko et al. A Modern Look at the Relationship between Sharpness and Generalization ICML 2023 and discussed the role of hyper-parameters e.g. learning rate.

Normalization Layers Are All That Sharpness-Aware Minimization Needs

Maximilian Müller, Tiffany Vlaar, David Rolnick, and Matthias Hein

by Hossein Sharifi

Why do we recommend it?

SAM (short for Sharpness-Aware Minimization) has shown generalization capabilities for different problems and architectures, but it comes with computational overhead as it requires two forward passes of the network, and it is often at 1.5-2x the cost of a general SGD. This paper studied SAM-ON (short for SAM-OnlyNorm) when sharpness-aware minimization is only applied to the Layer-norm in transformers architecture or batch-norm in ResNet. Interestingly, SAM-ON was able to outperform a model that had sharpness-aware minimization on all of its parameters and not just the normalization part across CIFAR benchmarks and achieved competitive performance on ImageNet. Although the results are interesting, it is still unclear if this observation was specific to the normalization layers or if it was the effect of sparsity in the training. To answer that, SAM-ON was compared to a sparse version of SAM that applies SAM to a select number of parameters only and was able to outperform this sparse version of SAM at the same sparsity level of SAM-ON, which suggests that this performance should be due to learning capacity of the normalization layers. 

Another interesting observation of this paper is that although SAM-ON demonstrated generalization, it achieved that with less sharpness reduction capability compared to the original SAM, meaning that generalization is achievable without significant reduction of sharpness, which, similar to some other studies, casts doubt on the correlation between sharpness and generalization. I selected this paper because it shows the impact of a relatively simple part of the model that can significantly improve the performance. 

Large Language Models

Large Language Models Are Zero-Shot Time Series Forecasters

Nate Gruver, Marc Finzi,  Shikai Qiu, Andrew Gordon Wilson

he zhao headshot

by He Zhao

Why do we recommend it?

Large language models (LLMs) such as GPTs have shown some level of complex reasoning capabilities through intermediate reasoning steps. These models are trained on huge amounts of web data, which contains enormous human knowledge. So, there is a natural question – Can LLMs compete with carefully designed and supervisedly trained deep networks on time series data? This paper studies the use of LLMs on time series forecasting via encoding time series as a string of numerical digits, which is an entirely zero-shot fashion and does not require any fine-tuning. The key technology re-invented in this work is time series data tokenization specific to ChatGPT models. Existing tokenizationers do not consider multiple-digit values as a complete unit and often break the original input into separate pieces, which is problematic. To remedy such an issue, the authors separate the digits with spaces to force a separate tokenization of each digit and use a comma to separate each time step in a time series.

One interesting finding of this paper is that LLMs perform extremely well in multimodal distribution modelling. And they demonstrated this by showing LLMs are better than Gaussian mixture models at capturing complex distributions. It can be a valuable takeaway message because multimodality and diversity are common characteristics of time series data. Moreover, this paper reveals that LLMs have learned to (1) adopt the rule of simplicity (e.g., Occam’s razor prior) on problem-solving, (2) capture repetition and periodicity patterns, and (3) perform addition and multiplication when extrapolating. All of these are good qualities for time series analysis, which might be worth considering for further research.

Fine-Tuning Language Models with Just Forward Passes

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

by Vin Bhaskara

Why do we recommend it?

Large Language Models (LLMs) and other Foundation Models (FMs) are quickly growing in the number of parameters. It is increasingly prohibitive to fine-tune downstream tasks without a cluster of GPUs. Although as large as a 30 Billion parameter model can be run in inference mode on a single Nvidia A100 80 GB GPU, when it comes to training, the largest model that can fit is an order of magnitude smaller. This is due to the additional memory overhead for the backpropagation training algorithm that needs cached activations at each layer, additional memory for storing gradients, and optimizer-specific overheads (for example, historical gradients stored by Adam optimizer). How can one fine-tune a large model with a memory footprint that matches inference and still be competitive with the conventionally fine-tuned models with backpropagation? 

This paper shows that adapting the classical zero-order gradient estimator instead of backpropagation can be used to fine-tune models effectively with the same memory footprint as inference. Since zero-order methods only require forward passes, they can optimize non-differentiable objectives such as F1-scores and other important metrics as well without Reinforcement Learning. The authors show results across downstream tasks showing that their adaptation, called MeZO performs competitively to conventionally fine-tuned models and enables fine-tuning 30 Billion parameter models on a single GPU.

Are Emergent Abilities of Large Language Models a Mirage?

Rylan Schaeffer, Brando Miranda, Sanmi Koyejo

Portrait of Peter Forsyth

by Peter Forsyth

Why do we recommend it?

phase transition is said to have occurred when the properties of a system change abruptly as a parameter passes a critical threshold.  Phase transitions are ubiquitous in the sciences (between states of matter, for example) and also arise in computational systems (Wolfram 2002).  Recently, researchers have reported phase transitions in large language models, a phenomenon called emergence.  Capabilities like integer arithmetic, word unscrambling, and multi-task language understanding emerge discontinuously as the model scale increases (Wei et al., 2022). Small language models lack these capabilities, but large language models display a high degree of proficiency. 

Schaeffer et al. convincingly argue that the reported instances of emergence are not real phase transitions but rather are artifacts of the metrics by which model performance is measured.  They point out that the test-time cross-entropy loss of LLMs declines smoothly and predictably with scale (Henighan 2020, Kaplan 2020, Hoffman 2022), but common LLM evaluation metrics are discontinuous or highly non-linear functions of test-time cross-entropy.   Therefore. a smooth improvement in cross-entropy loss can translate into an abrupt change in the reported metric, creating the illusion of a phase transition.  

To clinch their case, the authors show that emergence completely disappears when the evaluation metric is modified to become a smooth function of cross entropy.  Moreover, they manufacture previously unreported emergent phenomena in computer vision models by choosing a non-linear metric.

Schaeffer et al.’s insight is not entirely new: the possibility that metrics might play a role in emergence was hinted at in Wei et al.’s (2022) original discussion. Nevertheless, Schaeffer et al.’s detailed elaboration and experiments are a major contribution.  Practically speaking, this work suggests that prior to training a large language model, we may be able to forecast many of its properties by extrapolating from small models, so long as we choose the right metrics.  Moreover, this work serves as a good reminder of the fundamental importance of rigorous choice of metrics in machine learning. While Schaeffer et al.’s paper does not preclude the possibility that true emergent abilities will be discovered in the future, It does add weight to the argument that the recent astounding progress in large language models is a difference of scale, not of kind.

Faith and Fate: Limits of Transformers on Compositionality

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, Yejin Choi

Portrait of Hamed Shirzad.

by Hamed Shirzad

Why do we recommend it?

Large Language Models (LLMs) have received significant attention, not only within the machine-learning community but also in broader discussions. Numerous analyses, both praising their capabilities and underscoring their limitations, have been conducted. The question arises: Are we on the verge of the “Sparks of AGI,” or do inherent deficiencies in Transformers prevent them from reaching this level? If you are also on this roller coaster of exploration and enjoy encountering new pieces of evidence and theories, this paper will be an excellent read for you.

The paper questions the capability of Transformers in compositional tasks requiring multi-hop reasoning. Can Transformers learn generalizable rules in these tasks? How challenging is it for a Transformer to master 3-digit by 3-digit multiplications? What if we provide them with instructions or train them exhaustively? Transformers fall short in generalizing out of their train distribution in all these scenarios, and the issue intensifies as the size of the input increases. The analysis extends to two additional tasks—Enisten’s puzzle and a dynamic programming problem—yielding similar results. What underlies these shortcomings?

The paper suggests interpreting these tasks as computation graphs, where information is transmitted along the edges, and the nodes are responsible for atomic processes on values they receive. The paper hypothesizes about Transformers’ shortcomings through a set of analyses on these graphs. The first hypothesis suggests that instead of learning the rules, Transformers perform linearized subgraph matching, finding subgraphs that have been seen in the training data. This suggests that Transformers learn patterns instead of rules, which can work in less complex computation graphs, but effectiveness reduces as the computation graph becomes more complex. Another mentioned hypothesis is error propagation in the computation graph, where a single error in a node can be propagated along the steps, leading to incorrect results. The paper presents several insightful ideas beyond the scope of this review, supported by numerous large-scale experiments and theoretical proofs, making it engaging for a broad range of readers.


Learning Transformer Programs

Dan Friedman, Alexander Wettig, Danqi Chen

by Ankit Vani

Why do we recommend it?

The transformer architecture has become a dominant paradigm for many domains and tasks thanks to its versatility and performance. A question often debated in the community is whether large models based on transformers, such as large language models (LLMs), can achieve understanding. Some argue that scaling up models will lead to better generalization, while others point out the inherent limitations of transformers on specific tasks. One possible direction to provide insight is to analyze what functions transformers can represent. RASP is a programming language proposed in ICML 2021 as a computational model for transformers. Specifically, RASP allows writing programs that can be directly mapped to transformer weights, such that running the transformer on an input is equivalent to executing the RASP program on it. This provides an interpretable way of understanding what transformers can learn, but it does not allow interpreting transformer models that have already been trained on real data.

The authors of “Learning Transformer Programs” propose a novel approach to enhance the interpretability of transformers trained in the real world by requiring that the trained models be mapped to an RASP program. This can help reveal the models’ biases and limitations, as well as demonstrate the trade-offs between optimization and learnability. The authors show that their method can achieve comparable performance to standard transformers of similar sizes while providing more transparency and insight. The authors also acknowledge the challenges of scaling up their technique due to the difficulty of discrete optimization, which they implement through a Gumbel-softmax relaxation during training, that the community can address in future work. Overall, the framework introduced by this paper is a valuable contribution to interpretable machine learning and a must-read for anyone interested in the latest developments of transformer models.

Representational Strengths and Limitations of Transformers

Clayton Sanford, Daniel Hsu, Matus Telgarsky

Portrait of Hamed Shirzad.

by Hamed Shirzad

Why do we recommend it?

Transformers have shown great potential across diverse subdomains of deep learning, serving as the backbone for highly successful models in natural language processing, computer vision, time series, learning on graphs, and more. However, what do we know about the strengths and limitations of these models? Can they be applied to any task, or are there inherent limitations? While these models are recognized as universal approximator functions, determining the optimal depth or width for solving a problem remains a crucial question. If solving a problem with Transformers requires an exponentially large number of parameters, is it practically feasible to use them? Given our limited computational power and datasets, we like to identify models with a better inductive bias for our task. This raises the question of when the inductive biases of Transformers align well with our problem and how we can expect the required depth or width scale with the input size.

This paper explores Transformers’ capabilities through three synthetic tasks mirroring real-world problems. The first task is sparse averaging, where each token is assigned a small subset of indices and aims to calculate the average of the values stored in tokens with those indices. The second and third problems, Match2 and Match3, involve pairwise and triple-wise matching, respectively. These are analogous to the well-known 2SUM and 3SUM problems in theoretical computer science, with the distinction that each token has an integer value and aims to determine if it is part of a pairwise or triple-wise match summing to zero. They show that Transformers can perform the sparse averaging task with logarithmic scaling on the width of a single-layer network. In contrast, feed-forward networks or RNNs necessitate hidden dimensions scaling at least linearly with the input size, highlighting the superior inductive biases of Transformers in this context. Interestingly, there is a substantial disparity in the capabilities of Transformers between Match2 and Match3. While Match2 can be solved with a logarithmic scale of the input, no Transformer lacking at least linear scaling in one of the dimensions can solve the 3Match problem. Despite its highly theoretical nature, the paper effectively conveys motivation and insights, using informal descriptions of their theorems. Therefore, the next time you use Transformers for a problem, consider which task in this paper your problem resembles and whether Transformers are well-suited for your task.

Time Series

Encoding Time-Series Explanations through Self-Supervised Model Behavior Consistency

 Owen Queen, Thomas Hartvigsen, Teddy Koker, Huan He, Theodoros Tsiligkaridis, Marinka Zitnik

portrait of ed smith

by Ed Smith

Why do we recommend it?

Explainability for machine learning models can be essential for many reasons, including building trust in predictions, providing accountability and transparency, and meeting regulatory requirements. Time series models and the complex temporal signals they operate over present a difficult challenge for explainability as it is necessary to not only identify salient locations within the signals but also align them with interpretable temporal patterns. While prior works have developed post-hoc explainability in this setting by building components on top of pre-trained models, these systems have been demonstrated to suffer from a lack of faithfulness and poor stability. 

This paper introduces TIMEX, an explainability framework for time series models that boasts improved model interpretability over previous state-of-the-art systems. TIMEX trains a surrogate network to mimic a target model’s behaviour and latent structure but also appends a learned masking function to occlude the least relevant components of the input signal. The learned masks from this network can be used to rank the importance of locations in each series directly. In addition, they also learn a set of landmarks within the embedding space of the surrogate explainer, which serves to aggregate similar explanations and easily recognize temporal patterns visually. TIMEX interpretability was evaluated across multiple synthetic and real-world datasets and achieved state-of-the-art or on-par performance compared to strong baseline methods. Furthermore, TIMEX’s learned landmarks were visually demonstrated to partition the latent space of explanations into easily interpretable temporal patterns. 

Self-supervised Learning

Reverse Engineering Self-Supervised Learning

Ido Ben-Shaul, Ravid Shwartz-Ziv, Tomer Galanti, Shai Dekel, Yann LeCun

Vin headshot

by Vin Bhaskara

Why do we recommend it?

Self-supervised learning (SSL) algorithms like SimCLR and VICReg are behind the training of Foundation Models (FMs) in computer vision on large corpora of unlabeled data. FMs have been shown to learn rich internal representations that can easily be fine-tuned for diverse downstream tasks. However, it has not been clear if such learnt internal representations capture commonsensical semantic class labels at difference hierarchical levels. As an example, at a coarse level, say, an object can be classified as an animal or a plant. At a finer level, an animal can be further classified as a dog, a cat, etc. Going one hierarchy deeper, it could be about classifying a dog among several dog species. Can SSL capture clusters of internal representations that correspond to categories across all such hierarchical levels of semantic classes?

This paper shows that such SSL models indeed capture semantic class categories at multiple hierarchical levels from coarse to fine. Moreover, the paper shows that the regularization term of the SSL loss objectives contributes to such implicitly learnt semantic clustering. Other interesting in-depth analyses are presented that show the continual improvement of semantic clustering with training iterations and layer depth. Potential use cases of this study include using SSL for weak labelling of large corpora of unlabeled data and understanding the generalization capabilities of learnt internal representations.

Responsible AI

Theoretical and Practical Perspectives on what Influence Functions Do

Andrea Schioppa, Katja Filippova, Ivan Titov, Polina Zablotskaia

Vineel Nagisetty

by Vineel Nagisetty

Why do we recommend it?

An influence function (IF) identifies the training data most “influential” in a model’s prediction on a target data point. More specifically, IF estimates the change in a model’s prediction on a test data point when training the model after down-sampling or removal of a training data point. This is useful to correct any unwanted model behaviour. For instance, if we find that a model’s prediction on a target test data point is incorrect, we can use an influence function to identify the training data points responsible. We can then rectify the model to correct this behaviour by removing those problematic training points and retraining (aka Leave-Some-Out Retraining or LSOR). 

IFs have been used successfully in linear models and have shown some promise in deep learning models. However, recent works identified a discrepancy between the theoretical justifications of IF and empirical results in the deep learning setting. Specifically, they showed that the correlation between rankings of training data using LSOR and IF is low and is also affected by the model’s hyper-parameters. This raises concerns about the applicability of IFs in deep learning.

In this work, Schioppa et al. shed more light on the discrepancy raised previously. They identified several assumptions made in using IFs that can cause this discrepancy, eliminating all others but one – parameter divergence – as the main culprit. Simply put, this means that the accuracy of IF predictions fades as the number of re-training steps increases due to the training trajectories diverging over time and leading to violating a first-order expansion assumption that IFs make. Equipped with this knowledge, the authors adjust how IFs are evaluated and use them successfully in model correction.
There are some interesting theoretical contributions in this paper. For instance, while the strict convexity assumptions used in Hessian-based influence functions (HIF) are not met in deep learning, it is not as problematic as originally assumed. Further, the authors provide an upper bound on the parameter divergence when retraining based on IF.

There are also some practical learnings from this work. First, using a training dynamic-based influence function (e.g. TracIn9) is more favourable than HIF due to more efficient computation – and more importantly – because HIF provides no guarantees that retraining would result in the solution approximated by HIF. Secondly, the paper identified the need to use perturbation and iteration budgets and provided some insights into how they interact. 

While this paper is interesting and furthers our understanding of when and how best to use IFs for model correction, it opens the path to several future directions. Firstly, the correction methods have room to improve (as the authors identify themselves). Secondly, and more importantly, this work paves the way for a better understanding of the tradeoff between correcting undesirable behaviour in a model and preserving desirable behaviour. Doing so can result in more efficient and effective fine-tuning, which can help save time and resources.

Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition

Samuel Dooley, Rhea Sanjay Sukthanker, John P Dickerson, Colin White, Frank Hutter, Micah Goldblum

Vineel Nagisetty

by Vineel Nagisetty

Why do we recommend it?

Deep learning models have been shown to exhibit unfair behaviour (using several definitions of fairness). Traditionally, this was believed to be due to the training data – and consequently – techniques for improving fairness focused on modifying the training data, either via pre-processing, in-processing or post-processing techniques. Most of the approaches to create fair models start by first training a model to achieve optimal performance (accuracy), and then applying one or more of these processing techniques. 

Dooley et al. ask an interesting question: Is there a connection between network topology and fairness? They run extensive experiments and find that – surprisingly –  the architectures and hyper-parameters used in the model have a significant impact on fairness – using several fairness definitions. The authors employ neural architecture search (NAS) and Hyper-parameter Optimization (HPO) – which are techniques that automate the design of model architectures and selection of hyper-parameters, respectively – to find a Pareto frontier of models in the task of facial recognition that outperforms existing state-of-the-art models on the joint objective of fairness and accuracy. This work opens up a lot of future research avenues, as well as practical applications. First of all, instead of (or in lieu of) modifying training data or the learning objective, we may be able to improve fairness via the selection of architectures and/or hyper-parameters –  which is a paradigm shift. Moreover, it advocates for adding multiple objectives in NAS, so allowing NAS techniques to learn multiple objectives effectively and efficiently can be a fruitful direction. 

On the other hand, this work focused on a single (albeit useful) domain; we do not know yet whether the findings generalize. In other words, does network topology affect fairness generally, or only in facial recognition? 

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, Jacob Steinhardt

Portrait of Wenjie Zi

by Wenjie Zi

Why do we recommend it?

The growing capabilities of large language models are impressive, yet ensuring their alignment with human desires and safety is paramount for broader adoption and to prevent adversarial misuse. In social media, users have exploited model vulnerabilities through modifications of prompts and found multiple ways to get unsafe responses with harmful content or personal information from LLM. This is known as jailbreaks.

In this paper, the University of Berkeley researchers investigated systematically and identified two key failures contributing to these attacks:

Competing Objectives:

  • Model capabilities in understanding context and generating responses often conflict with safety requirements.
  • Examples include attaching output prefixes encouraging the model to continue a harmful response or instructing it to rule out common refusal responses.

Mismatched Generalization Ability:

  • Discrepancies regarding generalization between pre-training and fine-tuning stages using algorithms like RLHF: pre-training involves diverse data, while fine-tuning uses a smaller set of examples, leading to generalization gaps.
  • Instances like using Base64 encoded prompts to bypass safety checks were demonstrated.

Recent state-of-art model GPT-4 and Claudia v1.3 use RLHF for alignment fine-tuning and red-team evaluations, enhancing robustness to disallowed prompts(RLHF fine-tuning and red-team refined GPT-4 responses to 82% less disallowed prompts than GPT3.5). However, despite these efforts, simple combinations of the findings above can still result in successful jailbreaks. Authors’ experiments showed that by combining simple tricks mentioned before, they could jailbreak more than 80% of the time for Claude v1.3 and more than 90% of the time for GPT-4 on both prompts that have already been flagged by red-team or on unseen prompts automatically generated by GPT-4.

It is worth noticing that scaling language models doesn’t automatically resolve safety issues; they often emerge alongside increased capabilities. The argument for developing new training objectives prioritizing safety-capability parity is crucial. Current models, despite advancements, remain vulnerable to misuse.

Double Descent

A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning

Alicia Curth*, Alan Jeffares*, Mihaela van der Schaar

Portrait of Peter Forsyth

by Peter Forsyth

Why do we recommend it?

A famous U-shaped plot is axiomatic to classical statistical learning. In this plot, the x-axis records the number of parameters a model has, while the y-axis records the model’s generalization error.  The plot is U-shaped because models with too few parameters underfit and have a higher generalization error, while models with too many parameters overfit and have a high generalization error. This plot has formed the basis for the scientific intuition of a generation of ML researchers.

It has been observed, however, that the U-shaped plot is not the whole picture.  For instance, in deep learning it is standard to choose models with sufficiently many parameters that they easily interpolate the training data.  Classical statistical learning would predict that these models should overfit. Often, they do not.  Instead, the last decade has shown the impressive generalization power of deep neural networks with very large capacities.

Belkin et al.’s seminal paper (2019) formulated the above observation as a new stylized fact called “double descent.”  “Double descent” is illustrated with a new generalization plot. Before the interpolation threshold –the number of parameters at which the model interpolates the data– we have the classical U-shaped plot.  After the interpolation threshold, however, generalization error declines again as the parameter count increases, in agreement with the empirical behaviour of deep neural networks.  Belkin et al. argued that double descent is a general pattern not unique to neural networks: they also provided examples of double descent in the context of linear regression, trees, and gradient-boosted tree ensembles.

Curth et al. deconstruct the non-neural examples from Belkin et al.’s paper.  They point out that the x-axis of the double descent graph is misleading because, in all these examples, moving along the x-axis (i.e. increasing the number of parameters) actually corresponds to varying at least two distinct hyperparameters simultaneously. In general, one of these hyperparameters promotes interpolation of the training data, while the other promotes what might be called “smoothing.”  By separately manipulating these hyperparameters, Curth et al. show that they can eliminate the double descent phenomenon or even synthesize a plot that exhibits triple descent.

More fundamentally, Curth et al. argue that measuring model capacity by “number of parameters” is inadequate.  They cite an “effective number of parameters” metric from the classical work of Hastie and Tibshirani (1990) and show that if this metric is plotted along the x-axis, the classical U-shaped plot is restored.

Curth et al.’s work demonstrates that details matter, and both the original U-shaped plot and the double-descent picture are too simplistic to cover the rich behaviour of ML models.  Curth et al. focus exclusively on non-deep models, but their work also raises the important question of whether a similar analysis could be applied to deep neural networks.  When we increase the number of parameters of a neural network, can we decompose this into varying multiple fundamental hyperparameters that have distinct effects on generalization error?  Is there an appropriate measure of the “effective number of parameters” for deep neural networks? Curth et al.’s paper suggests these and other directions for future research.

Class Imbalance

Simplifying Neural Network Training Under Class Imbalance

Ravid Shwartz-Ziv, Micah Goldblum, Yucen Li, C. Bayan Bruss, Andrew Wilson

portrait of ed smith

by Ed Smith

Why do we recommend it?

Training and deploying machine learning models under class imbalances can have a severe negative impact on performance, and as a result, careful model design can be crucial in this setting. Common approaches here include tailored loss terms, imbalanced sampling, and fine-tuning on the minority classes. In contrast, this paper studies the effect of tuning simple ML training settings in the class imbalance setting to describe their impact and provide prescriptions and considerations for this setting. They first examine the effects of simple settings such as batch size, data augmentations, and network size in class imbalanced tasks and determine their best settings. For example, smaller batch sizes and networks are significantly more effective when the minority class is small, while data augmentations help improve minority class accuracies far more than more prevalent classes. They then adapt self-supervision, label smoothing, and other standard optimization techniques to boost performance further. 

The paper then analyses the performance of this training regime across artificially imbalanced image recognition datasets, real-world imbalanced tasks such as melanoma identification, and real-world tabular settings such as product classification. Using only their simple training prescriptions and adapted optimizations, they were able to outperform all baseline class imbalance training strategies and acquire state-of-the-art performance in all tasks. Finally, through analysis of the Neural Collapse phenomenon, where late in-class representations collapse to the mean, they demonstrated that their system’s improved performance resulted from better generalization to test sets over prior work, as opposed to producing models that better fit the training data. 

Tabular Data

When Do Neural Nets Outperform Boosted Trees on Tabular Data? 

Duncan C. McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakrishnan, Micah Goldblum, Colin White

Photograph of Hossein Hajimirsadeghi

by Hossein H.

Why do we recommend it?

In recent years, a vibrant and ongoing discussion has emerged around the optimal machine learning models for tabular data. Various studies have compared gradient-boosted decision trees (GBDTs) with neural networks (NNs) on tabular data, with mixed results. Some found that NNs perform best on average, while others concluded that GBDTs have the edge.

This paper conducts a large-scale study across 176 datasets using 19 ML algorithms, including popular recent techniques and baselines. This includes three Gradient Boosted Decision Trees (GBDTs): CatBoost, LightGBM, and XGBoost; 11 neural networks; and five baselines such as Decision Tree, KNN, Logistic Regression, Random Forest, and SVM.

Some key findings can be summarized as follows.

  • No Universal Winner: Across different experiments, it’s concluded that there is no universal winner between GBDTs and NNs. This varies depending on the dataset and the specific context of the application.
  • Performance of NNs vs GBDTs: The paper’s analysis revealed that for a significant number of datasets, the performance difference between GBDTs and NNs is negligible. In some cases, light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs.
  • Handling of Dataset Irregularities: GBDTs are found to be much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. This indicates that GBDTs are more robust in dealing with diverse and challenging data characteristics.
  • Performance in Large Datasets: GBDTs tend to perform better on larger datasets and those with a high ratio of size to the number of features. Note that despite GBDTs generally perform better than NNs on larger datasets, this doesn’t mean that all GBDTs are superior to all NNs on large datasets. Therefore, practitioners should concentrate on analyses specific to each algorithm when making decisions.
  • Recommendation when facing a new dataset: Initially, start with straightforward baseline models and then proceed with minimal hyperparameter tuning of CatBoost, which often leads to the best performance unexpectedly. Subsequently, explore Neural Networks and other Gradient Boosted Decision Trees that have shown a high correlation with effective performance according to the meta features of the dataset.

Finally, the paper releases TabZilla Benchmark Suite, a collection of 36 hardest datasets designed to accelerate research in tabular data. All the material is available online here.

Object-centric representations

Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation 

Sebastien Lachapelle, Divyat Mahajan, Ioannis Mitliagkas, Simon Lacoste-Julien

by Ankit Vani

Why do we recommend it?

Compositional generalization is a crucial challenge for machine learning in the real world. It refers to the ability to apply learned knowledge to new situations composed of familiar elements. For example, humans can understand and communicate complex scenes, concepts and ideas by systematically combining simpler components. This requires a compositional representation of information that captures the data’s underlying structure and causal factors. Existing machine learning models often lack such a representation and fail to generalize beyond the limited combinations seen during training or empirically demonstrate improved compositionality without a theoretical explanation. A fundamental question that needs concrete answers is: How can we ensure that a model learns the true latent variables that generate the data rather than spurious correlations and artifacts?

To answer this question, this paper proposes a novel identifiability analysis, asking whether a model can recover the ground-truth latent variables up to a permutation and element-wise invertible transformation when shown only a subset of possible compositions during training. Unlike previous approaches that impose strong assumptions on the latent distribution, such as independence or factorial structure, this paper allows for arbitrary dependencies among the latent factors. Instead, the paper considers structural assumptions of a class of decoders, called additive decoders, that sum the output of decoders for each latent factor separately. The paper proves that when the ground-truth decoder is additive and possible values for each factor of variation are present in the training set, trained additive decoders can identify the true latent factors and extrapolate to novel combinations of factors not seen during training. For instance, an additive decoder can generate a “blue box with a red ball” even if it only saw the concepts “blue box” and “red ball” separately in other images. The authors define this property as “Cartesian-product extrapolation” and demonstrate empirically that additive decoders are essential for achieving it. This paper provides valuable insights into the theoretical aspects of learning compositionally generalizable representations.

Reinforcement Learning 

Behaviour Alignment via Reward Optimization Function

Dhawal Gupta, Yash Chandak, Scot M. Jordan, Philip S. Thomas, Bruno Castro da Silva 

Portrait of Raihan Seraj.

by Raihan Seraj

Why do we recommend it?

This paper introduces a new framework based on a bi-level objective to train reward functions aligning with a reinforcement learning agent’s intended behaviour. Incorporating domain knowledge often involves heuristic auxiliary rewards that match the designer’s intentions and expedite learning. A popular method of performing reward shaping has been through the use of potential-based schemes, which add auxiliary rewards while retaining the optimal policy. Consequently, the success of integrating these auxiliary rewards heavily depends on problem complexity and the designer’s adeptness in creating heuristic rewards that don’t deviate from the intended behaviour.

The authors present a novel bilevel optimization approach to design parameterized reward functions. This approach addresses poorly designed auxiliary functions by automatically integrating them into the inner and outer objectives of the optimization problem, effectively utilizing designer-specified auxiliary rewards to shape desired behaviours. Moreover, this method demonstrates how employing a bilevel formulation for reward function design and careful optimization reduces bias in the underlying RL algorithms.

Additionally, the paper introduces an algorithm for parameter learning based on the bilevel objective, utilizing implicit gradients. To validate the efficacy of their proposed objective, the paper considers a wide array of RL test suites, ranging from basic grid-world setups to more complex locomotion tasks, demonstrating the scalability and robustness of their method for designing rewards.

Keep updated on everything related to NeurIPS 2023

The Borealis AI team will be at NeurIPS all week, showcasing our research and products. We’ll also be meeting with brilliant minds in the AI community and chatting about our open roles. 

Do you want to stay up to date on all things #NeurIPS2023? We’re sharing highlights from the conference and additional papers on our LinkedIn and Twitter. Check it out!