We’re ready for the 61st Annual Meeting of the Association for Computational Linguistics, the premier international conference on computational linguistics. From July 9th to July 14th, 2023, join us in Toronto, Canada, for ACL’23.

In an era where large language models like ChatGPT are revolutionizing the fields of Natural Language Processing (NLP) and AI, the anticipation for ACL’23 is palpable. These cutting-edge models have propelled NLP into uncharted territories, powering applications such as machine translation, speech recognition, intelligent web searching, and more.

This year’s conference brings together a diverse community of researchers, tackling various computational challenges related to natural language. It encompasses an extensive array of research areas, all united by their shared commitment to computational approaches in understanding and processing language. To cut through the noise, we’ve curated a reading list categorized by topics such as causality, LLM, decoding, reasoning, diffusion models, language model evaluation, small language models, and chain-of-thoughts. Each selected paper is paired with a summary and our thoughts on its potential impact.

Explore our research team’s recommendations and get ready to be captivated by this year’s research.


  • Causality
  • Diffusion Models
  • LLM
  • Decoding
  • Causality
  • Language Model Evaluation
  • Natural Language Processing
  • Small Language Model
  • Chain-of-thoughts


Contrastive Decoding: Open-ended Text Generation as Optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis

Portrait of Peter Forsyth

by Peter Forsyth

Why do we recommend it?

The lion’s share of recent progress in text generation can be attributed to scale: by spending more money deploying more advanced hardware to train larger transformers for longer, leading NLP labs have created systems that generate text of astonishing fluency and coherence.  It is natural to ask whether fluent, coherent text can be generated more cheaply using smaller models.  Li et al. cleverly observe that when a medium-sized and a small language model both generate the incorrect continuation for a prompt, the medium-sized model generally assigns a higher probability to the correct continuation than the small model. For example, GPT-2 XL and GPT-2 Small both incorrectly continue the prompt “Barack Obama was born in Hawaii.  He was born in …”, but GPT-2 XL assigns higher probability to “1961”, the correct continuation.  The authors propose contrastive decoding, which exploits this observation. 

Contrastive decoding uses two language models — an expert and an amateur — and selects tokens that have non-negligible probability according to the expert and to which the expert assigns a much larger probability than the amateur.  The technique works well, enabling a small and a medium-sized model to be combined to generate text of comparable quality to that of a large model. Contrastive decoding seems to be in the air at ACL: I notice one paper that directly builds on Li et al.’s method and another that concurrently proposes something similar.

Diffusion Models

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Xiaochuang Han, Sachin Kumar, Yulia Tsvetkov

Portrait of Peter Forsyth

by Peter Forsyth

Why do we recommend it?

In language modelling, the autoregressive paradigm is dominant.  Nevertheless, autoregressive language models have known drawbacks. For instance, controlling the style or sentiment of generated text requires either prompt engineering, which can be unreliable, or fine-tuning, which can be expensive.  Moreover, certain erroneous, hallucinatory reasoning patterns can be attributed to strictly left-to-right generation: when asked to solve a logic problem, an autoregressive model may output the wrong answer and then attempt to justify it retrospectively.

Han et al. propose to address some of these drawbacks with a diffusion language model that generates all tokens in a block of text simultaneously through iterative refinement. This makes it easy to control the style and sentiment of the text by incorporating an off-the-shelf classifier into the refinement steps.  Diffusion models are popular in computer vision, and Han et al. are not the first to propose applying the same idea to natural language.  However, unlike previous diffusion language models, Han et al.’s model appears to compete with comparable auto-regressive models in the of quality of its generated text.  The authors leverage a few technical tricks to achieve this: they perform their diffusion in the log-transform of the token probability simplex, and they make use of a semi-auto-regressive step as part of their iterative refinement.

While diffusion models still have a way to go before they can pose a serious challenge to autoregressive models in the language domain, Han et al.’s work is an important step.


A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models

Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bernhard Schölkopf, Mrinmaya Sachan 

Portrait of Keyi Tang

by Keyi Tang

Why do we recommend it?

Mathematical reasoning is one of the key tasks to track the capabilities of Large Language Model (LLM). Although the state-of-art LLMs have achieved impressive results on various mathematical reasoning benchmarks, it remains unclear whether these models have learned superficial patterns in the data or have truly mastered the mathematical concepts necessary to consistently solve variations of similar problems.

This paper proposed a formal quantitative framework based on causal inference. By conducting controlled interventions on mathematical operation, operands and textual framing of the underline problems, it disentangles and separately measures the causal effect of different factors influencing the predictions of LLMs for math reasoning. This paper applied this framework to analyze thirteen pre-trained language models and reveals the following intriguing findings:

  1. With an increase in model size, the model becomes more sensitive to various interventions.
  2. Mathematical reasoning robustness does not always improve with increasing model size. While GPT3 Davinci and LLaMA families showed significant improvement in mathematical reasoning robustness, other models did not exhibit the same trend. 
  3. Models that demonstrate good robustness when dealing with two-operand problems exhibit unstable behaviours when faced with three-operand problems. This discovery suggests that LLM becomes more sensitive to spurious correlations between the text surface form and the results as the complexity of the problem increases.

Large Language Model Evaluation

Can Large Language Models Be an Alternative to Human Evaluation?

Cheng-Han Chiang, Hung-yi Lee

Portrait of Wenjie Zi

by Wenjie Zi

Why do we recommend it?

Evaluating the performance of a large language model (LLM) is crucial, as it serves as the guiding principle for model optimization. In business domains, especially highly regulated ones like finance, gaining a comprehensive understanding of the expected performance of LLMs is vital to ensure no risks are overlooked.

While there have been numerous evaluation benchmarks released (Chang et al. (2023)), challenges remain, particularly for natural language generation tasks where there is no single correct answer (Gehrmann et al. (2022)). Typically, two approaches are used for such tasks: automatic model evaluation using human-designed metrics such as extensions of BLEU or ROUGUE and human evaluation. However, human-designed metrics are known to be less robust, as a simple change in wording can lead to significant variations in scores.

In this paper, the researchers explored the possibility of using LLMs to evaluate their own performance. They focused on the open-ended story generation task as an example. Although the correlation scores between evaluations by human experts and LLMs were considered weak to moderate, the researchers observed that human experts tended to agree with the evaluation scores provided by advanced LLMs like GPT-3 or ChatGPT. The lower correlation scores may stem from the inherent difficulty of the task, as even the correlation between two human experts is relatively low. Additionally, the researchers conducted experiments where they changed the prompts and found that the performance of the LLMs was less sensitive to prompt variations. For identifying adversarial samples, they’ve discovered that while ChatGPT was less critical of the unnatural and artificial aspects in adversarial samples compared to human experts, it still rated adversarial samples significantly lower than real samples.

These findings suggest a promising direction: leveraging LLMs to evaluate their own generated results. Given the low requirement for human labelling and high reproducibility, this approach can be easily incorporated into existing evaluation or validation systems.

Smaller Language Model

Teaching Small Language Models to Reason

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn

Portrait of Wenjie Zi

by Wenjie Zi

Why do we recommend it?

 Large language models have demonstrated the remarkable ability to conduct complex reasoning through techniques such as Chain-of-Thoughts during prompting time. This complex reasoning ability enables the development of more intelligent AI systems than ever before.

However, deploying solutions based on LLMs comes with many challenges, including long inference times and massive resource requirements (memory/GPU). Additionally, MLOps tools that help with supporting productionalization LLMs are under rapid development but yet still at an early stage. Hence, smaller language models that perform relatively comparable are with significant interests of many given their advantages in practical applications.

In this paper, researchers developed a straightforward yet promising strategy inspired by ‘self-instruct’ (Wang et. al (2022)). They utilized two LLMs, (PaLM (540B) and GPT-3 (175B)), to generate Chain-of-Thoughts responses, which were used to fine-tune a considerably more compact model, specifically, T5 XXL (11B). On the GSM8K benchmark, which consists of 8.5k grade mathematics word problems, the fine-tuned T5 XXL(11B) achieved an improvement of over twice its pre-fine-tuning accuracy rate (from 8% to 22%). While still far from the performance of PaLM (540B) at 57%, this approach shows potential for advancing the proficiency of smaller models. Similar findings have been observed in Stanford’s Alpaca model (Taori et al. (2023)). It utilized generated responses from GPT-3 (175B) for fine-tuning LLaMA (7B) and achieved a similar level of performance as GPT-3 (175B).

A recent development from Microsoft, phi-1 (1.3B) (Gunasekar et al. (2023)), has pushed the performance boundaries of smaller models through improved dataset quality. We are excited to see the synergies emerging from different directions in the field, aiming to narrow the performance gap between smaller language models and LLMs.