This blog post is based on our paper accepted to International Conference on Learning Representations (ICLR) 2024. Please refer to the paper AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval for full details. Code is available here.

Predicting World Events with Machine Learning: A Serious Endeavour

The field of deep learning research is currently experiencing a surge in using machine learning to predict real-world events, a domain filled with potential. Imagine machines that can not only process data but also predict future events by analyzing news articles. This concept, once confined to science fiction, is now a burgeoning area of research.

Traditionally, machine learning predictions focused on time-series data, like stock market trends or weather forecasts. The real-world complexity requires sifting through vast, unstructured data in news articles. Despite significant progress, such the release of the Autocast dataset [1], machine learning’s ability to predict events from news content is still in its infancy and lags behind human forecasters.

Our research seeks to improve event prediction accuracy by addressing three key questions:

  • How do we identify the most relevant news articles for prediction?
  • Which methods effectively process these articles to extract key insights?
  • How can we improve the learning process to better utilize information from news articles?

Introducing AutoCast++: An Overview of Our Methodology

We propose a novel approach to improve world event predictions using news articles, incorporating a zero-shot ranking-based retriever-reader model. This model aims to refine the selection and processing of relevant articles and optimize learning from the information they provide. We detail the components of our methodology, each designed to tackle specific challenges in event forecasting.

Figure 1: Full architecture diagram of our approach.

Figure 1: Full architecture diagram of our approach.

All of our experiments were conducted on the recently released Autocast dataset (from which our approach takes its name). This dataset contains a large set of real-life questions on world events, along with the responses from human forecasters. The questions are diverse in both topic and geographical location, and consist of 3 types: multiple-choice, binary (True/False) and finally numerical where the goal is to output a number (or a date e.g.). We compare to the baselines introduced in the corresponding paper.

Preliminary. Our model’s foundation is a retriever-reader framework that selects relevant news articles from a large database for any forecasting query. It starts by identifying relevant articles based on their publication date, title, and relevance score to the query. These articles, along with the query, are then processed by our reader network to predict event outcomes using an encoder decoder transformer architecture, specifically Fusion-in-Decoder (FiD) built on T5. The full architecture diagram can be seen in figure 1.

Task-Aligned Retrieval Module. We developed a task-aligned retrieval module that enhances article selection by incorporating zero-shot news relevance assessment and considering news timeliness based on human feedback. It re-ranks articles initially identified by a basic retriever (e.g., BM25) using a relevance score from a pre-trained Language Model (LM) without task-specific training. This is achieved by using a simple prompt to elicit a relevance ranking from a LLM.

We also introduced a method to prioritize recent articles, believing this correlates with better prediction performance, similar to human forecasters. An example of the impact of our retrieval module in terms of article relevance can be seen in figure 2. A detailed view of the retriever can be found in figure 3.

Neural Reader with Unsupervised Distillation. After retrieval, our reader processes the articles to answer forecasting questions. We use unsupervised distillation for abstractive summarization, turning lengthy articles into concise summaries. These summaries enable the FiD reader to generate forecasting answers, enhancing the precision and effectiveness of our predictions.

Figure 2: Comparison between vanilla (i.e. BM25) retriver and our approach.

Figure 2: Comparison between vanilla (i.e. BM25) retriver and our approach.

Figure 3: Representation of our retriever module

Figure 3: Representation of our retriever module

Training Objectives with Human Alignment Loss. We propose modifications to the autoregressive generative decoder loss, including numerical question binning for direct value prediction and a human annotation alignment loss. This aligns the model’s predictions with human forecasters’ probabilistic responses, offering a nuanced learning approach from real-world forecasting.

Our methodology aims to refine machine learning application in event forecasting for more accurate and timely predictions through strategic news article analysis.

Key Results

Our evaluation on the Autocast dataset, covering economics, politics, and technology, uses annotated questions and news articles from the Common Crawl corpus (2016-2022). Our metrics include accuracy for true/false and multiple choice questions and absolute error for numerical predictions. The dataset was split at mid-2021 for testing until mid-2022. Using pre-trained T5 weights and the FiD Static architecture, our model achieved significant improvements: a 48% increase in multiple-choice question accuracy, an 8% increase for true/false questions, and a 19% reduction in absolute error for numerical predictions within the 0.2B model size category.

UnifiedQA (Khashabi et al., 2022)
Full Context Length Model SizeT/F ↑MCQ↑Num.↓
Small-sized Models
UnifiedQA (Khashabi et al., 2022) n/a0.2B45.423.534.5
TS (Raffel et al., 2020) n/a0.2B61.324.020.5
FiD Static (Zou et al., 2022) 100.2B62.029.624.5
FiD Temporal (Zou et al., 2022) query-specific0.6B62.033.523.9
Autocast++ (ours) 100.2B66.743.819.8
Middle-size Models
UnifiedQA (Khashabi et al., 2022) n/a0.8B48.223.534.5
TS (Raffel et al., 2020) n/a0.8B60.029.121.7
FiD Static (Zou et al., 2022) 100.8B64.132.421.8
FiD Temporal (Zou et al., 2022) query-specific1.5B63.832.421.0
Autocast++ (ours) 100.8B67.344.019.9
Large-size Models
UnifiedQA (Khashabi et al., 2022) n/a2.8B54.925.134.5
TS (Raffel et al., 2020) n/a2.8B60.026.821.9
FiD Static (Zou et al., 2022) 102.8B65.435.819.9
FiD Temporal (Zou et al., 2022) query-specific4.3B62.936.9 19.5
Autocast++ (ours) 102.8B 67.9 44.1 19.8
Figure 4: Our method’s key results, showcasing performance surpassing larger baselines in parameter counts.

These results indicate that larger model sizes do not necessarily lead to better forecasting performance. Our smaller models outperformed the larger 2.8B parameter baseline, highlighting the importance of optimization and architectural improvements.

For more information on this project, we invite readers to explore our full paper here and join us at the upcoming ICLR conference in Vienna for discussions on future directions in event forecasting.


Our study presents a comprehensive approach to improving machine learning models for event forecasting. By integrating a Task-Aligned Retrieval Module, an Enhanced Neural Article Reader, and a Human-Aligned Loss Function, we achieved significant accuracy improvements.

These results highlight the potential of our methodology to bridge the gap between theoretical applications and practical utility in real-world event forecasting. While promising, our findings also pave the way for further research and development in this evolving field.