Introduction

Time-series modeling has in recent years seen vast improvements, owing to the introduction of powerful new deep learning architectures but also to the ever-increasing volume of available training data, in particular in the financial sector. At Borealis AI, we are interested in improving the modeling of real-world time-series data. In capital markets, applications include asset pricing, volatility estimation, market regime classification, and many more. We also consider empirical issues, such as structured dependencies, expressive uncertainty, and non-stationary dynamics. At its core, this quantitative view of market microstructure relies on robust and generalizable representations.

Despite the field’s impressive evolution, dealing with large volumes of data often precludes access to labels and highlights the need for effective self-supervised learning strategies, which have already achieved impressive success in computer vision (e.g., MoCo [1]) and natural language processing(e.g., BERT [2]). While forecasting is inherently self-supervised and therefore can reap some of the benefits of these approaches, namely an ability to utilize unlabelled data, this approach also has some limitations, which motivates the need for alternative self-supervised methods, such as contrastive learning. Three main arguments motivate the use of contrastive approaches:

Firstly, contrastive learning may be more conductive to better generalization. Learning representations through solving a task such as forecasting encourages a model to discard information that is not directly relevant to forecasting. As a consequence, using the same representation for another downstream task might not work as desired. Other fields, such as zero-shot learning, have already noted that contrastive objectives generalize better than other representation learning approaches, including reconstruction and prediction [3]. Contrastive approaches to time-series representation learning have the potential to help solve various challenges in time-series modeling, in particular ones that consider multiple downstream tasks, which is of vital importance for a better understanding of real-world financial data.

Self-supervised Learning in Time-Series Forecasting — A Contrastive Learning Approach

Figure 1: Overview of Time-Series Forecasting Methods. The X-axis shows different backbones, the Y- axis different time-series (TS) treatments. Various methods operate at the intersection of backbone model and TS treatments. We show DeepAR [4], Deep State Space [5], ESRNN [6], TS2Vec [7], CoST [8], Autoformer [9], Informer [10], and ScaleFormer [11].

Secondly, contrastive approaches have a higher inherent flexibility than alternatives. As we will see in further sections, contrastive approaches like InfoNCE can be modulated to give more emphasis to strong/weak negative samples, instance or temporal contrast, etc., depending on the needs of the downstream task. Being able to modulate an approach, and having access to a large body of literature on the benefits and drawbacks of different variations, is a significant advantage when it comes to practical applications and can lead to increased robustness with respect to the often challenging properties of financial time-series, such as non-stationarity.

Finally, empirically, contrastive approaches tend to outperform their more traditional counterparts, in particular when large volumes of data are available. In addition to this statement being generally true in the computer vision domain, where contrastive approaches derived from InfoNCE dominate representation learning, similar evidence has also started to emerge for downstream tasks such as time-series classification, forecasting, and anomaly detection, and the field is still quite new!

Background on Time-Series Forecasting

Time-series forecasting plays an important role in many domains, including weather forecasting [12], inventory planning [13], astronomy [14], and economic and financial forecasting [15]. One of the specificities of time-series data is the need to capture seasonal trends [16]. The literature on time-series architectures is vast, and a comprehensive review is beyond the scope of this article; see Fig.1 for a (high-level) overview of the methods discussed below.

Early approaches, such as ARIMA [17] and exponential smoothing [18], were followed by the introduction of neural network-based approaches involving either variants of Recurrent Neural Networks (RNNs)[4,5] or Temporal Convolutional Networks(TCNs) [19]. Time-series Transformers leverage self-attention to learn complex patterns and dynamics from time-series data [20,21]. Binh and Matteson [22] propose a probabilistic, non-auto regressive transformer-based model with the integration of state-space models. The originally quadratic complexity in time and memory was subsequently low ered to O(L log L)by enforcing sparsity in the attention mechanism, e.g., with the ProbSparse attention of the Informer model [10] and the LogSparse attention mechanism [23]. While these attention mechanisms operate on a point-wise basis, Autoformer [9] uses a cross-correlation-based attention mechanism with trend/cycle decomposition, which results in improved performance. ScaleFormer [11] iteratively refines a forecasted time-series at multiple scales with shared weights, architecture adaptations, and a specially-designed normalization scheme, achieving state-of-the-art performance.

Recently, the contrastive learning approach has been applied to time-series forecasting: TS2Vec [7] employs contrastive learning in a hierarchical way over augmented context views to enable a robust contextual representation for each timestamp. CoST [8] performs contrastive learning for learning disentangled seasonal-trend representations for time-series forecasting.

Main Ingredients of Contrastive Learning

General formulation of a contrastive objective

Contrastive learning relies on encouraging the representation of a given example to be close to that of positive pairs while pushing it away from that of negative pairs. The distance used to define the notion of “closeness” is somewhat arbitrary and can commonly be taken to be the cosine similarity. As we will see in the rest of this blog, there is a lot of leeway in terms of defining negative pairs. The InfoNCE loss constitutes the basis of almost all recent methods, in particular in the time-series domain, and variants of it are very commonly used in computer vision and natural language processing. Its standard formulation is:

\begin{align*}
    \mathcal{L} = – \sum_{b} \frac{\exp(\mathbf{z}_b \cdot \mathbf{z}^+_b/\tau)}{\exp(\mathbf{z}_b \cdot \mathbf{z}^+_b/\tau) + \sum_{n \in \mathcal{N}_b}\exp(\mathbf{z}_b \cdot \mathbf{z}^-_n / \tau)},
\end{align*}

where $\mathbf{z}^+_b$ is a positive sample(obtained from the example by a set of small random transformations, such as scaling), $\tau$ is a temperature parameter, and the $\mathcal{N}_b$ are negative samples.

The formulation above is very flexible; for instance, one can modify the definition of negative pairs (e.g., to perform contrastive learning in the temporal or instance domain), enhance negative samples to create stronger negatives, adapt the temperature parameter, or change the learning objective itself as exemplified by some of the methods presented in the next section.

Varieties of Contrastive Learning Objectives

In the previous section, we detailed a “standard” learning objective: InfoNCE. A number of alternatives exist to this objective, which we will mention for the sake of completeness. An earlier contrastive loss formulation [24] learns a function that maps an input into a low-dimensional target space such that the norm in the target space approximates the “semantic” distance in the input space. The triplet loss [25] is similar to the modern contrastive loss in the sense that it minimizes the distance between similar distributions and maximizes the distance between dissimilar distributions; however, for the triplet loss, a positive and a negative sample are simultaneously taken as input, together with the anchor sample, to compute the loss. Noise Contrastive Estimation (NCE)[26] uses logistic regression to discriminate the target data from noise.

Most of the more recent contrastive objectives, however, arose as variations of InfoNCE. MoCo [1], which was originally designed for learning image representations but was later applied to time-series, trains a visual representation encoder by matching an encoded query q to a dictionary of encoded keys using a contrastive loss. CoST [8] employs the MoCo variant of contrastive learning to time-series forecasting and makes use of a momentum encoder to obtain representations of the positive pair and a dynamic dictionary with a queue to obtain negative pairs. SimCLR [27] shows that the composition of multiple data augmentation operations is crucial in defining the contrastive prediction tasks that yield effective representations. Additionally, a substantially larger batch can provide more negative pairs. MoCo V2 [28] is an improved version of MoCo replacing its 1-layer fully-connected structure with a 2-layer MLP head with ReLU for the unsupervised training stage. Further modifications include blur augmentation and using a cosine learning rate scheduler.

What constitutes a positive/negative sample?

Contrastive learning approaches rely on defining positive and negative pairs that constitute good priors for the type of information the representation should encode. In the image domain, there are relatively few possibilities to define relevant positive and negative pairs. The most common approach by far holds positive pairs to be augmented versions of an input image, whereas negative pairs are augmented versions of a different image from the dataset. In the time-series domain, however, there are more possibilities.

Instance contrast, where positive pairs are representations from the same example at different time steps and negative pairs are extracted from different examples in a mini-batch, is a natural idea that has been incorporated frequently in recent works. An alternative known as temporal contrast constructs negative pairs from representations of the time-series at distant time steps. Additionally, recent works [8] also consider a contrastive loss in the frequency domain. The different possibilities currently used are listed below and summarized in Figure 2.

Self-supervised Learning in Time-Series Forecasting — A Contrastive Learning Approach

Figure2: Different Approaches to Contrastive Learning for Time-series.

Instance Contrastive Loss. When considering instance contrastive losses, the negative samples are taken from the other examples in the batch. To avoid the constraint imposed by the batch size, methods such as Momentum Contrast(MoCo) maintain a queue of encoded negative samples.

Temporal Contrast. As the time-series encoders yield a representation for each time-step in the original series, another possibility is to draw negative samples from the set of representations of the input example at time steps sufficiently far away. This encourages the representation to maintain high levels of time-specific information.

Contrastive Learning in the Frequency Domain. More recently, analogous losses were introduced in the frequency space by CoST, which first maps the encoded time-series into the frequency domain using an FFT and then computes separate losses by mapping the complex-valued result to its real-valued phase and amplitude. These losses add a strong inductive prior to the learned representation, encouraging it to preserve seasonal information.

Of course, there are many other possibilities, some of which could be the key to learning stronger representations!

Moving Forward

As we have seen, contrastive time-series representation learning has a lot of potential for impactful research, and is a relatively new field. At Borealis AI we care deeply about better understanding real-world time-series, in particular in finance. If you would like to help us improve the performance and robustness of contrastive time-series representation learning we would be happy to hear from you.

References