Authors: G. Oliveira, F. Tung
Two overlapping data lines, showing a corresponding pattern on the left and right of the intersecting area. The background is a gradient of navy blue to royal blue, the data lines are white, and two data points mustard yellow and light green.

Single-task learning (STL) models are the most traditional approach in machine learning and have been extremely successful in many applications. This approach assumes that the model is required to output a single prediction target for a given input sample, such as a class label or a regression value. If two output targets are associated with the same input data, then two independent models are trained: one for each target or task. STL may be suitable for situations in which the tasks are very different from each other and in which computational efficiency may be ignored. However, when the tasks are related, STL models are parameter inefficient [18,25]. In addition, in some applications, the synergy among tasks can help a jointly trained model better capture shared patterns that would otherwise be missed by independent training. 

In contrast to STL, multi-task learning (MTL) optimizes a single model to perform multiple related tasks simultaneously, aiming to improve generalization and parameter efficiency across tasks. In this case, two or more output targets are associated with the same input data. Effective multi-tasking learning typically requires task balancing to prevent one or more tasks from dominating the optimization, to decrease negative transfer, and to avoid overfitting. Standard MTL settings usually assume a homogeneous set of tasks, for example all tasks are classification or regression tasks, and usually they are non-sequential data. This scenario can greatly benefit MTL approaches with strong shared representations. In contrast, heterogeneous multi-task learning is defined by multiple classes of tasks, such as classification, regression with single or multi label characteristics and temporal data, being optimized simultaneously. The latter setting is more realistic but lacks further exploration. In this post, we share a novel method that we recently developed for heterogeneous MTL.

A mathematical diagram showing 4 depictions of the model flow to tower A/B and Task A/B. Figure 1. Neural Network Architectures.(a) Single-Task Learning, (b) Multi-task learning Hard-parameter sharing [1], (c) Multi-gate Mixture-of-Experts (MMoE)[2], (d) MMoEEx -- proposed method.

Mixture of Experts for MTL

Hard-parameter sharing networks [1], shown in Figure 1.b are one of the pillars of multi-task learning. These networks are composed of a shared bottom and task-specific branches. Ma et al. [2] suggested that a unique shared bottom might not be enough to generalize for all tasks in an application, and proposed to use several shared bottoms, or what they call experts. The experts are combined using gate functions, and their combination is forwarded to the towers. The final architecture is called Multi-gate Mixture-of-Experts(MMoE), and is shown in Figure 1.c. MMoE generalizes better than its traditional hard-parameter sharing counterpart, but there are two weaknesses: first, it lacks a task-balancing mechanism; second, the only source of diversity among the experts is due to the random initialization. Although the experts can indeed be diverse enough if they specialize in different tasks, there are no guarantees that this will happen in practice. We propose the Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx) (Figure 1.d), a model that induces more diversity among the experts and has a task-balancing component.

Multi-gate Mixture-of-Experts with Exclusivity

Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx) takes its inspiration from ensemble learning, where diversity among their learners tend to generalize better. MMoEEx can be divided in three parts: gates, experts and towers. Considering an application with $K$ tasks, input data $x \in \mathbb{R}^d$, the gate function $g^k()$ is defined as:

 \begin{equation}
\label{eq:g}
g^k(x) = \text{softmax}(W^k x), \forall k \in \{0,...,K\} \tag{1}
\end{equation}

where $W^k \in \mathbb{R}^{E \times d}$ are learnable weights and $E$ is the number of experts, defined by the user. The gates control the contribution of each expert to each task.

The experts $f_e(),\forall e\in \{0,...,E\}$, and our implementation is very flexible to accept several experts architectures, which is essential to work with applications with different data types. For example, if working with temporal data, the experts can be LSTMs, GRUs, RNNs; for non-temporal data, the experts can be dense layers. The number of experts $E$ is defined by the user. The experts and gates' outputs are combined as follows:

\begin{equation}
\label{eq:f}
f^k(x) = \sum_{e=0}^Eg^k(x)f_e(x), \forall k \in \{0,...,K\} \tag{2}
\end{equation}

The $f^k()$ are input to the towers, the task-specific part of the architecture. Their design depends on the data type and tasks. The towers $h^k$ output the task predictions as follows:

 \begin{equation}
y^k = h^k(f^k(x)), \forall k \in \{0,...,K \} \tag{3}
\end{equation}

Previous Mixture of Experts models like [2] leverage several experts to make their final predictions; however, they rely on indirect approaches, such as random initialization, to foster diversity among the experts, and on the expectation that the gate function will learn how to combine these experts. Here we propose a mechanism to induce diversity among the experts, defined as $\textit{exclusivity}$.

Exclusivity: We set $\alpha E$ experts to be exclusively connected to one task. The value $\alpha\in[0,1]$ controls the proportion of experts that will be $\textit{exclusive}$. If $\alpha=1$, all experts are exclusive, and if $\alpha=0$, all experts are shared (same as MMoE). An exclusive expert is randomly assigned to one of the tasks $T_k$, but the task $T_k$ can still be associated with other exclusive experts and shared experts. 
MMoEEx, similarly to MMoE, relies on the expectation that gate functions will learn how to combine the experts. Our approach induces more diversity by forcing some of these gates to be 'closed' to some experts, and the exclusivity mechanism is used to close part of the gates. The remaining non-closed gates learn to combine the output of each expert based on the input data, according to Equation 1.

MAML - MTL optimization

Competing task optimization is another challenge of optimizing heterogenous tasks. The goal of the MAML-MTL optimization is to balance the tasks on the gradient level. Finn et al. [3] proposed the Model-agnostic Meta-learning (MAML), a two-step optimization approach originally intend to be used with transfer-learning and few-shot learning due to its fast convergence. Initial attempts to apply MAML to MTL show that MAML can balance the tasks on the gradient level and yield better results than some existing task-balancing approaches [4]. The core idea is that MAML's temporary update yields smoothed losses, which also smooth the gradients on direction and magnitude. However, differently from [4], we do not freeze task-specific layers during the intermediate/inner update. The MAML-MTL approach is shown in Figure 2. The approach consists of evaluating each task loss. After that each task loss is used to temporarily update the network which are re-evaluated and the task specific temporarily losses are aggregated to form the final loss which will provide the actual network update.

A mathematical diagram showing 6 depictions of the models 'in progress' flow. Figure 2. MAML-MTL algorithm.


Heterogenous MTL Experiments

The Medical Information Mart for Intensive Care (MIMIC-III) database was proposed by [5] to be a benchmark dataset for MTL in time-series data. It contains metrics of patients from over 40,000 intensive care units (ICU) stays. This dataset has 4 tasks: two binary tasks, one temporal multi-label task, and one temporal classification. Figure 3 shows the neural network adopted in our work and where each task is calculated. 

Five rows by 5 columns of colour blocks, the bottom two rows are grey x5, third are yellow x5, fourth red x5, and fifth has one green and one blue. Figure 3. The input data $x_t$ has 76 features, and the size of the hidden layer $h_t$ depends on the model adopted. There are four tasks: the decompensation $d_t$ and LOS $l_t$ calculated at each time step, mortality $m_{48}$, and the phenotype $p_T$, both calculated only once per patient.

The full set of results for MIMIC-III dataset is presented in Table 1. We compared our approach with the multitask channel wise LSTM (MCW-LSTM) [6], single task trained network, shared bottom, MMoE [2] and MMoEEx.

MMoEEx outperforms all the compared approaches except on the Phenotype (Pheno) task. For both time series tasks (LOS and Decomp) our approach outperforms all baselines. It is worth noting that for the LOS task, which is the hardest task on MIMIC-III, we present a relative improvement superior to $40$ percentage points when compared to multitask channel wise LSTM [6] and over $16$ percentage points to MMoE.

Method Pheno LOS Decomp Ihm $\Delta$
MCW-LSTM[6]  77.4 $45.0$ $90.5$ $87.0$ $+0.28%$
Single Task [6] $77.0$ $45.0$ $91.0$ $86.0$ -
Shared Bottom $73.36$ $30.60$ $94.12$ $82.71$ $-9.28%$
MMoE 75.09 $54.48$ $96.20$ $90.44$ $+7.36%$
MMoEEx $72.44$ 63.45 96.82 90.73 +11.74%

Table 1: Final results MIMIC-III. MMoEEx outperforms all the compared baselines with the exception to Pheno where the work from [6] is the best approach. Our approach can provide a relative improvement superior to $40$ percentage points when compared to the Multitask channel wise LSTM for the LOS task.

Diversity Score Experiments

We measured how diverse MMoEEx experts are compared to traditional MMoE.
The diversity among experts can be scored by the distance between the experts' outputs $f_e, \forall e\in\{0,..., E\}$. Considering a pair of experts $i$ and $j$, the distance between them is defined as:

\begin{equation}
    d_{i,j} = \sqrt{\sum_{n=0}^N(f_i(x_n)-f_j(x_n))^2} \tag{4}
\end{equation}

where $N$ is the number of samples in the dataset, $d_{i,j} = d_{j,i}$, and a matrix $D \in \mathbb{R}^{E\times E}$ is used to keep all the distances. To scale the distances into $d_{i,j}\in [0,1]$, we divide the raw entries in the distance matrix $D$ by the maximum distance observed, $\text{max}(D)$. A pair of experts $i,j$ with $d_{i,j} = 0$ are considered identical, and experts distances $d_{i,j}$ close to 0 are considered very similar; analogously, experts with $d_{i,j}$ close to 1 are considered very dissimilar. To compare the overall distance between the experts of a model, we define the $\textit{diversity score}$ $\bar{d}$ as the mean entry in $D$. 

Two comparative grid graphs with numbers in each grid box. A diagonal white section appears from top left to bottom right of the axis, which gradually shade into dark blue around the bottom and right edges of the axis. Lighter the colour lower the number. Figure 4. MMoE ($\bar{d}=0.311$) and MMoEEx($\bar{d}=0.445$) heatmap in the MIMIC-III dataset. The MMoE has 12 shared experts $\textit{versus}$ 6 shared and 6 exclusive experts in the MMoEEx model. Darker colors indicate more dissimilarities between two experts and, therefore, more diversity.

We analyze the diversity score of the MMoE and MMoEEx experts on MIMIC-III. The MMoE and MMoEEx models compared have the same neural network structure, but the MMoEEx uses the MAML - MTL optimization and has the diversity enforced. The MMoEEx model in Figure 4 was created with $\alpha = 0.5$ and exclusivity. In other words, half of the experts in the MMoEEx model were randomly assigned to be exclusive to one of the tasks, while in the MMoE model all experts are shared among all tasks. Figure 4 shows a heatmap of the distances $D^{MMoE}$ and $D^{MMoEEx}$ calculated on the MIMIC-III testing set with 12 experts. MMoE's heatmap has overall lighter colors, indicating smaller diversity scores, compared with MMoEEx. Quantitatively, MMoEEx produces a relative lift of $43\%$ in diversity score.

Conclusions and Discussions

We presented a novel multi-task learning approach called Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx), which extends previous methods by introducing an exclusivity mechanism that induces more diversity among experts, allowing the network to learn representations that are more effective for heterogeneous MTL. We also introduce a two-step optimization approach called MAML-MTL, which balances tasks at the gradient level and enhances MMoEEx's capability to optimize imbalanced tasks.

MTL has achieve critical mass in multiple areas like natural language processing [7, 8, 9], computer vision [10, 11, 12], reinforcement learning [13, 14] and multi-modal learning [15, 16]. Standard soft/hard parameter sharing approaches are a well established technique to handle multiple tasks. While they show improvements over single task learning for tasks with similar characteristics, it is not fully explored how MTL can further improve heterogeneous task scenarios. Hybrid approaches like mixture of experts can mitigate several limitations of standard approaches and further extend its capabilities when coupled with specialized optimization methods. Optimization methods for MTL are in their infancy, and more research on meta-learning task balancing can greatly benefit MTL research. We hope this work inspires the community to further investigate multi-task learning at the network architecture and optimization levels.

 


[1]  Rich Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, 1993.

[2]  Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.

[3]  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.

[4]  Sungjae Lee and Youngdoo Son. Multitask learning with single gradient step update for task balancing. arXiv preprint arXiv:2005.09910, 2020. 8

[5]  Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1–18, 2019.

[6]  Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.

[7]  Victor Sanh, Thomas Wolf, and Sebastian Ruder. A hierarchical multi-task approach for learning embeddings from semantic tasks. In AAAI Conference on Artificial Intelligence, volume 33, 2019.

[8]  Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Annual Meeting of the Association for Computational Linguistics, 2019.

[9]  Cagla Aksoy, Alper Ahmetoglu, and Tunga Gung ¨ or. Hierar- ¨ chical multitask learning approach for BERT. arXiv preprint arXiv:2011.04451, 2020.

[10]  Ximeng Sun, Rameswar Panda, Rogerio Feris, and Kate Saenko. AdaShare: Learning what to share for efficient deep multi-task learning. In Advances in Neural Information Processing Systems, 2020.

[11]  Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.

[12]  Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. MTI-Net: Multi-scale task interaction networks for multi-task learning. In European Conference on Computer Vision, 2020.

[13]  Lerrel Pinto and Abhinav Gupta. Learning to push by grasping: Using multiple tasks for effective learning. arXiv preprint arXiv:1609.09025, 2016. 9

[14]  Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. Technical report, DeepMind, 2019.

[15]  Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain. OmniNet: A unified architecture for multi-modal multi-task learning. arXiv preprint arXiv:1907.07804, 2019.

[16]  Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12-in-1: Multi-task vision and language representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.

Authors