Single-task learning (STL) models are the most traditional approach in machine learning and have been extremely successful in many applications. This approach assumes that the model is required to output a single prediction target for a given input sample, such as a class label or a regression value. If two output targets are associated with the same input data, then two independent models are trained: one for each target or task. STL may be suitable for situations in which the tasks are very different from each other and in which computational efficiency may be ignored. However, when the tasks are related, STL models are parameter inefficient [18,25]. In addition, in some applications, the synergy among tasks can help a jointly trained model better capture shared patterns that would otherwise be missed by independent training.

Figure 1. Neural Network Architectures.(a) Single-Task Learning, (b) Multi-task learning Hard-parameter sharing [1], (c) Multi-gate Mixture-of-Experts (MMoE)[2], (d) MMoEEx — proposed method.

## Mixture of Experts for MTL

Hard-parameter sharing networks [1], shown in Figure 1.b are one of the pillars of multi-task learning. These networks are composed of a shared bottom and task-specific branches. Ma et al. [2] suggested that a unique shared bottom might not be enough to generalize for all tasks in an application, and proposed to use several shared bottoms, or what they call experts. The experts are combined using gate functions, and their combination is forwarded to the towers. The final architecture is called Multi-gate Mixture-of-Experts(MMoE), and is shown in Figure 1.c. MMoE generalizes better than its traditional hard-parameter sharing counterpart, but there are two weaknesses: first, it lacks a task-balancing mechanism; second, the only source of diversity among the experts is due to the random initialization. Although the experts can indeed be diverse enough if they specialize in different tasks, there are no guarantees that this will happen in practice. We propose the Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx) (Figure 1.d), a model that induces more diversity among the experts and has a task-balancing component.

## Multi-gate Mixture-of-Experts with Exclusivity

Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx) takes its inspiration from ensemble learning, where diversity among their learners tend to generalize better. MMoEEx can be divided in three parts: gates, experts and towers. Considering an application with $K$ tasks, input data $x \in \mathbb{R}^d$, the gate function $g^k()$ is defined as:

\label{eq:g}
g^k(x) = \text{softmax}(W^k x), \forall k \in \{0,…,K\} \tag{1}

where $W^k \in \mathbb{R}^{E \times d}$ are learnable weights and $E$ is the number of experts, defined by the user. The gates control the contribution of each expert to each task.

The experts $f_e(),\forall e\in \{0,…,E\}$, and our implementation is very flexible to accept several experts architectures, which is essential to work with applications with different data types. For example, if working with temporal data, the experts can be LSTMs, GRUs, RNNs; for non-temporal data, the experts can be dense layers. The number of experts $E$ is defined by the user. The experts and gates’ outputs are combined as follows:

$$\label{eq:f}f^k(x) = \sum_{e=0}^Eg^k(x)f_e(x), \forall k \in \{0,…,K\} \tag{2}$$

The $f^k()$ are input to the towers, the task-specific part of the architecture. Their design depends on the data type and tasks. The towers $h^k$ output the task predictions as follows:

y^k = h^k(f^k(x)), \forall k \in \{0,…,K \} \tag{3}

Previous Mixture of Experts models like [2] leverage several experts to make their final predictions; however, they rely on indirect approaches, such as random initialization, to foster diversity among the experts, and on the expectation that the gate function will learn how to combine these experts. Here we propose a mechanism to induce diversity among the experts, defined as $\textit{exclusivity}$.

Exclusivity: We set $\alpha E$ experts to be exclusively connected to one task. The value $\alpha\in[0,1]$ controls the proportion of experts that will be $\textit{exclusive}$. If $\alpha=1$, all experts are exclusive, and if $\alpha=0$, all experts are shared (same as MMoE). An exclusive expert is randomly assigned to one of the tasks $T_k$, but the task $T_k$ can still be associated with other exclusive experts and shared experts. MMoEEx, similarly to MMoE, relies on the expectation that gate functions will learn how to combine the experts. Our approach induces more diversity by forcing some of these gates to be ‘closed’ to some experts, and the exclusivity mechanism is used to close part of the gates. The remaining non-closed gates learn to combine the output of each expert based on the input data, according to Equation 1.

## MAML – MTL optimization

Figure 2. MAML-MTL algorithm.

## Heterogenous MTL Experiments

The Medical Information Mart for Intensive Care (MIMIC-III) database was proposed by [5] to be a benchmark dataset for MTL in time-series data. It contains metrics of patients from over 40,000 intensive care units (ICU) stays. This dataset has 4 tasks: two binary tasks, one temporal multi-label task, and one temporal classification. Figure 3 shows the neural network adopted in our work and where each task is calculated.

Figure 3. The input data $x_t$ has 76 features, and the size of the hidden layer $h_t$ depends on the model adopted. There are four tasks: the decompensation $d_t$ and LOS $l_t$ calculated at each time step, mortality $m_{48}$, and the phenotype $p_T$, both calculated only once per patient.

The full set of results for MIMIC-III dataset is presented in Table 1. We compared our approach with the multitask channel wise LSTM (MCW-LSTM) [6], single task trained network, shared bottom, MMoE [2] and MMoEEx.

MMoEEx outperforms all the compared approaches except on the Phenotype (Pheno) task. For both time series tasks (LOS and Decomp) our approach outperforms all baselines. It is worth noting that for the LOS task, which is the hardest task on MIMIC-III, we present a relative improvement superior to $40$ percentage points when compared to multitask channel wise LSTM [6] and over $16$ percentage points to MMoE.

#### Diversity Score Experiments

We measured how diverse MMoEEx experts are compared to traditional MMoE.The diversity among experts can be scored by the distance between the experts’ outputs $f_e, \forall e\in\{0,…, E\}$. Considering a pair of experts $i$ and $j$, the distance between them is defined as:

$$d_{i,j} = \sqrt{\sum_{n=0}^N(f_i(x_n)-f_j(x_n))^2} \tag{4}$$

where $N$ is the number of samples in the dataset, $d_{i,j} = d_{j,i}$, and a matrix $D \in \mathbb{R}^{E\times E}$ is used to keep all the distances. To scale the distances into $d_{i,j}\in [0,1]$, we divide the raw entries in the distance matrix $D$ by the maximum distance observed, $\text{max}(D)$. A pair of experts $i,j$ with $d_{i,j} = 0$ are considered identical, and experts distances $d_{i,j}$ close to 0 are considered very similar; analogously, experts with $d_{i,j}$ close to 1 are considered very dissimilar. To compare the overall distance between the experts of a model, we define the $\textit{diversity score}$ $\bar{d}$ as the mean entry in $D$.

Figure 4. MMoE ($\bar{d}=0.311$) and MMoEEx($\bar{d}=0.445$) heatmap in the MIMIC-III dataset. The MMoE has 12 shared experts $\textit{versus}$ 6 shared and 6 exclusive experts in the MMoEEx model. Darker colors indicate more dissimilarities between two experts and, therefore, more diversity.

We analyze the diversity score of the MMoE and MMoEEx experts on MIMIC-III. The MMoE and MMoEEx models compared have the same neural network structure, but the MMoEEx uses the MAML – MTL optimization and has the diversity enforced. The MMoEEx model in Figure 4 was created with $\alpha = 0.5$ and exclusivity. In other words, half of the experts in the MMoEEx model were randomly assigned to be exclusive to one of the tasks, while in the MMoE model all experts are shared among all tasks. Figure 4 shows a heatmap of the distances $D^{MMoE}$ and $D^{MMoEEx}$ calculated on the MIMIC-III testing set with 12 experts. MMoE’s heatmap has overall lighter colors, indicating smaller diversity scores, compared with MMoEEx. Quantitatively, MMoEEx produces a relative lift of $43\%$ in diversity score.

## Conclusions and Discussions

We presented a novel multi-task learning approach called Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx), which extends previous methods by introducing an exclusivity mechanism that induces more diversity among experts, allowing the network to learn representations that are more effective for heterogeneous MTL. We also introduce a two-step optimization approach called MAML-MTL, which balances tasks at the gradient level and enhances MMoEEx’s capability to optimize imbalanced tasks.

MTL has achieve critical mass in multiple areas like natural language processing [7, 8, 9], computer vision [10, 11, 12], reinforcement learning [13, 14] and multi-modal learning [15, 16]. Standard soft/hard parameter sharing approaches are a well established technique to handle multiple tasks. While they show improvements over single task learning for tasks with similar characteristics, it is not fully explored how MTL can further improve heterogeneous task scenarios. Hybrid approaches like mixture of experts can mitigate several limitations of standard approaches and further extend its capabilities when coupled with specialized optimization methods. Optimization methods for MTL are in their infancy, and more research on meta-learning task balancing can greatly benefit MTL research. We hope this work inspires the community to further investigate multi-task learning at the network architecture and optimization levels.

[1]  Rich Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, 1993.

[2]  Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.

[3]  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.

[4]  Sungjae Lee and Youngdoo Son. Multitask learning with single gradient step update for task balancing. arXiv preprint arXiv:2005.09910, 2020. 8

[5]  Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1–18, 2019.

[6]  Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.

[7]  Victor Sanh, Thomas Wolf, and Sebastian Ruder. A hierarchical multi-task approach for learning embeddings from semantic tasks. In AAAI Conference on Artificial Intelligence, volume 33, 2019.

[8]  Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Annual Meeting of the Association for Computational Linguistics, 2019.

[9]  Cagla Aksoy, Alper Ahmetoglu, and Tunga Gung ¨ or. Hierar- ¨ chical multitask learning approach for BERT. arXiv preprint arXiv:2011.04451, 2020.

[10]  Ximeng Sun, Rameswar Panda, Rogerio Feris, and Kate Saenko. AdaShare: Learning what to share for efficient deep multi-task learning. In Advances in Neural Information Processing Systems, 2020.

[11]  Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.

[12]  Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. MTI-Net: Multi-scale task interaction networks for multi-task learning. In European Conference on Computer Vision, 2020.

[13]  Lerrel Pinto and Abhinav Gupta. Learning to push by grasping: Using multiple tasks for effective learning. arXiv preprint arXiv:1609.09025, 2016. 9

[14]  Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. Technical report, DeepMind, 2019.

[15]  Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain. OmniNet: A unified architecture for multi-modal multi-task learning. arXiv preprint arXiv:1907.07804, 2019.

[16]  Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12-in-1: Multi-task vision and language representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.