It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data-dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.
Bibtex
@inproceedings{xu-etal-2021-optimizing,
title = "Optimizing Deeper Transformers on Small Datasets",
author = "Xu, Peng and
Kumar, Dhruv and
Yang, Wei and
Zi, Wenjie and
Tang, Keyi and
Huang, Chenyang and
Cheung, Jackie Chi Kit and
Prince, Simon J.D. and
Cao, Yanshuai",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.163",
doi = "10.18653/v1/2021.acl-long.163",
pages = "2089–2102",
abstract = "It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state of the art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.",
}
Related Research
-
Adversarial Robustness with AdverTorch Black-box Estimators
Adversarial Robustness with AdverTorch Black-box Estimators
M. Safa, and G. Castiglione.
Computer Vision; Natural Language Processing; Responsible AI
Research
-
Inside our Top 5 Responsible AI Papers from ACL 2022
Inside our Top 5 Responsible AI Papers from ACL 2022
D. Silverberg.
Research
-
Our NeurIPS 2021 Reading List
Our NeurIPS 2021 Reading List
Y. Cao, K. Y. C. Lui, T. Durand, J. He, P. Xu, N. Mehrasa, A. Radovic, A. Lehrmann, R. Deng, A. Abdi, M. Schlegel, and S. Liu.
Computer Vision; Data Visualization; Graph Representation Learning; Learning And Generalization; Natural Language Processing; Optimization; Reinforcement Learning; Time series Modelling; Unsupervised Learning
Research