Abstract
Retrosynthesis is the task of building a molecule from smaller precursor molecules. As shown in previous work, good results can be achieved on this task with the help of deep learning techniques, for example with the help of Transformer networks. Here the retrosynthesis task is treated as a machine translation problem where the Transformer network predicts the precursor molecules given a string representation of the target molecule. Previous research has focused on performing the training procedure on a single machine but in this article we investigate the effect of scaling the training of the Transformer networks for the retrosynthesis task on supercomputers. We investigate the issues that arise when scaling Transformers to multiple machines such as learning rate scheduling and choice of optimizer, and present strategies that improve results compared to previous research. By training on multiple machines we are able to increase the top-1 accuracy by \(2.5\%\) to \(43.6\%\). In an attempt to improve results further, we experiment with increasing the number of parameters in the Transformer network but find that models are prone to overfitting, which can be attributed to the small dataset used for training the models. On these runs we manage to achieve a scaling efficiency of nearly \(70\%\).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bai, R., Zhang, C., Wang, L., Yao, C., Ge, J., Duan, H.: Transfer learning: making retrosynthetic predictions based on a small chemical reaction dataset scale to a new level. Molecules 25(10), 2357 (2020)
Bjerrum, E.J.: Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076 (2017)
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Cavdar, D., et al.: Densifying assumed-sparse tensors. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds.) ISC High Performance 2019. LNCS, vol. 11501, pp. 23–39. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20656-7_2
Codreanu, V., Podareanu, D., Saletore, V.: Scale out for large minibatch SGD: residual network training on imagenet-1k with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017)
Coley, C.W., Rogers, L., Green, W.H., Jensen, K.F.: Computer-assisted retrosynthesis based on molecular similarity. ACS Central Sci. 3(12), 1237–1245 (2017)
Corey, E.J., Long, A.K., Rubenstein, S.D.: Computer-assisted analysis in organic synthesis. Science 228(4698), 408–418 (1985)
Dai, H., Li, C., Coley, C.W., Dai, B., Song, L.: Retrosynthesis prediction with conditional graph logic network. In: Advances in Neural Information Processing Systems, pp. 8872–8882 (2019)
Goodman, L., Reddy, R.: Effects of branching factor and vocabulary size on performance. Speech understanding systems: summary of results of the five-year research effort at Carnegie-Mellon University, p. 39
Goyal, P., et al.: Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Harlap, A., et al.: Pipedream: fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018)
Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: Advances in Neural Information Processing Systems, pp. 1731–1741 (2017)
Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, pp. 103–112 (2019)
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)
Karpov, P., Godin, G., Tetko, I.V.: A transformer model for retrosynthesis. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 817–830. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_78
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Liu, B., et al.: Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Sci. 3(10), 1103–1113 (2017)
Lowe, D.M.: Extraction of chemical structures and reactions from the literature (2012)
Ott, M., Edunov, S., Grangier, D., Auli, M.: Scaling neural machine translation. arXiv preprint arXiv:1806.00187 (2018)
Popel, M., Bojar, O.: Training tips for the transformer model. Prague Bull. Math. Linguist. 110(1), 43–70 (2018)
Segler, M.H., Preuss, M., Waller, M.P.: Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555(7698), 604–610 (2018)
Segler, M.H.S., Waller, M.P.: Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem.-A Eur. J. 23(25), 5966–5971 (2017)
Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018)
Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. IEEE (2017)
Tetko, I.V., Karpov, P., Van Deursen, R., Godin, G.: Augmented transformer achieves 97% and 85% for top5 prediction of direct and classical retro-synthesis. arXiv preprint arXiv:2003.02804 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (2017)
Weininger, D.: Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988)
You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. arXiv preprint arXiv:1904.00962 (2019)
Zheng, S., Rao, J., Zhang, Z., Xu, J., Yang, Y.: Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60(1), 47–55 (2019)
Acknowledgment
We thank the anonymous referees for their constructive comments, which helped to improve the paper. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 814416. We would also like to acknowledge Intel for providing us the resources to run on the Endeavour Supercomputer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mollinga, J., Codreanu, V. (2022). Scaling Out Transformer Models for Retrosynthesis on Supercomputers. In: Arai, K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 283. Springer, Cham. https://doi.org/10.1007/978-3-030-80119-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-80119-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80118-2
Online ISBN: 978-3-030-80119-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)