Abstract
In this paper we present a novel approach to improve machine learning techniques in emotion recognition from speech. The core idea is based on the fact that not all parts of the utterance convey emotion information. Thus, we propose to separate a given utterance into emotional and neutral parts and clean up the database to make it more univocal. Then we estimate short speech interval embeddings using speaker recognition convolutional neural network trained on the VoxCeleb2 dataset with the triplet loss. Sequences of these features are processed with a recurrent neural network to get an emotion label for the considered utterance. This stage consists of two sub-stages. At the first one we train a model to recognize neutral frames in a given utterance. Next we separate a corpus into emotional and neutral parts and train an improved model. Our experiments on the IEMOCAP corpus show that the final model achieves 66% of unweighted accuracy (UA) on four emotions and outperforms other known approaches like out-of-the-box Connectionist Temporal Classification (CTC) and local attention by more than 4%.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Pitaloka, D.A., Wulandari, A., Basaruddin, T., Liliana, D.Y.: Enhancing CNN with preprocessing stage in automatic emotion recognition. Procedia Comput. Sci. 116, 523–529 (2017)
Bitouk, Dmitri, Verma, Ragini, Nenkova, Ani: Class-level spectral features for emotion recognition. Speech Commun. 52(7–8), 613–625 (2010)
Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Chernykh, V., Sterling, G., Prihodko, P.: Emotion Recognition From Speech With Recurrent Neural Networks (2017)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. In: INTERSPEECH (2018)
Ghosh, S., Laksana, E., Morency, L.P., Scherer, S.: Learning Representations of Affect from Speech, pp. 1–10 (2015)
Graves, A., Fernndez, S., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, ICML, vol. 2006, pp. 369–376 (2006)
Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015)
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. Lecture Notes in Computer Science (including Subseries Lecture Notes Artificial Intelligence Lecture Notes Bioinformatics), vol. 9370(2010), pp. 84–92 (2015)
Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Interspeech, pp. 1537–1540 (2015)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986) \(<\)a href= “abps/sutherlandbp.pdf”\(>\) Commentary from News and Views section of Nature\(<\)/a\(>\)
Satt, A., Rozenberg, S., Hoory, R.: Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms, pp. 1089–1093. IBM Research-Haifa (2017)
Schuller, B., Rigoll, G.: Timing levels in segment-based speech emotion recognition. In: Proceedings of INTERSPEECH 2006, Proceedings of International Conference on Spoken Language Processing ICSLP, pp. 1818–1821 (2006)
Zhang, C., Mirsamadi, S., Barsoum, E.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: Proceedings of 42nd IEEE International Conferene on Acoustics Speech, and Signal Processing ICASSP 2017, pp. 2227–2231. Center for Robust Speech Systems , The University of Texas at Dallas , Richardson , TX 75080, USA Microsoft Research, One Microsoft Way, Redmond , WA 98052 , USA (2017)
Tripathi, S., Beigi, H.: Multi-Modal Emotion Recognition on IEMOCAP Dataset using Deep Learning
Wang, Z.-Q., Tashev, I.: Learning utterance-level representations for speech emotion and age / gender recognition using deep neural networks. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, pp. 5150–5154 (2017)
Xia, R., Liu, Y.: A multi-task learning framework for emotion recognition using 2d continuous space. IEEE Trans. Affect. Comput. 8(1), 3–14 (2017)
Zhang, C., Koishida, K.: End-To-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of Annual Conference Inter Speech Communication Association, INTERSPEECH, 2017-August(August), pp. 1487–1491 (2017)
Acknowledgements
All of this work is a part of Emotion Recognition Project at Neurodata Lab company.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sterling, G., Kazimirova, E. (2020). End-to-End Emotion Recognition From Speech With Deep Frame Embeddings And Neutral Speech Handling. In: Arai, K., Bhatia, R. (eds) Advances in Information and Communication. FICC 2019. Lecture Notes in Networks and Systems, vol 70. Springer, Cham. https://doi.org/10.1007/978-3-030-12385-7_76
Download citation
DOI: https://doi.org/10.1007/978-3-030-12385-7_76
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12384-0
Online ISBN: 978-3-030-12385-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)