Automatic Recognition System for Dysarthric Speech Based on MFCC’s, PNCC’s, JITTER and SHIMMER Coefficients

Zaidi, Brahim-Fares; Boudraa, Malika; Selouani, Sid-Ahmed; Addou, Djamel; Yakoub, Mohammed Sidi

doi:10.1007/978-3-030-17798-0_40

Automatic Recognition System for Dysarthric Speech Based on MFCC’s, PNCC’s, JITTER and SHIMMER Coefficients

Brahim-Fares Zaidi¹⁶,
Malika Boudraa¹⁶,
Sid-Ahmed Selouani¹⁷,
Djamel Addou¹⁶ &
…
Mohammed Sidi Yakoub¹⁷

Conference paper
First Online: 24 April 2019

2258 Accesses
2 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 944))

Abstract

The aim of this work is to improve the automatic recognition of the dysarthria speech. In this context, we have compared two techniques of speech parameterization; these two techniques are based on the recently proposed coefficients Power Normalized Cepstral Coefficients and Mel-Frequency Cepstral Coefficients. In this paper we have concatenate several variants of JITTER and SHIMMER with the techniques of speech parameterization to improve an automatic recognition of the dysarthric word system. The aim is to help the fragile persons having speech problems (dysarthric voice) and the doctor to make a first diagnosis about the patient’s disease. For this, an Automatic Acknowledgment of Continuous Pathological Speech System has been developed based on the Hidden Models of Markov and the Hidden Markov Model Toolkit. For our tests, we used the Nemours Database which contains 11 speakers representing dysarthric voices.

Download conference paper PDF

1 Introduction

Dysarthria is a difficulty to speak, mainly due to a dysfunction of the organs that allow the formation of the words in the mouth [9] and not caused by a problem of phonation (the voice). In a person with dysarthria, it is difficult to use or control the muscles of the mouth, tongue, larynx or vocal cords, which allow to hold a speech. Dysarthria can be caused by diseases that affect nerves and muscles. So in a person with dysarthria, the latter can isolate himself from his entourage and people, compromising his employability and his social relations.

To cope with this disease we have made an interface of an automatic recognition of dysarthria speech system [4] to help not only the patient but also the doctor to make a primary diagnosis.

The aim of this paper is to improve this automatic recognition of speech dysarthric system based on the HMM and HTK [5,6,7]. To do that, we calculate several variants (parameters) of JITTER and SHIMMER [3] then Combine these parameters with the MFCC’s and PNCC’s coefficients [1, 2]. Finally, we compared the results obtained in order to obtain the most relevant parameter for the recognition automatic dysarthric speech.

To our knowledge, this work is the first which proposes and applies the NEMOURS database [8] for the improvement of an automatic recognition of the speech dysarthric system with the combination of the two parameters JITTER and SHIMMER. The latter is a promising approach to improve communication between people with speech disorders and normal speakers.

This paper is outlined as follows; Sect. 2 is for JITTER and SHIMMER. Section 3 is for proposed technique.

2 Jitter and Shimmer

2.1 Jitter Variants

We define jitter as a quantification of cycle-to-cycle $ F0 $ perturbations (small deviations from exact periodicity), however, there is no formal, unequivocal and rigorous definition [10] that allows to develop many jitter variants (Schoentgen and de Guchteneere [11]; Baken and Orlikoff [12]). The computation of Jitter can be done using either the F0 contour, or the inversely proportional pitch period $ T0 = 1 / F0 $ contour. Typically, researchers focus on the latter. The possible differences in the quantification of the information in the speech signal using either the F0 contour or the T0 contour was investigated in Tsanas et al. [13], the authors conclude that neither approach led to improved quantification of vocal severity. Specifically, the jitter variants we used are:

The mean absolute difference of F ₀ estimates between successive cycles:
$$ Jitter_{{F_{0,abs} }} = \frac{1}{N} \mathop \sum \limits_{i = 1}^{N - 1} \left| {F_{0,i} - F_{0,i + 1} } \right| $$
(1)

Where $ N $ is the number of $ F0 $ computations.

$ \varvec{F}0 $ mean absolute difference of successive cycles divided by the mean $ \varvec{F}0 $ , expressed in percent (%):
$$ Jitter_{{F_{0,\% } }} = 100\,.\,\frac{{\frac{1}{N} \mathop \sum \nolimits_{i = 1}^{N - 1} \left| {F_{0,i} - F_{0,i + 1} } \right|}}{{\frac{1}{N} \mathop \sum \nolimits_{i = 1}^{N} F_{0,i} }} $$
(2)
Perturbation quotient measures using K cycles (we used K = 5):
$$ Jitter_{{F_{0,PQ1,K} }} = \frac{{\frac{1}{N - K + 1} \mathop \sum \nolimits_{{i = k_{1} }}^{{N - K_{2} }} \left[ {\frac{1}{K}\mathop \sum \nolimits_{{j = i - k_{2} }}^{{i + k_{2} }} \left| {F_{0,i} - F_{0,i + 1} } \right|} \right]}}{{\frac{1}{N} \mathop \sum \nolimits_{i = 1}^{N} F_{0,i} }} $$
(3)
Perturbation quotient using an autoregressive model
$$ Jitter_{{F_{0,PQ3,K} }} = \frac{{\frac{1}{N - P} \mathop \sum \nolimits_{i = p + 1}^{N} \left[ {\mathop \sum \nolimits_{j = i - p}^{i} a_{j} \left( {F_{0,j} - \frac{1}{N}\mathop \sum \nolimits_{i = 1}^{N} F_{0,i} } \right)} \right]}}{{\frac{1}{N} \mathop \sum \nolimits_{i = 1}^{N} F_{0,i} }} $$
(4)

Here, the autoregressive model coefficients are $ \left\{ {a_{j} } \right\}_{j}^{p} = 1 $, their estimation using the Yule-Walker equations is done from the F₀ contour. Following Schoentgen and de Guchteneere’s [11] suggestion, we used $ p = 5 $ coefficients. Instead of quantifying only the average absolute difference between two successive $ F0 $ estimates, we quantify the absolute (weighted) average difference between the mean $ F0 $ estimate and the F₀ estimate of the previous p time windows. Thus Eq. (4) is effectively the generalization of Eq. (2).

Mean absolute and normalized mean squared perturbations:
$$ Jitter_{{F_{0,p1} }} = \frac{1}{N} \mathop \sum \limits_{i = 1}^{N - 1} \left| {F_{0,i} - \frac{1}{N}\mathop \sum \limits_{j = 1}^{N} F_{0,j} } \right| $$
(5)

$$ Jitter_{{F_{0,p2} }} = \frac{{\frac{1}{N}\mathop \sum \nolimits_{i = 1}^{N - 1} \left( {F_{0,j} - F_{0,i + 1} } \right)^{2} }}{{\left( {\frac{1}{N}\mathop \sum \nolimits_{i = 1}^{N} F_{0,i} } \right)^{2} }} $$
(6)

Furthermore, we compute jitter-like measures using the standard deviation of the contour. Additionally, the difference between the mean from the estimation algorithm with the average of age-and gender-matched healthy controls was also calculated. This information was summarized in Fig. 1 [10].

In addition, we computed frequency modulation (FM) [10]:

$$ FM = \frac{{\hbox{max} \left( {F_{0,i} } \right)_{i = 1}^{N} \, - \,\hbox{min} \left( {F_{0,i} } \right)_{i = 1}^{N} }}{{\hbox{max} \left( {F_{0,i} } \right)_{i = 1}^{N} + \hbox{min} \left( {F_{0,i} } \right)_{i = 1}^{N} }} $$

(7)

Using the nonlinear Teager-Kaiser energy operator (TKEO) $ \varPsi $ [14] the contour was also analysed, and computed the mean, standard deviation and $ 5th $, $ 25th $, $ 75th $ and $ 95th $ percentile values of $ \varPsi \left( \right) $. Where $ \varPsi $ is defined as:

$$ \varPsi \left( {X_{n} } \right) = X_{n}^{2} - X_{n + 1} \,.\,X_{n - 1} $$

(8)

The amplitude modulation (AM) and the frequency modulation (FM) content of an oscillating signal were quantified by TKEO.

2.2 Shimmer Variants

The definition of jitter as the cycle-to-cycle F0 perturbations was done in the preceding section. For the amplitude of the speech signal, Shimmer is the analogue of jitter, rather than F0. The same calculations presented in the preceding section for the jitter variants was used by applying the amplitude A0 contour instead of the F0 contour in Eqs. (1, 2, 3, 4, 5, 6, 7) in order to derive the shimmer variants. In the context of A0 computation, we defined the A0 contour using the maximum amplitude value within each glottal cycle after using DYPSA ([15, 16]) to obtain the glottal cycles. In another ways, the A0 contour can be defined by focusing on signal segments (e.g. 25 ms) instead of within glottal cycles, or using the minimum amplitude values. The difference of the shimmer variants compared to the jitter variants appears by using K = 3, 5, and 11 in Eqs. (3, 5, 4) to conform to traditional amplitude perturbation quotient measures as used by standard reference software programs such as PRAAT. Since this has often been previously used, we computed shimmer in decibels (dB):

$$ Shimmer_{dB} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N - 1} 20\,.\,\left| {log_{10} \frac{{A_{0,i} }}{{A_{0,i + 1} }} } \right| $$

(9)

3 Proposed Technique

We have calculated and concatenated several variants of JITTER and SHIMMER with two types of speech parameterization which are the MFCC’s coefficients and the recently proposed PNCC’s coefficients.

We have tested this technique and for the first time on the NEMOURS dysarthric database.

3.1 Results of Speech Recognition with MFCC’s and PNCC’s Coefficients

We calculated for each 25 ms frame of the dysarthric speech signal (NEMOURS) the MFCC’s coefficients and the PNCC’s coefficients with their first and secondary derivatives, so in the end we will obtain for each frame a vector of 39 coefficients.

From Table 1 we will notice that word accuracy with MFCC’s coefficients are better compared to PNCC’s coefficients.

Table 1. Results of speech recognition with coefficients: MFCC’s/PNCC’s.

Full size table

3.2 Results of Speech Recognition with Coefficients: MFCC’s + JITTER

For each 25 ms frame of the dysarthric speech signal (NEMOURS) we have calculated and concatenated the MFCC’s and different JITTER types with their first and secondary derivatives, so in the end we will obtain for each frame a vector of 42 coefficients (see Fig. 2).

The results in Table 2 are obtained from a concatenation of the MFCC’s coefficients with different types of JITTER, according to Table 2 the best word accuracy is obtained with the MFCC’s coefficients + JITTER Relative, and the best word correction, is obtained with MFCC’s + JITTER Absolute coefficients, the results are respectively 43.69% and 50.40%.

Table 2. Results of speech recognition with coefficients: MFCC’s + JITTER.

Full size table

3.3 Results of Speech Recognition with Coefficients: MFCC’s + SHIMMER

We notice from Table 3 that the best word correction is obtained with the MFCC’s + SHIMMER Relative coefficients as well as with the MFCC’s + SHIMMER Ampl PQ11 classical Baken coefficients (see Fig. 3), and the best word accuracy is obtained with the coefficients MFCC’s + SHIMMER CV, the results are respectively 51.67%, 47.97%.

Table 3. Concatenation technique of MFCC’s coefficients with SHIMMER.

Full size table

3.4 Results of Speech Recognition with Coefficients: PNCC’s + SHIMMER

For each 25 ms frame of the dysarthric speech signal (NEMOURS) we calculated and concatenated the PNCC’s coefficients with different variants of SHIMMER then calculated their first and secondary derivatives, we have at the end for each frame a vector of 42 coefficients.

Table 4 shows that the best word accuracy is obtained with the coefficients PNCC’s + SHIMMER CV, and the best word correction, is obtained with the coefficients PNCC’s + SHIMMER Ampl PQ3 classical Baken, the results are respectively 48.19%, 52.83%.

Table 4. Results of speech recognition with coefficients: PNCC’s + SHIMMER.

Full size table

3.5 Results of Speech Recognition with Coefficients: MFCC/PNCC + JITTER + SHIMMER

For each 25 ms frame of the dysarthric speech signal (NEMOURS) we calculated and concatenated the MFCC’s/PNCC’s coefficients and different types of JITTER and SHIMMER with their first and secondary derivatives, we have at the end for each frame a vector of 45 coefficients (see Fig. 4).

According to Table 5 that the best word accuracy and word correction is obtained with the PNCC’s coefficients + JITTER + SHIMMER RELATIVE the results are respectively 47.59%, 50.37%.

Table 5. Results of speech recognition with coefficients: MFCC’s / PNCC’s + JITTER + SHIMMER.

Full size table

4 Discussions

Table 2 shows the concatenation of the MFCC’s with several variants of JITTER, according to this table we will notice that the best parameter of JITTER for automatic recognition of the speech is the jitter Relative.

If we compare Table 1 which represents basic system with Table 2, we will notice that the word accuracy is better with a basic system and more precisely with the MFCC’s coefficients.

Table 3 shows the concatenation of the MFCC’s coefficients with several SHIMMER variants, according to this table we will notice that the best parameter of SHIMMER in term of word accuracy is obtained with the SHIMMER CV, as well as the best parameter of SHIMMER in term of word correction is obtained with SHIMMER Relative and SHIMMER Ampl PQ11 classical Baken.

Table 4 shows the concatenation of the PNCC’s coefficients with several SHIMMER variants, according to this table we will notice that SHIMMER’s best parameter in terms of word accuracy is obtained with the SHIMMER CV, as well as the best parameter of SHIMMER in terms of word correction is obtained with the SHIMMER Ampl PQ3 classical Baken.

If we compare the two Tables 3 and 4, we will conclude that the best results in terms of word accuracy and word correction are obtained with the concatenation of SHIMMER with the PNCC coefficients compared to the MFCC’s.

Table 5 shows the concatenation of the MFCC’s/PNCC’s coefficients with JITTER and SHIMMER, according to this table the best results are obtained with the combination of the PNCC’s, JITTER and SHIMMER.

If we compare Tables 2, 3, 4 and 5 with the basic system (Table 1) we will conclude that the best results in terms of word accuracy and word correction are obtained with the concatenation of SHIMMER CV or the SHIMMER Ampl PQ3 classical Baken with the PNCC’s.

5 Conclusion

In this paper we have proposed several variants of JITTER and SHIMMER then combined these variants with the MFCC’s and PNCC’s speech parameterization techniques for the improvement of a system of automatic recognition of the speech of the dysarthria.

According to the results of the recognition and if we compare them with our basic system (Table 1), we will notice that the best results are obtained with the combination of the PNCC’s coefficients with the SHIMMER CV or with the SHIMMER Ampl PQ3 classical Baken.

Today our challenge is the improvement of a system of automatic recognition of the dysarthric speech to give a hope to the people having difficulty to speak and to make these people able to communicate with the normal people.

References

Kim, C., Stern, R.M.: Power Normalized Cepstral Coefficients (PNCC) for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 24, 1315 (2016)
Article Google Scholar
Mohammed, A., Mansour, A., Ghulam, M., Mohammed, Z., Mesallam, T.A., Malki, K.H., Mohamed, F., Mekhtiche, M.A., Mohamed, B.: Automatic speech recognition of pathological voice. Indian J. Sci. Technol. 8, 32 (2015)
Article Google Scholar
Tsanas, A.: Accurate telemonitoring of Parkinson’s disease symptom severity using nonlinear speech signal processing and statistical machine learning. University of Oxford, June 2012
Google Scholar
Zaidi, B.F., Selouani, S.A., Boudraa, M., Hamdani, G.: Human/machine interface dialog integrating new information and communication technology for pathological voice. In: IEEE Xplore, Future Technologies Conference (FTC), San Francisco, CA, USA, January 2017
Google Scholar
Alam, M.J., Kenny, P., Dumouchel, P., O’Shaughnessy, D.: Robust feature extractors for continuous speech recognition. In: IEEE Xplore, European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, November 2014
Google Scholar
Dua, M., Aggarwal, R.K., Kadyan, V., Dua, S.: Punjabi automatic speech recognition using HTK. Int. J. Comput. Sci. Issues 9(4), 359 (2012)
Google Scholar
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book, version 3.1, pp. 1–277 (2006)
Google Scholar
Menéndez-Pidal, X., Polikoff, J.B., Peters, S.M., Leonzio, J.E., Bunnell, H.T.: The nemours database of dysarthric speech. J. IEEE (in press)
Google Scholar
Darley, F.L., Aronson, A.E., Brown, J.R.: Differential diagnostic patterns of dysarthria. J. Speech Lang. Hear. Res. 12, 246–269 (1969)
Article Google Scholar
Titze, I.R.: Principles of Voice Production. National Center for Voice and Speech, Iowa City, USA, 2nd printing (2000)
Google Scholar
Schoentgen, J., de Guchteneere, R.: Time series analysis of jitter. J. Phon. 23, 189–201 (1995)
Article Google Scholar
Baken, R.J., Orlikoff, R.F.: Clinical Measurement of Speech and Voice, 2nd edn. Singular Thomson Learning, San Diego (2000)
Google Scholar
Tsanas, A., Little, M.A., McSharry, P.E., Ramig, L.O.: Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson‘s disease symptom severity. J. R. Soc. Interface 8, 842–855 (2011)
Article Google Scholar
Kaiser, J.: On a simple algorithm to calculate the ‘energy’ of a signal. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1990), pp. 381–384, Albuquerque, NM, USA, April 1990
Google Scholar
Kounoudes, A., Naylor, P.A., Brookes, M.: The DYPSA algorithm for estimation of glottal closure instants in voices speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), pp. 349–352, Orlando, FL (2002)
Google Scholar
Naylor, P.A., Kounoudes, A., Gudnason, J., Brookes, M.: Estimation of glottal closure instants in voices speech using the DYPSA algorithm. IEEE Trans. Audio Speech Lang. Process. 15, 34–43 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Speech Communication and Signal Processing (LSCSP), U.S.T.H.B University, Algiers, Algeria
Brahim-Fares Zaidi, Malika Boudraa & Djamel Addou
Laboratory of Research in Human-System Interaction (LARHSI), University of Moncton, Shippagan Campus, Moncton, Canada
Sid-Ahmed Selouani & Mohammed Sidi Yakoub

Authors

Brahim-Fares Zaidi
View author publications
You can also search for this author in PubMed Google Scholar
Malika Boudraa
View author publications
You can also search for this author in PubMed Google Scholar
Sid-Ahmed Selouani
View author publications
You can also search for this author in PubMed Google Scholar
Djamel Addou
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Sidi Yakoub
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brahim-Fares Zaidi .

Editor information

Editors and Affiliations

Saga University, Saga, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zaidi, BF., Boudraa, M., Selouani, SA., Addou, D., Yakoub, M.S. (2020). Automatic Recognition System for Dysarthric Speech Based on MFCC’s, PNCC’s, JITTER and SHIMMER Coefficients. In: Arai, K., Kapoor, S. (eds) Advances in Computer Vision. CVC 2019. Advances in Intelligent Systems and Computing, vol 944. Springer, Cham. https://doi.org/10.1007/978-3-030-17798-0_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-17798-0_40
Published: 24 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17797-3
Online ISBN: 978-3-030-17798-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Jitter and Shimmer

2.1 Jitter Variants

2.2 Shimmer Variants

3 Proposed Technique

3.1 Results of Speech Recognition with MFCC’s and PNCC’s Coefficients

3.2 Results of Speech Recognition with Coefficients: MFCC’s + JITTER

3.3 Results of Speech Recognition with Coefficients: MFCC’s + SHIMMER

3.4 Results of Speech Recognition with Coefficients: PNCC’s + SHIMMER

3.5 Results of Speech Recognition with Coefficients: MFCC/PNCC + JITTER + SHIMMER

4 Discussions

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation