1 Introduction

Dysarthria is a difficulty to speak, mainly due to a dysfunction of the organs that allow the formation of the words in the mouth [9] and not caused by a problem of phonation (the voice). In a person with dysarthria, it is difficult to use or control the muscles of the mouth, tongue, larynx or vocal cords, which allow to hold a speech. Dysarthria can be caused by diseases that affect nerves and muscles. So in a person with dysarthria, the latter can isolate himself from his entourage and people, compromising his employability and his social relations.

To cope with this disease we have made an interface of an automatic recognition of dysarthria speech system [4] to help not only the patient but also the doctor to make a primary diagnosis.

The aim of this paper is to improve this automatic recognition of speech dysarthric system based on the HMM and HTK [5,6,7]. To do that, we calculate several variants (parameters) of JITTER and SHIMMER [3] then Combine these parameters with the MFCC’s and PNCC’s coefficients [1, 2]. Finally, we compared the results obtained in order to obtain the most relevant parameter for the recognition automatic dysarthric speech.

To our knowledge, this work is the first which proposes and applies the NEMOURS database [8] for the improvement of an automatic recognition of the speech dysarthric system with the combination of the two parameters JITTER and SHIMMER. The latter is a promising approach to improve communication between people with speech disorders and normal speakers.

This paper is outlined as follows; Sect. 2 is for JITTER and SHIMMER. Section 3 is for proposed technique.

2 Jitter and Shimmer

2.1 Jitter Variants

We define jitter as a quantification of cycle-to-cycle \( F0 \) perturbations (small deviations from exact periodicity), however, there is no formal, unequivocal and rigorous definition [10] that allows to develop many jitter variants (Schoentgen and de Guchteneere [11]; Baken and Orlikoff [12]). The computation of Jitter can be done using either the F0 contour, or the inversely proportional pitch period \( T0 = 1 / F0 \) contour. Typically, researchers focus on the latter. The possible differences in the quantification of the information in the speech signal using either the F0 contour or the T0 contour was investigated in Tsanas et al. [13], the authors conclude that neither approach led to improved quantification of vocal severity. Specifically, the jitter variants we used are:

  • The mean absolute difference of F 0 estimates between successive cycles:

    $$ Jitter_{{F_{0,abs} }} = \frac{1}{N} \mathop \sum \limits_{i = 1}^{N - 1} \left| {F_{0,i} - F_{0,i + 1} } \right| $$
    (1)

Where \( N \) is the number of \( F0 \) computations.

  • \( \varvec{F}0 \) mean absolute difference of successive cycles divided by the mean \( \varvec{F}0 \) , expressed in percent (%):

    $$ Jitter_{{F_{0,\% } }} = 100\,.\,\frac{{\frac{1}{N} \mathop \sum \nolimits_{i = 1}^{N - 1} \left| {F_{0,i} - F_{0,i + 1} } \right|}}{{\frac{1}{N} \mathop \sum \nolimits_{i = 1}^{N} F_{0,i} }} $$
    (2)
  • Perturbation quotient measures using K cycles (we used K = 5):

    $$ Jitter_{{F_{0,PQ1,K} }} = \frac{{\frac{1}{N - K + 1} \mathop \sum \nolimits_{{i = k_{1} }}^{{N - K_{2} }} \left[ {\frac{1}{K}\mathop \sum \nolimits_{{j = i - k_{2} }}^{{i + k_{2} }} \left| {F_{0,i} - F_{0,i + 1} } \right|} \right]}}{{\frac{1}{N} \mathop \sum \nolimits_{i = 1}^{N} F_{0,i} }} $$
    (3)
  • Perturbation quotient using an autoregressive model

    $$ Jitter_{{F_{0,PQ3,K} }} = \frac{{\frac{1}{N - P} \mathop \sum \nolimits_{i = p + 1}^{N} \left[ {\mathop \sum \nolimits_{j = i - p}^{i} a_{j} \left( {F_{0,j} - \frac{1}{N}\mathop \sum \nolimits_{i = 1}^{N} F_{0,i} } \right)} \right]}}{{\frac{1}{N} \mathop \sum \nolimits_{i = 1}^{N} F_{0,i} }} $$
    (4)

Here, the autoregressive model coefficients are \( \left\{ {a_{j} } \right\}_{j}^{p} = 1 \), their estimation using the Yule-Walker equations is done from the F0 contour. Following Schoentgen and de Guchteneere’s [11] suggestion, we used \( p = 5 \) coefficients. Instead of quantifying only the average absolute difference between two successive \( F0 \) estimates, we quantify the absolute (weighted) average difference between the mean \( F0 \) estimate and the F0 estimate of the previous p time windows. Thus Eq. (4) is effectively the generalization of Eq. (2).

  • Mean absolute and normalized mean squared perturbations:

    $$ Jitter_{{F_{0,p1} }} = \frac{1}{N} \mathop \sum \limits_{i = 1}^{N - 1} \left| {F_{0,i} - \frac{1}{N}\mathop \sum \limits_{j = 1}^{N} F_{0,j} } \right| $$
    (5)
    $$ Jitter_{{F_{0,p2} }} = \frac{{\frac{1}{N}\mathop \sum \nolimits_{i = 1}^{N - 1} \left( {F_{0,j} - F_{0,i + 1} } \right)^{2} }}{{\left( {\frac{1}{N}\mathop \sum \nolimits_{i = 1}^{N} F_{0,i} } \right)^{2} }} $$
    (6)

Furthermore, we compute jitter-like measures using the standard deviation of the contour. Additionally, the difference between the mean from the estimation algorithm with the average of age-and gender-matched healthy controls was also calculated. This information was summarized in Fig. 1 [10].

Fig. 1.
figure 1

Life-span changes of the fundamental frequency F0 as a function of gender for the ages 20–90 years old [5].

In addition, we computed frequency modulation (FM) [10]:

$$ FM = \frac{{\hbox{max} \left( {F_{0,i} } \right)_{i = 1}^{N} \, - \,\hbox{min} \left( {F_{0,i} } \right)_{i = 1}^{N} }}{{\hbox{max} \left( {F_{0,i} } \right)_{i = 1}^{N} + \hbox{min} \left( {F_{0,i} } \right)_{i = 1}^{N} }} $$
(7)

Using the nonlinear Teager-Kaiser energy operator (TKEO) \( \varPsi \) [14] the contour was also analysed, and computed the mean, standard deviation and \( 5th \), \( 25th \), \( 75th \) and \( 95th \) percentile values of \( \varPsi \left( \right) \). Where \( \varPsi \) is defined as:

$$ \varPsi \left( {X_{n} } \right) = X_{n}^{2} - X_{n + 1} \,.\,X_{n - 1} $$
(8)

The amplitude modulation (AM) and the frequency modulation (FM) content of an oscillating signal were quantified by TKEO.

2.2 Shimmer Variants

The definition of jitter as the cycle-to-cycle F0 perturbations was done in the preceding section. For the amplitude of the speech signal, Shimmer is the analogue of jitter, rather than F0. The same calculations presented in the preceding section for the jitter variants was used by applying the amplitude A0 contour instead of the F0 contour in Eqs. (1, 2, 3, 4, 5, 6, 7) in order to derive the shimmer variants. In the context of A0 computation, we defined the A0 contour using the maximum amplitude value within each glottal cycle after using DYPSA ([15, 16]) to obtain the glottal cycles. In another ways, the A0 contour can be defined by focusing on signal segments (e.g. 25 ms) instead of within glottal cycles, or using the minimum amplitude values. The difference of the shimmer variants compared to the jitter variants appears by using K = 3, 5, and 11 in Eqs. (3, 5, 4) to conform to traditional amplitude perturbation quotient measures as used by standard reference software programs such as PRAAT. Since this has often been previously used, we computed shimmer in decibels (dB):

$$ Shimmer_{dB} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N - 1} 20\,.\,\left| {log_{10} \frac{{A_{0,i} }}{{A_{0,i + 1} }} } \right| $$
(9)

3 Proposed Technique

We have calculated and concatenated several variants of JITTER and SHIMMER with two types of speech parameterization which are the MFCC’s coefficients and the recently proposed PNCC’s coefficients.

We have tested this technique and for the first time on the NEMOURS dysarthric database.

3.1 Results of Speech Recognition with MFCC’s and PNCC’s Coefficients

We calculated for each 25 ms frame of the dysarthric speech signal (NEMOURS) the MFCC’s coefficients and the PNCC’s coefficients with their first and secondary derivatives, so in the end we will obtain for each frame a vector of 39 coefficients.

From Table 1 we will notice that word accuracy with MFCC’s coefficients are better compared to PNCC’s coefficients.

Table 1. Results of speech recognition with coefficients: MFCC’s/PNCC’s.

3.2 Results of Speech Recognition with Coefficients: MFCC’s + JITTER

For each 25 ms frame of the dysarthric speech signal (NEMOURS) we have calculated and concatenated the MFCC’s and different JITTER types with their first and secondary derivatives, so in the end we will obtain for each frame a vector of 42 coefficients (see Fig. 2).

Fig. 2.
figure 2

Concatenation technique of MFCC’s coefficients with JITTER.

The results in Table 2 are obtained from a concatenation of the MFCC’s coefficients with different types of JITTER, according to Table 2 the best word accuracy is obtained with the MFCC’s coefficients + JITTER Relative, and the best word correction, is obtained with MFCC’s + JITTER Absolute coefficients, the results are respectively 43.69% and 50.40%.

Table 2. Results of speech recognition with coefficients: MFCC’s + JITTER.

3.3 Results of Speech Recognition with Coefficients: MFCC’s + SHIMMER

We notice from Table 3 that the best word correction is obtained with the MFCC’s + SHIMMER Relative coefficients as well as with the MFCC’s + SHIMMER Ampl PQ11 classical Baken coefficients (see Fig. 3), and the best word accuracy is obtained with the coefficients MFCC’s + SHIMMER CV, the results are respectively 51.67%, 47.97%.

Table 3. Concatenation technique of MFCC’s coefficients with SHIMMER.
Fig. 3.
figure 3

Concatenation technique of MFCC’s coefficients with SHIMMER.

3.4 Results of Speech Recognition with Coefficients: PNCC’s + SHIMMER

For each 25 ms frame of the dysarthric speech signal (NEMOURS) we calculated and concatenated the PNCC’s coefficients with different variants of SHIMMER then calculated their first and secondary derivatives, we have at the end for each frame a vector of 42 coefficients.

Table 4 shows that the best word accuracy is obtained with the coefficients PNCC’s + SHIMMER CV, and the best word correction, is obtained with the coefficients PNCC’s + SHIMMER Ampl PQ3 classical Baken, the results are respectively 48.19%, 52.83%.

Table 4. Results of speech recognition with coefficients: PNCC’s + SHIMMER.

3.5 Results of Speech Recognition with Coefficients: MFCC/PNCC + JITTER + SHIMMER

For each 25 ms frame of the dysarthric speech signal (NEMOURS) we calculated and concatenated the MFCC’s/PNCC’s coefficients and different types of JITTER and SHIMMER with their first and secondary derivatives, we have at the end for each frame a vector of 45 coefficients (see Fig. 4).

Fig. 4.
figure 4

Results of speech recognition with the coefficients: MFCC’s/PNCC’s + JITTER + SHIMMER.

According to Table 5 that the best word accuracy and word correction is obtained with the PNCC’s coefficients + JITTER + SHIMMER RELATIVE the results are respectively 47.59%, 50.37%.

Table 5. Results of speech recognition with coefficients: MFCC’s / PNCC’s + JITTER + SHIMMER.

4 Discussions

Table 2 shows the concatenation of the MFCC’s with several variants of JITTER, according to this table we will notice that the best parameter of JITTER for automatic recognition of the speech is the jitter Relative.

If we compare Table 1 which represents basic system with Table 2, we will notice that the word accuracy is better with a basic system and more precisely with the MFCC’s coefficients.

Table 3 shows the concatenation of the MFCC’s coefficients with several SHIMMER variants, according to this table we will notice that the best parameter of SHIMMER in term of word accuracy is obtained with the SHIMMER CV, as well as the best parameter of SHIMMER in term of word correction is obtained with SHIMMER Relative and SHIMMER Ampl PQ11 classical Baken.

Table 4 shows the concatenation of the PNCC’s coefficients with several SHIMMER variants, according to this table we will notice that SHIMMER’s best parameter in terms of word accuracy is obtained with the SHIMMER CV, as well as the best parameter of SHIMMER in terms of word correction is obtained with the SHIMMER Ampl PQ3 classical Baken.

If we compare the two Tables 3 and 4, we will conclude that the best results in terms of word accuracy and word correction are obtained with the concatenation of SHIMMER with the PNCC coefficients compared to the MFCC’s.

Table 5 shows the concatenation of the MFCC’s/PNCC’s coefficients with JITTER and SHIMMER, according to this table the best results are obtained with the combination of the PNCC’s, JITTER and SHIMMER.

If we compare Tables 2, 3, 4 and 5 with the basic system (Table 1) we will conclude that the best results in terms of word accuracy and word correction are obtained with the concatenation of SHIMMER CV or the SHIMMER Ampl PQ3 classical Baken with the PNCC’s.

5 Conclusion

In this paper we have proposed several variants of JITTER and SHIMMER then combined these variants with the MFCC’s and PNCC’s speech parameterization techniques for the improvement of a system of automatic recognition of the speech of the dysarthria.

According to the results of the recognition and if we compare them with our basic system (Table 1), we will notice that the best results are obtained with the combination of the PNCC’s coefficients with the SHIMMER CV or with the SHIMMER Ampl PQ3 classical Baken.

Today our challenge is the improvement of a system of automatic recognition of the dysarthric speech to give a hope to the people having difficulty to speak and to make these people able to communicate with the normal people.