3.1 Formant Model Estimation                                                    

 

In LP model of speech, each complex pole pair corresponds to a second order resonator. The resonance frequency of each pole is associated with a peak in spectral energy or a formant candidate.  The pole radius is related to the concentration of local energy and the bandwidth of a formant candidate. Although in the long run automatic formant analysis of speech has received considerable attention and a variety of approaches have been developed, the calculation of accurate formant features from the speech signal is still considered a non-trivial problem. The accuracy of formant tracking using the conventional frame-based LPC analysis is affected by following factors [7]:

  1. The number of LPC model coefficients.

  2. The influence of pitch (glottal formant) on the first formant.

  3. Formant merging.

  4. Rapid formant variation that may occur in consonant vowel transitions or diphthongs.

  5. Source-vocal tract interaction (pre-condition in LPC analysis)

  6. Effects of lips radiation and internal loss on formant bandwidth and frequency.

 

In the this section an improved method is suggested to tackle the first four factors in LPC analysis. 

Formant classification is described in [8,9]. Each formant candidate is represented by a feature vector [Fk, BWk, Ik, Fk, BWk, Ik]: formant frequency Fk, bandwidth BWk, and intensity Ik together with the slopes of their time trajectories Fk, BWk, and Ik. A 2-D HMM (Figure 6) with 3 left–to-right states across time and four left–to-right states across frequency is used to classify formant candidates in each frame among four sequential formant clusters. Given a set of training data, the distribution of each formant vector in each state is modelled by a multi-variate mixture Gaussian distribution trained using the expectation-maximum (EM) algorithm. The kth state along the frequency axis models the distribution of the kth formant. Formant tracks are obtained using a Viterbi search methods to find the most likely path of formants given the HMMs [8,9]. Figure 7 shows a block diagram illustration of formant estimation procedure.

 

In [6], it is observed that rapid formant variation across phoneme boundaries is the dominant factor affecting the accuracy of formant estimation in continuous speech. To reduce these effects, five additional rules are applied as follows:

 

(a)    A pre-emphasis filter is used to eliminate the effect of pitch (glottal formants),

(b)   Very short phonetic segments that may have excessive co-articulation of formants are discarded.

(c)    Lower limits are placed on the bandwidth of formant candidates and on the LPC model order to avoid over-modeling,

(d)   Only formant candidates from the frames in the central (i.e. target) part of phoneme segments are used and, 

(e)    After HMM modeling of formant candidates, in each HMM state the mixture component with the largest variance is not used. Large variance mixture components are associated with the values of formants candidates that fall in between the formant frequencies.

 

The idea behind this is to make use of the steadiest part (target) of formants in each vowel. Figure 8 shows the histograms of formant distributions of the vowel /IY/ from an Australian speaker. Each peak represents a formant. It can be noted in Figure 8-(a) that the glottal formant due to the pitch effect (the first peak) could be mistaken for the first formant (the second peak) when there is no pre-emphasis, while in Figure 8-(b) the glottal formant is eliminated by pre-emphasis. In Figure 8-(a)(b)(c) the hump around 1700Hz is easily mistaken for the 2nd formant although /IY/ does not have any formant in that frequency range. After applying the rules (b-e) the hump disappears in Figure 8-(d) and the second and third formant become clearer and the HMM curve closely fits the formant candidates distribution. Formant frequencies are obtained eventually by averaging the central parts of formant trajectory.

                                                                 

                                                                  

                              Figure 6: An example of parallel left-to-right state HMMs to further split formant model. It is equivalent to a 2-D HMM.

 

 

                                                

 

Figure 7: Block Diagram of Formant Estimation

                                                        

                                         

Figure 8: Histograms of Formant Distributions of /IY/ from an Australian Speaker

(thin red solid line: histogram of formant candidate distribution; blue dash dot line: Gaussian HMM curve)

(a)     Without Pre-emphasis,

(b)     With Pre-emphasis,

(c)     Discarding short segments with limited bandwidth and LP order

(d)     Take the central part of segments with limited bandwidth and LP order

 

Back