Modelling

of voice quality correlates

 

Fitting an LF-model to inverse filtered signals

There are several benefits of fitting an optimal LF model over the estimated glottal pulse. Firstly, it can remove a significant amount of systematic and random errors present in the glottal pulse estimate. The drawback is that with poor performance it can irreversibly corrupt the signal. Hence, it is essential that the LF model fitting is a robust enough process and that its limitations are well known. However, the main reason for using the LF model fitting is that it provides a systematic, fast and most importantly automatic practice to parameterise the glottal pulse estimate.
Various methods are being investigated for the extraction of voice source parameters from speech. Having obtained an estimate of glottal flow signal, an LF-GFM model is fitted to this estimate. The performance of the criterion is to be evaluated against the synthetic speech. The performance of the fitting process depends on the accuracy of the optimisation algorithm and the cost function. The result deteriorates with the increasing presence of random and systematic errors in the estimated glottal pulse. If the extent of these errors is so substantial that the estimated signal losses correspondence to the estimated signal than the entire process ceases to have a purpose. With respect to the above discussion the extent of validity of the LF model fitting is investigated.

Solving this problem is best done in a pitch synchronous manner. This implies that the LF pulse is to be fitted to each period of the speech separately in an optimal manner. Since any optimisation algorithm would require a set of initial values, the LF parameterisation process can be roughly divided into two stages:

  1. Derivation of initial estimates
  2. Constrained non-linear optimisation


The need for the second stage comes from the fact that the estimate of glottal pulse derivative can at the times be very noisy with respect to the synthetically generated one and therefore any parameters obtained by direct estimation through heuristic rules would be highly susceptible to error. The required robustness could not be obtained in this way.

Derivation of initial estimates: A good first estimate of LF-parameters is required as the probability of finding the global optimum is enhanced if the initial estimates are improved. The estimates are obtained independently of any values corresponding to other pitch periods. According to Helmer Stric better results are obtained in this manner (see [14]). However, this will be further investigated. An obvious way to deal with the ripple and noise of relatively high frequency values is to pass the estimate of glottal flow derivative through a low pass filter. The choice of filter is restricted to those that have the ripple free impulse response. A possible candidate for this purpose is a 7-point Blackman window. Apart from its desired effect, the low pass filtering alters the shape of the glottal derivative signal and conversely the values of LF-parameters. The manner and the extent of this change can be found by passing a synthetic pulse of similar characteristics through the same filter. The estimated parameters values can than be corrected accordingly.

te parameter is perhaps the easiest to estimate. It corresponds to the instant when the glottal derivative signal reaches its local minimum. Ee is the magnitude of the signal at this instant. tp can than be estimated as the first zero crossing from the left of te. tc can be found as the first sample to the right of te that is smaller than a certain preset threshold value. Similarly, t0 can be estimated as an instant to the left of tp when the signal is lower than a certain threshold value and is constrained by the value of open quotient. It is particularly hard to obtain a good estimate of Ta. This in itself is a subject of many discussions and various solutions are proposed. The simplest method is to estimate this parameter is to set ts value as a direct function of tc and te.

. In this case simplicity comes at the price of accuracy. A good level of accuracy can be achieved in the frequency domain estimation. FFT is used to obtain a spectrum of normalised (each sample divided by Ee) returned phase, the section of pulse between te and tc . The magnitude of the spectrum resembles the DC component of the return phase.

Optimisation algorithm: The initial estimates are often in the error range of ± 7%.
te parameter is particularly accurate with error range ± 4.5%, while Ta error range can reach up to 20%. Hence, the aim of the constrained non-linear optimisation is to further refine the accuracy of these parameters. The methods being investigated are the following: dynamic time warping [15], simplex [16], steepest descent [17] and minimum root mean square error dynamic filtering. The latter three methods have been widely investigated for this purpose, both in isolation and in combination [14]. These methods particularly when used in combination, considerably improve the accuracy of the initial estimates and limit the error range to 6%. However, when the LF fitting was performed on an "ideal"(synthetically generated) glottal pulse the fit-error was merely reduced to 2-4 % which means that the procedure could be further improved. The dynamic time warping algorithm being specifically design to optimise time alignment process could provide these improvements.

Dynamic Time Warping
Dynamic time warping is generally considered as the most important speech recognition technique until the advent of HMMs. This project intends to make use of DTW to time-align a synthetically generated glottal derivative pulse with the one obtained through the inverse filtering. The initial estimate values, obtained according to the previous discussion, will be used to synthesise the LF signal. The motivation to employ this technique is that it allows a simultaneous estimate of three, and arguably the most perceptually important parameters out of the five that define the LF model. It is the time-based ratio parameters that uniquely identify various types of glottal pulse and voice quality. The cunningness of dynamic time warping lies in the computation of the distance between input streams and templates. Instead of comparing the input stream value to that of template stream at time t, the space of mappings from the time sequence of the input stream to that of template stream is searched so that the total distance is minimised. The mapping is not linear but confined to a space by some practical limits; such as the mapping function is monotonically non-decreasing in order that the sequence of events between input and template are preserved. Figure 9 illustrates some of the properties of dynamic time warping.

Designed for screen resolution 1024x768 using HTML 4.0 and CSS level 1.
Any comments, questins or suggestions are very welcome and should be directed to emir.turajlic@brunel.ac.uk.