Fitting an LF-model to inverse filtered signals
There are several
benefits of fitting an optimal LF model over the estimated glottal
pulse. Firstly, it can
remove a significant amount of systematic and random errors present in
the glottal pulse estimate. The drawback is that with poor performance
it can irreversibly corrupt the signal.
Hence, it is essential that the LF model fitting is a robust
enough process and that its limitations are well known.
However, the main reason for using the LF model fitting is that
it provides a systematic, fast and most importantly automatic practice
to parameterise the glottal pulse estimate.
Various methods are
being investigated for the extraction of voice source parameters from
speech. Having obtained an
estimate of glottal flow signal, an LF-GFM model is fitted to this
estimate.
The performance of the criterion is to be evaluated against
the synthetic speech. The
performance of the fitting process depends on the accuracy of the
optimisation algorithm and the cost function.
The result deteriorates with the increasing presence of random
and systematic errors in the estimated glottal pulse.
If the extent of these errors is so substantial that the
estimated signal losses correspondence to the estimated signal than the
entire process ceases to have a purpose.
With respect to the above discussion the extent of validity of
the LF model fitting is investigated.
Solving this problem is
best done in a pitch synchronous manner.
This implies that the LF pulse is to be fitted to each period of
the speech separately in an optimal manner.
Since any optimisation algorithm would require a set of initial
values, the LF parameterisation process can be roughly divided into two
stages:
- Derivation of initial estimates
- Constrained non-linear optimisation
The need for the second
stage comes from the fact that the estimate of glottal pulse derivative
can at the times be very noisy with respect to the synthetically
generated one and therefore any parameters obtained by direct estimation
through heuristic rules would be highly susceptible to error.
The required robustness could not be obtained in this way.
Derivation of initial estimates:
A good first estimate of
LF-parameters is required as the probability of finding the global
optimum is enhanced if the initial estimates are improved.
The estimates are obtained independently of any values
corresponding to other pitch periods.
According to Helmer Stric better results are obtained in this
manner (see [14]). However,
this will be further investigated.
An obvious way to deal with the ripple and noise of relatively
high frequency values is to pass the estimate of glottal flow derivative
through a low pass filter. The
choice of filter is restricted to those that have the ripple free
impulse response. A
possible candidate for this purpose is a 7-point Blackman window.
Apart from its desired effect, the low pass filtering alters the
shape of the glottal derivative signal and conversely the values of
LF-parameters. The manner
and the extent of this change can be found by passing a synthetic pulse
of similar characteristics through the same filter.
The estimated parameters values can than be corrected
accordingly.
te parameter is perhaps the easiest to estimate. It
corresponds to the instant when the glottal derivative signal reaches
its local minimum. Ee is the magnitude of the signal at this instant.
tp can than be estimated as the first zero crossing
from the left of te. tc can be found as the first
sample to the right of te that is smaller than a certain
preset threshold value. Similarly,
t0 can be estimated as an instant to the left of tp
when the signal is lower than a certain threshold value and is
constrained by the value of open quotient.
It is particularly hard to obtain a good estimate of Ta.
This in itself is a subject of many discussions and various
solutions are proposed. The
simplest method is to estimate this parameter is to set ts
value as a direct function of tc and te.
.
In this case
simplicity comes at the price of accuracy.
A good level of accuracy can be achieved in the frequency domain
estimation. FFT is used to
obtain a spectrum of normalised (each sample divided by Ee) returned
phase, the section of pulse between te and tc .
The magnitude of the spectrum resembles the DC component of the
return phase.
Optimisation algorithm: The initial estimates are often in the error range of ±
7%.
te parameter is particularly accurate with error range ±
4.5%, while Ta error range can reach up to 20%.
Hence, the aim of the constrained non-linear optimisation is to
further refine the accuracy of these parameters.
The methods being investigated are the following: dynamic time
warping [15], simplex [16], steepest descent [17] and minimum root mean
square error dynamic filtering. The
latter three methods have been widely investigated for this purpose,
both in isolation and in combination [14].
These methods particularly when used in combination, considerably
improve the accuracy of the initial estimates and limit the error range
to 6%. However, when the LF fitting was performed on an "ideal"(synthetically
generated) glottal pulse the fit-error was merely reduced to 2-4 % which
means that the procedure could be further improved.
The dynamic time warping algorithm being specifically design to
optimise time alignment process could provide these improvements.
Dynamic Time Warping
Dynamic time warping is generally considered as the most important speech
recognition technique until the advent of HMMs.
This project intends to make use of DTW to time-align a
synthetically generated glottal derivative pulse with the one obtained
through the inverse filtering. The
initial estimate values, obtained according to the previous discussion,
will be used to synthesise the LF signal.
The motivation to employ this technique is that it allows a
simultaneous estimate of three, and arguably the most perceptually
important parameters out of the five that define the LF model.
It is the time-based ratio parameters that uniquely identify
various types of glottal pulse and voice quality.
The cunningness of dynamic time warping lies in the computation
of the distance between input streams and templates.
Instead of comparing the input stream value to that of template
stream at time t, the space of mappings from the time sequence of the input stream
to that of template stream is searched so that the total distance is
minimised. The mapping is
not linear but confined to a space by some practical limits; such as the
mapping function is monotonically non-decreasing in order that the
sequence of events between input and template are preserved. Figure 9
illustrates some of the properties of dynamic time warping.
|