view main/audio/endpoint.doc @ 0:6b33357c7561 octave-forge

Initial revision
author pkienzle
date Wed, 10 Oct 2001 19:54:49 +0000
parents
children
line wrap: on
line source

Interactive speech recognition systems are only useful if they can
can run with live input.  The problem with live input, as opposed to
pre-recorded data, is that the exact start and end of the utterance
is unknown.  One technique to deal with this problem, is to record
a fixed size utterance (e.g., 5 seconds) and assume that the user will
speak the entire utterance within the time period.  A recognizer which
has silence models for the start and end of the utterance can thus
parse such an utterance. However, such a scheme is obviously prone
to errors and is computationally wasteful because the entire input
buffer must be searched.

The obvious solution is an endpointer which identifies the start and
end of utterance.  The problem is that endpointing an utterance, like
speech recognition itself, is non-trivial.

This is an endpointing algorithm designed for real-time input of
a speech signal.  "Real-time" means that the signal is processed
in parallel with its recording.  This allows a speech recognition
system to run in parallel with the input of the utterance.

This algorithm calculates and uses "cheap" parameters, RMS energy and
zero crossing counts.  Thus, this algorithm can run in real-time on
any micro processor without the need for a DSP.

Because the signal is end-pointed in real-time, errors can and do
occur in identifying the start and end of the actual utterance.
Thus, the labels, or tags, that this endpointer gives for each
frame of data are some what "fuzzy".  That is, the endpointer will
tentitively label a frame but may indicate at a later frame that
the identification of a previous frame was in error.  This requires
special handling by the speech recognition system in that it must
be capable of re-starting recognition after false starts and continuing
searching after possible end of utterance frames.

The endpointer works by passing to it one frame of data at a time.  The
endpointer will check the frame to determine if it is part of the
utterance and return a label, or tag, for the frame.  The possible
labels are the following:

	EP_NONE
	EP_NOSTARTSILENCE
	EP_SILENCE
	EP_SIGNAL
	EP_RESET
	EP_MAYBEEND
	EP_NOTEND
	EP_ENDOFUTT

EP_NONE - This is a NULL label which the endpointer does not return,
This is convenient to have for labeling frames for which the endpointer
is turned off.

EP_NOSTARTSILENCE - The first frame is so loud or noisy that it does not
"look" like background silence.  This depends on absolute thresholds and
can generate a false positive for really noisy signals or a false negative
for really quiet signals.  See theory of operation below.

EP_SILENCE - This label is returned for silence frames before the start
of the utterance.

EP_SIGNAL - This is returned for each frame that appears to be contained
in the utterance signal.  The first E_SIGNAL frame marks the start of the
utterance.

EP_RESET - This indicates a false start condition.  The previous EP_SIGNAL
frames were, in fact, not part of the utterance.  The recognition system
should reset itself and start over.

EP_MAYBEEND - This label indicates the possible end of utterance.  The
frame which has this label is actually one frame after the possible last
frame of the utterance.  As this is a tentative label, the recognition
system should either do end of utterance processing or save its state at
this point for end of utterance processing.  In either case, the recognition
system must continuing searching, including this frame, until the end of
utterance has been confirmed.

EP_NOTEND - The previous EP_MAYBEEND label was wrong.  The utterance is
continuing.  The recognition can now forget its possible end of utterance
state.

EP_ENDOFUTT - The label confirms the actual end of utterance.  The real
end of utterance was the last EP_SIGNAL frame before the last EP_MAYBEEND
labeled frame.


Theory of operation:
For each frame of data, the endpointer calculates the RMS energy and the
zero-cross count.  The first few frames are assumed to be background
silence and are used to initialize various thresholds. If there is no
starting silence (the user speaks too soon), then the endpointer will
mislabel the first syllable (which may be one or more words) until a
silence is reached.  Similarly, if there is no ending silence, then the
endpointer will not mark the end of utterance.

A running average of the background silence is kept which consists of
averaging the last few silence frames.  This background silence is used
to set energy thresholds and the Schmidt trigger for the zero-cross
counter.

The endpointer contains over a dozen thresholds and settings which are used
to determine frication, voicing, and silence.  These thresholds have been
determined emperically.

The sampling rate, window size in samples, and the step size in samples
are passed to the class constructor.  These three arguments are used to
calculate the internal thresholds (actual zero-cross count values for
frequencies and number of frames for durations).  Any or all of the internal

CAVEATS:
The endpointer will fail if there is no starting silence or endsilence.
If there is no starting silence, then the first syllable up to the first
stop consonant will be lost.  If there is no ending silence, then the last
syllable will the lost or no end of utterance will be determined.
thresholds can be changed by specifying them in the class constructor.

The endpointer makes no distinction between noise and speech.  Impulse
noises will fool it.  The endpointer tends to be conservative in that it
will err by including noises with the signal rather than cutting out part
of the actual speech signal.  So, a good recognition system must model
noise.

Large amplitude background white noise may cause the endpointer to miss
fricatives, weak or strong.  If the background noise is known a priori, then
the endpointer thresholds can be adjusted to cope with the noise.