Mercurial > forge
diff main/audio/endpoint.doc @ 0:6b33357c7561 octave-forge
Initial revision
author | pkienzle |
---|---|
date | Wed, 10 Oct 2001 19:54:49 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/main/audio/endpoint.doc Wed Oct 10 19:54:49 2001 +0000 @@ -0,0 +1,124 @@ +Interactive speech recognition systems are only useful if they can +can run with live input. The problem with live input, as opposed to +pre-recorded data, is that the exact start and end of the utterance +is unknown. One technique to deal with this problem, is to record +a fixed size utterance (e.g., 5 seconds) and assume that the user will +speak the entire utterance within the time period. A recognizer which +has silence models for the start and end of the utterance can thus +parse such an utterance. However, such a scheme is obviously prone +to errors and is computationally wasteful because the entire input +buffer must be searched. + +The obvious solution is an endpointer which identifies the start and +end of utterance. The problem is that endpointing an utterance, like +speech recognition itself, is non-trivial. + +This is an endpointing algorithm designed for real-time input of +a speech signal. "Real-time" means that the signal is processed +in parallel with its recording. This allows a speech recognition +system to run in parallel with the input of the utterance. + +This algorithm calculates and uses "cheap" parameters, RMS energy and +zero crossing counts. Thus, this algorithm can run in real-time on +any micro processor without the need for a DSP. + +Because the signal is end-pointed in real-time, errors can and do +occur in identifying the start and end of the actual utterance. +Thus, the labels, or tags, that this endpointer gives for each +frame of data are some what "fuzzy". That is, the endpointer will +tentitively label a frame but may indicate at a later frame that +the identification of a previous frame was in error. This requires +special handling by the speech recognition system in that it must +be capable of re-starting recognition after false starts and continuing +searching after possible end of utterance frames. + +The endpointer works by passing to it one frame of data at a time. The +endpointer will check the frame to determine if it is part of the +utterance and return a label, or tag, for the frame. The possible +labels are the following: + + EP_NONE + EP_NOSTARTSILENCE + EP_SILENCE + EP_SIGNAL + EP_RESET + EP_MAYBEEND + EP_NOTEND + EP_ENDOFUTT + +EP_NONE - This is a NULL label which the endpointer does not return, +This is convenient to have for labeling frames for which the endpointer +is turned off. + +EP_NOSTARTSILENCE - The first frame is so loud or noisy that it does not +"look" like background silence. This depends on absolute thresholds and +can generate a false positive for really noisy signals or a false negative +for really quiet signals. See theory of operation below. + +EP_SILENCE - This label is returned for silence frames before the start +of the utterance. + +EP_SIGNAL - This is returned for each frame that appears to be contained +in the utterance signal. The first E_SIGNAL frame marks the start of the +utterance. + +EP_RESET - This indicates a false start condition. The previous EP_SIGNAL +frames were, in fact, not part of the utterance. The recognition system +should reset itself and start over. + +EP_MAYBEEND - This label indicates the possible end of utterance. The +frame which has this label is actually one frame after the possible last +frame of the utterance. As this is a tentative label, the recognition +system should either do end of utterance processing or save its state at +this point for end of utterance processing. In either case, the recognition +system must continuing searching, including this frame, until the end of +utterance has been confirmed. + +EP_NOTEND - The previous EP_MAYBEEND label was wrong. The utterance is +continuing. The recognition can now forget its possible end of utterance +state. + +EP_ENDOFUTT - The label confirms the actual end of utterance. The real +end of utterance was the last EP_SIGNAL frame before the last EP_MAYBEEND +labeled frame. + + +Theory of operation: +For each frame of data, the endpointer calculates the RMS energy and the +zero-cross count. The first few frames are assumed to be background +silence and are used to initialize various thresholds. If there is no +starting silence (the user speaks too soon), then the endpointer will +mislabel the first syllable (which may be one or more words) until a +silence is reached. Similarly, if there is no ending silence, then the +endpointer will not mark the end of utterance. + +A running average of the background silence is kept which consists of +averaging the last few silence frames. This background silence is used +to set energy thresholds and the Schmidt trigger for the zero-cross +counter. + +The endpointer contains over a dozen thresholds and settings which are used +to determine frication, voicing, and silence. These thresholds have been +determined emperically. + +The sampling rate, window size in samples, and the step size in samples +are passed to the class constructor. These three arguments are used to +calculate the internal thresholds (actual zero-cross count values for +frequencies and number of frames for durations). Any or all of the internal + +CAVEATS: +The endpointer will fail if there is no starting silence or endsilence. +If there is no starting silence, then the first syllable up to the first +stop consonant will be lost. If there is no ending silence, then the last +syllable will the lost or no end of utterance will be determined. +thresholds can be changed by specifying them in the class constructor. + +The endpointer makes no distinction between noise and speech. Impulse +noises will fool it. The endpointer tends to be conservative in that it +will err by including noises with the signal rather than cutting out part +of the actual speech signal. So, a good recognition system must model +noise. + +Large amplitude background white noise may cause the endpointer to miss +fricatives, weak or strong. If the background noise is known a priori, then +the endpointer thresholds can be adjusted to cope with the noise.