« Home « Kết quả tìm kiếm

47 Speech Recognition by Machine

- 47.1 Introduction.
- 47.2 Characterization of Speech Recognition Systems 47.3 Sources of Variability of Speech.
- 47.5 Speech Recognition by Pattern Matching.
- Performance of Connected Word Recognizers 47.7 Continuous Speech Recognition.
- 47.9 Practical Issues in Speech Recognition 47.10 ASR Applications.
- Initial attempts at providing human-machine communications led to the development of the keyboard, the mouse, the trackball, the touch screen, and the joy stick.
- This capability is required for tasks in which the human is controlling the actions of the machine using only limited speaking capability, e.g., while speaking simple commands or sequences of words from a limited vocabulary (e.g., digit sequences for a telephone number).
- the more general case, usually referred to as speech understanding, the machine need only recognize a limited subset of the user input speech, namely the speech that specifies enough about the action requested so that the machine can either respond appropriately, or initiate some action in response to what was understood..
- 47.2 Characterization of Speech Recognition Systems.
- The size of the recognition vocabulary, including:.
- The knowledge of the user’s speech patterns, including:.
- systems which integrate acoustic and linguistic knowledge, where the linguistic knowledge is generally represented via syntactical and semantic constraints on the output of the recognition system..
- 47.3 Sources of Variability of Speech.
- Speech recognition by machine is inherently difficult because of the variability in the signal.
- 47.4.1 The Acoustic-Phonetic Approach [1].
- This is the basis of the acoustic-phonetic approach which postulates that there exist finite, distinctive phonetic units (phonemes) in spoken language, and that these units are broadly characterized by a set of acoustic properties that are manifest in the speech signal over time.
- The first step in the acoustic-phonetic approach is a segmentation and labeling phase in which the speech signal is segmented into stable acoustic regions, followed by attaching one or more phonetic labels to each segmented region, resulting in a phoneme lattice characterization of the speech (see Fig.
- In the validation process, linguistic constraints of the task (i.e., the vocabulary, the syntax, and other semantic rules) are invoked in order to access the lexicon for word decoding based on the phoneme lattice.
- 47.4.2 “Pattern-Matching” Approach [2].
- In the pattern-comparison stage of the approach, a direct comparison is made between the unknown speech (the speech to be recognized) with each possible.
- pattern learned in the training stage, in order to determine the identity of the unknown according to the goodness of match of the patterns.
- 47.4.3 Artificial Intelligence Approach [3, 4].
- A decision scheme determines the word or phonetic class of the unknown speech based on the matching scores with respect to the stored reference patterns..
- The first type, called a nonparametric reference pattern [5] (or often a template), is a pattern created from one or more spoken tokens (exemplars) of the sound associated with the pattern.
- The second type, called a statistical reference model, is created as a statistical characterization (via a fixed type of model) of the behavior of a collection of tokens of the sound associated with the pattern.
- The hidden Markov model [6] is an example of the statistical model..
- We now discuss the elements of the pattern recognition model and show how it has been used in isolated word, connected word, and continuous speech recognition systems..
- 47.5.1 Speech Analysis.
- The purpose of the speech analysis block is to transform the speech waveform into a parsimonious representation which characterizes the time varying properties of the speech.
- Studies in psychoa- coustics also suggest that our auditory perception of sound power and loudness involves compression, leading to the use of the logarithmic power spectrum and the cepstrum [8], which is the Fourier trans- form of the log-spectrum.
- The low order cepstral coefficients (up to 10 to 20) provide a parsimonious representation of the short-time speech segment which is usually sufficient for phonetic identification..
- The cepstral parameters are often augmented by the so-called delta cepstrum [9] which character- izes dynamic aspects of the time-varying speech process..
- 47.5.2 Pattern Training.
- Clustering training in which a large number of versions of the sound pattern (extracted from a wide range of talkers) is used to create one or more templates or a reliable statistical model of the sound pattern..
- The basic premise of the HMM is that a Markov chain can be used to describe the probabilistic nature of the temporal sequence of sounds in speech, i.e., the phonemes in the speech, via a probabilistic state sequence.
- Instead, the states manifest themselves through the second component of the HMM which is a set of output distributions governing the production of the speech features in each state (the spectral characterization of the sounds).
- In other words, the output distributions (which are observed) represent the local statistical knowledge of the speech pattern within the state, and the Markov chain characterizes, through a set of state transition probabilities, how these sound processes evolve from one sound to another.
- FIGURE 47.3: Characterization of a word (or phrase, or subword) using a N(5) state, left-to-right, HMM, with continuous observation densities in each state of the model..
- Within each state (assumed to represent a stable acoustical distribution) the spectral features of the speech signal.
- The states represent the changing temporal nature of the speech signal.
- The training problem for HMMs consists of estimating the parameters of the statistical distributions within each state (e.g., means, variances, mixture gains, etc.
- Well-established techniques (e.g., the Baum-Welch method [10] or the segmental K-means method [11]) have been defined for doing this pattern training efficiently..
- 47.5.3 Pattern Matching.
- When the reference pattern consists of a probabilistic model, such as an HMM, the process of pattern matching is equivalent to using the statistical knowledge contained in the probabilistic model to assess the likelihood of the speech (which led to the model) being realized as the unknown pattern..
- FIGURE 47.4: Results of time aligning two versions of the word “seven”, showing linear alignment of the two utterances (top panel).
- HMMs provide an implicit time normalization as part of the process for measuring likelihood.
- The top panel of the figure shows the log energy contour of the two patterns (for the spoken word “seven.
- It can be seen that the inherent duration of the two patterns, 30 and 35 frames (where each frame is a 15-ms segment of speech), is different and that linear alignment is grossly inadequate for internally aligning events within the two patterns (compare the locations of the vowel peaks in the two patterns).
- 47.4 in contrast to the linear alignment of the patterns at the top.
- 47.5.4 Decision Strategy.
- The decision strategy takes all the matching scores (from the unknown pattern to each of the stored reference patterns) into account, finds the “closest” match, and decides if the quality of the match is good enough to make a recognition decision.
- If not, the user is asked to provide another token of the speech (e.g., the word or phrase) for another recognition attempt.
- This is necessary because often the user may speak words that are incorrect in some sense (e.g., hesitation, incorrectly spoken word, etc.) or simply outside of the vocabulary of the recognition system..
- 47.5.5 Results of Isolated Word Recognition.
- 47.2, and using either the non-parametric template approach or the statistical HMM method to derive reference patterns, a wide variety of tests of the recognizer have been performed on telephone speech with isolated word inputs in both speaker- dependent (SD) and speaker-independent (SI) modes.
- Vocabulary sizes have ranged from as few as 10 words (i.e., the digits zero–nine) to as many as 1109 words.
- TABLE 47.1 Performance of Isolated Word Recognizers.
- 47.6 Connected Word Recognition.
- In this section we consider extensions of the basic processing methods described in pre-.
- R V } each representing one of the words in the vocabulary.
- FIGURE 47.5: Illustration of the problem of matching a connected word string, spoken fluently, using whole word patterns concatenated together to provide the best match..
- 47.6.1 Performance of Connected Word Recognizers.
- In the next section we will see how we exploit linguistic constraints of the task to improve recognition accuracy for word strings beyond the level one would expect on the basis of word error rates of the system..
- TABLE 47.2 Performance of Connected Word Recognizers.
- 47.7 Continuous Speech Recognition.
- First of all, as the size of the vocabulary of the recognizer grows, it becomes impractical to train patterns for each individual word in the vocabulary..
- Hence, continuous speech recognizers generally use sub-word speech units as the basic patterns to be trained, and use a lexicon to define the structure of word patterns in terms of the sub-word units.
- In order to achieve good recognition performance, account must be taken of the word grammar so as to constrain the set of possible recognized sentences.
- Finally, the spoken sentence often must make sense according to a semantic model of the task which the recognizer is asked to perform.
- Again, by explicitly including these semantic constraints on the spoken sentence, as part of the recognition process, performance of the system improves..
- Choice of a representation of words in the recognition vocabulary, in terms of the sub- word units;.
- 47.7.1 Sub-Word Speech Units and Acoustic Modeling.
- Since the number of phonemes is limited, it is usually straightforward to collect sufficient speech training data for reliable estimation of statistical models of the phonemes.
- The resulting set of sub-word speech models are usually referred to as “context independent” phone-like units (CI-PLU) since each unit is trained independently of the context of neighboring units.
- addition to the CI-PLUs) the “resolution” of the acoustic models is increased, and the performance of the recognition system improves..
- 47.7.2 Word Modeling From Sub-Word Units.
- Hence, for each word in the recognition vocabulary, the lexicon contains a baseform (or standard) pronunciation of the word, as well as alternative pronunciations, as appropriate..
- 47.7.3 Language Modeling Within the Recognizer.
- In order to determine the best match to a spoken sentence, a continuous speech recognition system has to evaluate both an acoustic match score (corresponding to the “local” acoustic matches of the words in the sentence) and a language match score (corresponding to the match of the words to the grammar and syntax of the task).
- The language match scores are computed according to a production model of the syntax and the semantics.
- 47.7.4 Performance of Continuous Speech Recognizers.
- TABLE 47.3 Performance of Continuous Speech Recognition Systems.
- 47.8 Speech Recognition System Issues.
- 47.8.1 Robust Speech Recognition [18].
- Noise is an inevitable component of the acoustic environment and is normally considered additive with the speech.
- Distortion refers to modification to the spectral char- acteristics of the signal by the room, the transducer (microphone), the channel (e.g., transmission), etc.
- Adaptive methods differ from invariant methods in the way the characteristics of the operating environment are taken into account.
- Invariant methods assume no explicit knowledge of the signal environment, while adaptive methods attempt to estimate the adverse condition and adjust the signal (or the reference models) accordingly in order to achieve reliable matching results..
- This distortion model leads to the method of cepstral mean subtraction and, more generally, signal bias removal [24] which makes a maximum likelihood estimate of the bias due to distortion and subtracts the estimated bias from the cepstral features before pattern matching is performed..
- 47.8.2 Speaker Adaptation [25].
- 47.8.3 Keyword Spotting [26] and Utterance Verification [27].
- 47.8.4 Barge-In.
- With “barge-in”, the recognizer needs to be activated and listen starting from the beginning of the system prompt.
- 47.9 Practical Issues in Speech Recognition.
- 47.10 ASR Applications.
- and Reddy D.R., Organization of the Hearsay-II Speech Understanding System, IEEE Trans.
- and Rabiner, L.R., The segmental k-means algorithm for estimating parameters of hidden Markov models, IEEE Trans.
- and Juang, B.H., The short-time modified coherence representation and noisy speech recognition, IEEE Trans.
- and Juang, B.H., A study on speaker adaptation of the parameters of continuous density hidden Markov models, IEEE Trans.
- [28] Jelinek, F., The development of an experimental discrete dictation recognizer, IEEE Proc Nov

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt

47 Speech Recognition by Machine

CHỦ ĐỀ LIÊN QUAN