Low-bitrate distributed speech recognition for packet-based and wireless communication

We present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition as opposed to speech
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  570 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002 Low-Bitrate Distributed Speech Recognition forPacket-Based and Wireless Communication Alexis Bernard  , Student Member, IEEE,  and Abeer Alwan  , Senior Member, IEEE   Abstract— In this paper, we present a framework for developingsource coding, channel codingand decoding as wellas erasure con-cealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition asopposed to speech coding, is more sensitive to channel errors thanchannel erasures, and appropriate channel coding design criteriaare determined. For channel decoding, we introduce a novel tech-nique for combining at the receiver soft decision decoding witherror detection. Frame erasure concealment techniques are usedat the decoder to deal with unreliable frames. At the recognitionstage,wepresentatechniquetomodifytherecognitionengineitself to take into account the time-varying reliability of the decoded fea-ture after channel transmission. The resulting engine, referred toas weighted Viterbi recognition, further improves recognition ac-curacy. Together, source coding, channel coding and the modifiedrecognitionengineareshowntoprovidegoodrecognitionaccuracyover a wide range of communication channels with bitrates of 1.2kbps or less.  Index Terms— Automatic speech recognition, distributed speechrecognition (DSR), joint channel decoding-speech recognition, softdecisiondecoding,weightedViterbialgorithm,wirelessandpacket(IP) communication. I. I NTRODUCTION I N DISTRIBUTED speech recognition (DSR) systems,speech features are acquired by the client and transmitted tothe server for recognition. This enables low power/complexitydevices to perform speech recognition. Applications includevoice-activated web portals, menu browsing and voice-operatedpersonal digital assistants.This paper investigates channel coding, channel decoding,source coding and speech recognition techniques suitable forDSR systems over error prone channels (Fig. 1). The goal is toprovide high recognition accuracy over a wide range of channelconditions with low bitrate, delay and complexity for the client.Wireless communications is a challenging environment forspeech recognition. The communication link is characterizedby time-varying, low signal-to-noise ratio (SNR) channels.Previous studies have suggested alleviating the effect of channel errors by adapting acoustic models [1] and automatic Manuscript received September 25, 2001; revised August 7, 2002. This work was supported in part by the NSF, HRL, STM, and Broadcom through the Uni-versity of California Micro Program. Portions of this work were presented attheIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing,SaltLakeCity,UT,May7–11,2001,andtheEurospeechconferenceinAalborg,Denmark, September 3–7, 2001. The associate editor coordinating the reviewof this manuscript and approving it for publication was Dr. Harry Printz.The authors are with the Speech Processing and Auditory Perception Labora-tory, Electrical Engineering Department, University of California, Los Angeles,CA 90095-1594 USA (e-mail:; Object Identifier 10.1109/TSA.2002.808141Fig. 1. Block diagram of a remote speech recognition system. speech recognition (ASR) front-ends [2] to different channelconditions, or by modeling GSM noise and holes [3]. Otherstudies analyzed the effect of random and burst errors in theGSM bitstream for remote speech recognition applications[4]. Finally, [5] and [6] evaluate the reliability of the decoded feature to provide robustness against channel errors. Similarly,packet switched networks constitute a difficult environment.The communication link in IP based systems is characterizedby packet losses, mainly due to congestion at routers. Packetloss recovery techniques including silence substitution, noisesubstitution, repetition and interpolation [7]–[9]. In terms of source coding for DSR, there are three possibleapproaches. The first approach bases recognition on the de-coded speech signal, after speech coding and decoding. How-ever, it is shown in [10]–[12] that this method suffers from sig- nificant recognition degradation at low bitrates. A second ap-proach is to build a DSR engine based on speech coding param-eters without re-synthesizing the speech signal [13]–[16]. The thirdapproachperformsrecognitiononquantizedASRfeatures,and provides a good tradeoff between bitrate and recognitionaccuracy [17]–[20]. This paper presents contributions in several areas of DSR systems based on quantized ASR features.In the area of   channel coding , it is first explained and experi-mentally verified that speech recognition, as opposed to speechcoding, is more sensitive to channel errors than channel era-sures. Two types of channels are analyzed, independent andburstychannels.Second,efficientchannelcodingtechniquesforerror detection based on linear block codes are presented.In the area of   channel decoding , the merits of soft andhard decision decoding are discussed, and a new techniquefor performing error detection with soft decision decoding ispresented. The soft decision channel decoder, which introducesadditional complexityonly atthe server, is shown to outperformthe widely-used hard decision decoding.In the area of   speech recognition , the recognition engine ismodified to include a time-varying weighting factor dependingon the quality of each decoded feature after transmission overtime-varying channels. Following frame erasure concealment, 1063-6676/02$17.00 © 2002 IEEE  BERNARD AND ALWAN: LOW-BITRATE DISTRIBUTED SPEECH RECOGNITION FOR PACKET-BASED AND WIRELESS COMMUNICATION 571 an estimate of the quality of the substituted features is takeninto account using a weighted Viterbi recognizer (WVR). To-gether, erasure concealment and WVR improves robustness of the DSR system against channel noise, extending the range of channelconditionsoverwhichwirelessorinternet-basedspeechrecognition can be sustained.Source coding, channel coding, and speech recognition tech-niques are then combined to provide high recognition accuracyover a large range of channel conditions for two types of speechrecognitionfeatures:perceptuallinearprediction(PLP)andMelfrequency cepstral coefficients (MFCC).This paper is organized as follows. Section II analyzes the ef-fectofchannelerrorsanderasuresonrecognitionaccuracy.Sec-tionIIIprovidesadescriptionofthechannelencodersusedtoef-ficientlyprotecttherecognitionfeatures.InSectionIV,differentchannel decoding techniques are presented. Section V presentsthe weighted Viterbi recognition (WVR) algorithm. Techniquesalleviating the effect of erasures using WVR are proposed inSection VI. Finally, Section VII illustrates the performance of the overall speech recognition system applied to quantized PLPand MFCC features.II. E FFECT OF  C HANNEL  E RASURES AND  E RRORS In this section, we study how channel errors and erasures af-fect the Viterbi speech recognizer. We then present techniquesfor minimizing recognition degradation due to transmission of speech features over noisy channels.Throughout this paper, speech recognition experiments con-sist of continuous digit recognition based on 4 kHz bandwidthspeech signals. Training is done using speech from 110 malesand females from the Aurora-2 database [18] for a total of 2200digit strings. The feature vector consists of PLP or Mel fre-quency cepstral coefficients with the first and second deriva-tives. As specified by the Aurora-2 ETSI standard [18], hiddenMarkov (HMM) word models contain 16 states with 6 mix-tures each, and are trained using the Baum–Welch algorithm as-suming a diagonal covariance matrix. Recognition tests contain1000 digitstrings spoken by 100speakers (male and female)fora total of 3241 digits.  A. Effect of Channel Erasures and Errors on DSR The emphasis in remote ASR is recognition accuracy and notplayback. Recognition is made by computing feature vectors’likelihood time and by selecting the element in the dictionarythatmostlikelyproducedthatsequenceofobservations.Thena-ture of this task implies different criteria for designing channelencoders and decoders than those used in speech coding/play-back applications.The likelihood of observing a given sequence of featuresgiven a hidden Markov model is computed by searchingthrough a trellis for the most probable state sequence. TheViterbi algorithm (VA) presents a dynamic programmingsolution to find the most likely path through a trellis. For eachstate , at time , the likelihood of each path is computed bymultiplying the transition probabilities between states andthe output probabilities along that path. The partial Fig. 2. Illustration of the consequences of a channel erasure and error on themost likely paths taken in the trellis by the received sequence of observations,given a 16-state word digit model. The erasure and error occur at frame number17. likelihood is computed efficiently using the followingrecursion:(1)The probability of observing the -dimensional feature is(2)where isthenumberofmixturecomponents, isthemix-ture weight, and the parameters of the multivariate Gaussianmixture are its mean vector and covariance matrix .Fig. 2 analyzes the effect of a channel error and erasure inthe VA. Assume first a transmission free of channel errors. Thebest path through the trellis is the line with no marker. As-sume now that a channel  error   occurs at time . The decodedfeature is as opposed to and the associated probabilitiesfor each state may differ considerably ,which will disturb the state metrics . A large discrepancybetween and can force the best path in the trellisto branch out from the error-free best path. Consequently, manyfeatures may be accounted for in the overall likelihood compu-tation using the state model instead of the correct state model, which will once again modify the probability of observationsince .On the other hand, channel  erasures  have little effect onlikelihood computation. State metrics are not disturbed sincethe probability of the missing observation cannot be computed.Also, note that not updating the state metricsis not as likely to create a path split between the best pathswith and without an erasure as a channel error. Hence, channelerasures typically do not propagate through the trellis.  B. Simulations of Channel Erasures and Errors In this section, we simulate the effects of channel erasuresand channel errors on DSR.Fig. 3 illustrates the effect of randomly inserted channel era-sures and errors in the communication between the client andthe server. The feature vector transmitted consists of 5 PLP cep-stral coefficients, enough to represent two observable peaks in  572 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002 Fig. 3. Simulation of the effect of channel erasures and errors on continuousdigit recognition performance using the Aurora-2 database and PLP features.Recognition accuracies are represented in percent on a gray scale. the perceptual spectrum and the spectral tilt. Erasures are sim-ulated by removing the corresponding frame from the observa-tion sequence. Channel errors, on the other hand, are simulatedby replacing the feature vector with another vector, chosen ran-domly according to the statistical distribution of the features.This simulation technique has the merit of being independent of thesourcecodingalgorithm.Itisvalidespeciallyforlow-bitratequantization schemes, which are highly sensitive to channel er-rors.Fig.3showsthatchannelerrors,whichpropagatethroughthetrellis, have a disastrous effect on recognition accuracy, whilethe recognizer is able to operate with almost no loss of accu-racy with up to 15% of channel erasures. This confirms resultsobtained in [19] for isolated digit recognition based on PLP co-efficients and in [5] for MFCCs. Note that computation of thetemporal derivatives at the receiver accentuates error propaga-tion.The results indicate that a very important attribute of anychannel encoder designed for remote recognition applicationsshouldbeerrordetectionmorethanerrorcorrection.SectionsIIIand IV present innovative techniques to maximize error detec-tion capabilities of linear block codes suitable for DSR applica-tions.Fortheremainderofthissection,weassumethatalltrans-missionerrorsaredetectedandreplacedbyerasures.Modelsforerasure channels are presented next. C. Gilbert–Elliot Models for Erasure Channels Two types of erasure channels are analyzed. In the firsttype, channel erasures occur independently. In the second type,channel erasures occur in bursts, which is typically the casefor correlated fading channels in wireless communication orIP based communication systems, where fadings or network congestion may cause a series of consecutive packets to bedropped.For independent-erasure channels, erasures are inserted ran-domlywithagivenprobability.Aclassicmodelforburstychan- TABLE IG ILBERT –E LLIOT  T EST  C HANNELS  (P ROBABILITIES IN  %) nels is the Gilbert–Elliot model [21], in which the transmissionis modeled as a Markov system where the channel is assignedone of two states:  good   or  bad  . With such a model character-ized by the state transition probabilities and , thereis a probability to be in the goodstate and a probability to be in thebad state. If the probabilities of channel erasures are andforthegoodandbadstate, respectively,theoverallaverageprobability of erasure is: .Throughout this paper, will be considered to be equal to0.01 and is set to 0.80. Different types of bursty channelsare analyzed, depending on and , which in turn deter-mine how bursty the channel is. Table I summarizes the proper-ties of the bursty channels studied, including the probability (inpercent) of being in the bad state , the overall probabilityof erasure, , and the average length (in frames) of a burstof erasures .The Gilbert–Elliot model parameters are selected basedon values reported in the literature on Gilbert models forpacket-based (IP) networks [22], [23] and wireless communi-cation channels [24]–[26]. III. C HANNEL  C ODING FOR  DSR S YSTEMS The analysis in Section II indicates that the most importantrequirement for a channel coding scheme for DSR is low proba-bility of undetected error ( 0.5%) and large enough probabilityofcorrectdecoding( 90%).Thissectionpresentstechniquestodetect most channel errors. Corrupted frames are then ignored(erased)andframeerasureconcealmenttechniquespresentedinSection VI can be applied.For  packet-based   transmission, frames are typically either re-ceived or lost, but not in error. Frame erasures can be detectedby analyzing the ordering of the received packet and there is noneed for sophisticated error detection techniques.With  wireless communication , transmitted bits are alteredduring transmission. Based on the values of the received bits, the receiver can either correctly decode the message (for correct decoding), detect a transmission error ( for errordetection) or fail to detect such error ( for undetected error).Sincethenumberofsourceinformationbitsnecessarytocodeeach frame can be very low (6–40 bits/frame) for efficient ASRfeaturecodingschemes[19],linearblockcodesarefavoredoverconvolutional or trellis codes for delay and complexity consid-erations, as well as for their ability to provideerror detectionforeach frame independently.  574 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002 (a)(b)(c)Fig. 4. Illustration of the different decoding strategies. (a) Hard decoding, (b)soft decoding, and (c)    -soft decoding. Withsoftdecisiondecoding, ,allowingonlyforcor-rect or erroneous decoding. Consequently, both andincrease, which ultimately decreases recognition performance.We propose in the following section a technique to combine theadvantage of soft decision decoding with the error detection ca-pability of hard decision decoding. C. Modified Soft Decision Decoding ( -Soft) In order to accept a decision provided by the soft decoder,onewouldliketoevaluatetheprobabilitythatthedecodedcode-vector was the one transmitted. Such  a posteriori  probability isgiven bywhich is complex and requires the knowledge of , which isdifficult to evaluate.Another solution is to perform error detection based on theratio of the likelihoods of the two most probable codevectors.Assuming that all codewords are equiprobable, the ratio of thelikelihoods of the two most probable vectors and (thetwo closest codevectors from the received vector at Euclideandistances and from ) is given by(5)(6)where istheEuclideandistancebetweenthetwoclosestcode-vectors and , while and are the distances from theprojection of the received codevector to the line joiningand . The important factor in (6) is(7)If , both codevectors are equally probable and the de-cision of the Maximum-Likelihood (ML) decoder should be re- jected. If , , correct decision is almostguaranteed since the block codes used are chosen according tochannel conditions so that the minimum Euclidean distance be-tween any two codevectors is at least several times as large asthe expected noise .Fig. 4(c) shows an example of -soft decision decoding thesame (2, 1) code. Error detection can be declared when issmaller than a threshold . Classic soft decision decoding is aparticular case of modified soft decision decoding with .The area for error detection grows as increases.  D. Comparison of Channel Decoding Performances For comparison, consider the (10, 7) SED block code of Table II over an independent Rayleigh fading channel at 5 dBSNR. Hard decoding yields %, %and %. These numbers are insufficient to providegood recognition results. With soft decision decoding, on theother hand, the probability of undetected errors is too large( %).Fig. 5 illustrates the performance of the -soft decision de-codingschemesforthesamecodeoverthesamechannelfordif-ferent values of . Note first that -soft decision decoding withcorresponds to classic soft decision decoding. With in-creasing , however, one can rapidly reduce to the desired
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!