What is the basic pipeline of a speech recognition system?

Question

Ravi Vishwakarma · Answer

1. Audio Input What happens: Captures raw speech signal via a microphone.Format: Usually a waveform (.wav, .mp3, etc.) 2. Preprocessing Tasks: Noise reductionVoice activity detection (removing silence)Normalization Goal: Clean and standardize the signal for feature extraction. 3. Feature Extraction Common techniques: MFCC (Mel-Frequency Cepstral Coefficients)SpectrogramLog Mel-filterbanks Purpose: Convert audio waveform into meaningful numerical features that represent phonetic content. 4. Acoustic Modeling Models: DNN, CNN, RNN, LSTM, TransformersRole: Map features (like MFCCs) to phonetic units (phones or senones). 5. Language Modeling Models: N-grams, RNNs, Transformer-based (like BERT, GPT)Purpose: Predict word sequences by understanding grammar and context. 6. Decoding Process: Combines acoustic model + language model + pronunciation dictionary (lexicon) to convert the signal into text.Output: Most likely word sequence (transcription). 7. Post-processing Includes: Punctuation insertionCapitalizationError correction (optional) Goal: Produce a clean and readable transcription. End-to-End Models Modern systems (like Whisper, DeepSpeech) use end-to-end deep learning models that skip traditional pipelines and directly map audio to text.

interview

What is the basic pipeline of a speech recognition system?

Ravi Vishwakarma

Can you answer this question?

1 Answers

1. Audio Input

2. Preprocessing

3. Feature Extraction

4. Acoustic Modeling

5. Language Modeling

6. Decoding

7. Post-processing

End-to-End Models

Liked By