Utterance to Phoneme Mapping using Viterbi Training

Project Overview

Developed an Automatic Speech Recognition (ASR) model that maps spoken utterances to phoneme sequences, a critical component in speech recognition systems. The project focuses on optimizing the alignment between acoustic features and phoneme representations using advanced neural architectures and training techniques.

Technical Approach

  • Implemented a pyramidal Bi-directional LSTM architecture to efficiently process sequential audio data
  • Utilized Connectionist Temporal Classification (CTC) Loss to address the variable-length alignment problem between input utterances and output phoneme sequences
  • Incorporated multiple feature extraction and sequence modeling approaches:
    • CNN-pBLSTMs (Convolutional Neural Network with pyramidal Bi-directional LSTMs)
    • CNN-LSTMs for hierarchical feature learning
    • ResNets for robust representation learning

Decoding Strategies

  • Implemented Greedy Search for fast, real-time phoneme sequence decoding
  • Developed Beam Search decoding to improve accuracy by evaluating multiple high-probability phoneme sequences
  • Optimized beam width and search parameters for optimal performance trade-offs

Evaluation Method

  • Employed Levenshtein Distance (Edit Distance) to measure character-level differences between predicted and ground-truth phoneme sequences
  • Analyzed confusion matrices to identify problematic phoneme pairs
  • Conducted ablation studies to quantify the contribution of each architectural component

Technologies Used

  • Python
  • TensorFlow
  • Keras
  • PyTorch
  • Signal processing libraries

Applications

This system serves as a foundation for speech recognition applications, pronunciation assessment tools, and language learning technologies. The phoneme-level processing enables fine-grained analysis of speech patterns and accent characteristics.

See Repo Here