Lip Reading Using Convolutional Neural Networks
DOI:
https://doi.org/10.3126/joeis.v4i1.81574Keywords:
Lip Reading, Convolutional Neural Networks (CNN), Bidirectional Long Short Term Memory(BiLSTM)Abstract
Lip reading, or the decoding of speech from facial movements, is crucial for enhancing communication for individuals with hearing or speech impairments, as well as for generating accurate captions when audio is compromised. Traditional Automatic Speech (ASR) systems often fall in noisy environments, creating a need for robust visual-based alternatives. The main objective of this study was to develop and evaluate a highly accurate, visual-only automated lip-reading system based on a novel deep-learning architecture.
The methodology employed a hybrid model that combined 3D Convolutional Neural Networks (CNNs) for spatial feature extraction from video frames and Bidirectional Long Short-Term Memory (BiLSTM) networks to analyze temporal dependencies. This model was trained on the GRID corpus dataset, which contains thousands of spoken sentences. Performance was evaluated using Word Error Rate (WER) and Character Error Rate (CER) metrics.
The implemented model demonstrated strong performance, achieving an average WER of 0.1706 and an average CER of 0.0712 on 50 unseen test videos. This translates to a word prediction accuracy of approximately 83% and a character prediction accuracy of 93%. The study concludes that the hybrid CNN-BiLSTM architecture is highly effective for visual speech recognition. The findings have significant implications for creating practical assistive technologies that can serve as a hearing aid for the deaf and a voice for the mute, ultimately improving accessibility and communication.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright is held by the authors.