ExerLiteNet: Lightweight CNN-LSTM Architecture for Binary Exercise Recognition from Webcam RGB Video

Ranish Ghimire; Rojin Maharjan; Riya Bhattarai; Sahaj Shakya

doi:10.3126/injet.v3i2.95537

Authors

Ranish Ghimire Computer and Electronics Department, Kantipur Engineering College, Lalitpur, Nepal
Rojin Maharjan Computer and Electronics Department, Kantipur Engineering College, Lalitpur, Nepal
Riya Bhattarai Computer and Electronics Department, Kantipur Engineering College, Lalitpur, Nepal
Sahaj Shakya Computer and Electronics Department, Kantipur Engineering College, Lalitpur, Nepal

Keywords:

Exercise Recognition, MobileNetV3Small, CNN-LSTM, Spatiotemporal Learning, Transfer Learning, RGB Video, Lightweight Deep Learning, Fine-tuning, Overfit

Abstract

Exercise recognition from video is important for building digital workout tracking systems and automated fitness monitoring tools that can provide coaching without the need for expensive equipment. This paper presents ExerLiteNet, a lightweight deep learning model designed for resource-constrained devices, which uses standard RGB webcam video to classify two common resistance exercises, squats and bicep curls. The proposed model combines a fine-tuned MobileNetV3Small CNN wrapped in a Time Distributed architecture to extract spatial features from each video frame, and a Long Short-Term Memory (LSTM) network to capture the motion patterns across a sequence of 15 frames. The training data consisted of 299 videos across two exercise classes, each averaging 10 seconds in length, collected from both stock video sources and webcam recordings. The model achieved a classification accuracy of 92.08% on the test dataset after fine-tuning, outperforming a frozen ResNet50 baseline (82.31%) and an intermediate frozen MobileNetV3Small + LSTM configuration (87.08%). In terms of computational efficiency, the proposed model has a total size of approximately 178 MB and requires only 1.8 GFLOPs per inference sequence, which is significantly lower than VGG16+LSTM at approximately 553 MB and 15.43 GFLOPs, and ResNet50 at approximately 3.8 GFLOPs with no temporal modeling capability. The model also runs at approximately 15 frames per second with an end-to-end inference latency of 2.3 seconds per sequence on a standard webcam setup. These results show that combining a lightweight convolutional architecture with sequence modeling can achieve competitive accuracy while remaining practical for deployment on everyday hardware without any specialized sensors or depth cameras.

Abstract

21

PDF

0

ExerLiteNet: Lightweight CNN-LSTM Architecture for Binary Exercise Recognition from Webcam RGB Video

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

How to Cite