ExerLiteNet: Lightweight CNN-LSTM Architecture for Binary Exercise Recognition from Webcam RGB Video

Authors

  • Ranish Ghimire Computer and Electronics Department, Kantipur Engineering College, Lalitpur, Nepal
  • Rojin Maharjan Computer and Electronics Department, Kantipur Engineering College, Lalitpur, Nepal
  • Riya Bhattarai Computer and Electronics Department, Kantipur Engineering College, Lalitpur, Nepal
  • Sahaj Shakya Computer and Electronics Department, Kantipur Engineering College, Lalitpur, Nepal

DOI:

https://doi.org/10.3126/injet.v3i2.95537

Keywords:

Exercise Recognition, MobileNetV3Small, CNN-LSTM, Spatiotemporal Learning, Transfer Learning, RGB Video, Lightweight Deep Learning, Fine-tuning, Overfit

Abstract

Exercise recognition from video is important for building digital workout tracking systems and automated fitness monitoring tools that can provide coaching without the need for expensive equipment. This paper presents ExerLiteNet, a lightweight deep learning model designed for resource-constrained devices, which uses standard RGB webcam video to classify two common resistance exercises, squats and bicep curls. The proposed model combines a fine-tuned MobileNetV3Small CNN wrapped in a Time Distributed architecture to extract spatial features from each video frame, and a Long Short-Term Memory (LSTM) network to capture the motion patterns across a sequence of 15 frames. The training data consisted of 299 videos across two exercise classes, each averaging 10 seconds in length, collected from both stock video sources and webcam recordings. The model achieved a classification accuracy of 92.08% on the test dataset after fine-tuning, outperforming a frozen ResNet50 baseline (82.31%) and an intermediate frozen MobileNetV3Small + LSTM configuration (87.08%). In terms of computational efficiency, the proposed model has a total size of approximately 178 MB and requires only 1.8 GFLOPs per inference sequence, which is significantly lower than VGG16+LSTM at approximately 553 MB and 15.43 GFLOPs, and ResNet50 at approximately 3.8 GFLOPs with no temporal modeling capability. The model also runs at approximately 15 frames per second with an end-to-end inference latency of 2.3 seconds per sequence on a standard webcam setup. These results show that combining a lightweight convolutional architecture with sequence modeling can achieve competitive accuracy while remaining practical for deployment on everyday hardware without any specialized sensors or depth cameras.

Downloads

Download data is not yet available.
Abstract
4
PDF
3

Downloads

Published

2026-06-18

How to Cite

Ghimire, R., Maharjan, R., Bhattarai, R., & Shakya, S. (2026). ExerLiteNet: Lightweight CNN-LSTM Architecture for Binary Exercise Recognition from Webcam RGB Video. International Journal on Engineering Technology, 3(2), 171–186. https://doi.org/10.3126/injet.v3i2.95537