A Comparative Analysis of Image Captioning in Nepali Language Using Deep Learning

Authors

  • Prabhas Parajuli Advanced College of Engineering and Management, Tribhuvan University, Nepal
  • Sumita Dangal Advanced College of Engineering and Management, Tribhuvan University, Nepal
  • Prafulla Shrestha Advanced College of Engineering and Management, Tribhuvan University, Nepal
  • Saru Pradhan Advanced College of Engineering and Management, Tribhuvan University, Nepal
  • Bipun Man Pati Advanced College of Engineering and Management, Tribhuvan University, Nepal
  • Ukesh Thapa Advanced College of Engineering and Management, Tribhuvan University, Nepal

DOI:

https://doi.org/10.3126/jost.v5i1.92659

Keywords:

Image Processing, Nepali Image Captioning, Transformer Models, Self-Attention

Abstract

Image captioning incorporates the knowledge of both image processing and Natural Language Processing (NLP). The complexity of Nepali grammar, written in the Devanagari script, presents a significant challenge in generating a grammatically correct description from an image. Challenges arise from both the morphological richness and free word order of Nepali grammar, as well as the intricacies of the Devanagari script. This work presents a comparative analysis of deep learning architectures for sequence modeling, with a particular emphasis on Transformer-based and Long Short-Term Memory (LSTM)-based models. In particular, we assess the effectiveness of feature extractors combined with sequential processors, such as Convolutional Neural Network (CNN)+Transformer, VGG16+LSTM, ResNet101+LSTM, EfficientNetB0+LSTM, Vision Transformer (ViT)+Transformer, and Global Context Vision Transformer (GCViT)+Transformer. The models are rigorously assessed on a Nepali-language captioning dataset using a comprehensive set of performance metrics: BLEU-1, BLEU-2, BLEU-3, and BLEU-4 for n-gram-based linguistics, along with METEOR. Results indicate that LSTM-based models consistently outperform their Transformer-based counterparts across all metrics, achieving higher BLEU scores and superior METEOR scores. In particular, the ResNet101-LSTM demonstrates particularly strong performance, suggesting the efficacy of end-to-end attention-based architectures is not always guaranteed in Transformer-based models while dealing with smaller datasets. The findings provide clear empirical benchmarks for model selection in Nepali language image captioning, highlighting the superior capability of LSTM-based approaches in generating linguistically accurate and semantically rich captions, which is vital for enhancing accessibility and digital inclusion for Nepali-speaking communities.

Downloads

Download data is not yet available.
Abstract
111
PDF
66

Downloads

Published

2026-04-20

How to Cite

Parajuli, P., Dangal, S., Shrestha, P., Pradhan, S., Pati, B. M., & Thapa, U. (2026). A Comparative Analysis of Image Captioning in Nepali Language Using Deep Learning. Journal of Science and Technology, 5(1), 7–12. https://doi.org/10.3126/jost.v5i1.92659

Issue

Section

Articles