A Comparative Analysis of Image Captioning in Nepali Language Using Deep Learning

Prabhas Parajuli; Sumita Dangal; Prafulla Shrestha; Saru Pradhan; Bipun Man Pati; Ukesh Thapa

doi:10.3126/jost.v5i1.92659

Authors

Prabhas Parajuli Advanced College of Engineering and Management, Tribhuvan University, Nepal
Sumita Dangal Advanced College of Engineering and Management, Tribhuvan University, Nepal
Prafulla Shrestha Advanced College of Engineering and Management, Tribhuvan University, Nepal
Saru Pradhan Advanced College of Engineering and Management, Tribhuvan University, Nepal
Bipun Man Pati Advanced College of Engineering and Management, Tribhuvan University, Nepal
Ukesh Thapa Advanced College of Engineering and Management, Tribhuvan University, Nepal

DOI:

https://doi.org/10.3126/jost.v5i1.92659

Keywords:

Image Processing, Nepali Image Captioning, Transformer Models, Self-Attention

Abstract

Image captioning incorporates the knowledge of both image processing and Natural Language Processing (NLP). The complexity of Nepali grammar, written in the Devanagari script, presents a significant challenge in generating a grammatically correct description from an image. Challenges arise from both the morphological richness and free word order of Nepali grammar, as well as the intricacies of the Devanagari script. This work presents a comparative analysis of deep learning architectures for sequence modeling, with a particular emphasis on Transformer-based and Long Short-Term Memory (LSTM)-based models. In particular, we assess the effectiveness of feature extractors combined with sequential processors, such as Convolutional Neural Network (CNN)+Transformer, VGG16+LSTM, ResNet101+LSTM, EfficientNetB0+LSTM, Vision Transformer (ViT)+Transformer, and Global Context Vision Transformer (GCViT)+Transformer. The models are rigorously assessed on a Nepali-language captioning dataset using a comprehensive set of performance metrics: BLEU-1, BLEU-2, BLEU-3, and BLEU-4 for n-gram-based linguistics, along with METEOR. Results indicate that LSTM-based models consistently outperform their Transformer-based counterparts across all metrics, achieving higher BLEU scores and superior METEOR scores. In particular, the ResNet101-LSTM demonstrates particularly strong performance, suggesting the efficacy of end-to-end attention-based architectures is not always guaranteed in Transformer-based models while dealing with smaller datasets. The findings provide clear empirical benchmarks for model selection in Nepali language image captioning, highlighting the superior capability of LSTM-based approaches in generating linguistically accurate and semantically rich captions, which is vital for enhancing accessibility and digital inclusion for Nepali-speaking communities.

Downloads

Download data is not yet available.

Abstract

228

PDF

152

A Comparative Analysis of Image Captioning in Nepali Language Using Deep Learning

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Current Issue

Information