Image Captioning in Nepali Using CNN and Transformer Decoder

Rabin Budhathoki; Suresh Timilsina

doi:10.3126/jes2.v2i1.60391

Image Captioning in Nepali Using CNN and Transformer Decoder

Authors

Rabin Budhathoki Department of Electronics and Computer Engineering, IOE, Pashchimanchal Campus, Tribhuvan University, Nepal
Suresh Timilsina Department of Electronics and Computer Engineering, IOE, Pashchimanchal Campus, Tribhuvan University, Nepal

Keywords:

BLEU, CNN, Flickr8k, Meteor, MobilentV3, RNN, Transformer

Abstract

Image captioning has attracted huge attention from deep learning researchers. This approach combines image and text-based deep learning techniques to create the written descriptions of images automatically. There has been limited research on image captioning using the Nepali language, with most studies focusing on English datasets. Therefore, there are no publicly available datasets in the Nepali language. Most previous works are based on the RNN-CNN approach, which produces inferior results compared to image captioning using the Transformer model. Similarly, using the BLEU score as the only evaluation metric cannot justify the quality of the produced captions. To address this gap, in this research work, the well-known “Flickr8k” English data set is translated into Nepali language and then manually corrected to ensure accurate translations. The conventional Transformer is comprised of encoder and decoder modules. Both modules contain a multi-head attention mechanism. This makes the model complex and computationally expensive. Hence, we propose a noble approach where the encoder module of the Transformer is completely removed and only the decoder part of the Transformer is used, in conjunction with CNN, which acts as a feature extractor. The image features are extracted using the MobileNetV3 Large while the Transformer decoder processes these feature vectors and the input text sequence to generate appropriate captions. The system's effectiveness is measured using metrics to judge the caliber and precision of the generated captions, such as the BLEU and Meteor scores.

Abstract

476

PDF

Downloads

Published

2023-12-06

How to Cite

Budhathoki, R., & Timilsina, S. (2023). Image Captioning in Nepali Using CNN and Transformer Decoder. Journal of Engineering and Sciences, 2(1), 41-48. https://doi.org/10.3126/jes2.v2i1.60391

Download Citation

Issue

Vol. 2 No. 1 (2023)

Section

Articles

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY: This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.

How to Cite

Budhathoki, R., & Timilsina, S. (2023). Image Captioning in Nepali Using CNN and Transformer Decoder. Journal of Engineering and Sciences, 2(1), 41-48. https://doi.org/10.3126/jes2.v2i1.60391

Download Citation

Image Captioning in Nepali Using CNN and Transformer Decoder

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

How to Cite

Information