Lightweight Attention-Guided CNN–LSTM for Image Captioning
DOI:
https://doi.org/10.3126/jkbc.v7i1.88398Keywords:
Index Terms—Image Captioning, Visual Attention, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Deep LearningAbstract
Automatically generating meaningful captions for images is a fundamental problem in both computer vision and natural language processing. The existing models often struggle with complex scenes, object relationships, and computational efficiency. In this paper, We introduce a lightweight image captioning method that integrates VGG-16 ConvNets for robust spatial feature extraction with a soft attention model and an LSTM decoder to selectively attend to only salient portions of an image when generating its attendant caption. The model is trained and tested on the Flickr8k dataset consisting of 8,000 images with five captions for each image. Experimental results show competitive performance with BLEU-1, BLEU-2, BLEU-3 and BLEU-4 from 0.53 to 0.10 respectively illustrating the model is able to identify objects and generate coherent image descriptions with context information. The proposed method offers an efficient and explainable solution that effectively bridges visual content and natural language, contributing to more accessible and intelligent multimedia technology.