A Hybrid Cross-Lingual News Aggregator for Nepali Media using mT5 and Dense Vector Embeddings

Authors

  • Sapen Chhetri Nepal College of Information Technology, Pokhara University
  • Ahwan Poudyal Nepal College of Information Technology, Pokhara University
  • Deepak Gurung Nepal College of Information Technology, Pokhara University
  • Babita Adhikari Nepal College of Information Technology, Pokhara University
  • Ashim Khadka Nepal College of Information Technology, Pokhara University

DOI:

https://doi.org/10.3126/injet.v3i2.95500

Keywords:

Hybrid Search, mT5 Summarization, Cross-Lingual Retrieval, Dense Vector Embeddings, Zero-Shot Classification, Nepali NLP

Abstract

The rapid proliferation of digital journalism in Nepal has created a fragmented information landscape, where navigating hundreds of independent portals leads to significant cognitive load and a widening semantic gap in information retrieval. A hybrid search engine is featured in the system’s core architecture, through which lexical precision (BM25) is fused with deep conceptual understanding via dense vector embeddings. Seamless cross-lingual accessibility is enabled by this dual-path retrieval mechanism, by which semantically relevant Nepali content is accurately retrieved using English queries, achieving an average Cosine Similarity score exceeding 0.72. To alleviate information density, an automated synthesis layer is implemented using a fine-tuned mT5 (Multilingual Text-to-Text Transfer Transformer) model, through which long-form journalism is distilled into concise abstractive summaries with ROUGE-1 of 0.33. Furthermore, Zero-Shot Classification based on Natural Language Inference (NLI) is integrated into the platform so that unstructured news streams are dynamically categorized into thematic verticals without the requirement for manual labeling. It is demonstrated by experimental results that retrieval recall and organizational efficiency are significantly improved by the proposed framework, and a scalable solution for modernizing regional news consumption in low-resource linguistic environments is provided.

Downloads

Download data is not yet available.
Abstract
5
PDF
2

Author Biography

Sapen Chhetri, Nepal College of Information Technology, Pokhara University

The rapid proliferation of digital journalism in Nepal has created a fragmented information landscape, where navigating hundreds of independent portals leads to significant cognitive load and a widening semantic gap in information retrieval. A hybrid search engine is featured in the system’s core architecture, through which lexical precision (BM25) is fused with deep conceptual understanding via dense vector embeddings. Seamless cross-lingual accessibility is enabled by this dual-path retrieval mechanism, by which semantically relevant Nepali content is accurately retrieved using English queries, achieving an average Cosine Similarity score exceeding 0.72. To alleviate information density, an automated synthesis layer is implemented using a fine-tuned mT5 (Multilingual Text-to-Text Transfer Transformer) model, through which long-form journalism is distilled into concise abstractive summaries with ROUGE-1 of 0.33. Furthermore, Zero-Shot Classification based on Natural Language Inference (NLI) is integrated into the platform so that unstructured news streams are dynamically categorized into thematic verticals without the requirement for manual labeling. It is demonstrated by experimental results that retrieval recall and organizational efficiency are significantly improved by the proposed framework, and a scalable solution for modernizing regional news consumption in low-resource linguistic environments is provided.

Downloads

Published

2026-06-18

How to Cite

Chhetri, S., Poudyal, A., Gurung, D., Adhikari, B., & Khadka, A. (2026). A Hybrid Cross-Lingual News Aggregator for Nepali Media using mT5 and Dense Vector Embeddings. International Journal on Engineering Technology, 3(2), 62–70. https://doi.org/10.3126/injet.v3i2.95500