Extractive Nepali Question Answering System
DOI:
https://doi.org/10.3126/kjse.v9i1.78368Keywords:
Extractive Question Answering, Low-Resource NLP, Nepali Language, BERT, SQuADAbstract
There is a noticeable gap in language processing tools and resources for Nepali, a language spoken by more than 17 million people [1] yet significantly underrepresented in computational linguistics. We present an Extractive Nepali Question Answering System designed to generate precise, contextually accurate responses in Nepali. Addressing the lack of high-quality training data, we contribute three key datasets: a Nepali and Hindi translation of SQuAD 1.1, a Nepali translation of XQuAD for benchmarking, and a curated Nepali QA dataset derived from Belebele’s MCQ data. To mitigate translation-induced answer span loss, we utilize translation-invariant tokens, improving span retention from 50% to 93%, and evaluate translation quality using human assessment and GPT-4, confirming a faithful answer span distribution. We evaluate our models on XQuAD and our curated dataset, demonstrating the effectiveness of fine-tuning multilingual models for Nepali QA. Our best-performing model achieves an exact match (EM) score of 72.99 and an F1 score of 84.13 on XQuAD-Nepali. These results establish a strong baseline for Nepali QA and highlight the impact of utilizing cross-lingual transfer from same language family data. All datasets and code are publicly available, encouraging further advancements in Nepali NLP research.