Preprocessing of Nepali News Corpus for Downstream Tasks
Keywords:Text processing, conjuncts, language models, glyphs, Nepali corpus
Text collected from online resources introduce a lot of errors which results in incorrect learning outcomes in automatic language learning tasks. In this paper, we discuss a Nepali text preprocessing pipeline to generate clean corpus. This pipeline is tested using a language model to observe impact of each steps in learning task. The relevancy of this work lies in systematizing the procedure in the development of standard Nepali corpus.