Preprocessing of Nepali News Corpus for Downstream Tasks

Authors

  • Sushil Awale Integrated ICT Private LTD, Kupondole, Lalitpur, Nepal
  • Suraj Prasai Integrated ICT Private LTD, Kupondole, Lalitpur, Nepal
  • Birodh Rijal
  • Santa B. Basnet

DOI:

https://doi.org/10.3126/nl.v35i01.46553

Keywords:

Text processing, conjuncts, language models, glyphs, Nepali corpus

Abstract

Text collected from online resources introduce a lot of errors which results in incorrect learning outcomes in automatic language learning tasks. In this paper, we discuss a Nepali text preprocessing pipeline to generate clean corpus. This pipeline is tested using a language model to observe impact of each steps in learning task. The relevancy of this work lies in systematizing the procedure in the development of standard Nepali corpus.

Downloads

Download data is not yet available.
Abstract
112
PDF
72

Downloads

Published

2022-07-11

How to Cite

Awale, S., Prasai, S., Rijal, B., & Basnet, S. B. (2022). Preprocessing of Nepali News Corpus for Downstream Tasks. Nepalese Linguistics, 35(01), 1–6. https://doi.org/10.3126/nl.v35i01.46553

Issue

Section

Articles