Nepali Dialogue Corpus

Research Group:BBMMLLStatus:Open for further support

This project created a comprehensive Nepali dialogue corpus using weakly supervised methods to support development of dialogue-based NLP applications.

Background

There is a scarcity of comprehensive Nepali dialogue corpora, which limits the development and evaluation of dialogue-based applications and systems. Existing corpora, if any, often have constraints such as limited size, domain specificity, or lack of diversity, hindering their utility for broader research and development. Additionally, robust methodologies for collecting dialogues in a weakly supervised manner are absent.

Research Aim

Our goal is threefold: First, to evaluate existing Nepali dialogue corpora to identify their limitations. Second, to establish a new benchmark for Nepali dialogue data by curating a diverse and comprehensive corpus that addresses identified limitations. Third, to explore methodologies for collecting dialogues in a weakly supervised manner, leveraging platforms like Twitter, to enhance scalability and diversity..

Outcomes

This project developed a comprehensive Nepali dialogue corpus by reviewing existing datasets and exploring weakly supervised methods to collect diverse conversational data from platforms like Twitter. The resulting corpus serves as a benchmark for dialogue-based NLP applications in Nepali and provides validated datasets for training and evaluating machine learning models with improved generalization capabilities.

Team

Dr. Binod Bhattarai Adj. Research Scientist