Nepali Stemmer

Wednesday. April 15, 2020 - 1 min

This project is my third project in Nepali language. This is a project to perform morphological segmentation of a given word into stem and suffix.

Abstract

Nepali language is morphologically rich language which makes it difficult to perform any computational tasks unlike English. Therefore, morphological segmentation performs a vital role in improving the downstream NLP tasks like (Named Entity Recognition, Sentiment Analysis, Coreference resolution and many others). In this posts, two methods are presented, one is simple rule-based method while other one is based on non-parametric bayesian model.

Examples:

अमेरिकाद्वारा -> अमेरिका द्वारा
रामलाई -> राम लाई

Methods

Rule-based

This simple rule-based method is based on hindi-stemmer. It iteratively separates out the suffixes (postpositions) until no more separation can be processed. We also need to provide the dictionary, stem and suffix list.

Bayesian

This is an unsupervised non-parametric bayesian method which learns to segment stem and suffix from a given training corpus. No extra resources like dictionary or any rules are required. This method is based on Chinese Restaurant Process. At the heart of this method, we used beta-geometric conjugate prior over the length of a given word and used Gibbs sampling for inference purpose.

Github

Deployment

Rule-based
- A simple flask based web app deployed on Heroku platform
- Created PyPI package for convenience
Bayesian
- This is also a flask based web app but deployed on AWS Elastic Beanstalk following CI/CD pipeline using Github Actions
- Note: It’s deployed version may not produce desired result as we need to play more with hyperparameters

References

Final Note:

This project is limited to single split segmentation and inflectional morphemes.
Rule-based is a hobby project and Bayesian project is completed with the help of Sushil Awale during my summer internship in NAAMII.
There are no publications for this project.