Nepali Stemmer

Nepali Stemmer

- 1 min

This project is my third project in Nepali language. This is a project to perform morphological segmentation of a given word into stem and suffix.

Abstract

Nepali language is morphologically rich language which makes it difficult to perform any computational tasks unlike English. Therefore, morphological segmentation performs a vital role in improving the downstream NLP tasks like (Named Entity Recognition, Sentiment Analysis, Coreference resolution and many others). In this posts, two methods are presented, one is simple rule-based method while other one is based on non-parametric bayesian model.

Examples:

Methods

Rule-based

This simple rule-based method is based on hindi-stemmer. It iteratively separates out the suffixes (postpositions) until no more separation can be processed. We also need to provide the dictionary, stem and suffix list.

Bayesian

This is an unsupervised non-parametric bayesian method which learns to segment stem and suffix from a given training corpus. No extra resources like dictionary or any rules are required. This method is based on Chinese Restaurant Process. At the heart of this method, we used beta-geometric conjugate prior over the length of a given word and used Gibbs sampling for inference purpose.

Github

Deployment

References

Final Note:
comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora