This project is my third project in Nepali language. This is a project to perform morphological segmentation of a given word into stem and suffix.
Nepali language is morphologically rich language which makes it difficult to perform any computational tasks unlike English. Therefore, morphological segmentation performs a vital role in improving the downstream NLP tasks like (Named Entity Recognition, Sentiment Analysis, Coreference resolution and many others). In this posts, two methods are presented, one is simple rule-based method while other one is based on non-parametric bayesian model.
Examples:
This simple rule-based method is based on hindi-stemmer. It iteratively separates out the suffixes (postpositions) until no more separation can be processed. We also need to provide the dictionary, stem and suffix list.
This is an unsupervised non-parametric bayesian method which learns to segment stem and suffix from a given training corpus. No extra resources like dictionary or any rules are required. This method is based on Chinese Restaurant Process. At the heart of this method, we used beta-geometric conjugate prior over the length of a given word and used Gibbs sampling for inference purpose.