Technocrazian Hack the Code

GSoC Community Bonding - VibhakthiGenerator


So the community bonding period of Google Summer of Code 2016 comes to an end today. During this period, the participants were expected to mingle with the community, get to know more about its basic workflow, make small small contributions to the projects undertaken by community, understand the coding standards, deployement models and version control schemes, the communication model used etc.


I have been contributing to the community before GSoC, in form of localization and packaging etc. Still, as a contribution to the community during the GSoC Community Bonding period, I have written a small library in Python, that can be used to generate different Vibhakthi forms of Malayalam words (like രാമൻ -> രാമന്റെ, രാമനെ, രാമനോട്, രാമനാൽ etc). The project was inspired by Santhoshettan's similar project using jQuery.i18n library. The code is available in my personal repo now and I will be pushing it to the organization's repo soon. I am yet to make it follow libindic's directory structure.

The library now uses a rule based approach to generate the Vibhakthi forms, which is not 100% efficient. It will fail in words ending with those letters (usually Chillu characters) whose base forms are still ambiguous. Example, the words ending with ർ whose base form can be either ര or റ. Different words ending with ർ, that have similar structure has different results when applying the same vibhakthi.

Example :
അനിവർ + സംബന്ധിക = അനിവറിന്റെ
മലർ + സംബന്ധിക = മലരിന്റെ
കൗരവർ + സംബന്ധിക = കൗരവരുടെ

(Thanks to Santhoshettan for the following info) This shows a drawback of rule based method and we need to develop a method where the word etymology is also considered . That can be expected to be done when Machine Learning techniques become more clear and usable for Malayalam. For now, since no such library exists, I follow the concept of "99% is better than 0%" and guess the library is worth using until we can find something better. I have a plan to use this library during the development of the spell checker (I will post its proposal in detail, soon), which I have to dig more on. For now, you can try out the library at this online demo. Happy if you people can test and report issues/suggestions etc.


Also, to handle inflections, the spellchecker should have a stemming phase in between. So, I have read the stemmer code of libindic and found out that it is actually halfway between a stemmer and lemmatizer (and I intend to keep it as such). However, the existing method is highly inefficient, causing many false positives and has to be made more efficient. I intend to follow the rule based approach for it, with an option to crowd source the root word corpus. This will be the first phase of the project.