Balasankar C

Balasankar C

Geek. Freedom. Privacy.

Home | Blog | Talks | Setup | Feed

GSoC Update: Week #1 and #2

Heyo,

Last two weeks of GSoC mostly involved working on Stemmer module of the proposal. I had discussions with Hrishi and Vasudev regarding the directory structure that I will be using for the stummer. I proposed .. format because it gives more visibility to libindic. Since both of them agreed (I will be converting all the existing modules to this format, after GSoC), I first ported the existing indicstemmer module to this directory structure. With directions and suggestions from Vasudev and Hrishi, we (Me and Jerin, who is working on Sandhi Splitter set up pbr as packaging tool, testtools as testing framework, Travis CI for continuous integration and tox for local automation and testing. The development environment may be summarized as follows

Work on Stemmer

There are several problems with the existing stememr implementation. One is the high count of false positives. This is because, in Malayalam there exists root words which satisfy the structure of an inflected word. An example is ആപത്ത്, which can be considered similar to എറണാകുളത്ത്. The former is a root word whereas the latter is an inflected word. So, based on the stemmer rule ത്ത്=ം, that is used to handle എറണാകുളത്ത്, ആപത്ത് will also get stemmed to ആപം. This, hence, is a false positive. What we need is a root word corpus, that contains the possible root words in Malayalam (well, we need some crowd sourcing to update it) and checking the input word against it so as to detect if it is a root word.

Another problem with the existing stemmer is that it is unable to handle multiple levels of inflection. An example is അതിലേക്ക് (into that) : അതിൽ + ഏക്ക് : അത് (that) + ഇൽ (in) + ഏക്ക് (to). We need to implement a multiple suffix stripping algorithm that will handle it. I wrote an iterative suffix stripping algorithm, that continues suffix stripping and transformation until a root word is encountered or a mis-hit on the rules occurs.

Since linear list obviously is the least optimal solution for storing a large dataset, and tries are good for storing textual data, I decided to go with a Trie for storing the root word corpus. Tailoring a data structure that will suit my need is one of the last tasks of my GSoC proposal. So, I used an existing trie implementation - marisa trie that is available as a Python library.

I have added tests for all the 7 vibhakthis, and form of a word (അവനും, രാമുവും), some of the plural forms (കാളകൾ) etc and the coverage as of now is 100%.

As always, the code and update is available at GitHub repo.