GSoC Update: Week #3 and #4

Balasankar C

Geek. Freedom. Privacy.

Home | Blog | Talks | Setup | Feed

GSoC Update: Week #3 and #4

GSoC

Jun 23, 2016

Heyo,

[Sorry for the delay in the post]

I spent the last two weeks mainly testing out the stemmer module and the defined rules. During that I found out there are many issues for a rule based model because different types of inflections to different parts of speech can yield same inflected form. This can be solved only by machine learning algorithm that incorporates a morphological analyzer and is hence out of scope of my proposal. So I decided to move forward with the stemmer.

I tried to incorporate handling of inflections of verb - like tense change - using rules and was able to do a subset of them. Rest of the forms need more careful analysis and I've decided to get the system working first and then optimize it.

I've also decided to tag the rules so that a history of stemming can be preserved. The stemmer will now generate the stem as well as the tags of rules applied. This metadata can be useful to handle the problem of same letter being inflected to different forms that I faced while developing VibhakthiGenerator.

I spent some time in cleaning up the code more and setting up some local testing setup like a CLI and Web interface.

The PR was accepted by Vasudev and the changes are currently a part of the indicstemmer codebase.

BTW, it is time for the Midterm evaluations of GSoC 2016, where the mentors evaluate the progress of the students and give a pass/fail grade to them. Also, the students get to evaluate the mentors, communication with them and their inputs. I have already completed this and am waiting for my mentor to finish it. Hopefully, everything will go well.