Hemantakumar Hegde
2020-03-21
Data source
Development
…continued
Choices to make about removing sparse words, foreign language words, misspelling, character encoding, keep or drop features, treatment of punctuation, accented letters, preserving capitalization etc.
Starting to think about the algorithm required from the scratch!
Performance
Ended up with an algorithm which startes as a Markov chain now similar to Kat,z backoff model! How my algorithm works:
Tried creating n2gram n3gram till n6grams of word tokens (created only within individual sentences) but ONLY using n6grams now as I noticed others were redundant.
Algorithm tokenizes the input text and stems that (as the training data features were also stemmed)
Then it tries to match up to 5 words of the input (in sequence) to the n6grams stored. If it could not match all 5 words, it falls back to matching only 4 and so on and finally only 1.
Thank you