Algorithm
- Bigrams, trigrams, fourgrams and fivegrams were made
- They were combined into one dataframe
- Data.table package was used
- Ngrams having count less than 5 were discarded
Algorithm
- The next word was predicted using Stupid Backoff
- Lambda value was taken to be 0.4
- For out of vocabulary words, a default prediction was made
- Top 5 words with scores are displayed in the app
- Prediction takes around 1 second
Further Improvements
- Only about 5% of the data could be used due to RAM constraints
- The speed could be improved by precomputing scores as well
- Sixgrams or higher could be considered
- Kneser-Key smoothing or other algorithms could be used