Siti Salwani Yaacob
4th June 2020
Download and extract three large files from the Swiftkey Dataset. Reduce the size of files to 20k each. Create a large corpus of the data. Data was then analyzed by removing non needed text characteristic.
These data were then tokenised 3 times using 1-gram to 3-gram calculations using RWeka.
The algorithm predicts the next word based on the last 3 text inputs the user entered then starts to search using the 3-gram. If the next word isn't predicted, it selects the 2-gra,, then 1-gram. If nothing is found it falls back to a “default” of the word most often seen.