In the frame of this capstone project my goal was to build an application, similar to SwiftKey. Since the SwiftKey is a mobile application, the model size is important as well as the quality of a prediction.
The final application uses a Backoff model, that tries to predict the word based on n previous words in the text.
The application uses 2 dictionnaries:
- 3 word dictionnary
- 2 word dictionnary
The dictionnaries were build using data available here. Only a sample from each source (news, blogs and twitter) was used to build them.
The application tries to predict the next word using 3 word dictionnary (combination of 3 consecutive words). If it finds several possible options, it proposes 3 most frequent ones. If it does not find anything, the application tries to perform a prediction using a 2 word dictionary. In case the application finds nothing using a 2 word dictionary, it does not make any predictions.
The main tradeoff in this situation is the size of the dictionnaries vs prediction quality. So I studied several strategies (based on training set size and prediction method) to achieve the most optimal result. See slide "Accuracy vs Dictionary size".