Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking. But typing on mobile devices can be a serious pain.
Microsoft SwiftKey is a virtual keyboard which builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. In this capstone project i will build a predictive text web application on shinyapps website, The application will present several options for what the next word might be based on the preceding words. For example,when typing “i will” the app will present probable words like “go”,“bring”,“stay”.
The data is collected from a large database of textual data from three sources blogs,news and tweets in four languages English, German, Russian and Finnish. I choosed the English database for building the predictive text.
| File_name | Size_mb | Lines |
|---|---|---|
| en_US.blogs.txt | 200.4242 | 899288 |
| en_US.news.txt | 196.2775 | 77259 |
| en_US.twitter.txt | 159.3641 | 2360148 |
The predictive algorithm generates words based on the preceding sequence of words ,our strategy is to train the model with only words, I removed profanity terms, symbols , tags , numbers , dates, etc. to make the training data set clean and optimized for the predictive algorithm.
Wordcloud of the corpus.
Word count histogram of the corpus.
Cluster dendrogram of the similarities of features on each data source .
Keyness statistics of the corpus.
Top 20 Unigram frequencies of each data source .
Top 20 Bigram frequencies of each data source
Top 20 Trigram frequencies of each data source
The cumulative frequency of corpus terms and their respective number of terms .
The optimal approach to maximize the accuracy of the next word prediction is to have a clean and concise training corpus ,which make blogs and news the best data source for our prediction app because they are optimized for the prediction algorithm .
After cleaning the corpus and getting the training data set ready for implementation, The next steps are: