Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking. But typing on mobile devices can be a serious pain.

Microsoft SwiftKey is a virtual keyboard which builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. In this capstone project i will build a predictive text web application on shinyapps website, The application will present several options for what the next word might be based on the preceding words. For example,when typing “i will” the app will present probable words like “go”,“bring”,“stay”.

Data

The data is collected from a large database of textual data from three sources blogs,news and tweets in four languages English, German, Russian and Finnish. I choosed the English database for building the predictive text.

The English Corpus files.
File_name Size_mb Lines
en_US.blogs.txt 200.4242 899288
en_US.news.txt 196.2775 77259
en_US.twitter.txt 159.3641 2360148

Exploratory data analysis

The predictive algorithm generates words based on the preceding sequence of words ,our strategy is to train the model with only words, I removed profanity terms, symbols , tags , numbers , dates, etc. to make the training data set clean and optimized for the predictive algorithm.

  • The wordcloud of the corpus shows that most frequent words are stop words such as :“the”, “is”,“and” .
Wordcloud of the corpus.

Wordcloud of the corpus.

  • Blogs and news both have most types (unique tokens) and most word count compared to tweets illustrated by the histogram.
Word count histogram of the corpus.

Word count histogram of the corpus.

  • Blogs and news have similar characteristics explained in the cluster dendrogram below :
Cluster dendrogram of the similarities of features on each data source .

Cluster dendrogram of the similarities of features on each data source .

  • Comparing frequencies of words between tweets and reference documents blogs and news we notice that Blogs and news have more frequent stop words than tweets illustrated by The keyness statistics.
Keyness statistics of the corpus.

Keyness statistics of the corpus.

  • An n-gram is a consecutive subsequence of length n of some sequence of tokens, The figures below illustrate the frequency of unigram, bigram ,and trigram for each data source after removing stopwords .
Top 20 Unigram frequencies of each data source .

Top 20 Unigram frequencies of each data source .

Top 20 Bigram frequencies of each data source

Top 20 Bigram frequencies of each data source

Top 20 Trigram frequencies of each data source

Top 20 Trigram frequencies of each data source

  • The cumulative frequency of each term in the corpus shows that a dictionary of 118 terms will cover 50% of word coverage in the corpus while 7042 terms will cover 90% .

The cumulative frequency of corpus terms and their respective number of terms .

Conclusion

The optimal approach to maximize the accuracy of the next word prediction is to have a clean and concise training corpus ,which make blogs and news the best data source for our prediction app because they are optimized for the prediction algorithm .

Next steps

After cleaning the corpus and getting the training data set ready for implementation, The next steps are:

  1. Choosing an optimal prediction algorithm with respect of accuracy and response time and data volume.
    • The best choice is Stupid back off model with smoothing since it is fast and doesn’t need much data. storage.
  2. Developping a shiny web application to host the prediction model on shinyapps.io server .