SwiftKey Capstone Project

Stanislav Prikhodko
8/23/2015

The final capstone project for Data Science Specialization on Coursera.com. Shiny application for next word prediction.

https://stas.shinyapps.io/WordPrediction

Introduction, Description of Data

This application benefits from large text data collected from the following sources:

  • Twitter
  • News
  • Blogs

The raw data existed as three big text files that I sampled (10%) and loaded into a term document matrix using packages tm and NLP (numbers, punctuation, special characters were removed prior doing this). Having calculated bigrams, trigrams, quadrigrams, they were converted into data frames with a frequency column and with each word in an individual column. Each n-gram was saved in a *.RData file for convenient usage in Shiny App.

Description of the Algorithm

As it was mentioned on the previous slide, the data frames derived from n-grams are the only input for the shiny app. For example, the below is a sample from trigram: this is a data frame with 3 columns for words and 1 column for frequency.

       V1   V2   V3 Frequency
17843   a  bit more        65
17856   a  bit   of       207
156708 as well   as       344
179720 be able   to       324
31226   a  lot   of       608

Once the user entered a word, the application splits it into an array of individual words and uses last 4 words to see best matches in appropriate n-gram.

Description of the Application, Instructions to use

The application has an input text box where the user can enter multiple words. The response is printed in the text area underneath the welcome message.

The application starts from the biggest n-gram (ex, quadrigram if 4 words available) and then uses smaller n-grams if no mathes found on the previous stage. The application uses the biggest frequency as the best match, then it calculates approximate probability based on the total frequency. The application usually respondes in about a second.

Conclusion

During this capstone project I've learned some interresting tricks and I obtained some very important experience: 1. Never leave most important things on the last day! 2. For any problems - there are multiple good solutions! 3. Be as simple as possible. Do not overcomplicate things!

About tricks… It was pretty paintful to convert trigram into a data frame that would be more convenient for me. Once you have a matrix with trigram frequencies, you still have to split three-word-string into three individual words so they will be in three different columns. Strsplit() works so slow! it's appeared to save into a file and then just read the file.

Thanks to Coursera for such a fun project!