Stanislav Prikhodko
8/23/2015
The final capstone project for Data Science Specialization on Coursera.com. Shiny application for next word prediction.
This application benefits from large text data collected from the following sources:
The raw data existed as three big text files that I sampled (10%) and loaded into a term document matrix using packages tm and NLP (numbers, punctuation, special characters were removed prior doing this). Having calculated bigrams, trigrams, quadrigrams, they were converted into data frames with a frequency column and with each word in an individual column. Each n-gram was saved in a *.RData file for convenient usage in Shiny App.
As it was mentioned on the previous slide, the data frames derived from n-grams are the only input for the shiny app. For example, the below is a sample from trigram: this is a data frame with 3 columns for words and 1 column for frequency.
V1 V2 V3 Frequency
17843 a bit more 65
17856 a bit of 207
156708 as well as 344
179720 be able to 324
31226 a lot of 608
Once the user entered a word, the application splits it into an array of individual words and uses last 4 words to see best matches in appropriate n-gram.
The application has an input text box where the user can enter multiple words. The response is printed in the text area underneath the welcome message.
The application starts from the biggest n-gram (ex, quadrigram if 4 words available) and then uses smaller n-grams if no mathes found on the previous stage. The application uses the biggest frequency as the best match, then it calculates approximate probability based on the total frequency. The application usually respondes in about a second.
During this capstone project I've learned some interresting tricks and I obtained some very important experience: 1. Never leave most important things on the last day! 2. For any problems - there are multiple good solutions! 3. Be as simple as possible. Do not overcomplicate things!
About tricks… It was pretty paintful to convert trigram into a data frame that would be more convenient for me. Once you have a matrix with trigram frequencies, you still have to split three-word-string into three individual words so they will be in three different columns. Strsplit() works so slow! it's appeared to save into a file and then just read the file.
Thanks to Coursera for such a fun project!