Eric Scuccimarra
2018-01-28
The problem was to create a text prediction system using text taken from online news, blogs and Twitter. This presentation will describe the steps taken to accomplish this as well as describe the final algorithm.
The data was provided by Swiftkey in multiple languages. Only the English data was used for this project.
A sample of the raw data is below:
load("raw_twitter.RData")
head(twitter, 3)
[1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
[2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
[3] "they've decided its more fun if I don't."
The raw data is filtered to remove profanity and then subsetted to keep the data to a manageable size. Numbers and non-ASCII characters are removed and the filtered data from all three sources are combined into one corpus.
A TermDocumentMatrix is created using the tm library. The matrix contains all words longer than three characters. The matrix is sorted by frequency to yield the most frequently occuring words:
the to and a of in for that is on with said
18723 8634 8510 8392 7121 6511 3380 3320 2722 2701 2387 2373
I did not remove stop words because I felt that doing so would negatively impact the quality of the predictions. I did however remove misspelled words and other non-English words.
After a great deal of trial and error a process was created with which to create a model of the text.
The ngrams are reversed because I believe that words towards the end of the string are more important predictors than those at the beginning. This structure allows words which do not match any ngrams to be ignored which provides flexibility in predicting for input which does not exist in the model.
There are optional parameters in the functions that create the model to set the number of matches to be kept for each ngram and whether to remove ngrams which only occur once in the corpus. By default the top 3 matches are kept and all ngrams are retained.
I attempted to fit some decision tree models to the data, but the amount of RAM required made this impossible. Instead I use the following algorithm:
Steps 2 and 3 inside the for loop are designed to ensure that some match is returned for every string, and to allow flexibility for strings which are not contained in the ngrams. If a word in the string filters out all of the potential matches, that word is ignored and the loop continues.
Up to three possible matches are returned for every input string.
A Shiny application was created to feature this algorithm, which is available at https://ericscuccimarra.shinyapps.io/TextPrediction2/.
The application uses a dataset which filters out 75% of the text from the Twitter and Blog data, in order to provide a reasonable response speed. Using more data would provide in better predictions.
The source code is available on GitHub at https://github.com/escuccim/DataScienceCapstone
A more detailed presentation of the algorithm and the data is available here: http://rpubs.com/skooch/355342
There are some problems with my algorith, which can be fixed with more data and time: