Capstone Project for the Johns Hopkins-Coursera Data Science Series
Mark Culp, August 27, 2017
The data set used for this project was provided by the SwiftKey Company, which is based in central London. The SwiftKey “corpora” used in the project was comprised of:
The R “tm” and “RWeka” packages were used to mine the text, and create a series of 2 and 3 term “n-grams.” These n-grams predict the next word a user types based on either the last word the user typed, or a combination of the last word and second-to-last word the user typed.
In an effort to cover every series of 2 and 3 term n-grams contained in the corpora, the entire data set was processed in approximately 10 megabyte chunks of text. The 2 and 3 term n-grams were then merged into a single data frame that summarized the n-gram words counts with the “merge.R” script. The final,merged data frame contained approximately 36.8 million lines of text.
An Alienware computer with 16 GB RAM was used to process the data. The “tokenizer.R” script took anywhere from one hour to 90 minutes to filter and create each n-gram pair. The “merge.R” script took about 20 to 30 minutes to merge the n-gram pairs and add them to final data frame.
The final data frame was parsed alphabetically into a series of 42 R object files (.rds) for quick retrieval by the text prediction application. These files preserve most of the unique word combinations discovered:
Two final reduction scripts reduced the final accuracy of the application while preserving the most popular words and word combinations:
The final application predicts the most commonly used word combinations obtained from the SwiftKey corpora. The most likely guess is derived from user input and appended to the user's input. Up to three alternative guesses are populated in an alternative answer text box.