Sean Jackson
July 17, 2016
Download Swiftkey provided news, blog, and twitter corpora
Obtain samples of corpora
Clean corpora of foul language and prepare for tokenization
Tokenize corpora to create text data frames for basis of predictive model
Create Shiny app using cleaned corpora data frames to predict user inputs
After downloading and sampling the corpora, the tm library was used on the sample for cleaning by changing the text to lower case, removing special characters, removing foul language, removing numbers, removing punctuation, and finally any white spaces created by the other changes. Initial testing of the data showed that removing stop words and word stems lowered the predictive ability of the predictive model.
The rWeka library was used to tokenize into bigram, trigram, and quadgrams. These were then sorted for frequency and split up in data frames for final use in the Shiny app.
The application allows the user to enter an English word, then attempts to predict the next word used based on up to the prior three words entered. The app also shows what has been entered by the user as well as having additional supporting tabs explaining the project.
I want to give special thanks Coursera, Johns Hopkins University, and Swiftkey for making this experience possible. I have learned many useful skills which have helped spark my journey into data science.
Coursera Data Science [www.coursera.org/specializations/jhu-data-science]
Johns Hopkins University [https://www.jhu.edu/]
Swiftkey [https://swiftkey.com/en]
Shiny App[https://seanpj.shinyapps.io/TextPrediction/]