Coursera Data Science Capstone Presentation

Sean Jackson
July 17, 2016

Goals:

  • Download Swiftkey provided news, blog, and twitter corpora

  • Obtain samples of corpora

  • Clean corpora of foul language and prepare for tokenization

  • Tokenize corpora to create text data frames for basis of predictive model

  • Create Shiny app using cleaned corpora data frames to predict user inputs

Data Preparation Methodology

After downloading and sampling the corpora, the tm library was used on the sample for cleaning by changing the text to lower case, removing special characters, removing foul language, removing numbers, removing punctuation, and finally any white spaces created by the other changes. Initial testing of the data showed that removing stop words and word stems lowered the predictive ability of the predictive model.

The rWeka library was used to tokenize into bigram, trigram, and quadgrams. These were then sorted for frequency and split up in data frames for final use in the Shiny app.

Final Application

The application allows the user to enter an English word, then attempts to predict the next word used based on up to the prior three words entered. The app also shows what has been entered by the user as well as having additional supporting tabs explaining the project.

Application Screenshot

Thanks and Credits

I want to give special thanks Coursera, Johns Hopkins University, and Swiftkey for making this experience possible. I have learned many useful skills which have helped spark my journey into data science.

Coursera Data Science [www.coursera.org/specializations/jhu-data-science]

Johns Hopkins University [https://www.jhu.edu/]

Swiftkey [https://swiftkey.com/en]

Shiny App[https://seanpj.shinyapps.io/TextPrediction/]