Coursera Capstone Project: Word Prediction App

Johannes Spiess
March 2018

This presentation aims to describe the application developed during the Coursera Capstone Project. The application is a Word Prediction App. It was developed in R Studio based upon data provided by SwiftKey.

Scope of the project

The overall task the Coursera capstone project was to produce a R ShinyApp with an underlying word prediction algortihm. The exercise was split in sub tasks (obtaining and cleaning data, exploratory analysis, development of algorithm and creation of data product).

The final product can be found here: https://spiessj.shinyapps.io/WordPredictionJS/.

Method applied (I)

  • Data from HC Corpora (blogs, news, twitter) was provided by Coursera and supposed to be downloaded.
  • One sub-task of the project consisted in an in-depth exploratory analysis (word frequencies).
  • This was when the concept of n-grams, i.e. combinations of words (unigram = one word, bigram = two words etc.), was introduced.
  • Due to the vast amount of data provided, one challenge in the project was to balance completeness and performance.
  • I decided to work with three samples with 5.000 records each from the data sets provided (blogs, news, twitter).

Method applied (II)

  • The extracted samples were cleaned (transform to lower case, remove punctuations, remove numbers, remove blank spaces, remove English stopwords)
  • The sampe data was then transformed into a so-called Corpus.
  • Based upo this corpus, tokenization was applied to generate n-grams.
  • These n-grams are essentially frequency dictionaries the prediction algorithm can rely on.

How to use the app

User interface of app

  • Enter the word or sentence for which you wish to run the prediction on the left (1).
  • The box on the right (2) will show the prediction result
  • In the area under the result box (3), one finds additional information in the prediction result. “It” is used as a fallback resut value as it is the most common pronoun in the English language.

Futher information

Visit the repository on Github with all files: https://github.com/SpiessJ/repohannes.

Check out coursera data science: https://www.coursera.org/specializations/jhu-data-science.

Get in touch: jkhspiess@gmail.com