12 February 2017

Aim

  • The main goal of the Coursera Data Science Capstone is the creation of a text predictive application.
  • Trough the course of the project we explored summary statistics for the data, we did exploratory analysis of the data and explored a variety of R packages like NLP, TM, RWeka etc. to help us build the prediction algorithm.
  • We named our app J.A.R.V.I.S. (Just A Rather Very Intelligent System)

Acknowledgments

  • Also, I want to thank all our teachers and all the fellow learners for their contribution through all these courses. I really enjoyed this educational journey and I learned a lot about Data science.

The process

  • The data used in this project came from a corpus called HC Corpora www.corpora.heliohost.org

  • Due to the size of data we decided to work with a sample from the corpus.

  • We removed: punctuation, numbers, stopwords (a, and, also, the, etc), profanity words (we used a txt file that contains most of the profanity words in EN language), common word endings ( “ing”, “es”, “s”) also we converted all characters to lower case and stripped the unnecessary whitespace

Modelling

  • The base of our algorithm was the n-gram model
    An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model. more on Wikipedia
  • we then created 1-gram, 2-gram, 3-gram and 4-gram tokenizers and their respective term document matrices
  • we then created data frames with the frequency (in an descending order) of each N-gram in our corpus that we used to make the predictions

The J.A.R.V.I.S. app

The app was designed with user-friendliness in mind and to be simple to use. The tools we used were R & Shiny

Instructions

  1. Writing a short English phrase inside the textbox
  2. Pressing the prediction button
  3. a predicted word is printed