Coursera Data Science Capstone Project

12 February 2017

Aim

The main goal of the Coursera Data Science Capstone is the creation of a text predictive application.
Trough the course of the project we explored summary statistics for the data, we did exploratory analysis of the data and explored a variety of R packages like NLP, TM, RWeka etc. to help us build the prediction algorithm.
We named our app J.A.R.V.I.S. (Just A Rather Very Intelligent System)

Also, I want to thank all our teachers and all the fellow learners for their contribution through all these courses. I really enjoyed this educational journey and I learned a lot about Data science.

The data used in this project came from a corpus called HC Corpora www.corpora.heliohost.org
Due to the size of data we decided to work with a sample from the corpus.
We removed: punctuation, numbers, stopwords (a, and, also, the, etc), profanity words (we used a txt file that contains most of the profanity words in EN language), common word endings ( “ing”, “es”, “s”) also we converted all characters to lower case and stripped the unnecessary whitespace

The base of our algorithm was the n-gram model
An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model. more on Wikipedia
we then created 1-gram, 2-gram, 3-gram and 4-gram tokenizers and their respective term document matrices
we then created data frames with the frequency (in an descending order) of each N-gram in our corpus that we used to make the predictions

The app was designed with user-friendliness in mind and to be simple to use. The tools we used were R & Shiny

Instructions