Capstone Presentation for Data Science Specialisation

Barbara M
December 2017

Next Word Prediction Shiny App

A Natural Language Processing (NLP) Model was built using N-grams: The data for building the model were downloaded from https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html

The csmcu package was used to build 1- to 4-gram models with a training dataset of 50% of the combined twitter, news and blogs text files containing a total of 2.1 million text messages

Text Cleaning

The method for text cleaning and n-gram building was adapted from code in Dave Vinson's cmscu tutorial which can be seen at http://davevinson.com/cmscu-tutorial.html

Stopwords were not removed as they are likely to be present in the input phrase.

A separate 1-gram dictionary was created to use as the completion word for the phrase. This dictionary was additionally cleaned with hunspell and stopwords removed. This is to give a cleaner set of words for selecting the next word. This dictionary contains 144K words.

More information about hunspell can be found at https://cran.r-project.org/web/packages/hunspell/hunspell.pdf

Model Construction

The model itself is very simple. The input phrase (3 last words) is pre-pended to each term in the 1g-dictionary to create a 4 word phrase. A cmscu query function is then used to find the most common occurrence of this phrase in the 4-gram model. If a match is not found using 3 last words then it checks with last 2 words in the 3-gram model.

Model Performance: I attempted to run the benchmark.R program but could not get it to work. Instead I compared the accuracy of my model to the course quizzes. In quiz 2 the model achieved 40% correct answers. These were for questions 1, 2, 3 and 7. In Quiz 3 the model achieved 30% correct answers. These were for questions 5, 6 and 8.

This suggests an average accuracy of 35%

Shiny App

The user enters an input phrase which is subjected to the same cleaning process as used in building of the n-grams. The user then clicks the “Submit” button. At least 3 words must be entered.

The app returns the word with the maximum count by comparing up to the last 3 words of the input phrase combined with a dictionary term with the n-gram model.

The App can be found at: https://moloneb.shinyapps.io/CapstoneProject2/