Data Science Capstone

Big Data Face
August 2015

Shiny App. Capstone Swiftkey Project.

The aim of the Data Science Specialisation Capstone project is to produce a Shiny App web application that trys to predict the next word in a user generated sentence. The data has been provided by Swiftkey and includes content from blogs, news and twitter. The Shiny App can be accessed from the following location:

http://bigdataface.shinyapps.io/Shiny

Shiny App. Process


The data was first tokenised and processed using the 'tm' application to transform all text to lowercase, remove punctuation, strip white space and remove numbers.

I have also removed profanity words from the text based on an external list. It should be noted that this list will not be complete so it will still be possible to find a limited number of words considered profane in the text.

The n-gram method was then used to structure the data into groups. The highest n-gram I used was a six group to allow the algorithm to predict the sixth word based on the previous five.

Shiny App. Prediction Generation


I have used n-grams going up to 6 n-grams to predict the next word in a sentence. The programme trys to identify a prediction based on the 6th n-gram and if it cannot identified the full sequence it looks in the 5th, 4th, 3rd and 2nd n-gram.
If it still cannot find a match the most frequent word in the corpus will be suggested (note. this is always the word 'the' in the english data).
I then developed a back-off model to allow the Shiny App to first try to find the five word string and predict the sixth from the 6 n-gram, failing this the algorithm will look for a four word string and predict using the fifth word in that string etc.

Shiny App. How it works

width
Please allow 30 seconds for the first download, the algorithm will then predict based on the default text.

Insert your text into the text box and hit submit. The prediction will appear on the right as indicated.

The documentation tab contains further details.