Shiny App Presentation

Frederica Janga, Coursera Capstone Project
23-03-2017

Short Summary

  • In this presentation the Capstone Project of the Data Science Specialization offered by Coursera is reported.

  • The goal of the Capstone Project is to build a prediction algorithm that can predict the next word with 1,2 or 3 words as an input. Also a Shiny App that can be accessed by others will show the outcome of the prediction algorithm where anyone can test it.

  • The prediction algorithm and the Shiny App are entirely built in R. The predictions are based on 3 English text files; Twitter, News and Blog text files. Therefore the prediction outcome is in English.

Description Algorithm

The word prediction is done in following steps:

  • First the 3 big text files are transformed into text files without numbers, punctuation and the text files are converted to lowercase.

  • Then tokenizers are setup for 2-, 3-, and 4-Grams DocumentTermMatrices. These DocumentTermMatrices are edited until they consist of 2,3 or 4 columns and a frequency column. In every column there is only 1 word, depending on the n-gram how many columns there will be.

  • The Backoff algorithm is used for the prediction. This consists of recursively trying to find a match with the (number of input words)+1- gram dataframe, with the (number of input words)-gram dataframe and the (number of input words)-1 until a match is found.

If the number of input words is more then 4 words, then the last 3 input words are compared to the 4-gram dataframe.

Shiny App: How to use it?

  • Write a piece of text in the input box.

  • Choose the number of predicted words you want to see appear.

  • Choose if the input text is shown.

  • Wait a bit…………

  • The predictions for the next word(s) will appear!

Try out my app here: https://fredje8.shinyapps.io/ShinyAppMSP/

Results and Recap

It was a big challenge to build a prediction algorithm using Natural Language Processing. For me it was the first time to learn about a text prediction.

After combining all the different techniques which I learned throughout the Coursera course, I found the (simple) solution to the prediction algorithm.

The model could be improved by the following:

  • Using longer ngrams

  • Improving the speed of the prediction

  • Make a more Shiny App