Frederica Janga, Coursera Capstone Project
23-03-2017
In this presentation the Capstone Project of the Data Science Specialization offered by Coursera is reported.
The goal of the Capstone Project is to build a prediction algorithm that can predict the next word with 1,2 or 3 words as an input. Also a Shiny App that can be accessed by others will show the outcome of the prediction algorithm where anyone can test it.
The prediction algorithm and the Shiny App are entirely built in R. The predictions are based on 3 English text files; Twitter, News and Blog text files. Therefore the prediction outcome is in English.
The word prediction is done in following steps:
First the 3 big text files are transformed into text files without numbers, punctuation and the text files are converted to lowercase.
Then tokenizers are setup for 2-, 3-, and 4-Grams DocumentTermMatrices. These DocumentTermMatrices are edited until they consist of 2,3 or 4 columns and a frequency column. In every column there is only 1 word, depending on the n-gram how many columns there will be.
The Backoff algorithm is used for the prediction. This consists of recursively trying to find a match with the (number of input words)+1- gram dataframe, with the (number of input words)-gram dataframe and the (number of input words)-1 until a match is found.
If the number of input words is more then 4 words, then the last 3 input words are compared to the 4-gram dataframe.
Write a piece of text in the input box.
Choose the number of predicted words you want to see appear.
Choose if the input text is shown.
Wait a bit…………
The predictions for the next word(s) will appear!
Try out my app here: https://fredje8.shinyapps.io/ShinyAppMSP/
It was a big challenge to build a prediction algorithm using Natural Language Processing. For me it was the first time to learn about a text prediction.
After combining all the different techniques which I learned throughout the Coursera course, I found the (simple) solution to the prediction algorithm.
The model could be improved by the following:
Using longer ngrams
Improving the speed of the prediction
Make a more Shiny App