The Word Prediction Application

6th March 2018

Background

The Word Prediction Application has been created as a final Capstone Project in Data Science Specialization conducted by John Hopkins University at Coursera.

The data is from a corpus called HC Corpora and it contains texts collected from blogs, twitter and news:

This project will use the English dataset:

en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt

How does it work?

The application is very easy to use.

Just enter a phase and press Submit button.
Predicted next word will be shown at the Prediction panel.

Data cleaning and preprocessing

The main R packages used for this implementation were tidytext,tidyr,dplyr and stringi.
The Algorithm was trained with randomly selected data (5% of blogs and news data set and 0.1 % of twitter data set).
To improve algorithm performance following text cleaning was done:

Make all character in lower case.
Remove numbers.
Remove punctuation.
Remove extra whitespaces.
Remove Non-ASCII characters.
Remove stop words.

Prediction algorithm

Prediction is based on the n-grams and simple backoff model.
The algorithm takes words from user input and tries to find a next word.
What I actually created is a set of dataframes containing n-grams (bigrams, trigrams, quadgram), and a function that searches these frames.
Search will be started from the n+1 dataframe, where n presents the number of words.
For example if the user input contains 2 words the function takes two words and returns a row from the trigrams dataframe where the first word matches the first column and the second word matches the second column.
If a empty row is returned, the algorithm tries to find match from the lower n-gram dataframes.

Links

Word Prediction Application: https://rasieev.shinyapps.io/next_word_app/