6th March 2018

Background

The Word Prediction Application has been created as a final Capstone Project in Data Science Specialization conducted by John Hopkins University at Coursera.

The data is from a corpus called HC Corpora and it contains texts collected from blogs, twitter and news:

This project will use the English dataset:

  • en_US.blogs.txt
  • en_US.twitter.txt
  • en_US.news.txt

How does it work?

The application is very easy to use.

  • Just enter a phase and press Submit button.
  • Predicted next word will be shown at the Prediction panel.

Data cleaning and preprocessing

  • The main R packages used for this implementation were tidytext,tidyr,dplyr and stringi.
  • The Algorithm was trained with randomly selected data (5% of blogs and news data set and 0.1 % of twitter data set).
  • To improve algorithm performance following text cleaning was done:
  1. Make all character in lower case.
  2. Remove numbers.
  3. Remove punctuation.
  4. Remove extra whitespaces.
  5. Remove Non-ASCII characters.
  6. Remove stop words.

Prediction algorithm

  • Prediction is based on the n-grams and simple backoff model.
  • The algorithm takes words from user input and tries to find a next word.
  • What I actually created is a set of dataframes containing n-grams (bigrams, trigrams, quadgram), and a function that searches these frames.
  • Search will be started from the n+1 dataframe, where n presents the number of words.
  • For example if the user input contains 2 words the function takes two words and returns a row from the trigrams dataframe where the first word matches the first column and the second word matches the second column.
  • If a empty row is returned, the algorithm tries to find match from the lower n-gram dataframes.

Links