Friday, August 21, 2015

The Word Prediction App

The Word Prediction App is a small Shiny application that accepts an n-gram as input and predicts the next word using R.

The Word Prediction App offers advantages over standard text typing solutions by providing

  • Increased speed for typing content
  • Increased spelling accuracy

How to use the App

  • Start by typing any text and choose Predict
  • The input text is cleaned and then matched with 4-grams,3-grams and 2-gram words present in that data.table
  • The output is then displayed as the most likely next word.

How does it work

The data is provided from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.

The Corpus contains randomly selected sentences in the language of the corpus and are available in sizes of 100,000 sentences, 300,000 sentences, 1 million sentences etc..

Using these most frequent word combinations in the Corpus the app can take the user submitted sentences and calculate the most likely next word.

Modeling and approach

All text mining and natural language processing was done with the usage of a variety of well-known R packages.

Step 1: Select a sample of the dataset for further work as this needs to fit within the avaliable memory

Step 2: Clean the data by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters.

Step 3: Clean dataset is screened and processed to removed extraneous characters and then is categorized into the most frequent word combinations of 2, 3 and 4 words(2-grams, 3-grams, 4-grams).

Step 4: The resulting data.table is used to predict the next word in connection with the text input and the frequencies of the underlying n-grams table.

Acknowledgement and associated files: