08-06-2020

Introduction

This shiny app is to predict the next word based on the text of the user’s input.

Here is the link to the app: NLP Shiny App

Data Cleaning and Reformatting

To clean and reformat the data, I followed these steps:

  1. Load the files in the english folder: blogs.txt, news.txt and twitter.txt

  2. Random sample a subset of data to save memory. (2% of twitter, 10% of blogs and 50% of news)

  3. Use R package tm_map to perform corpus cleaning: convert to lower cases, remove punctuation, remove numbers, remove special symbols and so on.

  4. Use R package RWeka to convert corpus to dataframes of trigrams.

  5. Rank the trigrams in each orpus by descending order of their occurancy.

  6. Remove all the trigrams with frequency of 1 to reduce the size.

  7. Use R package textstem to perform string lemmatization to each trigrams in each corpus.

  8. Save the three transformed corpuses as data frame format to RData file to be used by shiny.

Algorithm

Get the last two words in the user input, use regular expression to check the expression starting with those two words in the corresponding selected corpus, if there are multiple outputs, then pick the trigram with the highest frequency, and return the last word of the trigram.

If the returned output is empty, then we use the last word to search in the corpus (again starts with the word), and we pick the trigram with the highest frequency in the search result. We return the second word of the trigram as output.

If there is still no result can be found, then return a word “the”.

About the App

There is a side panel and main panel in the app.

In the side panel on the left there is a multi-selection input filed where you can indicate the context of your text. You can select all of the three corpus: blogs, news and twitter or only part of them.

In the text input field you can write your sentence, and you don’t have to worry about the cases since the algorithm is case-insensitive.

In the main panel the predicted word will be displayed.

Have a lot of fun with my app!!!