Word Suggestions

Darius Kharazi
07/31/18

Description of NLP

Natural Language Processing or NLP is a field of computer science with the interaction between computers and human languages.

One on the oldest NLP problem related with computer word prediction is Claude Shannon's problem of assigning a probability to a word, Shannon used n-grams, defined as a contiguous sequence of n items, from a given sequence of text or speech, to compute probabilities of English sentences.

Preprocessing

  • Ensure that characters are only ASCII-formatted
  • Strip special characters from text
  • Strip numbers from text
  • Strip extra whitespace
  • Strip URLs from text
  • Strip capital letters from text

Essentially, we are only analyzing ibasic, lower case words from the given tweets, blog posts, and news articles

Prediction Algorithm

  • Each n-gram table includes phrases and words pulled from the data preprocessing step.

  • There are 2-grams, 3-grams, 4-grams, and 5-grams included in the repository and Shiny app to be used for predicting the next word

  • Each n-gram table includes columns with the input phrase, predicted output word, and the frequency, or probability, of the predicted word in the n-gram.

Running the Shiny App

  • Simply run the Shipy app server from the Shiny directory or URL.

  • In order to run the Shiny app locally, you will need to build your own prediction model.

  • If you would like a step-by-step process of preprocessing the data, building the prediction model, and using it for example phrase, go to the rmarkdown file and configure the “data” directories, as noted.

Future Work

  • Implement more accurate prediction algorithms, such as linear regression and logistic regression, that predicts the probability of observing the predicted word.

  • Test the code with other languages other than English.