CAPSTONE PROJECT

ADHAM ALEID

Introduction

  • The Natural Language Processing (NLP) is an active area of research in Data Science
  • Purpose: The goal of this project was to build a predictive model for text.
  • R programming language was used to develop this prediction model, and shiny app was used for front-end interface.

FRAMEWORK

  • A large English datasets from blogs, news and twitter was downloaded. This datasets were used to build next-word predictive model.
  • To understand variation in the frequencies of words and word pairs.
  • n-gram approach, which is contiguous sequence of n words in text, was used in this model.

FRAMEWORK

  • Three order of n-gram was used in this model
  1. Unigram: next word is predicted based on the last word
  2. Bigram: next word is predicted based on the last 2 word
  3. Trigram: next word is predicted based on the last 3 word

PREDICTION APP

screenshot

  • Left, the user enters the text in the textbox, and choose the n-gram model. Right, predicted words appear in the table.

CONCLUSION

  • This prediction model exhibits substantial accuracy, especially for words used commonly.
  • The app is easy to used, and results returned usually in acceptable duration, which is good for newly developed model.
  • This model has a large potential for improvement in both accuracy and speed.