CAPSTONE PROJECT

ADHAM ALEID

Introduction

The Natural Language Processing (NLP) is an active area of research in Data Science
Purpose: The goal of this project was to build a predictive model for text.
R programming language was used to develop this prediction model, and shiny app was used for front-end interface.

FRAMEWORK

A large English datasets from blogs, news and twitter was downloaded. This datasets were used to build next-word predictive model.
To understand variation in the frequencies of words and word pairs.
n-gram approach, which is contiguous sequence of n words in text, was used in this model.

FRAMEWORK

Three order of n-gram was used in this model

Unigram: next word is predicted based on the last word
Bigram: next word is predicted based on the last 2 word
Trigram: next word is predicted based on the last 3 word

PREDICTION APP

screenshot

Left, the user enters the text in the textbox, and choose the n-gram model. Right, predicted words appear in the table.

CONCLUSION

This prediction model exhibits substantial accuracy, especially for words used commonly.
The app is easy to used, and results returned usually in less than 1 second, which is good for newly developed model.
This model has a large potential for improvement in both accuracy and speed.