Natural Language Processing

devank
10 Apr 2016

Johns Hopkins University - Coursera Data Science Capstone Project in cooperation with SwiftKey.

Procedures

Following tasks are performed.

Task 1: Getting and Cleaning the Data
Task 2: Exploratory Data Analysis
Task 3: Modeling
Task 4: Prediction Model
Task 5: Creative Exploration
Task 6: Data Product
Task 7: Slide Deck

Corpus data comes from a corpus called HC Corpora. Excisiting R packages were used for text mining and natural language processing.

Underlying Algorithm

This model reads last few words of a sentence and uses statistics about a large collection of English sentences to predict the most probable next word. Sample is from blogs, news and Twitter.

First the data sample was cleaned. Then the sample data was tokenized into n-grams. Following n-grams were created to get the frequency dictionaries.

unigram
bigrams
trigrams
quadgrams

Above data.frames are used to predict the next word for a given word.

Instructions

You need to enter a word in the input field given.

Application Screenshot

The predicted word appears bellow.
Number of words to predict is shown on the right.
N/A appears if the app can't predict

Embedded live version is loaded here.

Not working? Please visit: https://devank.shinyapps.io/NPLshinny.

Next Steps?

Currently only last 3 words are considered. This could be expanded to include broader sentences.
Currently numerics are not considered.
Out put is a continuous sentence. It never finishes. Need to look at the whole sentance.