Natural Language Processing

devank
10 Apr 2016

Johns Hopkins University - Coursera Data Science Capstone Project in cooperation with SwiftKey.

Procedures

Following tasks are performed.

  • Task 1: Getting and Cleaning the Data
  • Task 2: Exploratory Data Analysis
  • Task 3: Modeling
  • Task 4: Prediction Model
  • Task 5: Creative Exploration
  • Task 6: Data Product
  • Task 7: Slide Deck

Corpus data comes from a corpus called HC Corpora. Excisiting R packages were used for text mining and natural language processing.

Underlying Algorithm

This model reads last few words of a sentence and uses statistics about a large collection of English sentences to predict the most probable next word. Sample is from blogs, news and Twitter.

First the data sample was cleaned. Then the sample data was tokenized into n-grams. Following n-grams were created to get the frequency dictionaries.
  • unigram
  • bigrams
  • trigrams
  • quadgrams
Above data.frames are used to predict the next word for a given word.

Instructions

You need to enter a word in the input field given.

Application Screenshot

  • The predicted word appears bellow.
  • Number of words to predict is shown on the right.
  • N/A appears if the app can't predict

Embedded live version is loaded here.

Not working? Please visit: https://devank.shinyapps.io/NPLshinny.

Next Steps?

  • Currently only last 3 words are considered. This could be expanded to include broader sentences.
  • Currently numerics are not considered.
  • Out put is a continuous sentence. It never finishes. Need to look at the whole sentance.