Introduction

This is a milestone report which is a part of the Capstone Project in the Data Science Specialization offered by Johns Hopking University in Coursera.org. The main objective of this report is to develop an understanding of statistical properties of the dataset which can be applied to Natural Language Processing (NLP) in order to build a predictive text application. In this application user will input text, as text is typed by the user, predictive model will recommend the next possible word(s) to be appended to the input stream. The final product will be used in a Shiny application platform, which will allow the users to type an input text and suggestion of the next text prediction in a web based environment.

The text data can be downloaded here and is provided in four different languages. We will be using the English corpora. The model will be trained using a document corpus compiled from the follwing three sources of text data:

  • Blogs
  • Twitter
  • News

Data Processing

Data Summary

There are more than 30 million words in each source which is a great assest to create predicting algorithm. Variables AWP and MWP is mean words per line and median words per line respectively.

       Items          FileName  Size_MB    Words   Lines      AWP MWP
  1:   Blogs   en_US.blogs.txt 200.4242 37334131  899288 41.75107  28
  2: Twitter en_US.twitter.txt 159.3641 30373543 2360148 12.75063  12
  3:    News   en_US.news.txt  196.2775 34372530 1010242 34.40997  32

Data Cleaning

Build document-feature matrices

The predictive model for the Shiny application will handle uniqrams, bigrams, and trigrams. The quanteda package is used to construct functions construct matrices of uniqrams, bigrams, and trigrams from the tokenized data.

Exploratory Data Analysis

Unigram Analysis

Plotting the top ten Unigram frequency.

Bigram Analysis

Plotting the top ten Bigram frequency.

Trigram Analysis

Plotting the top ten Trigram frequency.

Plans for prediction algorithm and Shiny App

The final part of this capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface in the web browser. In Shiny app will take multiple words input in a text box and output a prediction of the next word.

The predictive algorithm will be created using an n-gram model with a frequency lookup to that performed in the exploratory data analysis section of this report.Based on the exploratory analysis here are the plans:

  • Increase the number of n-grams, possibly upto 5 or 6.
  • Create a model, first to look for the unigram from the entered text.
  • Once the text is entered, find the most common bigram model and increase the n-grams.
  • Increase or decrease the percentage of training data for efficiency and accuracy.
  • Efficiency and accuracy will be a deciding factor or the final strategy.
  • The design for the Shiny App is to be decided.