The App

The App is working on a Shiny server and it’s very simple User Interface that anyone can use and create some outcome with typed text and, or clicking the button below to select the next and then the next word. Fast and reliable not with the best accuracy after 10 or 15 words. Next word prediction shiny app link below:

https://my-data-analytics.shinyapps.io/Shiny_NLP_App/

This ia a presentation on the capstone project from the Coursera Data Science Specialization Course from the Johns Hopkins University

Data and Analysis

The data for the app has been taken from: - The US blogs file has 899,288 lines and approximately 4,799,000 words. - The US news file has 1,010,242 lines and approximately 886,300 words. - The US Twitter file has 2,360,148 lines and approximately 4,424,800 words.

Data Acquisition and Pre-Processing:
Prepared the enviroment by loading the needed packages and enabling multicore processing
Pulled in the data
Created a function to pull a random % of the records from each of the three US files and write to a separate file
Read in the three sample files and combine into one
Cleaned up some contractions in the files
Created a corpus and pre-process the text data
Created a TermDocumentMatrix

The Milestone report and the Exploratory analysis can be found on Rpubs link below:

http://rpubs.com/Damjan_Stefanovski/323998

Building the Model

The packages used in building the application in R were very necessary from my slow performance machine especialy with the large data sets and creating the n-grams. here are some of them used for building the model and run the exploratory analysis.

(R.utils), (tm) ,(SnowballC),(NLP),(ggplot2),(parallel, quietly=T),(doParallel, quietly=T),(data.table),(wordcloud)

Some observations that stood up in the preliminary findings were that the top 30 single terms are mainly common, the one and tow syllable terms. The largest part of the corpus comes from Twitter, which is written in brief and simple language. New York and New York City are both featured in the bi and tri grams, that shows that is quite a popular city. Also Happy New year and Happy Mothers day have their dicent place in popularity and frequency of use.

Here is a wordcloud from the Unigram:

Word Predictions

The more words were used the better the predictions stood up to be, but many ajdustments were done time and time when bulding the model. Some because of the RAM usage and some because of prediction precision. The ideal algorithm that works best on my PC was hard to find but it was worthed. Making the corpus out of the twitter, news and bloggs data for transforming and using therm document matrix when buliding bi-grams.

Here is a wordcloud from the bi-grams:

N-Grams Predictive Model

The prediciton model build out of series n-grams from the corpus, an algorithm that takes the last two, three or five words typed text and looks for the matches in the frequency tables previously assigned base on these n-grams.

An N-gram is: n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.

Here is a wordcloud of tri-grams:

Word_Prediction_App_Presentation

The App

Data and Analysis

Building the Model

Word Predictions

N-Grams Predictive Model