Milestone Report: Swiftkey Project.

NLP (Natural Language Processing).

Executive Summary

This is a progress report concerning the task of producing a word prediction app.

English text files taken from blogs, news articles and tweets are briefly examined within this report.

The current findings are:

Cleaning the data of characters which do not contribute to word prediction is both important and challenging.
The three sources; blogs, news, and twitter, show variety in the number of unique words used and variety and the ranking (in order of frequency) that terms are used.
An prediction algorithm based on the final 2, 3 and 4 words of a sentence is in development.
In order to produce a web ready app a considerable amount of time will be spent reducing the resulting prediction tables without loosing a large amount of accuracy in prediction.

The data

The data have been sourced from Capstone Dataset - Date: 16th November 2014. The series of files in this data set have been presented by Swiftkey and researchers at Johns Hopkins Dept of Biostatistics as part of a data science specialisation course on Coursera.

The data itself is from a body of text called HC Corpora, which is a free corpora (body of text) available for research purposes.

Summary of data currently under study

Currently three text files are under study these are

Name	Description
“en_US.blogs.txt”	A text file consisting of blog entries written in US English.
“en_US.news.txt”	A text file consisting of news related results written in US English.
“en_US.twitter.txt”	A text file consisting of “tweets” from the on-line social networking service Twitter.

The raw data has the following attributes:

Name	Lines	Words	Size (bytes)
“en_US.blogs.txt”	899288	37334131	210160014
“en_US.news.txt”	1010242	34372530	205811889
“en_US.twitter.txt”	2360148	30373583	167105338
TOTALS	4269678	102080244	583077241

Sample of the data

A two line sample of the Blog data

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."

the news data

## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."

and the twitter data

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

Features of the data

The data set requires a significant amount of processing in order to analyse its structure.

An exploration of 70% of each of the data sets has been conducted with the remaining 30% being left untouched for future predictive model testing.

After cleaning the data, of artefacts and characters that are not relevant for analysis, and tokenizing (a processes of splitting the lines into units). The 70% samples showed the following characteristics

File	Sample Size	Number of words	Number of Unique Words
“en_US.blogs.txt”	70%	2.5768 × 10⁷	345208
“en_US.news.txt”	70%	2.3411 × 10⁷	270262
“en_US.twitter.txt”	70%	2.0525 × 10⁷	396946

The vast majority of the identified words in each of the files occur relatively infrequently. In other words the vast majority of a document is made up of a small proportion of the total number of unique words in the document.

plot of chunk unnamed-chunk-6

The plot above shows that over 90% of the words (upper horizontal red line) identified by the algorithm are covered by less than 10% (vertical blue line) of the the most frequently occurring unique words.

Findings thus far

Top 100 words by frequency for the Blog, News and Twitter samples are shown in the following clouds respectively, Note: The relative size of the words indicate how often the terms occur in the document with respect to one another.

The most common words in the Blog sample

##  [1] "the"  "and"  "to"   "a"    "of"   "i"    "in"   "that" "is"   "it"  
## [11] "for"  "you"  "with" "was"  "on"   "my"   "this" "as"   "have" "be"

the news sample

##  [1] "the"  "to"   "and"  "a"    "of"   "in"   "for"  "that" "is"   "on"  
## [11] "with" "said" "was"  "he"   "it"   "at"   "as"   "his"  "i"    "be"

and twitter sample

##  [1] "the"  "to"   "i"    "a"    "you"  "and"  "for"  "in"   "of"   "is"  
## [11] "it"   "my"   "on"   "that" "me"   "be"   "at"   "with" "your" "have"

The top spot of word frequency in the documents belong to common stop words such as “the”, “is”, “to” etc. such words may well need to be removed in order to enrich the prediction vocabulary or the final product.

Current outline for a word prediction Shiny App.

Currently an analysis of the 2,3 & 4-grams (2,3 & 4 word chunks) present in the data sets is under examination.

The initial prediction model takes the last 2,3 & 4 words from a sentence/phrase and makes presents the most frequently occurring “next” word from the sample data sets.

These frequency tables currently need to be reduced in size in order to make them feasible for an on-line shiny app where speed of prediction is a significant factor and the size of the app is a significant consideration.

In order to reduce the frequency tables infrequent terms will be removed and the removal of stop-words such as “the, to, a” will be removed from the prediction if those words are already present in the sentence.

Profanity filtering of predictions will be included in the shiny app. A simple table of “illegal” prediction words will be used to filter the final predictions sent to the user. The app will process profanity in order to predict the next word but will not present profanity as a prediction.