This is a progress report concerning the task of producing a word prediction app.
English text files taken from blogs, news articles and tweets are briefly examined within this report.
The current findings are:
The data have been sourced from Capstone Dataset - Date: 16th November 2014. The series of files in this data set have been presented by Swiftkey and researchers at Johns Hopkins Dept of Biostatistics as part of a data science specialisation course on Coursera.
The data itself is from a body of text called HC Corpora, which is a free corpora (body of text) available for research purposes.
Currently three text files are under study these are
| Name | Description |
|---|---|
| “en_US.blogs.txt” | A text file consisting of blog entries written in US English. |
| “en_US.news.txt” | A text file consisting of news related results written in US English. |
| “en_US.twitter.txt” | A text file consisting of “tweets” from the on-line social networking service Twitter. |
The raw data has the following attributes:
| Name | Lines | Words | Size (bytes) |
|---|---|---|---|
| “en_US.blogs.txt” | 899288 | 37334131 | 210160014 |
| “en_US.news.txt” | 1010242 | 34372530 | 205811889 |
| “en_US.twitter.txt” | 2360148 | 30373583 | 167105338 |
| TOTALS | 4269678 | 102080244 | 583077241 |
A two line sample of the Blog data
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
the news data
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
and the twitter data
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
The data set requires a significant amount of processing in order to analyse its structure.
An exploration of 70% of each of the data sets has been conducted with the remaining 30% being left untouched for future predictive model testing.
After cleaning the data, of artefacts and characters that are not relevant for analysis, and tokenizing (a processes of splitting the lines into units). The 70% samples showed the following characteristics
| File | Sample Size | Number of words | Number of Unique Words |
|---|---|---|---|
| “en_US.blogs.txt” | 70% | 2.5768 × 107 | 345208 |
| “en_US.news.txt” | 70% | 2.3411 × 107 | 270262 |
| “en_US.twitter.txt” | 70% | 2.0525 × 107 | 396946 |
The vast majority of the identified words in each of the files occur relatively infrequently. In other words the vast majority of a document is made up of a small proportion of the total number of unique words in the document.
The plot above shows that over 90% of the words (upper horizontal red line) identified by the algorithm are covered by less than 10% (vertical blue line) of the the most frequently occurring unique words.
Top 100 words by frequency for the Blog, News and Twitter samples are shown in the following clouds respectively, Note: The relative size of the words indicate how often the terms occur in the document with respect to one another.
The most common words in the Blog sample
## [1] "the" "and" "to" "a" "of" "i" "in" "that" "is" "it"
## [11] "for" "you" "with" "was" "on" "my" "this" "as" "have" "be"
the news sample
## [1] "the" "to" "and" "a" "of" "in" "for" "that" "is" "on"
## [11] "with" "said" "was" "he" "it" "at" "as" "his" "i" "be"
and twitter sample
## [1] "the" "to" "i" "a" "you" "and" "for" "in" "of" "is"
## [11] "it" "my" "on" "that" "me" "be" "at" "with" "your" "have"
The top spot of word frequency in the documents belong to common stop words such as “the”, “is”, “to” etc. such words may well need to be removed in order to enrich the prediction vocabulary or the final product.
Currently an analysis of the 2,3 & 4-grams (2,3 & 4 word chunks) present in the data sets is under examination.
The initial prediction model takes the last 2,3 & 4 words from a sentence/phrase and makes presents the most frequently occurring “next” word from the sample data sets.
These frequency tables currently need to be reduced in size in order to make them feasible for an on-line shiny app where speed of prediction is a significant factor and the size of the app is a significant consideration.
In order to reduce the frequency tables infrequent terms will be removed and the removal of stop-words such as “the, to, a” will be removed from the prediction if those words are already present in the sentence.
Profanity filtering of predictions will be included in the shiny app. A simple table of “illegal” prediction words will be used to filter the final predictions sent to the user. The app will process profanity in order to predict the next word but will not present profanity as a prediction.