Natural language generation (NLG) is a subset of natural language processing (NLP) which utilizes grammars and statistical models that have been extracted from human written texts. SwiftKey® has built their company using these concepts to predict the next word a user types. We have been tasked with first extracting and building datasets from representative text and then modeling that text to enable us to do the same. This project is an attempt to recreate this functionality in the programming language, ‘R’.
Our dataset is made up of three files. Each file has been extracted from the various parts of the web representing very different types of language. The colloquial nature of blog postings is in sharp contrast to the more formal structure of news reports. However, both news and blogs are comprised of lines that contain entire paragraphs from each source whereas, twitter lines are constrained to a 140 character count. Twitter therefore contains many more lines with far fewer word and character counts than the other two.
| Filename | Line Count | Word Count* | Character Count | Megabytes | |||
|---|---|---|---|---|---|---|---|
| en_US.blogs.txt | 899,288 | (21%) | 37,334,690 | (37%) | 210,160,014 | (36%) | 200 |
| en_US.news.txt | 1,010,242 | (23%) | 34,372,720 | (34%) | 205,811,889 | (35%) | 196 |
| en_US.twitter.txt | 2,360,148 | (55%) | 30,374,206 | (30%) | 167,105,338 | (27%) | 159 |
| TOTAL | 4,269,678 | 102,081,616 | 583,077,241 | 528 |
* word count is based on unix’s wc command - R’s tm package returns blogs: 29,465,729 news: 28,263,823 twitter: 23,490,398
The following table comprises the extreme counts* of words found in one file but missing entirely in another. Note that the process involves looking up the words as complete words and not within a word.
| found in | not found | in blogs | not found | in news | not found | in twitter |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | — | 1,366 | fucking | 466 | stampin | |
| en_US.news.txt | 1,477 | Dimora | — | 1,080 | square-foot | |
| en_US.twitter.txt | 1,388 | tryna | 11,771 | fuck | — |
* word count is based on results from unix’s grep command. ie. “grep -iw tyrna en_US.twitter.txt | wc -l”
The histograms of word frequencies are extremely right skewed. This is to be expected since some words are much more common than others. The breakdown of word frequencies for each document is displayed as follows.
The variability of the language can be seen in simple word clouds comprised of each set of documents. These clouds are composed by removing very common English articles that provide no meaningful data such as ‘the’, converting uppercase characters and then combining words into their stems. In this way we can see the frequency of the terms used. A quick glance clearly displays the more formal nature of the news documents compared to the twitter or blog posts.
We have successfully demonstrated that we will need to come up with different prediction algorithms based on the application this data set is working on. For instance, the most common trigram phrase drawn from the twitter data file is, ‘thanks for the’ whereas in both the news and blog files ‘one of the’ is the most prevalent. To build a more accurate prediction we will need to know which type of data we are working with.
I see no reason not to use the same routines in devising our predictions however we will need to establish which algorithm to use when choosing our predicted word. For instance, seeing the word ‘Dimora’ as part of our context should result in using the news algorithm as it’s the only one with that unique word in it. If we see the character count go beyond 140 we know we should be using the twitter algorithm. Common words or phrases can be linked to individual data sources and therefore cause us to switch to those.