Exploratory Data Analysis

Introduction

The ability to predict the next word a website user may wish to type has a number of applications. In order to build a data product that can predict the next word a user will type, text data from Twitter, blogs, and news stories were obtained. The data were all in English, and formed training data for a prediction model that could power the data product.

Because of the large size of these documents, a 1% sample of each was used for model-building. Here I present the results of a descriptive/exploratory analysis of the three data sources.

Part I: Twitter Data

Data Description

The first step in analyzing the Twitter data was to read in the data stream and create a character vector object from it. This then allowed for exploration of the available data.

The entire stream of Twitter messages was recorded in English, and consisted of 2,360,148 tweets. The random sample of the data had 23,655 tweets that themselves contained 302,194 words.

Descriptive Analysis

Digging into the data, I created ngrams from the individual words in the text. I then generated a term-document matrix (tdm) from the bigrams. This matrix shows the number of bigrams, the number of documents, and then the sparse matrix of the number of occurances of each term in each document. The term-document matrix contained 144,162 terms in the 23,655 tweets:

## Loading required package: NLP

## <<TermDocumentMatrix (terms: 144162, documents: 23655)>>
## Non-/sparse entries: 266597/3409885513
## Sparsity           : 100%
## Maximal term length: 67
## Weighting          : term frequency (tf)

As the term-document matrix summary shows, the matrix is sparse, with only 266,597 entries out of the possible 3.4 trillion entries. In fact, the following histogram shows that most of the bigrams are used 10 or fewer times:

I then wanted to get a sense of the most common bigram phrases among those generated. Using the findfreqwords() function I was able to produce vectors of the most common words.

library(tm)
findFreqTerms(tdm, lowfreq = 200)

##  [1] "a great"    "and i"      "are you"    "at the"     "for a"     
##  [6] "for the"    "going to"   "have a"     "have to"    "i am"      
## [11] "i dont"     "i have"     "i just"     "i know"     "i love"    
## [16] "i think"    "i was"      "if you"     "in a"       "in the"    
## [21] "is a"       "is the"     "it was"     "need to"    "of the"    
## [26] "on the"     "so much"    "thank you"  "thanks for" "to be"     
## [31] "to get"     "to see"     "to the"     "want to"    "will be"   
## [36] "with the"   "you are"

This output shows bigrams with at least 200 occurences in the corpus. Some of the phrases are common phrases in English that include articles and so-called “stop words.” While natural language processing often removes these words from the corpus when attempting to determine topics or sentiment, I have left them here since my main concern is with predicting the next word the user would like to type. I then changed the function to include only those phrases with at least 500 occurences in the corpus:

findFreqTerms(tdm, lowfreq = 500)

## [1] "for the" "in the"  "of the"  "on the"

Here the list has been narrowed to phrases which are surely the most common in the English language: a preposition combined with ‘the’.

Part II: Blog data

Data Description

The blog data consisted of 899,288 lines of text written in English. Again, because of the size of the data file, a 1% random sample of the data was used, and had 8,992 lines.

Descriptive Analysis

The term-document matrix (tdm) from the bigrams of the blog data showed that there were 184,324 bigrams in the 8,990 documents:

## <<TermDocumentMatrix (terms: 184324, documents: 8990)>>
## Non-/sparse entries: 345691/1656727069
## Sparsity           : 100%
## Maximal term length: 98
## Weighting          : term frequency (tf)

The term-document matrix summary shows us again a sparse matrix, with 345,691 entries out of the possible 1.7 trillion entries. Like in the Twitter data, the histogram again shows that most of the bigrams are used 10 or fewer times:

Here there are many bigrams that have at least 200 usages, much like the Twitter data:

library(tm)
findFreqTerms(tdm, lowfreq = 200)

##  [1] "a few"     "a lot"     "all the"   "and a"     "and i"    
##  [6] "and the"   "as a"      "at the"    "but i"     "by the"   
## [11] "for a"     "for the"   "from the"  "going to"  "have a"   
## [16] "have been" "have to"   "i am"      "i dont"    "i had"    
## [21] "i have"    "i think"   "i was"     "if you"    "in a"     
## [26] "in my"     "in the"    "into the"  "is a"      "is the"   
## [31] "it is"     "it was"    "of a"      "of my"     "of the"   
## [36] "on the"    "one of"    "out of"    "that i"    "that the" 
## [41] "the first" "the same"  "they are"  "this is"   "to be"    
## [46] "to do"     "to get"    "to make"   "to the"    "want to"  
## [51] "was a"     "when i"    "will be"   "with a"    "with the" 
## [56] "you can"

In these data the phrases are much more formal, and less elemental than in the Twitter data. In hindsight this is sensible, since blogs tend to be more focused than Tweets. It is also noticable that there are more high-frequency bigrams in the blog data. This too makes sense, as blogs should create clusters of text that are about similar topics, and thus might produce more commonly used bigrams. This may also stem from the fact that there are likely fewer authors in the blog data, and individual people tend to use the same gramattical construction often.

I again changed the function to include only those phrases with at least 500 occurences in the corpus:

findFreqTerms(tdm, lowfreq = 500)

## [1] "and the" "for the" "i was"   "in the"  "it was"  "of the"  "on the" 
## [8] "to be"   "to the"

Like in the prior >100 list, there are more bigrams than there were in the Twitter data; yet similar to the Twitter data the >500 bigrams tend to be articles combined with prepositions.

Part III: Newsstory data

Data Description

The news data consisted of 77,259 lines of text, containing 2,643,969 words, again written in English. The 1% random sample of the data therefore had 772 lines with 369,463 words.

Descriptive Analysis

According to the TDM for the news data, there were 20,030 bigrams in the 772 documents:

## <<TermDocumentMatrix (terms: 20030, documents: 772)>>
## Non-/sparse entries: 24369/15438791
## Sparsity           : 100%
## Maximal term length: 34
## Weighting          : term frequency (tf)

Just like in the prior 2 data sources, the terms form a sparse matrix, with 24,369 entries out of the possible 15 trillion entries. As in the previous data sources, most of the bigrams are used 10 or fewer times:

In the bigram frequency count is where the news data diverge from the Twitter and blog data. Whereas in those sources there were many bigrams that were used more than 200 times, in these data there are no bigrams that are used that many times. Instead, the frequency must be lowered to 50 to see even a small number of bigrams:

library(tm)
findFreqTerms(tdm, lowfreq = 50)

## [1] "at the"  "for the" "in the"  "of the"  "on the"  "to the"

The most frequent terms here are almost identical to those found in the other two data sources. However, there are not many phrases that are used at least 50 times, so to be more inclusive, I changed the frequency to 10:

findFreqTerms(tdm, lowfreq = 10)

##  [1] "a lot"        "according to" "all the"      "and a"       
##  [5] "and other"    "and the"      "as a"         "as the"      
##  [9] "at a"         "at least"     "at the"       "but the"     
## [13] "by a"         "by the"       "for a"        "for the"     
## [17] "from a"       "from the"     "going to"     "has been"    
## [21] "have been"    "have to"      "he said"      "he was"      
## [25] "i dont"       "i think"      "in a"         "in his"      
## [29] "in the"       "is a"         "is the"       "it is"       
## [33] "it was"       "it would"     "last year"    "lot of"      
## [37] "more than"    "need to"      "new york"     "of a"        
## [41] "of his"       "of the"       "on a"         "on the"      
## [45] "one of"       "out of"       "over the"     "part of"     
## [49] "said he"      "said the"     "she said"     "some of"     
## [53] "that he"      "that the"     "the first"    "the game"    
## [57] "the next"     "the same"     "there are"    "to a"        
## [61] "to be"        "to do"        "to have"      "to make"     
## [65] "to the"       "want to"      "we have"      "will be"     
## [69] "with a"       "with the"     "would be"

The phrases here tend to impart a sense of third person narrative, appropriate for a newsstory.

Text Prediction Plan

After having explored the data from the three sources, it seems as though the best approach is to first generate a list of bigram terms and their frequencies. I can then match a given input word to a bigram by the first term in the bigram. The most probable bigram (or perhaps top few most probable) can then be mined to provide its second word as a choice to the application user.

To create a list of words that can be matched, I may have to use the unique words in the three data sources. If the user types one of these “known” words, a choice can be offered; if not, nothing will appear.

To demonstrate the product, an app will be created in Shiny. The app will have a text box to input a word, and then the user will be given a list of probable next words. To implement this model in a website or other location would require more complex scripting, and would be outside the scope of this capstone project.

Exploratory Data Analysis

Rob Reynolds, MPH PhD

November 27, 2016

Introduction

Part I: Twitter Data

Data Description

Descriptive Analysis

Part II: Blog data

Data Description

Descriptive Analysis

Part III: Newsstory data

Data Description

Descriptive Analysis

Text Prediction Plan