Introduction

This is the Milestone Report for the Capstone Project in the Data Science Specialization offered by Johns Hopkins University at Coursera.

The final goal of the project is to produce a prediction algorithm for the next word that a user might want to type in a smartphone, based on the previously used words. In this Milestone Report we focus on an exploratory data analysis of three documents with text data from blogs, news, and Twitter. This corpus of text data will provide the foundation for our prediction algorithm.

A first look at the data

We use the tm R package to read and preprocess the text data and to create “Term Document Matrices,” i.e., summaries of how many times a given word or n-gram appears in a document (an n-gram is a sequence of n words. 2-grams and 3-grams are sometimes called bigrams and trigrams, respectively). For this initial exploratory data analysis we use a subset of the text data comprising around 1% of the total number of entries in each document. The following data frame shows the number of entries that we’ll work with in each document:

##    source entries
## 1   blogs    9102
## 2    news   10222
## 3 twitter   23678

tm provides several “transformations” that can be applied to the data before any computations are done. We have applied the following transformations to our dataset:

The rationale for these transformations is hopefully clear; we want a dictionary of comparable words, without complications such as whitespace or punctuation. Converting all words to lowecase allows us to identify, e.g., “Love” and “love” as the same word. Since numbers are infinite, there is no point in trying to predict them.

Removing punctuation, however, has the undesired effect of turning, e.g., “I’m” into “Im,” which is wrong. In a next phase of our work we’ll have to address this issue by excluding apostrophes from the list of punctuation symbols that must be removed.

There are two transformations that we chose not to apply:

“Stopwords,” such as “our,” “in,” and “very,” are kept in the database because our prediction algorith must be able to predict them when they arise. “Stemming” is the process of transforming, e.g., “am,” “is,” “are,” “was,” and “were” into “be.” This is useful if one wants to know how often this verb is used in English, in any of its variants, but is useless in our context, since we care about precisely which variant of “to be” must be predicted by the algorithm.

After all these preliminaries, we create Term Document Matrices for single words (1-grams), bigrams, and trigrams.

As a first result from our analysis, we quote the number of unique words, bigrams, and trigrams appearing in all documents, as well as the number of instances in which these words, bigrams, or trigrams occur.

##      item unique instances
## 1    word  56953    792567
## 2  bigram 473702   1002112
## 3 trigram 852807   1002109

For instance, we find that our text data, comprising all three documents, is made up of a grand total of 792567 words, of which 56953 are unique (i.e., different). This means that each word appears an average number of 13.9 times in our data.

In particular, for words we break down this information according to source document:

##    source unique instances
## 1   blogs  29242    288223
## 2    news  30860    277268
## 3 twitter  25951    227076

This suggests that Twitter users command a more limited vocabulary (25951 words in our database) than news authors, who used 30860 different words when writing.

The most common words, bigrams, and trigrams

According to this article about the Oxford English Corpus (OEC), around 7000 “lemmas” (with a lemma being the base form of a word) account for approximately 90% of all words used in the English language. The fifty most common words in our database are the following:

##     word instances  frequency
## 1    the     47742 0.06023718
## 2    and     24258 0.03060688
## 3    for     10890 0.01374016
## 4   that     10375 0.01309038
## 5    you      9634 0.01215544
## 6   with      7097 0.00895445
## 7    was      6290 0.00793624
## 8   this      5441 0.00686503
## 9   have      5380 0.00678807
## 10   are      5012 0.00632376
## 11   but      4848 0.00611683
## 12   not      4202 0.00530176
## 13  from      3747 0.00472768
## 14   its      3644 0.00459772
## 15  they      3274 0.00413088
## 16   all      3185 0.00401859
## 17  said      3132 0.00395172
## 18  will      3127 0.00394541
## 19  your      3105 0.00391765
## 20   his      3017 0.00380662
## 21  just      3017 0.00380662
## 22   out      3010 0.00379779
## 23 about      2971 0.00374858
## 24   one      2960 0.00373470
## 25  what      2748 0.00346721
## 26  when      2702 0.00340918
## 27  like      2689 0.00339277
## 28   has      2596 0.00327543
## 29   who      2483 0.00313286
## 30   can      2463 0.00310762
## 31  more      2396 0.00302309
## 32   get      2334 0.00294486
## 33   had      2255 0.00284519
## 34  were      2168 0.00273542
## 35  time      2150 0.00271270
## 36 would      2135 0.00269378
## 37 there      2130 0.00268747
## 38   her      2104 0.00265467
## 39 their      2094 0.00264205
## 40  some      2009 0.00253480
## 41   she      1958 0.00247045
## 42   new      1930 0.00243513
## 43   our      1915 0.00241620
## 44  dont      1882 0.00237456
## 45  been      1862 0.00234933
## 46   how      1802 0.00227362
## 47  good      1762 0.00222316
## 48   now      1702 0.00214745
## 49   day      1701 0.00214619
## 50  know      1639 0.00206796

The word “the” is by far the most common in our database, appearing 47742 times. This corresponds to a staggering 6.02 percent of all words present.

We can also determine how many unique words we need to cover a given percentage of all word instances. This data frame gives a summary:

##   percentage words
## 1        25%    32
## 2        50%   319
## 3        75%  2314
## 4        90%  9998
## 5        95% 21574
## 6        99% 49028

From this, we learn that, e.g., we need 9998 unique words to account for 90% of all words appearing in our documents, in rough agreement with the OEC. In other words: a dictionary made up of 9998 words is rich enough to, in principle, predict the next word that a user will type about 90% of the time.

The following density plot confirms that a relatively small number of words appear a disproportionately large number of times, while a long “tail” of words show up only seldomly. Note that the horizontal axis is on a logarithmic scale

The most frequent bigrams and trigrams, appearing at least 500 and 100 times in our database, respectively, are:

##  [1] "all the"   "and a"     "and i"     "and the"   "as a"     
##  [6] "at the"    "be a"      "but i"     "by the"    "for a"    
## [11] "for the"   "from the"  "going to"  "have a"    "have to"  
## [16] "i am"      "i dont"    "if you"    "i have"    "i love"   
## [21] "in a"      "in the"    "is a"      "is the"    "i think"  
## [26] "it is"     "it was"    "i was"     "of a"      "of the"   
## [31] "on a"      "one of"    "on the"    "out of"    "that i"   
## [36] "that the"  "the first" "this is"   "to a"      "to be"    
## [41] "to do"     "to get"    "to see"    "to the"    "want to"  
## [46] "was a"     "will be"   "with a"    "with the"
##  [1] "a couple of"        "a lot of"           "as well as"        
##  [4] "be able to"         "going to be"        "i dont know"       
##  [7] "i have a"           "i have to"          "im going to"       
## [10] "it was a"           "i want to"          "looking forward to"
## [13] "one of my"          "one of the"         "out of the"        
## [16] "part of the"        "some of the"        "thanks for the"    
## [19] "thank you for"      "the end of"         "the first time"    
## [22] "there is a"         "the rest of"        "this is a"         
## [25] "to be a"            "you have to"        "you want to"

Frequencies

The following bar plot shows the frequency of appearance for the twenty most common words, given as an average number of instances per 1000 words, and broken down according to source (blogs, news, Twitter).

“The” is very common in news and blogs, but not so much in Twitter, where, presumably, users concerned about the 140 character limit choose to omit it. Similarly, “you” is much more common in Twitter than in news and blogs, suggesting that the platform is used by people to communicate to each other directly, without explicitly addressing a larger audience.

The frequency of bigrams is shown in the following barplot for the twenty most common ones.

We note that personal forms, such as “I have,” “I was,” and “and I” are much more common in blogs than in news or Twitter.

Finally, the following barplot shows the frequency of the twenty most common trigrams.

“Thanks for the” and “thank you for,” which suggest a personal communication, are barely present in news and blogs, but appear to be Twitter favorites.

Final thoughts

There are significant differences in the English language used in blogs, news, and Twitter. Our results are consistent with the view that news employ a more formal language, while blogs tend to assume a “diary” form, and Twitter is much used for direct communication between individuals.

It would be interesting if our algorithm can be tuned to account for these differences in use. For instance, a keyboard app can be made aware of what kind of document the user is typing, and adjust its predictions accordingly.